Compiling wkhtmltopdf on Mac OS X 10.7 Lion

Wkhtmltopdf is extremely cool. I’ve used qtwebkit for generating server-side page images before using python-webkit2png, and that’s fine (unlike using Firefox running in xvfb!), but I need to produce PDFs. So, I looked around and found several neat, simple PHP wrappers for calling wkhtmltopdf, and even a PHP extension. “Great”, I thought, “I’ll just install that and spend time working on the layouts since the code looks really simple”. I spoke too soon.

To use it requires that you have a working copy of wkhtmltox and libwkhtmltox. Getting those is not as straightforward as it should be, and the docs are really pretty inadequate (hence this post). For Linux there is a simple download of a binary, but the OS X version (despite being the most recent version posted) is curiously supplied as an OS X app bundle. When you run it one of two things happens: nothing, or an interminable bounce requiring a force-quit, i.e. as supplied it’s apparently useless, though I eventually solved this mystery. In a bug report (why does anyone use google code? it’s horrible!) I found a reference to a binary lurking inside the app bundle, and sure enough, it’s there, and it works. Here’s the magic to make it accessible in a ‘normal’ way:

sudo ln -s /Applications/wkhtmltopdf.app/Contents/MacOS/wkhtmltopdf /usr/local/bin

That could well be enough for many uses, but this version is built for 32-bit OS X 10.4, which makes it about 327 in computer years. Homebrew has a recipe for wkhtmltopdf, but it’s not built against a custom qt stack, and so is missing several features. I figured it would be worth trying to do better than that, targeting 64-bit 10.7, so I found some build instructions (thanks to comments on Mar 13, 2012 on this page and this one (no, google code doesn’t provide IDs for comments, duh)) which I was able to adapt.

Environment

Before starting, make sure you have the latest Apple toolchain: Run system update, then run XCode, go to preferences -> downloads and make sure you’ve got the latest command line tools installed. You may also want to check your shell’s environment vars. I use these in my /etc/zshenv:

export MACOSX_DEPLOYMENT_TARGET=10.7
export CHOST='x86_64-apple-darwin11'
export CFLAGS='-arch x86_64 -O3 -fPIC -mmacosx-version-min=10.7 -pipe -march=native -m64'
export LDFLAGS='-arch x86_64 -mmacosx-version-min=10.7'
export CXXFLAGS=${CFLAGS}

Those settings suit my MacBook Air: yours may need to be different.

Building QT

Compiling against the specially wkhtmltopdf-patched version of Qt adds several features to wkhtmltopdf that are not available in most distributed and/or statically compiled versions:

  • Printing more then one HTML document into a PDF file.
  • Running without an X11 server.
  • Adding a document outline to the PDF file.
  • Adding headers and footers to the PDF file.
  • Generating a table of contents.
  • Adding links in the generated PDF file.
  • Printing using the screen media-type.
  • Disabling the smart shrink feature of webkit.

It’s normal for packaging systems like MacPorts, HomeBrew and Fink not to add this dependency as it makes the build very large and take a long time, and these features just may not be needed for many users – those packagers could perhaps add custom-Qt ‘flavours’ of the builds so it’s at least possible without straying outside the packager, though it could have implications for other packages built against Qt, of which there are many.

First we need to compile a copy of the the qt library, and to do that we have to get the whole thing, even though we’re only going to use some of it.

git clone git://gitorious.org/+wkhtml2pdf/qt/wkhtmltopdf-qt.git
cd wkhtmltopdf-qt
git checkout staging

This takes quite a while since it’s a 970M download! In order to make it compile for x86_64 we need to change the arch option in the build config, and tell it where the 10.7 SDKs are (they’ve moved since 10.4). So I edited configure on line 4875 (of 9133 – this is a BIG configure file!) to look like this:

echo "export MACOSX_DEPLOYMENT_TARGET = 10.7" >> "$mkfile"

Now we can set it up to build, specifiying the location of the 10.7 SDK and the x86_64 arch, and deleting any references to the x86 arch (if you leave it in it may try to build for both):

QTDIR=. ./bin/syncqt
./configure -sdk /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.7.sdk -arch x86_64 -release -static -fast -exceptions -no-accessibility -no-stl -no-sql-ibase -no-sql-mysql -no-sql-odbc -no-sql-psql -no-sql-sqlite -no-sql-sqlite2 -no-qt3support -xmlpatterns -no-phonon -no-phonon-backend -webkit -no-scripttools -no-mmx -no-3dnow -no-sse -no-sse2 -no-ssse3 -qt-zlib -qt-libtiff -qt-libpng -qt-libmng -qt-libjpeg -openssl -graphicssystem raster -opensource -nomake "tools examples demos docs translations" -no-opengl -no-dbus -no-framework -no-dwarf2 -no-multimedia -no-declarative -largefile -rpath -no-nis -no-cups -no-iconv -no-pch -no-gtkstyle -no-nas-sound -no-sm -no-xshape -no-xinerama -no-xfixes -no-xrandr -no-xrender -no-mitshm -no-xkb -no-glib -no-openvg -no-xsync -no-javascript-jit -no-egl -carbon --prefix=../wkqt/
make -j3
make install

This takes a long time. Using -j3 made a big difference on my 8-core Mac Pro and 4-core MacBook Air. Note that the configure step may use an i386 arch; that doesn’t mean that the build itself will.

Build the wkhtmltopdf app

wget http://wkhtmltopdf.googlecode.com/files/wkhtmltopdf-0.11.0_rc1.tar.bz2
tar xvjf wkhtmltopdf-0.11.0_rc1.tar.bz2
rm wkhtmltopdf-0.11.0_rc1.tar.bz2
cd wkhtmltopdf-0.11.0_rc1

This code also needs to be set to build for x86_64, so edit these two files: src/image/image.pro and src/pdf/pdf.pro and change this section in each:

macx {
#    CONFIG -= app_bundle
    CONFIG += x86_64
}

This sets them to build for 64-bit and not to omit building as an app bundle.
Now build it:

../wkqt/bin/qmake
make
sudo make install

This installs two app bundles in /bin/wkhtmltopdf.app and /bin/wkhtmltoimage.app.
When I tried to actually use it, I ran into the reason why it’s built as an app – it has dependencies on a qt component resource that needs to be bundled with it (why it needs a graphical menu resource when it has no GUI of any kind is beyond me!). To fix this I copied the necessary parts into the apps and set up symlinks to the binaries:

cd wkhtmltopdf-qt
sudo cp -pr src/gui/mac/qt_menu.nib /bin/wkhtmltopdf.app/Contents/Resources
sudo cp -pr src/gui/mac/qt_menu.nib /bin/wkhtmltoimage.app/Contents/Resources
sudo ln -s /bin/wkhtmltopdf.app/Contents/MacOS/wkhtmltopdf /usr/local/bin
sudo ln -s /bin/wkhtmltoimage.app/Contents/MacOS/wkhtmltoimage /usr/local/bin

After this running wkhtmltopdf --version gives:

Name:
  wkhtmltopdf 0.10.0 rc2

License:
  Copyright (C) 2010 wkhtmltopdf/wkhtmltoimage Authors.

  License LGPLv3+: GNU Lesser General Public License version 3 or later
  . This is free software: you are free to
  change and redistribute it. There is NO WARRANTY, to the extent permitted by
  law.

Authors:
  Written by Jan Habermann, Christian Sciberras and Jakob Truelsen. Patches by
  Mehdi Abbad, Lyes Amazouz, Pascal Bach, Emmanuel Bouthenot, Benoit Garret and
  Mário Silva.

The version number string is wrong (it’s supposedly 0.11.0-rc1) and there’s a bug report for that. We can check we’ve built for the right architecture too:

file /usr/local/bin/wkhtmltopdf
/usr/local/bin/wkhtmltopdf: Mach-O 64-bit executable x86_64

Building the PHP extension

First I needed to copy the libs and include files somewhere the compiler would find them:

cd ..
sudo cp -r wkhtmltopdf-0.11.0_rc1/include/wkhtmltox /usr/local/include
sudo cp wkhtmltopdf-0.11.0_rc1/bin/libwkhtmltox.* /usr/local/lib

For some reason it was building for i386 (which is no use with a 64-bit lib), and specifying a host of x86_64 didn’t work – it builds, but produces a .a library instead of a .so shared object, claiming that libtool couldn’t make shared objects. A bit of rummaging led me to the correct host type for 10.7 which allowed it to link correctly.

git clone https://github.com/mreiferson/php-wkhtmltox.git
cd php-wkhtmltopdf
phpize
./configure --host=x86_64-apple-darwin11.4.0
make
make install

After that I added extension=phpwkhtmltox.so to an appropriate ini file and PHP then listed the extension in php -m output. There are a couple of test scripts included with the extension files, so I ran php test_pdf.php, which makes a bunch of test PDFs in /tmp, and all looks pretty good. Don’t forget to restart apache if you want it to show up in there too.

I hope someone finds this useful.

Update May 18th 2012

I repeated this build on my MacBook Air and ran into several issues, and one section that worked completely differently to my original, so I’ve updated the article with these changes.

Update June 9th 2012

Added notes about what using the custom Qt libs buys you.

Domain name validation

I was revisiting the validation of domain names and realised that most of the regexes posted around the web have faults.

Many refer to Sean Inman’s 2006 post, which does a fair job but is prone to break as new TLDs are introduced. This answer on StackOverflow is about the best I’ve found so far: it enforces label and overall lengths; allowing multiple dashes means it works with punycoded domains; it’s generally permissive so won’t break as TLDs change, but there’s one case not handled. RFC2872 says that labels that are not used as hostnames (i.e. which do not map to an IP, for example in TXT or SRV records) may contain any printable ASCII character, so  `_,;:'”!@£~$` and friends are all up for inclusion. This is most commonly found in domainkeys, which use the `_domainkey` label. There’s a good article on the use of underscores in DNS.

This does relate to the validation of email addresses (which often contain domains), and the best page on that subject is this one, however, you can’t simply extract the domain part from that as domain names in general are a superset of what’s used in email.

It’s difficult to do this right because you can’t tell whether a label is a hostname or not, or where a hostname stops and a domain begins, and validity varies according to context: `_domainkey.example.com` is invalid in an A record, but valid in a TXT record. I can foresee a parameter to allow you to specify usage context to deal with this. It might be better to process the name backwards so that you have more context available as you encounter each label, for example if you processed `www.example.com` as `com.example.www`, you would stand a better chance of knowing whether www is a hostname or a domain name.

I’m mainly thinking out loud here, I don’t have a solution as yet!