Compiling wkhtmltopdf on Mac OS X 10.7 Lion

Wkhtmltopdf is extremely cool. I’ve used qtwebkit for generating server-side page images before using python-webkit2png, and that’s fine (unlike using Firefox running in xvfb!), but I need to produce PDFs. So, I looked around and found several neat, simple PHP wrappers for calling wkhtmltopdf, and even a PHP extension. “Great”, I thought, “I’ll just install that and spend time working on the layouts since the code looks really simple”. I spoke too soon.

To use it requires that you have a working copy of wkhtmltox and libwkhtmltox. Getting those is not as straightforward as it should be, and the docs are really pretty inadequate (hence this post). For Linux there is a simple download of a binary, but the OS X version (despite being the most recent version posted) is curiously supplied as an OS X app bundle. When you run it one of two things happens: nothing, or an interminable bounce requiring a force-quit, i.e. as supplied it’s apparently useless, though I eventually solved this mystery. In a bug report (why does anyone use google code? it’s horrible!) I found a reference to a binary lurking inside the app bundle, and sure enough, it’s there, and it works. Here’s the magic to make it accessible in a ‘normal’ way:

sudo ln -s /Applications/wkhtmltopdf.app/Contents/MacOS/wkhtmltopdf /usr/local/bin

That could well be enough for many uses, but this version is built for 32-bit OS X 10.4, which makes it about 327 in computer years. Homebrew has a recipe for wkhtmltopdf, but it’s not built against a custom qt stack, and so is missing several features. I figured it would be worth trying to do better than that, targeting 64-bit 10.7, so I found some build instructions (thanks to comments on Mar 13, 2012 on this page and this one (no, google code doesn’t provide IDs for comments, duh)) which I was able to adapt.

Environment

Before starting, make sure you have the latest Apple toolchain: Run system update, then run XCode, go to preferences -> downloads and make sure you’ve got the latest command line tools installed. You may also want to check your shell’s environment vars. I use these in my /etc/zshenv:

export MACOSX_DEPLOYMENT_TARGET=10.7
export CHOST='x86_64-apple-darwin11'
export CFLAGS='-arch x86_64 -O3 -fPIC -mmacosx-version-min=10.7 -pipe -march=native -m64'
export LDFLAGS='-arch x86_64 -mmacosx-version-min=10.7'
export CXXFLAGS=${CFLAGS}

Those settings suit my MacBook Air: yours may need to be different.

Building QT

Compiling against the specially wkhtmltopdf-patched version of Qt adds several features to wkhtmltopdf that are not available in most distributed and/or statically compiled versions:

  • Printing more then one HTML document into a PDF file.
  • Running without an X11 server.
  • Adding a document outline to the PDF file.
  • Adding headers and footers to the PDF file.
  • Generating a table of contents.
  • Adding links in the generated PDF file.
  • Printing using the screen media-type.
  • Disabling the smart shrink feature of webkit.

It’s normal for packaging systems like MacPorts, HomeBrew and Fink not to add this dependency as it makes the build very large and take a long time, and these features just may not be needed for many users – those packagers could perhaps add custom-Qt ‘flavours’ of the builds so it’s at least possible without straying outside the packager, though it could have implications for other packages built against Qt, of which there are many.

First we need to compile a copy of the the qt library, and to do that we have to get the whole thing, even though we’re only going to use some of it.

git clone git://gitorious.org/+wkhtml2pdf/qt/wkhtmltopdf-qt.git
cd wkhtmltopdf-qt
git checkout staging

This takes quite a while since it’s a 970M download! In order to make it compile for x86_64 we need to change the arch option in the build config, and tell it where the 10.7 SDKs are (they’ve moved since 10.4). So I edited configure on line 4875 (of 9133 – this is a BIG configure file!) to look like this:

echo "export MACOSX_DEPLOYMENT_TARGET = 10.7" >> "$mkfile"

Now we can set it up to build, specifiying the location of the 10.7 SDK and the x86_64 arch, and deleting any references to the x86 arch (if you leave it in it may try to build for both):

QTDIR=. ./bin/syncqt
./configure -sdk /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.7.sdk -arch x86_64 -release -static -fast -exceptions -no-accessibility -no-stl -no-sql-ibase -no-sql-mysql -no-sql-odbc -no-sql-psql -no-sql-sqlite -no-sql-sqlite2 -no-qt3support -xmlpatterns -no-phonon -no-phonon-backend -webkit -no-scripttools -no-mmx -no-3dnow -no-sse -no-sse2 -no-ssse3 -qt-zlib -qt-libtiff -qt-libpng -qt-libmng -qt-libjpeg -openssl -graphicssystem raster -opensource -nomake "tools examples demos docs translations" -no-opengl -no-dbus -no-framework -no-dwarf2 -no-multimedia -no-declarative -largefile -rpath -no-nis -no-cups -no-iconv -no-pch -no-gtkstyle -no-nas-sound -no-sm -no-xshape -no-xinerama -no-xfixes -no-xrandr -no-xrender -no-mitshm -no-xkb -no-glib -no-openvg -no-xsync -no-javascript-jit -no-egl -carbon --prefix=../wkqt/
make -j3
make install

This takes a long time. Using -j3 made a big difference on my 8-core Mac Pro and 4-core MacBook Air. Note that the configure step may use an i386 arch; that doesn’t mean that the build itself will.

Build the wkhtmltopdf app

wget http://wkhtmltopdf.googlecode.com/files/wkhtmltopdf-0.11.0_rc1.tar.bz2
tar xvjf wkhtmltopdf-0.11.0_rc1.tar.bz2
rm wkhtmltopdf-0.11.0_rc1.tar.bz2
cd wkhtmltopdf-0.11.0_rc1

This code also needs to be set to build for x86_64, so edit these two files: src/image/image.pro and src/pdf/pdf.pro and change this section in each:

macx {
#    CONFIG -= app_bundle
    CONFIG += x86_64
}

This sets them to build for 64-bit and not to omit building as an app bundle.
Now build it:

../wkqt/bin/qmake
make
sudo make install

This installs two app bundles in /bin/wkhtmltopdf.app and /bin/wkhtmltoimage.app.
When I tried to actually use it, I ran into the reason why it’s built as an app – it has dependencies on a qt component resource that needs to be bundled with it (why it needs a graphical menu resource when it has no GUI of any kind is beyond me!). To fix this I copied the necessary parts into the apps and set up symlinks to the binaries:

cd wkhtmltopdf-qt
sudo cp -pr src/gui/mac/qt_menu.nib /bin/wkhtmltopdf.app/Contents/Resources
sudo cp -pr src/gui/mac/qt_menu.nib /bin/wkhtmltoimage.app/Contents/Resources
sudo ln -s /bin/wkhtmltopdf.app/Contents/MacOS/wkhtmltopdf /usr/local/bin
sudo ln -s /bin/wkhtmltoimage.app/Contents/MacOS/wkhtmltoimage /usr/local/bin

After this running wkhtmltopdf --version gives:

Name:
  wkhtmltopdf 0.10.0 rc2

License:
  Copyright (C) 2010 wkhtmltopdf/wkhtmltoimage Authors.

  License LGPLv3+: GNU Lesser General Public License version 3 or later
  . This is free software: you are free to
  change and redistribute it. There is NO WARRANTY, to the extent permitted by
  law.

Authors:
  Written by Jan Habermann, Christian Sciberras and Jakob Truelsen. Patches by
  Mehdi Abbad, Lyes Amazouz, Pascal Bach, Emmanuel Bouthenot, Benoit Garret and
  Mário Silva.

The version number string is wrong (it’s supposedly 0.11.0-rc1) and there’s a bug report for that. We can check we’ve built for the right architecture too:

file /usr/local/bin/wkhtmltopdf
/usr/local/bin/wkhtmltopdf: Mach-O 64-bit executable x86_64

Building the PHP extension

First I needed to copy the libs and include files somewhere the compiler would find them:

cd ..
sudo cp -r wkhtmltopdf-0.11.0_rc1/include/wkhtmltox /usr/local/include
sudo cp wkhtmltopdf-0.11.0_rc1/bin/libwkhtmltox.* /usr/local/lib

For some reason it was building for i386 (which is no use with a 64-bit lib), and specifying a host of x86_64 didn’t work – it builds, but produces a .a library instead of a .so shared object, claiming that libtool couldn’t make shared objects. A bit of rummaging led me to the correct host type for 10.7 which allowed it to link correctly.

git clone https://github.com/mreiferson/php-wkhtmltox.git
cd php-wkhtmltopdf
phpize
./configure --host=x86_64-apple-darwin11.4.0
make
make install

After that I added extension=phpwkhtmltox.so to an appropriate ini file and PHP then listed the extension in php -m output. There are a couple of test scripts included with the extension files, so I ran php test_pdf.php, which makes a bunch of test PDFs in /tmp, and all looks pretty good. Don’t forget to restart apache if you want it to show up in there too.

I hope someone finds this useful.

Update May 18th 2012

I repeated this build on my MacBook Air and ran into several issues, and one section that worked completely differently to my original, so I’ve updated the article with these changes.

Update June 9th 2012

Added notes about what using the custom Qt libs buys you.

Domain name validation

I was revisiting the validation of domain names and realised that most of the regexes posted around the web have faults.

Many refer to Sean Inman’s 2006 post, which does a fair job but is prone to break as new TLDs are introduced. This answer on StackOverflow is about the best I’ve found so far: it enforces label and overall lengths; allowing multiple dashes means it works with punycoded domains; it’s generally permissive so won’t break as TLDs change, but there’s one case not handled. RFC2872 says that labels that are not used as hostnames (i.e. which do not map to an IP, for example in TXT or SRV records) may contain any printable ASCII character, so  `_,;:'”!@£~$` and friends are all up for inclusion. This is most commonly found in domainkeys, which use the `_domainkey` label. There’s a good article on the use of underscores in DNS.

This does relate to the validation of email addresses (which often contain domains), and the best page on that subject is this one, however, you can’t simply extract the domain part from that as domain names in general are a superset of what’s used in email.

It’s difficult to do this right because you can’t tell whether a label is a hostname or not, or where a hostname stops and a domain begins, and validity varies according to context: `_domainkey.example.com` is invalid in an A record, but valid in a TXT record. I can foresee a parameter to allow you to specify usage context to deal with this. It might be better to process the name backwards so that you have more context available as you encounter each label, for example if you processed `www.example.com` as `com.example.www`, you would stand a better chance of knowing whether www is a hostname or a domain name.

I’m mainly thinking out loud here, I don’t have a solution as yet!

PHP Base-62 encoding

There’s a really horrible bug (they won’t call it that, but I can’t think of any use case for the default broken behaviour!) in Apache’s mod_rewrite that means that urlencoded inputs in rewrites get unescaped in their transformation to output patterns. The underlying ‘bug’ remains unfixed even in 2.3, though a workaround in the form of the ‘B’ flag first appeared in Apache 2.2.7, but was broken until 2.2.12 (which wasn’t all that long ago). Put it like this: if you’re not using the B flag in your mod_rewrite rules, your site is probably only working due to blind luck.

With that in mind, several years ago I spent ages looking for a base-62 encoder/decoder for PHP to replace mod_rewrite’s broken urlencoding handling. Nobody seemed to have the slightest interest in writing one. Base-62 is interesting as it can be made safe for use in URLs, DNS, email addresses and pathnames, unlike any available encoding of base-64, as it only includes [0-9A-Za-z]. As a workaround for the above bug, I was interested in base-62 encoding URLs for embedding in redirects. At the time I wrote something using bc_math, but it was very slow (and weirdly got ripped off by some dickhead and passed off as his own, despite that fact that I said it was crap!). I eventually gave up on that and switched to base-64, which led to occasional URL corruption. If you include hashes in URLs, keeping them in the default hex representation is quite wasteful, and can contribute to issues with line length in email. Having hashes in base-62 is a nice way of reducing their size.

There are a few posts on base-62 in PHP, notably this one and this one, but they make the assumption that you’re talking about a numeric value, and while a hash is a numeric value, it’s way too big for PHP to handle as an integer. Others take the multiprecision artithmetic route, which treats the input binary as a single very large, and calculates its representation in another base; that works, but it’s horribly slow.

Since then, the gmp and bc_math extensions were improved in PHP 5.3.2, and now they handle (usefully) up to base-62. So here’s a simple function for getting a hash in base-62:

function base62hash($source) {
    return gmp_strval(gmp_init(md5($source), 16), 62);
}

and for converting to and from base-16 hashes:

function hash16to62($hash) {
    return gmp_strval(gmp_init($hash, 16), 62);
}

function hash62to16($hash) {
    return gmp_strval(gmp_init($hash, 62), 16);
}

I could still use a proper base-62 encoder for longer arbitrary strings, but at least now it should be simpler to write something iterative now that these extensions have (ahem) their bases covered.

Update: I’ve written a sufficiently usable PHP base-62 encoder for arbitrary-length binary strings that’s not too slow. You can find it on github in this gist. Let me know if you find it useful

Incidentally I discovered that the gmp functions use [0-9a-f] up to base 16, but [0-9A-Za-z] (i.e. upper case first) from bases 17 to 62. This differs from most of the base-62 implementations I’ve found that tend to use lower case first.

This is all slightly academic now as the apache B-flag workaround works, so standard urlencoding works properly and I don’t need to use a different encoding any more, however, there were so many examples of slow encoders, I thought the world could do with a usable one.

Update Something else worth mentioning is that if you use the apache B flag, you most likely need to turn the AllowEncodedSlashes directive on too, as otherwise you’ll get mysterious 404s. I posted a bug report against the apache docs to make this clearer.

Update Apache used my rewrite of the B-flag docs, yay!

Google Charts API Simple and Extended Encoders in PHP

Google’s charting API has been around for quite a while now, but I’ve only just needed to actually look at it. It became immediately obvious that I needed a PHP encoding function, so off to google I went. Though I found several implementations, they were all incomplete or deficient in one way or another (and it didn’t help that there was an error in google’s extended encoding docs), so I’ve written my own based on several different ones. Both simple and extended encoders support automatic scaling, inflated maximum and lower-bound truncation, so you can pretty much stuff whatever data you like in, with no particular regard for pre-scaling and you’ll get a usable result out. They have an identical interface, so you can use either encoding interchangeably according to the output resolution you need (contrary to popular belief, the encoding to use has very little to do with the range of values you need to graph). By default, the full range of possible values is used as it just seems silly not to. I deliberately omit the ‘s:’ and ‘e:’ prefixes so that you can call these functions for multiple data series, and I include a function that does just that. You still need to generate your own URLs and other formatting, but that’s a different problem. Read on for the code… Continue reading “Google Charts API Simple and Extended Encoders in PHP”