PHP Base-62 encoding

There’s a really horrible bug (they won’t call it that, but I can’t think of any use case for the default broken behaviour!) in Apache’s mod_rewrite that means that urlencoded inputs in rewrites get unescaped in their transformation to output patterns. The underlying ‘bug’ remains unfixed even in 2.3, though a workaround in the form of the ‘B’ flag first appeared in Apache 2.2.7, but was broken until 2.2.12 (which wasn’t all that long ago). Put it like this: if you’re not using the B flag in your mod_rewrite rules, your site is probably only working due to blind luck.

With that in mind, several years ago I spent ages looking for a base-62 encoder/decoder for PHP to replace mod_rewrite’s broken urlencoding handling. Nobody seemed to have the slightest interest in writing one. Base-62 is interesting as it can be made safe for use in URLs, DNS, email addresses and pathnames, unlike any available encoding of base-64, as it only includes [0-9A-Za-z]. As a workaround for the above bug, I was interested in base-62 encoding URLs for embedding in redirects. At the time I wrote something using bc_math, but it was very slow (and weirdly got ripped off by some dickhead and passed off as his own, despite that fact that I said it was crap!). I eventually gave up on that and switched to base-64, which led to occasional URL corruption. If you include hashes in URLs, keeping them in the default hex representation is quite wasteful, and can contribute to issues with line length in email. Having hashes in base-62 is a nice way of reducing their size.

There are a few posts on base-62 in PHP, notably this one and this one, but they make the assumption that you’re talking about a numeric value, and while a hash is a numeric value, it’s way too big for PHP to handle as an integer. Others take the multiprecision artithmetic route, which treats the input binary as a single very large, and calculates its representation in another base; that works, but it’s horribly slow.

Since then, the gmp and bc_math extensions were improved in PHP 5.3.2, and now they handle (usefully) up to base-62. So here’s a simple function for getting a hash in base-62:

function base62hash($source) {
    return gmp_strval(gmp_init(md5($source), 16), 62);

and for converting to and from base-16 hashes:

function hash16to62($hash) {
    return gmp_strval(gmp_init($hash, 16), 62);

function hash62to16($hash) {
    return gmp_strval(gmp_init($hash, 62), 16);

I could still use a proper base-62 encoder for longer arbitrary strings, but at least now it should be simpler to write something iterative now that these extensions have (ahem) their bases covered.

Update: I’ve written a sufficiently usable PHP base-62 encoder for arbitrary-length binary strings that’s not too slow. You can find it on github in this gist. Let me know if you find it useful

Incidentally I discovered that the gmp functions use [0-9a-f] up to base 16, but [0-9A-Za-z] (i.e. upper case first) from bases 17 to 62. This differs from most of the base-62 implementations I’ve found that tend to use lower case first.

This is all slightly academic now as the apache B-flag workaround works, so standard urlencoding works properly and I don’t need to use a different encoding any more, however, there were so many examples of slow encoders, I thought the world could do with a usable one.

Update Something else worth mentioning is that if you use the apache B flag, you most likely need to turn the AllowEncodedSlashes directive on too, as otherwise you’ll get mysterious 404s. I posted a bug report against the apache docs to make this clearer.

Update Apache used my rewrite of the B-flag docs, yay!

MySQL backups with Percona’s XtraBackup

MySQL backup is sometimes very hard to do effectively. MySQL provides various options for backup, but many of them are simply unsuitable for large systems, particularly if they need to remain active during backups. Percona’s XtraBackup is an open-source clone of InnoBase’s InnoDB Hot Backup utility. So what makes XtraBackup a better solution, and how does it work?

Update: on December 10th 2009, Percona released Xtrabackup 1.0.

Google Charts API Simple and Extended Encoders in PHP

Google's charting API has been around for quite a while now, but I've only just needed to actually look at it. It became immediately obvious that I needed a PHP encoding function, so off to google I went. Though I found several implementations, they were all incomplete or deficient in one way or another (and it didn't help that there was an error in google's extended encoding docs), so I've written my own based on several different ones. Both simple and extended encoders support automatic scaling, inflated maximum and lower-bound truncation, so you can pretty much stuff whatever data you like in, with no particular regard for pre-scaling and you'll get a usable result out. They have an identical interface, so you can use either encoding interchangeably according to the output resolution you need (contrary to popular belief, the encoding to use has very little to do with the range of values you need to graph). By default, the full range of possible values is used as it just seems silly not to. I deliberately omit the 's:' and 'e:' prefixes so that you can call these functions for multiple data series, and I include a function that does just that. You still need to generate your own URLs and other formatting, but that's a different problem.

Subversion 1.5 repository upgrade on SourceForge

I’ve just had a slightly tricky time upgrading a subversion repository on sourceforge. They have recently added support for subversion 1.5 at the server end. 1.5 brings major new features for merging, but as it’s not backward compatible with older subversion clients, the upgrade is not done automatically. SF have also done a major rearrangement of their documentation while transferring everything to Trac, and it’s not always easy to get the right info. Normally to upgrade a subversion repo, you just run the ‘svnadmin upgrade /path/to/repo’, however, it’s not quite so simple on sourceforge as you don’t have direct access to the repo, and the instructions they give are slightly wrong at the time of writing. You’re likely to get an error like this (it’s not obvious that this is a fatal error) when you reload a dump file:

svnadmin: File already exists: filesystem ‘/svnroot/projectname/db’, transaction ‘443-0’, path ‘tags’
\* adding path : tags …

This is because load is intended to add files to an existing repo, not to replace those that are already there, so you need to wipe the repo and start from scratch.

So, here is a working command sequence that needs to be run from a project login shell on sourceforge (it applies to the project you’re logged in through, substitute your project’s name for projectname):

adminrepo –checkout svn
svnadmin dump /svnroot/projectname > svn.dump
rm -rf /svnroot/projectname/\*
svnadmin create /svnroot/projectname
svnadmin load /svnroot/projectname < svn.dump adminrepo --save svn

Yes, you do need to delete the whole thing and re-import it, but it’s quick and easy, and you have a backup in the dump file you take at the start. After the upgrade, make sure you get a new checkout of your project to ensure that you’re using 1.5 all the way through. Now you’ll find that commands like ‘svn merge –reintegrate’ work.