PHP Base-62 encoding

There’s a really horrible bug (they won’t call it that, but I can’t think of any use case for the default broken behaviour!) in Apache’s mod_rewrite that means that urlencoded inputs in rewrites get unescaped in their transformation to output patterns. The underlying ‘bug’ remains unfixed even in 2.3, though a workaround in the form of the ‘B’ flag first appeared in Apache 2.2.7, but was broken until 2.2.12 (which wasn’t all that long ago). Put it like this: if you’re not using the B flag in your mod_rewrite rules, your site is probably only working due to blind luck.

With that in mind, several years ago I spent ages looking for a base-62 encoder/decoder for PHP to replace mod_rewrite’s broken urlencoding handling. Nobody seemed to have the slightest interest in writing one. Base-62 is interesting as it can be made safe for use in URLs, DNS, email addresses and pathnames, unlike any available encoding of base-64, as it only includes [0-9A-Za-z]. As a workaround for the above bug, I was interested in base-62 encoding URLs for embedding in redirects. At the time I wrote something using bc_math, but it was very slow (and weirdly got ripped off by some dickhead and passed off as his own, despite that fact that I said it was crap!). I eventually gave up on that and switched to base-64, which led to occasional URL corruption. If you include hashes in URLs, keeping them in the default hex representation is quite wasteful, and can contribute to issues with line length in email. Having hashes in base-62 is a nice way of reducing their size.

There are a few posts on base-62 in PHP, notably this one and this one, but they make the assumption that you’re talking about a numeric value, and while a hash is a numeric value, it’s way too big for PHP to handle as an integer. Others take the multiprecision artithmetic route, which treats the input binary as a single very large, and calculates its representation in another base; that works, but it’s horribly slow.

Since then, the gmp and bc_math extensions were improved in PHP 5.3.2, and now they handle (usefully) up to base-62. So here’s a simple function for getting a hash in base-62:

function base62hash($source) {
    return gmp_strval(gmp_init(md5($source), 16), 62);
}

and for converting to and from base-16 hashes:

function hash16to62($hash) {
    return gmp_strval(gmp_init($hash, 16), 62);
}

function hash62to16($hash) {
    return gmp_strval(gmp_init($hash, 62), 16);
}

I could still use a proper base-62 encoder for longer arbitrary strings, but at least now it should be simpler to write something iterative now that these extensions have (ahem) their bases covered.

Update: I’ve written a sufficiently usable PHP base-62 encoder for arbitrary-length binary strings that’s not too slow. You can find it on github in this gist. Let me know if you find it useful

Incidentally I discovered that the gmp functions use [0-9a-f] up to base 16, but [0-9A-Za-z] (i.e. upper case first) from bases 17 to 62. This differs from most of the base-62 implementations I’ve found that tend to use lower case first.

This is all slightly academic now as the apache B-flag workaround works, so standard urlencoding works properly and I don’t need to use a different encoding any more, however, there were so many examples of slow encoders, I thought the world could do with a usable one.

Update Something else worth mentioning is that if you use the apache B flag, you most likely need to turn the AllowEncodedSlashes directive on too, as otherwise you’ll get mysterious 404s. I posted a bug report against the apache docs to make this clearer.

Update Apache used my rewrite of the B-flag docs, yay!

14 Replies to “PHP Base-62 encoding”

  1. I’ve actually implemented this functionality more generically in my own PHP library (PHP CryptLib).

    https://github.com/ircmaxell/PHP-CryptLib/blob/master/lib/CryptLib/Core/BaseConverter.php

    So to do base62 encode/decode:

    $characters = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
    
    $encoded = BaseConverter::convertFromBinary($data, $characters);
    
    $decoded = BaseConverter::convertToBinary($encoded,$characters);
    

    It will accept any binary data (base 256) including straight text, and output the properly encoded string…

    Basically, the $characters list is just a list of the identifiers for a specific position. For hex, you’d use 0123456789abcdef, for binary you’d use 01. For octal, 012345678.

    It will handle arbitrary length data (I’ve tested it out to about 5000 characters for the input, and all 256 possible bases)…

  2. Apologies for my blog’s comment formatting being so awful!

    Thanks for the pointer, I gave it a try. Your approach is similar to my old one, and for short strings like hashes, it’s about 160x slower than using gmp for me, so I’d say it’s worth special casing your base converter to trap more efficient conversions when possible.

    I suspect the time it takes is exponential – while a 1k string took 0.4 sec, 2k took 1.6sec, 4k 6.5 etc. I tried it on a 32k string and gave up waiting for a result after 10 mins! For long strings, you can’t treat it as a single number, you need to step through it in small chunks. FWIW the internal base64encode function does 32k in 0.000005 sec 🙂

    Incidentally I noticed while trying it out that the gmp internal bases use [0-9A-Za-z] as their char pattern rather than [0-9a-zA-Z], which I guess makes sense as they are in ASCII order.

  3. Note this requires your PHP to be compiled against GMP 4.2.0 or above. If not you’re limited to base 36 with this trick. For example RHEL/Centos 5 PHP is compiled against an older version.

  4. Why not use base64-url encoding?

    Uses in – and _ insteaad of . and / respectively, and does away with = padding.

  5. You’re only the nth person to suggest that! There are no available base-64 encodings that are safe to use in all of URLs, domain names, pathnames and email addresses. Base-62 presents no opportunity for problems in any of those use cases (other than on a case-insensitive FS). This is why pretty much all of the url shorteners use base-62.

  6. True, and that’s mentioned on the PHP manual page. Doesn’t affect me as you won’t find me anywhere near either of those distros… 🙂

  7. The coding colors u used in this website are terrible! use a white background vor the code pieces, my brains blow when i try to read this.

  8. Yep, the page titles kept spilling out too. Picked another theme that’s a bit more readable. Nobody seems to be doing decent themes for s9y any more.

  9. It’s not the fastest thing in the world. I’m working on updating that now to make it at least a little more efficient. (so far, I’ve reduced the runtime by a factor of 4).

    Also, it is exponential (O(n^2)). So doubling n increases runtime by 4.

    There are 2 problems with using the gmp function you specify. First, it’s not an actual base62 *encode*. It’s a base conversion between 16 and 62. That’s a very different task. Encoding is taking an arbitrary string and converting it to a destination base. That’s what my function does. gmp is unable to do that at all.

    The second problem, is that gmp only supports bases between 2 and 62. I need to convert to other bases (namely 256) often. So using it is sadly off the table for my needs…

  10. Yes, your approach is almost identical to my original posting on experts exchange (you might like to use my digit generator function).
    I’d say this type of encoding uses base conversions. For example I would have no trouble writing a base-32 encoder using gmp – it’s possible to treat the input in chunks and get exactly the same result as if you had treated the whole thing as a single multiprecision number.
    However, you can’t do it sensibly with base 62 because it will never fit into an exact number of bits, so you have to pad and chunk for a practical encoder.

Comments are closed.