August 10, 2011March 23, 2017 by Marcus Bointon

PHP Base-62 encoding

There’s a really horrible bug (they won’t call it that, but I can’t think of any use case for the default broken behaviour!) in Apache’s mod_rewrite that means that urlencoded inputs in rewrites get unescaped in their transformation to output patterns. The underlying ‘bug’ remains unfixed even in 2.3, though a workaround in the form of the ‘B’ flag first appeared in Apache 2.2.7, but was broken until 2.2.12 (which wasn’t all that long ago). Put it like this: if you’re not using the B flag in your mod_rewrite rules, your site is probably only working due to blind luck.

With that in mind, several years ago I spent ages looking for a base-62 encoder/decoder for PHP to replace mod_rewrite’s broken urlencoding handling. Nobody seemed to have the slightest interest in writing one. Base-62 is interesting as it can be made safe for use in URLs, DNS, email addresses and pathnames, unlike any available encoding of base-64, as it only includes [0-9A-Za-z]. As a workaround for the above bug, I was interested in base-62 encoding URLs for embedding in redirects. At the time I wrote something using bc_math, but it was very slow (and weirdly got ripped off by some dickhead and passed off as his own, despite that fact that I said it was crap!). I eventually gave up on that and switched to base-64, which led to occasional URL corruption. If you include hashes in URLs, keeping them in the default hex representation is quite wasteful, and can contribute to issues with line length in email. Having hashes in base-62 is a nice way of reducing their size.

There are a few posts on base-62 in PHP, notably this one and this one, but they make the assumption that you’re talking about a numeric value, and while a hash is a numeric value, it’s way too big for PHP to handle as an integer. Others take the multiprecision artithmetic route, which treats the input binary as a single very large, and calculates its representation in another base; that works, but it’s horribly slow.

Since then, the gmp and bc_math extensions were improved in PHP 5.3.2, and now they handle (usefully) up to base-62. So here’s a simple function for getting a hash in base-62:

function base62hash($source) {
    return gmp_strval(gmp_init(md5($source), 16), 62);
}

and for converting to and from base-16 hashes:

function hash16to62($hash) {
    return gmp_strval(gmp_init($hash, 16), 62);
}

function hash62to16($hash) {
    return gmp_strval(gmp_init($hash, 62), 16);
}

I could still use a proper base-62 encoder for longer arbitrary strings, but at least now it should be simpler to write something iterative now that these extensions have (ahem) their bases covered.

Update: I’ve written a sufficiently usable PHP base-62 encoder for arbitrary-length binary strings that’s not too slow. You can find it on github in this gist. Let me know if you find it useful

Incidentally I discovered that the gmp functions use [0-9a-f] up to base 16, but [0-9A-Za-z] (i.e. upper case first) from bases 17 to 62. This differs from most of the base-62 implementations I’ve found that tend to use lower case first.

This is all slightly academic now as the apache B-flag workaround works, so standard urlencoding works properly and I don’t need to use a different encoding any more, however, there were so many examples of slow encoders, I thought the world could do with a usable one.

Update Something else worth mentioning is that if you use the apache B flag, you most likely need to turn the AllowEncodedSlashes directive on too, as otherwise you’ll get mysterious 404s. I posted a bug report against the apache docs to make this clearer.

Update Apache used my rewrite of the B-flag docs, yay!

14 Replies to “PHP Base-62 encoding”

ircmaxell says:

August 10, 2011 at 21:26
I’ve actually implemented this functionality more generically in my own PHP library (PHP CryptLib).

https://github.com/ircmaxell/PHP-CryptLib/blob/master/lib/CryptLib/Core/BaseConverter.php

So to do base62 encode/decode:
```
$characters = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';

$encoded = BaseConverter::convertFromBinary($data, $characters);

$decoded = BaseConverter::convertToBinary($encoded,$characters);
```
It will accept any binary data (base 256) including straight text, and output the properly encoded string…

Basically, the $characters list is just a list of the identifiers for a specific position. For hex, you’d use 0123456789abcdef, for binary you’d use 01. For octal, 012345678.

It will handle arbitrary length data (I’ve tested it out to about 5000 characters for the input, and all 256 possible bases)…
Tim says:

August 10, 2011 at 19:36

Interesting. I’ve been writing some similar code to get a random base 62 hash (shorturl style). See https://gist.github.com/1137905
Marcus Bointon says:

August 10, 2011 at 22:31

Apologies for my blog’s comment formatting being so awful!

Thanks for the pointer, I gave it a try. Your approach is similar to my old one, and for short strings like hashes, it’s about 160x slower than using gmp for me, so I’d say it’s worth special casing your base converter to trap more efficient conversions when possible.

I suspect the time it takes is exponential – while a 1k string took 0.4 sec, 2k took 1.6sec, 4k 6.5 etc. I tried it on a 32k string and gave up waiting for a result after 10 mins! For long strings, you can’t treat it as a single number, you need to step through it in small chunks. FWIW the internal base64encode function does 32k in 0.000005 sec :)

Incidentally I noticed while trying it out that the gmp internal bases use [0-9A-Za-z] as their char pattern rather than [0-9a-zA-Z], which I guess makes sense as they are in ASCII order.
John Carter says:

August 11, 2011 at 10:07

Note this requires your PHP to be compiled against GMP 4.2.0 or above. If not you’re limited to base 36 with this trick. For example RHEL/Centos 5 PHP is compiled against an older version.
Anonymous says:

August 11, 2011 at 10:26

Why not use base64-url encoding?

Uses in – and _ insteaad of . and / respectively, and does away with = padding.
Marcus Bointon says:

August 11, 2011 at 11:13

You’re only the nth person to suggest that! There are no available base-64 encodings that are safe to use in all of URLs, domain names, pathnames and email addresses. Base-62 presents no opportunity for problems in any of those use cases (other than on a case-insensitive FS). This is why pretty much all of the url shorteners use base-62.
Marcus Bointon says:

August 11, 2011 at 11:16

True, and that’s mentioned on the PHP manual page. Doesn’t affect me as you won’t find me anywhere near either of those distros… :)
www.scripting4you.com says:

August 11, 2011 at 16:48
M says:

August 11, 2011 at 16:48

The coding colors u used in this website are terrible! use a white background vor the code pieces, my brains blow when i try to read this.
www.scripting4you.com says:

August 11, 2011 at 16:49
www.php-boutique.de says:

August 11, 2011 at 17:12
Marcus Bointon says:

August 11, 2011 at 17:34

Yep, the page titles kept spilling out too. Picked another theme that’s a bit more readable. Nobody seems to be doing decent themes for s9y any more.
ircmaxell says:

August 11, 2011 at 17:35

It’s not the fastest thing in the world. I’m working on updating that now to make it at least a little more efficient. (so far, I’ve reduced the runtime by a factor of 4).

Also, it is exponential (O(n^2)). So doubling n increases runtime by 4.

There are 2 problems with using the gmp function you specify. First, it’s not an actual base62 *encode*. It’s a base conversion between 16 and 62. That’s a very different task. Encoding is taking an arbitrary string and converting it to a destination base. That’s what my function does. gmp is unable to do that at all.

The second problem, is that gmp only supports bases between 2 and 62. I need to convert to other bases (namely 256) often. So using it is sadly off the table for my needs…
Marcus Bointon says:

August 11, 2011 at 17:50

Yes, your approach is almost identical to my original posting on experts exchange (you might like to use my digit generator function).
I’d say this type of encoding uses base conversions. For example I would have no trouble writing a base-32 encoder using gmp – it’s possible to treat the input in chunks and get exactly the same result as if you had treated the whole thing as a single multiprecision number.
However, you can’t do it sensibly with base 62 because it will never fit into an exact number of bits, so you have to pad and chunk for a practical encoder.

Comments are closed.

Share this:

Related

14 Replies to “PHP Base-62 encoding”