PHP Base-62 encoding

There’s a really horrible bug (they won’t call it that, but I can’t think of any use case for the default broken behaviour!) in Apache’s mod_rewrite that means that urlencoded inputs in rewrites get unescaped in their transformation to output patterns. The underlying ‘bug’ remains unfixed even in 2.3, though a workaround in the form of the ‘B’ flag first appeared in Apache 2.2.7, but was broken until 2.2.12 (which wasn’t all that long ago). Put it like this: if you’re not using the B flag in your mod_rewrite rules, your site is probably only working due to blind luck.

With that in mind, several years ago I spent ages looking for a base-62 encoder/decoder for PHP to replace mod_rewrite’s broken urlencoding handling. Nobody seemed to have the slightest interest in writing one. Base-62 is interesting as it can be made safe for use in URLs, DNS, email addresses and pathnames, unlike any available encoding of base-64, as it only includes [0-9A-Za-z]. As a workaround for the above bug, I was interested in base-62 encoding URLs for embedding in redirects. At the time I wrote something using bc_math, but it was very slow (and weirdly got ripped off by some dickhead and passed off as his own, despite that fact that I said it was crap!). I eventually gave up on that and switched to base-64, which led to occasional URL corruption. If you include hashes in URLs, keeping them in the default hex representation is quite wasteful, and can contribute to issues with line length in email. Having hashes in base-62 is a nice way of reducing their size.

There are a few posts on base-62 in PHP, notably this one and this one, but they make the assumption that you’re talking about a numeric value, and while a hash is a numeric value, it’s way too big for PHP to handle as an integer. Others take the multiprecision artithmetic route, which treats the input binary as a single very large, and calculates its representation in another base; that works, but it’s horribly slow.

Since then, the gmp and bc_math extensions were improved in PHP 5.3.2, and now they handle (usefully) up to base-62. So here’s a simple function for getting a hash in base-62:

function base62hash($source) {
    return gmp_strval(gmp_init(md5($source), 16), 62);
}

and for converting to and from base-16 hashes:

function hash16to62($hash) {
    return gmp_strval(gmp_init($hash, 16), 62);
}

function hash62to16($hash) {
    return gmp_strval(gmp_init($hash, 62), 16);
}

I could still use a proper base-62 encoder for longer arbitrary strings, but at least now it should be simpler to write something iterative now that these extensions have (ahem) their bases covered.

Update: I’ve written a sufficiently usable PHP base-62 encoder for arbitrary-length binary strings that’s not too slow. You can find it on github in this gist. Let me know if you find it useful

Incidentally I discovered that the gmp functions use [0-9a-f] up to base 16, but [0-9A-Za-z] (i.e. upper case first) from bases 17 to 62. This differs from most of the base-62 implementations I’ve found that tend to use lower case first.

This is all slightly academic now as the apache B-flag workaround works, so standard urlencoding works properly and I don’t need to use a different encoding any more, however, there were so many examples of slow encoders, I thought the world could do with a usable one.

Update Something else worth mentioning is that if you use the apache B flag, you most likely need to turn the AllowEncodedSlashes directive on too, as otherwise you’ll get mysterious 404s. I posted a bug report against the apache docs to make this clearer.

Update Apache used my rewrite of the B-flag docs, yay!

Google Charts API Simple and Extended Encoders in PHP

Google’s charting API has been around for quite a while now, but I’ve only just needed to actually look at it. It became immediately obvious that I needed a PHP encoding function, so off to google I went. Though I found several implementations, they were all incomplete or deficient in one way or another (and it didn’t help that there was an error in google’s extended encoding docs), so I’ve written my own based on several different ones. Both simple and extended encoders support automatic scaling, inflated maximum and lower-bound truncation, so you can pretty much stuff whatever data you like in, with no particular regard for pre-scaling and you’ll get a usable result out. They have an identical interface, so you can use either encoding interchangeably according to the output resolution you need (contrary to popular belief, the encoding to use has very little to do with the range of values you need to graph). By default, the full range of possible values is used as it just seems silly not to. I deliberately omit the ‘s:’ and ‘e:’ prefixes so that you can call these functions for multiple data series, and I include a function that does just that. You still need to generate your own URLs and other formatting, but that’s a different problem. Read on for the code… Continue reading “Google Charts API Simple and Extended Encoders in PHP”

The web developer’s holy vhost trinity

When you’re developing web stuff, working with projects in path names (i.e. not at the top level of a domain) can be difficult (gets in the way of absolute links, rewrite rules etc), so you often need to set up a local apache virtual host, stick an entry in DNS and create an SSL certificate before you can get on with the serious business of doing some real work. This can get to be a drag when you do it a lot, but there is an extremely elegant solution that means you’ll never have to do it again…
Continue reading “The web developer’s holy vhost trinity”

Web Hooks, Callbacks and Distributed Observers

Someone at AMEE pointed out to me that there’s been a flurry of activity around so-called “Web Hooks” when I referred to the concept. This is quite heartening as I thought of this a couple of years ago and implemented this in Smartmessages early last year! I call them callbacks, but the idea is the same – it’s essentially a distributed observer pattern. I couldn’t figure out why nobody seemed to understand what I was on about… When I get some interesting event (e.g. a message open, mailshot completion, clickthrough etc), I ping a user-supplied URL with the appropriate event data, pretty much the one-liner that Jeff alludes to. The reason we do it is that sync with external systems (usually CRM) is something that were always running into, and there seems to be no sensible, generic way of dealing with it other than this, so I’m surprised it has not been discussed in this context before.
There’s one downside as far as I can see – it is highly dependent on the receiver to be able to handle the event in a timely fashion. This isn’t an issue if you’re connecting say, Yahoo! to Google, but it could be a big problem if you connect Google to your WordPress blog… My experience of CRM systems is that they are simply too slow to cope with the high rates of traffic that we are likely to generate, for example, if we point a stream of ~200 events per second at a CRM system, it will probably just bog down and fail (I’m thinking of the SalesForce API here which typically takes 1-2 sec to deal with a single SOAP API call). Retrying will only make this worse. I have two solutions for this: limit events to those that don’t happen so often (kind of lame!), or alternatively, use an outbound message queue to rate-limit the sending (Amazon SQS and Memcacheq spring to mind). Queueing works, but you lose some of the real-time aspect. Ideally clients would implement their own incoming queue in order to allow them to process events at their leisure, but this is mostly beyond the vast majority of web authors (or at least those that host the CRM systems that we hear from!).
Anyway, it’s nice to know that I’m not completely barking…