I was revisiting the validation of domain names and realised that most of the regexes posted around the web have faults.
Many refer to Sean Inman’s 2006 post, which does a fair job but is prone to break as new TLDs are introduced. This answer on StackOverflow is about the best I’ve found so far: it enforces label and overall lengths; allowing multiple dashes means it works with punycoded domains; it’s generally permissive so won’t break as TLDs change, but there’s one case not handled. RFC2872 says that labels that are not used as hostnames (i.e. which do not map to an IP, for example in TXT or SRV records) may contain any printable ASCII character, so `_,;:'”!@£~$` and friends are all up for inclusion. This is most commonly found in domainkeys, which use the `_domainkey` label. There’s a good article on the use of underscores in DNS.
This does relate to the validation of email addresses (which often contain domains), and the best page on that subject is this one, however, you can’t simply extract the domain part from that as domain names in general are a superset of what’s used in email.
It’s difficult to do this right because you can’t tell whether a label is a hostname or not, or where a hostname stops and a domain begins, and validity varies according to context: `_domainkey.example.com` is invalid in an A record, but valid in a TXT record. I can foresee a parameter to allow you to specify usage context to deal with this. It might be better to process the name backwards so that you have more context available as you encounter each label, for example if you processed `www.example.com` as `com.example.www`, you would stand a better chance of knowing whether www is a hostname or a domain name.
I’m mainly thinking out loud here, I don’t have a solution as yet!
Generally I don’t like to publicly point the finger in cases like this, but this time I will: Tesco Compare, it’s your turn.
I last used Tesco Compare a couple of years ago while looking for car insurance. About a year ago they began sending me promotional email, clearly realising they had a list they had not been using. I’ve moved to France and I’m really not interested in UK car insurance any more, so I wanted to unsubscribe.
The message they sent me shows they are handling my data correctly to some extent – they have my email address right, they addressed me by name. It also contains an unsubscribe link. That’s the good bit over – It all goes downhill from there.
Continue reading “Case study: how not to handle unsubscribes”
PHP Barcelona’s conference site just went live. I’m speaking on email in PHP at the conference, along with PHPLondon regulars Zoë Slattery and Scott McVicar. It all happens on Saturday September 27th. Tell your friends!
Here’s a little rant I’ve been meaning to get out for a while.
The whole point of the multipart/alternative data type is progressive enhancement. A client is free to select from the alternatives presented and render as best it can, with an option for manual selection (that is, as long as you don’t use Outlook which doesn’t believe in such things). This applies to the common text/plain > text/html combo as much as it would to text/plain > image/jpeg, or perhaps application/pdf > application/vnd.sun.xml.writer. Now if they restricted their comments to text/html only, I might have some sympathy, as that’s just shoddy behaviour on the part of the sender. However, they usually prefer to throw out the baby with the bath water.
To conclude: MIME is a wonderful thing; some people use it badly; get over it.