Domain name validation

I was revisiting the validation of domain names and realised that most of the regexes posted around the web have faults.

Many refer to Sean Inman’s 2006 post, which does a fair job but is prone to break as new TLDs are introduced. This answer on StackOverflow is about the best I’ve found so far: it enforces label and overall lengths; allowing multiple dashes means it works with punycoded domains; it’s generally permissive so won’t break as TLDs change, but there’s one case not handled. RFC2872 says that labels that are not used as hostnames (i.e. which do not map to an IP, for example in TXT or SRV records) may contain any printable ASCII character, so  `_,;:'”!@£~$` and friends are all up for inclusion. This is most commonly found in domainkeys, which use the `_domainkey` label. There’s a good article on the use of underscores in DNS.

This does relate to the validation of email addresses (which often contain domains), and the best page on that subject is this one, however, you can’t simply extract the domain part from that as domain names in general are a superset of what’s used in email.

It’s difficult to do this right because you can’t tell whether a label is a hostname or not, or where a hostname stops and a domain begins, and validity varies according to context: `_domainkey.example.com` is invalid in an A record, but valid in a TXT record. I can foresee a parameter to allow you to specify usage context to deal with this. It might be better to process the name backwards so that you have more context available as you encounter each label, for example if you processed `www.example.com` as `com.example.www`, you would stand a better chance of knowing whether www is a hostname or a domain name.

I’m mainly thinking out loud here, I don’t have a solution as yet!