February 3, 2016

How to SURT IDN domains?

Converting URLs to SURT (Sort-friendly URI Reordering Transform) form has many benefits in web archiving and is widely used when configuring crawlers. Notably, the use of SURT prefixes can be used to easily apply rules to a logical segment of the web.

Thus the SURT prefix:
http://(is,
Will match all domains under the .is TLD.

A bit less known ability is to match against partial domain names. Thus the following SURT prefix:
http://(is,a
Would match all .is domains that begin with the letter a (note that there isn't a comma at the end).

This all works quite well, until you hit Internationalized Domain Names (IDNs). As the original infrastructure of the web does not really support non-ASCII characters, all IDNs are designed so that they can be translated into an ASCII equivalent.

Thus the IDN domain landsbókasafn.is is actually represented using the "punycode" representation xn--landsbkasafn-5hb.is.

When matching SURTs against full domain names (trailing comma), this doesn't really matter. But, when matching against a domain name prefix, you run into an issue. Considering the example above, should landsbókasafn.is match the SURT http://(is,l?

The current implementation (at least in Heritrix's much used SurtPrefixedDecideRule) is to evaluate only the punycode version (so no, but it would match http://(is,x).

This seems potentially limiting and likely to cause confusion.

No comments:

Post a Comment