One of the web projects Rob and I are working on at the moment has some internationalisation requirements that are pretty key to its success. The standard user-application interactions aren’t that problematic, there’s some things to think about encoding/storage wise, but it’s a well understood area.
The tricky bit is that URLs are ASCII only. You can encode non-ASCII characters and handle things in the application to make it look and act like it’s coping with different character sets, but this only really works if you think of a URL as a pure reference, that isn’t containing any information in itself. For web 2.0 type applications (and when using REST), this doesn’t really work as the URL contains information in itself. If you want a piece of information referenced by a URL like http://mysite/user/page1 making that URL make sense in languages not using ASCII is hard.
Tim Bray recently linked
to an blog entry by James Holderness at his site: http://www.詹姆斯.com. Makes an interesting read on the troubles of using & and < in RSS feed titles (such as AT&T).
More relevant to this discuession however is James’ site url: http://www.詹姆斯.com or http://www.xn--8ws00zhy3a.com.
I wonder what wordpress will make of the characters in this post; hopefully it will do the right thing. If not, go see Tim’s original article.
looks liek we ant to be looking at punycode (interesting name…). There’s aconvertor we can try out here.
wow i was suffering from some serious finger dyslexia in that last comment 😉
xqokgnbc
Tom Cruise have dyslexia and yet he is still a very successful actor..-.