Character Encodings

There have been a few links around to an article about AJAX and Multibyte Character Support today. He highlights how he fixed a problem by ensuring that he sent the correct HTTP headers for the character encoding he was using. After stabbing being thoroughly mislead by Internet Explorer.

Now the correct character-encoding ended up being UTF-8. But of course, you actually have to turn your data into UTF-8 first. But it’s still the right answer.

Unless you have really, really peculiar clients, you should always be sending out UTF-8 data. It’s a no-brainer.

But it highlights a deeper issue. You simply have to know what encoding your characters are in. It’s no good saying “ASCII” (I can’t type in “£” then?) or Latin-1 (because invariably it’s cp-1252 instead). UTF-8 being a superset of all the others (with the possible exception of some far-east characters that didn’t make it into the HAN Unification process).

If you don’t know what character encoding you have, then you don’t know how to interpret those bytes. Plain and simple. You need to convert your data to the right character encoding on the way in and send it out in right encoding. Because UTF-8 is a superset, you can transliterate if needed, but it’s an expensive solution.

But it seems that many, many web developers haven’t grasped the concept of a character set yet. It’s just another area when “things sometimes go wrong if you type in funny characters sometimes”. But it needn’t be rocket science.

Oh, and the comments about using escape() in JavaScript are way off the mark. That’s just punting the problem, guys.

Finally, it’s worth noting something else related that popped into my radar today. Sending out character sets with a media type of text/* is absolutely necessary, since it’s perfectly legitimate for a bit of proxy software to then translate to a different character set if it wants to. This is why you should never send out xml as text/*, because then the encoding in the xml declaration would be wrong…