Character Encodings

There have been a few links around to an article about AJAX and Multibyte Character Support today. He highlights how he fixed a problem by ensuring that he sent the correct HTTP headers for the character encoding he was using. After stabbing being thoroughly mislead by Internet Explorer.

Now the correct character-encoding ended up being UTF-8. But of course, you actually have to turn your data into UTF-8 first. But it’s still the right answer.

Unless you have really, really peculiar clients, you should always be sending out UTF-8 data. It’s a no-brainer.

But it highlights a deeper issue. You simply have to know what encoding your characters are in. It’s no good saying “ASCII” (I can’t type in “£” then?) or Latin-1 (because invariably it’s cp-1252 instead). UTF-8 being a superset of all the others (with the possible exception of some far-east characters that didn’t make it into the HAN Unification process).

If you don’t know what character encoding you have, then you don’t know how to interpret those bytes. Plain and simple. You need to convert your data to the right character encoding on the way in and send it out in right encoding. Because UTF-8 is a superset, you can transliterate if needed, but it’s an expensive solution.

But it seems that many, many web developers haven’t grasped the concept of a character set yet. It’s just another area when “things sometimes go wrong if you type in funny characters sometimes”. But it needn’t be rocket science.

Oh, and the comments about using escape() in JavaScript are way off the mark. That’s just punting the problem, guys.

Finally, it’s worth noting something else related that popped into my radar today. Sending out character sets with a media type of text/* is absolutely necessary, since it’s perfectly legitimate for a bit of proxy software to then translate to a different character set if it wants to. This is why you should never send out xml as text/*, because then the encoding in the xml declaration would be wrong…

3 Comments to Character Encodings

  1. Taylan Pince says:

    Thanks for the nice wrap-up. You are right about the need to use UTF-8 whenever possible. In my case, I don’t have the luxury of using it due to the requirements of my client.

    As for the JavaScript escape suggestion, I have been researching it but I don’t think it is a real solution.

  2. Taylan Pince says:

    You are right about the use of UTF-8, it should be the encoding specified unless there is a special reason for it. In my case, it is my customer’s requirements, I could only use ISO-8859-9 in that specific case, and that’s why I had to find a solution for AJAX encoding.

    As for the JavaScript escape() suggestion, I have been researching it but I don’t think it is a real solution.

  3. Alastair says:

    > This is why you should never send out xml as text/, because then the encoding in the xml declaration would be wrong…

    “would be” or “could be”?

    In either case it is my understanding (based on an <a href=”http://diveintomark.org/archives/2004/02/13/xml-media-types
    ”>old post of Mark Pilgrim’s</a>) that the charset parameter of the Content-Type header trumps the xml encoding declaration for the various application/xml variations.

    Meaning that there’s still a risk that the xml encoding declaration may be wrong.

    IIUC the difference with text/ is that the xml encoding declaration is ignored.

    This is still way harder than it needs to be.