Character Encodings

Q: When a program reads input, what is it reading?

A: Bytes.

i.e. not characters.

If you want characters, you have to convert from one to the other. Thanks to decades of ASCII and Latin-1, with one-to-one byte to character mappings, most programmers have never even noticed that there’s a difference.

But there is. And as soon as somebody feeds your program something like UTF-8, your code is broken.

Now some environments are aware of the difference between bytes and characters. Like Perl and Java.

But there’s still a nasty breakage waiting for you in these environments. It’s called the “default character encoding”. And it’s bitten me several times in the last few weeks.

Picking on Java for a moment, take a look at InputStreamReader. It has four constructors, the first of which doesn’t take an encoding. So when you use that you have (effectively) no idea of what encoding you’re reading. It could be Windows-1252 if you’re on a PC. It could be MacRoman on OS X (seriously). On Linux, it’s probably UTF-8. But you’re at the mercy of not only changes in the OS, but also “helpful” administrators, environment variables, systems properties. Really, anything could change it.

Which is why when I see somebody saying new InputStreamReader(someInputStream) in a 3rd party library, I scream. Loudly. And often. Because they’ve suddenly decided that everybody else knows form my input should be, except me, the person writing the program. Needless to say, this is rather difficult to cope with.

The lesson is:

If you ever do any I/O or bytes ↔ characters conversion without explicitly specifying the character set, you will be fucked.

Don’t do it kids, just use UTF-8.



I used to work with Tony Finch. Great guy. But the chap’s clearly got a few screws loose: Signal Mishandling.