Tag Archives: unicode

Go Strings

I’ve been looking at Go recently. It’s a pleasant language, with few surprises. However, I wondered (as always) what the encoding of a string is supposed to be. For example: Python 2 has two types: str, and unicode. Python 3 has sensibly renamed these to bytes and str, respectively. Perl has a magic bit which […]

The joy of apple keyboards

Recently, I’ve been using a Linux desktop for the first time in ages. It’s Ubuntu (Hardy Heron), and it looks nice. But after using a mac for three years, I’m really missing quite a few little things. The ability to drag and drop anything anywhere. Being able to type a wide range of Unicode characters […]

Java Platform Encoding

This came up at $WORK recently. We had a java program that was given input through command line arguments. Unfortunately, it went wrong when being passed UTF-8 characters (U+00A9 COPYRIGHT SIGN [©]). Printing out the command line arguments from inside Java showed that we had double encoded Unicode. Initially, we just slapped -Dfile.encoding=UTF-8 on the […]

No escape() from JavaScript

A couple of days ago, we got caught out by a few encoding issues in a site at $WORK. The Perl related ones were fairly self explanatory and I’d seen before (e.g. not calling decode_utf8() on the query string parameters). But the JavaScript part was new to me. The problem was that we were using […]

Mixed Character Encodings

I’ve been given a MySQL dump file at work. It’s got problems — Windows-1252 and UTF-8 characters are mixed in. Bleargh. How can we clean it up to be all UTF-8? Perl to the rescue. use Encode qw( encode decode );   # From http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl my $utf8_char = qr{ (?: [\x00-\x7f] | [\xc0-\xdf][\x80-\xbf] | [\xe0-\xef][\x80-\xbf]{2} […]

Character Encodings Bite Again

A colleague gave me a nudge today. “This page doesn’t validate because of an encoding error”. It was fairly simple: the string “Jiménez” contained a single byte—Latin1. Ooops. It turned out that we were generating the page as ISO-8859-1 instead of UTF-8 (which is what the page had been declared as in the HTML). So, […]

Character Encodings

Q: When a program reads input, what is it reading? A: Bytes. i.e. not characters. If you want characters, you have to convert from one to the other. Thanks to decades of ASCII and Latin-1, with one-to-one byte to character mappings, most programmers have never even noticed that there’s a difference. But there is. And […]

Mongrel's Default Charset

I suddenly noticed that my last entry had Unicode problems. How embarrassing. It turns out that mongrel doesn’t set a default charset, so the usual caveats apply. Looking through the mongrel docs, you can do something with the -m option, but it still seems difficult to apply a default universally. Thankfully, I’m proxying to mongrel […]

Locales That Work

As I mentioned before, I don’t like locales. But of course, the solution is blindingly obvious and had passed me by. Unicode Support on FreeBSD points out the correct solution, which avoids breaking ls. % export LANG=en_GB.UTF-8 LC_COLLATE=POSIX Marvellous. Now things can autodetect that I’d like UTF-8, please.

Unicode in Rails

Unicode in Rails takes a step further today, as ActiveSupport::MultiByte is committed to the edge (r5223). More information is available over at fingertips, including a neat demo video. This should really help people who need proper Unicode support. There’s no excuse to not use UTF-8 now!