Tag: unicode

 

Go Strings

I’ve been looking at Go recently. It’s a pleasant language, with few surprises. However, I wondered (as always) what the encoding of a string is supposed to be. For example:

  • Python 2 has two types: str, and unicode. Python 3 has sensibly renamed these to bytes and str, respectively.
  • Perl has a magic bit which gets set to state that the string contains characters as opposed to bytes (it’s called the UTF-8 bit, but it means characters).

So how does Go deal with characters in strings? Given that the authors of Go also invented UTF-8, we can hope it’s been thought about.

There are three types to think about.

byte[]

A slice of bytes.

string

A (possibly empty) sequence of bytes. Strings are immutable.

rune

A single unicode code point. Produced by characters in single quotes.

There’s no explicit encoding in the above. Nonetheless, there’s an implicit preference for UTF-8:

But this doesn’t help the common case:

package main

import "fmt"

func main() {
  s := "café"
  fmt.Printf("%q has length %d\n", s, len(s))
}

// "café" has length 5

The unicode/utf8 package can do what’s needed though. This provides functions for, amongst other things, picking runes out of strings.

package main

import (
  "fmt"
  "unicode/utf8"
)

func main() {
  s := "café"
  fmt.Printf("%q has length %d\n", s, utf8.RuneCountInString((s)))
}

// "café" has length 4

This is very Go-like. The default is somewhat low-level, but the types and libraries build on top of it. For example, text/scanner provides a nice way of iterating over runes in a UTF-8 input stream.

On a whim, I took a look at the internals of utf8.RuneCountInString(). It’s deceptively simple.

func RuneCountInString(s string) (n int) {
  for _ = range s {
    n++
  }
  return
}

This relies on the spec defining how a string interacts with a for loop: it’s defined as iterating over the UTF-8 codepoints (or runes).

The joy of apple keyboards

Recently, I’ve been using a Linux desktop for the first time in ages. It’s Ubuntu (Hardy Heron), and it looks nice. But after using a mac for three years, I’m really missing quite a few little things.

  1. The ability to drag and drop anything anywhere.
  2. Being able to type a wide range of Unicode characters easily.

On a mac, it’s really, really easy to type in a wide variety of useful characters. All you need is alt (⌥), sometimes known as “option”.

Keys Character Name
⌥ ; HORIZONTAL ELLIPSIS
⌥ - EN DASH
⌥ ⇧ - EM DASH
⌥ [ LEFT DOUBLE QUOTATION MARK
⌥ ⇧ [ RIGHT DOUBLE QUOTATION MARK
⌥ 2 TRADE MARK SIGN
⌥ 8 BULLET
⌥ e   e é LATIN SMALL LETTER E WITH ACUTE

How did I find all this out? The lovely keyboard viewer that comes with OS X. You can get the flag in your menu bar by going to International in system preferences and checking “Show input menu in menu bar.”

Selecting the keyboard viewer in the input menu
OS X Keyboard Viewer (normal state)

Now, hold down alt and see what you can get (try alt and shift too).

OS X Keyboard Viewer (alt)

But not everything is attached to a key. In case you need more characters, there’s always the character palette. Usually on the ⌥ ⌘ T key as well as in the Edit menu. Here, you can get access to the vast repertoire of characters in Unicode. Need an arrow?

Arrows in the Character Palette

There’s a lot you can do with the character palette, but the search box is probably the best way in. Just tap in a bit of the name of the character you’re looking for and see what turns up.

This easy access to a wide array of characters is something I’ve rather come to take for granted in OS X. So coming back to the Linux desktop, it was odd to find that I couldn’t as readily type them in. Of course, I haven’t invested the time in figuring out how to set up XKB correctly. Doubtless I could achieve many of the same things. But my past experiences of XKB and it’s documentation have shown me how complicated it can be, so I don’t rate my ability to pull it off.

The end result is that I’m spending most of my time on the (mac) laptop and ignoring the desktop. I do like my characters. 🙂

Java Platform Encoding

This came up at $WORK recently. We had a java program that was given input through command line arguments. Unfortunately, it went wrong when being passed UTF-8 characters (U+00A9 COPYRIGHT SIGN [©]). Printing out the command line arguments from inside Java showed that we had double encoded Unicode.

Initially, we just slapped -Dfile.encoding=UTF-8 on the command line. But that failed when the site that called this code went through an automatic restart. So we investigated the issue further.

We quickly found that the presence of absence of the LANG environment variable had a bearing on the matter.

NB: ShowSystemProperties.jar is very simple and just lists all system properties in sorted order.

$ java -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)
$ echo $LANG
en_GB.UTF-8
$ java -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=UTF-8
$ LANG= java -jar ShowSystemProperties.jar | grep encoding
file.encoding=ANSI_X3.4-1968
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=ANSI_X3.4-1968

So, setting file.encoding works, but there’s an internal property, sun.jnu.encoding as well.

Next, see what happens when we add the explicit override.

$ LANG= java -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=ANSI_X3.4-1968

Hey! sun.jnu.encoding isn’t changing!

Now, as far as I can see, sun.jnu.encoding isn’t actually documented anywhere. So you have to go into the source code for Java (openjdk’s jdk6-b16 in this case) to figure out what’s up.

Let’s start in main(), which is in java.c. Actually, it’s JavaMain() that we’re really interested in. In there you can see:

int JNICALL
JavaMain(void * _args)
{
  …
  jobjectArray mainArgs;
 
  …
  /* Build argument array */
  mainArgs = NewPlatformStringArray(env, argv, argc);
  if (mainArgs == NULL) {
      ReportExceptionDescription(env);
      goto leave;
  }}

NewPlatformStringArray() is defined in java.c and calls NewPlatformString() repeatedly with each command line argument. In turn, that calls new String(byte[], encoding). It gets the encoding from getPlatformEncoding(). That essentially calls System.getProperty("sun.jnu.encoding").

So where does that property get set? If you look in System.c, Java_java_lang_System_initProperties() calls:

    PUTPROP(props, "sun.jnu.encoding", sprops->sun_jnu_encoding);

sprops appears to get set in GetJavaProperties() in java_props_md.c. This interprets various environment variables including the one that control the locale. It appears to pull out everything after the period in the LANG environment variable as the encoding in order to get sun_jnu_encoding.

Phew. So we now know that there is a special property which gets used for interpreting “platform” strings like:

* Command line arguments
* Main class name
* Environment variables

And it can be overridden:

$ LANG= java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=UTF-8

No escape() from JavaScript

A couple of days ago, we got caught out by a few encoding issues in a site at $WORK. The Perl related ones were fairly self explanatory and I’d seen before (e.g. not calling decode_utf8() on the query string parameters). But the JavaScript part was new to me.

The problem was that we were using JavaScript to create an URL, but this wasn’t encoding some characters correctly. After a bit of investigation, the problem comes down to the difference between escape() and encodeURIComponent().

input escape(…) encodeURIComponent(…)
a&b a%26b a%26b
1+2 1+2 1%2B2
café caf%E9 caf%C3%A9
Ādam %u0100dam %C4%80dam

The last is particularly troublesome, as no server I know of will support decoding that %u form.

The takeaway is that encodeURIComponent() always encodes as UTF-8 and doesn’t miss characters out. As far as I can see, this means you should simply never use escape(). Which is why I’ve asked Douglas Crockford to add it as a warning to JSLint.

Once we switched the site’s JavaScript from escape() to encodeURIComponent(), everything worked as expected.

Mixed Character Encodings

I’ve been given a MySQL dump file at work. It’s got problems — Windows-1252 and UTF-8 characters are mixed in. Bleargh. How can we clean it up to be all UTF-8? Perl to the rescue.

use Encode qw( encode decode );
 
# From http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl
my $utf8_char = qr{
    (?:
        [\x00-\x7f]
        |
        [\xc0-\xdf][\x80-\xbf]
        |
        [\xe0-\xef][\x80-\xbf]{2}
        |
        [\xf0-\xf7][\x80-\xbf]{3}
    )
}x;
 
while (<>) {
    s{($utf8_char)|(.)}{
        if ( defined $1 )    { $1 }
        elsif ( defined $2 ) { encode( "utf8", decode( "cp1252", $2 ) ) }
        else                 { "" }
    }ge;
    print $_;
}

Yes, that’s a regex for matching UTF-8 characters (courtesy of Markus Kuhn). I hadn’t considered using a regex when I first started down this road. I started examining bytes by hand. And the code was about three times longer.

Anyway, this seems to solve the issues I was having.

Character Encodings Bite Again

A colleague gave me a nudge today. “This page doesn’t validate because of an encoding error”. It was fairly simple: the string “Jiménez” contained a single byte—Latin1. Ooops. It turned out that we were generating the page as ISO-8859-1 instead of UTF-8 (which is what the page had been declared as in the HTML).

So, which bit of Spring WebMVC sets the character encoding? A bit of poking around in the debugger didn’t pop up any obvious extension point. So we stuck this in our Controller.

  response.setContentType("UTF-8");

This worked, but it’s pretty awful having to do this in every single controller. So, we poked around a bit more and found CharacterEncodingFilter. Installing this into web.xml made things work.

  <filter>
    <filter-name>CEF</filter-name>
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-name>
    </init-param>
    <init-param>
      <param-name>forceEncoding</param-name>
      <param-value>true</param-name>
    </init-param>
  </filter>
  <filter-mapping>
    <filter-name>CEF</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

Whilst rummaging around in here, we noticed something interesting: the code is set up like a spring bean—it doesn’t read the init-params directly. There’s some crafty code in GenericFilterBean to get this to work. Check it out.

Anyway, that Filter ensured that we output UTF-8 correctly. The forceEncoding parameter ensured that it was set on the response as well as the request.

Incidentally, we figured out where the default value of ISO-8859-1 gets applied. Inside DispatcherServlet.render(), the LocaleResolver gets called, followed by ServletResponse.setLocale(). Tomcat uses the Locale to set the character encoding if it hasn’t been already. Which frankly is a pretty daft thing to do. Being british does not indicate my preference as to Latin-1 vs UTF-8.

Then, the next problem reared its head. The “Jiménez” text was actually a link to search for “Jiménez” in the author field. The URL itself was correctly encoded as q=Jim%C3%A9nez. But when we clicked on it, it didn’t find the original article.

Our search is implemented in Solr. So we immediately had a look at the Solr logs. That clearly had Unicode problems (which is why it wasn’t finding any results). The two bytes of UTF-8 were being interpreted as individual characters (i.e. something was interpreting the URI as ISO-8859-1). Bugger.

Working backwards, we looked at the access logs for Solr. After a brief diversion to enable the access logs for tomcat inside WTP inside Eclipse (oh, the pain of yak shaving), we found that the sender was passing doubly encoded UTF-8. Arrgh.

So we jumped all the way back to the beginning of the search, back in the Controller.

  String q = request.getParameter("q");

Looking at q in the debugger, that was also wrong. So at that point, the only thing that could have affected it would be tomcat itself. A quick google turned up the URIEncoding parameter of the HTTP connector. Setting that to UTF-8 in server.xml fixed our search problem by making getParameter return the correct string.

I have no idea why tomcat doesn’t just listen to the request.setContentType() that the CharacterEncodingFilter performs, but there you go.

So, the lessons are:

  1. Use CharacterEncodingFilter with Spring WebMVC to get the correct output encoding (and input encoding for POST requests).
  2. Always configure tomcat to use UTF-8 for interpreting URI query strings.
  3. Always include some test data with accents to ensure it goes through your system cleanly.

Character Encodings

Q: When a program reads input, what is it reading?

A: Bytes.

i.e. not characters.

If you want characters, you have to convert from one to the other. Thanks to decades of ASCII and Latin-1, with one-to-one byte to character mappings, most programmers have never even noticed that there’s a difference.

But there is. And as soon as somebody feeds your program something like UTF-8, your code is broken.

Now some environments are aware of the difference between bytes and characters. Like Perl and Java.

But there’s still a nasty breakage waiting for you in these environments. It’s called the “default character encoding”. And it’s bitten me several times in the last few weeks.

Picking on Java for a moment, take a look at InputStreamReader. It has four constructors, the first of which doesn’t take an encoding. So when you use that you have (effectively) no idea of what encoding you’re reading. It could be Windows-1252 if you’re on a PC. It could be MacRoman on OS X (seriously). On Linux, it’s probably UTF-8. But you’re at the mercy of not only changes in the OS, but also “helpful” administrators, environment variables, systems properties. Really, anything could change it.

Which is why when I see somebody saying new InputStreamReader(someInputStream) in a 3rd party library, I scream. Loudly. And often. Because they’ve suddenly decided that everybody else knows form my input should be, except me, the person writing the program. Needless to say, this is rather difficult to cope with.

The lesson is:

If you ever do any I/O or bytes ↔ characters conversion without explicitly specifying the character set, you will be fucked.

Don’t do it kids, just use UTF-8.

Mongrel's Default Charset

I suddenly noticed that my last entry had Unicode problems. How embarrassing. It turns out that mongrel doesn’t set a default charset, so the usual caveats apply. Looking through the mongrel docs, you can do something with the -m option, but it still seems difficult to apply a default universally.

Thankfully, I’m proxying to mongrel via Apache. So correcting the situation turned out to be as simple as adding this to my VirtualHost config.

  AddDefaultCharset UTF-8

I was actually not sure that this would work, because Apache is proxying rather than serving files directly. But it does work. I suspect that it may not work un der Apache 1.3, but that would need to be confirmed.

But now the error is corrected and I’m Unicode happy once more. Hurrah!

Locales That Work

As I mentioned before, I don’t like locales. But of course, the solution is blindingly obvious and had passed me by. Unicode Support on FreeBSD points out the correct solution, which avoids breaking ls.

  % export LANG=en_GB.UTF-8 LC_COLLATE=POSIX

Marvellous. Now things can autodetect that I’d like UTF-8, please.

Unicode in Rails

Unicode in Rails takes a step further today, as ActiveSupport::MultiByte is committed to the edge (r5223). More information is available over at fingertips, including a neat demo video. This should really help people who need proper Unicode support. There’s no excuse to not use UTF-8 now!