Character Encodings Bite Again

A colleague gave me a nudge today. “This page doesn’t validate because of an encoding error”. It was fairly simple: the string “Jiménez” contained a single byte—Latin1. Ooops. It turned out that we were generating the page as ISO-8859-1 instead of UTF-8 (which is what the page had been declared as in the HTML).

So, which bit of Spring WebMVC sets the character encoding? A bit of poking around in the debugger didn’t pop up any obvious extension point. So we stuck this in our Controller.


This worked, but it’s pretty awful having to do this in every single controller. So, we poked around a bit more and found CharacterEncodingFilter. Installing this into web.xml made things work.


Whilst rummaging around in here, we noticed something interesting: the code is set up like a spring bean—it doesn’t read the init-params directly. There’s some crafty code in GenericFilterBean to get this to work. Check it out.

Anyway, that Filter ensured that we output UTF-8 correctly. The forceEncoding parameter ensured that it was set on the response as well as the request.

Incidentally, we figured out where the default value of ISO-8859-1 gets applied. Inside DispatcherServlet.render(), the LocaleResolver gets called, followed by ServletResponse.setLocale(). Tomcat uses the Locale to set the character encoding if it hasn’t been already. Which frankly is a pretty daft thing to do. Being british does not indicate my preference as to Latin-1 vs UTF-8.

Then, the next problem reared its head. The “Jiménez” text was actually a link to search for “Jiménez” in the author field. The URL itself was correctly encoded as q=Jim%C3%A9nez. But when we clicked on it, it didn’t find the original article.

Our search is implemented in Solr. So we immediately had a look at the Solr logs. That clearly had Unicode problems (which is why it wasn’t finding any results). The two bytes of UTF-8 were being interpreted as individual characters (i.e. something was interpreting the URI as ISO-8859-1). Bugger.

Working backwards, we looked at the access logs for Solr. After a brief diversion to enable the access logs for tomcat inside WTP inside Eclipse (oh, the pain of yak shaving), we found that the sender was passing doubly encoded UTF-8. Arrgh.

So we jumped all the way back to the beginning of the search, back in the Controller.

  String q = request.getParameter("q");

Looking at q in the debugger, that was also wrong. So at that point, the only thing that could have affected it would be tomcat itself. A quick google turned up the URIEncoding parameter of the HTTP connector. Setting that to UTF-8 in server.xml fixed our search problem by making getParameter return the correct string.

I have no idea why tomcat doesn’t just listen to the request.setContentType() that the CharacterEncodingFilter performs, but there you go.

So, the lessons are:

  1. Use CharacterEncodingFilter with Spring WebMVC to get the correct output encoding (and input encoding for POST requests).
  2. Always configure tomcat to use UTF-8 for interpreting URI query strings.
  3. Always include some test data with accents to ensure it goes through your system cleanly.

Java Concurrency in Practice

I’ve recently purchased a copy of Java Concurrency in Practice. I was always a little bit scared of threaded code, but we have a small amount at work, so I figured I’d better get my head around it. Wow. The big takeaway from the book was that I was nowhere near scared enough. Concurrency is hard.

But it’s not all bad news. Java 5 introduced java.util.concurrent—a set of abstractions for dealing with concurrency at a higher level. Chapters 5 and 6 are an overview of these facilities. There’s extensive coverage of the new data structures like ConcurrentHashMap as well as a very detailed exploration of what the new Executor and ExecutorService can do for you. That’s as well as looking at things like latches, queues and so on.

Chapter 7 was a detailed exploration of the issues surrounding Thread.interrupt() and ensuring that your threads can shut down properly. Java’s threading system is cooperative—you can’t stop a thread, you have to ask it to stop. This is harder to get right than it looks. Incidentally, much of the same discussion is on the web as Java theory and practice: Dealing with InterruptedException (by the same author, Brian Goetz).

These chapters were probably the most immediately useful parts of the book for me. I put it down and rewrite the threaded portions of my application using ConcurrentHashMap, ExecutorService and using interrupt() for cancellation. And I saw improvements. Where the threads had previously not shut correctly 100% of the time, they now worked just fine. In general, I feel a lot happier using the higher level abstractions. Or maybe that should be “less uneasy about having cocked it up”.

Stepping back a moment, the rest of the book is still very useful. The first part is a grounding of both concurrency and concurrency in Java. It’s all illustrated with very good code examples1. This really taught me about visibility as a threading issue, which I hadn’t previously been aware of at all (and it pointed me straight at a bug in my own code). i.e.

  class Foo extends Thread {
    private boolean done = false;
    public synchronized void setDone(boolean done) {
      this.done = done;
    public void run() {
      while (!done) {
        // …

Spot the problem here? Because read access to done isn’t synchronized, an update from another thread might not be noticed. Ooops. (and yes, it’s better to implements Runnable rather than extends Thread [and yes, I should be using Thread.interrupt() rather than my own flag]).

Later chapters continue to give excellent, clear advice. I’ve never done (or needed to do) Swing programming, but there appears to be a good discussion of the issues involved in ensuring that you only do GUI actions within the confines of a single thread.

One chapter which really blew me away was the testing chapter. I had absolutely no idea how to go about testing thread safety. The clues flew in abundance at me from the pages however. I confess I haven’t written any tests yet, but I at least have an idea of where to start now…

The later chapters are “advanced topics”. They seemed like cogent explanations, but as an absolute beginner, they weren’t too relevant just yet. I can see I’ll be coming back to the book soon enough, however.

Overall, I reckon this is one of those “must own” books, if you go anywhere near Java and threads.

1 Nothing is more off-putting than bad code examples. I mean, you’re meant to be an expert, right? Brian outlined how to avoid errors in code on his blog.


Character Encodings

Q: When a program reads input, what is it reading?

A: Bytes.

i.e. not characters.

If you want characters, you have to convert from one to the other. Thanks to decades of ASCII and Latin-1, with one-to-one byte to character mappings, most programmers have never even noticed that there’s a difference.

But there is. And as soon as somebody feeds your program something like UTF-8, your code is broken.

Now some environments are aware of the difference between bytes and characters. Like Perl and Java.

But there’s still a nasty breakage waiting for you in these environments. It’s called the “default character encoding”. And it’s bitten me several times in the last few weeks.

Picking on Java for a moment, take a look at InputStreamReader. It has four constructors, the first of which doesn’t take an encoding. So when you use that you have (effectively) no idea of what encoding you’re reading. It could be Windows-1252 if you’re on a PC. It could be MacRoman on OS X (seriously). On Linux, it’s probably UTF-8. But you’re at the mercy of not only changes in the OS, but also “helpful” administrators, environment variables, systems properties. Really, anything could change it.

Which is why when I see somebody saying new InputStreamReader(someInputStream) in a 3rd party library, I scream. Loudly. And often. Because they’ve suddenly decided that everybody else knows form my input should be, except me, the person writing the program. Needless to say, this is rather difficult to cope with.

The lesson is:

If you ever do any I/O or bytes ↔ characters conversion without explicitly specifying the character set, you will be fucked.

Don’t do it kids, just use UTF-8.



I’ve just found a new feature in junit 4.4: assertThat()1. This gives you a much nicer way of specifying assertions than the usual assertEquals() or assertTrue(). Some examples from the docs above:

  assertThat(x, is(3));
  assertThat(x, is(not(4)));
  assertThat(responseString, either(containsString("color")).or(containsString("colour")));
  assertThat(myList, hasItem("3"));

This is both very readable, and leads to much improved diagnostic messages in failure.

It’s also (relatively) easy to extend to provide your own matchers (although check out hamcrest for some handy predefined ones). For example, I’ve just come up with this class to assert whether a collection is empty is empty or not.

class EmptyMatcher extends TypeSafeMatcher<Collection<?>> {
    public boolean matchesSafely(Collection<?> c) {
        return c.isEmpty();

    public void describeTo(Description desc) {

    public static <T> Matcher<? super Collection<?>> empty() {
        return new EmptyMatcher();

This is based on the example in the hamcrest tutorial. The only thing of real interest is the static factory method empty()2. This is what you would import static, so you can say:

  assertThat(setOfThings, empty());

This is more concise and leads to better error reporting than it’s predecessor.

  assertThat(setOfThings.size(), is(0));

One of the other nice things about using Matchers is that you can have side effects other than matching something. A great example is describedAs(). Say that you have:

  assertThat(a.getFrog(), is(not(nullValue())));

This would produce an error like:

Expected: is not null
Got: null

You can add in a description like this:

  assertThat(a.getFrog(), describedAs("frog", is(not(nullValue()))));

The error now becomes:

Expected: frog
Got: null

In the use I’ve put it to so far, this seems to be particularly appropriate for null values, which are otherwise quite unhelpful when a test fails.

I’ve completely switched my latest project to using assertThat(). So far, it seems to be leading to some nice readable code and I’m quite pleased with the results.

1 As an aside, “ FAIL”: no javadocs, no news. The place is a mess.

2 I recommend checking out the chapter on generics in Effective Java 2nd Edition in order to properly understand the declaration here.


Postfix 2.5.1 TLS on FreeBSD

This is one of those things that I have to put up there in case anybody else has the same obscure setup that I do…

I run postfix on FreeBSD, using the ports system. This means I have a tendency to just use portupgrade to upgrade to the latest version of anything I happen to have installed. Normally, this works just fine. I usually check the output to see if any warnings about upgrading pop out and that’s about it. Slightly seat-of-the-pants, I know.

Anyway, I recently upgraded to postfix 2.5.1 and started seeing these messages in the logs.

Jul 26 21:29:44 gimli postfix/tlsmgr[7789]: fatal: tls_prng_exch_open: cannot open PRNG exchange file /var/lib/postfix/prng_exch: Permission denied

tlsmgr is the bit of postfix that handles SMTP over SSL.

The first port of call is to look through the postfix release notes. This seemed relevant.

[Incompat 20071206] The tlsmgr(8) and verify(8) servers no longer use root privileges when opening the address_verify_map, *_tls_session_cache_database, and tls_random_exchange_name cache files. This avoids a potential security loophole where the ownership of a file (or directory) does not match the trust level of the content of that file (or directory).

So, what’s the problem?

  % sudo -u postfix ls -l /var/lib/postfix
  ls: /var/lib/postfix: Permission denied
  % sudo -u postfix ls -l /var/lib
  total 0
  ls: lib: Permission denied
  % sudo -u postfix ls -ld /var/lib
  drwxr-x---  5 root  wheel  512 26 Jul 08:14 /var/lib

So, it’s basically a permissions problem. Postfix can’t see the directory it’s trying to use. Previously it wasn’t a problem, as postfix was doing things as root, and root sidesteps permissions checks.

What to do? The simplest is to change the permissions. But I don’t particularly like doing that on systems directories, as they may well get reset in the future (e.g. nightly runs of mtree). So the simplest option is probably to reconfigure postfix to use a different directory. One that it actually has permission to access, like /var/db/postfix.

Annoyingly, when I look at the port to understand this problem (PR#121236), it was fixed in April. I wonder why I didn’t get the fix?

As it turns out a reinstall of postfix (portupgrade -f postfix-2.5.1_2,1) completely fixes the problem, and the directory it uses is now /var/db/postfix by default. I wonder what caused it to go wrong in the first place though?



One more little library that I’ve come to love: jasypt. It’s a simplified veneer over the top of the gargantuan java security apparatus. All I wanted to do was encrypt a String before putting it in a Cookie.

  BasicTextEncryptor encryptor = new BasicTextEncryptor();
  String cipherText = encryptor.encrypt(clearText);

It nicely base64 encodes the result, which is ideal for Cookie stuffing.

The reverse operation is just as simple.

  BasicTextEncryptor encryptor = new BasicTextEncryptor();
  String recoveredText = encryptor.decrypt(cipherText);

Google Collections to the rescue

A few days ago, I was writing a piece of code that turned a line at a time into an Object. And it was using iterators. I had a RecordStream, which wrapped a LineStream (just a thin veneer over LineNumberReader).

Then I discovered that there was a terminating record at the end of each file. And it was in a completely different format to all the other lines. Bother.

Ok, I know, I’ll insert another iterator in the middle, which specifically ignores that record. Well, easier said than done as it turns out. I spent the best part of a day trying to create an Iterator which reads the next value and pretends that it’s not there. It turns out to have an awful lot of state.

Eventually I managed the task, and it worked. But boy, was it ugly. And it was long—about two pages of code.

Then the light bulb went off. I remembered that google collections had some tools for dealing with Iterators. In particular, there’s a function filter(), which takes a Predicate. And look! The Predicates class contains some handy builtins!

After about 5 minutes work, my two pages of code boiled down to three lines of code.

    import static*;

    private static final String END_RECORD = "END RECORD,END RECORD,END RECORD";

    public Iterator<T> iterator() {
        // Produce an iterator that returns one line at a time.
        Iterator<String> lines = new LineStream(reader).iterator();
        // A predicate to return all records which are not the end record.
        Predicate<String> notEndRecord = not(isEqualTo(END_RECORD));
        // Apply the predicate to the iterator.
        final Iterator<String> it = Iterators.filter(lines, notEndRecord);
        return new Iterator<T>() { … };

Marvellous and powerful stuff. It’s seriously worth checking out in case you haven’t played with it before. My favourite is the static factory methods. e.g.

  // Before
  Map<String, String> myMap = new HashMap<String,String>();

  // After
  Map<String, String> myMap = Maps.newHashMap();

Isn’t it lovely how the compiler just figures it all out for you? Anything that can save space like that has to be a Good Thing™.

There are a whole bunch of other useful things in there.

  • Preconditions.checkNotNull() is a compact way of validity checking your arguments.
  • Join.join()—I don’t know how many times I’ve written this by hand (usually badly). Much better to have somebody else do it for me.

Do yourself a favour and go check them out. You won’t regret it.


Busy Times

Blimey, it’s been a while since I posted. Well, it’s been busy times.

I’ve mostly been working on products at $WORK, mostly in Java. I’ve got a whole series of posts from my internal blog that need to be reposted here. Suffice to say I’ve been having fun.

I’ve been on holiday a bit. We went to Cornwall for my daughter’s first birthday (superb weather and if you’re near Looe, you must visit trawlers). We’ve been to Sweden for midsummers celebrations with more family (superb mygga…).

My grandmother visited us from France a few weeks ago. With the rest of that side of the family we attended the annual battle of britain memorial service in order to remember my Grandfather (Wing Commander Henry Maynard Mitchell).

It looks like I won’t be getting any less busy either. Small children like attention, so we’re finding that family visits are a very regular occurrence. But this is a welcome distraction.