Character Encodings Bite Again

A colleague gave me a nudge today. “This page doesn’t validate because of an encoding error”. It was fairly simple: the string “Jiménez” contained a single byte—Latin1. Ooops. It turned out that we were generating the page as ISO-8859-1 instead of UTF-8 (which is what the page had been declared as in the HTML).

So, which bit of Spring WebMVC sets the character encoding? A bit of poking around in the debugger didn’t pop up any obvious extension point. So we stuck this in our Controller.


This worked, but it’s pretty awful having to do this in every single controller. So, we poked around a bit more and found CharacterEncodingFilter. Installing this into web.xml made things work.


Whilst rummaging around in here, we noticed something interesting: the code is set up like a spring bean—it doesn’t read the init-params directly. There’s some crafty code in GenericFilterBean to get this to work. Check it out.

Anyway, that Filter ensured that we output UTF-8 correctly. The forceEncoding parameter ensured that it was set on the response as well as the request.

Incidentally, we figured out where the default value of ISO-8859-1 gets applied. Inside DispatcherServlet.render(), the LocaleResolver gets called, followed by ServletResponse.setLocale(). Tomcat uses the Locale to set the character encoding if it hasn’t been already. Which frankly is a pretty daft thing to do. Being british does not indicate my preference as to Latin-1 vs UTF-8.

Then, the next problem reared its head. The “Jiménez” text was actually a link to search for “Jiménez” in the author field. The URL itself was correctly encoded as q=Jim%C3%A9nez. But when we clicked on it, it didn’t find the original article.

Our search is implemented in Solr. So we immediately had a look at the Solr logs. That clearly had Unicode problems (which is why it wasn’t finding any results). The two bytes of UTF-8 were being interpreted as individual characters (i.e. something was interpreting the URI as ISO-8859-1). Bugger.

Working backwards, we looked at the access logs for Solr. After a brief diversion to enable the access logs for tomcat inside WTP inside Eclipse (oh, the pain of yak shaving), we found that the sender was passing doubly encoded UTF-8. Arrgh.

So we jumped all the way back to the beginning of the search, back in the Controller.

  String q = request.getParameter("q");

Looking at q in the debugger, that was also wrong. So at that point, the only thing that could have affected it would be tomcat itself. A quick google turned up the URIEncoding parameter of the HTTP connector. Setting that to UTF-8 in server.xml fixed our search problem by making getParameter return the correct string.

I have no idea why tomcat doesn’t just listen to the request.setContentType() that the CharacterEncodingFilter performs, but there you go.

So, the lessons are:

  1. Use CharacterEncodingFilter with Spring WebMVC to get the correct output encoding (and input encoding for POST requests).
  2. Always configure tomcat to use UTF-8 for interpreting URI query strings.
  3. Always include some test data with accents to ensure it goes through your system cleanly.