Character Encodings Bite Again
A colleague gave me a nudge today. “This page doesn’t validate because of an encoding error”. It was fairly simple: the string “Jiménez” contained a single byte—Latin1. Ooops. It turned out that we were generating the page as ISO-8859-1 instead of UTF-8 (which is what the page had been declared as in the HTML).
This worked, but it’s pretty awful having to do this in every single controller. So, we poked around a bit more and found CharacterEncodingFilter. Installing this into
web.xml made things work.
<filter> <filter-name>CEF</filter-name> <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class> <init-param> <param-name>encoding</param-name> <param-value>UTF-8</param-name> </init-param> <init-param> <param-name>forceEncoding</param-name> <param-value>true</param-name> </init-param> </filter> <filter-mapping> <filter-name>CEF</filter-name> <url-pattern>/*</url-pattern> </filter-mapping>
Whilst rummaging around in here, we noticed something interesting: the code is set up like a spring bean—it doesn’t read the init-params directly. There’s some crafty code in GenericFilterBean to get this to work. Check it out.
Anyway, that Filter ensured that we output UTF-8 correctly. The
forceEncoding parameter ensured that it was set on the response as well as the request.
Incidentally, we figured out where the default value of ISO-8859-1 gets applied. Inside DispatcherServlet.render(), the LocaleResolver gets called, followed by ServletResponse.setLocale(). Tomcat uses the Locale to set the character encoding if it hasn’t been already. Which frankly is a pretty daft thing to do. Being british does not indicate my preference as to Latin-1 vs UTF-8.
Then, the next problem reared its head. The “Jiménez” text was actually a link to search for “Jiménez” in the author field. The URL itself was correctly encoded as
q=Jim%C3%A9nez. But when we clicked on it, it didn’t find the original article.
Our search is implemented in Solr. So we immediately had a look at the Solr logs. That clearly had Unicode problems (which is why it wasn’t finding any results). The two bytes of UTF-8 were being interpreted as individual characters (i.e. something was interpreting the URI as ISO-8859-1). Bugger.
Working backwards, we looked at the access logs for Solr. After a brief diversion to enable the access logs for tomcat inside WTP inside Eclipse (oh, the pain of yak shaving), we found that the sender was passing doubly encoded UTF-8. Arrgh.
So we jumped all the way back to the beginning of the search, back in the Controller.
String q = request.getParameter("q");
q in the debugger, that was also wrong. So at that point, the only thing that could have affected it would be tomcat itself. A quick google turned up the
URIEncoding parameter of the HTTP connector. Setting that to
server.xml fixed our search problem by making
getParameter return the correct string.
I have no idea why tomcat doesn’t just listen to the
request.setContentType() that the CharacterEncodingFilter performs, but there you go.
So, the lessons are:
- Use CharacterEncodingFilter with Spring WebMVC to get the correct output encoding (and input encoding for POST requests).
- Always configure tomcat to use UTF-8 for interpreting URI query strings.
- Always include some test data with accents to ensure it goes through your system cleanly.