Java Platform Encoding

This came up at $WORK recently. We had a java program that was given input through command line arguments. Unfortunately, it went wrong when being passed UTF-8 characters (U+00A9 COPYRIGHT SIGN [©]). Printing out the command line arguments from inside Java showed that we had double encoded Unicode.

Initially, we just slapped -Dfile.encoding=UTF-8 on the command line. But that failed when the site that called this code went through an automatic restart. So we investigated the issue further.

We quickly found that the presence of absence of the LANG environment variable had a bearing on the matter.

NB: ShowSystemProperties.jar is very simple and just lists all system properties in sorted order.

$ java -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)
$ echo $LANG
$ java -jar ShowSystemProperties.jar | grep encoding
$ LANG= java -jar ShowSystemProperties.jar | grep encoding

So, setting file.encoding works, but there’s an internal property, sun.jnu.encoding as well.

Next, see what happens when we add the explicit override.

$ LANG= java -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding

Hey! sun.jnu.encoding isn’t changing!

Now, as far as I can see, sun.jnu.encoding isn’t actually documented anywhere. So you have to go into the source code for Java (openjdk’s jdk6-b16 in this case) to figure out what’s up.

Let’s start in main(), which is in java.c. Actually, it’s JavaMain() that we’re really interested in. In there you can see:

JavaMain(void * _args)
  jobjectArray mainArgs;

  /* Build argument array */
  mainArgs = NewPlatformStringArray(env, argv, argc);
  if (mainArgs == NULL) {
      goto leave;

NewPlatformStringArray() is defined in java.c and calls NewPlatformString() repeatedly with each command line argument. In turn, that calls new String(byte[], encoding). It gets the encoding from getPlatformEncoding(). That essentially calls System.getProperty("sun.jnu.encoding").

So where does that property get set? If you look in System.c, Java_java_lang_System_initProperties() calls:

    PUTPROP(props, "sun.jnu.encoding", sprops->sun_jnu_encoding);

sprops appears to get set in GetJavaProperties() in java_props_md.c. This interprets various environment variables including the one that control the locale. It appears to pull out everything after the period in the LANG environment variable as the encoding in order to get sun_jnu_encoding.

Phew. So we now know that there is a special property which gets used for interpreting “platform” strings like:

* Command line arguments
* Main class name
* Environment variables

And it can be overridden:

$ LANG= java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding


Sometimes, it’s useful to see what subversion’s actually doing when talking to a server. There’s a neon-debug-mask option in ~/.subversion/servers. But what values can it take? They’re not documented in the subversion manual.

As always, the source is informative.

/* Debugging masks. */
#define NE_DBG_SOCKET (1<<0) /* raw socket */
#define NE_DBG_HTTP (1<<1) /* HTTP request/response handling */
#define NE_DBG_XML (1<<2) /* XML parser */
#define NE_DBG_HTTPAUTH (1<<3) /* HTTP authentication (hiding credentials) */
#define NE_DBG_HTTPPLAIN (1<<4) /* plaintext HTTP authentication */
#define NE_DBG_LOCKS (1<<5) /* WebDAV locking */
#define NE_DBG_XMLPARSE (1<<6) /* low-level XML parser */
#define NE_DBG_HTTPBODY (1<<7) /* HTTP response body blocks */
#define NE_DBG_SSL (1<<8) /* SSL/TLS */
#define NE_DBG_FLUSH (1<<30) /* always flush debugging */

Or if you’re not a C coder (or had to spent more than 3 seconds working it out like me):

raw socket
HTTP request/response handling
XML parser
HTTP authentication (hiding credentials)
plaintext HTTP authentication
WebDAV locking
low-level XML parser
HTTP response body blocks
always flush debugging

These are summed to enable the features you want. So, if I want requests, responses and bodies, that’s 2+8+128, so I need this in ~/.subversion/servers:

neon-debug-mask = 138

Of course, interpreting the resulting output is up to you, but if you’re having difficulties, it may give you some clue what’s up. I can immediately see the large “log” command that git svn rebase is using for instance.


No escape() from JavaScript

A couple of days ago, we got caught out by a few encoding issues in a site at $WORK. The Perl related ones were fairly self explanatory and I’d seen before (e.g. not calling decode_utf8() on the query string parameters). But the JavaScript part was new to me.

The problem was that we were using JavaScript to create an URL, but this wasn’t encoding some characters correctly. After a bit of investigation, the problem comes down to the difference between escape() and encodeURIComponent().

#escaper {
border: 1px dotted black;
text-align: center;
#escaper thead {
background: #eee;
#escaper thead th {
border-bottom: 1px solid black;
#escaper td, #escaper th {
padding: .25em;

input escape(…) encodeURIComponent(…)
a&b a%26b a%26b
1+2 1+2 1%2B2
café caf%E9 caf%C3%A9
Ādam %u0100dam %C4%80dam

The last is particularly troublesome, as no server I know of will support decoding that %u form.

The takeaway is that encodeURIComponent() always encodes as UTF-8 and doesn’t miss characters out. As far as I can see, this means you should simply never use escape(). Which is why I’ve asked Douglas Crockford to add it as a warning to JSLint.

Once we switched the site’s JavaScript from escape() to encodeURIComponent(), everything worked as expected.


Which whitespace was that again?

We recently saw this at $WORK. It appears corrupted in Internet Explorer only. Firefox and Safari show it normally.

Corrupted text in internet explorer

After much exploration in the debugger, we eventually found it was caused by using the innerText property in internet explorer. This has the mildly surprising property of turning multiple spaces into U+00A0 (NO-BREAK SPACE) characters (&nbsp; to you and me). This behaviour doesn’t appear to be documented. And before you ask, this was all being done by a third-party — I know to not use proprietary extensions where possible.

Anyway, I nailed it down to a small test. Given this markup.

Then this script demonstrates the problem.

var foo = document.getElementById('foo');

// OK
foo.innerText = ['A', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A B"

foo.innerText = ['A', ' ', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A  B"

If you want to try it yourself, check out (jsbin is awesome! Thanks, rem!)

The quick solution is simple: normalize whitespace before insertion before using innerText.

var text = 'some       where        over      the      rainbow';

foo.innerText = text.split(/s+/).join(' ');

Of course, you should really be using appendChild() and createTextNode().


BarCampBrighton #4

What a weekend — BCB4 has just been and gone. This was my first BarCamp, and it was a superb experience. Great talks; great people; great venue. Somewhat predictably I gave a talk on git. But I misjudged the audience and had a bad case of nerves.

Although I enjoyed pretty much everything I listened to, the one thing that keeps coming back to me is Seb’s Simple 3d in HTML5 Canvas talk. It reminded me that actually, the maths isn’t that hard, and the environment is really easy to work in. And the results are spectacular, even with simple code. I’ve been playing with canvas since in my down time since then.

I also learned about robots (need patience yet very cool); project lombok (sufficently encapsulated magic); bulletproof widgets (with scary CSS); aikido & software development (practice, practice, practice); how car engines work (controlled boom); systems engineering (know your context). But it all pales compared to just meeting people, finding out what they’ve been up to and generally bouncing ideas around.

Check out bcb4 on flickr to get a feel.

Of course none of this could have happened without super-human powers of organisation. Whilst there were others involved, Jay is a hero for pulling it all together (well, and Minna). Thanks, Jay! See you at BCB5!

Update: since at least one person asked for it, here are the slides for my git presentation.


Changing the committer

Quite often, I find myself using git for non-work related activity on my work laptop. Yeah, yeah, I know.

Normally, I remember to set my email to be my home address before starting work.

$ mymail='dom [at] happygiraffe (dot) net'
$ git config $mymail

Of course, you’d use your proper email address, instead of that obfuscated form.

Note that we don’t use --global. This change is specific to the repository that we’re working in.

Unfortunately, I usually just dive in and start working. About four or five commits down the line, I realise I’ve screwed up. What then?

git filter-branch to the rescue! We just need to change a couple of environment variables and redo each commit.

$ git filter-branch --env-filter "export GIT_AUTHOR_EMAIL=$mymail GIT_COMMITTER_EMAIL=$mymail" master
Rewrite 0c5299bf98bf30938bb1d0fc0211aa9f3a9ddcf8 (3/3)
Ref 'refs/heads/master' was rewritten

Like all uses of filter-branch, you should only do this on an unpublished repository, as it’s effectively altering history.

There is a reference to the original commits left behind, in case I screwed something up. When you’ve checked that everything looks OK, you can clean up.

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
$ git reflog expire --expire=now --all
$ git gc --prune=now
Counting objects: 9, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (9/9), done.
Total 9 (delta 1), reused 0 (delta 0)