Java Platform Encoding

This came up at $WORK recently. We had a java program that was given input through command line arguments. Unfortunately, it went wrong when being passed UTF-8 characters (U+00A9 COPYRIGHT SIGN [©]). Printing out the command line arguments from inside Java showed that we had double encoded Unicode.

Initially, we just slapped -Dfile.encoding=UTF-8 on the command line. But that failed when the site that called this code went through an automatic restart. So we investigated the issue further.

We quickly found that the presence of absence of the LANG environment variable had a bearing on the matter.

NB: ShowSystemProperties.jar is very simple and just lists all system properties in sorted order.

$ java -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)
$ echo $LANG
en_GB.UTF-8
$ java -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=UTF-8
$ LANG= java -jar ShowSystemProperties.jar | grep encoding
file.encoding=ANSI_X3.4-1968
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=ANSI_X3.4-1968

So, setting file.encoding works, but there’s an internal property, sun.jnu.encoding as well.

Next, see what happens when we add the explicit override.

$ LANG= java -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=ANSI_X3.4-1968

Hey! sun.jnu.encoding isn’t changing!

Now, as far as I can see, sun.jnu.encoding isn’t actually documented anywhere. So you have to go into the source code for Java (openjdk’s jdk6-b16 in this case) to figure out what’s up.

Let’s start in main(), which is in java.c. Actually, it’s JavaMain() that we’re really interested in. In there you can see:

int JNICALL
JavaMain(void * _args)
{
  …
  jobjectArray mainArgs;
 
  …
  /* Build argument array */
  mainArgs = NewPlatformStringArray(env, argv, argc);
  if (mainArgs == NULL) {
      ReportExceptionDescription(env);
      goto leave;
  }}

NewPlatformStringArray() is defined in java.c and calls NewPlatformString() repeatedly with each command line argument. In turn, that calls new String(byte[], encoding). It gets the encoding from getPlatformEncoding(). That essentially calls System.getProperty("sun.jnu.encoding").

So where does that property get set? If you look in System.c, Java_java_lang_System_initProperties() calls:

    PUTPROP(props, "sun.jnu.encoding", sprops->sun_jnu_encoding);

sprops appears to get set in GetJavaProperties() in java_props_md.c. This interprets various environment variables including the one that control the locale. It appears to pull out everything after the period in the LANG environment variable as the encoding in order to get sun_jnu_encoding.

Phew. So we now know that there is a special property which gets used for interpreting “platform” strings like:

* Command line arguments
* Main class name
* Environment variables

And it can be overridden:

$ LANG= java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=UTF-8

4 Comments to Java Platform Encoding

  1. John Dell'Aera says:

    For example, I can create files with Greek characters from a terminal via touch and all the UTF-8 characters are displayed correctly.

  2. John Dell'Aera says:

    I have a java server app that creates UTF-8 file system names.
    Unfortunately, when I look at the file names the non-ascii characters of the file names are displayed with ‘?’. How do I get the system to display the appropriate UTF-8 character?

    System specs:
    Linux CentOS 6.0 2.6.18.8-xenU #1 SMP Thu May 13 11:11:51 PDT 2010 x86_64 x86_64 x86_64 GNU/Linux

    Tomcat 6
    Java 1.6

    JAVA_OPTS=-Dsun.jnu.encoding=UTF-8
    CATALINA_OPTS=-Dfile.encoding=UTF-8

    locale
    LANG=en_US.UTF-8
    LC_CTYPE=”en_US.UTF-8″
    LC_NUMERIC=”en_US.UTF-8″
    LC_TIME=”en_US.UTF-8″
    LC_COLLATE=”en_US.UTF-8″
    LC_MONETARY=”en_US.UTF-8″
    LC_MESSAGES=”en_US.UTF-8″
    LC_PAPER=”en_US.UTF-8″
    LC_NAME=”en_US.UTF-8″
    LC_ADDRESS=”en_US.UTF-8″
    LC_TELEPHONE=”en_US.UTF-8″
    LC_MEASUREMENT=”en_US.UTF-8″
    LC_IDENTIFICATION=”en_US.UTF-8″
    LC_ALL=

    I even execute the following at startup:
    System.setProperty(“file.encoding”, “UTF-8”);
    System.setProperty(“encoding”, “UTF-8”);
    System.setProperty(“user.language”, “en_US.UTF-8”);
    System.setProperty(“user.country”, “en_US.UTF-8”);
    System.setProperty(“sun.jnu.encoding”, “UTF8”);

    And where I create the file:
    fullPathName = new String(fullPathName.getBytes(“UTF-8”));
    InputStream is = file.getInputStream();
    input = new BufferedInputStream(is, STREAM_BUFFER_SIZE);
    output = new BufferedOutputStream(new FileOutputStream(fullPathName),
    STREAM_BUFFER_SIZE);

    // Read file from memory and write it to disk.
    int r;
    byte[] buf = new byte[STREAM_BUFFER_SIZE];
    while ((r = input.read(buf)) != -1) {
    output.write(buf, 0, r);
    }

    output.close();
    output = null;
    input.close();
    input = null;

  3. Jayesh Malondkar says:

    The above article was a life-saver for me. We were stuck in an issue where we were not able to set Encoding of the JVM inspite of overridding the LANG variable.
    Thus, overridding the sun.jnu.encoding variable was the last resort. However, as rightly pointed out in the above artice, this variable is not well documented in the Java Documentation.

    The fine research and detail explanation given by : Dominic Mitchell ; made my day.
    Thanks a lot once again and keep up the good work.!!