Embedding Cocoa Frameworks

I’m (re)learning Cocoa at the moment. I’m working my way through the wonderful Cocoa Programming for Mac OS X. But for my first solo project, I found myself needing to XML-RPC. I took one look at Web Services Core Programming Guide and barfed. After a quick hunt, I found Eric Czarny’s lovely XMLRPC project for cocoa. This looked exactly like what I needed.

I had one small problem though — it’s a framework. I’ve not used one before. How do I actually get this code included in my project?

After looking at the Framework Programming Guide, there are a few options for embedding the framework into my project (embedding is the only sensible choice — I don’t want the user to have to install this on their system).

  1. Build and install the framework locally, then just drag it into my project.
  2. Build the framework, and check the results into my source control system.
  3. Embed the framework into my own project, so it gets built on its own each time.

The first option means that the build is no dependent on my workstation. The second option works, but I dislike binaries in version control (and how non-reproduceable these can be). The third option is ideal as it means the build is self contained. However, it’s also the most non-obvious to implement.

After quite a few misdirected searches, I eventually found Jonathan “wolf” Rentzsch’s marvellous tutorial Embedded Cocoa Frameworks. Unfortunately, it’s gone 404. has a copy, including the 8 minute screencast.

In case it goes away again, here’s what I did to integrate the XMLRPC framework. I’ll start with a brand-new Cocoa project, MyApp.

Brand new Xcode project

Next you need to git clone git:// in a nearby place (a sibling directory is ideal, though you could set up inside MyApp using git submodules).

Before we can get started, we need to open up the XMLRPC project and make a couple of changes. First, we need to ensure that the Configurations match the ones in my project. Otherwise, Xcode will use the default configuration, which may produce unexpected results. I did this by duplicating Development into Debug.

XMLRPC Project Configurations

Now, build the XMLRPC project (⌘-B). You should end up with …/xmlrpc/build/Debug/XMLRPC.framework. Right-click on MyAppFrameworksLinked Frameworks and choose Add…Existing Frameworks….
Navigate to the previously built framework and select it. In the sheet that pops up, ensure that you choose “Reference Type: Relative to Project,” as well as ensuring that MyApp is selected in “Add To Targets.”

Add XMLRPC framework to MyApp

If you’ve been fastidiously keeping an eye on things in git (as I like to do), you can see the changes in your xcode project file. As well as introducing a reference to the framework directory, you will also see that the FRAMEWORK_SEARCH_PATHS build setting has been set up for you:


Unfortunately, this the point at which I realised that what I was doing wouldn’t work. We’re linking against the Debug configuration of the framework. I need to embed a reference in my project that’s independent of which configuration the XMLRPC project was built with. In order to do that, we have to modify the XMLRPC project a bit more.

Go back to the XMLRPC project and right click on the XMLRPC target. Choose Add…New Build Phase…New Run Script Build Phase. Type in the code:

Add Run Script phase to XMLRPC project
set -x

The net effect is to copy the framework from …/xmlrpc/build/Debug to …/xmlrpc/build. This means that there’s a single location to pull in. You may wish to rename the Run Script phase to something like “Run Script: Copy to well-known location.”

In case you’re wondering, I figured out which environment variables to use by looking at the build output, as it has another Run Script phase which shows you what they all are.

Now, go back into MyApp, remove the reference to XMLRPC.framwork. Add back the version in the new location, which should be directly under build. You also need to clean up the FRAMEWORK_SEARCH_PATHS setting on the MyApp target for both Release and Debug configurations. Ooops. I wish I’d spotted my mistake earlier!

So now we’ve linked against the framework, and set up the header search path. To check the latter, add a new Objective-C class and add in:

#import <XMLRPC/XMLRPC.h>

A build of MyApp should succeed without error.

But there’s one more trick. We want to build the XMLRPC project as part of our own build. In order to do this, you need to drag XMLRPC.xcodeproj into MyApp.

Drag XMLRPC project file into MyApp

This should pop up the same sheet where you need to choose “Reference Type: Relative to Project.”

With that done, right-click on the MyApp target and choose Get Info. On the general tab, add XMLRPC as a direct dependency.

Add dependency on XMLRPC to MyApp

Now, when you build MyApp, you should see the XMLRPC project being built too. Even better, if you switch configurations, you should see the framework being built in the same configuration.

Build log from MyApp

In summary, the steps were:

  1. Modify the framework to support the same configurations as your project.
  2. Modify the framework to put the built artifact in a consistent location.
  3. Add the framework to your project (making sure to select “relative to project” and “add to target”).
  4. Add a reference to the framework’s project to your project.
  5. Add the target for the framework as a direct dependency of your target that needs it.

It may seem like a lot of effort, but it really only takes a few seconds. And now you’ve got the ability to pull in hundreds of wonderful open-source projects to make your app better!


What’s in a certificate?

The principle of public-key cryptography is fairly simple to get.

  • Alice has a public/private keypair.
  • Alice gives the public key to Bob.
  • Bob encrypts some data for Alice using the public key.
  • Bob sends the data to Alice, who can decrypt it using her private key.

There’s also some other stuff about multiplying large primes, and how difficult it is to reverse that. But how on earth does this relate to certificates in the real world? If you click on that padlock in your browser, how does that relate to the above scenario?

For this demonstration, let’s use OpenSSL, as it’s commonly available. I’m going to make a “self-signed” key pair. More on that in a moment.

% openssl req -out cert -newkey rsa:2048 -keyout key -x509
Generating a 2048 bit RSA private key
writing new private key to 'key'
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
Country Name (2 letter code) [AU]:GB
State or Province Name (full name) [Some-State]:.
Locality Name (eg, city) []:Brighton
Organization Name (eg, company) [Internet Widgits Pty Ltd]:happygiraffe
Organizational Unit Name (eg, section) []:.
Common Name (eg, YOUR name) []:Dominic Mitchell
Email Address []
hdm-macbookpro% ls -l
total 16
-rw-r--r--  1 hdm  5000  1554 19 Mar 17:51 cert
-rw-r--r--  1 hdm  5000  1751 19 Mar 17:51 key

So, what do we have here? Let’s look at the key first.

% openssl rsa -in key -noout -text
Enter pass phrase for key:
Private-Key: (2048 bit)
publicExponent: 65537 (0x10001)

Hmmm, lots of maths stuff. What’s in the certificate?

% openssl x509 -in cert -noout -text
        Version: 3 (0x2)
        Serial Number:
        Signature Algorithm: md5WithRSAEncryption
        Issuer: C=GB, L=Brighton, O=happygiraffe, CN=Dominic Mitchell/
            Not Before: Mar 19 17:51:38 2010 GMT
            Not After : Apr 18 17:51:38 2010 GMT
        Subject: C=GB, L=Brighton, O=happygiraffe, CN=Dominic Mitchell/
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
            RSA Public Key: (2048 bit)
                Modulus (2048 bit):
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Subject Key Identifier:
            X509v3 Authority Key Identifier:
                DirName:/C=GB/L=Brighton/O=happygiraffe/CN=Dominic Mitchell/

            X509v3 Basic Constraints:
    Signature Algorithm: md5WithRSAEncryption

There’s a great deal of useful information here. Note that “Issuer” and “Subject” fields are identical. This is what I meant by “self-signed” earlier. For a purchased certificate, the issuer would be somebody like Verisign or Thawte. Also for a certificate that’s being used for an SSL server, the CN field in the subject would have to match the hostname of the site it’s serving.

In the middle, under “Subject Public Key Info” there is some of the same information that’s present in the key (Modulus and Exponent). But there’s a lot that isn’t (this is good — you want to keep the private key private).

What’s that signature? Well, it’s a way of ensuring that the contents of the certificate can’t be tampered with. Otherwise, somebody could just alter the Subject Public Key Info to point to a different key that they control. It assures the integrity of the certificate. Think of it like a secure wrapper.

So, we’ve got a certificate with public key information inside, but how is actually used in practice? OpenSSL will let us simulate a client and a server. Open up a terminal window and run:

% openssl s_server -cert cert -key key -msg
Using default temp DH parameters
Enter PEM pass phrase:

Open another command line and run:

% openssl s_client -debug -connect localhost:4433 -msg
>>> SSL 2.0 [length 0074], CLIENT-HELLO
<<< TLS 1.0 Handshake [length 004a], ServerHello
<<< TLS 1.0 Handshake [length 045c], Certificate
depth=0 /C=GB/L=Brighton/O=happygiraffe/CN=Dominic Mitchell/
verify error:num=18:self signed certificate
verify return:1
depth=0 /C=GB/L=Brighton/O=happygiraffe/CN=Dominic Mitchell/
verify return:1
<<< TLS 1.0 Handshake [length 018d], ServerKeyExchange
<<>> TLS 1.0 Handshake [length 0046], ClientKeyExchange
>>> TLS 1.0 ChangeCipherSpec [length 0001]
>>> TLS 1.0 Handshake [length 0010], Finished
<<< TLS 1.0 ChangeCipherSpec [length 0001]
<<< TLS 1.0 Handshake [length 0010], Finished
Certificate chain
 0 s:/C=GB/L=Brighton/O=happygiraffe/CN=Dominic Mitchell/
   i:/C=GB/L=Brighton/O=happygiraffe/CN=Dominic Mitchell/
Server certificate
subject=/C=GB/L=Brighton/O=happygiraffe/CN=Dominic Mitchell/
issuer=/C=GB/L=Brighton/O=happygiraffe/CN=Dominic Mitchell/
No client certificate CA names sent
SSL handshake has read 1670 bytes and written 252 bytes
New, TLSv1/SSLv3, Cipher is DHE-RSA-AES256-SHA
Server public key is 2048 bit
    Protocol  : TLSv1
    Cipher    : DHE-RSA-AES256-SHA
    Session-ID: 422BC3AE9F6B11C32770C6A6DB8B915713754C00694FCCF6BA37FE14A3D5C087
    Master-Key: 32895E90044DFE72B8CAEC4B7D4E14DF9969109AFE9F0504C69CA4B9CB9259E177C85B498F9F850EBF5866DF47A45B17
    Key-Arg   : None
    Start Time: 1269245236
    Timeout   : 300 (sec)
    Verify return code: 18 (self signed certificate)

This is really verbose (it is a debugging tool), so I’ve removed much of the output. But it shows the important steps in the SSL transaction. Of particular interest is the line near the end stating the Master-Key. This reveals the truth behind public key crypto: it’s expensive enough that it’s used for just long enough to exchange a regular two-way encryption key between the two peers. The public key information in the certificate is only used to encrypt a single message to the server (ClientKeyExchange) containing the private key to use for this session.

Of course, there’s a lot more to SSL connections than this. I’ve obtained most of this from SSL & TLS Essentials, which has a very detailed breakdown of precisely what’s happening.

Please don’t rely on the above in any way — it’s just my interpretation. But it does seem useful to explore certificates a little bit more.


jslint4java 1.3.3

I’ve updated jslint4java again. This time, I’ve added:

  • Support for the predef option, so you can specify a list of predefined global variables. I first said I’d do this over a year ago. 😦
  • Updated to JSLint 2009-11-24, which brings a new devel option. Now you can decide if alert(), console.log(), etc are available.

Unfortunately, just after I’d committed the release, I noticed that I’ve managed to (somehow) pick up a junit dependency. I’ll set about removing that for the next release (issue 35).


jslint4java 1.3.2

Just a quick note that I’ve released jslint4java 1.3.2. There’s not a lot of news in here. The main new feature is that I added the ability to specify an external copy of jslint.js. This is quite useful if Doug Crockford introduces new features before I release a new version of jslint4java.

This release also upgrades to JSLint 2009-10-04, which sports a new maxerrs option.

Apart from that, I’m particularly grateful to both Simon Kenyon Shepard and Ryan Alberts for pointing out where my unit tests where non-portable. I really need to get hudson up and running on the games PC…


The Perforce Perspective

I’m a long time user of subversion, and more recently git. Coming to Google, however, everything’s based around perforce. I’m still new enough to it, that I don’t want to criticise it, merely contrast my experiences with it.

The first thing that I noticed with perforce (p4) is quite how server-based it is. Subversion (and CVS) is often criticised for leaving lots of “turds” around: .svn or CVS directories. They’re just clutter that you don’t want to be bothered with. With perforce however, everything lives on the server. There is almost no data stored on the client side (perhaps just a .p4config file). Everything you do has to talk to the server.

The next surprise was how things are checked out. In subversion, you usually check out the trunk of a project, or a branch. You can do that in perforce, but it’s a great deal more flexible. You supply a client spec, which is a small text file describing a mapping from the server’s directory structure to your own workspace. e.g.

Client: myproject-client

Root: /home/dom/myproject

  //depot/myproject/...  //myproject-client/myproject/...
  -//depot/myproject/bigdata/... //myproject-client/myproject/bigdata/...

In this example, I’ve checked out all of myproject, except I’ve also removed some big data which I don’t need for my development. You can create a client workspace which is composed of any part (or parts) of your repository. Unsurprisingly, this is both a blessing and a curse. You can create very complicated setups using these somewhat ephemeral client specs. But they’re not (by default) versioned, so they’re really easy to lose. I’ve also found it very easy to make small mistakes which mean the wrong bits of projects are checked out (or no bits). If you’re new to a project, figuring out the correct client spec is one of the first hurdles you’ll come across.

Once you’ve got some code checked out, it’s not too dissimilar to other version control systems. The most irritating thing that I came across was p4‘s inability to detect added files. So, if I create a file …/myproject/foo.txt and run p4 pending, it says “no change.” You have to explicitly run p4 add. This is terrible — it’s really easy to forget add files. You can convince perforce to list these files, but it’s not trivial:

$ find * -type f | p4 -x - files 2>&1 | awk '/ - no such file/{print $1}'

One feature I quite like is the ability to have “pending changelists.” A changelist is perforce’s equivalent of a commit in subversion. You can create a pending changelist, which essentially allows you to build up a commit a little bit at a time, somewhat like git’s index. But even though you can have multiple pending changelists in a single client, you are still restricted in that a given file can only be in one of them. Personally, I find the git index more useful. Plus, when you submit a pending changelist, it gets assigned a new changelist number. This can make them difficult to track.

The critical feature for perforce is it’s integration (merging) support. Whilst I’ve done a few p4 integrates, I’ve not got the full hang of it yet. But it’s clearly far in advance of svn’s merging.

Internally, perforce is built upon two things:

  1. A collection of RCS *,v files.
  2. A few database files to coordinate metadata.

This architecture is noticeable: as soon as you look at the file log, you can see that each file has its own individual version number.

Over the years, Google have built up tools to work around many of these issues. There’s a nice discussion of perforce and how it’s used at Google in the comments of the LWN article KS2009: How Google uses Linux (which is a fascinating read in and of itself).

In case you’re interested in some of the challenges of running perforce at the scale of Google, it’s worth checking out some of the papers that have been presented at the perforce user conferences:


The joy of apple keyboards

Recently, I’ve been using a Linux desktop for the first time in ages. It’s Ubuntu (Hardy Heron), and it looks nice. But after using a mac for three years, I’m really missing quite a few little things.

  1. The ability to drag and drop anything anywhere.
  2. Being able to type a wide range of Unicode characters easily.

On a mac, it’s really, really easy to type in a wide variety of useful characters. All you need is alt (⌥), sometimes known as “option”.

Keys Character Name
⌥ ⇧ - EM DASH

How did I find all this out? The lovely keyboard viewer that comes with OS X. You can get the flag in your menu bar by going to International in system preferences and checking “Show input menu in menu bar.”

Selecting the keyboard viewer in the input menu
OS X Keyboard Viewer (normal state)

Now, hold down alt and see what you can get (try alt and shift too).

OS X Keyboard Viewer (alt)

But not everything is attached to a key. In case you need more characters, there’s always the character palette. Usually on the ⌥ ⌘ T key as well as in the Edit menu. Here, you can get access to the vast repertoire of characters in Unicode. Need an arrow?

Arrows in the Character Palette

There’s a lot you can do with the character palette, but the search box is probably the best way in. Just tap in a bit of the name of the character you’re looking for and see what turns up.

This easy access to a wide array of characters is something I’ve rather come to take for granted in OS X. So coming back to the Linux desktop, it was odd to find that I couldn’t as readily type them in. Of course, I haven’t invested the time in figuring out how to set up XKB correctly. Doubtless I could achieve many of the same things. But my past experiences of XKB and it’s documentation have shown me how complicated it can be, so I don’t rate my ability to pull it off.

The end result is that I’m spending most of my time on the (mac) laptop and ignoring the desktop. I do like my characters. 🙂


ASL (Apple System Log)

I’ve just been debugging a problem with the pulse-agent on a mac. One of the big questions we had was: where the heck are the logs? The pulse-agent is managed by launchd. Apparently, this logs all stdout and stderr via ASL.

But what’s ASL? Apparently, it’s the Apple System Log. There’s a nice summary on Domain of the Bored. He also gives the key hint: you can use syslog(1) to read the binary ASL files.

I didn’t delve too deeply into the flags. It appeared that just running syslog spat out all the information I required. But, it comes out encoded like cat -v. But you can pipe it through unvis(1) to clean that up.

$ syslog | unvis | less

Normally, would take care of all this transparently. But when you’re ssh’d into a mac, that’s not an option. So it’s good to know about syslog(1).

Looking closer at the flags, you can say syslog -E safe in place of piping through unvis(1).


Book review: Solr 1.4 Enterprise Search Server


I was recently offered a review copy of Solr 1.4 Enterprise Search Server (thanks to Swati Iyer). Whilst this is most fortuitous, I only wish I’d had this a month or two ago, when I was working fairly heavily on a Solr based project at $OLDWORK. Still, I’ll be able to judge whether or not the book would have been useful. 🙂

First, some background. Normally, Solr is documented through its wiki. As wikis go, it’s well maintained and informative. But it suffers from both a lack of narrative structure and by being completely online. The latter point really hit home when I was in ApacheCon 2008, in a Solr training class and couldn’t get at the documentation. So, a book covering Solr has to be a good idea.

Even though this book covers Solr 1.4, most of it is still applicable to earlier versions (my experience is all with 1.3). This is handy, seeing as Solr 1.4 isn’t released yet (and hence not yet in central). Hopefully, it should be any day now, seeing as the version numbers have been bumped in svn (r824893).

The first nice thing about this book is simply that it’s not a massive tome. At only 317pp, it’s really quite approachable. When you open it, the writing is in a friendly, conversational style.

The book starts with a brief introduction to solr and lucene, before moving on to installation. One thing I found unusual were the comparisons to relational database technology. These continue in a few places through the book. Perhaps I’m so used to search that I don’t need this. But given that the focus is on “enterprise,” it’s quite likely that’s the best angle to pull in the target audience. The chapter rounds off with a quick walkthrough of loading and querying data. It’s good to see something practical even at this point.

With that out of the way, the discussion moves to the absolute bedrock of solr: the schema. Defining what data you have and how you want to index and search it is of crucial importance. Particularly useful is the advice to play with Solr’s analysis tool, in order to understand how the fields you define actually work. Whilst the explanations of what the schema is and how design a good one are clear, it’s still likely that this is a chapter you’ll be revisiting as you get to know both Solr and your data more.

This chapter also introduces the data set you’ll work with through the book: the MusicBrainz data. This isn’t an obvious choice for testing out a search engine (gutenberg? shakespeare?), but it is fun. And where it doesn’t fully exercise Solr, this is pointed out.

Next we move on to how to get your data into Solr. This assumes a level of familiarity with the command line, in order to use curl. As well as the “normal” method of POSTing XML documents into Solr, this also covers uploading CSV files and the DataImportHandler. The latter is a contrib module which I hadn’t seen before. This lets you pull your data in to solr (instead of pushing) from any JDBC data source. The only missing thing is something that I spent a while getting right: importing XML data into Solr. There is a confusion which stems from the fact that you can post XML into solr, but not arbitrary XML. If you want to put an arbitrary XML document in a Solr field, you have to escape it and nest it into a solr document. It’s ugly, but can be made to work.

Once you’ve got the data in, what about getting it out again? The chapter on “basic querying” covers the myriad of ways you can alter Solr’s output. But the basic query stuff is handled well. In particular, it has a nice clear explanations of Solr’s variant of “data structure as XML” as well as the full query syntax. There is also detail on the solrconfig.xml which I completely managed to miss in six months of staring at it. Oh well.

At this point, the book has the basics covered. You could stop here and get along very well with Solr. But this is also the bit where the interesting parts start to appear:

  • There’s coverage of function queries, which allow you to manipulate the rankings of results in various ways (e.g. ranking newer content higher). I confess that the function queries looked interesting, but I haven’t used them and the descriptions in the book swiftly go past my limited maths knowledge.
  • The dismax handler is introduced, which gives a far simpler query interface to your users. This is something I wish I’d payed closer attention to in my last project.
  • Faceting is covered in detail. This is one of Solr’s hidden gems, providing information about the complete set of results without performing a second query. There’s also a nice demonstration of using faceting to back up a “suggestions” mechanism.
  • Highlighting results data. I could have saved a lot of time by reading this.
  • Spellchecking (“did you mean”). Again, the coverage highlights several pitfalls you need to be aware of.

Then comes the best surprise of all. A chapter on deployment. So many books forget this crucial step. So, there is coverage of logging, backups, monitoring and security. It might have been nice to also mention integrating it into the system startup sequence.

The remaining chapters cover client integration (with Java, PHP, JavaScript and Rails) and how to scale Solr. Though I never needed the scaling for my project, the advice given is still useful. For example, do you need to make every field stored? (doing so can increase disk usage) The coverage of running Solr on EC² also looked rather useful.

Perhaps the one thing that I’m not entirely happy with is the index (though I acknowledge a good index is hard to achieve). Some common terms I looked up weren’t present.

Overall, I’m really pleased by this book. Given my own experiences figuring out solr through the school of hard debugging sessions, I can say that this would have made my life a great deal easier. If you want to use Solr, you’ll save yourself time with this book.


$WORK =~ s/semantico/google/

A couple of weeks ago, I started at Google. It was time for a change. I’d been at semantico for nine years and had an enormous amount of fun with some excellent people. But I needed to do something different. So I applied for a release engineer post at Google. After the byzantine hiring process, I was accepted. I still reckon I got lucky on the interviews. 🙂

And it took me about two hours to realise I know nothing. To scale up to that size, everything is custom-built. It’s going to be a loooong learning process. And one that doesn’t stop. When you have that may thousands of engineers clustered together, things don’t stand still for long. But it’s going to be fun.

On leaving semantico, I was enormously pleased to be given the Unicode 5.0 book. The one continuing thread throughout my time has been encoding issues. It’s a fitting cap.

The Unicode 5.0 Standard

I’d like to say thanks once again to semantico for all the fun times I’ve had. I wish you the best of luck in the future.


Java Platform Encoding

This came up at $WORK recently. We had a java program that was given input through command line arguments. Unfortunately, it went wrong when being passed UTF-8 characters (U+00A9 COPYRIGHT SIGN [©]). Printing out the command line arguments from inside Java showed that we had double encoded Unicode.

Initially, we just slapped -Dfile.encoding=UTF-8 on the command line. But that failed when the site that called this code went through an automatic restart. So we investigated the issue further.

We quickly found that the presence of absence of the LANG environment variable had a bearing on the matter.

NB: ShowSystemProperties.jar is very simple and just lists all system properties in sorted order.

$ java -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)
$ echo $LANG
$ java -jar ShowSystemProperties.jar | grep encoding
$ LANG= java -jar ShowSystemProperties.jar | grep encoding

So, setting file.encoding works, but there’s an internal property, sun.jnu.encoding as well.

Next, see what happens when we add the explicit override.

$ LANG= java -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding

Hey! sun.jnu.encoding isn’t changing!

Now, as far as I can see, sun.jnu.encoding isn’t actually documented anywhere. So you have to go into the source code for Java (openjdk’s jdk6-b16 in this case) to figure out what’s up.

Let’s start in main(), which is in java.c. Actually, it’s JavaMain() that we’re really interested in. In there you can see:

JavaMain(void * _args)
  jobjectArray mainArgs;

  /* Build argument array */
  mainArgs = NewPlatformStringArray(env, argv, argc);
  if (mainArgs == NULL) {
      goto leave;

NewPlatformStringArray() is defined in java.c and calls NewPlatformString() repeatedly with each command line argument. In turn, that calls new String(byte[], encoding). It gets the encoding from getPlatformEncoding(). That essentially calls System.getProperty("sun.jnu.encoding").

So where does that property get set? If you look in System.c, Java_java_lang_System_initProperties() calls:

    PUTPROP(props, "sun.jnu.encoding", sprops->sun_jnu_encoding);

sprops appears to get set in GetJavaProperties() in java_props_md.c. This interprets various environment variables including the one that control the locale. It appears to pull out everything after the period in the LANG environment variable as the encoding in order to get sun_jnu_encoding.

Phew. So we now know that there is a special property which gets used for interpreting “platform” strings like:

* Command line arguments
* Main class name
* Environment variables

And it can be overridden:

$ LANG= java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding