Author Archives: Dominic Mitchell

jslint4java 1.3.3

I’ve updated jslint4java again. This time, I’ve added:

  • Support for the predef option, so you can specify a list of predefined global variables. I first said I’d do this over a year ago. :(
  • Updated to JSLint 2009-11-24, which brings a new devel option. Now you can decide if alert(), console.log(), etc are available.

Unfortunately, just after I’d committed the release, I noticed that I’ve managed to (somehow) pick up a junit dependency. I’ll set about removing that for the next release (issue 35).

jslint4java 1.3.2

Just a quick note that I’ve released jslint4java 1.3.2. There’s not a lot of news in here. The main new feature is that I added the ability to specify an external copy of jslint.js. This is quite useful if Doug Crockford introduces new features before I release a new version of jslint4java.

This release also upgrades to JSLint 2009-10-04, which sports a new maxerrs option.

Apart from that, I’m particularly grateful to both Simon Kenyon Shepard and Ryan Alberts for pointing out where my unit tests where non-portable. I really need to get hudson up and running on the games PC…

The Perforce Perspective

I’m a long time user of subversion, and more recently git. Coming to Google, however, everything’s based around perforce. I’m still new enough to it, that I don’t want to criticise it, merely contrast my experiences with it.

The first thing that I noticed with perforce (p4) is quite how server-based it is. Subversion (and CVS) is often criticised for leaving lots of “turds” around: .svn or CVS directories. They’re just clutter that you don’t want to be bothered with. With perforce however, everything lives on the server. There is almost no data stored on the client side (perhaps just a .p4config file). Everything you do has to talk to the server.

The next surprise was how things are checked out. In subversion, you usually check out the trunk of a project, or a branch. You can do that in perforce, but it’s a great deal more flexible. You supply a client spec, which is a small text file describing a mapping from the server’s directory structure to your own workspace. e.g.

Client: myproject-client

Root: /home/dom/myproject

Views:
  //depot/myproject/...  //myproject-client/myproject/...
  -//depot/myproject/bigdata/... //myproject-client/myproject/bigdata/...

In this example, I’ve checked out all of myproject, except I’ve also removed some big data which I don’t need for my development. You can create a client workspace which is composed of any part (or parts) of your repository. Unsurprisingly, this is both a blessing and a curse. You can create very complicated setups using these somewhat ephemeral client specs. But they’re not (by default) versioned, so they’re really easy to lose. I’ve also found it very easy to make small mistakes which mean the wrong bits of projects are checked out (or no bits). If you’re new to a project, figuring out the correct client spec is one of the first hurdles you’ll come across.

Once you’ve got some code checked out, it’s not too dissimilar to other version control systems. The most irritating thing that I came across was p4‘s inability to detect added files. So, if I create a file …/myproject/foo.txt and run p4 pending, it says “no change.” You have to explicitly run p4 add. This is terrible — it’s really easy to forget add files. You can convince perforce to list these files, but it’s not trivial:

$ find * -type f | p4 -x - files 2>&1 | awk '/ - no such file/{print $1}'

One feature I quite like is the ability to have “pending changelists.” A changelist is perforce’s equivalent of a commit in subversion. You can create a pending changelist, which essentially allows you to build up a commit a little bit at a time, somewhat like git’s index. But even though you can have multiple pending changelists in a single client, you are still restricted in that a given file can only be in one of them. Personally, I find the git index more useful. Plus, when you submit a pending changelist, it gets assigned a new changelist number. This can make them difficult to track.

The critical feature for perforce is it’s integration (merging) support. Whilst I’ve done a few p4 integrates, I’ve not got the full hang of it yet. But it’s clearly far in advance of svn’s merging.

Internally, perforce is built upon two things:

  1. A collection of RCS *,v files.
  2. A few database files to coordinate metadata.

This architecture is noticeable: as soon as you look at the file log, you can see that each file has its own individual version number.

Over the years, Google have built up tools to work around many of these issues. There’s a nice discussion of perforce and how it’s used at Google in the comments of the LWN article KS2009: How Google uses Linux (which is a fascinating read in and of itself).

In case you’re interested in some of the challenges of running perforce at the scale of Google, it’s worth checking out some of the papers that have been presented at the perforce user conferences:

The joy of apple keyboards

Recently, I’ve been using a Linux desktop for the first time in ages. It’s Ubuntu (Hardy Heron), and it looks nice. But after using a mac for three years, I’m really missing quite a few little things.

  1. The ability to drag and drop anything anywhere.
  2. Being able to type a wide range of Unicode characters easily.

On a mac, it’s really, really easy to type in a wide variety of useful characters. All you need is alt (⌥), sometimes known as “option”.

Keys Character Name
⌥ ; HORIZONTAL ELLIPSIS
⌥ - EN DASH
⌥ ⇧ - EM DASH
⌥ [ LEFT DOUBLE QUOTATION MARK
⌥ ⇧ [ RIGHT DOUBLE QUOTATION MARK
⌥ 2 TRADE MARK SIGN
⌥ 8 BULLET
⌥ e   e é LATIN SMALL LETTER E WITH ACUTE

How did I find all this out? The lovely keyboard viewer that comes with OS X. You can get the flag in your menu bar by going to International in system preferences and checking “Show input menu in menu bar.”

Selecting the keyboard viewer in the input menu
OS X Keyboard Viewer (normal state)

Now, hold down alt and see what you can get (try alt and shift too).

OS X Keyboard Viewer (alt)

But not everything is attached to a key. In case you need more characters, there’s always the character palette. Usually on the ⌥ ⌘ T key as well as in the Edit menu. Here, you can get access to the vast repertoire of characters in Unicode. Need an arrow?

Arrows in the Character Palette

There’s a lot you can do with the character palette, but the search box is probably the best way in. Just tap in a bit of the name of the character you’re looking for and see what turns up.

This easy access to a wide array of characters is something I’ve rather come to take for granted in OS X. So coming back to the Linux desktop, it was odd to find that I couldn’t as readily type them in. Of course, I haven’t invested the time in figuring out how to set up XKB correctly. Doubtless I could achieve many of the same things. But my past experiences of XKB and it’s documentation have shown me how complicated it can be, so I don’t rate my ability to pull it off.

The end result is that I’m spending most of my time on the (mac) laptop and ignoring the desktop. I do like my characters. :)

ASL (Apple System Log)

I’ve just been debugging a problem with the pulse-agent on a mac. One of the big questions we had was: where the heck are the logs? The pulse-agent is managed by launchd. Apparently, this logs all stdout and stderr via ASL.

But what’s ASL? Apparently, it’s the Apple System Log. There’s a nice summary on Domain of the Bored. He also gives the key hint: you can use syslog(1) to read the binary ASL files.

I didn’t delve too deeply into the flags. It appeared that just running syslog spat out all the information I required. But, it comes out encoded like cat -v. But you can pipe it through unvis(1) to clean that up.

$ syslog | unvis | less

Normally, Console.app would take care of all this transparently. But when you’re ssh’d into a mac, that’s not an option. So it’s good to know about syslog(1).

Looking closer at the flags, you can say syslog -E safe in place of piping through unvis(1).

Book review: Solr 1.4 Enterprise Search Server

solr-book-thumb.png

I was recently offered a review copy of Solr 1.4 Enterprise Search Server (thanks to Swati Iyer). Whilst this is most fortuitous, I only wish I’d had this a month or two ago, when I was working fairly heavily on a Solr based project at $OLDWORK. Still, I’ll be able to judge whether or not the book would have been useful. :)

First, some background. Normally, Solr is documented through its wiki. As wikis go, it’s well maintained and informative. But it suffers from both a lack of narrative structure and by being completely online. The latter point really hit home when I was in ApacheCon 2008, in a Solr training class and couldn’t get at the documentation. So, a book covering Solr has to be a good idea.

Even though this book covers Solr 1.4, most of it is still applicable to earlier versions (my experience is all with 1.3). This is handy, seeing as Solr 1.4 isn’t released yet (and hence not yet in central). Hopefully, it should be any day now, seeing as the version numbers have been bumped in svn (r824893).

The first nice thing about this book is simply that it’s not a massive tome. At only 317pp, it’s really quite approachable. When you open it, the writing is in a friendly, conversational style.

The book starts with a brief introduction to solr and lucene, before moving on to installation. One thing I found unusual were the comparisons to relational database technology. These continue in a few places through the book. Perhaps I’m so used to search that I don’t need this. But given that the focus is on “enterprise,” it’s quite likely that’s the best angle to pull in the target audience. The chapter rounds off with a quick walkthrough of loading and querying data. It’s good to see something practical even at this point.

With that out of the way, the discussion moves to the absolute bedrock of solr: the schema. Defining what data you have and how you want to index and search it is of crucial importance. Particularly useful is the advice to play with Solr’s analysis tool, in order to understand how the fields you define actually work. Whilst the explanations of what the schema is and how design a good one are clear, it’s still likely that this is a chapter you’ll be revisiting as you get to know both Solr and your data more.

This chapter also introduces the data set you’ll work with through the book: the MusicBrainz data. This isn’t an obvious choice for testing out a search engine (gutenberg? shakespeare?), but it is fun. And where it doesn’t fully exercise Solr, this is pointed out.

Next we move on to how to get your data into Solr. This assumes a level of familiarity with the command line, in order to use curl. As well as the “normal” method of POSTing XML documents into Solr, this also covers uploading CSV files and the DataImportHandler. The latter is a contrib module which I hadn’t seen before. This lets you pull your data in to solr (instead of pushing) from any JDBC data source. The only missing thing is something that I spent a while getting right: importing XML data into Solr. There is a confusion which stems from the fact that you can post XML into solr, but not arbitrary XML. If you want to put an arbitrary XML document in a Solr field, you have to escape it and nest it into a solr document. It’s ugly, but can be made to work.

Once you’ve got the data in, what about getting it out again? The chapter on “basic querying” covers the myriad of ways you can alter Solr’s output. But the basic query stuff is handled well. In particular, it has a nice clear explanations of Solr’s variant of “data structure as XML” as well as the full query syntax. There is also detail on the solrconfig.xml which I completely managed to miss in six months of staring at it. Oh well.

At this point, the book has the basics covered. You could stop here and get along very well with Solr. But this is also the bit where the interesting parts start to appear:

  • There’s coverage of function queries, which allow you to manipulate the rankings of results in various ways (e.g. ranking newer content higher). I confess that the function queries looked interesting, but I haven’t used them and the descriptions in the book swiftly go past my limited maths knowledge.
  • The dismax handler is introduced, which gives a far simpler query interface to your users. This is something I wish I’d payed closer attention to in my last project.
  • Faceting is covered in detail. This is one of Solr’s hidden gems, providing information about the complete set of results without performing a second query. There’s also a nice demonstration of using faceting to back up a “suggestions” mechanism.
  • Highlighting results data. I could have saved a lot of time by reading this.
  • Spellchecking (“did you mean”). Again, the coverage highlights several pitfalls you need to be aware of.

Then comes the best surprise of all. A chapter on deployment. So many books forget this crucial step. So, there is coverage of logging, backups, monitoring and security. It might have been nice to also mention integrating it into the system startup sequence.

The remaining chapters cover client integration (with Java, PHP, JavaScript and Rails) and how to scale Solr. Though I never needed the scaling for my project, the advice given is still useful. For example, do you need to make every field stored? (doing so can increase disk usage) The coverage of running Solr on EC² also looked rather useful.

Perhaps the one thing that I’m not entirely happy with is the index (though I acknowledge a good index is hard to achieve). Some common terms I looked up weren’t present.

Overall, I’m really pleased by this book. Given my own experiences figuring out solr through the school of hard debugging sessions, I can say that this would have made my life a great deal easier. If you want to use Solr, you’ll save yourself time with this book.

$WORK =~ s/semantico/google/

A couple of weeks ago, I started at Google. It was time for a change. I’d been at semantico for nine years and had an enormous amount of fun with some excellent people. But I needed to do something different. So I applied for a release engineer post at Google. After the byzantine hiring process, I was accepted. I still reckon I got lucky on the interviews. :)

And it took me about two hours to realise I know nothing. To scale up to that size, everything is custom-built. It’s going to be a loooong learning process. And one that doesn’t stop. When you have that may thousands of engineers clustered together, things don’t stand still for long. But it’s going to be fun.

On leaving semantico, I was enormously pleased to be given the Unicode 5.0 book. The one continuing thread throughout my time has been encoding issues. It’s a fitting cap.

The Unicode 5.0 Standard

I’d like to say thanks once again to semantico for all the fun times I’ve had. I wish you the best of luck in the future.

Java Platform Encoding

This came up at $WORK recently. We had a java program that was given input through command line arguments. Unfortunately, it went wrong when being passed UTF-8 characters (U+00A9 COPYRIGHT SIGN [©]). Printing out the command line arguments from inside Java showed that we had double encoded Unicode.

Initially, we just slapped -Dfile.encoding=UTF-8 on the command line. But that failed when the site that called this code went through an automatic restart. So we investigated the issue further.

We quickly found that the presence of absence of the LANG environment variable had a bearing on the matter.

NB: ShowSystemProperties.jar is very simple and just lists all system properties in sorted order.

$ java -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)
$ echo $LANG
en_GB.UTF-8
$ java -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=UTF-8
$ LANG= java -jar ShowSystemProperties.jar | grep encoding
file.encoding=ANSI_X3.4-1968
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=ANSI_X3.4-1968

So, setting file.encoding works, but there’s an internal property, sun.jnu.encoding as well.

Next, see what happens when we add the explicit override.

$ LANG= java -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=ANSI_X3.4-1968

Hey! sun.jnu.encoding isn’t changing!

Now, as far as I can see, sun.jnu.encoding isn’t actually documented anywhere. So you have to go into the source code for Java (openjdk’s jdk6-b16 in this case) to figure out what’s up.

Let’s start in main(), which is in java.c. Actually, it’s JavaMain() that we’re really interested in. In there you can see:

int JNICALL
JavaMain(void * _args)
{
  …
  jobjectArray mainArgs;
 
  …
  /* Build argument array */
  mainArgs = NewPlatformStringArray(env, argv, argc);
  if (mainArgs == NULL) {
      ReportExceptionDescription(env);
      goto leave;
  }}

NewPlatformStringArray() is defined in java.c and calls NewPlatformString() repeatedly with each command line argument. In turn, that calls new String(byte[], encoding). It gets the encoding from getPlatformEncoding(). That essentially calls System.getProperty("sun.jnu.encoding").

So where does that property get set? If you look in System.c, Java_java_lang_System_initProperties() calls:

    PUTPROP(props, "sun.jnu.encoding", sprops->sun_jnu_encoding);

sprops appears to get set in GetJavaProperties() in java_props_md.c. This interprets various environment variables including the one that control the locale. It appears to pull out everything after the period in the LANG environment variable as the encoding in order to get sun_jnu_encoding.

Phew. So we now know that there is a special property which gets used for interpreting “platform” strings like:

* Command line arguments
* Main class name
* Environment variables

And it can be overridden:

$ LANG= java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding
file.encoding=UTF-8
file.encoding.pkg=sun.io
sun.io.unicode.encoding=UnicodeLittle
sun.jnu.encoding=UTF-8

neon-debug-mask

Sometimes, it’s useful to see what subversion’s actually doing when talking to a server. There’s a neon-debug-mask option in ~/.subversion/servers. But what values can it take? They’re not documented in the subversion manual.

As always, the source is informative.

/* Debugging masks. */
#define NE_DBG_SOCKET (1<<0) /* raw socket */
#define NE_DBG_HTTP (1<<1) /* HTTP request/response handling */
#define NE_DBG_XML (1<<2) /* XML parser */
#define NE_DBG_HTTPAUTH (1<<3) /* HTTP authentication (hiding credentials) */
#define NE_DBG_HTTPPLAIN (1<<4) /* plaintext HTTP authentication */
#define NE_DBG_LOCKS (1<<5) /* WebDAV locking */
#define NE_DBG_XMLPARSE (1<<6) /* low-level XML parser */
#define NE_DBG_HTTPBODY (1<<7) /* HTTP response body blocks */
#define NE_DBG_SSL (1<<8) /* SSL/TLS */
#define NE_DBG_FLUSH (1<<30) /* always flush debugging */

Or if you’re not a C coder (or had to spent more than 3 seconds working it out like me):

1
raw socket
2
HTTP request/response handling
4
XML parser
8
HTTP authentication (hiding credentials)
16
plaintext HTTP authentication
32
WebDAV locking
64
low-level XML parser
128
HTTP response body blocks
256
SSL/TLS
1073741824
always flush debugging

These are summed to enable the features you want. So, if I want requests, responses and bodies, that’s 2+8+128, so I need this in ~/.subversion/servers:

[global]
neon-debug-mask = 138

Of course, interpreting the resulting output is up to you, but if you’re having difficulties, it may give you some clue what’s up. I can immediately see the large “log” command that git svn rebase is using for instance.

No escape() from JavaScript

A couple of days ago, we got caught out by a few encoding issues in a site at $WORK. The Perl related ones were fairly self explanatory and I’d seen before (e.g. not calling decode_utf8() on the query string parameters). But the JavaScript part was new to me.

The problem was that we were using JavaScript to create an URL, but this wasn’t encoding some characters correctly. After a bit of investigation, the problem comes down to the difference between escape() and encodeURIComponent().

input escape(…) encodeURIComponent(…)
a&b a%26b a%26b
1+2 1+2 1%2B2
café caf%E9 caf%C3%A9
Ādam %u0100dam %C4%80dam

The last is particularly troublesome, as no server I know of will support decoding that %u form.

The takeaway is that encodeURIComponent() always encodes as UTF-8 and doesn’t miss characters out. As far as I can see, this means you should simply never use escape(). Which is why I’ve asked Douglas Crockford to add it as a warning to JSLint.

Once we switched the site’s JavaScript from escape() to encodeURIComponent(), everything worked as expected.