Author: Dominic Mitchell


Book review: Solr 1.4 Enterprise Search Server


I was recently offered a review copy of Solr 1.4 Enterprise Search Server (thanks to Swati Iyer). Whilst this is most fortuitous, I only wish I’d had this a month or two ago, when I was working fairly heavily on a Solr based project at $OLDWORK. Still, I’ll be able to judge whether or not the book would have been useful. 🙂

First, some background. Normally, Solr is documented through its wiki. As wikis go, it’s well maintained and informative. But it suffers from both a lack of narrative structure and by being completely online. The latter point really hit home when I was in ApacheCon 2008, in a Solr training class and couldn’t get at the documentation. So, a book covering Solr has to be a good idea.

Even though this book covers Solr 1.4, most of it is still applicable to earlier versions (my experience is all with 1.3). This is handy, seeing as Solr 1.4 isn’t released yet (and hence not yet in central). Hopefully, it should be any day now, seeing as the version numbers have been bumped in svn (r824893).

The first nice thing about this book is simply that it’s not a massive tome. At only 317pp, it’s really quite approachable. When you open it, the writing is in a friendly, conversational style.

The book starts with a brief introduction to solr and lucene, before moving on to installation. One thing I found unusual were the comparisons to relational database technology. These continue in a few places through the book. Perhaps I’m so used to search that I don’t need this. But given that the focus is on “enterprise,” it’s quite likely that’s the best angle to pull in the target audience. The chapter rounds off with a quick walkthrough of loading and querying data. It’s good to see something practical even at this point.

With that out of the way, the discussion moves to the absolute bedrock of solr: the schema. Defining what data you have and how you want to index and search it is of crucial importance. Particularly useful is the advice to play with Solr’s analysis tool, in order to understand how the fields you define actually work. Whilst the explanations of what the schema is and how design a good one are clear, it’s still likely that this is a chapter you’ll be revisiting as you get to know both Solr and your data more.

This chapter also introduces the data set you’ll work with through the book: the MusicBrainz data. This isn’t an obvious choice for testing out a search engine (gutenberg? shakespeare?), but it is fun. And where it doesn’t fully exercise Solr, this is pointed out.

Next we move on to how to get your data into Solr. This assumes a level of familiarity with the command line, in order to use curl. As well as the “normal” method of POSTing XML documents into Solr, this also covers uploading CSV files and the DataImportHandler. The latter is a contrib module which I hadn’t seen before. This lets you pull your data in to solr (instead of pushing) from any JDBC data source. The only missing thing is something that I spent a while getting right: importing XML data into Solr. There is a confusion which stems from the fact that you can post XML into solr, but not arbitrary XML. If you want to put an arbitrary XML document in a Solr field, you have to escape it and nest it into a solr document. It’s ugly, but can be made to work.

Once you’ve got the data in, what about getting it out again? The chapter on “basic querying” covers the myriad of ways you can alter Solr’s output. But the basic query stuff is handled well. In particular, it has a nice clear explanations of Solr’s variant of “data structure as XML” as well as the full query syntax. There is also detail on the solrconfig.xml which I completely managed to miss in six months of staring at it. Oh well.

At this point, the book has the basics covered. You could stop here and get along very well with Solr. But this is also the bit where the interesting parts start to appear:

  • There’s coverage of function queries, which allow you to manipulate the rankings of results in various ways (e.g. ranking newer content higher). I confess that the function queries looked interesting, but I haven’t used them and the descriptions in the book swiftly go past my limited maths knowledge.
  • The dismax handler is introduced, which gives a far simpler query interface to your users. This is something I wish I’d payed closer attention to in my last project.
  • Faceting is covered in detail. This is one of Solr’s hidden gems, providing information about the complete set of results without performing a second query. There’s also a nice demonstration of using faceting to back up a “suggestions” mechanism.
  • Highlighting results data. I could have saved a lot of time by reading this.
  • Spellchecking (“did you mean”). Again, the coverage highlights several pitfalls you need to be aware of.

Then comes the best surprise of all. A chapter on deployment. So many books forget this crucial step. So, there is coverage of logging, backups, monitoring and security. It might have been nice to also mention integrating it into the system startup sequence.

The remaining chapters cover client integration (with Java, PHP, JavaScript and Rails) and how to scale Solr. Though I never needed the scaling for my project, the advice given is still useful. For example, do you need to make every field stored? (doing so can increase disk usage) The coverage of running Solr on EC² also looked rather useful.

Perhaps the one thing that I’m not entirely happy with is the index (though I acknowledge a good index is hard to achieve). Some common terms I looked up weren’t present.

Overall, I’m really pleased by this book. Given my own experiences figuring out solr through the school of hard debugging sessions, I can say that this would have made my life a great deal easier. If you want to use Solr, you’ll save yourself time with this book.

$WORK =~ s/semantico/google/

A couple of weeks ago, I started at Google. It was time for a change. I’d been at semantico for nine years and had an enormous amount of fun with some excellent people. But I needed to do something different. So I applied for a release engineer post at Google. After the byzantine hiring process, I was accepted. I still reckon I got lucky on the interviews. 🙂

And it took me about two hours to realise I know nothing. To scale up to that size, everything is custom-built. It’s going to be a loooong learning process. And one that doesn’t stop. When you have that may thousands of engineers clustered together, things don’t stand still for long. But it’s going to be fun.

On leaving semantico, I was enormously pleased to be given the Unicode 5.0 book. The one continuing thread throughout my time has been encoding issues. It’s a fitting cap.

The Unicode 5.0 Standard

I’d like to say thanks once again to semantico for all the fun times I’ve had. I wish you the best of luck in the future.

Java Platform Encoding

This came up at $WORK recently. We had a java program that was given input through command line arguments. Unfortunately, it went wrong when being passed UTF-8 characters (U+00A9 COPYRIGHT SIGN [©]). Printing out the command line arguments from inside Java showed that we had double encoded Unicode.

Initially, we just slapped -Dfile.encoding=UTF-8 on the command line. But that failed when the site that called this code went through an automatic restart. So we investigated the issue further.

We quickly found that the presence of absence of the LANG environment variable had a bearing on the matter.

NB: ShowSystemProperties.jar is very simple and just lists all system properties in sorted order.

$ java -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode)
$ echo $LANG
$ java -jar ShowSystemProperties.jar | grep encoding
$ LANG= java -jar ShowSystemProperties.jar | grep encoding

So, setting file.encoding works, but there’s an internal property, sun.jnu.encoding as well.

Next, see what happens when we add the explicit override.

$ LANG= java -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding

Hey! sun.jnu.encoding isn’t changing!

Now, as far as I can see, sun.jnu.encoding isn’t actually documented anywhere. So you have to go into the source code for Java (openjdk’s jdk6-b16 in this case) to figure out what’s up.

Let’s start in main(), which is in java.c. Actually, it’s JavaMain() that we’re really interested in. In there you can see:

JavaMain(void * _args)
  jobjectArray mainArgs;
  /* Build argument array */
  mainArgs = NewPlatformStringArray(env, argv, argc);
  if (mainArgs == NULL) {
      goto leave;

NewPlatformStringArray() is defined in java.c and calls NewPlatformString() repeatedly with each command line argument. In turn, that calls new String(byte[], encoding). It gets the encoding from getPlatformEncoding(). That essentially calls System.getProperty("sun.jnu.encoding").

So where does that property get set? If you look in System.c, Java_java_lang_System_initProperties() calls:

    PUTPROP(props, "sun.jnu.encoding", sprops->sun_jnu_encoding);

sprops appears to get set in GetJavaProperties() in java_props_md.c. This interprets various environment variables including the one that control the locale. It appears to pull out everything after the period in the LANG environment variable as the encoding in order to get sun_jnu_encoding.

Phew. So we now know that there is a special property which gets used for interpreting “platform” strings like:

* Command line arguments
* Main class name
* Environment variables

And it can be overridden:

$ LANG= java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 -jar ShowSystemProperties.jar | grep encoding


Sometimes, it’s useful to see what subversion’s actually doing when talking to a server. There’s a neon-debug-mask option in ~/.subversion/servers. But what values can it take? They’re not documented in the subversion manual.

As always, the source is informative.

/* Debugging masks. */
#define NE_DBG_SOCKET (1<<0) /* raw socket */
#define NE_DBG_HTTP (1<<1) /* HTTP request/response handling */
#define NE_DBG_XML (1<<2) /* XML parser */
#define NE_DBG_HTTPAUTH (1<<3) /* HTTP authentication (hiding credentials) */
#define NE_DBG_HTTPPLAIN (1<<4) /* plaintext HTTP authentication */
#define NE_DBG_LOCKS (1<<5) /* WebDAV locking */
#define NE_DBG_XMLPARSE (1<<6) /* low-level XML parser */
#define NE_DBG_HTTPBODY (1<<7) /* HTTP response body blocks */
#define NE_DBG_SSL (1<<8) /* SSL/TLS */
#define NE_DBG_FLUSH (1<<30) /* always flush debugging */

Or if you’re not a C coder (or had to spent more than 3 seconds working it out like me):

raw socket
HTTP request/response handling
XML parser
HTTP authentication (hiding credentials)
plaintext HTTP authentication
WebDAV locking
low-level XML parser
HTTP response body blocks
always flush debugging

These are summed to enable the features you want. So, if I want requests, responses and bodies, that’s 2+8+128, so I need this in ~/.subversion/servers:

neon-debug-mask = 138

Of course, interpreting the resulting output is up to you, but if you’re having difficulties, it may give you some clue what’s up. I can immediately see the large “log” command that git svn rebase is using for instance.

No escape() from JavaScript

A couple of days ago, we got caught out by a few encoding issues in a site at $WORK. The Perl related ones were fairly self explanatory and I’d seen before (e.g. not calling decode_utf8() on the query string parameters). But the JavaScript part was new to me.

The problem was that we were using JavaScript to create an URL, but this wasn’t encoding some characters correctly. After a bit of investigation, the problem comes down to the difference between escape() and encodeURIComponent().

input escape(…) encodeURIComponent(…)
a&b a%26b a%26b
1+2 1+2 1%2B2
café caf%E9 caf%C3%A9
Ādam %u0100dam %C4%80dam

The last is particularly troublesome, as no server I know of will support decoding that %u form.

The takeaway is that encodeURIComponent() always encodes as UTF-8 and doesn’t miss characters out. As far as I can see, this means you should simply never use escape(). Which is why I’ve asked Douglas Crockford to add it as a warning to JSLint.

Once we switched the site’s JavaScript from escape() to encodeURIComponent(), everything worked as expected.

Which whitespace was that again?

We recently saw this at $WORK. It appears corrupted in Internet Explorer only. Firefox and Safari show it normally.

Corrupted text in internet explorer

After much exploration in the debugger, we eventually found it was caused by using the innerText property in internet explorer. This has the mildly surprising property of turning multiple spaces into U+00A0 (NO-BREAK SPACE) characters (&nbsp; to you and me). This behaviour doesn’t appear to be documented. And before you ask, this was all being done by a third-party — I know to not use proprietary extensions where possible.

Anyway, I nailed it down to a small test. Given this markup.

<p id="foo"></p>

Then this script demonstrates the problem.

var foo = document.getElementById('foo');
// OK
foo.innerText = ['A', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A B"
foo.innerText = ['A', ' ', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A&nbsp; B"

If you want to try it yourself, check out (jsbin is awesome! Thanks, rem!)

The quick solution is simple: normalize whitespace before insertion before using innerText.

var text = 'some       where        over      the      rainbow';
foo.innerText = text.split(/\s+/).join(' ');

Of course, you should really be using appendChild() and createTextNode().

BarCampBrighton #4

What a weekend — BCB4 has just been and gone. This was my first BarCamp, and it was a superb experience. Great talks; great people; great venue. Somewhat predictably I gave a talk on git. But I misjudged the audience and had a bad case of nerves.

Although I enjoyed pretty much everything I listened to, the one thing that keeps coming back to me is Seb’s Simple 3d in HTML5 Canvas talk. It reminded me that actually, the maths isn’t that hard, and the environment is really easy to work in. And the results are spectacular, even with simple code. I’ve been playing with canvas since in my down time since then.

I also learned about robots (need patience yet very cool); project lombok (sufficently encapsulated magic); bulletproof widgets (with scary CSS); aikido & software development (practice, practice, practice); how car engines work (controlled boom); systems engineering (know your context). But it all pales compared to just meeting people, finding out what they’ve been up to and generally bouncing ideas around.

Check out bcb4 on flickr to get a feel.

Of course none of this could have happened without super-human powers of organisation. Whilst there were others involved, Jay is a hero for pulling it all together (well, and Minna). Thanks, Jay! See you at BCB5!

Update: since at least one person asked for it, here are the slides for my git presentation.

Changing the committer

Quite often, I find myself using git for non-work related activity on my work laptop. Yeah, yeah, I know.

Normally, I remember to set my email to be my home address before starting work.

$ mymail='dom [at] happygiraffe (dot) net'
$ git config $mymail

Of course, you’d use your proper email address, instead of that obfuscated form.

Note that we don’t use --global. This change is specific to the repository that we’re working in.

Unfortunately, I usually just dive in and start working. About four or five commits down the line, I realise I’ve screwed up. What then?

git filter-branch to the rescue! We just need to change a couple of environment variables and redo each commit.

$ git filter-branch --env-filter "export GIT_AUTHOR_EMAIL=$mymail GIT_COMMITTER_EMAIL=$mymail" master
Rewrite 0c5299bf98bf30938bb1d0fc0211aa9f3a9ddcf8 (3/3)
Ref 'refs/heads/master' was rewritten

Like all uses of filter-branch, you should only do this on an unpublished repository, as it’s effectively altering history.

There is a reference to the original commits left behind, in case I screwed something up. When you’ve checked that everything looks OK, you can clean up.

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
$ git reflog expire --expire=now --all
$ git gc --prune=now
Counting objects: 9, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (9/9), done.
Total 9 (delta 1), reused 0 (delta 0)

temporarily ignoring files in git

Quite often, you want to change a file temporarily whilst you work on something, but you know you don’t want to commit it. Right now I want to change my project’s logging from INFO to DEBUG, but I don’t want to commit that.

There’s a command git update-index which has a flag --assume-unchanged. And it just makes those files ignored for a while.

$ git status
# On branch master
# Changed but not updated:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#	modified:   src/main/webapp/WEB-INF/log4j.xml
no changes added to commit (use "git add" and/or "git commit -a")
$ git update-index --assume-unchanged src/main/webapp/WEB-INF/log4j.xml
$ git status
# On branch master
nothing to commit (working directory clean)

Easy! Now, edit away.

… time passes …

And now to get everything back to normal.

$ git update-index --verbose --really-refresh
src/main/webapp/WEB-INF/log4j.xml: needs update
$ git status
# On branch master
# Changed but not updated:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#	modified:   src/main/webapp/WEB-INF/log4j.xml
no changes added to commit (use "git add" and/or "git commit -a")

I have to be honest, this is slightly hacky. It would be nice to be able to tell git “ignore this change,” in the way you can say “add this change”. But it works OK for now. on OSX

Just a quick note… I was looking at RT#48699 when I noticed that MacPorts didn’t have in it’s collection. I needed to install it by hand. Unfortunately, the latest version (1.12) doesn’t install cleanly.

So I’ve forked it and fixed it (along with a couple of other minor nits).

Claes said he’ll apply the patch at some point. So hopefully when 1.13 comes out, this won’t be necessary.

Of course, really I should get to grips with MacPorts and submit a Portfile