Jabbering Giraffe

Book review: Solr 1.4 Enterprise Search Server

I was recently offered a review copy of Solr 1.4 Enterprise Search Server (thanks to Swati Iyer). Whilst this is most fortuitous, I only wish I’d had this a month or two ago, when I was working fairly heavily on a Solr based project at $OLDWORK. Still, I’ll be able to judge whether or not the book would have been useful. 🙂

First, some background. Normally, Solr is documented through its wiki. As wikis go, it’s well maintained and informative. But it suffers from both a lack of narrative structure and by being completely online. The latter point really hit home when I was in ApacheCon 2008, in a Solr training class and couldn’t get at the documentation. So, a book covering Solr has to be a good idea.

Even though this book covers Solr 1.4, most of it is still applicable to earlier versions (my experience is all with 1.3). This is handy, seeing as Solr 1.4 isn’t released yet (and hence not yet in central). Hopefully, it should be any day now, seeing as the version numbers have been bumped in svn (r824893).

The first nice thing about this book is simply that it’s not a massive tome. At only 317pp, it’s really quite approachable. When you open it, the writing is in a friendly, conversational style.

The book starts with a brief introduction to solr and lucene, before moving on to installation. One thing I found unusual were the comparisons to relational database technology. These continue in a few places through the book. Perhaps I’m so used to search that I don’t need this. But given that the focus is on “enterprise,” it’s quite likely that’s the best angle to pull in the target audience. The chapter rounds off with a quick walkthrough of loading and querying data. It’s good to see something practical even at this point.

With that out of the way, the discussion moves to the absolute bedrock of solr: the schema. Defining what data you have and how you want to index and search it is of crucial importance. Particularly useful is the advice to play with Solr’s analysis tool, in order to understand how the fields you define actually work. Whilst the explanations of what the schema is and how design a good one are clear, it’s still likely that this is a chapter you’ll be revisiting as you get to know both Solr and your data more.

This chapter also introduces the data set you’ll work with through the book: the MusicBrainz data. This isn’t an obvious choice for testing out a search engine (gutenberg? shakespeare?), but it is fun. And where it doesn’t fully exercise Solr, this is pointed out.

Next we move on to how to get your data into Solr. This assumes a level of familiarity with the command line, in order to use curl. As well as the “normal” method of POSTing XML documents into Solr, this also covers uploading CSV files and the DataImportHandler. The latter is a contrib module which I hadn’t seen before. This lets you pull your data in to solr (instead of pushing) from any JDBC data source. The only missing thing is something that I spent a while getting right: importing XML data into Solr. There is a confusion which stems from the fact that you can post XML into solr, but not arbitrary XML. If you want to put an arbitrary XML document in a Solr field, you have to escape it and nest it into a solr document. It’s ugly, but can be made to work.

Once you’ve got the data in, what about getting it out again? The chapter on “basic querying” covers the myriad of ways you can alter Solr’s output. But the basic query stuff is handled well. In particular, it has a nice clear explanations of Solr’s variant of “data structure as XML” as well as the full query syntax. There is also detail on the solrconfig.xml which I completely managed to miss in six months of staring at it. Oh well.

At this point, the book has the basics covered. You could stop here and get along very well with Solr. But this is also the bit where the interesting parts start to appear:

Then comes the best surprise of all. A chapter on deployment. So many books forget this crucial step. So, there is coverage of logging, backups, monitoring and security. It might have been nice to also mention integrating it into the system startup sequence.

The remaining chapters cover client integration (with Java, PHP, JavaScript and Rails) and how to scale Solr. Though I never needed the scaling for my project, the advice given is still useful. For example, do you need to make every field stored? (doing so can increase disk usage) The coverage of running Solr on EC² also looked rather useful.

Perhaps the one thing that I’m not entirely happy with is the index (though I acknowledge a good index is hard to achieve). Some common terms I looked up weren’t present.

Overall, I’m really pleased by this book. Given my own experiences figuring out solr through the school of hard debugging sessions, I can say that this would have made my life a great deal easier. If you want to use Solr, you’ll save yourself time with this book.