Tag: solr

 

Book review: Solr 1.4 Enterprise Search Server

solr-book-thumb.png

I was recently offered a review copy of Solr 1.4 Enterprise Search Server (thanks to Swati Iyer). Whilst this is most fortuitous, I only wish I’d had this a month or two ago, when I was working fairly heavily on a Solr based project at $OLDWORK. Still, I’ll be able to judge whether or not the book would have been useful. 🙂

First, some background. Normally, Solr is documented through its wiki. As wikis go, it’s well maintained and informative. But it suffers from both a lack of narrative structure and by being completely online. The latter point really hit home when I was in ApacheCon 2008, in a Solr training class and couldn’t get at the documentation. So, a book covering Solr has to be a good idea.

Even though this book covers Solr 1.4, most of it is still applicable to earlier versions (my experience is all with 1.3). This is handy, seeing as Solr 1.4 isn’t released yet (and hence not yet in central). Hopefully, it should be any day now, seeing as the version numbers have been bumped in svn (r824893).

The first nice thing about this book is simply that it’s not a massive tome. At only 317pp, it’s really quite approachable. When you open it, the writing is in a friendly, conversational style.

The book starts with a brief introduction to solr and lucene, before moving on to installation. One thing I found unusual were the comparisons to relational database technology. These continue in a few places through the book. Perhaps I’m so used to search that I don’t need this. But given that the focus is on “enterprise,” it’s quite likely that’s the best angle to pull in the target audience. The chapter rounds off with a quick walkthrough of loading and querying data. It’s good to see something practical even at this point.

With that out of the way, the discussion moves to the absolute bedrock of solr: the schema. Defining what data you have and how you want to index and search it is of crucial importance. Particularly useful is the advice to play with Solr’s analysis tool, in order to understand how the fields you define actually work. Whilst the explanations of what the schema is and how design a good one are clear, it’s still likely that this is a chapter you’ll be revisiting as you get to know both Solr and your data more.

This chapter also introduces the data set you’ll work with through the book: the MusicBrainz data. This isn’t an obvious choice for testing out a search engine (gutenberg? shakespeare?), but it is fun. And where it doesn’t fully exercise Solr, this is pointed out.

Next we move on to how to get your data into Solr. This assumes a level of familiarity with the command line, in order to use curl. As well as the “normal” method of POSTing XML documents into Solr, this also covers uploading CSV files and the DataImportHandler. The latter is a contrib module which I hadn’t seen before. This lets you pull your data in to solr (instead of pushing) from any JDBC data source. The only missing thing is something that I spent a while getting right: importing XML data into Solr. There is a confusion which stems from the fact that you can post XML into solr, but not arbitrary XML. If you want to put an arbitrary XML document in a Solr field, you have to escape it and nest it into a solr document. It’s ugly, but can be made to work.

Once you’ve got the data in, what about getting it out again? The chapter on “basic querying” covers the myriad of ways you can alter Solr’s output. But the basic query stuff is handled well. In particular, it has a nice clear explanations of Solr’s variant of “data structure as XML” as well as the full query syntax. There is also detail on the solrconfig.xml which I completely managed to miss in six months of staring at it. Oh well.

At this point, the book has the basics covered. You could stop here and get along very well with Solr. But this is also the bit where the interesting parts start to appear:

  • There’s coverage of function queries, which allow you to manipulate the rankings of results in various ways (e.g. ranking newer content higher). I confess that the function queries looked interesting, but I haven’t used them and the descriptions in the book swiftly go past my limited maths knowledge.
  • The dismax handler is introduced, which gives a far simpler query interface to your users. This is something I wish I’d payed closer attention to in my last project.
  • Faceting is covered in detail. This is one of Solr’s hidden gems, providing information about the complete set of results without performing a second query. There’s also a nice demonstration of using faceting to back up a “suggestions” mechanism.
  • Highlighting results data. I could have saved a lot of time by reading this.
  • Spellchecking (“did you mean”). Again, the coverage highlights several pitfalls you need to be aware of.

Then comes the best surprise of all. A chapter on deployment. So many books forget this crucial step. So, there is coverage of logging, backups, monitoring and security. It might have been nice to also mention integrating it into the system startup sequence.

The remaining chapters cover client integration (with Java, PHP, JavaScript and Rails) and how to scale Solr. Though I never needed the scaling for my project, the advice given is still useful. For example, do you need to make every field stored? (doing so can increase disk usage) The coverage of running Solr on EC² also looked rather useful.

Perhaps the one thing that I’m not entirely happy with is the index (though I acknowledge a good index is hard to achieve). Some common terms I looked up weren’t present.

Overall, I’m really pleased by this book. Given my own experiences figuring out solr through the school of hard debugging sessions, I can say that this would have made my life a great deal easier. If you want to use Solr, you’ll save yourself time with this book.

Search & Replace in XSLT 2

For a project at $WORK, we want to implement Solr’s spelling suggestions. When you ask solr to provide suggestions, it comes back with something like this (the original search was spinish englosh):

  <response><lst name="spellcheck">
      <lst name="suggestions">
        <lst name="spinish">
          <int name="numFound">1</int>
          <int name="startOffset">19</int>
          <int name="endOffset">26</int>
          <arr name="suggestion">
            <str>spanish</str>
          </arr>
        </lst>
        <lst name="englosh">
          <int name="numFound">1</int>
          <int name="startOffset">27</int>
          <int name="endOffset">34</int>
          <arr name="suggestion">
            <str>english</str>
          </arr>
        </lst>
        <lst name="spinish">
          <int name="numFound">1</int>
          <int name="startOffset">60</int>
          <int name="endOffset">67</int>
          <arr name="suggestion">
            <str>spanish</str>
          </arr>
        </lst></lst>
    </lst>
  </response>

What we want to do is transform this into:

<p>Did you mean <a href="?q=spanish%20english">spanish english</a>?</p>

As it turns out, this is a non-trivial task in XSLT. It’s doable, but significantly easier in XSLT 2, since you are less restricted by the rules on result-tree-fragments.

The first problem to solve is getting the data into a sensible data structure for further processing. In a real language, I’d want a list of (from, to) pairs. In XSLT, sequences are always flat. The way to simulate this is to construct an element for the pair.

  <xsl:variable name="suggRoot" select="/response/lst[@name='spellcheck']/lst[@name='suggestions']" />
  <xsl:variable name="suggestions" as="element(sugg)*">
    <xsl:for-each select="distinct-values($suggRoot/lst/@name)">
      <!-- Pick the first suggestion for this name. -->
      <sugg from="{.}" to="{($suggRoot/lst[@name=current()])[1]/arr[@name='suggestion']/str[1]}" />
    </xsl:for-each>
  </xsl:variable>

Note the commented caveat: we always pick the first suggestion for any given name. From my (small) experience, this isn’t an issue as the suggestions for a given word are always identical.

This results in $suggestions containing a sequence of elements looking like this.

  <sugg from="spinish" to="spanish" />
  <sugg from="englosh" to="english" />

Now one of the nice things about XSLT 2 is that you can define functions which are visible to XPath. So we can write a fairly simple recursive function to do the search and replace.

  <!-- Take some input and a list of suggestions, and do a recursive search and
       replace over the input until all have been applied. -->
  <xsl:function name="my:replaceSuggestions" as="xs:string">
    <xsl:param name="input" as="xs:string" />
    <xsl:param name="suggestions" as="element(sugg)*" />
    <xsl:variable name="sugg" select="$suggestions[1]" />
    <xsl:sequence select="
      if (count($suggestions) > 0) then
        my:replaceSuggestions(replace($input, $sugg/@from, $sugg/@to), $suggestions[position() > 1])
      else
        $input" />
  </xsl:function>

There are a few things to note:

  • You have to give your function a namespace prefix.
  • The xsl:param‘s are used in order (not by name) to specify the arity of the function.
  • The as attributes aren’t necessary, but the idea of types in XSLT is growing on me. I’d rather know about type problems as soon as possible.
  • The notion of cdr (tail) in XSLT is rather odd: the sequence of all nodes in the sequence whose position is greater than one.
  • Even though I’m using replace(), I’m not taking any precautions against escaping regex characters. I’m certain that these won’t occur given my data.

So finally, we end up with:

  <xsl:variable name="newQuery">
    <xsl:value-of select="my:replaceSuggestions($input, $suggestions)"/>
  </xsl:variable>
  <p class="spelling">
    <xsl:text>Did you mean </xsl:text>
    <em>
      <a href="?q={encode-for-uri($newQuery)}">
        <xsl:value-of select="$newQuery" />
      </a>
    </em>
    <xsl:text>?</xsl:text>
  </p>

I don’t think all this will win any awards for elegance, but it does work. 🙂

Solr's Lucene Source

I’m debugging a plugin for Solr. I’ve just about got the magic voodoo set up so that I can make Eclipse talk to tomcat and stick breakpoints in and so on. But I’ve immediately run into a problem.

Even though Solr itself comes with -sources jars, the bundled copy of lucene that they’ve used doesn’t. Needless to say, this is a bit of a hindrance.

Thankfully, the apache people have set up git.apache.org, which makes this situation a lot less annoying than it could be.

First, I checked out copies of lucene & solr.

$ git clone git://git.apache.org/solr.git
$ git clone git://git.apache.org/lucene.git

Now, I need to go into solr and figure out which version of lucene is in use. Unfortunately, it’s not a released version, it’s a snapshot of the lucene trunk at a point in time.

$ cd/solr
$ git branch -r
  origin/HEAD -> origin/trunk
  origin/branch-1.1
  origin/branch-1.2
  origin/branch-1.3
  origin/sandbox
  origin/solr-ruby-refactoring
  origin/tags/release-1.1.0
  origin/tags/release-1.2.0
  origin/tags/release-1.3.0
  origin/trunk
$ git whatchanged origin/tags/release-1.3.0 lib
…
commit 904e378b7b4fd18232f657c9daf484a3e63b272c
Author: Yonik Seeley <yonik@apache.org>
Date:   Wed Sep 3 20:31:42 2008 +0000
 
    lucene update 2.4-dev r691741
 
    git-svn-id: https://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.3@691758 13f79535-47bb-0310-9956-ffa450edef68
 
:100644 100644 a297b74... 54442dc... M  lib/lucene-analyzers-2.4-dev.jar
:100644 100644 596625b... 5c6e003... M  lib/lucene-core-2.4-dev.jar
:100644 100644 db13718... f0f93a7... M  lib/lucene-highlighter-2.4-dev.jar
:100644 100644 50c8cb4... a599f43... M  lib/lucene-memory-2.4-dev.jar
:100644 100644 aef3fb8... 79feaef... M  lib/lucene-queries-2.4-dev.jar
:100644 100644 1c733b9... 440fa4e... M  lib/lucene-snowball-2.4-dev.jar
:100644 100644 0195fa2... b5ff08b... M  lib/lucene-spellchecker-2.4-dev.jar
…

So, the last change to lucene was taking a copy of r691741 of lucene’s trunk. So, lets go over there. And see what that looks like.

$ cd/lucene
$ git log --grep=691741

Except that doesn’t return anything. Because there was no lucene commit at that revision in the original repository (it was something to do with geronimo). So we need to search backwards for the commit nearest to that revision. Thankfully, git svn includes the original subversion revision numbers of each commit.

$ cd/lucene
$ git log | perl -lne 'if (m/git-svn-id:.*@(\d+)/ && $1 <= 691741){print $1; exit}'
691694

So now we can go back and find the git commit id that corresponds.

$ cd/lucene
$ git log --grep=691694
commit 71afff2cebd022fe63bdf2ec4b87aaa0cee41dc8
Author: Michael McCandless <mikemccand@apache.org>
Date:   Wed Sep 3 17:34:29 2008 +0000
 
    LUCENE-1374: fix test case to close reader/writer in try/finally; add assert b!=null in RAMOutputStream.writeBytes (matches FSIndexOutput which hits NPE)
 
    git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@691694 13f79535-47bb-0310-9956-ffa450edef68

Hurrah! Now I can checkout the same version of Lucene that’s in Solr. But, probably more useful for Eclipse, is just to zip it up somewhere.

$ cd/lucene
$ git archive --format=zip 71afff2 >/tmp/lucene-2.4-r691741.zip

Excellent. Now I can resume my debugging session. 🙂

NB: I could have just used subversion to check out the correct revision of Lucene. But, I find it quicker to use git to clone the repository, and I get the added benefit that I now have the whole lucene history available. So I can quickly see why something was changed.

Character Encodings Bite Again

A colleague gave me a nudge today. “This page doesn’t validate because of an encoding error”. It was fairly simple: the string “Jiménez” contained a single byte—Latin1. Ooops. It turned out that we were generating the page as ISO-8859-1 instead of UTF-8 (which is what the page had been declared as in the HTML).

So, which bit of Spring WebMVC sets the character encoding? A bit of poking around in the debugger didn’t pop up any obvious extension point. So we stuck this in our Controller.

  response.setContentType("UTF-8");

This worked, but it’s pretty awful having to do this in every single controller. So, we poked around a bit more and found CharacterEncodingFilter. Installing this into web.xml made things work.

  <filter>
    <filter-name>CEF</filter-name>
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-name>
    </init-param>
    <init-param>
      <param-name>forceEncoding</param-name>
      <param-value>true</param-name>
    </init-param>
  </filter>
  <filter-mapping>
    <filter-name>CEF</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

Whilst rummaging around in here, we noticed something interesting: the code is set up like a spring bean—it doesn’t read the init-params directly. There’s some crafty code in GenericFilterBean to get this to work. Check it out.

Anyway, that Filter ensured that we output UTF-8 correctly. The forceEncoding parameter ensured that it was set on the response as well as the request.

Incidentally, we figured out where the default value of ISO-8859-1 gets applied. Inside DispatcherServlet.render(), the LocaleResolver gets called, followed by ServletResponse.setLocale(). Tomcat uses the Locale to set the character encoding if it hasn’t been already. Which frankly is a pretty daft thing to do. Being british does not indicate my preference as to Latin-1 vs UTF-8.

Then, the next problem reared its head. The “Jiménez” text was actually a link to search for “Jiménez” in the author field. The URL itself was correctly encoded as q=Jim%C3%A9nez. But when we clicked on it, it didn’t find the original article.

Our search is implemented in Solr. So we immediately had a look at the Solr logs. That clearly had Unicode problems (which is why it wasn’t finding any results). The two bytes of UTF-8 were being interpreted as individual characters (i.e. something was interpreting the URI as ISO-8859-1). Bugger.

Working backwards, we looked at the access logs for Solr. After a brief diversion to enable the access logs for tomcat inside WTP inside Eclipse (oh, the pain of yak shaving), we found that the sender was passing doubly encoded UTF-8. Arrgh.

So we jumped all the way back to the beginning of the search, back in the Controller.

  String q = request.getParameter("q");

Looking at q in the debugger, that was also wrong. So at that point, the only thing that could have affected it would be tomcat itself. A quick google turned up the URIEncoding parameter of the HTTP connector. Setting that to UTF-8 in server.xml fixed our search problem by making getParameter return the correct string.

I have no idea why tomcat doesn’t just listen to the request.setContentType() that the CharacterEncodingFilter performs, but there you go.

So, the lessons are:

  1. Use CharacterEncodingFilter with Spring WebMVC to get the correct output encoding (and input encoding for POST requests).
  2. Always configure tomcat to use UTF-8 for interpreting URI query strings.
  3. Always include some test data with accents to ensure it goes through your system cleanly.