<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jabbering Giraffe &#187; solr</title>
	<atom:link href="http://happygiraffe.net/blog/tag/solr/feed/" rel="self" type="application/rss+xml" />
	<link>http://happygiraffe.net/blog</link>
	<description></description>
	<lastBuildDate>Tue, 07 Feb 2012 20:49:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/>		<item>
		<title>Book review: Solr 1.4 Enterprise Search Server</title>
		<link>http://happygiraffe.net/blog/2009/10/20/book-review-solr-1-4-enterprise-search-server/</link>
		<comments>http://happygiraffe.net/blog/2009/10/20/book-review-solr-1-4-enterprise-search-server/#comments</comments>
		<pubDate>Tue, 20 Oct 2009 19:25:21 +0000</pubDate>
		<dc:creator>Dominic Mitchell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[book]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://happygiraffe.net/blog/?p=1630</guid>
		<description><![CDATA[I was recently offered a review copy of Solr 1.4 Enterprise Search Server (thanks to Swati Iyer). Whilst this is most fortuitous, I only wish I&#8217;d had this a month or two ago, when I was working fairly heavily on &#8230; <a href="http://happygiraffe.net/blog/2009/10/20/book-review-solr-1-4-enterprise-search-server/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.packtpub.com/solr-1-4-enterprise-search-server?utm_source=happygiraffe.net&#038;utm_medium=bookrev&#038;utm_content=blog&#038;utm_campaign=mdb_000941"><img src="http://happygiraffe.net/blog/wp-content/uploads/2009/10/solr-book-thumb.png" alt="solr-book-thumb.png" border="0" width="100" height="123" align="right" style="padding: 0.5em" /></a></p>
<p>I was recently offered a review copy of <a href="http://www.packtpub.com/solr-1-4-enterprise-search-server?utm_source=happygiraffe.net&#038;utm_medium=bookrev&#038;utm_content=blog&#038;utm_campaign=mdb_000941">Solr 1.4 Enterprise Search Server</a> (thanks to <a href="http://www.linkedin.com/pub/swati-iyer/13/309/804">Swati Iyer</a>).  Whilst this is most fortuitous, I only wish I&#8217;d had this a month or two ago, when I was working fairly heavily on a <a href="http://lucene.apache.org/solr/">Solr</a> based project at <code>$OLDWORK</code>.  Still, I&#8217;ll be able to judge whether or not the book <em>would</em> have been useful. <img src='http://happygiraffe.net/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>First, some background.  Normally, Solr is documented through its <a href="http://wiki.apache.org/solr/FrontPage">wiki</a>.  As wikis go, it&#8217;s well maintained and informative.  But it suffers from both a lack of narrative structure and by being completely online.  The latter point really hit home when I was in ApacheCon 2008, in a Solr training class and couldn&#8217;t get at the documentation.  So, a book covering Solr <em>has</em> to be a good idea.</p>
<p>Even though this book covers Solr 1.4, most of it is still applicable to earlier versions (my experience is all with 1.3).  This is handy, seeing as Solr 1.4 isn&#8217;t released yet (and hence not yet in <a href="http://repo2.maven.org/maven2/org/apache/solr">central</a>).  Hopefully, it should be any day now, seeing as the version numbers have been bumped in svn (<a href="https://svn.apache.org/viewvc?view=revision&#038;revision=824893">r824893</a>).</p>
<p>The first nice thing about this book is simply that it&#8217;s not a massive tome.  At only 317pp, it&#8217;s really quite approachable.  When you open it, the writing is in a friendly, conversational style.</p>
<p>The book starts with a brief introduction to solr and lucene, before moving on to installation.  One thing I found unusual were the comparisons to relational database technology.  These continue in a few places through the book.  Perhaps I&#8217;m so used to search that I don&#8217;t need this.  But given that the focus is on “enterprise,” it&#8217;s quite likely that&#8217;s the best angle to pull in the target audience.  The chapter rounds off with a quick walkthrough of loading and querying data.  It&#8217;s good to see something practical even at this point.</p>
<p>With that out of the way, the discussion moves to the absolute bedrock of solr: the schema.  Defining what data you have and how you want to index and search it is of crucial importance.  Particularly useful is the advice to play with Solr&#8217;s analysis tool, in order to understand how the fields you define <em>actually</em> work.  Whilst the explanations of what the schema is and how design a good one are clear, it&#8217;s still likely that this is a chapter you&#8217;ll be revisiting as you get to know both Solr and your data more.</p>
<p>This chapter also introduces the data set you&#8217;ll work with through the book: the <a href="http://www.musicbrainz.org">MusicBrainz</a> data.  This isn&#8217;t an obvious choice for testing out a search engine (gutenberg?  shakespeare?), but it is fun.  And where it doesn&#8217;t fully exercise Solr, this is pointed out.</p>
<p>Next we move on to how to get your data into Solr.  This assumes a level of familiarity with the command line, in order to use <a href="http://curl.haxx.se/">curl</a>.  As well as the &#8220;normal&#8221; method of POSTing XML documents into Solr, this also covers uploading CSV files and the DataImportHandler.  The latter is a contrib module which I hadn&#8217;t seen before.  This lets you pull your data in to solr (instead of pushing) from any <abbr title="Java DataBase Connectivity">JDBC</abbr> data source.  The only missing thing is something that I spent a while getting right: importing XML data into Solr.  There is a confusion which stems from the fact that you can post XML into solr, but not arbitrary XML.  If you want to put an arbitrary XML document in a Solr field, you have to escape it and nest it into a solr document.  It&#8217;s ugly, but can be made to work.</p>
<p>Once you&#8217;ve got the data in, what about getting it out again?  The chapter on &#8220;basic querying&#8221; covers the myriad of ways you can alter Solr&#8217;s output.  But the basic query stuff is handled well.  In particular, it has a nice clear explanations of Solr&#8217;s variant of &#8220;data structure as XML&#8221; as well as the full query syntax.  There is also detail on the <code>solrconfig.xml</code> which I completely managed to miss in six months of staring at it.  Oh well.</p>
<p>At this point, the book has the basics covered.  You could stop here and get along very well with Solr.  But this is also the bit where the interesting parts start to appear:</p>
<ul>
<li>  There&#8217;s coverage of function queries, which allow you to manipulate the rankings of results in various ways (e.g. ranking newer content higher).  I confess that the function queries looked interesting, but I haven&#8217;t used them and the descriptions in the book swiftly go past my limited maths knowledge.
<li>  The dismax handler is introduced, which gives a far simpler query interface to your users.  This is something I wish I&#8217;d payed closer attention to in my last project.
<li>  Faceting is covered in detail.  This is one of Solr&#8217;s hidden gems, providing information about the complete set of results without performing a second query.  There&#8217;s also a nice demonstration of using faceting to back up a “suggestions” mechanism.
<li> Highlighting results data.  I could have saved a <em>lot</em> of time by reading this.
<li> Spellchecking (“did you mean”).  Again, the coverage highlights several pitfalls you need to be aware of.
</ul>
<p>Then comes the best surprise of all.  A chapter on deployment.  So many books forget this crucial step.  So, there is coverage of logging, backups, monitoring and security.  It might have been nice to also mention integrating it into the system startup sequence.</p>
<p>The remaining chapters cover client integration (with Java, PHP, JavaScript and Rails) and how to scale Solr.  Though I never needed the scaling for my project, the advice given is still useful.  For example, do you <em>need</em> to make every field stored? (doing so can increase disk usage)  The coverage of running Solr on <a href="http://aws.amazon.com/ec2/">EC²</a> also looked rather useful.</p>
<p>Perhaps the one thing that I&#8217;m not entirely happy with is the index (though I acknowledge a good index is hard to achieve).  Some common terms I looked up weren&#8217;t present.</p>
<p>Overall, I&#8217;m really pleased by this book.  Given my own experiences figuring out solr through the school of hard debugging sessions, I can say that this would have made my life a great deal easier.  If you want to use Solr, you&#8217;ll save yourself time with this book.</p>
]]></content:encoded>
			<wfw:commentRss>http://happygiraffe.net/blog/2009/10/20/book-review-solr-1-4-enterprise-search-server/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Search &amp; Replace in XSLT 2</title>
		<link>http://happygiraffe.net/blog/2009/07/23/search-replace-in-xslt-2/</link>
		<comments>http://happygiraffe.net/blog/2009/07/23/search-replace-in-xslt-2/#comments</comments>
		<pubDate>Thu, 23 Jul 2009 21:07:45 +0000</pubDate>
		<dc:creator>Dominic Mitchell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://happygiraffe.net/blog/?p=1560</guid>
		<description><![CDATA[For a project at $WORK, we want to implement Solr&#8217;s spelling suggestions. When you ask solr to provide suggestions, it comes back with something like this (the original search was spinish englosh): &#60;response&#62; … &#60;lst name=&#34;spellcheck&#34;&#62; &#60;lst name=&#34;suggestions&#34;&#62; &#60;lst name=&#34;spinish&#34;&#62; &#8230; <a href="http://happygiraffe.net/blog/2009/07/23/search-replace-in-xslt-2/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>For a project at $WORK, we want to implement Solr&#8217;s spelling suggestions.  When you ask solr to provide suggestions, it comes back with something like this (the original search was <em>spinish englosh</em>):</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;response<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    …
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;lst</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;spellcheck&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;lst</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;suggestions&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;lst</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;spinish&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;numFound&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>1<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;startOffset&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>19<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;endOffset&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>26<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;arr</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;suggestion&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
            <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;str<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>spanish<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/str<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/arr<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/lst<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;lst</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;englosh&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;numFound&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>1<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;startOffset&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>27<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;endOffset&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>34<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;arr</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;suggestion&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
            <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;str<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>english<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/str<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/arr<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/lst<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;lst</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;spinish&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;numFound&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>1<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;startOffset&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>60<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;int</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;endOffset&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>67<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/int<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;arr</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;suggestion&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
            <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;str<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>spanish<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/str<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
          <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/arr<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/lst<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        …
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/lst<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/lst<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/response<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>What we want to do is transform this into:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Did you mean <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;?q=spanish%20english&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>spanish english<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>?<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>As it turns out, this is a non-trivial task in XSLT.  It&#8217;s doable, but significantly easier in <a href="http://www.w3.org/TR/xslt20/">XSLT 2</a>, since you are less restricted by the rules on result-tree-fragments.</p>
<p>The first problem to solve is getting the data into a sensible data structure for further processing.  In a real language, I&#8217;d want a list of <code>(from, to)</code> pairs.  In XSLT, sequences are always flat.  The way to simulate this is to construct an element for the pair.</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:variable</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;suggRoot&quot;</span> <span style="color: #000066;">select</span>=<span style="color: #ff0000;">&quot;/response/lst[@name='spellcheck']/lst[@name='suggestions']&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:variable</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;suggestions&quot;</span> <span style="color: #000066;">as</span>=<span style="color: #ff0000;">&quot;element(sugg)*&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:for-each</span> <span style="color: #000066;">select</span>=<span style="color: #ff0000;">&quot;distinct-values($suggRoot/lst/@name)&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
      <span style="color: #808080; font-style: italic;">&lt;!-- Pick the first suggestion for this name. --&gt;</span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;sugg</span> <span style="color: #000066;">from</span>=<span style="color: #ff0000;">&quot;{.}&quot;</span> <span style="color: #000066;">to</span>=<span style="color: #ff0000;">&quot;{($suggRoot/lst[@name=current()])[1]/arr[@name='suggestion']/str[1]}&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/xsl:for-each<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/xsl:variable<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>Note the commented caveat: we always pick the first suggestion for any given name.  From my (small) experience, this isn&#8217;t an issue as the suggestions for a given word are always identical.</p>
<p>This results in <code>$suggestions</code> containing a sequence of elements looking like this.</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;sugg</span> <span style="color: #000066;">from</span>=<span style="color: #ff0000;">&quot;spinish&quot;</span> <span style="color: #000066;">to</span>=<span style="color: #ff0000;">&quot;spanish&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;sugg</span> <span style="color: #000066;">from</span>=<span style="color: #ff0000;">&quot;englosh&quot;</span> <span style="color: #000066;">to</span>=<span style="color: #ff0000;">&quot;english&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span></pre></div></div>

<p>Now one of the nice things about XSLT 2 is that you can define functions which are visible to XPath.  So we can write a fairly simple recursive function to do the search and replace.</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">  <span style="color: #808080; font-style: italic;">&lt;!-- Take some input and a list of suggestions, and do a recursive search and</span>
<span style="color: #808080; font-style: italic;">       replace over the input until all have been applied. --&gt;</span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:function</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;my:replaceSuggestions&quot;</span> <span style="color: #000066;">as</span>=<span style="color: #ff0000;">&quot;xs:string&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:param</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;input&quot;</span> <span style="color: #000066;">as</span>=<span style="color: #ff0000;">&quot;xs:string&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:param</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;suggestions&quot;</span> <span style="color: #000066;">as</span>=<span style="color: #ff0000;">&quot;element(sugg)*&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:variable</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;sugg&quot;</span> <span style="color: #000066;">select</span>=<span style="color: #ff0000;">&quot;$suggestions[1]&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:sequence</span> <span style="color: #000066;">select</span>=<span style="color: #ff0000;">&quot;</span>
<span style="color: #009900;">      if (count($suggestions) &gt; 0) then</span>
<span style="color: #009900;">        my:replaceSuggestions(replace($input, $sugg/@from, $sugg/@to), $suggestions[position() &gt; 1])</span>
<span style="color: #009900;">      else</span>
<span style="color: #009900;">        $input&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/xsl:function<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>There are a few things to note:</p>
<ul>
<li>You have to give your function a namespace prefix.</li>
<li>The <code>xsl:param</code>&#8216;s are used in order (not by name) to specify the arity of the function.</li>
<li>The <code>as</code> attributes aren&#8217;t necessary, but the idea of types in XSLT is growing on me.  I&#8217;d rather know about type problems as soon as possible.</li>
<li>The notion of <a href="http://en.wikipedia.org/wiki/CAR_and_CDR">cdr</a> (tail) in XSLT is rather odd: the sequence of all nodes in the sequence whose position is greater than one.</li>
<li>Even though I&#8217;m using <a href="http://www.w3.org/TR/xpath-functions/#func-replace"><code>replace()</code></a>, I&#8217;m not taking any precautions against escaping regex characters.  I&#8217;m certain that these won&#8217;t occur given my data.</li>
</ul>
<p>So finally, we end up with:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:variable</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;newQuery&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:value-of</span> <span style="color: #000066;">select</span>=<span style="color: #ff0000;">&quot;my:replaceSuggestions($input, $suggestions)&quot;</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/xsl:variable<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;spelling&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Did you mean <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/xsl:text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;em<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;?q={encode-for-uri($newQuery)}&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:value-of</span> <span style="color: #000066;">select</span>=<span style="color: #ff0000;">&quot;$newQuery&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/em<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;xsl:text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>?<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/xsl:text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>I don&#8217;t think all this will win any awards for elegance, but it does work. <img src='http://happygiraffe.net/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://happygiraffe.net/blog/2009/07/23/search-replace-in-xslt-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Solr&#039;s Lucene Source</title>
		<link>http://happygiraffe.net/blog/2009/07/16/solrs-lucene-source/</link>
		<comments>http://happygiraffe.net/blog/2009/07/16/solrs-lucene-source/#comments</comments>
		<pubDate>Thu, 16 Jul 2009 22:15:19 +0000</pubDate>
		<dc:creator>Dominic Mitchell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[eclipse]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://happygiraffe.net/blog/?p=1558</guid>
		<description><![CDATA[I&#8217;m debugging a plugin for Solr. I&#8217;ve just about got the magic voodoo set up so that I can make Eclipse talk to tomcat and stick breakpoints in and so on. But I&#8217;ve immediately run into a problem. Even though &#8230; <a href="http://happygiraffe.net/blog/2009/07/16/solrs-lucene-source/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m debugging a plugin for <a href="http://lucene.apache.org/solr/">Solr</a>.  I&#8217;ve just about got the magic voodoo set up so that I can make Eclipse talk to tomcat and stick breakpoints in and so on.  But I&#8217;ve immediately run into a problem.</p>
<p>Even though Solr itself comes with <code>-sources</code> jars, the bundled copy of lucene that they&#8217;ve used <em>doesn&#8217;t</em>.  Needless to say, this is a bit of a hindrance.</p>
<p>Thankfully, the apache people have set up <a href="http://git.apache.org/">git.apache.org</a>, which makes this situation a lot less annoying than it could be.</p>
<p>First, I checked out copies of lucene &#038; solr.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #c20cb9; font-weight: bold;">git</span> clone <span style="color: #c20cb9; font-weight: bold;">git</span>:<span style="color: #000000; font-weight: bold;">//</span>git.apache.org<span style="color: #000000; font-weight: bold;">/</span>solr.git
$ <span style="color: #c20cb9; font-weight: bold;">git</span> clone <span style="color: #c20cb9; font-weight: bold;">git</span>:<span style="color: #000000; font-weight: bold;">//</span>git.apache.org<span style="color: #000000; font-weight: bold;">/</span>lucene.git</pre></div></div>

<p>Now, I need to go into solr and figure out <em>which</em> version of lucene is in use.  Unfortunately, it&#8217;s not a released version, it&#8217;s a snapshot of the lucene trunk at a point in time.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #7a0874; font-weight: bold;">cd</span> …<span style="color: #000000; font-weight: bold;">/</span>solr
$ <span style="color: #c20cb9; font-weight: bold;">git</span> branch <span style="color: #660033;">-r</span>
  origin<span style="color: #000000; font-weight: bold;">/</span>HEAD -<span style="color: #000000; font-weight: bold;">&gt;</span> origin<span style="color: #000000; font-weight: bold;">/</span>trunk
  origin<span style="color: #000000; font-weight: bold;">/</span>branch-<span style="color: #000000;">1.1</span>
  origin<span style="color: #000000; font-weight: bold;">/</span>branch-<span style="color: #000000;">1.2</span>
  origin<span style="color: #000000; font-weight: bold;">/</span>branch-<span style="color: #000000;">1.3</span>
  origin<span style="color: #000000; font-weight: bold;">/</span>sandbox
  origin<span style="color: #000000; font-weight: bold;">/</span>solr-ruby-refactoring
  origin<span style="color: #000000; font-weight: bold;">/</span>tags<span style="color: #000000; font-weight: bold;">/</span>release-1.1.0
  origin<span style="color: #000000; font-weight: bold;">/</span>tags<span style="color: #000000; font-weight: bold;">/</span>release-1.2.0
  origin<span style="color: #000000; font-weight: bold;">/</span>tags<span style="color: #000000; font-weight: bold;">/</span>release-1.3.0
  origin<span style="color: #000000; font-weight: bold;">/</span>trunk
$ <span style="color: #c20cb9; font-weight: bold;">git</span> whatchanged origin<span style="color: #000000; font-weight: bold;">/</span>tags<span style="color: #000000; font-weight: bold;">/</span>release-1.3.0 lib
…
commit 904e378b7b4fd18232f657c9daf484a3e63b272c
Author: Yonik Seeley <span style="color: #000000; font-weight: bold;">&lt;</span>yonik<span style="color: #000000; font-weight: bold;">@</span>apache.org<span style="color: #000000; font-weight: bold;">&gt;</span>
Date:   Wed Sep <span style="color: #000000;">3</span> <span style="color: #000000;">20</span>:<span style="color: #000000;">31</span>:<span style="color: #000000;">42</span> <span style="color: #000000;">2008</span> +0000
&nbsp;
    lucene update <span style="color: #000000;">2.4</span>-dev r691741
&nbsp;
    git-svn-id: https:<span style="color: #000000; font-weight: bold;">//</span>svn.apache.org<span style="color: #000000; font-weight: bold;">/</span>repos<span style="color: #000000; font-weight: bold;">/</span>asf<span style="color: #000000; font-weight: bold;">/</span>lucene<span style="color: #000000; font-weight: bold;">/</span>solr<span style="color: #000000; font-weight: bold;">/</span>branches<span style="color: #000000; font-weight: bold;">/</span>branch-<span style="color: #000000;">1.3</span><span style="color: #000000; font-weight: bold;">@</span>691758 13f79535-47bb-0310-<span style="color: #000000;">9956</span>-ffa450edef68
&nbsp;
:<span style="color: #000000;">100644</span> <span style="color: #000000;">100644</span> a297b74... 54442dc... M  lib<span style="color: #000000; font-weight: bold;">/</span>lucene-analyzers-<span style="color: #000000;">2.4</span>-dev.jar
:<span style="color: #000000;">100644</span> <span style="color: #000000;">100644</span> 596625b... 5c6e003... M  lib<span style="color: #000000; font-weight: bold;">/</span>lucene-core-<span style="color: #000000;">2.4</span>-dev.jar
:<span style="color: #000000;">100644</span> <span style="color: #000000;">100644</span> db13718... f0f93a7... M  lib<span style="color: #000000; font-weight: bold;">/</span>lucene-highlighter-<span style="color: #000000;">2.4</span>-dev.jar
:<span style="color: #000000;">100644</span> <span style="color: #000000;">100644</span> 50c8cb4... a599f43... M  lib<span style="color: #000000; font-weight: bold;">/</span>lucene-memory-<span style="color: #000000;">2.4</span>-dev.jar
:<span style="color: #000000;">100644</span> <span style="color: #000000;">100644</span> aef3fb8... 79feaef... M  lib<span style="color: #000000; font-weight: bold;">/</span>lucene-queries-<span style="color: #000000;">2.4</span>-dev.jar
:<span style="color: #000000;">100644</span> <span style="color: #000000;">100644</span> 1c733b9... 440fa4e... M  lib<span style="color: #000000; font-weight: bold;">/</span>lucene-snowball-<span style="color: #000000;">2.4</span>-dev.jar
:<span style="color: #000000;">100644</span> <span style="color: #000000;">100644</span> 0195fa2... b5ff08b... M  lib<span style="color: #000000; font-weight: bold;">/</span>lucene-spellchecker-<span style="color: #000000;">2.4</span>-dev.jar
…</pre></div></div>

<p>So, the last change to lucene was taking a copy of r691741 of lucene&#8217;s trunk.  So, lets go over there.  And see what that looks like.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #7a0874; font-weight: bold;">cd</span> …<span style="color: #000000; font-weight: bold;">/</span>lucene
$ <span style="color: #c20cb9; font-weight: bold;">git</span> log <span style="color: #660033;">--grep</span>=<span style="color: #000000;">691741</span></pre></div></div>

<p>Except that doesn&#8217;t return anything.  Because there was no lucene commit at that revision in the original repository (it was <a href="http://svn.apache.org/viewvc?view=rev&#038;revision=691741">something to do with geronimo</a>).  So we need to search backwards for the commit nearest to that revision.  Thankfully, <a href="http://www.kernel.org/pub/software/scm/git/docs/git-svn.html">git svn</a> includes the original subversion revision numbers of each commit.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #7a0874; font-weight: bold;">cd</span> …<span style="color: #000000; font-weight: bold;">/</span>lucene
$ <span style="color: #c20cb9; font-weight: bold;">git</span> log <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">perl</span> <span style="color: #660033;">-lne</span> <span style="color: #ff0000;">'if (m/git-svn-id:.*@(\d+)/ &amp;&amp; $1 &lt;= 691741){print $1; exit}'</span>
<span style="color: #000000;">691694</span></pre></div></div>

<p>So now we can go back and find the git commit id that corresponds.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #7a0874; font-weight: bold;">cd</span> …<span style="color: #000000; font-weight: bold;">/</span>lucene
$ <span style="color: #c20cb9; font-weight: bold;">git</span> log <span style="color: #660033;">--grep</span>=<span style="color: #000000;">691694</span>
commit 71afff2cebd022fe63bdf2ec4b87aaa0cee41dc8
Author: Michael McCandless <span style="color: #000000; font-weight: bold;">&lt;</span>mikemccand<span style="color: #000000; font-weight: bold;">@</span>apache.org<span style="color: #000000; font-weight: bold;">&gt;</span>
Date:   Wed Sep <span style="color: #000000;">3</span> <span style="color: #000000;">17</span>:<span style="color: #000000;">34</span>:<span style="color: #000000;">29</span> <span style="color: #000000;">2008</span> +0000
&nbsp;
    LUCENE-<span style="color: #000000;">1374</span>: fix <span style="color: #7a0874; font-weight: bold;">test</span> <span style="color: #000000; font-weight: bold;">case</span> to close reader<span style="color: #000000; font-weight: bold;">/</span>writer <span style="color: #000000; font-weight: bold;">in</span> try<span style="color: #000000; font-weight: bold;">/</span>finally; add assert b<span style="color: #000000; font-weight: bold;">!</span>=null <span style="color: #000000; font-weight: bold;">in</span> RAMOutputStream.writeBytes <span style="color: #7a0874; font-weight: bold;">&#40;</span>matches FSIndexOutput <span style="color: #c20cb9; font-weight: bold;">which</span> hits NPE<span style="color: #7a0874; font-weight: bold;">&#41;</span>
&nbsp;
    git-svn-id: https:<span style="color: #000000; font-weight: bold;">//</span>svn.apache.org<span style="color: #000000; font-weight: bold;">/</span>repos<span style="color: #000000; font-weight: bold;">/</span>asf<span style="color: #000000; font-weight: bold;">/</span>lucene<span style="color: #000000; font-weight: bold;">/</span>java<span style="color: #000000; font-weight: bold;">/</span>trunk<span style="color: #000000; font-weight: bold;">@</span>691694 13f79535-47bb-0310-<span style="color: #000000;">9956</span>-ffa450edef68</pre></div></div>

<p>Hurrah!  Now I can checkout the same version of Lucene that&#8217;s in Solr.  But, probably more useful for Eclipse, is just to zip it up somewhere.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #7a0874; font-weight: bold;">cd</span> …<span style="color: #000000; font-weight: bold;">/</span>lucene
$ <span style="color: #c20cb9; font-weight: bold;">git</span> archive <span style="color: #660033;">--format</span>=<span style="color: #c20cb9; font-weight: bold;">zip</span> 71afff2 <span style="color: #000000; font-weight: bold;">&gt;/</span>tmp<span style="color: #000000; font-weight: bold;">/</span>lucene-<span style="color: #000000;">2.4</span>-r691741.zip</pre></div></div>

<p>Excellent.  Now I can resume my debugging session. <img src='http://happygiraffe.net/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>NB: I <em>could</em> have just used subversion to check out the correct revision of Lucene.  But, I find it quicker to use git to clone the repository, and I get the added benefit that I now have the whole lucene history available.  So I can quickly see <em>why</em> something was changed.</p>
]]></content:encoded>
			<wfw:commentRss>http://happygiraffe.net/blog/2009/07/16/solrs-lucene-source/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Character Encodings Bite Again</title>
		<link>http://happygiraffe.net/blog/2008/07/31/character-encodings-bite-again/</link>
		<comments>http://happygiraffe.net/blog/2008/07/31/character-encodings-bite-again/#comments</comments>
		<pubDate>Thu, 31 Jul 2008 23:17:06 +0000</pubDate>
		<dc:creator>Dominic Mitchell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[servlets]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[tomcat]]></category>
		<category><![CDATA[unicode]]></category>

		<guid isPermaLink="false">http://happygiraffe.net/2008/07/31/character-encodings-bite-again/</guid>
		<description><![CDATA[A colleague gave me a nudge today. &#8220;This page doesn&#8217;t validate because of an encoding error&#8221;. It was fairly simple: the string &#8220;Jiménez&#8221; contained a single byte&#8212;Latin1. Ooops. It turned out that we were generating the page as ISO-8859-1 instead &#8230; <a href="http://happygiraffe.net/blog/2008/07/31/character-encodings-bite-again/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A colleague gave me a nudge today.  &#8220;This page doesn&#8217;t validate because of an encoding error&#8221;.  It was fairly simple: the string &#8220;Jiménez&#8221; contained a single byte&#8212;Latin1.  Ooops.  It turned out that we were generating the page as <a href="http://en.wikipedia.org/wiki/ISO/IEC_8859-1"><span class="caps">ISO</span>-8859-1</a> instead of <a href="http://en.wikipedia.org/wiki/UTF-8"><span class="caps">UTF</span>-8</a> (which is what the page had been declared as in the <span class="caps">HTML</span>).</p>
<p>So, which bit of <a href="http://static.springframework.org/spring/docs/2.5.x/reference/mvc.html">Spring WebMVC</a> sets the character encoding?  A bit of poking around in the debugger didn&#8217;t pop up any obvious extension point.  So we stuck this in our <a href="http://static.springframework.org/spring/docs/2.5.x/api/org/springframework/stereotype/Controller.html">Controller</a>.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">  response.<span style="color: #006633;">setContentType</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>This worked, but it&#8217;s pretty awful having to do this in every single controller.  So, we poked around a bit more and found <a href="http://static.springframework.org/spring/docs/2.5.x/api/org/springframework/web/filter/CharacterEncodingFilter.html">CharacterEncodingFilter</a>.  Installing this into <code>web.xml</code> made things work.</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">  <span style="color: #ddbb00;">&lt;</span>filter<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>filter-name<span style="color: #ddbb00;">&gt;</span>CEF<span style="color: #ddbb00;">&lt;</span>/filter-name<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>filter-class<span style="color: #ddbb00;">&gt;</span>org.springframework.web.filter.CharacterEncodingFilter<span style="color: #ddbb00;">&lt;</span>/filter-class<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>init-param<span style="color: #ddbb00;">&gt;</span>
      <span style="color: #ddbb00;">&lt;</span>param-name<span style="color: #ddbb00;">&gt;</span>encoding<span style="color: #ddbb00;">&lt;</span>/param-name<span style="color: #ddbb00;">&gt;</span>
      <span style="color: #ddbb00;">&lt;</span>param-value<span style="color: #ddbb00;">&gt;</span>UTF-8<span style="color: #ddbb00;">&lt;</span>/param-name<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>/init-param<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>init-param<span style="color: #ddbb00;">&gt;</span>
      <span style="color: #ddbb00;">&lt;</span>param-name<span style="color: #ddbb00;">&gt;</span>forceEncoding<span style="color: #ddbb00;">&lt;</span>/param-name<span style="color: #ddbb00;">&gt;</span>
      <span style="color: #ddbb00;">&lt;</span>param-value<span style="color: #ddbb00;">&gt;</span>true<span style="color: #ddbb00;">&lt;</span>/param-name<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>/init-param<span style="color: #ddbb00;">&gt;</span>
  <span style="color: #ddbb00;">&lt;</span>/filter<span style="color: #ddbb00;">&gt;</span>
  <span style="color: #ddbb00;">&lt;</span>filter-mapping<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>filter-name<span style="color: #ddbb00;">&gt;</span>CEF<span style="color: #ddbb00;">&lt;</span>/filter-name<span style="color: #ddbb00;">&gt;</span>
    <span style="color: #ddbb00;">&lt;</span>url-pattern<span style="color: #ddbb00;">&gt;</span>/*<span style="color: #ddbb00;">&lt;</span>/url-pattern<span style="color: #ddbb00;">&gt;</span>
  <span style="color: #ddbb00;">&lt;</span>/filter-mapping<span style="color: #ddbb00;">&gt;</span></pre></div></div>

<p>Whilst rummaging around in here, we noticed something interesting: the code is set up like a spring bean&#8212;it doesn&#8217;t read the init-params directly.  There&#8217;s some crafty code in <a href="http://static.springframework.org/spring/docs/2.5.x/api/org/springframework/web/filter/GenericFilterBean.html">GenericFilterBean</a> to get this to work.  Check it out.</p>
<p>Anyway, that Filter ensured that we output <span class="caps">UTF</span>-8 correctly.  The <code>forceEncoding</code> parameter ensured that it was set on the response as well as the request.</p>
<p>Incidentally, we figured out where the default value of <span class="caps">ISO</span>-8859-1 gets applied.  Inside <a href="http://static.springframework.org/spring/docs/2.5.x/api/org/springframework/web/servlet/DispatcherServlet.html#render(org.springframework.web.servlet.ModelAndView,%20javax.servlet.http.HttpServletRequest,%20javax.servlet.http.HttpServletResponse)">DispatcherServlet.render()</a>, the <a href="http://static.springframework.org/spring/docs/2.5.x/api/org/springframework/web/servlet/LocaleResolver.html">LocaleResolver</a> gets called, followed by <a href="http://java.sun.com/j2ee/1.4/docs/api/javax/servlet/ServletResponse.html#setLocale(java.util.Locale)">ServletResponse.setLocale()</a>.  Tomcat uses the Locale to set the character encoding if it hasn&#8217;t been already.  Which frankly is a pretty daft thing to do.  Being british does not indicate my preference as to Latin-1 vs <span class="caps">UTF</span>-8.</p>
<p>Then, the next problem reared its head.  The &#8220;Jiménez&#8221; text was actually a link to search for &#8220;Jiménez&#8221; in the author field.  The <span class="caps">URL</span> itself was correctly encoded as <code>q=Jim%C3%A9nez</code>.  But when we clicked on it, it didn&#8217;t find the original article.</p>
<p>Our search is implemented in <a href="http://lucene.apache.org/solr/">Solr</a>.  So we immediately had a look at the Solr logs.  That clearly had Unicode problems (which is why it wasn&#8217;t finding any results).  The two bytes of <span class="caps">UTF</span>-8 were being interpreted as individual characters (i.e. something was interpreting the <span class="caps">URI</span> as <span class="caps">ISO</span>-8859-1).  Bugger.</p>
<p>Working backwards, we looked at the access logs for Solr.  After a brief diversion to enable the access logs for tomcat inside <span class="caps">WTP</span> inside Eclipse (oh, the pain of yak shaving), we found that the sender was passing doubly encoded <span class="caps">UTF</span>-8.  Arrgh.</p>
<p>So we jumped all the way back to the beginning of the search, back in the Controller.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">  <span style="color: #003399;">String</span> q <span style="color: #339933;">=</span> request.<span style="color: #006633;">getParameter</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;q&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Looking at <code>q</code> in the debugger, that was also wrong.  So at that point, the only thing that could have affected it would be tomcat itself.  A quick google turned up the <code>URIEncoding</code> parameter of the <a href="http://tomcat.apache.org/tomcat-6.0-doc/config/http.html"><span class="caps">HTTP</span> connector</a>.  Setting that to <code>UTF-8</code> in <code>server.xml</code> fixed our search problem by making <code>getParameter</code> return the correct string.</p>
<p>I have no idea why tomcat doesn&#8217;t just listen to the <code>request.setContentType()</code> that the CharacterEncodingFilter performs, but there you go.</p>
<p>So, the lessons are:</p>
<ol>
<li>Use CharacterEncodingFilter with Spring WebMVC to get the correct output encoding (and input encoding for <span class="caps">POST</span> requests).</li>
<li>Always configure tomcat to use <span class="caps">UTF</span>-8 for interpreting <span class="caps">URI</span> query strings.</li>
<li>Always include some test data with accents to ensure it goes through your system cleanly.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://happygiraffe.net/blog/2008/07/31/character-encodings-bite-again/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

