Categories
Uncategorized

Search & Replace in XSLT 2

For a project at $WORK, we want to implement Solr’s spelling suggestions. When you ask solr to provide suggestions, it comes back with something like this (the original search was spinish englosh):

  
    …
    
      
        
          1
          19
          26
          
            spanish
          
        
        
          1
          27
          34
          
            english
          
        
        
          1
          60
          67
          
            spanish
          
        
        …
      
    
  

What we want to do is transform this into:

Did you mean spanish english?

As it turns out, this is a non-trivial task in XSLT. It’s doable, but significantly easier in XSLT 2, since you are less restricted by the rules on result-tree-fragments.

The first problem to solve is getting the data into a sensible data structure for further processing. In a real language, I’d want a list of (from, to) pairs. In XSLT, sequences are always flat. The way to simulate this is to construct an element for the pair.

  
  
    
      
      
    
  

Note the commented caveat: we always pick the first suggestion for any given name. From my (small) experience, this isn’t an issue as the suggestions for a given word are always identical.

This results in $suggestions containing a sequence of elements looking like this.

  
  

Now one of the nice things about XSLT 2 is that you can define functions which are visible to XPath. So we can write a fairly simple recursive function to do the search and replace.

  
  
    
    
    
    
  

There are a few things to note:

  • You have to give your function a namespace prefix.
  • The xsl:param‘s are used in order (not by name) to specify the arity of the function.
  • The as attributes aren’t necessary, but the idea of types in XSLT is growing on me. I’d rather know about type problems as soon as possible.
  • The notion of cdr (tail) in XSLT is rather odd: the sequence of all nodes in the sequence whose position is greater than one.
  • Even though I’m using replace(), I’m not taking any precautions against escaping regex characters. I’m certain that these won’t occur given my data.

So finally, we end up with:

  
    
  
  

Did you mean ?

I don’t think all this will win any awards for elegance, but it does work. 🙂

Categories
Uncategorized

Practical Fascism

On my recent project at work, I’ve instigated a number of features:

  • An XSLT stylesheet that strips out all inline JavaScript and inline CSS before it ever hits the browser. Go-go Unobtrusive JavaScript!
  • Lots of jUnit tests:
    • Every piece of static HTML gets validated via RelaxNG (we had to use RelaxNG in order to allow mixed namespaces).
    • Ditto for every piece of XSLT.
    • A well-formedness test for every piece of XML.
    • A jslint test to check every file of JavaScript isn’t doing anything silly.
    • A “no tabs” test to ensure that there are no literal tabs in any piece of source code (in accordance with our local conventions).

I haven’t yet gotten around to integrating checkstyle. And I really need to write an XQuery parser that understands the ancient dialect we seem to be saddled with.

On the whole, these tests have proved to be a minor inconvenience up-front, often leading to groans around the office when the build breaks. But as a whole they’ve kept the code base clean and managed to detect problems early. So on the whole they’re a win. I’ll definitely be dragging them along to my next project.

Categories
Uncategorized

Client-Side XSLT

I’ve been playing with XSLT in the Browser. It appears simple enough: Just add an xml-stylesheet processing instruction to the top of your XML, pointing at your XSLT.

  <?xml-stylesheet href="stylish.xsl" type="text/xsl"?>

But there’s one more condition that needs to be met: Inside the stylesheet itself, you have to say <xsl:output type="html"/>. Not xml. Not leaving out the xsl:output tag. It has to be present and set to html or you don’t get any useful output.

My thanks to XSLT in Gecko : Generating HTML on the lovely devmo for pointing out the error of my ways to me.

Categories
Uncategorized

XSLT Character Creation

I’ve just spent most of the afternoon on a character building exercise. I have some XML like this:

  <symbol unicode="2103"/>

And I need to turn that into the numeric character reference . It’s perfectly possible to do so with a bit of fudging around with <xsl:text disable-output-escaping="yes"/>. But there’s a slight caveat: You’re not creating a numeric character reference. You’re just creating something that looks like one. Really, it’s the characters “&”, ”#”, “x”, “2”, “0”, “1”, “3” and ”;”.

Now most of the time, this doesn’t matter. You just output XML that looks correct and the next parser along (probably a browser) will interpret it correctly. But it’s sleight of hand.

Today, I needed to copy the text contents of a node into an attribute. Unfortunately, that text content contained one of these symbol tags. But because it’s only a string, XSLT feels (correctly) that it needs to escape the leading ampersand. So, with this input:

  <name>Fred <symbol unicode="2013"/> Bloggs</name>

I get this output:

  <name attrib="Fred &amp;#x2013; Bloggs">Fred – Bloggs</name>

Yes, I know that the input data is completely stupid. I can’t help that. Unfortunately I also have the restriction that I can’t do this in multiple passes.

I’ve looked at the standard XSLT functions and the standard XPath functions. I’ve looked at the EXSLT functions. All I want is something that works like Perl’s chr.

I noticed that Saxon has the saxon:entity-ref function, but annoyingly, libxslt doesn’t support it.

All I really need is some way of re-invoking the XML parser over a string of my choosing. That way I could just wrap the characters in an element, parse it and call text() to get the character I need.

Right now, the only way that I can see of doing this is to turn UnicodeData.txt into one big XML lookup table, and lookup the numbers in that. Bleeeaaargh.

Thankfully, it’s not my project and the person doing it has just hacked around this in the output layer. But it bugs me that there’s no good way to achieve this.