A Year in XQuery

About a year ago, I wrote The year of XQuery?. I’ve just finished my involvement with the large project at $WORK that was using XQuery. So it’s time to reflect over it a little.

First, a bit of background. The site I was developing is essentially a more-or-less static view of 112,1491 XML documents in 4 collections2. The main interface is a search page, plus there’s a browse-by-title if you really fancy getting lost. It sounds simple, but as with any large collection of data, there is lots of variation in the data, leading to unexpected complications. Plus there are several different views of each document, just to make life more fun.

The site is based on Cocoon talking to an XML database (MarkLogic Server in this instance). We get XML back from the database, and render it through an XSLT pipeline into HTML[3]. Plus there’s some nice jQuery on top to smooth the ride.

Looking inside the codebase, we’re presently running at 5200 lines of XQuery code across 39 files (admittedly, that includes a 1000 line “lookup table”). But that doesn’t include any of the search code, which is dynamically generated using Java.

But in many senses, that’s not the remarkable thing about the XQuery. One of the most important aspects has been the ability to query the existing data. This may not sound particularly remarkable for a query language. But for a collection of XML this size, the ability to do ad-hoc queries that understand the document structure is truly remarkable.

For example, several months ago, I received a bug report stating “the tables are not rendering in article 12345.” I was able to look at the article, see that there were no tables, examine the source markup in the database and discover a TAB element that I’d never seen before4. But how widespread is this problem? Three seconds later, I have:

    for $tab in //TAB
    return base-uri($tab)

Which tells me the 99 affected articles. Now I know that I only need to reload those from the source data instead of all 112,149.

Looking back over my original criticisms, how do they stand up to my experience?

  • The development environment.
    • Well, it’s slightly better than I originally thought. But not good enough when placed next to Java IDEs like Eclipse.
    • I’ve been using my TextMate XQuery Bundle with reasonable success (although it still needs a great deal of improvement). There are modes for Emacs and Vim.
    • I’ve managed to get the OxygenXML debugger talking to MarkLogic, but it was less useful than it initially appeared. The XQuery editor turned out to be worse than useless, because MarkLogic uses an outdated version of XQuery, leading to a lack of syntax colouring and a plethora of error reports.
    • MarkLogic has an addon CQ which is a browser based interactive query tool. It’s pretty useful.
    • Fundamentally, sharing a database between developers (the stored procedure model) doesn’t work well when you have multiple people updating it.
      • We solved this expediently by kicking everybody else off the project. πŸ™‚
  • The verbosity.
    • Like most things, you kind of get used to it. Although I confess that when I started to understand what I was doing, I found that I could write code in a significantly less verbose way.
  • The SQL / functional nature.
    • This is another one of those things that you just get used to. And in this case, start to enjoy.
  • Not a standard.
    • Fixed!—although not MarkLogic, as mentioned above.
  • XML Namespaces bite.
    • And continue to do so. Let’s blame Tim Bray. πŸ™‚
    • Seriously, this continues to be a problem, even now a year later. I lost a whole morning two weeks ago due to my inability to query the correct namespace. My main advice now is to never use a default namespace—prefix everything.
  • The type system.
    • Over time, now that I have come to understand it, I can begin to use the type system to my advantage. In fact, it’s one of the things that I usually have to reinforce in developers just starting in the project.
    • Thankfully, I’ve never needed to dabble in XML schema.
  • Implementation defined areas.
    • It’s a concern, I’ll admit. MarkLogic is profligate in this area, and to get decent performance, you need to use the extensions.
  • Smiley comments
    • They’re just about getting to the point of being ignorable.

But what new things have I learned?

  • I seriously underestimated the utility of function libraries. You can have two kinds of query in XQuery: an inline query, or a “module”. The advantage of a module is that it’s a lot easier to reach in and test an individual function by hand when needed.
  • Speaking of testing, I haven’t come up with a good solution for unit testing. This pains me greatly. I realise that it’s similar to the RDBMS unit testing problem and basically I need a known test database. MarkLogic doesnt make automating this easy (there are no management APIs).
  • Performance is very unpredictable. Not unpredictable as in “varys a lot”, but difficult to tell the performance of a given statement by visual inspection. MarkLogic comes with profiling APIs, which helps somewhat. But compared to EXPLAIN in SQL, it still feels a bit primitive.
    • For example, my XSLT experience told me to avoid things like //p to examine all paragraphs. But in XQuery, everything is indexed up the wazzoo, so it’s likely to be faster than an XPath statement with an explicit path.
  • Thinking in a functional style is an art. I’ve had a few problems, which cry out for an accumulator of some sorts. My whiteboard and I have had some long, intimate moments.
  • Having regexes available is a godsend, after XSLT 1.0.
  • I still really need a decent XQuery pretty printer, alΓ‘ perltidy.

Overall, I have to ask myself: would I do it the same again? And I probably would. For this particular project, I would try to place more emphasis on the XQuery than the XSLT (this was down to our inexperience—you should always try to work as close to the data store as possible). Despite the initial strong learning curve, the XQuery itself was rarely the main problem. But that’s leading into a whole new post…

In short: if you have a bunch of XML data lying around, XQuery is an excellent way to get the most use out it5.

1 count(/doc)

2 There’s also a second site, almost identical, but on a different topic which has 46,876 documents.

3 HTML 4.01. Sadly, XHTML and browsers still interact badly.

4 This is a rather baroque DTD unfortunately.

5 If you’re not up to paying for a MarkLogic licence (it’s pricey), then eXist might be worth checking out.


XQuery Redux

I spent four days last week on a training course, most of which consisted of learning XQuery. I’m a lot happier with it now that I was a couple of months ago. In particular,

  • OxygenXML or Stylus Studio are pretty good IDEs. A little cluttered, but an improvement over vim & Emacs.
  • XQuery 1.0 is now out.
    • It’s a minor shame that the product I’m using doesn’t support it fully yet, but that’s just a matter of time.
  • I’ve come to term with the type system. I’ve worked out where it can help me more than hinder me. I continue to be shocked when I return the wrong kind of element from a function and get a type error. Dunno why, it’s doing the right thing, but it still surprises me.
  • I β™₯ XPath. It just rocks in so many ways.
    • Heh, James Clark was the editor. That explains a lot.

Overall, this is good news, especially given the fact that I’m embarking on a large project that’s utterly dependent on XQuery.

On the downside,

  • XML namespaces suck. You can tell this is true, when in a room where 5 developers have been involved with XML and namespaces for over 5 years each, and they still don’t get it.
  • The implementation defined areas still loom quite large.
    • I suspect that this isn’t really a big issue in practice, though. You’re not likely to be swapping out XQuery implementations with any frequency.
  • I still despise the smileys-for-comments.

Anyway, if you’re playing around with XML in any seriousness, it’s probably worth checking out XQuery. It’s surprisingly useful.


The year of XQuery?

Apparently, it’s the year of xquery. I’ve just started a large project at $WORK, of which XQuery is a fundamental piece. And I have to say I’m not so sure.

From what I’ve seen so far, there are good and bad bits to XQuery. The good bits are that it’s really flexible and very easy to pull apart large quantities of XML documents in ways in which it’s much, much harder with XSLT. I like the way XML fragments are treated as just another datatype, like E4X but done right. I also like the way that it’s built on XPath, as XPath rocks.

Sadly, there are quite a few more bad bits.

  • The development environment leaves a lot to be desired (an eclipse plugin would be nice, as would vim1 or emacs2 syntax highlighting).
  • XQuery itself is really verbose. There appears to be a lot of syntax, and the grammar is on the large side. This means that it’s really hard to pick up large bodies of other peoples code. Coincidentally like the stuff I’ve been dumped with.
  • The spirit of the language is more like SQL than any procedural language. Yet it presents a veneer of procedurality, which fools you into thinking you can get away with things like “just throw an extra statement in”. It’s not that simple. Your XQuery code is returning a list of items3, so instead you have to insert your extra code as a previous item in the list. Essentially, this means you must end your extra code with a comma, instead of the expected semicolon or nothing. This continues to bite me.
  • It’s still not a standard. “Nearly there now”, apparently. It’s beginning to sound like Perl 6…
  • Namespaces continue to cause me much wailing and gnashing of teeth. In XQuery, there is a tendency to use quite a few of them as well. In fairness, this is more of an XML problem than an XQuery problem, but it’s really irritating to still be tripped up by having xpath not match because you’re not looking at a namespace by accident.
  • The type system is a pain. XQuery may or may not be strongly typed, depending upon the code you’re using, the implementation, or the phase of the moon. It has had a tendency to get in the way, in all the stuff I’ve seen so far.
  • For all its verbosity, the language itself is quite limited. I’ve been needing to use a lot of implementation-specific extensions in the work I’ve been doing. Particularly for things like updates.
  • Oh, and whoever chose fucking smileys as the comment syntax needs to be shot. Now.

Having said all that, I think XQuery is still useful in the area I’m working in (large corpus’ of XML documents). Despite my rash of indignation, it’s proved a lot easier to deal with than our previous technology (flat files plus manually created indexes in an RDBMS plus Lucene for searching). It has a future. I just hope it develops quicker than it has so far.

1 That doesn’t suck, anyway, unlike xquery.vim.

2 I haven’t tried xquery-mode.el yet.

3 Except where it isn’t, e.g. defining functions.