About a year ago, I wrote The year of XQuery?. I’ve just finished my involvement with the large project at $WORK
that was using XQuery. So it’s time to reflect over it a little.
First, a bit of background. The site I was developing is essentially a more-or-less static view of 112,1491 XML documents in 4 collections2. The main interface is a search page, plus there’s a browse-by-title if you really fancy getting lost. It sounds simple, but as with any large collection of data, there is lots of variation in the data, leading to unexpected complications. Plus there are several different views of each document, just to make life more fun.
The site is based on Cocoon talking to an XML database (MarkLogic Server in this instance). We get XML back from the database, and render it through an XSLT pipeline into HTML[3]. Plus there’s some nice jQuery on top to smooth the ride.
Looking inside the codebase, we’re presently running at 5200 lines of XQuery code across 39 files (admittedly, that includes a 1000 line “lookup table”). But that doesn’t include any of the search code, which is dynamically generated using Java.
But in many senses, that’s not the remarkable thing about the XQuery. One of the most important aspects has been the ability to query the existing data. This may not sound particularly remarkable for a query language. But for a collection of XML this size, the ability to do ad-hoc queries that understand the document structure is truly remarkable.
For example, several months ago, I received a bug report stating “the tables are not rendering in article 12345.” I was able to look at the article, see that there were no tables, examine the source markup in the database and discover a TAB
element that I’d never seen before4. But how widespread is this problem? Three seconds later, I have:
distinct-values( for $tab in //TAB return base-uri($tab) )
Which tells me the 99 affected articles. Now I know that I only need to reload those from the source data instead of all 112,149.
Looking back over my original criticisms, how do they stand up to my experience?
- The development environment.
- Well, it’s slightly better than I originally thought. But not good enough when placed next to Java IDEs like Eclipse.
- I’ve been using my TextMate XQuery Bundle with reasonable success (although it still needs a great deal of improvement). There are modes for Emacs and Vim.
- I’ve managed to get the OxygenXML debugger talking to MarkLogic, but it was less useful than it initially appeared. The XQuery editor turned out to be worse than useless, because MarkLogic uses an outdated version of XQuery, leading to a lack of syntax colouring and a plethora of error reports.
- MarkLogic has an addon CQ which is a browser based interactive query tool. It’s pretty useful.
- Fundamentally, sharing a database between developers (the stored procedure model) doesn’t work well when you have multiple people updating it.
- We solved this expediently by kicking everybody else off the project. π
- The verbosity.
- Like most things, you kind of get used to it. Although I confess that when I started to understand what I was doing, I found that I could write code in a significantly less verbose way.
- The SQL / functional nature.
- This is another one of those things that you just get used to. And in this case, start to enjoy.
- Not a standard.
- Fixed!—although not MarkLogic, as mentioned above.
- XML Namespaces bite.
- And continue to do so. Let’s blame Tim Bray. π
- Seriously, this continues to be a problem, even now a year later. I lost a whole morning two weeks ago due to my inability to query the correct namespace. My main advice now is to never use a default namespace—prefix everything.
- The type system.
- Over time, now that I have come to understand it, I can begin to use the type system to my advantage. In fact, it’s one of the things that I usually have to reinforce in developers just starting in the project.
- Thankfully, I’ve never needed to dabble in XML schema.
- Implementation defined areas.
- It’s a concern, I’ll admit. MarkLogic is profligate in this area, and to get decent performance, you need to use the extensions.
- Smiley comments
- They’re just about getting to the point of being ignorable.
But what new things have I learned?
- I seriously underestimated the utility of function libraries. You can have two kinds of query in XQuery: an inline query, or a “module”. The advantage of a module is that it’s a lot easier to reach in and test an individual function by hand when needed.
- Speaking of testing, I haven’t come up with a good solution for unit testing. This pains me greatly. I realise that it’s similar to the RDBMS unit testing problem and basically I need a known test database. MarkLogic doesnt make automating this easy (there are no management APIs).
- Performance is very unpredictable. Not unpredictable as in “varys a lot”, but difficult to tell the performance of a given statement by visual inspection. MarkLogic comes with profiling APIs, which helps somewhat. But compared to EXPLAIN in SQL, it still feels a bit primitive.
- For example, my XSLT experience told me to avoid things like
//p
to examine all paragraphs. But in XQuery, everything is indexed up the wazzoo, so it’s likely to be faster than an XPath statement with an explicit path.
- For example, my XSLT experience told me to avoid things like
- Thinking in a functional style is an art. I’ve had a few problems, which cry out for an accumulator of some sorts. My whiteboard and I have had some long, intimate moments.
- Having regexes available is a godsend, after XSLT 1.0.
- I still really need a decent XQuery pretty printer, alΓ‘ perltidy.
Overall, I have to ask myself: would I do it the same again? And I probably would. For this particular project, I would try to place more emphasis on the XQuery than the XSLT (this was down to our inexperience—you should always try to work as close to the data store as possible). Despite the initial strong learning curve, the XQuery itself was rarely the main problem. But that’s leading into a whole new post…
In short: if you have a bunch of XML data lying around, XQuery is an excellent way to get the most use out it5.
1 count(/doc)
2 There’s also a second site, almost identical, but on a different topic which has 46,876 documents.
3 HTML 4.01. Sadly, XHTML and browsers still interact badly.
4 This is a rather baroque DTD unfortunately.
5 If you’re not up to paying for a MarkLogic licence (it’s pricey), then eXist might be worth checking out.