That Looks Somewhat Iffy

Posted by Dominic Mitchell Sun, 13 Apr 2008 09:32:00 GMT

Oh, go on, then.

  % history 1 | awk '{a[$2]++ } END{for(i in a){print a[i] " " i}}'|sort -rn|head
  320 git
  112 ll
  80 cd
  48 ls
  24 less
  20 man
  19 happygiraffe.net
  17 sudo
  15 rm
  14 xattr

NB:

  1. This is at home, not work. At work, there’s significantly more git, ant, mvn and mate.
  2. I turn all my ssh known_hosts entries into command line aliases, hence “happygiraffe.net”.
  3. I’ve only just found xattr, so I’ve been playing with it.
  4. zsh only lists 25 history items by default. I have to ask it to start from the beginning.
  5. I obviously do too much as root.

As always, I can’t resist tinkering with the shell. I’d have written it as history 1 | awk '{print $2}' | sort | uniq -c | sort -rn | head.

m2eclipse tip

Posted by Dominic Mitchell Wed, 09 Apr 2008 22:41:00 GMT

I was wondering why one of my computers seemed to have an older version of m2eclipse installed. It turns out that they moved the update site from m2eclipse.codehaus.org/update/ to m2eclipse.sonatype.org/update/. Doh. It’d be nice if they mentioned that one the front page.

Nostalgia (not by Veidt)

Posted by Dominic Mitchell Wed, 02 Apr 2008 22:47:31 GMT

I was thinking about what I needed to do tomorrow. One of the tasks involved writing a DOS batch file for a colleague. That got me thinking. When did I learn to write batch files? Scarily, I realised it was 20 years ago. I must have been 14 at the time. I had an Amstrad PC 1512 and I wanted to learn everything about it (and play Ultima V a lot).

So, I spent ages reading help, playing with commands to see what they did. I even managed to get a book or two (if the books seem expensive now, they’re even more to a 14 year old with practically no income).

Learning batch files was pretty much mandatory — you had to configure AUTOEXEC.BAT somehow. But learning why you needed to prefix “echo off” with an @ was fascinating[1].

Somehow this information has stayed relevant a lot longer than I thought it would. Certainly longer than the DOS assembly coding I did (a PSP was a Program Segment Prefix long before it was a Play Station Portable — but who cares these days?).

I had little realisation quite how bad batch files really were until I came across sh on SunOS 4 at University. The whole Unix thing (including the culture) was pretty mind-blowing.

I guess I’ve just officially joined the “old farts” club. :-)

1 “echo off” stops outputting commands to the console. But the “echo off” itself has already been output to the console by that point. The @ in front stops that. Looking back now, it seems remarkably similar to the syntax used by make(1).

for() $DEITY's sake, why? 2

Posted by Dominic Mitchell Wed, 02 Apr 2008 10:17:00 GMT

I used to think I had a reasonable grasp of Perl. Yesterday, I realised I didn’t even understand a basic foreach loop.

  my $val;
  my @values = qw( a b c );
  foreach $val (@values) {
    print $val, "\n";
  }
  print "[end] $val\n";

I reckoned that this should print:

  a
  b
  c
  [end] c

Instead, it prints:

  a
  b
  c
  Use of uninitialized value in concatenation (.) or string at foo.pl line 11.
  [end] 

This confused me no end. But it’s actually documented behaviour. From perlsyn

The foreach loop iterates over a normal list value and sets the variable VAR to be each element of the list in turn. If the variable is preceded with the keyword my, then it is lexically scoped, and is therefore visible only within the loop. Otherwise, the variable is implicitly local to the loop and regains its former value upon exiting the loop. If the variable was previously declared with my, it uses that variable instead of the global one, but it’s still localized to the loop. This implicit localisation occurs only in a foreach loop.

Wow. You really do learn something new every day. I suspect that this is implementation behaviour that was documented post-fact, rather than designed that way.

Expanding Outline Views in Cocoa

Posted by Dominic Mitchell Sun, 30 Mar 2008 21:42:00 GMT

Thanks to a lengthy commute last week, I’ve been making a toy in Cocoa. It reads an URL, parses some XML and displays it in an NSOutlineView. Simple stuff, but it will hopefully make my life better at work, where I need to do this a fair bit.

By default, when I load the XML into the NSOutlineView, everything is closed up. So all you’re presented with is the root element. I’d like to expand that so it automatically includes all children of the root element. Nice and simple—there’s an expandItem: method.

Except that when I call it from the action that puts the XML into the NSOutlineView, it doesn’t work. Bugger.

After instrumenting my XML data source, I can see that nothing is really happening until after my action. My suspicion is that the NSOutlineView isn’t realising that it has any data until after the first call to display.

I tested this by hooking up the call to expandItem into a secondary action (on another button). And it works great.

So, I need a way to say “call this code back in the next idle period”. And this is where I start to get upset that Objective-C doesn’t have closures.

How to execute code soon isn’t easily determined from the apple docs. My guess is that you use an NSTimer with a very small NSTimeInterval . Let’s try that…

  NSTimer *idle = [NSTimer scheduledTimerWithTimeInterval:0.0
                                                   target:self
                                                 selector:@selector(expandRoot)
                                                 userInfo:nil
                                                  repeats:NO];
  [[NSRunLoop currentRunLoop] addTimer:idle forMode:NSDefaultRunLoopMode];

Well, it works. But! Further investigation finally reveals Deferring the Execution of Operations. This suggests that I should use an NSNotification instead, but posted with a NSPostWhenIdle flag. This means getting involved with the cocoa notifications system...

The code now ends up looking like this:

  - (void)awakeFromNib
  {
    // Listen out for notification's we've posted to ourselves.
    [[NSNotificationCenter defaultCenter] addObserver:self
                                             selector:@selector(expandRoot:)
                                                 name:@"expandRoot" 
                                               object:nil];
  }

  -(IBAction)fetch: (id)sender
  {
    // …
    NSNotification* todo = [NSNotification notificationWithName:@"expandRoot" 
                                                         object:self];
    [[NSNotificationQueue defaultQueue] enqueueNotification:todo
                                               postingStyle:NSPostWhenIdle];
  }

  - (void)expandRoot:(NSNotification *)notification
  {
    [outlineView expandItem:[[outlineViewDataSource doc] rootElement]];
  }

Which is quite a bit more code. But it feels more robust doing it this way.

The big take away from all this is how difficult it is to use a non-Open-Source framework. If I had the source to Cocoa, I’d be able to look inside and see what I needed to do simply and quickly. Instead, it took me three train journeys. But there’s still enough to like in Cocoa that I intend to carry on.

q4e 2

Posted by Dominic Mitchell Sat, 01 Mar 2008 10:12:00 GMT

I hate maven. The UI sucks so badly, it’s incredibly painful to use. Anything that can take the edge off this has to be a good thing. In the past, I’ve experimented with m2eclipse, but to be quite frank, it’s not much improvement over the command line. And the tools are why we use Java instead of Perl/Python/Ruby, right?

So now I’m trying out q4e, which is being promoted as the “official” maven integration for Eclipse. At some point. Hopefully.

After installing (0.4.0), first impressions are good. There’s a “new maven project” wizard, which knows about archetypes. Creating archetypes manually is incredibly annoying (I can never remember how). Now, the wizard just lists them for you:

The q4e archetypes wizard

Afterwards, it nicely prompts you for details like groupId and artifactId. Plus a description, which I’d never have thought of otherwise. The more metadata the better! (Actually, the description appears to vanish once the project has been created).

Once created, this is what you end up with.

A new maven project created by q4e

Nice and simple so far. Most of the maven commands are on the right-click menu:

q4e's main menu

When you run something, you get a log viewer, which is a little nicer than the m2eclipse console. At least it’s timestamped.

The q4e log view

Unfortunately, the dependency management appears to be broken. I can’t search for dependencies in the way I could in m2eclipse. That’s not good, as there are lots of them available and the computer should be telling me what the heck they are.

Annoyingly, it seemed to “lose” my src/main/java folder. I had to recreate it in order to get it to work. Very odd.

It will graph dependencies, which is probably more useful on a larger project than my test app.

The q4e dependency viewer

There’s no help yet, which is annoying, though not critical (I don’t know anyone who uses Eclipse help much anyway. Developers—who’d believe it, eh?)

The one thing that’s been really bugging me with maven is the complete inability to get at the source code of dependencies. Oh sure, it’s available for download, and there are features to ask for the source to be downloaded. But I have no idea how on earth to make it work. Not having source code is a real problem when it comes to understanding software.

It didn’t offer much help in the way of creating tests (A default src/test/java would have been nice).

Overall, it’s pretty clear that this is a very early stage of development. But it still looks more promising than m2eclipse, and anything that can make maven easier to use is most welcome.

That was all written last night. This morning, I’ve updated to the development build (0.5.0). It’s managed to grow a “Fetch Source JARs” command, which is great. Unfortunately, I’ve just bumped into issue 153—the compiler settings aren’t synced between maven and eclipse. Oh well. I stand by the previous paragraph. This has potential, but it’s not there yet.

Vendor branches in git

Posted by Dominic Mitchell Thu, 07 Feb 2008 21:00:00 GMT

Recently, I’ve been playing with git. It’s been pretty good so far. There are two things that have really impressed me:

  1. The subversion integration with subversion via git-svn is superb. I’m busy coding in git and I can just push all my changes back into subversion with nobody knowing I’m doing anything different.
  2. The visualisation provided by gitk is wonderful. One of our projects at work has about 10 open branches. gitk allowed me to see this quickly and easily. Most importantly, it immediately alerted me to changes on branches that had not been merged back into the trunk.

There are downsides of course. Support for Windows and Eclipse still isn’t wonderful (though it is progressing). But for my personal use, it’s damned handy.

There’s one thing I make use of in subversion that I didn’t immediately understand how to do in git: vendor branches. If you haven’t come across them before, they’re a way of tracking third party software and incorporating your changes (where you either don’t have access to the repository, or don’t want that level of granularity). At work, I have a wordpress blog, and I keep it up to date (including all my customizations) using a vendor branch. It’s worked really well.

It took a bit of searching, but eventually I found a mail message describing tracking changes when the upstream uses snapshots, which is the same thing as “vendor branches”. The process is really quite elegant and simple, and makes use of git’s superb branching.

To start with, you just unpack the software and set it up as a new git repository.

  $ unzip wordpress-2.3.zip
  $ cd wordpress
  $ git init
  $ git add .
  $ git commit -m "Import wordpress 2.3" 
  $ git tag v2.3
  $ git branch upstream

The clever bit is the last line: create an “upstream” branch, which you can use to store the newer releases.

With that, you can go away and start making changes: wp-config.php, add a few themes and plugins, that sort of thing.

Now, when a new release comes along, there’s a simple process (easily wrapped up in a small shell script) to incorporate it.

  $ cd wordpress
  $ git checkout upstream
  $ rm -r *
  $ (cd .. && unzip wordpress-2.3.1.zip)
  $ git add .
  $ git commit -a -m 'Import wordpress 2.3.1.'
  $ git tag v2.3.1
  $ git checkout master
  $ git merge upstream

There’s a few tricks in there.

  • I know that wordpress always unzips into a wordpress directory, so I can unzip it on top of the current code.
  • git add . catches all the modified and added files.
  • git commit -a catches the removed files.
  • The magic is in the last two lines—switch back to your own lineage, and pull in all the changes in the latest version of the code. If there are any conflicts, git will point them out to you and let you deal with them.

After a few tries, you end up with something like this.

Wordpress on a vendor branch in git

That is, you have a wordpress 2.3.3 install with all your customizations applied.

I confess, this is all probably overkill for something like wordpress. But it’s still a useful technique to know about.

Update: I’ve just realised that the above solution isn’t perfect, particularly for things like wordpress and mediawiki, where there’s a tendency to upload files into the repository area. Frequently these aren’t in version control so the rm -r * above could blow the uploads away. There are two choices:

  1. Configure the software to not modify anything in the repository (e.g. in wordpress, change Options → Miscellaneous → Store uploads in this folder).
  2. Have a cron job to regularly add & commit uploads into the repository. Depending on the size and frequency, this may or may not be preferrable.

Code Examples Should Work

Posted by Dominic Mitchell Wed, 30 Jan 2008 22:48:00 GMT

I’m looking at Wicket this evening. There’s a wicket tutorial on theserverside.com. However, the first code example fails!

Initially I tried to get things working by downloading wicket. However, the article recommends maven (as does the quickstart), so I grit my teeth and went ahead.

Once again, I get 50 lines of gibberish, including an error. But everything seems to work. Clench teeth some more, and move on.

The first code example is the definition of the WicketApplication class, which Maven has helpfully created for us in outline. The first method to add is:

  private ApplicationContext getContext() {
    return WebApplicationContextUtils
      .getRequiredWebApplicationContext(getServletContext());
  }

As usual, I get compile errors. This is normal, just hit quick-fix in Eclipse to import the classes ApplicationContext. Except that there isn’t one.

Strike one.

I take a “lucky guess” and add a dependency on spring-web to my pom.xml. That makes both ApplicationContext and WebApplicationContextUtils appear on my classpath. Yay.

The next problematic method is this one:

  public static WicketApplication get() {
      return (WicketApplication) WebApplication.get();
  }

According to the tutorial:

The third overrides the Application#get() method to allow for covariant return types, in this case the WicketApplication class. This example uses the WicketApplication as a form of Service Locator.

Great! Except for the minor problem that it doesn’t compile:

  The return type is incompatible with Application.get()

Strike two.

For now, I’m going to comment it out and carry on playing until I get by bitten by that. But it’s not what I’d call a great start to a tutorial when your first code example is this buggy.

Update: I spoke to the author of the article, Nick Heudecker, and he pointed out:

  • I should have downloaded the example code, that works.
    • Well, great, but it would have been nice to have that mentioned.
  • Variant return types are supported by Java 5.
    • Which I certainly thought I was using. I don’t know what went wrong here. But I didn’t know that Java 5 could do this, so it’s good to know.

Trusting your tools 3

Posted by Dominic Mitchell Sun, 27 Jan 2008 22:40:00 GMT

After Grzegorz’s piping up, I’m giving cocoon 2.2 another try. Here are some selected errors.

  javax.servlet.ServletException: No block for /favicon.ico
          at org.apache.cocoon.servletservice.DispatcherServlet.service(DispatcherServlet.java:84)
          at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServlet.service(ReloadingServlet.java:89)
          at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1093)
          at org.apache.cocoon.servlet.multipart.MultipartFilter.doFilter(MultipartFilter.java:119)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServletFilter.doFilter(ReloadingServletFilter.java:50)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1084)
          at org.apache.cocoon.servlet.DebugFilter.doFilter(DebugFilter.java:169)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServletFilter.doFilter(ReloadingServletFilter.java:50)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1084)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingSpringFilter.doFilter(ReloadingSpringFilter.java:69)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServletFilter.doFilter(ReloadingServletFilter.java:50)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1084)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
          at org.mortbay.jetty.Server.handle(Server.java:313)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:506)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:830)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:514)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:381)
          at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:396)
          at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

How fabulous! 30 lines to tell me about a 404 I couldn’t care less about! (this is from mvn jetty:run). And in the process, obliterating any messages I did care about.

  [ERROR] VM #displayTree: error : too few arguments to macro. Wanted 2 got 0
  [ERROR] VM #menuItem: error : too few arguments to macro. Wanted 1 got 0
  [INFO] ------------------------------------------------------------------------
  [INFO] BUILD SUCCESSFUL
  [INFO] ------------------------------------------------------------------------

There’s an error, but the build was successful. That makes sense. Not.(from mvn site).

  Caused by: org.codehaus.plexus.util.xml.pull.XmlPullParserException: TEXT must be immediately followed by END_TAG and not START_TAG (position: START_TAG seen ...<reports>\n            <report>... @118:21) 
          at org.codehaus.plexus.util.xml.pull.MXParser.nextText(MXParser.java:1063)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.parseReportPlugin(MavenXpp3Reader.java:3572)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.parseReporting(MavenXpp3Reader.java:3709)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.parseModel(MavenXpp3Reader.java:2347)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read(MavenXpp3Reader.java:4422)
          at org.apache.maven.project.DefaultMavenProjectBuilder.readModel(DefaultMavenProjectBuilder.java:1412)
          ... 17 more
  [INFO] ------------------------------------------------------------------------
  [INFO] Total time: 2 seconds
  [INFO] Finished at: Sun Jan 27 22:25:35 GMT 2008
  [INFO] Final Memory: 1M/2M
  [INFO] ------------------------------------------------------------------------

XML parser exception (and in only 40+ lines!). Complaining about unbalanced tags? Must be non-well-formed XML, right? Wrong. This is down to copying an example from “Better builds with maven”. But the example’s wrong—I’m missing a tag. Can you guess what’s missing from this XML?

  <reportSets>
    <reports>
      <report>dependencies</report>
    </reports>
  </reportSets>

You mean you didn’t spot the missing reportSet element? I’m shocked, I tell you, shocked. Plus the lack of indication that an error actually occurred. The stacktrace is a good indication, but an actual “ERROR” or “BUILD FAILED” message would be nice (there is an error line, but it zoomed past three screens ago. I blinked and missed it).

So that’s two strikes to maven and one to cocoon. My trust in them is basically non-existent at this point. But at least the RCL worked as documented.

Argumentative 1

Posted by Dominic Mitchell Fri, 25 Jan 2008 07:29:00 GMT

I spent a little while looking at mootools yesterday for a colleague. It’s yet another JavaScript library. My colleague was wondering how to restrict the Accordion effect so it applied once to each area of content on the page, rather than once for the whole page (each content area has multiple bits of content where only one should be visible).

As usual, I just headed straight for the source code (Accordion.js). Inside, I found the best abuse of JavaScript I’ve seen in a while.

  initialize: function(){
    var options, togglers, elements, container;
    $each(arguments, function(argument, i){
            switch($type(argument)){
                    case 'object': options = argument; break;
                    case 'element': container = $(argument); break;
                    default:
                            var temp = $$(argument);
                            if (!togglers) togglers = temp;
                            else elements = temp;
            }
    });
    // …
  }

Now what exactly does this do? It’s pretty different from the usual JavaScript function usage:

  var initialize = function (options, togglers, elements, container) { … }

The first thing of interest is the first parameter to $each: arguments. A little known feature of JavaScript is that every function has an array-like object of all its arguments available. You can find get a reference to the function that you’re in, which is sometimes useful.

Here, it’s being used to accept an arbitrary number of arguments, in any order. To be quite frank, it’s a bit of a mess. This is my understanding of the above code:

  • You can pass in up to four arguments.
    • You can pass in more, but we only get 4 parameters out of it.
  • The last JavaScript object will get treated as a set of options.
  • The last element passed in will be used as “container”.
  • The first non object|element will be passed to $$ and treated as the set of “togglers”.
  • The last non object|element arguments will be passed through $$ and treated as the elements to be shown/hidden.

Is that right? I’ve been looking at it for a bit and still not 100% sure.

To me, this is a fine example of taking the flexibility of JavaScript just a little too far.

A Year in Cocoon 7

Posted by Dominic Mitchell Mon, 21 Jan 2008 20:01:00 GMT

The other large part of the project at $WORK I’ve just finished was Cocoon. Cocoon is a Java web framework. It’s got some really neat ideas in it, and it’s main purpose in life is transforming XML. It is (or should be) a perfect match for XML databases.

I described Cocoon about a year ago, towards the start of this project. But how do I feel about it now? Looking back, was it the right choice?

To start with, I’m still very impressed with the core Cocoon technologies. The pipelines are perfect for dealing with XML. FlowScript still impresses the heck out of me.

But there’s a lot that leaves a sour taste. My first original complaint was about the size. Well, 50Mb isn’t so big. But the fact of the matter is that there’s an awful lot of stuff in there, quite a bit of which you don’t want to touch with a bargepole. We wasted a lot of time looking at things like XSP, Actions and implementing our own custom Generators. I wish I’d been made more aware of what FlowScript was up-front, and what it could do for more. I wish I’d realised that it’s basically the “C” in MVC.

Which dovetails straight into another complaint: documentation. There is quite a bit of documentation for Cocoon. But it’s still inadequate given the gargantuan size of the project. And the coverage is extremely spotty. Normally, I’d jump straight to the published literature, but the most recent Cocoon book I could find was hideously out of date. In fact, that’s what caused me to go down several of the rabbit-holes mentioned above.

When I’ve really needed to figure out what’s going on, I’ve invariable had to turn to the cocoon source code. Which due to it’s dependence on the weird-yet-not-wonderful avalon framework made it less than simple to understand.

My complaint about debugging still holds, although less severely. You get used to the seemingly-intractible error messages. You spot the patterns that are causing trouble. Like most things, logging goes a long way.

And then there’s Cocoon 2.2. My site was developed entirely on Cocoon 2.1. This certainly had it’s flaws—figuring out how to deploy it as a war file sensibly was a pain. But Cocoon 2.2 has Maven.

I’ve pointed out my dislike for Maven before. As have other people. Recently, other people in my office have been using it and I’ve witnessed the project overruns thanks to trying to figure out what Maven is doing. Nice idea, bad implementation.

Cocoon 2.2 uses Maven because it’s “modularized”. What this means is you can’t have a single project with everything in it any more. You have a “webapp” project and a “block” project. And when something changes you have to build the block, install it, build the webapp, mount the installed block in the webapp and fire up jetty. It doesn’t make for a good development environment.

Now I could be completely wrong and missing the obvious way to do seamless Cocoon 2.2 web development from with Eclipse. I’d love to be corrected. But for now, Cocoon 2.2 has shot itself in the foot as far as I’m concerned.

So I’m not happy with the future direction of Cocoon. I need to look again at why I chose it in the first place. Initially, it was because all the other web frameworks for Java (Struts, Tapestry, Wicket) all seemed totally focussed on form-based CRUD-style web apps. Cocoon focussed on documents and URLs instead. So it’s time to start working my way through the tutorials of the many java web frameworks in order to find a more suitable one. I may not find a good replacement for Cocoon. But I certainly need to try.

Update: In the comments, Grzegorz Kossakowski points out a screencast about the RCL which slightly lessens the pain of interactive development with Cocoon 2.2.

A Year in XQuery 4

Posted by Dominic Mitchell Sun, 20 Jan 2008 22:45:00 GMT

About a year ago, I wrote The year of XQuery?. I’ve just finished my involvement with the large project at $WORK that was using XQuery. So it’s time to reflect over it a little.

First, a bit of background. The site I was developing is essentially a more-or-less static view of 112,1491 XML documents in 4 collections2. The main interface is a search page, plus there’s a browse-by-title if you really fancy getting lost. It sounds simple, but as with any large collection of data, there is lots of variation in the data, leading to unexpected complications. Plus there are several different views of each document, just to make life more fun.

The site is based on Cocoon talking to an XML database (MarkLogic Server in this instance). We get XML back from the database, and render it through an XSLT pipeline into HTML[3]. Plus there’s some nice jQuery on top to smooth the ride.

Looking inside the codebase, we’re presently running at 5200 lines of XQuery code across 39 files (admittedly, that includes a 1000 line “lookup table”). But that doesn’t include any of the search code, which is dynamically generated using Java.

But in many senses, that’s not the remarkable thing about the XQuery. One of the most important aspects has been the ability to query the existing data. This may not sound particularly remarkable for a query language. But for a collection of XML this size, the ability to do ad-hoc queries that understand the document structure is truly remarkable.

For example, several months ago, I received a bug report stating “the tables are not rendering in article 12345.” I was able to look at the article, see that there were no tables, examine the source markup in the database and discover a TAB element that I’d never seen before4. But how widespread is this problem? Three seconds later, I have:

  distinct-values(
    for $tab in //TAB
    return base-uri($tab)
  )

Which tells me the 99 affected articles. Now I know that I only need to reload those from the source data instead of all 112,149.

Looking back over my original criticisms, how do they stand up to my experience?

  • The development environment.
    • Well, it’s slightly better than I originally thought. But not good enough when placed next to Java IDEs like Eclipse.
    • I’ve been using my TextMate XQuery Bundle with reasonable success (although it still needs a great deal of improvement). There are modes for Emacs and Vim.
    • I’ve managed to get the OxygenXML debugger talking to MarkLogic, but it was less useful than it initially appeared. The XQuery editor turned out to be worse than useless, because MarkLogic uses an outdated version of XQuery, leading to a lack of syntax colouring and a plethora of error reports.
    • MarkLogic has an addon CQ which is a browser based interactive query tool. It’s pretty useful.
    • Fundamentally, sharing a database between developers (the stored procedure model) doesn’t work well when you have multiple people updating it.
      • We solved this expediently by kicking everybody else off the project. :)
  • The verbosity.
    • Like most things, you kind of get used to it. Although I confess that when I started to understand what I was doing, I found that I could write code in a significantly less verbose way.
  • The SQL / functional nature.
    • This is another one of those things that you just get used to. And in this case, start to enjoy.
  • Not a standard.
    • Fixed!—although not MarkLogic, as mentioned above.
  • XML Namespaces bite.
    • And continue to do so. Let’s blame Tim Bray. :)
    • Seriously, this continues to be a problem, even now a year later. I lost a whole morning two weeks ago due to my inability to query the correct namespace. My main advice now is to never use a default namespace—prefix everything.
  • The type system.
    • Over time, now that I have come to understand it, I can begin to use the type system to my advantage. In fact, it’s one of the things that I usually have to reinforce in developers just starting in the project.
    • Thankfully, I’ve never needed to dabble in XML schema.
  • Implementation defined areas.
    • It’s a concern, I’ll admit. MarkLogic is profligate in this area, and to get decent performance, you need to use the extensions.
  • Smiley comments
    • They’re just about getting to the point of being ignorable.

But what new things have I learned?

  • I seriously underestimated the utility of function libraries. You can have two kinds of query in XQuery: an inline query, or a “module”. The advantage of a module is that it’s a lot easier to reach in and test an individual function by hand when needed.
  • Speaking of testing, I haven’t come up with a good solution for unit testing. This pains me greatly. I realise that it’s similar to the RDBMS unit testing problem and basically I need a known test database. MarkLogic doesnt make automating this easy (there are no management APIs).
  • Performance is very unpredictable. Not unpredictable as in “varys a lot”, but difficult to tell the performance of a given statement by visual inspection. MarkLogic comes with profiling APIs, which helps somewhat. But compared to EXPLAIN in SQL, it still feels a bit primitive.
    • For example, my XSLT experience told me to avoid things like //p to examine all paragraphs. But in XQuery, everything is indexed up the wazzoo, so it’s likely to be faster than an XPath statement with an explicit path.
  • Thinking in a functional style is an art. I’ve had a few problems, which cry out for an accumulator of some sorts. My whiteboard and I have had some long, intimate moments.
  • Having regexes available is a godsend, after XSLT 1.0.
  • I still really need a decent XQuery pretty printer, alá perltidy.

Overall, I have to ask myself: would I do it the same again? And I probably would. For this particular project, I would try to place more emphasis on the XQuery than the XSLT (this was down to our inexperience—you should always try to work as close to the data store as possible). Despite the initial strong learning curve, the XQuery itself was rarely the main problem. But that’s leading into a whole new post…

In short: if you have a bunch of XML data lying around, XQuery is an excellent way to get the most use out it5.

1 count(/doc)

2 There’s also a second site, almost identical, but on a different topic which has 46,876 documents.

3 HTML 4.01. Sadly, XHTML and browsers still interact badly.

4 This is a rather baroque DTD unfortunately.

5 If you’re not up to paying for a MarkLogic licence (it’s pricey), then eXist might be worth checking out.

Google Collections

Posted by Dominic Mitchell Sat, 19 Jan 2008 23:11:00 GMT

I’ve been rather stuck in the Java world recently. One of the things that makes this bearable is developing in Java 1.5. For all the flak that generics take, they are far superior to littering your code with casts, every single one an admission that the type system is utterly broken. The most common usage of generics tends to be part of the collections API.

But the collections only go so far. In particular, I noticed my code filling up with a pattern:

  class Thingummybob {
    Map<String, List<String>> params = new HashMap<String, List<String>>();

    void addParam(String name, String value) {
      List<String> values = params.get(name);
      if (values == null) {
        values = new ArrayList<String>();
        params.put(name, values)
      }
      values.add(value);
    }
  }

In short, ridiculously complicated1. Look at that code, then think about fetching or removing a value. It’s ridiculously complex for something that should be simple.

Then I found the google collections framework. It’s an extension of the collections framework and includes the marvellous MultiMap. It’s like an ordinary Map, but it takes multiple values automatically. The above code now looks like this:

  class Thingummybob {
    Multimap<String, String> params = Multimaps.newHashMultimaps();

    void addParam(String name, String value) {
      params.put(name, value);
    }
  }

Much, much nicer. You’ll also notice that factory function newHashMultimaps(). It cunningly avoids having to specify the types twice thanks to the compiler inferring what they should be.

Once I started using google collections, I noticed a few other niceties contained within.

  • There’s a Join class for joining Strings together, just like Perl’s join. This is a serious lack in the standard Java library, and something I keep reinventing (usually badly).
  • The Preconditions class provides a nice standard way of doing assert-like behaviour.
  • See Coding in the small with Google Collections for more examples of little things that can make your life simpler.

Overall, I’d rate the google collections API as “well worth a look.” It has the potential to make your code simpler, and that can only be a good thing.

1 Oh, what I’d give for autovivification!

Clichés are hard 7

Posted by Dominic Mitchell Wed, 19 Dec 2007 06:42:00 GMT

So yesterday, Jeremy asked:

Wondering if accents are valid in class names (so I can mark up some text as being of the class “cliché”)

It’s a damned good question. And you have to consider: character encodings; CSS; HTML; XHTML; JavaScript; HTTP. Needless to say, it’s more complicated than it looks at first.

My first thought was that of CSS files. Is it valid to say:

  p.cliché { color: #f00; }

To answer this question, you have to visit the CSS 2.1 spec. Near the end is §G.1 Grammar. It contains a BNF grammar describing the syntax of CSS. They’re not that difficult to read when you get the hang of them. In this case, I start by finding something I can recognise: a selector. Then, I work down through the grammar to find what I’m interested in.

  • selector : simple_selector [ combinator simple_selector ]* ;
    • a selector is composed of one or more simple_selectors.
  • simple_selector : element_name [ HASH | class | attrib | pseudo ]* | [ HASH | class | attrib | pseudo ]+ ;
    • A simple_selector is composed of an element_name followed by zero or more an ID, class, attribute or pseudo selector. Alternatively, it’s composed of one or more ID/class/attribute/pseudo selector (without an element name).
  • class : '.' IDENT ;
    • A class name is just ”.” followed by an identifier. That’s what we’re interested in here.
  • ident -?{nmstart}{nmchar}*
    • This is now in §G.2. But you can see that an identifier has an optional leading minus, followed by an nmstart and zero or more nmchar. It’s those nmchar that we care about.
  • nmchar [_a-z0-9-]|{nonascii}|{escape}
    • nmchar allows letters, numbers, underscores and minuses, as well as non-ascii characters and escapes. Oooh! Getting closer!
  • nonascii [\200-\377]
    • This is a horrid notation. It’s an octal character range. Octal stopped being in general usage in the early 80s, although Unix and C perpetuate them. Anyway, it says that any character whose code is between 128 and 255 is allowed.

So we get an answer: Because é is U+00E9 (or, 233 decimal), it’s allowed as part of an identifier in a CSS file.

But it’s worth noting the arbitrary limit of 255 here. That means that you don’t get to use any unicode character above that (e.g. Ā [U+0100]) verbatim in a CSS file. Instead, you have to escape it by saying (according to the escape declaration in that grammar) \h100. Which is quite nasty.

There’s one other wrinkle to consider before this will work. You also need to ensure that the CSS file is served over HTTP using the correct character set. If you’ve saved it as Latin-1, you need to ensure that it’s served up with this header:

  Content-Type: text/css; charset="iso-8859-1" 

This is the default, so it could be left off, but it’s usually better to be explicit. Likewise, if the file is saved as UTF-8, you need this header to be added.

  Content-Type: text/css; charset="UTF-8" 

If you’re using Apache, check out the AddDefaultCharset and AddCharset directives.

So that’s CSS. But what about HTML?

HTML is defined in the HTML 4.01 specification. It’s defined using SGML, which means more complication in order to work out what the heck’s going on. Thankfully, everybody knows that there are four ways to get an é into HTML:

  • A literal é.
  • A character entity: &eacute;
  • A decimal character reference: &233;
  • A hex character reference: &xE9;

In order to figure out what characters are allowed in a class attribute, though, you have to go and start looking at the DTD:

  • The coreattrs entity is the first mention. It defines a class as being some CDATA.
  • The definition of CDATA is an intrinsic part of SGML. The details of which can be altered by the SGML Declaration for HTML 4. There’s a section at the beginning which lists which characters are allowed. It includes a large number of unicode characters all above 160 decimal.

That means that it’s safe to include a character via any of the above methods.

But there are a few more wrinkles. Firstly, whilst the two characters references above are intrinsic to HTML (via SGML), where does the character entity come from? Well, they are defined as part of the HTML spec: Character entity references in HTML 4.

There’s also the problem of the character encoding in case you use the literal é. Like the CSS above, you need to ensure that your web server is telling everybody what character encoding the file is served as. Actually, for HTML, it’s less of a problem, as the browser will generally auto-detect character encodings. But that’s not necessarily reliable, so it’s better to be explicit. And in HTML, you can put the character encoding in the file itself:

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Yes, this a little bit like having a french dictionary containing the words “ecrit en Français” in the front. But it’s a good idea to have both this and the HTTP declaration (and they must match).

I’m not going to talk about XHTML/XML because it’s not in widespread use (i.e. serving it up as application/xhtml+xml).

Finally, what about JavaScript? Well, it’s defined as ECMA-262 (3rd edition). That spec explicitly defines everything in terms of Unicode, so it’s mostly OK. You can still access characters you can’t type via an escape mechanism: \u00E9 (see the definition of UnicodeEscapeSequence on page 19). Additionally, JavaScript can get at the unicode characters in the DOM quite easily:

  <p id='a1' class="cliché">Lessons will be learned.</p>
  <script type="text/javascript">
    alert(document.getElementById('a1').className)
  </script>

As always, JavaScript files served over HTTP need to be supplied with the correct character-encoding through the Content-Type header. Just like CSS and HTML.

So what’s the take-away from all this?

  • Use literal characters and UTF-8 everywhere. It’s consistent and extensible.
  • Know how to look in the specs when something’s going wrong – you’ll know whether it’s you, or the browser that’s getting it wrong.
  • Characters are hard, let’s go shopping!

Jeremy worked it all out in far less time than I did.

Figuring I should be okay as long as I use a character entity. http://tinyurl.com/7p7qc

Looking at that link, I notice that CDATA is handled specially within STYLE and SCRIPT tags. Yet more exceptions to the rules!

Unit Testing falsehoods 2

Posted by Dominic Mitchell Sat, 15 Dec 2007 08:33:00 GMT

My latest fad is looking at developing Mac apps (in Cocoa). Since the Leopard upgrade, Objective-C 2.0 has Garbage Collection. Congratulations Apple, you’ve lifted it out of the dark ages.

So, I’ve been trying to read up bits in ADC, but my time is limited. This means I turn to podcasts like Late Night Cocoa which I can listen to during my commute.

Today, I was listening to the latest Mac Developer Roundtable, Testing Testing Testing. Neat, I wondered how to do unit testing in Cocoa.

Sadly, the discussion was like asking blind people to critique a painting. None of the participants (save the host, Scotty) actually knew what Unit Testing was, or had ever done any. It was the most ill informed discussion I’ve had the misfortune to hear in a long while.

“asserts are just another form of unit testing.”

“Well, I know my code, so I don’t have to bother with unit testing.”

“Unit testing doesn’t give any value.”

(all paraphrased, but it’s the gist of things).

It’s weird, sort of like jumping back 10 years in time. Like the whole agile thing never happened.

Now, I completely accept that XCode might have terrible support for unit testing. It’s possible, I have’t got that far yet. Likewise, OCUnit might be hard to use. But that doesn’t negate the value of knowing that you’ve fixed something and don’t have to go back to it.

This is particularly surprising, given the high quality of what I’ve seen in Cocoa so far. Apple is big on design patterns. In particular, MVC plays a huge rôle in Cocoa apps. Controllers and Views might well be hard to test, because they’re part of the GUI. But your Model (which is the core of your application) should really be amenable to testing.

I really hope that the rest of the Mac development community isn’t so stuck in the dark ages.

Older posts: 1 2 3 ... 24