Tag: cocoon

 

Using JavaRebel with Cocoon

Normally, the cocoon-maven-plugin includes a reloading classloader, so that changes to class files are automatically picked up when do mvn jetty:run. Just hit refresh and your changes get picked up. It’s just like working in PHP. 🙂

This is OK, but it’s not foolproof. This morning, I saw a few errors of the form “expected class SearchManager, but got class SearchManager”. This is a case of the same class being loaded by a different ClassLoader. Annoyingly, I can no longer reproduce this.

There’s a commercial product, JavaRebel, that aims to do a much better reloading ClassLoader. So, I thought I’d give it a try.

The basic idea to use it is twofold:

  • Include the javarebel jar as an agent.
  • Stop jetty from auto-reloading.

Of course, this being cocoon, we also have to stop the cocoon-maven-plugin from using its reloading classloader.

The javarebel documentation is quite clear on how to configure maven and jetty. But it makes no mention of cocoon (understandably).

Thankfully, it’s all fairly simple to configure with a maven profile. This makes it easy to call from the command line.

  <profile>
    <id>javarebel</id>
    <build>
      <plugins>
        <!-- Disable Jetty's auto-reload -->
        <plugin>
          <groupId>org.mortbay.jetty</groupId>
          <artifactId>maven-jetty-plugin</artifactId>
          <configuration>
            <scanIntervalSeconds>0</scanIntervalSeconds>
          </configuration>
        </plugin>
        <!-- Disable cocoon's RCL. -->
        <plugin>
          <groupId>org.apache.cocoon</groupId>
          <artifactId>cocoon-maven-plugin</artifactId>
          <configuration>
            <reloadingClassLoaderEnabled>false</reloadingClassLoaderEnabled>
            <reloadingSpringEnabled>false</reloadingSpringEnabled>
          </configuration>
        </plugin>
      </plugins>
    </build>
  </profile>

With that in place, all that remains is a teeny-tiny shell script to augment the normal call to maven.

#!/bin/sh
javarebel_jar="$HOME/javarebel.jar"
MAVEN_OPTS="$MAVEN_OPTS -noverify -javaagent:$javarebel_jar" mvn -Pjavarebel "$@"

With this, you can immediately see that javarebel is enabled, as it spits out a big message at startup time. But more importantly, as soon as I change a spring bean (and reload the page that uses it), I get this on the console:

JavaRebel: Reloading class 'com.example.Spigot'.
JavaRebel-Spring: Reconfiguring bean 'spigot' [com.example.Spigot]

Hurrah — no errors! It all seems to work rather well. I should probably purchase a licence. 🙂

Update: I’ve seen the error again:

Caused by: org.springframework.beans.PropertyBatchUpdateException; nested PropertyAccessExceptions (1) are:
PropertyAccessException 1: org.springframework.beans.TypeMismatchException: Failed to convert property value of type [com.example.MyService] to required type [com.example.MyService] for property ‘myService’; nested exception is java.lang.IllegalArgumentException: Cannot convert value of type [com.example.MyService] to required type [com.example.MyService] for property ‘myService’: no matching editors or conversion strategy found
at org.springframework.beans.AbstractPropertyAccessor.setPropertyValues(AbstractPropertyAccessor.java:104)
at org.springframework.beans.AbstractPropertyAccessor.setPropertyValues(AbstractPropertyAccessor.java:59)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1198)
… 101 more

Cocoon Settings

I’ve been looking quite extensively at the cocoon-spring-configurator, trying to work out how to make it fit into our preferred java webapp config scheme: context-params.

By default, cocoon-spring-configurator just reads Properties files. The complete list of property files that cocoon-spring-configurator picks up is extensive. If you want to see what’s happening, then add this to log4j.xml:

  <logger name="org.apache.cocoon.spring.configurator">
    <level value="DEBUG" />
  </logger>

But there’s one interesting bit in the docs:

9. If a property provider is configured in the application context, it is queried for a set of properties.

If you have special requirements for property handling, you can add a property provider bean which is a Spring managed bean conforming to the org.apache.cocoon.configuration.PropertyProvider interface. For example, if you want to store your configurations inside a database or configure them through a jndi context, you can provide these values through a custom implementation.

So this means that you:

  1. Write a class that fetches properties from somewhere like (say) the ServletContext.
  2. Add that class into Spring with the name org.apache.cocoon.configuration.PropertyProvider.

So I did that. The code itself is moderately simple:

public class ContextParamsProvider implements PropertyProvider, ServletContextAware {
    public Properties getProperties(Settings settings, String runningMode,
            String path) {
        Properties props = new Properties();
        EnumerationString en = servletContext.getInitParameterNames();
        while (en.hasMoreElements()) {
            String name = en.nextElement();
            String value = servletContext.getInitParameter(name);
            props.setProperty(name, value);
        }
        return props;
    }
}

This should mean that things like what email to send to can be completely external to the application.

Cocoon sitemap variables

Whilst dragging myself through an issue for a client last night, I found another cocoon feature I wasn’t aware of: sitemap variables.

Of course, I’d seen Input Modules before, in fact I’ve written one. They’re nice and simple and look like this:

  <map:parameter name="contextPath" value="{request:servletPath}"/>

But then I came across this little line:

  <map:parameter name="email" value="${contact.email}" />

What the heck is that? I immediately jumped into the cocoon source code, thanks to Jukka Zitting. First, I saw _no_ examples of this syntax in any sitemap.xmap. Marvellous.

After a bit of digging, I ended up at VariableExpressionTokenizer (thanks to a comment). This revealed (in yet another comment) that strings of the form ${…} are handled using the new cocoon-expression-language.

… snip several hours of wasted time …

Now, at this point, I’ve gone down the garden path. I’ve spent quite a few hours debugging this and it’s completely in the wrong direction — it seems like cocoon has far too many ways of inserting variables into the sitemap.

Finally, I’ve managed to end up looking at AvalonUtils.replaceProperties() (which gets called from SitemapLanguage.build()). This gets passed in a Settings object (i.e. something you configured with the cocoon-spring-configurator). So any reference to ${something} will look it up directly in your configuration.

It’s actually slightly more generic than that. Looking through the code, AvalonUtils.replaceProperties() also gets called any time that an avalon component is set up. So, any older components can benefit from the new Settings as well.

Summary:

  1. Create a property file in your block. e.g. src/main/resources/META-INF/cocoon/properties/app.properties.
  2. Set a value in there. e.g. contact.email=dom@example.com
  3. In src/main/resources/COB-INF/sitemap.xmap, you can now say ${contact.email} and the correct value will be substituted. e.g.
      <map:parameter name="email" value="${contact.email}" />

Sitemap components in Cocoon 2.2

For the cocoon 2.1 project I did last year, I wrote a few components in Java (mostly Generators and one InputModule). It’s a bit of a pain because it’s built on the out-of-date and intrusive avalon framework. Anyway the end result is that you can write things in your cocoon sitemap like:

  <map:match pattern="foo">
    <map:generate type="foo" />
    <map:transform type="xslt" src="{my:something}" />
    <map:serialize type="html" />
  </map:match>

Now with the newer cocoon 2.2 project, we needed to reuse the InputModule. In order to get an Avalon component working in cocoon 2.2, you need:

  • src/main/java/com/myco/MyInputModule.java
  • src/main/resources/META-INF/cocoon/avalon/my-input-module.xconf

The latter is the old style avalon configuration file and looks like this:

  <components>
    <input-modules>
      <component-instance
        class="com.com.myco.MyInputModule"
        logger="core.modules.input"
        name="my"
        />
    </input-modules>
  </components>

But cocoon 2.2 is meant to be Spring based. So we should be able to write a Spring bean that does the same thing. Unfortunately, this doesn’t appear to be documented anywhere. But there’s always the source code for cocoon itself.

As an aside, I’m hugely grateful to Jukka Zitting for making available git mirrors of apache projects. This let me download all of cocoon’s source, including full history in a very short time. It’s a helluva useful resource, and I hope it gets proper support from the ASF.

So, I started looking through the Cocoon source code for Spring config files that might reference InputModules.

⇒ find . -name '*.xml' | grep META-INF/cocoon/spring | xargs egrep InputModule
./core/cocoon-servlet-service/cocoon-servlet-service-components/src/main/resources/META-INF/cocoon/spring/cocoon-servlet-linkRewritingReader.xml:        <property name="inputModule" ref="org.apache.cocoon.components.modules.input.InputModule/servlet"/>
./core/cocoon-servlet-service/cocoon-servlet-service-components/src/main/resources/META-INF/cocoon/spring/cocoon-servlet-service-complete-path-module.xml:    <bean name="org.apache.cocoon.components.modules.input.InputModule/servlet"
./core/cocoon-servlet-service/cocoon-servlet-service-components/src/main/resources/META-INF/cocoon/spring/cocoon-servlet-service-complete-path-module.xml:        <property name="blockPathModule" ref="org.apache.cocoon.components.modules.input.InputModule/block-path"/>
./core/cocoon-servlet-service/cocoon-servlet-service-components/src/main/resources/META-INF/cocoon/spring/cocoon-servlet-service-path-module.xml:    <bean name="org.apache.cocoon.components.modules.input.InputModule/block-path"
./core/cocoon-servlet-service/cocoon-servlet-service-components/src/main/resources/META-INF/cocoon/spring/cocoon-servlet-service-property-module.xml:	<bean name="org.apache.cocoon.components.modules.input.InputModule/block-property"

That seems to suggest that by creating a Spring bean whose name is the interface followed by a slash and then a name, you should be able to get at it in the sitemap. And it does indeed work like that. This is the bean name I ended up with.

  <bean name="org.apache.cocoon.components.modules.input.InputModule/my"
        class="com.myco.MyInputModule">
    <property name="message" value="aardvark" />
  </bean>

In order for it to not be an Avalon component, I also removed the base class I was using and just implemented the InputModule interface directly. And to my surprise, I could use the bean directly in the sitemap.

But we have the source, so we can see why it’s working. Did you notice the ROLE constant in InputModule? We can grep for that in the source. That quickly led me to PreparedVariableResolver.java, which contains:

  InputModule module;
  try {
    module = (InputModule) this.manager.lookup(InputModule.ROLE + '/' + moduleName);
  } catch (ServiceException e) {
    throw new PatternException("Cannot get module named '" + moduleName +
    "' in expression '" + this.originalExpr + "'", e);
  }

The manager field is an instance of ServiceManager, which just happens to be AvalonServiceManager. And lookup() is a standard Spring construct:

  try {
    return this.beanFactory.getBean(role);
  } catch (BeansException be) {
    throw new ServiceException("AvalonServiceManager",
      "Exception during lookup of component with '" + role + "'.", be);
  }

Further nosing around reveals that Generators are looked up in Spring in a similar manner inside AbstractProcessingPipeline:

  try {
    this.generator = (Generator) this.newManager.lookup(Generator.ROLE + '/' + role);
  } catch (ServiceException ce) {
    throw ProcessingException.throwLocated("Lookup of generator '" + role + "' failed", ce, getLocation(param));
  }

I’m guessing that transformers, serializers and readers follow a similar pattern.

So, that’s how to do Cocoon components in Spring. It’s a bit simpler than before, which can’t be a bad thing. I just wish it was documente, but that’s a big problem for Cocoon anyway…

Logging in Cocoon 2.2

I’ve had to try and understand logging in Cocoon 2.2 for a project at work recently. It’s been “interesting,” so I thought I’d blog the process in case anybody else needs to o this…

Normally, logging in Java is quite simple: you add log4j to your classpath, then create a log4j.properties to say what gets logged. If you’re running as part of a webapp, you use something like Spring’s Log4jConfigListener to ensure that the configuration gets applied as soon as the webapp is active.

Cocoon is different. By default (in development) you run it using a combination of the cocoon-maven-plugin and mvn jetty:run. This is quite cunning (as Grzegorz explained in a comment a while back), it lets you edit all sorts of stuff and have it dynamically reloaded. In order to make the cocoon “block” work with jetty, the maven plugin creates things like web.xml for you automatically, so there’s no chance to edit things. Drat.

Now, if you follow the documentation for logging in Cocoon, it advises:

The usual Cocoon web application sets up Log4j through the Cocoon Spring Configurator.

Lovely advice, except that by the time Spring has started up, read your logging configuration and applied it, a great deal of interesting events have already occurred. You really need to enable logging as early as possible using a ServletContextListener.

Thankfully, it’s possible to do so, even when using mvn jetty:run.

First, you need to create a log4j configuration that’s suitable. I think it has to be XML, which is a shame as it’s more complicated than the properties file. I wanted to change the defaults to get my FlowScript calls to log to the console. This is what I ended up with in etc/log4j.xml.

  <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
  <log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
    <appender name="stdout" class="org.apache.log4j.ConsoleAppender">
      <param name="target" value="System.err"/>
      <layout class="org.apache.log4j.PatternLayout">
        <param name="ConversionPattern" value="%d{ISO8601} %c{2} %p - %m%n"/>
      </layout>
    </appender>
    <logger name="cocoon">
      <level value="INFO" />
    </logger>
    <logger name="org.apache.cocoon.components.flow.javascript.fom">
      <level value="INFO" />
    </logger>
    <root>
      <priority value="WARN"/>
      <appender-ref ref="stdout"/>
    </root>
  </log4j:configuration>

Note that we shut everything up (WARN) by default, and then explicitly enable messages for things we want to see (INFO). I’ve found that even in development, this helps to tell the wood from the trees.

With that in place, you have to edit pom.xml in order to tell the cocoon-maven-plugin to use that instead if its default. The default pom should have a build/plugins/plugin section, to which this stanza needs adding.

  <configuration>
    <!-- Gets copied to target/.../WEB-INF/log4j.xml -->
    <customLog4jXconf>etc/log4j.xml</customLog4jXconf>
  </configuration>

Finally, you need to arrange for the auto-generated web.xml to be patched with a reference to Log4jConfigListener. This is done through Cocoon’s slightly arcane mechanism, xpatch. Create a file src/main/resources/META-INF/cocoon/xpatch/log4j.xweb which looks like this.

  <xweb xpath="/web-app"
        unless="comment()[contains(., 'Log4j Configuration')]"
        insert-after="node()[1]">
    <!--Log4j Configuration-->
    <context-param>
      <param-name>log4jConfigLocation</param-name>
      <param-value>/WEB-INF/log4j.xml</param-value>
    </context-param>
    <listener>
      <listener-class>org.springframework.web.util.Log4jConfigListener</listener-class>
    </listener>
  </xweb>

Now if you run mvn jetty:run in your block and inspect the generated web.xml, you should see the above patched in to place. Also, you should be able to generate messages on the console from within FlowScript by doing:

  cocoon.log.info("hello world");

The procedure above is a hassle. But the benefit of being able to see logging messages coming out on the console in front of you is significant.

One final point to note. When you do run mvn jetty:run, you’ll see a few log4j errors, i.e.

log4j:WARN No appenders could be found for logger (org.apache.commons.configuration.ConfigurationUtils).
log4j:WARN Please initialize the log4j system properly.

log4j:WARN No appenders could be found for logger (org.apache.commons.jci.stores.MemoryResourceStore).
log4j:WARN Please initialize the log4j system properly.

As far as I can tell these are completely ignorable, just very annoying. They appear to happen before jetty itself starts up, and are irrelevant to the web app (as far as I can see).

Trusting your tools

After Grzegorz’s piping up, I’m giving cocoon 2.2 another try. Here are some selected errors.

  javax.servlet.ServletException: No block for /favicon.ico
          at org.apache.cocoon.servletservice.DispatcherServlet.service(DispatcherServlet.java:84)
          at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServlet.service(ReloadingServlet.java:89)
          at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1093)
          at org.apache.cocoon.servlet.multipart.MultipartFilter.doFilter(MultipartFilter.java:119)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServletFilter.doFilter(ReloadingServletFilter.java:50)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1084)
          at org.apache.cocoon.servlet.DebugFilter.doFilter(DebugFilter.java:169)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServletFilter.doFilter(ReloadingServletFilter.java:50)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1084)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingSpringFilter.doFilter(ReloadingSpringFilter.java:69)
          at org.apache.cocoon.tools.rcl.wrapper.servlet.ReloadingServletFilter.doFilter(ReloadingServletFilter.java:50)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1084)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
          at org.mortbay.jetty.Server.handle(Server.java:313)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:506)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:830)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:514)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:381)
          at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:396)
          at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

How fabulous! 30 lines to tell me about a 404 I couldn’t care less about! (this is from mvn jetty:run). And in the process, obliterating any messages I did care about.

  [ERROR] VM #displayTree: error : too few arguments to macro. Wanted 2 got 0
  [ERROR] VM #menuItem: error : too few arguments to macro. Wanted 1 got 0
  [INFO] ------------------------------------------------------------------------
  [INFO] BUILD SUCCESSFUL
  [INFO] ------------------------------------------------------------------------

There’s an error, but the build was successful. That makes sense. Not.(from mvn site).

  Caused by: org.codehaus.plexus.util.xml.pull.XmlPullParserException: TEXT must be immediately followed by END_TAG and not START_TAG (position: START_TAG seen ...<reports>\n            <report>... @118:21)
          at org.codehaus.plexus.util.xml.pull.MXParser.nextText(MXParser.java:1063)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.parseReportPlugin(MavenXpp3Reader.java:3572)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.parseReporting(MavenXpp3Reader.java:3709)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.parseModel(MavenXpp3Reader.java:2347)
          at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read(MavenXpp3Reader.java:4422)
          at org.apache.maven.project.DefaultMavenProjectBuilder.readModel(DefaultMavenProjectBuilder.java:1412)
          ... 17 more
  [INFO] ------------------------------------------------------------------------
  [INFO] Total time: 2 seconds
  [INFO] Finished at: Sun Jan 27 22:25:35 GMT 2008
  [INFO] Final Memory: 1M/2M
  [INFO] ------------------------------------------------------------------------

XML parser exception (and in only 40+ lines!). Complaining about unbalanced tags? Must be non-well-formed XML, right? Wrong. This is down to copying an example from “Better builds with maven”. But the example’s wrong—I’m missing a tag. Can you guess what’s missing from this XML?

  <reportSets>
    <reports>
      <report>dependencies</report>
    </reports>
  </reportSets>

You mean you didn’t spot the missing reportSet element? I’m shocked, I tell you, shocked. Plus the lack of indication that an error actually occurred. The stacktrace is a good indication, but an actual “ERROR” or “BUILD FAILED” message would be nice (there is an error line, but it zoomed past three screens ago. I blinked and missed it).

So that’s two strikes to maven and one to cocoon. My trust in them is basically non-existent at this point. But at least the RCL worked as documented.

A Year in Cocoon

The other large part of the project at $WORK I’ve just finished was Cocoon. Cocoon is a Java web framework. It’s got some really neat ideas in it, and it’s main purpose in life is transforming XML. It is (or should be) a perfect match for XML databases.

I described Cocoon about a year ago, towards the start of this project. But how do I feel about it now? Looking back, was it the right choice?

To start with, I’m still very impressed with the core Cocoon technologies. The pipelines are perfect for dealing with XML. FlowScript still impresses the heck out of me.

But there’s a lot that leaves a sour taste. My first original complaint was about the size. Well, 50Mb isn’t so big. But the fact of the matter is that there’s an awful lot of stuff in there, quite a bit of which you don’t want to touch with a bargepole. We wasted a lot of time looking at things like XSP, Actions and implementing our own custom Generators. I wish I’d been made more aware of what FlowScript was up-front, and what it could do for more. I wish I’d realised that it’s basically the “C” in MVC.

Which dovetails straight into another complaint: documentation. There is quite a bit of documentation for Cocoon. But it’s still inadequate given the gargantuan size of the project. And the coverage is extremely spotty. Normally, I’d jump straight to the published literature, but the most recent Cocoon book I could find was hideously out of date. In fact, that’s what caused me to go down several of the rabbit-holes mentioned above.

When I’ve really needed to figure out what’s going on, I’ve invariable had to turn to the cocoon source code. Which due to it’s dependence on the weird-yet-not-wonderful avalon framework made it less than simple to understand.

My complaint about debugging still holds, although less severely. You get used to the seemingly-intractible error messages. You spot the patterns that are causing trouble. Like most things, logging goes a long way.

And then there’s Cocoon 2.2. My site was developed entirely on Cocoon 2.1. This certainly had it’s flaws—figuring out how to deploy it as a war file sensibly was a pain. But Cocoon 2.2 has Maven.

I’ve pointed out my dislike for Maven before. As have other people. Recently, other people in my office have been using it and I’ve witnessed the project overruns thanks to trying to figure out what Maven is doing. Nice idea, bad implementation.

Cocoon 2.2 uses Maven because it’s “modularized”. What this means is you can’t have a single project with everything in it any more. You have a “webapp” project and a “block” project. And when something changes you have to build the block, install it, build the webapp, mount the installed block in the webapp and fire up jetty. It doesn’t make for a good development environment.

Now I could be completely wrong and missing the obvious way to do seamless Cocoon 2.2 web development from with Eclipse. I’d love to be corrected. But for now, Cocoon 2.2 has shot itself in the foot as far as I’m concerned.

So I’m not happy with the future direction of Cocoon. I need to look again at why I chose it in the first place. Initially, it was because all the other web frameworks for Java (Struts, Tapestry, Wicket) all seemed totally focussed on form-based CRUD-style web apps. Cocoon focussed on documents and URLs instead. So it’s time to start working my way through the tutorials of the many java web frameworks in order to find a more suitable one. I may not find a good replacement for Cocoon. But I certainly need to try.

Update: In the comments, Grzegorz Kossakowski points out a screencast about the RCL which slightly lessens the pain of interactive development with Cocoon 2.2.

A Year in XQuery

About a year ago, I wrote The year of XQuery?. I’ve just finished my involvement with the large project at $WORK that was using XQuery. So it’s time to reflect over it a little.

First, a bit of background. The site I was developing is essentially a more-or-less static view of 112,1491 XML documents in 4 collections2. The main interface is a search page, plus there’s a browse-by-title if you really fancy getting lost. It sounds simple, but as with any large collection of data, there is lots of variation in the data, leading to unexpected complications. Plus there are several different views of each document, just to make life more fun.

The site is based on Cocoon talking to an XML database (MarkLogic Server in this instance). We get XML back from the database, and render it through an XSLT pipeline into HTML[3]. Plus there’s some nice jQuery on top to smooth the ride.

Looking inside the codebase, we’re presently running at 5200 lines of XQuery code across 39 files (admittedly, that includes a 1000 line “lookup table”). But that doesn’t include any of the search code, which is dynamically generated using Java.

But in many senses, that’s not the remarkable thing about the XQuery. One of the most important aspects has been the ability to query the existing data. This may not sound particularly remarkable for a query language. But for a collection of XML this size, the ability to do ad-hoc queries that understand the document structure is truly remarkable.

For example, several months ago, I received a bug report stating “the tables are not rendering in article 12345.” I was able to look at the article, see that there were no tables, examine the source markup in the database and discover a TAB element that I’d never seen before4. But how widespread is this problem? Three seconds later, I have:

  distinct-values(
    for $tab in //TAB
    return base-uri($tab)
  )

Which tells me the 99 affected articles. Now I know that I only need to reload those from the source data instead of all 112,149.

Looking back over my original criticisms, how do they stand up to my experience?

  • The development environment.
    • Well, it’s slightly better than I originally thought. But not good enough when placed next to Java IDEs like Eclipse.
    • I’ve been using my TextMate XQuery Bundle with reasonable success (although it still needs a great deal of improvement). There are modes for Emacs and Vim.
    • I’ve managed to get the OxygenXML debugger talking to MarkLogic, but it was less useful than it initially appeared. The XQuery editor turned out to be worse than useless, because MarkLogic uses an outdated version of XQuery, leading to a lack of syntax colouring and a plethora of error reports.
    • MarkLogic has an addon CQ which is a browser based interactive query tool. It’s pretty useful.
    • Fundamentally, sharing a database between developers (the stored procedure model) doesn’t work well when you have multiple people updating it.
      • We solved this expediently by kicking everybody else off the project. 🙂
  • The verbosity.
    • Like most things, you kind of get used to it. Although I confess that when I started to understand what I was doing, I found that I could write code in a significantly less verbose way.
  • The SQL / functional nature.
    • This is another one of those things that you just get used to. And in this case, start to enjoy.
  • Not a standard.
    • Fixed!—although not MarkLogic, as mentioned above.
  • XML Namespaces bite.
    • And continue to do so. Let’s blame Tim Bray. 🙂
    • Seriously, this continues to be a problem, even now a year later. I lost a whole morning two weeks ago due to my inability to query the correct namespace. My main advice now is to never use a default namespace—prefix everything.
  • The type system.
    • Over time, now that I have come to understand it, I can begin to use the type system to my advantage. In fact, it’s one of the things that I usually have to reinforce in developers just starting in the project.
    • Thankfully, I’ve never needed to dabble in XML schema.
  • Implementation defined areas.
    • It’s a concern, I’ll admit. MarkLogic is profligate in this area, and to get decent performance, you need to use the extensions.
  • Smiley comments
    • They’re just about getting to the point of being ignorable.

But what new things have I learned?

  • I seriously underestimated the utility of function libraries. You can have two kinds of query in XQuery: an inline query, or a “module”. The advantage of a module is that it’s a lot easier to reach in and test an individual function by hand when needed.
  • Speaking of testing, I haven’t come up with a good solution for unit testing. This pains me greatly. I realise that it’s similar to the RDBMS unit testing problem and basically I need a known test database. MarkLogic doesnt make automating this easy (there are no management APIs).
  • Performance is very unpredictable. Not unpredictable as in “varys a lot”, but difficult to tell the performance of a given statement by visual inspection. MarkLogic comes with profiling APIs, which helps somewhat. But compared to EXPLAIN in SQL, it still feels a bit primitive.
    • For example, my XSLT experience told me to avoid things like //p to examine all paragraphs. But in XQuery, everything is indexed up the wazzoo, so it’s likely to be faster than an XPath statement with an explicit path.
  • Thinking in a functional style is an art. I’ve had a few problems, which cry out for an accumulator of some sorts. My whiteboard and I have had some long, intimate moments.
  • Having regexes available is a godsend, after XSLT 1.0.
  • I still really need a decent XQuery pretty printer, alá perltidy.

Overall, I have to ask myself: would I do it the same again? And I probably would. For this particular project, I would try to place more emphasis on the XQuery than the XSLT (this was down to our inexperience—you should always try to work as close to the data store as possible). Despite the initial strong learning curve, the XQuery itself was rarely the main problem. But that’s leading into a whole new post…

In short: if you have a bunch of XML data lying around, XQuery is an excellent way to get the most use out it5.

1 count(/doc)

2 There’s also a second site, almost identical, but on a different topic which has 46,876 documents.

3 HTML 4.01. Sadly, XHTML and browsers still interact badly.

4 This is a rather baroque DTD unfortunately.

5 If you’re not up to paying for a MarkLogic licence (it’s pricey), then eXist might be worth checking out.

Cocoon

Surprising as it may seem if you only read this blog, I don’t do much Perl or Ruby or Rails. I try to in my spare time, but it’s not what I’m doing at $WORK. That’s mostly concerned with pushing around XML using Java. Right now, I’m trying to learn Cocoon.

Cocoon is a framework (in much the same way that Rails is), but it’s oriented to pushing around XML[1]. The basics of cocoon are pretty simple. There’s a “pipeline” for processing XML:

  • A generator produces XML. Usually, this is just reading a file. At $WORK, it’s pulled from an XML database.
  • Zero or more transformers munge the XML in various ways. Normally, this is XSLT.
  • Finally, it gets output through a serializer. Mostly this will be HTML.

There’s a little bit more to it, but that’s the basics. And for serving up XML directly, in a read-only fashion it actually works really well.

The problems start when you want to get a little bit more interactive. It seems that Cocoon has evolved a number of different approaches over the years, but the current favourite appears to be FlowScript.

FlowScript is server-side JavaScript2. When an URL is matched, a little bit of JavaScript gets run in order to determine what to do. It can interact with Java objects and when it’s figured out what to do, run the appropriate pipeline, passing in parameters. It’s effectively an MVC architecture, with the controller being JavaScript.

But what’s really neat about FlowScript is captured in a single call:

  function calculator()
  {
    var a, b, operator;

    cocoon.sendPageAndWait("getA.html");
    a = cocoon.request.get("a");

    cocoon.sendPageAndWait("getB.html");
    b = cocoon.request.get("b");

    cocoon.sendPage("result.html", {result: a + b});
  }

cocoon.sendPageAndWait() uses a continuation to effectively pause the execution of the JavaScript, return to the browser and when the user submits the form again, the FlowScript will carry on executing after the call to cocoon.sendPageAndWait(). Neat stuff.

Continuations are currently the hot thing because of seaside, a web framework for smalltalk. But cocoon’s had them for a couple of years.

Building on FlowScript is a framework for form handling called CForms. The idea is that you define a model for your form, which then gets rendered into HTML. I’m playing with this for a very complex form at the moment, and I’m not totally sold on the concept. Plus the generated result is some pretty yucky markup.

In fact, there are quite a few things about cocoon that make me feel uncomfortable about it.

  • It’s huge. The download is 50Mb, and you get a lot in that. The problem is two fold: firstly, you don’t need most of it most of the time. Secondly, figuring out what you do actually need is bloody hard work. e.g. I still haven’t figured out what the hell the “apples” block is.
  • It gets complicated very quickly when you step outside the core competencies. If you follow the CForms link, you’ll see what I mean.
  • Debugging is hard. Partially, this is down to the nature of XML (and in particular XML Namespaces), but in general, you’re not working with Java, so it’s difficult to get the level of debugging one would be used to. The error messages that do appear are somewhat vague.
  • Cocoon 2.2. The current version, 2.1, is a bit old now. I’ve been trying to find out more about cocoon 2.2 by poking around in the dev list. It appears that cocoon has been converted to a maven project and switched to use Spring internally. It’s Maven that I have a big issue with. It basically means that there isn’t a download any more. Instead, you just tell maven “make me a new cocoon 2.2 project” and it goes and downloads it. From somewhere you may or may not trust. That may or may not be compiled correctly. Oh, and they’ve completely reorganised how you integrate with a standard servlet container. And the docs aren’t updated yet. All this, combined with the fact that when maven blew up when I tried it means I’m not happy with the future direction of the project. Maybe with better docs, I’d be happier. We’ll see—the proper release should be “soon”.

Overall, I’m left with a mixed feeling about Cocoon. For it’s core purpose, I like it. Beyond that, I’m less certain. The trouble is that pretty much any web site you create these days falls into that “beyond” bit quite quickly—even the large, static ones like we create at $WORK. I kind of wish that it had some competition, but there doesn’t appear to be a lot out there that comes close to dealing with XML as well as Cocoon.

I’m going on a training course in a couple of weeks. We’ll have to see if that reassures me any that Cocoon is the correct choice.

1 XML oftens gets a lot of stick, but for its intended purpose (documents, as opposed to data), it’s a pretty reasonable solution.

2 Which appears to be coming back into fashion, what with things like Project Phobos and Zimki. Although it does go back a long way to the Netscape web server—see Server Side JavaScript.