Tag: xml

 

To XHTML or not to XHTML?

Today, we had a conversation about HTML 4 vs XHTML 1.0. For me, the matter was neatly settled they very first time I saw an XML system produce XHTML like this:

  <p>An article with an <em/> empty emphasis tag.</p>

Perfectly legal XML, perfectly legal XHTML. But — if you serve up this XHTML as text/html (which 99.99% of the world does), then you end up with this:

Empty tags considered harmful

Why? Because it’s parsed as HTML. And the browser sees the start of an em tag, but no close.

And now I make sure that all our sites emit HTML 4. It’s a lot simpler.

This isn’t to say I don’t use XHTML. It’s a fine medium for further processing (e.g. applying XSLT). But it’s not right for serving up to browsers verbatim.

SAX EntityResolver

I was trying to resolve entities (&weirdChar;) in an XML file. Easy enough, use a validating parser. But here’s the tricky bit: get the entity definitions from the classpath. This should still be easy, as SAX provides an EntityResolver.

Unfortunately, the interactions between JAXP and SAX make life complicated. I found that you have to ignore the SAXParser (from JAXP) and instead focus on the XMLReader interface (part of plain old SAX).

This is what I came up with. First, a small driver.

  public void parseIt() {
    SAXParserFactory spf = SAXParserFactory.newInstance();
    spf.setValidating(true);
    XMLReader reader = spf.newSAXParser().getXMLReader();
    reader.setEntityResolver(new MyResolver());
    // Look for test.xml on the classpath.
    InputStream testXmlStream = App.class.getClassLoader().getResourceAsStream("test.xml");
    reader.parse(new InputSource(testXmlStream));
  }

That references the EntityResolver implementation I wrote:

  class MyResolver implements EntityResolver2 {
    public InputSource resolveEntity(String name, String publicId, String baseURI, String systemId)
      throws SAXException, IOException {
      InputStream stream = getClass().getClassLoader().getResourceAsStream(systemId);
      return new InputSource(stream);
    }
  }

Actually, I had to use EntityResolver2 for reasons I don’t entirely understand.

On top of this, I found that I had to include xerces 2.8 explicitly as a dependency. The version bundled with Java 1.5 is Xerces 2.6.2, which has a bug: It passes the entity resolver an absolutized systemId. Which makes it very difficult to resolver further. What a pain in the arse.

But it does now work, and I can successfully resolve entities off the classpath.

Writing XML in Java

Last night, I was looking at generating SAMS XML from Java objects. It made me realise two things:

  1. Java sucks badly at creating XML (by default).
  2. I should really be looking at XML Binding.

I’m interested in point 1. I’ve done it before, trying to write XML using DOM (as there doesn’t appear to be a builtin SAX writer). This is what I ended up with.

public abstract class AbstractXmlBuilder {
    protected void addXlinkHref(Element e, URI href) {
        e.setAttributeNS(Constants.XLINK_NS, "x:href", href.toString());
    }
 
    // See http://www.cafeconleche.org/books/xmljava/chapters/ch09s09.html
    // (JAXP Serialisation) for details on why all this palaver is necessary.
    protected String domToString(Document doc) {
        try {
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer trans = factory.newTransformer();
            trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
            DOMSource source = new DOMSource(doc);
            StringWriter writer = new StringWriter();
            StreamResult streamResult = new StreamResult(writer);
            trans.transform(source, streamResult);
            return writer.toString();
        } catch (TransformerConfigurationException e) {
            throw new RuntimeException(e);
        } catch (TransformerFactoryConfigurationError e) {
            throw new RuntimeException(e);
        } catch (TransformerException e) {
            throw new RuntimeException(e);
        }
    }
 
    protected Document newDocument(String namespaceURI, String qualifiedName) {
        try {
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            dbf.setNamespaceAware(true);
            DocumentBuilder docBuilder = dbf.newDocumentBuilder();
            DOMImplementation impl = docBuilder.getDOMImplementation();
            return impl.createDocument(namespaceURI, qualifiedName, null);
        } catch (ParserConfigurationException e) {
            throw new RuntimeException(e);
        }
    }
 
    protected Document newDocument() {
        Document doc = newDocument(Constants.MY_NS, "doc");
        Element root = doc.getDocumentElement();
        root.setPrefix("my");
        // Set up the usual namespaces.
        root.setAttribute("xmlns:x", Constants.XLINK_NS);
        return doc;
    }
 
    protected Element myElem(Document doc, String name) {
        return doc.createElementNS(Constants.MY_NS, name);
    }
}

This is intended to be subclassed to be useful, but even in the subclasses, it still relies on dozens of helper methods. e.g.

    private Element getAccIdElem(SessionCreateRequest s, Document doc) {
        Element accId = myElem(doc, "acc_id");
        addXlinkHref(accId, s.getAccountId());
        return accId;
    }

This is incredibly verbose, but I don’t think we can do better with just the JDK. What are the other options? I’ve been looking at XOM. This appears to have a nice API for writing XML:

    Element root = new Element("root");
    root.appendChild("Hello World!");
    Document doc = new Document(root);
    System.out.println(doc.toXML());

This is much, much simpler. We’re still building a up a DOM-like structure, but it’s simpler to use.

But XOM is still a fairly large dependency. This doesn’t particularly matter for what I’m doing. But there are alternatives. Whilst looking for libraries, I also stumbled across xmlwriter. This is much lower-level:

  Writer writer = new java.io.StringWriter();
  XmlWriter xw = new SimpleXmlWriter(writer);
  xw.writeXmlVersion();
  xw.writeComment("Example of XmlWriter running");
  xw.writeEntity("person");
  xw.writeAttribute("name", "fred");
  xw.writeAttribute("age", "12");
  xw.writeEntity("phone");
  xw.writeText("4254343");
  xw.endEntity();
  xw.writeComment("Examples of empty tags");
  xw.writeEntity("friends");
  xw.writeEmptyEntity("bob");
  xw.writeEmptyEntity("jim");
  xw.endEntity();
  xw.writeEntityWithText("foo","This is an example.");
  xw.endEntity();
  xw.close();
  System.err.println(writer.toString());

This is closer to genx, a C library that I like very much. It’s quite weird that “element” and “entity” appear to be confused, though. Plus there doesn’t appear to be any way to generate canonical XML.

For now, I think I’ll stick with XOM, but it’s good to know that there are alternatives to the heavyweight JDK options.

Dynamic JUnit

Recently, I wanted to do something slightly unusual with JUnit. I’m working on a cocoon project, so there are squillions of little XML files floating around. These need to be all well-formed. So I want a test that parses each one. Then, CruiseControl can let us know when they get broken.

First, I gave the task to a colleague. He came up with something that checked each file, and returned a list of the bad ones. It then asserted that the nonWellFormed list had a zero length. Which is great and all, but didn’t tell you which file was broken, nor why.

What I really wanted to do was have a single test per file, so it could display the errors correctly. This seemed like an easy thing to do… Until I tried it. This is what I eventually arrived at:

  class WellFormedTest extends TestCase {
    public static Test suite() {
        final TestSuite suite = new TestSuite(WellFormedTest.class.getCanonicalName());
        // Stupid bloody Java regexes have to match from the beginning of the
        // string.
        Pattern p = Pattern.compile(".*\\.(xml|xslt|xconf|xmap)$");
        FindFiles ff = new FindFiles(p) {
            protected void processFile(final File file) {
                suite.addTest(new WellFormedTest(file));
            }
        };
        ff.search("web");
        return suite;
    }

    private File file;

    public WellFormedTest(File file) {
      super("Well-Formed? " + file.toString());
      this.file = file;
    }

    protected void runTest() throws Throwable {
      XmlValidator validator = new XmlValidator();
      String result = validator.isWellFormed(file);
      assertEquals(file.toString(), null, result);
    }
  }
  • FindFiles is a utility class to walk a directory tree. Tell me again why Java doesn’t have something this basic in it’s vast class libraries?
  • You have to call super("blah") in your constructor to name each test sensibly.
  • But if you do this, you have to override runTest() in order for things to actually work. The usual mechanism for determining which tests to run doesn’t work if you supply a custom name. This took forever to work out and required delving into the JUnit source. Halleluljah for Open Source.
    • As part of prodding around in the debugger, I noticed that JUnit creates a new TestCase object for each test in the class. So it’s OK to just do one thing in runTest(), as that’s all that’s going to happen anyway.
  • XmlValidator is another custom helper class. It just parses the file and returns a String containing the error (or null).
  • Yes, this is JUnit 3.8. I know I need to migrate to JUnit 4. That’s a battle for another day, dependent on upgrading ant first.

Originally, I tried to get the test done inside a nested anonymous subclass of TestCase, but there’s no constructor there, so that doesn’t work too well. Plus it bumps the ugliness of the source another level.

The end result works quite well and provides a useful example for doing dynamic tests with JUnit.

Cocoon

Surprising as it may seem if you only read this blog, I don’t do much Perl or Ruby or Rails. I try to in my spare time, but it’s not what I’m doing at $WORK. That’s mostly concerned with pushing around XML using Java. Right now, I’m trying to learn Cocoon.

Cocoon is a framework (in much the same way that Rails is), but it’s oriented to pushing around XML[1]. The basics of cocoon are pretty simple. There’s a “pipeline” for processing XML:

  • A generator produces XML. Usually, this is just reading a file. At $WORK, it’s pulled from an XML database.
  • Zero or more transformers munge the XML in various ways. Normally, this is XSLT.
  • Finally, it gets output through a serializer. Mostly this will be HTML.

There’s a little bit more to it, but that’s the basics. And for serving up XML directly, in a read-only fashion it actually works really well.

The problems start when you want to get a little bit more interactive. It seems that Cocoon has evolved a number of different approaches over the years, but the current favourite appears to be FlowScript.

FlowScript is server-side JavaScript2. When an URL is matched, a little bit of JavaScript gets run in order to determine what to do. It can interact with Java objects and when it’s figured out what to do, run the appropriate pipeline, passing in parameters. It’s effectively an MVC architecture, with the controller being JavaScript.

But what’s really neat about FlowScript is captured in a single call:

  function calculator()
  {
    var a, b, operator;

    cocoon.sendPageAndWait("getA.html");
    a = cocoon.request.get("a");

    cocoon.sendPageAndWait("getB.html");
    b = cocoon.request.get("b");

    cocoon.sendPage("result.html", {result: a + b});
  }

cocoon.sendPageAndWait() uses a continuation to effectively pause the execution of the JavaScript, return to the browser and when the user submits the form again, the FlowScript will carry on executing after the call to cocoon.sendPageAndWait(). Neat stuff.

Continuations are currently the hot thing because of seaside, a web framework for smalltalk. But cocoon’s had them for a couple of years.

Building on FlowScript is a framework for form handling called CForms. The idea is that you define a model for your form, which then gets rendered into HTML. I’m playing with this for a very complex form at the moment, and I’m not totally sold on the concept. Plus the generated result is some pretty yucky markup.

In fact, there are quite a few things about cocoon that make me feel uncomfortable about it.

  • It’s huge. The download is 50Mb, and you get a lot in that. The problem is two fold: firstly, you don’t need most of it most of the time. Secondly, figuring out what you do actually need is bloody hard work. e.g. I still haven’t figured out what the hell the “apples” block is.
  • It gets complicated very quickly when you step outside the core competencies. If you follow the CForms link, you’ll see what I mean.
  • Debugging is hard. Partially, this is down to the nature of XML (and in particular XML Namespaces), but in general, you’re not working with Java, so it’s difficult to get the level of debugging one would be used to. The error messages that do appear are somewhat vague.
  • Cocoon 2.2. The current version, 2.1, is a bit old now. I’ve been trying to find out more about cocoon 2.2 by poking around in the dev list. It appears that cocoon has been converted to a maven project and switched to use Spring internally. It’s Maven that I have a big issue with. It basically means that there isn’t a download any more. Instead, you just tell maven “make me a new cocoon 2.2 project” and it goes and downloads it. From somewhere you may or may not trust. That may or may not be compiled correctly. Oh, and they’ve completely reorganised how you integrate with a standard servlet container. And the docs aren’t updated yet. All this, combined with the fact that when maven blew up when I tried it means I’m not happy with the future direction of the project. Maybe with better docs, I’d be happier. We’ll see—the proper release should be “soon”.

Overall, I’m left with a mixed feeling about Cocoon. For it’s core purpose, I like it. Beyond that, I’m less certain. The trouble is that pretty much any web site you create these days falls into that “beyond” bit quite quickly—even the large, static ones like we create at $WORK. I kind of wish that it had some competition, but there doesn’t appear to be a lot out there that comes close to dealing with XML as well as Cocoon.

I’m going on a training course in a couple of weeks. We’ll have to see if that reassures me any that Cocoon is the correct choice.

1 XML oftens gets a lot of stick, but for its intended purpose (documents, as opposed to data), it’s a pretty reasonable solution.

2 Which appears to be coming back into fashion, what with things like Project Phobos and Zimki. Although it does go back a long way to the Netscape web server—see Server Side JavaScript.

The year of XQuery?

Apparently, it’s the year of xquery. I’ve just started a large project at $WORK, of which XQuery is a fundamental piece. And I have to say I’m not so sure.

From what I’ve seen so far, there are good and bad bits to XQuery. The good bits are that it’s really flexible and very easy to pull apart large quantities of XML documents in ways in which it’s much, much harder with XSLT. I like the way XML fragments are treated as just another datatype, like E4X but done right. I also like the way that it’s built on XPath, as XPath rocks.

Sadly, there are quite a few more bad bits.

  • The development environment leaves a lot to be desired (an eclipse plugin would be nice, as would vim1 or emacs2 syntax highlighting).
  • XQuery itself is really verbose. There appears to be a lot of syntax, and the grammar is on the large side. This means that it’s really hard to pick up large bodies of other peoples code. Coincidentally like the stuff I’ve been dumped with.
  • The spirit of the language is more like SQL than any procedural language. Yet it presents a veneer of procedurality, which fools you into thinking you can get away with things like “just throw an extra statement in”. It’s not that simple. Your XQuery code is returning a list of items3, so instead you have to insert your extra code as a previous item in the list. Essentially, this means you must end your extra code with a comma, instead of the expected semicolon or nothing. This continues to bite me.
  • It’s still not a standard. “Nearly there now”, apparently. It’s beginning to sound like Perl 6…
  • Namespaces continue to cause me much wailing and gnashing of teeth. In XQuery, there is a tendency to use quite a few of them as well. In fairness, this is more of an XML problem than an XQuery problem, but it’s really irritating to still be tripped up by having xpath not match because you’re not looking at a namespace by accident.
  • The type system is a pain. XQuery may or may not be strongly typed, depending upon the code you’re using, the implementation, or the phase of the moon. It has had a tendency to get in the way, in all the stuff I’ve seen so far.
  • For all its verbosity, the language itself is quite limited. I’ve been needing to use a lot of implementation-specific extensions in the work I’ve been doing. Particularly for things like updates.
  • Oh, and whoever chose fucking smileys as the comment syntax needs to be shot. Now.

Having said all that, I think XQuery is still useful in the area I’m working in (large corpus’ of XML documents). Despite my rash of indignation, it’s proved a lot easier to deal with than our previous technology (flat files plus manually created indexes in an RDBMS plus Lucene for searching). It has a future. I just hope it develops quicker than it has so far.

1 That doesn’t suck, anyway, unlike xquery.vim.

2 I haven’t tried xquery-mode.el yet.

3 Except where it isn’t, e.g. defining functions.

XML::Genx Plans

I was talking to Mark Fowler yesterday and XML::Genx came up. He had a couple of good points:

  • It’s not absolutely clear in the documentation that any valid Perl string will work correctly (be it UTF-8 encoded or not). I need to double check the tests for this and amend the docs.
  • The API is still fairly horrible. It mirrors the C api almost exactly, but this feels very odd as a Perl programmer. I need to have a think about what would be better. Ideally, I’d like something more like ruby’s builder. In fact, I actually wrote something similar to that before (XML::SAX::Builder), but that uses SAX, which is too slow in Perl.

I really appreciate the feedback. Apart from Aristotle, it’s the only feedback I’ve had since I released it.

If I can figure these out, I should probably slap a 1.0 on it.

I also reckon I should do a talk on “Why we need another XML writer”. There are quite a few on CPAN already and I should say why I wrote another one…

XML::Filter::Normalize 0.1

A couple of days ago, I released SAX event stream where possible. Here’s Robin’s prodding. Anyway, it appears to work more or less as expected. I’m particularly pleased as it appears to have 100% test coverage. Now, I just need to make

new XML::Genx

XML::Genx 0.19 is now out. Changes include:

  • Allow namespace objects to be passed in to StartElementLiteral() and AddAttributeLiteral(). This makes it much easier to put things into the default namespace.
  • Add a missing “static” declaration to some XS helper functions.
  • Allow multiple different default namespaces inside XML::Genx::SAXWriter. Previously you would get a “Duplicate Prefix” error. Bug spotted by Aristotle Pagaltzis.
  • Make the tests work in perl 5.6.1. Not sure when I broke this.

I deliberately left out the SAX changes that Aristotle was talking about on the list, as I wanted some more feedback. I’ll do another new release in a few days if nobody has said anything.

XML::Genx

Aristotle Pagaltzis has mentioned XML::Genx. Yay, it needs all the publicity that it can get. He highlighted some issues in XML::Genx::SAXWriter, which I need to address, although getting some consensus from the list first would be good.

He also mentions a problem with default namespaces a little further on. I’ll have to look into that one. I’m not sure if it’s a bug or Genx is supposed to work that way.

But it has made me realise that I need to upgrade the documentation that comes with Genx. The API usage is not always as clear as it could be, particularly when it comes to namespaces and optional function arguments. So a very good EXAMPLES section would do well I think.

Not only that, it’s also made realise that there’s some bad behaviour in there.

  1. Passing in a namespace object to StartElementLiteral() doesn’t work properly. It assumes that the stringification of the namespace object is the URL.
  2. The manner for using StartElementLiteral() to declare a default namespace sucks, badly. You have to declare the namespace object with the default prefix (ie: ””). Then, you have to call StartElementLiteral() passing in the URI yourself. And then you have to call AddNamespace() on the namespace object to switch from the genx created prefix back to the default. Very weird, but fixing the bug above would render this much simpler.

It’s really important to get this sort of thing fixed up so that the API is easy to use and works as expected.

Finally, I’ve also noticed that the last release had an unexpected failure (the Win32 stuff is at least expected failure). What’s irritating is that it’s nothing to do with my code. For some reason, version.xs is being picked up and mingled into my module. I blame the person submitting the CPAN tester report (for now, anyway). I’m sure I saw something about this on a use.perl.org journal recently…