Regex Readability

Piers describes a foul abuse of programming. I’m totally with him on this one. People should just learn to use regexes—they’re an enormously powerful tool for pulling apart strings. And they’re not difficult. Unless of course, you’re of the “Java is the one true language and I must know nothing else” school of programming. Don’t laugh, there are people like that.

But the second crime, which Piers fails to point out is parsing XML with a regex. There are so many ways in which this will blow up in your face. Really. What about character encodings? Attribute order? Entities? Even unexpected tag order would blow up most XML regexes I’ve seen. Don’t do it, folks, you’re just setting yourself up for a fall.

Update: Dave and Piers both correctly point out in the comments that I misread the original article. Piers does in fact state that parsing XML is “one of the canonical no nos”. Mea Culpa. I still agree with everything he says.

Comments 3

  1. Dave Cross wrote:

    Piers says

    he’s using this regular expression […] to parse XML (one of the canonical no nos that one)

    Maybe he’s clarified his post since you wrote this, but currently it looks like he’s pretty clear on the dangers of parsing XML with a regex. He even gives examples of how this particular attempt is broken.

    Posted 16 Mar 2007 at 08:36
  2. Piers Cawley wrote:

    The first posted draft was dreadfully incoherent, but I definitely pointed out that parsing XML is “bad, m’kay?”. Hopefully it coheres a little better now.

    Posted 16 Mar 2007 at 09:09
  3. Mark Fowler wrote:

    You also can’t parse XML with a pure regexp. Can’t be done. You’ll need a push down-automata at a minimum, otherwise how can you keep track of the opening and closing tags / quotes / etc?

    You need to mix code in there somehow, either by using multiple regexps or you need to use one of your language’s extensions to embed code in there. Which is decidedly non-trivial and will take about a zillion attempts to get right with all the nasty edge cases.

    And if you’re going to that hassle, well…why not use a standard library?

    Posted 16 Mar 2007 at 09:57