Categories
Uncategorized

XSLT Character Creation

I’ve just spent most of the afternoon on a character building exercise. I have some XML like this:

  <symbol unicode="2103"/>

And I need to turn that into the numeric character reference . It’s perfectly possible to do so with a bit of fudging around with <xsl:text disable-output-escaping="yes"/>. But there’s a slight caveat: You’re not creating a numeric character reference. You’re just creating something that looks like one. Really, it’s the characters “&”, ”#”, “x”, “2”, “0”, “1”, “3” and ”;”.

Now most of the time, this doesn’t matter. You just output XML that looks correct and the next parser along (probably a browser) will interpret it correctly. But it’s sleight of hand.

Today, I needed to copy the text contents of a node into an attribute. Unfortunately, that text content contained one of these symbol tags. But because it’s only a string, XSLT feels (correctly) that it needs to escape the leading ampersand. So, with this input:

  <name>Fred <symbol unicode="2013"/> Bloggs</name>

I get this output:

  <name attrib="Fred &amp;#x2013; Bloggs">Fred – Bloggs</name>

Yes, I know that the input data is completely stupid. I can’t help that. Unfortunately I also have the restriction that I can’t do this in multiple passes.

I’ve looked at the standard XSLT functions and the standard XPath functions. I’ve looked at the EXSLT functions. All I want is something that works like Perl’s chr.

I noticed that Saxon has the saxon:entity-ref function, but annoyingly, libxslt doesn’t support it.

All I really need is some way of re-invoking the XML parser over a string of my choosing. That way I could just wrap the characters in an element, parse it and call text() to get the character I need.

Right now, the only way that I can see of doing this is to turn UnicodeData.txt into one big XML lookup table, and lookup the numbers in that. Bleeeaaargh.

Thankfully, it’s not my project and the person doing it has just hacked around this in the output layer. But it bugs me that there’s no good way to achieve this.

11 replies on “XSLT Character Creation”

I’ve just had a really evil thought. That only works for characters in the BMP. Bwahahahaha. 🙂

Boy, I’m going to love showing this at work on Monday!

Ah, good stuff about the non-BMP characters. I obviously didn’t look closely enough at what you wrote.

As to the comments, I have no idea. I blogged a while ago about typo getting the times wrong on posts. It appears that it’s doing it on comments as well. Drat. I’ve filed ticket#690 to see if any of the developers have any clue as to what’s going on…

Alistair,

Sadly, it’s not WordML! At work I get to deal with a lot of data from publishers and sadly, it’s often very bad quality. Mostly this is because the data has been recovered from (or still has to deal with) ugly typesetting systems…

My apologies for not saying so earlier. Yes, everybody was utterly flabbergasted that it could be done! And it meant that we took out the nasty special casing in the layer above that was put in to cope with not doing it. So, a big win! Thank you very much!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s