I’ve just spent most of the afternoon on a character building exercise. I have some XML like this:
<symbol unicode="2103"/>
And I need to turn that into the numeric character reference –
. It’s perfectly possible to do so with a bit of fudging around with <xsl:text disable-output-escaping="yes"/>
. But there’s a slight caveat: You’re not creating a numeric character reference. You’re just creating something that looks like one. Really, it’s the characters “&”, ”#”, “x”, “2”, “0”, “1”, “3” and ”;”.
Now most of the time, this doesn’t matter. You just output XML that looks correct and the next parser along (probably a browser) will interpret it correctly. But it’s sleight of hand.
Today, I needed to copy the text contents of a node into an attribute. Unfortunately, that text content contained one of these symbol tags. But because it’s only a string, XSLT feels (correctly) that it needs to escape the leading ampersand. So, with this input:
<name>Fred <symbol unicode="2013"/> Bloggs</name>
I get this output:
<name attrib="Fred &#x2013; Bloggs">Fred – Bloggs</name>
Yes, I know that the input data is completely stupid. I can’t help that. Unfortunately I also have the restriction that I can’t do this in multiple passes.
I’ve looked at the standard XSLT functions and the standard XPath functions. I’ve looked at the EXSLT functions. All I want is something that works like Perl’s chr.
I noticed that Saxon has the saxon:entity-ref function, but annoyingly, libxslt doesn’t support it.
All I really need is some way of re-invoking the XML parser over a string of my choosing. That way I could just wrap the characters in an element, parse it and call text()
to get the character I need.
Right now, the only way that I can see of doing this is to turn UnicodeData.txt into one big XML lookup table, and lookup the numbers in that. Bleeeaaargh.
Thankfully, it’s not my project and the person doing it has just hacked around this in the output layer. But it bugs me that there’s no good way to achieve this.
11 replies on “XSLT Character Creation”
I’ve just had a really evil thought. That only works for characters in the BMP. Bwahahahaha. 🙂
Boy, I’m going to love showing this at work on Monday!
Woot! That is a most excellent hack! Thank you, Sir!
Ah, good stuff about the non-BMP characters. I obviously didn’t look closely enough at what you wrote.
As to the comments, I have no idea. I blogged a while ago about typo getting the times wrong on posts. It appears that it’s doing it on comments as well. Drat. I’ve filed ticket#690 to see if any of the developers have any clue as to what’s going on…
Btw, let me know what they say at work. 🙂
The char-to-utf8bytes I wrote works for the full Unicode range; not only the BMP but for all sixteen of them.
Sigh. You made me waste hours of time.
PS.: what the hell is up with the ordering of your comments? It’s pretty much random rather than chronological.
Can I have a <a href=”http://msdn.microsoft.com/library/default.asp?url=/library/en-us/WordXMLCDK/html/cdkwelesym_HV01114886.asp” rel=”nofollow”>guess as to the type of XML document</a> you are dealing with?
Aristotle, my hat is off to you.
Alistair,
Sadly, it’s not WordML! At work I get to deal with a lot of data from publishers and sadly, it’s often very bad quality. Mostly this is because the data has been recovered from (or still has to deal with) ugly typesetting systems…
So did they see it at work yet? If so, what did they say?
My apologies for not saying so earlier. Yes, everybody was utterly flabbergasted that it could be done! And it meant that we took out the nasty special casing in the layer above that was put in to cope with not doing it. So, a big win! Thank you very much!