XSLT Character Creation

I’ve just spent most of the afternoon on a character building exercise. I have some XML like this:

  <symbol unicode="2103"/>

And I need to turn that into the numeric character reference &#x2013;. It’s perfectly possible to do so with a bit of fudging around with <xsl:text disable-output-escaping="yes"/>. But there’s a slight caveat: You’re not creating a numeric character reference. You’re just creating something that looks like one. Really, it’s the characters “&”, ”#”, “x”, “2”, “0”, “1”, “3” and ”;”.

Now most of the time, this doesn’t matter. You just output XML that looks correct and the next parser along (probably a browser) will interpret it correctly. But it’s sleight of hand.

Today, I needed to copy the text contents of a node into an attribute. Unfortunately, that text content contained one of these symbol tags. But because it’s only a string, XSLT feels (correctly) that it needs to escape the leading ampersand. So, with this input:

  <name>Fred <symbol unicode="2013"/> Bloggs</name>

I get this output:

  <name attrib="Fred &#x2013; Bloggs">Fred &#x2013; Bloggs</name>

Yes, I know that the input data is completely stupid. I can’t help that. Unfortunately I also have the restriction that I can’t do this in multiple passes.

I’ve looked at the standard XSLT functions and the standard XPath functions. I’ve looked at the EXSLT functions. All I want is something that works like Perl’s chr.

I noticed that Saxon has the saxon:entity-ref function, but annoyingly, libxslt doesn’t support it.

All I really need is some way of re-invoking the XML parser over a string of my choosing. That way I could just wrap the characters in an element, parse it and call text() to get the character I need.

Right now, the only way that I can see of doing this is to turn UnicodeData.txt into one big XML lookup table, and lookup the numbers in that. Bleeeaaargh.

Thankfully, it’s not my project and the person doing it has just hacked around this in the output layer. But it bugs me that there’s no good way to achieve this.

Comments 11

  1. Dominic Mitchell wrote:

    I’ve just had a really evil thought. That only works for characters in the BMP. Bwahahahaha. :-)

    Boy, I’m going to love showing this at work on Monday!

    Posted 03 Feb 2006 at 8:41 am
  2. Dominic Mitchell wrote:

    Woot! That is a most excellent hack! Thank you, Sir!

    Posted 03 Feb 2006 at 8:46 am
  3. Dominic Mitchell wrote:

    Ah, good stuff about the non-BMP characters. I obviously didn’t look closely enough at what you wrote.

    As to the comments, I have no idea. I blogged a while ago about typo getting the times wrong on posts. It appears that it’s doing it on comments as well. Drat. I’ve filed ticket#690 to see if any of the developers have any clue as to what’s going on…

    Posted 03 Feb 2006 at 8:46 am
  4. Aristotle Pagaltzis wrote:

    Btw, let me know what they say at work. :-)

    Posted 03 Feb 2006 at 9:02 am
  5. Aristotle Pagaltzis wrote:

    The char-to-utf8bytes I wrote works for the full Unicode range; not only the BMP but for all sixteen of them.

    Posted 03 Feb 2006 at 9:02 am
  6. Aristotle Pagaltzis wrote:

    Sigh. You made me waste hours of time.

    Posted 03 Feb 2006 at 10:05 am
  7. Aristotle Pagaltzis wrote:

    PS.: what the hell is up with the ordering of your comments? It’s pretty much random rather than chronological.

    Posted 03 Feb 2006 at 10:05 am
  8. Alastair wrote:

    Can I have a <a href=”http://msdn.microsoft.com/library/default.asp?url=/library/en-us/WordXMLCDK/html/cdkwelesym_HV01114886.asp” rel=”nofollow”>guess as to the type of XML document</a> you are dealing with?

    Aristotle, my hat is off to you.

    Posted 06 Feb 2006 at 4:05 am
  9. Dominic Mitchell wrote:

    Alistair,

    Sadly, it’s not WordML! At work I get to deal with a lot of data from publishers and sadly, it’s often very bad quality. Mostly this is because the data has been recovered from (or still has to deal with) ugly typesetting systems…

    Posted 06 Feb 2006 at 6:15 am
  10. Aristotle Pagaltzis wrote:

    So did they see it at work yet? If so, what did they say?

    Posted 18 Feb 2006 at 11:42 pm
  11. Dominic Mitchell wrote:

    My apologies for not saying so earlier. Yes, everybody was utterly flabbergasted that it could be done! And it meant that we took out the nasty special casing in the layer above that was put in to cope with not doing it. So, a big win! Thank you very much!

    Posted 19 Feb 2006 at 12:21 am

Post a Comment

Your email is never published nor shared. Required fields are marked *