Unicode in Rails

I’m really happy to see that Thijs has just pointed out that the unicode_hacks plugin is undergoing further development:

We’re almost ready with a new version of Julik’s ‘Unicode Hacks’ that’s now called ‘ActiveSupport::Multibyte’. You can find more information and code on the ‘Multibyte for Rails’ project site.

I’m particularly pleased to see that: “We hope to get ActiveSupport::Multibyte accepted as a new core extension in the 1.2 release of Ruby on Rails”. That would be a real boon. Check out the FAQ too.


Unicode for Rails

I finally gave my talk this afternoon. I rushed through things in 40 minutes; I was planning on 45, but I started a little late due to microphone difficulties.

The talk seemed to go down well; a few people came up to ask questions afterwards. My official hecklers, Tom and Paul were noticeably silent. They didn’t try to pedant-me-to-death afterwards, which is good. Although it probably means I had too much detail in there for mere mortals!

I’d also like to give a huge bouqet of thanks to the hyper-lovely _why for his fabulously encouraging words along the way.

Anyway, please take a look at the slides for Unicode for Rails if you fancy. One thing I added at the last minute and didn’t get a chance to show on screen was the links slide. In particular, I recommend checking out Julik’s Unicode Slides.

I practised by giving the talk to a collection of fluffy toys that we have around the house. We now have the most well-unicode-educated giraffes in existence, I suspect. 🙂

On a slightly less fun note, I’ve just read Tony Finch’s summary of UTF-8 in email, which is far, far hairier than in HTTP (which has most of the complications built in at least). Worth checking out if you do much email.


Character Info in Textmate

One rather useful feature of vim is that you can pull up information about a character by positioning your cursor over it and hitting ga (get ASCII?). I quite miss this in textmate, so I created a small command to add to the Text bundle. This is “Character Info”, which I’ve assigned to ^⇧I. It takes the selection as input and the information comes back in a tooltip.

  use strict;
  use warnings;
  use charnames qw( :full );
  binmode( STDIN, ':utf8' );
  foreach my $c (split //, do { local $/; <> }) {
      my $code = ord $c;
      my $name = charnames::viacode( $code ) || "unknown character";
      printf "U+%04X %sn", $code, $name;
  exit 0;

This is what it looks like.

Character Info in action

The only caveat is that it only works if you’re using UTF-8 for your files. But really, if not, why not?


Unicode Depresses Me

Perl is meant to have reasonable Unicode support. So why do I still have to write this at the top of a test?

  use utf8;
  use Test::More 'no_plan';
      my $Test = Test::Builder->new;
      binmode( $Test->output,         ":utf8" );
      binmode( $Test->failure_output, ":utf8" );
      binmode( $Test->todo_output,    ":utf8" );

I would have thought that adding the -CS flag to the #! line would have fixed this. But that doesn’t do it. Ah well, I’ve filed a wishlist bug: RT#21091.


Unicode for Rails — accepted

I had a little note today to say that my talk on “Unicode for Rails” has been accepted for RailsConf Europe 2006. Yay!

Now I have to write the thing. This is going to be interesting. I have only a few weeks to go, and most of those weekends are already taken…


Character Encodings

There have been a few links around to an article about AJAX and Multibyte Character Support today. He highlights how he fixed a problem by ensuring that he sent the correct HTTP headers for the character encoding he was using. After stabbing being thoroughly mislead by Internet Explorer.

Now the correct character-encoding ended up being UTF-8. But of course, you actually have to turn your data into UTF-8 first. But it’s still the right answer.

Unless you have really, really peculiar clients, you should always be sending out UTF-8 data. It’s a no-brainer.

But it highlights a deeper issue. You simply have to know what encoding your characters are in. It’s no good saying “ASCII” (I can’t type in “£” then?) or Latin-1 (because invariably it’s cp-1252 instead). UTF-8 being a superset of all the others (with the possible exception of some far-east characters that didn’t make it into the HAN Unification process).

If you don’t know what character encoding you have, then you don’t know how to interpret those bytes. Plain and simple. You need to convert your data to the right character encoding on the way in and send it out in right encoding. Because UTF-8 is a superset, you can transliterate if needed, but it’s an expensive solution.

But it seems that many, many web developers haven’t grasped the concept of a character set yet. It’s just another area when “things sometimes go wrong if you type in funny characters sometimes”. But it needn’t be rocket science.

Oh, and the comments about using escape() in JavaScript are way off the mark. That’s just punting the problem, guys.

Finally, it’s worth noting something else related that popped into my radar today. Sending out character sets with a media type of text/* is absolutely necessary, since it’s perfectly legitimate for a bit of proxy software to then translate to a different character set if it wants to. This is why you should never send out xml as text/*, because then the encoding in the xml declaration would be wrong…


XSLT Character Creation

I’ve just spent most of the afternoon on a character building exercise. I have some XML like this:

  <symbol unicode="2103"/>

And I need to turn that into the numeric character reference . It’s perfectly possible to do so with a bit of fudging around with <xsl:text disable-output-escaping="yes"/>. But there’s a slight caveat: You’re not creating a numeric character reference. You’re just creating something that looks like one. Really, it’s the characters “&”, ”#”, “x”, “2”, “0”, “1”, “3” and ”;”.

Now most of the time, this doesn’t matter. You just output XML that looks correct and the next parser along (probably a browser) will interpret it correctly. But it’s sleight of hand.

Today, I needed to copy the text contents of a node into an attribute. Unfortunately, that text content contained one of these symbol tags. But because it’s only a string, XSLT feels (correctly) that it needs to escape the leading ampersand. So, with this input:

  <name>Fred <symbol unicode="2013"/> Bloggs</name>

I get this output:

  <name attrib="Fred &amp;#x2013; Bloggs">Fred – Bloggs</name>

Yes, I know that the input data is completely stupid. I can’t help that. Unfortunately I also have the restriction that I can’t do this in multiple passes.

I’ve looked at the standard XSLT functions and the standard XPath functions. I’ve looked at the EXSLT functions. All I want is something that works like Perl’s chr.

I noticed that Saxon has the saxon:entity-ref function, but annoyingly, libxslt doesn’t support it.

All I really need is some way of re-invoking the XML parser over a string of my choosing. That way I could just wrap the characters in an element, parse it and call text() to get the character I need.

Right now, the only way that I can see of doing this is to turn UnicodeData.txt into one big XML lookup table, and lookup the numbers in that. Bleeeaaargh.

Thankfully, it’s not my project and the person doing it has just hacked around this in the output layer. But it bugs me that there’s no good way to achieve this.


Java Unicode Characters

Working on jenx, I’ve started looking at Characters. In particular, Astral characters. My first question was “how do I create one in a string literal?” Well I still don’t know. But my researches have shown that to do anything outside the Basic Multilingual Plane (BMP) requires JDK 5. Drat. That kind of limits the usefulness of this library. But I really need the stuff in JSR 204.

Which is a story in itself. It’s good thing that Java can handle the full Unicode range. But the support is (to be quite frank) a bit crap. Mostly down to the fact that char is a UTF-16 codepoint, not a “Unicode Character.” I personally don’t find it helpful that they’ve propogated the C problem of confusing char and int, and generally allowing the two to roam freely amongst each other. Plus, JSR 204 looks like it was extremely careful to avoid breaking backwards compatibility, which is always a noble goal, but in the case makes the end result incredibly difficult to use. I shouldn’t have to test each codepoint to see whether or not it’s a surrogate. Really. This is an OO language, I should be able to get the next Character object from the String. Shocking, I know.

It strikes me that Python is pretty much the only language I know that got Unicode right by making an entirely separate object, the “Unicode String.”

Update: Oh alright. In my whinging, I managed to miss String.codePointAt, which does what I need.