Clichés are hard

So yesterday, Jeremy asked:

Wondering if accents are valid in class names (so I can mark up some text as being of the class “cliché”)

It’s a damned good question. And you have to consider: character encodings; CSS; HTML; XHTML; JavaScript; HTTP. Needless to say, it’s more complicated than it looks at first.

My first thought was that of CSS files. Is it valid to say:

  p.cliché { color: #f00; }

To answer this question, you have to visit the CSS 2.1 spec. Near the end is §G.1 Grammar. It contains a BNF grammar describing the syntax of CSS. They’re not that difficult to read when you get the hang of them. In this case, I start by finding something I can recognise: a selector. Then, I work down through the grammar to find what I’m interested in.

  • selector : simple_selector [ combinator simple_selector ]* ;
    • a selector is composed of one or more simple_selectors.
  • simple_selector : element_name [ HASH | class | attrib | pseudo ]* | [ HASH | class | attrib | pseudo ]+ ;
    • A simple_selector is composed of an element_name followed by zero or more an ID, class, attribute or pseudo selector. Alternatively, it’s composed of one or more ID/class/attribute/pseudo selector (without an element name).
  • class : '.' IDENT ;
    • A class name is just ”.” followed by an identifier. That’s what we’re interested in here.
  • ident -?{nmstart}{nmchar}*
    • This is now in §G.2. But you can see that an identifier has an optional leading minus, followed by an nmstart and zero or more nmchar. It’s those nmchar that we care about.
  • nmchar [_a-z0-9-]|{nonascii}|{escape}
    • nmchar allows letters, numbers, underscores and minuses, as well as non-ascii characters and escapes. Oooh! Getting closer!
  • nonascii [\200-\377]
    • This is a horrid notation. It’s an octal character range. Octal stopped being in general usage in the early 80s, although Unix and C perpetuate them. Anyway, it says that any character whose code is between 128 and 255 is allowed.

So we get an answer: Because é is U+00E9 (or, 233 decimal), it’s allowed as part of an identifier in a CSS file.

But it’s worth noting the arbitrary limit of 255 here. That means that you don’t get to use any unicode character above that (e.g. Ā [U+0100]) verbatim in a CSS file. Instead, you have to escape it by saying (according to the escape declaration in that grammar) \h100. Which is quite nasty.

There’s one other wrinkle to consider before this will work. You also need to ensure that the CSS file is served over HTTP using the correct character set. If you’ve saved it as Latin-1, you need to ensure that it’s served up with this header:

  Content-Type: text/css; charset="iso-8859-1"

This is the default, so it could be left off, but it’s usually better to be explicit. Likewise, if the file is saved as UTF-8, you need this header to be added.

  Content-Type: text/css; charset="UTF-8"

If you’re using Apache, check out the AddDefaultCharset and AddCharset directives.

So that’s CSS. But what about HTML?

HTML is defined in the HTML 4.01 specification. It’s defined using SGML, which means more complication in order to work out what the heck’s going on. Thankfully, everybody knows that there are four ways to get an é into HTML:

  • A literal é.
  • A character entity: é
  • A decimal character reference: &233;
  • A hex character reference: &xE9;

In order to figure out what characters are allowed in a class attribute, though, you have to go and start looking at the DTD:

  • The coreattrs entity is the first mention. It defines a class as being some CDATA.
  • The definition of CDATA is an intrinsic part of SGML. The details of which can be altered by the SGML Declaration for HTML 4. There’s a section at the beginning which lists which characters are allowed. It includes a large number of unicode characters all above 160 decimal.

That means that it’s safe to include a character via any of the above methods.

But there are a few more wrinkles. Firstly, whilst the two characters references above are intrinsic to HTML (via SGML), where does the character entity come from? Well, they are defined as part of the HTML spec: Character entity references in HTML 4.

There’s also the problem of the character encoding in case you use the literal é. Like the CSS above, you need to ensure that your web server is telling everybody what character encoding the file is served as. Actually, for HTML, it’s less of a problem, as the browser will generally auto-detect character encodings. But that’s not necessarily reliable, so it’s better to be explicit. And in HTML, you can put the character encoding in the file itself:

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Yes, this a little bit like having a french dictionary containing the words “ecrit en Français” in the front. But it’s a good idea to have both this and the HTTP declaration (and they must match).

I’m not going to talk about XHTML/XML because it’s not in widespread use (i.e. serving it up as application/xhtml+xml).

Finally, what about JavaScript? Well, it’s defined as ECMA-262 (3rd edition). That spec explicitly defines everything in terms of Unicode, so it’s mostly OK. You can still access characters you can’t type via an escape mechanism: \u00E9 (see the definition of UnicodeEscapeSequence on page 19). Additionally, JavaScript can get at the unicode characters in the DOM quite easily:

  <p id='a1' class="cliché">Lessons will be learned.</p>
  <script type="text/javascript">
    alert(document.getElementById('a1').className)
  </script>

As always, JavaScript files served over HTTP need to be supplied with the correct character-encoding through the Content-Type header. Just like CSS and HTML.

So what’s the take-away from all this?

  • Use literal characters and UTF-8 everywhere. It’s consistent and extensible.
  • Know how to look in the specs when something’s going wrong – you’ll know whether it’s you, or the browser that’s getting it wrong.
  • Characters are hard, let’s go shopping!

Jeremy worked it all out in far less time than I did.

Figuring I should be okay as long as I use a character entity. http://tinyurl.com/7p7qc

Looking at that link, I notice that CDATA is handled specially within STYLE and SCRIPT tags. Yet more exceptions to the rules!

8 thoughts on “Clichés are hard

  1. dom

    Bert Bos emailed me a while ago with this tidbit:

    Somebody pointed me to your article[1] about the characters that are allowed in class names in CSS. You correctly deduced that é (e with acute accent) is allowed, but your article may lead people to believe that letters in other scripts, such as Chinese or Arabic, are *not* allowed.

    I readily admit that the specification is difficult to read, but it does say that all characters outside the ASCII range are allowed. In section 4.1.3 there is the phrase “ISO 10646 characters U+00A1 and higher” and in section G.2 is says that the grammar only shows code points up to (octal) 0377, but that in fact everything up to (octal) 04177777 is allowed.

    [1] http://happygiraffe.net/blog/articles/2007/12/19/cliches-are-hard

  2. Mark Fowler

    Of course, in Unicode, “é” has a character code under 255 if and only if you’re using NFC. Use NFD and everything suddenly looks a lot different…

  3. Mot

    Thanks for the article. Proper research done and describben. I love that style. For CSS charsets it pays to consider using the declaration inside the CSS-File as well with the @charset-rule:

    @charset "<IANA defined charset name>";
    @charset "UTF-8";
    

    it belongs at the very beginning of the document. (more)

    For the HTML a charset can be added as well into meta headers. This helps to read documents when they were saved and are not provided by the server any longer:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    

    does that job. But all this is only a very slight addition to your great article.

  4. Dominic Mitchell

    @steffen: Good point, but terribly anglocentric. Perhaps you would win the IOCCC “Best abuse of the rules.” :-)

    @codingvista: It depends on the server side language. Certainly Java and Perl allow you to use Unicode in identifiers, possibly Ruby too. I think PHP is the only severely Unicode crippled language at this point.

    But in general, this post isn’t really about those specifics. It’s more about: “How do I get a proper answer to my web development problem.” And for that, you do need to go to the specs.

  5. codingvista

    Hey

    I haven’t time to read the whole lot so you may have already mentioned this but if you reflect CSS id’s/class names in server side code, and there are good reasons why you’d do this, then using complex characters would be a nono.

    Just my .5p

    w://

  6. Dan Eastwell

    Great article Dominic! Saves me the time of having to work it out for myself. There will almost certainly come a point where things aren’t working and I’ve ended up using a >255 Unicode character in my CSS. Again!

Comments are closed.