Clichés are hard

So yesterday, Jeremy asked:

Wondering if accents are valid in class names (so I can mark up some text as being of the class “cliché”)

It’s a damned good question. And you have to consider: character encodings; CSS; HTML; XHTML; JavaScript; HTTP. Needless to say, it’s more complicated than it looks at first.

My first thought was that of CSS files. Is it valid to say:

  p.cliché { color: #f00; }

To answer this question, you have to visit the CSS 2.1 spec. Near the end is §G.1 Grammar. It contains a BNF grammar describing the syntax of CSS. They’re not that difficult to read when you get the hang of them. In this case, I start by finding something I can recognise: a selector. Then, I work down through the grammar to find what I’m interested in.

  • selector : simple_selector [ combinator simple_selector ]* ;
    • a selector is composed of one or more simple_selectors.
  • simple_selector : element_name [ HASH | class | attrib | pseudo ]* | [ HASH | class | attrib | pseudo ]+ ;
    • A simple_selector is composed of an element_name followed by zero or more an ID, class, attribute or pseudo selector. Alternatively, it’s composed of one or more ID/class/attribute/pseudo selector (without an element name).
  • class : '.' IDENT ;
    • A class name is just ”.” followed by an identifier. That’s what we’re interested in here.
  • ident -?{nmstart}{nmchar}*
    • This is now in §G.2. But you can see that an identifier has an optional leading minus, followed by an nmstart and zero or more nmchar. It’s those nmchar that we care about.
  • nmchar [_a-z0-9-]|{nonascii}|{escape}
    • nmchar allows letters, numbers, underscores and minuses, as well as non-ascii characters and escapes. Oooh! Getting closer!
  • nonascii [200-377]
    • This is a horrid notation. It’s an octal character range. Octal stopped being in general usage in the early 80s, although Unix and C perpetuate them. Anyway, it says that any character whose code is between 128 and 255 is allowed.

So we get an answer: Because é is U+00E9 (or, 233 decimal), it’s allowed as part of an identifier in a CSS file.

But it’s worth noting the arbitrary limit of 255 here. That means that you don’t get to use any unicode character above that (e.g. Ā [U+0100]) verbatim in a CSS file. Instead, you have to escape it by saying (according to the escape declaration in that grammar) h100. Which is quite nasty.

There’s one other wrinkle to consider before this will work. You also need to ensure that the CSS file is served over HTTP using the correct character set. If you’ve saved it as Latin-1, you need to ensure that it’s served up with this header:

  Content-Type: text/css; charset="iso-8859-1"

This is the default, so it could be left off, but it’s usually better to be explicit. Likewise, if the file is saved as UTF-8, you need this header to be added.

  Content-Type: text/css; charset="UTF-8"

If you’re using Apache, check out the AddDefaultCharset and AddCharset directives.

So that’s CSS. But what about HTML?

HTML is defined in the HTML 4.01 specification. It’s defined using SGML, which means more complication in order to work out what the heck’s going on. Thankfully, everybody knows that there are four ways to get an é into HTML:

  • A literal é.
  • A character entity: é
  • A decimal character reference: &233;
  • A hex character reference: &xE9;

In order to figure out what characters are allowed in a class attribute, though, you have to go and start looking at the DTD:

  • The coreattrs entity is the first mention. It defines a class as being some CDATA.
  • The definition of CDATA is an intrinsic part of SGML. The details of which can be altered by the SGML Declaration for HTML 4. There’s a section at the beginning which lists which characters are allowed. It includes a large number of unicode characters all above 160 decimal.

That means that it’s safe to include a character via any of the above methods.

But there are a few more wrinkles. Firstly, whilst the two characters references above are intrinsic to HTML (via SGML), where does the character entity come from? Well, they are defined as part of the HTML spec: Character entity references in HTML 4.

There’s also the problem of the character encoding in case you use the literal é. Like the CSS above, you need to ensure that your web server is telling everybody what character encoding the file is served as. Actually, for HTML, it’s less of a problem, as the browser will generally auto-detect character encodings. But that’s not necessarily reliable, so it’s better to be explicit. And in HTML, you can put the character encoding in the file itself:

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Yes, this a little bit like having a french dictionary containing the words “ecrit en Français” in the front. But it’s a good idea to have both this and the HTTP declaration (and they must match).

I’m not going to talk about XHTML/XML because it’s not in widespread use (i.e. serving it up as application/xhtml+xml).

Finally, what about JavaScript? Well, it’s defined as ECMA-262 (3rd edition). That spec explicitly defines everything in terms of Unicode, so it’s mostly OK. You can still access characters you can’t type via an escape mechanism: u00E9 (see the definition of UnicodeEscapeSequence on page 19). Additionally, JavaScript can get at the unicode characters in the DOM quite easily:

  <p id='a1' class="cliché">Lessons will be learned.</p>
  <script type="text/javascript">

As always, JavaScript files served over HTTP need to be supplied with the correct character-encoding through the Content-Type header. Just like CSS and HTML.

So what’s the take-away from all this?

  • Use literal characters and UTF-8 everywhere. It’s consistent and extensible.
  • Know how to look in the specs when something’s going wrong – you’ll know whether it’s you, or the browser that’s getting it wrong.
  • Characters are hard, let’s go shopping!

Jeremy worked it all out in far less time than I did.

Figuring I should be okay as long as I use a character entity.

Looking at that link, I notice that CDATA is handled specially within STYLE and SCRIPT tags. Yet more exceptions to the rules!