Which whitespace was that again?

We recently saw this at $WORK. It appears corrupted in Internet Explorer only. Firefox and Safari show it normally.

Corrupted text in internet explorer

After much exploration in the debugger, we eventually found it was caused by using the innerText property in internet explorer. This has the mildly surprising property of turning multiple spaces into U+00A0 (NO-BREAK SPACE) characters (  to you and me). This behaviour doesn’t appear to be documented. And before you ask, this was all being done by a third-party — I know to not use proprietary extensions where possible.

Anyway, I nailed it down to a small test. Given this markup.

Then this script demonstrates the problem.

var foo = document.getElementById('foo');

// OK
foo.innerText = ['A', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A B"

foo.innerText = ['A', ' ', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A  B"

If you want to try it yourself, check out (jsbin is awesome! Thanks, rem!)

The quick solution is simple: normalize whitespace before insertion before using innerText.

var text = 'some       where        over      the      rainbow';

foo.innerText = text.split(/s+/).join(' ');

Of course, you should really be using appendChild() and createTextNode().


To XHTML or not to XHTML?

Today, we had a conversation about HTML 4 vs XHTML 1.0. For me, the matter was neatly settled they very first time I saw an XML system produce XHTML like this:


An article with an empty emphasis tag.

Perfectly legal XML, perfectly legal XHTML. But — if you serve up this XHTML as text/html (which 99.99% of the world does), then you end up with this:

Empty tags considered harmful

Why? Because it’s parsed as HTML. And the browser sees the start of an em tag, but no close.

And now I make sure that all our sites emit HTML 4. It’s a lot simpler.

This isn’t to say I don’t use XHTML. It’s a fine medium for further processing (e.g. applying XSLT). But it’s not right for serving up to browsers verbatim.


Clichés are hard

So yesterday, Jeremy asked:

Wondering if accents are valid in class names (so I can mark up some text as being of the class “cliché”)

It’s a damned good question. And you have to consider: character encodings; CSS; HTML; XHTML; JavaScript; HTTP. Needless to say, it’s more complicated than it looks at first.

My first thought was that of CSS files. Is it valid to say:

  p.cliché { color: #f00; }

To answer this question, you have to visit the CSS 2.1 spec. Near the end is §G.1 Grammar. It contains a BNF grammar describing the syntax of CSS. They’re not that difficult to read when you get the hang of them. In this case, I start by finding something I can recognise: a selector. Then, I work down through the grammar to find what I’m interested in.

  • selector : simple_selector [ combinator simple_selector ]* ;
    • a selector is composed of one or more simple_selectors.
  • simple_selector : element_name [ HASH | class | attrib | pseudo ]* | [ HASH | class | attrib | pseudo ]+ ;
    • A simple_selector is composed of an element_name followed by zero or more an ID, class, attribute or pseudo selector. Alternatively, it’s composed of one or more ID/class/attribute/pseudo selector (without an element name).
  • class : '.' IDENT ;
    • A class name is just ”.” followed by an identifier. That’s what we’re interested in here.
  • ident -?{nmstart}{nmchar}*
    • This is now in §G.2. But you can see that an identifier has an optional leading minus, followed by an nmstart and zero or more nmchar. It’s those nmchar that we care about.
  • nmchar [_a-z0-9-]|{nonascii}|{escape}
    • nmchar allows letters, numbers, underscores and minuses, as well as non-ascii characters and escapes. Oooh! Getting closer!
  • nonascii [200-377]
    • This is a horrid notation. It’s an octal character range. Octal stopped being in general usage in the early 80s, although Unix and C perpetuate them. Anyway, it says that any character whose code is between 128 and 255 is allowed.

So we get an answer: Because é is U+00E9 (or, 233 decimal), it’s allowed as part of an identifier in a CSS file.

But it’s worth noting the arbitrary limit of 255 here. That means that you don’t get to use any unicode character above that (e.g. Ā [U+0100]) verbatim in a CSS file. Instead, you have to escape it by saying (according to the escape declaration in that grammar) h100. Which is quite nasty.

There’s one other wrinkle to consider before this will work. You also need to ensure that the CSS file is served over HTTP using the correct character set. If you’ve saved it as Latin-1, you need to ensure that it’s served up with this header:

  Content-Type: text/css; charset="iso-8859-1"

This is the default, so it could be left off, but it’s usually better to be explicit. Likewise, if the file is saved as UTF-8, you need this header to be added.

  Content-Type: text/css; charset="UTF-8"

If you’re using Apache, check out the AddDefaultCharset and AddCharset directives.

So that’s CSS. But what about HTML?

HTML is defined in the HTML 4.01 specification. It’s defined using SGML, which means more complication in order to work out what the heck’s going on. Thankfully, everybody knows that there are four ways to get an é into HTML:

  • A literal é.
  • A character entity: é
  • A decimal character reference: &233;
  • A hex character reference: &xE9;

In order to figure out what characters are allowed in a class attribute, though, you have to go and start looking at the DTD:

  • The coreattrs entity is the first mention. It defines a class as being some CDATA.
  • The definition of CDATA is an intrinsic part of SGML. The details of which can be altered by the SGML Declaration for HTML 4. There’s a section at the beginning which lists which characters are allowed. It includes a large number of unicode characters all above 160 decimal.

That means that it’s safe to include a character via any of the above methods.

But there are a few more wrinkles. Firstly, whilst the two characters references above are intrinsic to HTML (via SGML), where does the character entity come from? Well, they are defined as part of the HTML spec: Character entity references in HTML 4.

There’s also the problem of the character encoding in case you use the literal é. Like the CSS above, you need to ensure that your web server is telling everybody what character encoding the file is served as. Actually, for HTML, it’s less of a problem, as the browser will generally auto-detect character encodings. But that’s not necessarily reliable, so it’s better to be explicit. And in HTML, you can put the character encoding in the file itself:

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Yes, this a little bit like having a french dictionary containing the words “ecrit en Français” in the front. But it’s a good idea to have both this and the HTTP declaration (and they must match).

I’m not going to talk about XHTML/XML because it’s not in widespread use (i.e. serving it up as application/xhtml+xml).

Finally, what about JavaScript? Well, it’s defined as ECMA-262 (3rd edition). That spec explicitly defines everything in terms of Unicode, so it’s mostly OK. You can still access characters you can’t type via an escape mechanism: u00E9 (see the definition of UnicodeEscapeSequence on page 19). Additionally, JavaScript can get at the unicode characters in the DOM quite easily:

  <p id='a1' class="cliché">Lessons will be learned.</p>
  <script type="text/javascript">

As always, JavaScript files served over HTTP need to be supplied with the correct character-encoding through the Content-Type header. Just like CSS and HTML.

So what’s the take-away from all this?

  • Use literal characters and UTF-8 everywhere. It’s consistent and extensible.
  • Know how to look in the specs when something’s going wrong – you’ll know whether it’s you, or the browser that’s getting it wrong.
  • Characters are hard, let’s go shopping!

Jeremy worked it all out in far less time than I did.

Figuring I should be okay as long as I use a character entity.

Looking at that link, I notice that CDATA is handled specially within STYLE and SCRIPT tags. Yet more exceptions to the rules!


The Wrong Defaults

Aristotle points out a post from Anne van Kesteren about draconian error handling. FWIW, I agree with Aristotle. But both of them are really missing the point.

The simple fact of the matter is that the defaults in all of our tools are wrong. Yup, you heard right.

When you’re working with some server-side framework, variables are almost invariably output as-is. If you need to include data from a source that might contain angle brackets or ampersands, you have to call a function first to escape the output before inserting it into the page. Some frameworks (eg: rails) attempt to make this easier by reducing it to a single character (the h function).

But the simple fact of the matter is that it’s the wrong defaults. The default should be to escape everything that’s output, unless the programmer asks otherwise. This is the only way that we will start seeing a reduction in Cross Site Scripting attacks.

The fact that I can flip this default is one of the reasons that I like working with HTML::Mason, even though it’s otherwise a bad mix of code and template.

The SQL guys learned years ago that if you rely on programmers to do the escaping, they won’t bother half the time. Result: lots of applications with SQL injection attacks. The solution: placeholders. Make it simpler and easier to do the correct thing and the problem goes away.

Sadly, I don’t hold out much hope for this being achieved.


Citation Sources

One of the useful things about blockquotes is that you can specify where you’re quoting by adding a cite attribute. But current browsers don’t offer you an interface to use this information. Sounds like a job for greasemonkey! So I wrote cite.user.js to get around it.

Unfortunately, I should have looked around first—I might have seen Citeable Blockquotes, which does what my script does, but more and better. Ah well. It was an interesting learning experience; it took only minutes to put together.