No escape() from JavaScript

A couple of days ago, we got caught out by a few encoding issues in a site at $WORK. The Perl related ones were fairly self explanatory and I’d seen before (e.g. not calling decode_utf8() on the query string parameters). But the JavaScript part was new to me.

The problem was that we were using JavaScript to create an URL, but this wasn’t encoding some characters correctly. After a bit of investigation, the problem comes down to the difference between escape() and encodeURIComponent().

#escaper {
border: 1px dotted black;
text-align: center;
#escaper thead {
background: #eee;
#escaper thead th {
border-bottom: 1px solid black;
#escaper td, #escaper th {
padding: .25em;

input escape(…) encodeURIComponent(…)
a&b a%26b a%26b
1+2 1+2 1%2B2
café caf%E9 caf%C3%A9
Ādam %u0100dam %C4%80dam

The last is particularly troublesome, as no server I know of will support decoding that %u form.

The takeaway is that encodeURIComponent() always encodes as UTF-8 and doesn’t miss characters out. As far as I can see, this means you should simply never use escape(). Which is why I’ve asked Douglas Crockford to add it as a warning to JSLint.

Once we switched the site’s JavaScript from escape() to encodeURIComponent(), everything worked as expected.


Which whitespace was that again?

We recently saw this at $WORK. It appears corrupted in Internet Explorer only. Firefox and Safari show it normally.

Corrupted text in internet explorer

After much exploration in the debugger, we eventually found it was caused by using the innerText property in internet explorer. This has the mildly surprising property of turning multiple spaces into U+00A0 (NO-BREAK SPACE) characters (  to you and me). This behaviour doesn’t appear to be documented. And before you ask, this was all being done by a third-party — I know to not use proprietary extensions where possible.

Anyway, I nailed it down to a small test. Given this markup.

Then this script demonstrates the problem.

var foo = document.getElementById('foo');

// OK
foo.innerText = ['A', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A B"

foo.innerText = ['A', ' ', ' ', 'B'].join('');
alert(foo.innerHTML);   // "A  B"

If you want to try it yourself, check out (jsbin is awesome! Thanks, rem!)

The quick solution is simple: normalize whitespace before insertion before using innerText.

var text = 'some       where        over      the      rainbow';

foo.innerText = text.split(/s+/).join(' ');

Of course, you should really be using appendChild() and createTextNode().

Uncategorized on OSX

Just a quick note… I was looking at RT#48699 when I noticed that MacPorts didn’t have in it’s collection. I needed to install it by hand. Unfortunately, the latest version (1.12) doesn’t install cleanly.

So I’ve forked it and fixed it (along with a couple of other minor nits).

Claes said he’ll apply the patch at some point. So hopefully when 1.13 comes out, this won’t be necessary.

Of course, really I should get to grips with MacPorts and submit a Portfile

Uncategorized mirror

Whilst I ensure I have the latest version of JSLint for each release of jslint4java, it’s often difficult to know what’s actually changed between versions. Unfortunately, Douglas Crockford doesn’t maintain a public version control system1 (which is entirely up to him).

Nonetheless, it’s kind of useful to be able to look at a version of JSLint and say “this changed since the last version”. So, I’ve started mirroring into a git repository. I check for updates every hour.

This means you can now see how JSLint changes over time.

Now, it’s not as useful as it could be. It would be nice to tie these up to the changes that Douglas posts to the mailing list (e.g. announcing the new .data() support). But that’s quite a bit more work.

I hope that this is beneficial to somebody.

1 It’s amazing how normal public version control has become over the years. This used to be a lot less common, which made projects a lot harder to understand.


Google Analytics in XHTML

I’e been attempting to get Google Analytics to work correctly in both FireFox and IE6 for a site at $WORK. This is not normally a problem, apart from the fact that we’re serving up pages to firefox as application/xhtml+xml in order to get MathML support.

Now, the sample code from Google is pretty gnarly.

var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "' type='text/javascript'%3E%3C/script%3E"));

var pageTracker = _gat._getTracker("UA-xxxxxx-x");
} catch(err) {}

This fails in XHTML as document.write() isn’t there.

I tried a number of ways to get this to work.

  • Replace document.write() with some jQuery code to insert a script tag.
    • This didn’t work in IE6 — as the second script block ended up getting called before newly inserted script tag had loaded.
    • But I did find out that jQuery will replace script tags with Ajax calls for you. Which means you don’t end up with a script tag in the DOM tree, which is highly confusing when you’re looking for it in firebug.
  • Replace document.write() with native DOM calls to insert a script tag.
    • I did find the neat idea of adding an id to the script tag you’re currently in, so you know where to insert new DOM elements.
    • But it still failed, and for the same reason as above.

I was just about to start implementing something evil involving setInterval(), when I realised…

… this site will never use SSL!

So I replaced the code to generate a script tag with the script tag.

var pageTracker = _gat._getTracker("UA-xxxxxx-x");
} catch(err) {}

Tada! If only I’d thought of this a few hours earlier… The moral is to be more aware of the context in which you’re doing something. Keep an eye on the “big picture” to use a particularly horrible metaphor.


Time Zones, done better

A few days ago, I signed up to HuffDuffer (love the logo, BTW). However, once I’d registered, my profile page said Huffduffing since November 1st, 2008. Which was a little weird, as when I did this it was October 31st.

So I filed a bug report and Jeremy quickly spotted the problem: the server was running in Australia, so was using an Australian timezone. That’s about 11 hours ahead, which would put my signup in to the next day. His fix was simple: force huffduffer to use GMT (see datetime configuration in the PHP manual). This isn’t a 100% correct solution, but it’s definitely the least bad approach.

But it set me wondering: can we do better? The key thing is that something somewhere has to know what time zone you’re in. That thing happens to be your browser. JavaScript Date objects have a getTimezoneOffset() method which return the minutes difference between you and GMT.

  alert(new Date().getTimezoneOffset());

Right now, when I try it, it returns zero because I’m in british winter time, which just happens to be the same as GMT. If I was in Australia, it’d return 660 (11 hours).

So we know the time on the server, and we know how far away from GMT we are. So how do we get JavaScript to format the date correctly?

Obviously, the first step is to pass the time to the browser as GMT. The normal approach to this is to use the Unix time format (number of seconds since midnight 1970). In PHP, look at gmmktime(). Perl uses gmtime(). And JavaScript has a constructor which takes milliseconds.

  new Date(0); // Thu Jan 01 1970 01:00:00 GMT+0100 (BST)

Ooops. That’s an hour ahead. But it’s actually misleading: the internal representation (the milliseconds we are passing in) doesn’t have a timezone, it’s only applied when you turn it into a human readable form. You can demonstrate this by using rhino on the command line to fool Javascript in to thinking it’s in different time zones1.

First, let’s try it in my time zone:

  % java -jar js.jar -e "print(new Date(0))"
  Thu Jan 01 1970 00:00:00 GMT-0000 (GMT)

Now let’s pretend we’re in Australia:

  % TZ=Australia/Melbourne java -jar js.jar -e 'print(new Date(0))'
  Thu Jan 01 1970 10:00:00 GMT+1000 (EST)

Which makes things really easy for us: get the server to pass the unix time, create a new Date object from it and then rewrite the textual object based on the contents.

So, if you start out with some HTML like this:


At the third stroke, it will be 2008-10-31 14:00:00.

Note that we include the default (GMT) time in order to degrade gracefully.

Then you can use some JavaScript like this to automatically adjust it to the time zone of the browser.

    $("span.datetime").each(function () {
        if (!this.hasAttribute('unixtime')) {
        var unixtime = this.getAttribute('unixtime');
        // Convert seconds to milliseconds
        var date = new Date(unixtime * 1000);

I’m relying on jquery a little bit in there in order to focus on the problem, not the DOM. You can download the full source if needed (

This is such a teeny-tiny thing overall, but it’s part of the polish to help make your site great.

1 You can change the timezone of any command this way in Unix. I have alias nydate='TZ=America/New_York date' in my shell profile so I can quickly see what the time in New York is.


jslint4java 1.2

I’ve finally gotten around to finishing off the code that I’ve had sitting around in google code for over a year, and released jslint4java 1.2. The changes are actually pretty small:

  • Update to jslint 2008-09-04. This adds several new options.
  • Several updates to the ant task:
    • Move the antlib definition to “antlib:net.happygiraffe.jslint” (incompatible change).
    • Default to failing the build if validation fails (incompatible change).
    • Allow use of the fileset element to be more flexible about specifying where your javascript files are. This replaces several attributes on the jslint task (incompatible change).
    • Add a formatter sub-element, which can output either XML or plain text, either to the console or to a file.
    • See the documentation for some examples of the new usage.
  • Allow access to the HTML report produced by JSLint through an API.

I’ve also uploaded the javadoc directly into subversion on google code, so it’s permanently online. The google-collections guys seem to do this, so it can’t be a bad idea, right?

Typical. Five minutes after I release, I notice that it’s been compiled with 1.6 instead of 1.5. So, I’ve just released v1.2.1…


JavaScript Scope

This is an issue that popped up with a colleague yesterday. Roughly, there was some code like this.

  function setUpStuff() {
    var items = cssQuery("#stuff p");
    for (var i = 0; i < items.length; i++) {
      var item = items[i];
      item.onclick = function() {
        alert("item is " + item);

Unfortunately, even if cssQuery() returns a list of three items, every call to alert() will show the last item. Why?

It’s all down to the fact that JavaScript doesn’t have block-scoped variables. It only has function-scope.

In the context of the code above, it means that item exists from the point of entry to setUpStuff(). It’s just undefined until the we get into the loop. The consequence of this is that each time we assign onclick, we’re referring to the same variable.

The solution is to create another function scope, of course.

  function setUpStuff() {
    var items = cssQuery("#stuff p");
    for (var i = 0; i < items.length; i++) {

  functon setUpAStuff(item) {
    item.onclick = function() {
      alert("item is " + item);

This now works as expected, because inside setUpAStuff(), we’re referring to a different variable each time (you could also use an anonymous function). This is definitely one of the more confusing areas of JavaScript (though JSLint does pick up on this).

* Block-Structured JavaScript
* Core JavaScript 1.5 Guide:Block Statement (on devmo)



I spent a little while looking at mootools yesterday for a colleague. It’s yet another JavaScript library. My colleague was wondering how to restrict the Accordion effect so it applied once to each area of content on the page, rather than once for the whole page (each content area has multiple bits of content where only one should be visible).

As usual, I just headed straight for the source code (Accordion.js). Inside, I found the best abuse of JavaScript I’ve seen in a while.

  initialize: function(){
    var options, togglers, elements, container;
    $each(arguments, function(argument, i){
                    case 'object': options = argument; break;
                    case 'element': container = $(argument); break;
                            var temp = $$(argument);
                            if (!togglers) togglers = temp;
                            else elements = temp;
    // …

Now what exactly does this do? It’s pretty different from the usual JavaScript function usage:

  var initialize = function (options, togglers, elements, container) { … }

The first thing of interest is the first parameter to $each: arguments. A little known feature of JavaScript is that every function has an array-like object of all its arguments available. You can find get a reference to the function that you’re in, which is sometimes useful.

Here, it’s being used to accept an arbitrary number of arguments, in any order. To be quite frank, it’s a bit of a mess. This is my understanding of the above code:

  • You can pass in up to four arguments.
    • You can pass in more, but we only get 4 parameters out of it.
  • The last JavaScript object will get treated as a set of options.
  • The last element passed in will be used as “container”.
  • The first non object|element will be passed to $$ and treated as the set of “togglers”.
  • The last non object|element arguments will be passed through $$ and treated as the elements to be shown/hidden.

Is that right? I’ve been looking at it for a bit and still not 100% sure.

To me, this is a fine example of taking the flexibility of JavaScript just a little too far.


Clichés are hard

So yesterday, Jeremy asked:

Wondering if accents are valid in class names (so I can mark up some text as being of the class “cliché”)

It’s a damned good question. And you have to consider: character encodings; CSS; HTML; XHTML; JavaScript; HTTP. Needless to say, it’s more complicated than it looks at first.

My first thought was that of CSS files. Is it valid to say:

  p.cliché { color: #f00; }

To answer this question, you have to visit the CSS 2.1 spec. Near the end is §G.1 Grammar. It contains a BNF grammar describing the syntax of CSS. They’re not that difficult to read when you get the hang of them. In this case, I start by finding something I can recognise: a selector. Then, I work down through the grammar to find what I’m interested in.

  • selector : simple_selector [ combinator simple_selector ]* ;
    • a selector is composed of one or more simple_selectors.
  • simple_selector : element_name [ HASH | class | attrib | pseudo ]* | [ HASH | class | attrib | pseudo ]+ ;
    • A simple_selector is composed of an element_name followed by zero or more an ID, class, attribute or pseudo selector. Alternatively, it’s composed of one or more ID/class/attribute/pseudo selector (without an element name).
  • class : '.' IDENT ;
    • A class name is just ”.” followed by an identifier. That’s what we’re interested in here.
  • ident -?{nmstart}{nmchar}*
    • This is now in §G.2. But you can see that an identifier has an optional leading minus, followed by an nmstart and zero or more nmchar. It’s those nmchar that we care about.
  • nmchar [_a-z0-9-]|{nonascii}|{escape}
    • nmchar allows letters, numbers, underscores and minuses, as well as non-ascii characters and escapes. Oooh! Getting closer!
  • nonascii [200-377]
    • This is a horrid notation. It’s an octal character range. Octal stopped being in general usage in the early 80s, although Unix and C perpetuate them. Anyway, it says that any character whose code is between 128 and 255 is allowed.

So we get an answer: Because é is U+00E9 (or, 233 decimal), it’s allowed as part of an identifier in a CSS file.

But it’s worth noting the arbitrary limit of 255 here. That means that you don’t get to use any unicode character above that (e.g. Ā [U+0100]) verbatim in a CSS file. Instead, you have to escape it by saying (according to the escape declaration in that grammar) h100. Which is quite nasty.

There’s one other wrinkle to consider before this will work. You also need to ensure that the CSS file is served over HTTP using the correct character set. If you’ve saved it as Latin-1, you need to ensure that it’s served up with this header:

  Content-Type: text/css; charset="iso-8859-1"

This is the default, so it could be left off, but it’s usually better to be explicit. Likewise, if the file is saved as UTF-8, you need this header to be added.

  Content-Type: text/css; charset="UTF-8"

If you’re using Apache, check out the AddDefaultCharset and AddCharset directives.

So that’s CSS. But what about HTML?

HTML is defined in the HTML 4.01 specification. It’s defined using SGML, which means more complication in order to work out what the heck’s going on. Thankfully, everybody knows that there are four ways to get an é into HTML:

  • A literal é.
  • A character entity: &eacute;
  • A decimal character reference: &233;
  • A hex character reference: &xE9;

In order to figure out what characters are allowed in a class attribute, though, you have to go and start looking at the DTD:

  • The coreattrs entity is the first mention. It defines a class as being some CDATA.
  • The definition of CDATA is an intrinsic part of SGML. The details of which can be altered by the SGML Declaration for HTML 4. There’s a section at the beginning which lists which characters are allowed. It includes a large number of unicode characters all above 160 decimal.

That means that it’s safe to include a character via any of the above methods.

But there are a few more wrinkles. Firstly, whilst the two characters references above are intrinsic to HTML (via SGML), where does the character entity come from? Well, they are defined as part of the HTML spec: Character entity references in HTML 4.

There’s also the problem of the character encoding in case you use the literal é. Like the CSS above, you need to ensure that your web server is telling everybody what character encoding the file is served as. Actually, for HTML, it’s less of a problem, as the browser will generally auto-detect character encodings. But that’s not necessarily reliable, so it’s better to be explicit. And in HTML, you can put the character encoding in the file itself:

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Yes, this a little bit like having a french dictionary containing the words “ecrit en Français” in the front. But it’s a good idea to have both this and the HTTP declaration (and they must match).

I’m not going to talk about XHTML/XML because it’s not in widespread use (i.e. serving it up as application/xhtml+xml).

Finally, what about JavaScript? Well, it’s defined as ECMA-262 (3rd edition). That spec explicitly defines everything in terms of Unicode, so it’s mostly OK. You can still access characters you can’t type via an escape mechanism: u00E9 (see the definition of UnicodeEscapeSequence on page 19). Additionally, JavaScript can get at the unicode characters in the DOM quite easily:

  <p id='a1' class="cliché">Lessons will be learned.</p>
  <script type="text/javascript">

As always, JavaScript files served over HTTP need to be supplied with the correct character-encoding through the Content-Type header. Just like CSS and HTML.

So what’s the take-away from all this?

  • Use literal characters and UTF-8 everywhere. It’s consistent and extensible.
  • Know how to look in the specs when something’s going wrong – you’ll know whether it’s you, or the browser that’s getting it wrong.
  • Characters are hard, let’s go shopping!

Jeremy worked it all out in far less time than I did.

Figuring I should be okay as long as I use a character entity.

Looking at that link, I notice that CDATA is handled specially within STYLE and SCRIPT tags. Yet more exceptions to the rules!