A couple of days ago, we got caught out by a few encoding issues in a site at $WORK
. The Perl related ones were fairly self explanatory and I’d seen before (e.g. not calling decode_utf8()
on the query string parameters). But the JavaScript part was new to me.
The problem was that we were using JavaScript to create an URL, but this wasn’t encoding some characters correctly. After a bit of investigation, the problem comes down to the difference between escape()
and encodeURIComponent()
.
#escaper {
border: 1px dotted black;
text-align: center;
}
#escaper thead {
background: #eee;
}
#escaper thead th {
border-bottom: 1px solid black;
}
#escaper td, #escaper th {
padding: .25em;
}
input | escape(…) |
encodeURIComponent(…) |
---|---|---|
a&b |
a%26b |
a%26b |
1+2 |
1+2 |
1%2B2 |
café |
caf%E9 |
caf%C3%A9 |
Ādam |
%u0100dam |
%C4%80dam |
The last is particularly troublesome, as no server I know of will support decoding that %u
form.
The takeaway is that encodeURIComponent()
always encodes as UTF-8 and doesn’t miss characters out. As far as I can see, this means you should simply never use escape()
. Which is why I’ve asked Douglas Crockford to add it as a warning to JSLint.
Once we switched the site’s JavaScript from escape()
to encodeURIComponent()
, everything worked as expected.