The Internet, text encodings and making an ass out of you and me

The Internet, text encodings and making an ass out of you and me

HTML, by it's design, is incredibly liberal in what it accepts. Which is kind of cool, but sometimes it's definitely not and it bites you. And it can bite you in many ways, including the character encoding used on a page. (I.e. ascii vs unicode vs whatever. Read this if you want to learn more (and you should)).

There are a few ways to declare the encoding of a web page, as specified by the w3c:

To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  1. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  2. The charset attribute set on an element that designates an external resource.

In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese text. Also, user agents typically have a user-definable, local default character encoding which they apply in the absence of other indicators.

That last paragraph is the fun part. If no encoding is specified in any of the 3 ways listed, the user agent (Chrome, IE, etc) can attempt to figure it out itself.

Wait, so how can this bite you?

I ran into an interesting situation earlier. Powershell uses UTF-16 internally (it's based on .NET, you know, and .NET uses UTF-16 on its strings), and when you pipe output to a file, it defaults to that encoding:

"<html><body><div>Here's a div</div></body></html>" > index.html #I'm UTF-16!

Hey look, we created a webpage. When opening that file in the browser, the browser doesn't see a charset header or a META declaration so it decides to determine the encoding itself and it figures out that, hey, this is utf-16. You can see this by opening the browser developer tools and running:

> document.characterSet
    "UTF-16LE"

Neat. And the page displays correctly and all is well.

But let's say we want to add an external script because we like javascript and we're good programmers, damnit, so we separate our concerns and don't add inline scripts. So we edit our index.html file and add a script tag, specifically getting RequireJS from a CDN:

<html>
<head>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.16/require.min.js"></script>
</head>
<body>
    <div>
        Here's a div
    </div>
</body>
</html>

And now we have RequireJs! Wait. What? We don't?

If you open the index file now, you'll see Here's a div but if you open the dev tools, you'll get a

Uncaught SyntaxError: Unexpected token ILLEGAL

error. And if you open the requireJs file, you'll see a bunch of chinese characters.

Whyyyyyyyy

The RequireJs file, from the CDN, is encoded in windows-1252. But since we didn't specify a charset for the script file, the browser assumed it was the same as the html file, and brought it in as UTF-16. It interpreted all of the bytes as UTF-16 and you get all of those crazy characters. Specifying a charset fixes the problem, because the browser knows how to interpret the incoming file:

<script charset="windows-1252" src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.16/require.min.js"></script>

Do I care?

In most cases, this won't be a problem. Most everywhere uses UTF-8 now, and in the HTML5 spec, the w3c says:

Authors should use UTF-8.

Authoring tools should default to using UTF-8 for newly-created documents.

In other cases (which should be few and far between) set a charset and be on your way.