UTF-8 is perhaps the best hack, the best single thing that's used that can be written down on the back of a napkin, and that's how was it was put together. The first draft of UTF-8 was written on the back of a napkin in a diner and it's just such an elegant hack that solved so many problems and I
absolutely love it. Back in the 1960s, we had teleprinters, we had simple
devices where you type a key and it sends some numbers and the same letter comes out on the other side, but there needs to be a standard so in
the mid-1960s America, at least, settled on ASCII, which is the American Standard Code for Information Interchange, and it's a 7-bit binary system, so each letter you type in gets converted into 7 binary numbers and sent over the wire. Now that means you can have numbers from 0 to 127. They sort of moved the first 32 for control codes and less important stuff for writing, things like like "go down a line" or backspace. And then they made the rest characters. They added some numbers, some punctuation marks. They did a really clever thing, which is that they made 'A' 65 which, in binaryโ find 1, 2, 4, 8, 16, 32, 64โ in binary, 65 is 1000001, which means that 'B' is 66, which means you've got 2 in binary just here. C, 67, 3 in binary. So you can look at a 7-bit binary character and just knock off the first two digits and know what its position in the alphabet is. Even cleverer than that, they started lowercase 32 later, which means that lowercase 'a' is 97โ1100001. Anything that doesn't fit into that is probably a space, which conveniently will be all zeroes, or some kind of punctuation mark. Brilliant, clever, wonderful, great way of doing things, and that became the standard, at least in the English-speaking world. As for the rest of the world, a few of them did versions of that, but you start getting into other alphabets, into languages that don't really use alphabets at all. They all came up with their own encoding, which is fine. And then along come computers, and, over time, things change. We move to 8-bit computers, so we now have a whole extra number at the start just to confuse matters, which means we can go to 256! We can have twice as many characters! And, of course, everyone settled on the same standard for this, because that would make perfect sโ No. None of them did. All the Nordic countries start putting Norwegian characters and Finnish characters in there. Japan just doesn't use ASCII at all. Japan goes and creates its own multibyte encoding with more letters and more characters and more binary numbers going to each individual character. All of these things are massively incompatible. Japan actually has three or four different encodings, all of which are completely incompatible with each other. So you send a document from one old-school Japanese computer to another, it will come out so garbled that there is even a word in Japanese for "garbled characters," which isโI'm probably mispronouncing thisโbut it's "mojibake." It's a bit of a nightmare, but it's not bad, because how often does someone in London have to send a document to a completely incompatible and unknown computer at another company in Japan? In those days, it's rare. You printed it off and you faxed it. And then the World Wide Web hit, and we have a problem, because suddenly documents are being sent from all around the world all the time. So a thing is set up called the Unicode Consortium. In what I can only describe as a miracle, over the last couple of decades, they have hammered out a standard. Unicode now have a list of more than a hundred thousand characters that covers everything you could possibly want to write in any languageโ English alphabet, Cyrillic alphabet, Arabic alphabet, Japanese, Chinese, and Korean characters. What you have at the end is the Unicode Consortium assigning 100,000+ characters to 100,000 numbers. They have not chosen binary digits. They have not chosen what they should be represented as. All they have said is that THAT Arabic character there, that is number 5,700-something, and this linguistic symbol here, that's 10,000-something. I have to simplify massively here because there are about, of course, five or six incompatible ways to do this, but what the web has more or less settled on is something called "UTF-8." There are a couple of problems with doing the obvious thing, which is saying, "OK. We're going to 100,000. That's gonna need, what... to be safe, that's gonna need 32 binary digits to encode it." They encoded the English alphabet in exactly the same way as ASCII did. 'A' is still 65. So if you have just a string of English text, and you're encoding it at 32 bits per character, you're gonna have about 20-something... 26? Yeah. 26, 27 zeroes and then a few ones for every single character. That is incredibly wasteful. Suddenly every English language text file takes four times the space on disk. So problem 1: you have to get rid of all the zeroes in the English text. Problem 2: there are lots of old computer systems that interpret 8 zeroes in a row, a NULL, as "this is the end of the string of characters." so if you ever send 8 zeroes in a row, they just stop listening. They assume the string has ended there, and it gets cut off, so you can't have 8 zeroes in a row anywhere. 'K. Problem number 3: it has to be backwards-compatible. You have to be able to take this Unicode text and chuck it into something that only understands basic ASCII, and have it more or less work for English text. UTF-8 solves all of these problems and it's just a wonderful hack. It starts by just taking ASCII. If you have something under 128, that can just be expressed as 7 digits, you put down a zero, and then you put the same numbers that you would otherwise, so let's have that 'A' againโthere we go! That's still 'A.' That's still 65. That's still UTF-8-valid, and that's still ASCII-valid. Brilliant. OK. Now let's say we're going above that. Now you need something that's gonna work more or less for ASCII, or at least not break things, but still be understood. So what you do is you start by writing down "110." This means this is the start of a new character, and this character is going to be 2 bytes long. Two ones, two bytes, a byte being 8 characters. And you say on this one, we're gonna start it with "10," which means this is a continuation, and at all these blank spaces, of which you have 5 here and 6 here, you fill in the other numbers, and then when you calculate it, you just take off those headers, and it understands just as being whatever number that turns out to be. That's probably somewhere in the hundreds. That'll do you for the first 4,096. What about above that? Well, above that you go "1110," meaning there are three bytes in thisโthree ones, three bytesโ with two continuation bytes. So now you have 1, 2, 3, 4, 10, 16 spaces. You want to go above that? You can. This specification goes all the way to "1111110x" with this many continuation bytes after it. It's a neat hack that you can explain on the back of a napkin or a bit of paper. It's backwards-compatible. It avoids waste. At no point will it ever, ever, ever send 8 zeroes in a row, and, really, really crucially, the one that made it win over every other system is that you can move backwards and forwards really easily. You do not have to have an index of where the character starts. If you are halfway through a string and you wanna go back one character, you just look for the previous header. And that's it, and that works, and, as of a few years ago, UTF-8 beat out ASCII and everything else as, for the first time, the dominant character encoding on the web. We don't have that mojibake that Japanese has. We have something that nearly works, and that is why it's the most beautiful hack that I can think of that is used around the world every second of every day. (BRADY HARAN)
-We'd like to think Audible.com for their support of this Computerphile video, and, if you register with Audible and go to audible.com/computerphile, you can download a free audiobook. They've got a huge range of books at Audible. I'd like to recommend "The Last Man On the Moon," which is by Eugene Cernan who is the eleventh of twelve men to step onto the Moon. but he was the last man to step off the Moon, so I'm not sure whether or not he is "the last man on the Moon" or not. Sort of depends how you define it. But his book is really good, and what I really like about it is it's read by Cernan himself, which I think is pretty cool Again, thanks to Audible. Go to audible.com/computerphile and get a free audiobook. (TOM SCOTT)
-"... an old system that hasn't been programmed well will take those nice curly quotes that Microsoft Word has put into Unicode, and it will look at that and say, 'That is three separate characters...' "
How about a poem? Ode to a shipping label
He didn't explain why the continuation bytes all have to begin with
10
. After all, when you read the first byte, you know how many continuation bytes will follow, and you can have them all begin by1
to avoid having null bytes, and that's it.But then I thought about it for 5 seconds: random access.
UTF8 as is let you know if a given byte is an ASCII byte, a multibyte starting byte, or a continuation byte, without looking at anything else on either side! So:
0xxxxxxx
: ASCII byte10xxxxxx
: continuation byte11xxxxxx
: Multibyte start.It's quite trivial to get to the closest starting (or ASCII) byte.
There's something I still don't get, though: Why stop at
1111110x
? We could get 6 continuation bytes with11111110
, and even 7 with11111111
. Which suggests1111111x
has some special purpose. Which is it?Another gift from Rob Pike and Ken Thompson.
I didn't hear him drop Ken Thompson's name. Might have been cool, considering he's a guy that's alive right now.
This guy would make a fantastic teacher or professor.
What is the name of the camera shooting style?
It makes me want to tell them to get a tripod and stop with the "artistic" zooms. It's making me sea sick.
And yet Windows still doesn't use UTF-8 for any Windows APIs. It defaults to locale-specific (i.e. totally incompatible) encodings and even when you force it to use Unicode, it requires UTF-16. Sigh.
And in an alternate universe, "128-bit IPv8 The most beautiful hack"
Americans: