When you put English text into a computer, it is saving every individual character as
eight bits of data. Eight ones and zeros. I mean, they're not actually ones and zeros, they're particular positions
of atoms on a disk or a few electrons going over a wire, but as far as we're concerned
in the computer science world, they are ones and zeros. Bits. And yes, a modern phone or computer might
store a few quadrillion of them(!) But every time you've had to wait a long time
for a download, or your phone's complained
that its storage is full, you're running up against the same question
that computer scientists have been working on since before these things here
were state of the art: can we please not use so many bits? So, let’s run through
how computers compress text. Images and video are different —
that's lossy compression, where it doesn't matter if you lose
a little bit of detail. But text has to be losslessly compressed: you can't just lose a bit of detail, otherwise you end up sending the wrong worms. So the first thing we need to know is how text is stored on disk before it's
compressed. To make this easier, I am just going to talk about
English text here: I know only too well that it gets
more complicated than this, but hey, the series is called "the basics",
so deal with it. On a modern computer,
each English character takes up exactly eight ones and zeros on disk -
eight bits, or one byte. And there are 256 possible combinations of
those ones and zeros, so you can have 256 possible characters. That's enough for the English alphabet, numbers,
some punctuation, and... well, then it gets complicated depending on
which country you’re in and oh my god that's a whole separate video we do not want to get into right now. Why eight bits? Because it was big enough to store all the
characters that American computers usually needed,
but no bigger, so the text doesn't take up any more space
than it absolutely has to. Eight is also a power of two, which means
it's easy to deal with at the really low levels of programming. It's also helpful to have a fixed number of
bits per character because it makes searching through text
really fast. If you want to go to the 100,000th character
in some text, you know exactly where the ones and zeros
are going to be without having to count the number of bits in every
single character before it. If you want to know how long
a string of text is, you just count the bits
and you divide by eight. Computers don't have the luxury of spaces,
remember: it's just one long string of 1s and 0s. But let's say we're not worried about speed
right now, we just want to fit as much text as possible into as small a number of 1s and 0s as possible. Compression. A good plan would be to assign
the most common characters to smaller arrangements of bits. So those researchers, way way back,
could have said: well, the space bar is generally
used most often. Just give that the code "0". Then it's the lowercase e.
Give that the code "1". Then it's lowercase t: give that two zeros... ...and immediately, we've hit a problem. Because the computer running through that text later
has no way of knowing whether "00" is a t, or the space bar twice. And if we keep on assigning letters like that, the problems keep getting worse: is three zeros a lowercase n? Or a t followed by a space? Or three spaces? Remember, there are no gaps here, no actual
space separators, all the computer can see is a constant stream
of 1s and 0s. There's no way to know which is meant. Except. In 1952, a very clever mathematician
called David Huffman invented Huffman coding. You invent something like that,
your name gets put on it. Yes, there are much better, more modern, more complicated,
mathematical ways to do this, but Huffman coding is the basic foundation
of modern text compression. And here is how it works. Let's say we want to compress... let's go with the lyrics to
Will Smith's "Wild Wild West". Uncompressed, that is 3,684 characters
taking up nearly 30,000 bits. First up: you count how many times
each character is used, and you put that in a list in order. Now that'll be different for each block of
text you're compressing: this has way more Ws than usual. Now take the two least used characters. Those two are going to be the bottom branches
on your “Huffman tree”. Write them down, with how often they're used
next to them. That’s called their frequency. Then connect them together, one level up,
with the sum of their frequencies. Now add that new sum back into your list, wherever it sits, higher up. And repeat! Take the bottom two off your list,
connect them up, add the result back in your list. Keep that going. And when one of the sums reaches the bottom
two of the list, you connect it up like that. And eventually, you're going to be down to
one thing in your list, right at the top. You now have a Huffman tree,
and it looks like this, and it tells you how to convert your text
into ones and zeros. Our first letter is the first uppercase W
in 'wiki-wiki-wild-wild-West'. Uppercase W is here, there’s 141 of them. So go the top, follow the path down. Each time you take the left hand side, write
a 0; each time you take the right hand side, write
a 1. So W is this: only 5 bits instead of 8. Next, lowercase i. Follow the code down, only four bits
this time, it’s more common. Then the k, which is less common; that actually
takes up seven bits. And you keep going. Some of the letters will take up more than
8 bits, but that’s fine, because they’re not used
very often. Now, you do have to also store this tree to
provide a translation table between your new symbols
and the uncompressed ones - so this is not efficient for
short bits of text. If you’re ever asked to do this
in a computer science exam, then they’ll ignore all that
and just ask you to do one easy word from a tree by hand. But for Will Smith's magnum opus,
we have compressed it down to just over 20,000 bits: about a 30% saving. To uncompress the resulting stream of bits,
it works the other way: just read across, take the left fork every
time you see a 0 and the right fork every time you see a 1. When you reach a letter, that's it, you know
that’s the end, and you know that there is no other path you
could have possibly taken. You start again with the next bit. Now in practice, computers might do this working
backwards from the bottom up sometimes, but this is a pretty good way to at least
understand what's going on. And here's the really clever part: Huffman proved that this is
the most efficient way to assign 0s and 1s to single characters. It is mathematically impossible to beat this. Unless, as you might already
have figured out, you start working on blocks
bigger than one character. Clever tricks like that are basically how
zip files work... but then, this is just the basics. Thank you to all my proofreaders who helped
with that script, thank you to my graphics team, and also thank you to the
Centre for Computing History in Cambridge for letting me film with their old kit.
I've implemented Huffman coding before, but I don't think I've ever seen such a simple explanation of how to build a huffman tree.
All the pseudocode in the world can't compete with a simple animation.
Hah, "sending the wrong worms"
Very cool
BBC Micro's, fuck me they still exist somewhere ?
Ah, of course, The Center for Computing History in Cambridge.
Other videos in this thread:
Watch Playlist ▶
I'm a bot working hard to help Redditors find related videos to watch. I'll keep this updated as long as I can.
Play All | Info | Get me on Chrome / Firefox
And here is a handy link to Will Smith's magnum opus.
Wouldn't the other end need the tree to decode the text? Can you fit the tree in the 33% saved space?
BBC Micros! <3