Elegant Compression in Text (The LZ 77 Method) - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
We thought it might be a good idea to revisit the topic of text compression. Which was visited for the first time in the original compression video on the computerphile channel. Some of you in the comments made the point to which I must plead guilty that in trying to explain things I perhaps oversimplified a little. To take the discussion on and make some more general points about text in general: How compressible it is, what are good and bad ways to do it, and how it has all become really quite big business over the past thirty-odd years since it became commonly available. Well -- the two names we ought to write down to start with on text compression are two gentlemen who wrote a classic paper on it called Ziv and Lempel. But most people certainly in English speaking worlds seem to find it easier to say Lempel-Ziv rather than Ziv- Lempel and so very widely now it's known as the LZ77 method. Let's say we got something like this: "The computerphile channel handles computer topics" in my original explanation, I put to you the idea of having a dictionary of well known words and buzz phrases. Up at the top of the file with pointers to where these words occured and if in doubt, use the pointers -- for repeats of certain words. Well, the way it's actually done in practice is not dissimilar in principle to that, but the details are even more clever in terms of achieving maximum compression cutting down file size and so on. What a typical LZ compressor will do is it will work its way through all of the text that you need to compress and will actually look for sequences of characters that recur over and over again and will attempt to reuse them as much as it possibly can. It doesn't overtly make a dictionary entry by putting them somewhere special but just by leaving these strings of text in place, it can reuse stuff in place. Here we've got the word computers: eight characters. A compressor could remember that it has seen that string of characters already as a subset of the string 'computerphile' The thought immediately occurs--one, two, three, four, five, six, seven, eight-- characters are identical here to what they are in the beginning of that word, can't we use that in some way and cut down the length of this? And what the Lempel-Ziv method does is it uses a pointer, and in that composite pointer it points back to where the phrase first occurred. And it also notes down how long the phrase was at that point. Now I'm going to denote these pointers with a notation that looks like this: I'll call it 'j' and 'l' 'j' being the jump. It's a relative jump. How many characters back would I need to go to encounter this word 'computer' somewhere else And when I do get back to that character position in the file, then how long will that string of characters be? Now I must emphasize that if you look inside a Lempel-Ziv file, you will not see pointed brackets, let alone numbers or characters here This is just my notation to try and illustrate what is going on. In actual fact, this pointer might well be, at its very simplest, lets say it's 16 bits. two 8 bit bytes if you like. The details say it's a little more complicated than this even. In two bytes, I might be able to do some magic here that will replace 8 bytes in the word computer with a two-byte composite pointer of this sort. Now what I have to do is to say, "Right, where did 'computer' occur relative to where it is now?" Now this is just a single space here although it might not look like it. One, two, three, four, five, six, seven, eight We'll assume there isn't a newline here but there is a character-- nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen (...) thirty relative backward jump of 30 characters. There was, let me put a little partition here, a[n] 8 byte word. So there we go. Those numbers, those integers, are cheerfully representable within 8 bits each. So I could take out 'computer' here and replace it with, in my notation, <30, 8> Thinking around it a little bit, it should be pretty clear you've got a trade off. If we are going to allow a total of 16 bits to take the jump and also the length of the string you're pointing at, now you can trade off that internal boundary because 8 bits, 2 to the power 8 is 256 so on either size of this notional divide, you've got the ability to hold integers from 0 to 255. Now, in actual fact, there are not many situations, in Western European languages at least where you get massively long words repeated again, and again, and again. So the attitude taken in many practical implementations of these tends to be: 'Let's keep the pointed out string fairly short, So if it is a long word we might have to do a few of these pointers, but better to have a long jump back possibility. That means that you can keep reusing words up to... well what, let's say We do a partition like this: four bits here, twelve bits here, 2 to the power 12 is 4096. Note to those of you that keep writing in on comments saying: "Hey Professor, what should I do, I'm just at High School and I want to do computer science at university. What's the best preparation?" my statement would be: learn your powers of 2. Backwards, upside down, inside out learn your powers of 2. And while your at it learn your 16 times table so you'll be very quick at doing hexadecimal. I was forced to learn my 16 times table at school, not because we did hexadecimal, but because in those days, there were 16 ounces in a pound. of ... sugar, or whatever. So one 16 is 16, two 16s is 32, three 16s is 48 and so on. It's worth it. Anyway, what we're saying there is 2 to the power 12 is 4096, 2 to the power 4 is 16. You can encode in it here a string that is anything--if you like--1 to 16 bytes long, but you can have a relative backward jump of 4096 characters. That is a pretty good trade off. What will happen of course, is if your pointer back to the substring 'computer' gets to be more than 4096 characters away, then your encoder will have to remember and put in its mental dictionary a new instance of the word 'computer' that can be referred back to. So imagine this happening for more than just the word 'computer' but for words like 'channel' for 'handles', for substrings of 'handles' called 'hand', and so on. All the time, the thing is preparing, in a sense, to make a dictionary entry of anything that seems suitable So it can be referred back to again and again and again. So what one is saying here is that even with a simple model, you are able to replace a multi-byte entity by a two-byte pointer. probabilities, probabilities, probabilities: the three Ps. Getting your probability model right, for the material you are coping with is the heart of getting a really successful compression. And the web browser looks at that and goes: "THAT IS JAVASCRIPT CODE! I'M GONNA RUN THAT!"
Info
Channel: Computerphile
Views: 439,011
Rating: undefined out of 5
Keywords: computers, computerphile, lz77, text compression, Professor Brailsford, university of Nottingham
Id: goOa3DGezUA
Channel Id: undefined
Length: 8min 43sec (523 seconds)
Published: Fri Nov 01 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.