Compression - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Captions

Compression is reducing the amount of information but without, in a sense, losing any of it. Trying to compress it and shorten it in such a way that you could in principle recover the full information that you started with, but in terms of transmitting it over a network, it's going to cost you less because you pay per bit or per byte or whatever. And if you could get the number of bytes down that you need to transmit to get something across the wire, then that benefits everybody, generally. You do sometimes not compress because the following drawback happens. If you compress something then you have to decompress it to get back the full story, that decompression takes time at the other end. You have to get a computer on it saying: "this was transmitted compressed, because it saved BT bandwidth cost, but now I'm at this end and I want to see the real picture, I've got to uncompress it." So, the nice thing about not compressing, if you don't mind the amount of data, is it's there at the other end and it's instantly processable. You compress data essentially by looking for repeated patterns and predictability. And I think, if you take an example, shall we say of a text file, something that repeats itself over and over again with the stock buzzphrases used over and over and over again is going to be pretty compressable. If somebody writes a report that's got the phrase "meaningful and relevant scenarios at this point in time going forward", then you take that out and it's being used a thousand times and it drives you mad and you make one copy of it at the top of the file, and that's maybe, I don't know, 50 characters, 50 bites, and you say "I'm NOT gonna spend 50 bites repeating that phrase everytime in this report, I'm gonna compress it." You take one copy and take a note of where it occurs in the file. It occurs at <start position> 20, 800,1500 and so on. And you can save a huge amount because those numbers take three or four bytes each not a hundred bytes every time. Pictures are harder, coming back to this idea of what makes things compressible is predictability. Then, if you think about a photograph some photographs, you could say, do have predictable areas. You could take, shall we say, a photograph of, like on Microsoft Windows, there's a yellow desert island in the middle of the Pacific Ocean with one green tree. Now that's easy, because the Pacific is totally smooth, you've got thousands of blue pixels, you've got hundreds of yellow pixels for the desert island, a few hundred more for the tree. So you can compress it, not by sending every single pixel individually, but by saying: "There's the sea, there's the color blue, there's a thousand of those, just replicate them, similarly for the desert island and the tree. So, that's what I mean about the element of predictability or common areas inside a picture. Now, the average photo you take maybe fairly simple if it's an indoor photo with flat surfaces but you imagine taking a photograph, for example, of a sports car race, or of an impressionist picture with blobs of painting all over, all over the place, you know, that's much more random and there isn't a pattern that you can pick up on to make it easily compressible. You best have to take a deep breath and say: "I'm gonna have to transmit the whole thing because I can't see any patterns in here." You were talking about these blue pixels, there may be a thousand blue pixels, but the computer still has to know where to put every single one, so how are we saving time? You still have to transmit some minimal information as to where to start laying these pixels out on the screen. You say: "start at such and such a position, and the XY, you know how wide the picture is, just spew out ten thousand blue pixels and see how it fits in." What you don't have to say is "here is the spec of the blue pixel for position 1000, here is the same spec for the blue pixel at position 1001", you don't need to do that. You just send the spec once and say "replicate it for hundred, or whatever times". Well, I'm making it simpler than it is. Perhaps one good example would be the JPEG compression. It's very common, has to be done to be able to get photographs down for size to uploading to sites like Picasa and Flickr, they're not going to be happy having every single one of your photos taking up 20 MB and being utterly perfect. They want to compress it down to kilobytes. And the problem is that many, particularly scenery and so on, is inately random enough looking that it's not easy to compress. So JPEG is an example of a compression that goes by the name of 'lossy'. When we discussed earlier about compression, in the text case, I put it to you that you could always recover the exact file because you had your buzzphrase, all the places it was replicated, nothing to stop you at the far end undoing all that and building it back exactly as it was with no loss. In a photographic case, you could try and do that, but you wouldn't get much compression, because there's no predictability. So what JPEG does, is it basically says: "if you allow me to do some really clever things, I will try and get more compression out of it but I may make the edges of your picture look a bit blurry and fuzzy as you magnify them." And it's doing it by all sorts of tricks, about dividing the picture up in two, blocks and squares, and saying "well, this is almost the same green, I'll use this as an approximation." It's never recoverable exactly the way you started but you get a lot more compression that way, if you don't mind an approximation. And for people in the human visual system looking at something that can often be pretty good. Particularly on a screen. What are the compressing geniuses in their compression HQ trying to do at the moment or is it kind of, have they reached their limit? Surely, this has a limit? There is a limit, this is something that's covered by fascinating topical information theory. Information theory will tell you what is called the entropy limit. Now, many viewers of this video will have heard of entropy, and will freeze solid, because they were taught it in physics and chemistry, right? As a computer scientist, and an ex-chemist, let me tell you learning about entropy is a lot easier from a computer science standpoint. No question about it. You can work out, for example, what's the minimum number of bits I need to transmit a piece of information in a lossless way. Let's concentrate on that. What's the minimum number of bits I can get away with, and, I could,... Let's get you a piece of paper.

Info

Channel: Computerphile

Views: 406,188

Rating: undefined out of 5

Keywords: computers, compression, computer science, Data Compression (Algorithm Family)

Id: Lto-ajuqW3w

Channel Id: undefined

Length: 7min 37sec (457 seconds)

Published: Fri Jun 21 2013