From nothin’ to gzip - HTTP 203

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we're going to start simple jake because i didn't know anything about text compression or i mean it's not even text compression we're just going straight in by the way i just realized i started but no no clap no we should do that just get in the zone okay um three two one [Music] so jake uh we have spent considerable amount of time in the last years uh getting our heads into image compression and how images compress and what is interesting about compressing images that the result of the decompression is looked at with our eyes and interpret it hang on that sentence was just a load of words do i what what what was that give me that again the result of the decompression is just looked at with our eyes and interpreted by our brains it does it didn't make sense that's fine i i i i i can do this job um you know you know and you just you sometimes pass a sentence incorrectly at the start and it just messes just doesn't make sense it doesn't compute yeah no i get it it was unnecessarily fluffy and what i'm saying is when we look at images um our brain does a lot of interference in trying to help us but interference nonetheless um which is why image compression is so interesting because they can actually change the image to make it more compressible in a non-recoverable way and get away with it because our brains do the work for us when it comes to text lossy compression right so your jpegs your avifs stuff like that yeah with text it's different you can't really you know lossy compress text it will most likely end up not being intelligible anymore it would be fun to try like those optical illusions where they've rearranged the the middle of all the words but it's still that should be interesting let's let's do lossy text compression i mean i'm talking about text compression but really what i'm talking about is just lossless compression in general because the same goes for all our html css and javascript that goes over the wire on the web we want that exact same code to appear in the user's browser so that the actual write program executes we don't want that to be lossy um and for for reasons that would break the scope of this episode i wanted to compress a piece of of user input with gzip and yes chrome has the compressor stream but nobody else does and so i had to do it in userland and i looked up libraries and they were all big and i couldn't read them and i got frustrated and so i was like you know what i'm going to write my own but to write my own i have to understand how gs have worked and i've never not in university or anywhere else ever looked at how these worked and i thought their black magic is going to be really complicated but i'm going to use this opportunity to get a basic understanding and it turns out you know what it's it's both easy and complicated and i thought i would walk you through it yes please do it because i have no idea it's one of those things where when i'm a couple of libraries that i maintain i you know try and get the file size as small as possible because it looks good on on your npm page like this is only 800 bytes or whatever and the amount of times that i go into the code do a bit of code golf make it smaller and then look at the gzip result and it's now bigger and i don't know why because it's magic but hopefully now you're going to tell me yeah i think at least maybe you will have a bit more of an understanding why certain changes make gzip perform worse so we are going to start from nothing and the reason i'm looking at gzip is because on the web it is one of the most prevalently supported formats whenever you send a request to a server you usually not you but your browser sends along which compression formats it understands it's not called compression it's called encoding and i'm thinking because technically other encodings exist that are not the better primary goal is not necessarily compression um but i think most of the time nowadays these three is what a browser will send you and i will not talk about broadly because i have not looked up how broadly works but we will talk about deflate and we will talk about gzip so after starting my research i actually realized that pretty much all modern text compression go back to two basic compression algorithms from the 70s one of them being lz77 and the other one being lz78 and the rest is just smart combination of pre-encoding and post encoding transforms to make things even smaller yeah i remember these formats coming up when i was compressing things on my amiga 600 so yes it's been around a long time so gzip and deflate both depend on these but as i said like there's these two basic encodings ld77 and lc78 they're not specifically formats just like algorithms principles they have whole a whole lot of offspring all these uh one wrote down from ld 77 l78 start with lz because they're based on the same approach but use you know they add huffman encodings or use other optimizations or use markov chains um and basically everything you have at the bottom the formats they rely on those or variants of those algorithms so for example gif uses lzw png uses lzma if i'm not mistaken no png also uses deflate uh gzip also depends on it's it's super interesting um 7-zip is ldma now i remember so yeah it's it's really interesting um and so looking at lz77 ld78 is pretty good bang for the bug because it is the foundation of pretty much all modern lossless compression that's happening but we are only going to look at ld77 today because it is actually decent to understand so this is a piece of service worker code that i ripped auto scrooge um you know we have an install handler and activate handler and the basic idea behind ld77 is is to be like you know what there is a piece of text that we already had before and by the time we would send it a second time why do we action a second time instead we could say don't send those bytes but instead send a tuple saying go 421 bytes backwards and copy 23 bytes from there on and if you paste them here you will end up with the exact same result as if we just send the entire file by byte and this is this is really good because compared to some other compression formats that have like a dictionary at the front or a tree at the front like with huffman this format streams because straight away you can get the text and all of the the bits of compression there are back references so it doesn't damage streaming at all exactly now if you're a close observer you might be asking yourself well to a computer all of these things are just bytes and numbers how do we distinguish whether the numbers we're getting are one of these back references or if it's you know just a literal byte um and you know there's multiple ways to solve this and what ld77 did sounds a bit stupid but that's what it does because they say you know what we are not going to have literals at all we're just going to make everything not a tuple but a three tuple where we have the distance the length of the back reference and the next character like this and so if you don't actually want to have a back reference you just put zero for the first two numbers and so an entire file ends up looking like this you know there's zero zero s zero zero e zero zero l and you spell out self dot add and the second d can actually become a back reference because we just had a d a byte earlier now that sounds interesting to me it is it sounds slightly you know it is because obviously what used to be one byte for literal is now i mean it actually depends on how much it is because you can't just have arbitrary number length so you limit the length that you can back a reference and how long a back reference can be so for example you could do 12 bytes of back reference distance and four bytes of length you can at most go four kilobytes backwards and you can at most have 16 bytes of copy paste um most commonly you have 16 bytes 16 bits of reference so 65k that you can go backwards and up to 256 characters you can copy paste over so that means what used to be one byte for just a character is now four bytes which is actually making your file a lot bigger and yet from a certain size onwards when at least it's human readable text and not you know already compressed binary files this works this works quite well actually so if we take a full javascript file that is you know let's say one kilobyte above as a random number um you will actually end up with something smaller on the other side and i thought that was that was quite impressive i didn't expect that that is quite impressive i didn't know that i wouldn't expect that um but obviously for for smaller files if you think about you know a small 20 liner you will often also end up being quite a lot bigger but you know it took him a whole seven not seven five years it seems until that was i mean i'm sure there were other intermediate formats but this is the next one i'm looking at i guess five years later lcss was introduced which made a couple of observations that they wanted to change they said you know what instead of making everything a tuple as a back preference let's do something else let's just emit a single bit where the bit 0 means the next byte is literal and the bit 1 means the next three bytes in this example are one of those back reference tuples that we want to actually copy paste from what we've already decoded earlier so what used to look like this in in its binary representation you would now have these individual bits in between the bytes so to speak to tell you how to interpret the bits that come afterwards a downside is of course that you know now you're not bite aligned anymore you have to work at the bit layer and shift around and it it works but it gets kind of messy and and you know it also gets slower because reading bits is slower than reading byte aligned bytes every and everything around um and so they did something quite clever i thought which for some reason when i looked at i didn't realize but somebody else mentioned in just a blog post that i found is that they just put eight bits of those flag bits at the start together and now you're back to being right aligned so now you look at the first button you say okay i have seven literals coming up and one tuple so you read the next seven bytes and you know afterwards you have three bytes to read because one is the back reference distance and one of them at the back reference length and then you start over the next byte will have all your flags and so on and so forth um and the other optimization that lzs did is that they say you know what a back reference of length one turns what is one byte into three bytes that's not good let's say we only start doing that for any back reference it has length three or above so now we're at a point where for every eight bytes we add one other byte for flags but we also add the capability for copy pasting back references so that obviously seems a lot more efficient and we have gotten a lot closer to a good compression ratio even on smaller files so which when did this compression come into like so is this the things like png does that they use that format rather than the png uses deflate and deflate is based on lzss which is why i explained it because basically i mean that's what i'm going to talk about next is the the lcss or all the lz algorithm at the start the the 77 and 78 they kind of want to avoid repetition uh and either use a dictionary or this kind of back referencing to avoid sending what's already been sent and that's only one part of the puzzle piece because if we look at our javascript file you can see that not all characters are created equal you can see you know the e appears quite a bit in our circle file while the x and the z don't appear at all or something like a y only appears i guess once or twice i don't know the whole point is like why are we using eight bits to describe the letter e when it's so frequent but also eight bits for something that's so infrequent like the y or the k now the reason is that this is very easy to code we know we all always ever read eight bit at a time and that is our next character if we were to you know give different lengths to each individual character with different amount of bits how do you know how many bits you have to read for the next character and if you send the length of the next carrier beforehand you probably are undoing all the savings you're trying to to do just now so now you're going to talk about the thing i mentioned earlier which is huffman or uh arithmetic encoding right those are the sort of things that deal with that entropy coding side of things exactly there are entropy codings which basically just means entropy is a statistical term and they they incorporate the statistical frequency of the symbols as they are called that are used in the thing you want to sense so one really interesting thing about trees in general is that if you assign letters to edges of the tree that now you have a way to describe each leaf in the tree with a unique coat that is self-terminating as it's called because you know by just traversing the tree when you have reached a leaf and you just decoded a letter and you can go back to the root and start all over again you don't need to encode the length of the symbol itself of the code itself because it is self-terminating it tells you when you are done decoding letter and this principle is what huffman trees now we're finally talking about them are based on and they were actually already a thing way before the actual compression redundancy algorithms wow i didn't realize they were that old wow yeah so the basic principle i'm going to give a quick algorithm how to construct a huffman tree um that's not how scrabble works that's not how oh it would be nice as you said like for for entropy you need to know the frequency of each letter that appears and for that you basically have to have the entire message so for compression you need to have everything in memory and count how often each individual letter byte symbol whatever you want to compress appears so in this example i'm just limiting it for four and we have a lot of e's a couple of ss couple of keys and almost no ends now to create a huffman tree you sort your your frequencies or you sort your address by frequency so you put the lowest one at the right the most frequent one at the left and you group the lowest two into a common node and add the frequencies to better so basically you start constructing a little tree and so now you have an array of two literals and a tree and now you start over you keep doing this over and over you sort and you group the two least frequent ones you sort and you group and then you end up at some point with only one single element left which is a tree and that is how we arrived at the tree and what's interesting is that this is actually proven to be optimal huffman codes are an optimal encoding for the frequencies given now optimal in the sense that if you know the frequencies perfectly beforehand and that each frequency or probability as they call it um is independent now we know that in english language it is very unlikely for a j to come after letter x so the probabilities usually depend on the text that you're on on the length you're encoding um so it doesn't isn't in in real life the huffman code isn't necessarily optimal but it is still very good and as you mentioned arithmetic coding is an alternative and it's harder to implement but can get you closer to optimal encoding um in more of the cases yes and huffman is is what powers jpeg that's the the lossless uh part of jpeg encoding is hofmann huffman or arithmetic it's huffman uh arabic is in arithmetic is in there as well but at the time that browsers came to implement jpeg or while everyone was implementing jpeg there was a patent issues around arithmetic coding you know although jpeg does support arithmetic encoding which would be smaller it would produce smaller sizes no browser supports it historically for that reason and really we've got that formats now so there's no point in adding it interesting i didn't know that so now we can basically take a a file we count which byte values appear more frequent and less we can build a huffman tree and encode it and now we just emit these byte sequences and again this by itself can already do quite a lot of damage for example i think i applied it to the service worker file and did only huffman encoding and it compressed the data down to 70 of its original size which is pretty good however that is without sending the tree and you obviously need the tree to decode the bit sequence because otherwise it's it's just meaningless bits it actually looks like gibberish now the problem with that is that these dictionaries are you know quite extensive in this case i only have 26 for the alphabet but if you you know imagine we have a bit sequence for every possible byte value that means we have 255 byte sequences of different lengths that will take up quite a bit of space to encode and can again undo a lot of compression so only on bigger files you can probably break even or even get an advantage for it and this is where i learned something that i also didn't know before namely that in the deflate spec itself they give you an algorithm where the only thing they need to send is the length of the bit sequence for each symbol that is enough to actually reconstruct the entire huffman tree and as a result the entire dictionary and that is obviously a lot easier to put into a little file then you know an explicit mapping of symbol bit sequence and bit sequence length kind of thing that's very smart i like that oh this is you know it's an rfc and it looks intimidating but trust me it is actually fairly approachable the algorithm is on the bottom half here with three steps and has some pseudo c code i think for most people it will be quite easy to read it alright and now to the big reveal deflate deflate is pretty much the combination of lcss and then running huffman on it so basically what we do we do our lcss we try to find byte sequences that are reoccurring and replace them with tuples or with back references and but what is interesting because we we are now doing the huffman encoding as a second step in our previous software encoding every value we decoded was just literal so if we decoded value 255 we knew okay we have to emit byte 255. if we decoded value 2 we knew we had to emit literal for the value 2. but we are not bound to the limit between 0 and 255 anymore because we're going to huffman compress it anyway so what they're doing and i think that's really clever in in deflate is that they're assigning they're using the values above 255 as well for basically instructions for the decoder so for example 256 signals we have reached the end of the stream or the end of a block we're not going to talk about blocks but it's the same thing effectively um and the even higher values stand for the back references so if you decode a value of 257 that means you just found the back reference of length three and so on so forth now obviously as i said our back references can have lengths up to 256 themselves so they're not going to add another 256 symbols instead they start realizing that the the values the bigger preferences are commonly very unlikely so they started to create groups we're saying okay for example 265 means you have found a back reference of either length 11 or length 12 and we're going to put one extra bit in the bit sequence to let you know which one of the two it is later on they make the groups even bigger and so on and so forth but this i thought was a really interesting revelation in that the huffman encoding allowed you to use more values than just eight bits per instruction so to speak that's very smart i like that i didn't know that's how it worked i i yeah i figured it would be like yeah one bite to say i'm now going to give you multiple bites worth of data like it's so similar to unicode i guess but yeah this is a much smaller neater way of doing it i like it yeah so this table again can be found in the spec what the exact mappings are in case you are interested in that um so this is where we've gotten so far we have a raw byte data stream from our file we run it through lcss and then that resulting stream of literals and back references tuples is huffman encoded and we create a huffman tree specifically for the frequencies as they appear in the file so if a file contains a lots of zero literals or a lot of back references those will get a smaller bit sequence than something that barely ever appears so that's why it is a very adaptive compression to the individual contents as i said though the the the back references are not only a length but also a distance and for that they do the exact same thing they create the it's a separate huffman encoding just for the distances because they have a very different distribution so if they found it so it's worth creating its own huffman tree and again they do this grouping scheme the same as with length and so in the end what you end up with you you do your lcss and you do two separate huffman encodings get two different huffman streams and you kind of interleave these codes um on the bistring from one tree and the other three that's cool so each the huffman trees will i guess presumably be at the start of the file or is it that's what we're gonna talk about next because obviously as i said you need to send them to be actually able to decode these bit streams and so this is what they what you could do you could just put the huffman tree for the literals and length at the start and we know just using the lengths of the bit codes is enough to reconstruct the entire huffman tree then you put the tree for the distances next and then you put the bitstream and again that works that's what i said and you're gonna tell me why that's not good enough well it turned like the huffman tree for literals has 285 possible values the huffman tree for distances has 30 possible values that's 300 bytes just to declare how to decode before the content even starts and you know that that is quite big and especially in the age where this compression was specified i think 30 bytes was quite considerable amount of space to not waste but to use um and so what they're doing they're huffman encoding the huffman trees um of course they are of course they are so you have you know these series sequences of numbers that just define the lengths of the literals which is enough to create the tree but these numbers are a lot smaller there's very little variance between them as you know some it's a length four length five length four length zero length one and even this huffman tree for the huffman trees has some instructions it has a special code for repeat the last encoding length three times or fill the rest with zero so those instructions are also in there and again that's all in the spec but this actually does reduce everything quite a bit and that's here is how deflate works deflate has a bit more capabilities it has a version where a predefined huffman tree pair is used you don't have to encode any tree at all it will often not work as well because it is not adapted to your specific content but yeah that is basically how deflate work does deflates give you the option of providing your own tree or is it always just the tree it has built into the format no so it has three three modes it has no compression at all just like concatenate couple of streams in there basically it has use the predefined huffman trees as they are in the spec or and i think that's a default mode builds bespoke huffman trees for the content that i'm giving you gotcha yeah okay that makes sense what i'm showing here is this bespoke version because i think that's the interesting one that's the one that really squeezes out as much bytes as possible out of your content without actually introducing any loss or leaving many bits on the ground so yeah we made it halfway jake you now know the flat let's talk about gzip gzip is deflate we're done excellent brilliant yes now i think i knew this bit it is just a wrapper format around it right that's the it pretty much is they have they add a bit of smarts as far as i know a block scheme that you can actually change compression for different parts of the stream they have a couple more headers and they have a checksum in the end for integrity checks but in the end it it is deflate um which is interesting because the fact that browser advertise support for both basically means they support two different they understand the output of two different libraries deflate stands for the z lib library while gzip stands for the actual gzip specified format now jake i hope you have a slightly better understanding of why sometimes there are cases with javascript but most importantly with css i see this a lot happening where using a minifier can give you can make your file bigger after gzip than without the minifier oh is it yes because the minifiers on css don't do a lot and i guess that's just because it's removing white space right is that the yeah i think so maybe sometimes some grouping or stuff like that but in the end that sometimes removes certain repetition that was there before and makes it less compressible which i think is really interesting um and i think that's what was happening in in cases where i've tried to manually compress javascript right i mean even though i know these compression formats work on repetition like i'll i'll feel like well i've definitely made that smaller or i've removed like i've turned a function into an arrow function and that seems smaller but yes it hits these paths where it's not able to uh do the repetition as much uh or or it means that a function for whatever reason is appearing less frequently and that changes the the whole makeup of the the huffman tree and changes how half the rest of the file is compressing in some way that i didn't expect and and now i know yeah and if you want to know more about compression our colleague coltman analyst made a whole series called compressor head back in 2013 2015 um where he explains and if you think that sounds like a long time ago that it's not going to be relevant remember some of these compression formats are from the 50s so it doesn't matter that much and i think that's exactly it like it seems like most modern compression are using those inventions and recombining them in very clever ways transforming your input before it goes to through one of these old compression algorithms and that's how really it makes it better and so it's definitely worth watching um so check that link out in the description we do the clap and then professionalism descends upon us is that what you call our show a show of professionalism it's something isn't it it's not professionalism it's not that it's many things but it's not that
Info
Channel: Google Chrome Developers
Views: 17,661
Rating: 4.9947505 out of 5
Keywords: GDS: Yes, gzip, what is gzip, how to use gzip, intro to gzip, introduction to gzip, gzip tutorial, http 203, http 203 series, chrome http 203, http, web dev, web, dev, chrome developer, google chrome developer, google developer, chrome developers, google developers, google chrome, chrome, google, Jake Archibald, Surma
Id: PZryHH8roIY
Channel Id: undefined
Length: 29min 18sec (1758 seconds)
Published: Tue Jun 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.