How PNG Works: Compromising Speed for Quality

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what we'll dive into in this video is an understanding of how png one of the most commonly used image file formats compresses images in a rather clever way such that the original image quality is perfectly preserved across all websites the two most commonly used image file formats are png and jpeg one pretty common problem with jpeg is that some particular images can look noticeably off after compression especially when these images have vector graphics or text these artifacts are a result of jpeg purposely discarding information to save more space png files however have no such problem one of the many themes in today's video will be about understanding png from the perspective of lossless compression png is a really clever extension to core ideas used in a variety of text compression algorithms specifically those found in zip files but as ingenious as these ideas are as we go through the algorithms you'll notice how complex and slow they can be especially when used on images towards the end of the video i'll also share with you an alternative to png that i'm almost certain most of you have never heard of discovered a few months ago a new image format called qoi made the rounds in the developer community not only is it comparable to png but it's also up to 50 times faster there's a lot of exciting topics to cover so let's begin [Music] imagine i gave you an image composed of red green and blue pixel values and asks you to compress the image into the fewest number of bits possible a key constraint is that whatever compression method you use it must be the case that there is a way to reverse it and get back the exact same image same number of pixels and each pixel corresponds one to one how much you approach this problem well one big idea that's common in lossless compression is to exploit redundancy let's focus on a single channel of this image to keep things simple in this particular channel some pixel values are more common than others and maybe we can take advantage of that by representing them with fewer bits than less frequent pixels the most common scheme to generate a set of codes where fewer bits are allocated to more frequent data is huffman codes for those of you who aren't familiar with huffman codes we can take each pixel independently and find the frequencies of the pixel within the image we sort these frequencies in increasing order we can build what's called a huffman tree by taking the two least frequent pixels and then connecting them to a node with their sum we put this back in our set and repeat the process until we get through all the values at the end this tree gives us the codes for each pixel value every time we go left in the tree we add a zero to the code and every time we go right we add a one the key idea here is that the most frequently used pixels are higher up in the tree leading to a smaller bit representation our original image had four pixel values so a standard bit representation of these values would have taken 2 bits per pixel by using huffman codes we've now compressed our image to 1.89 bits per pixel [Music] but note these savings are slightly misleading because for us to actually get back the original image a huffman decoder has to know the codes we used we need to store some version of this tree which makes huffman coding generally wasteful for smaller images but for larger sets of data huffman coding can be a good choice png by the way does use huffman coding but not exactly the way i described we'll come back to this point a little bit later so keep huffman coding in the back of your mind and let's explore some other options one thing that's not so appealing about huffman codes is they essentially treat pixels on an image as independent of each other but in real images redundancy also happens across sequences of pixels for example pixels neighboring each other are likely to be similar in value one simple compression scheme that takes advantage of repetitive neighboring pixels is run length and coding the basic premise of run length encoding involves compressing a sequence of similar pixel values as the pixel value and the number of continuous occurrences of that value in images where a lot of the pixels along the row have the same color run length encoding can be quite effective but again it's not too hard to define an image with a lot of repetitive data where run length encoding doesn't really perform well the aspect that makes this image more tricky to compress is that the redundancy happens across a sequence of 2 pixels run length encoding doesn't handle that well when we start looking at repetition across sequences there's a lot of powerful tools in the world of text compression that better handle it if you think about text snippets in general the redundancy will often occur as a result of using the same words or the same phrases a class of algorithms that are better tailored to handling repetitions of sequences are lempel ziv schemes named after the two original authors of the first paper introducing the basic version of the idea the algorithm used frequently in zip files and png is a more efficient version of the original author's scheme called lzss which is what we're going to focus on today the basic idea of lzss involves back referencing on a high level if i have some text and i see a sequence of characters that appeared earlier in the text i could possibly save some space by referencing that earlier sequence instead of outputting the same sequence of characters it's best understood with an example suppose we had the following set of texts that we want to compress there's a few important terms to be aware of when understanding lzss we must define a sliding window and a lookahead buffer both of which are given a specific size the lookahead buffer can be thought of as the characters lcss tries to create a back reference for a natural question is where does it look for a match of characters that i can reference later that's where the search buffer comes in the search buffer will be the difference between the sliding window size and the lookahead buffer size and this is where lzss will store processed characters to find a match at the start of the algorithm the search buffer is empty the sliding window encompasses the entire search buffer and the lookahead buffer at the start of lzss we consider just the look-ahead characters since we haven't seen the first character we'll just encode it as the character r now that we've processed the character we move it to the search buffer which contains previously encoded characters we go through the first few characters in a similar manner since we've never seen them before now where this gets interesting is when we reach the second instance of the character e the compressor looks for the largest string in the lookahead buffer that matches text in the current sliding window and it finds a match for the character e so the big idea in lcss is we can encode this character as an offset and a length the original e was found two characters back so that's our offset and the length of the match is going to be one we can now move the second instance of e into the search buffer and proceed we encode the letter t the letter i and then we hit our next interesting case in our lookahead buffer we have the character t but the key aspect of this class of algorithms is it tries to find the longest string it can match with anything in our sliding window notice that the string ti has been seen before and that's the longest match so we can output an offset of 2 and a length of 2. we then pass these two characters into the search buffer and continue encoding we encode the letter v as is and then use a back reference to the letter e the space is processed normally and then we hit another r what's the longest match in the lookahead buffer in this case we can actually match 4 characters here this allows us to output a back reference with offset 11 and a length four we then move these four characters into the search buffer and one thing i want you to notice is we actually get rid of the four characters from the beginning to keep the sliding window length constant [Music] the last couple of steps involves encoding the character a as is and finally the character t as a back reference a final encoding is the following sequence of characters and offset length pairs decoding this is not too tricky we simply output characters as we see them and anytime we encounter an offset length pair we go back into our current decoded string by the offset and output characters as defined by the length we continue until we process all characters and offset length pairs seeing this in action confirms that this entire scheme does not lose any information the decoder can completely reconstruct the original data this is the core of lzss but there are a few important points worth emphasizing in the world of compression the devil really is in the details the search buffer size and look ahead buffer size we chose has a direct impact on compression a good way to see that is by taking a look at this step in our encoding because of how we define the sliding window we can output a really nice back reference of length 4. but suppose we decreased the search buffer size by 1. we're now a little unlucky and can't make a back reference to the sequence because the character r had just moved out of the search buffer in practice lcss as implemented by png and other zip file tools generally define a sliding window length of 32 kilobytes we also define a look ahead buffer of 258 bytes another subtle detail that's important in lzss involves the way it allows for defining offset length pairs suppose we had the task of encoding the following set of characters with these defined parameters how do you think lzss will encode it the actual answer may surprise you and it's important enough to show we start by outputting the first five characters normally and then in our lookahead buffer we have the characters d-e-d-e how we identify matches in lzss is by looking for the starting character in our search buffer but a somewhat counter-intuitive feature is that the end of this match can actually be in the lookahead buffer so in this particular scenario we can actually output an offset of 2 and a length of 4. it looks weird to have a length greater than the offset but seeing how lcss decodes it should convince you that this works perfectly fine after we output our individual characters we decode the back reference by going back two characters and then copying over the next four characters the first two characters copy over as we've seen before now we still need two more characters but hey we have two characters right here so we reference them and fill out the rest of the string what basically happened here is we took the last four characters and generated a back reference by using this particular set of characters and there's nothing in lcss that prevents you from doing that in fact there's a pretty cool byproduct of allowing such representations take for example the following sequence of text where we just repeat a single character lzss will compress the sequence of text with the following representation what's interesting about that well it's essentially just run length and coding i think what's remarkably clever about lzss is by designing the algorithm in this manner we get run length encoding for free a final caveat of lcss is that in practice the algorithm only emits a back reference of length three or more it turns out that more often than not it takes more space to store offset length pairs than it does to store two characters individually so in this earlier sequence the actual lzss encoding will only have one offset length pair all the other length one or two pairs end up being overkill transitioning back into the world of images png uses the same lzss algorithm to take pixels from each rgb channel and represent them as a sequence of raw pixels and offset length pairs but there's actually more to it in the world of text we feel pretty comfortable with the idea that phrases and blocks of characters can be repeated but suppose for example i gave you the following gradient image here each pixel value is just slightly greater than its previous value lcss will do terrible on this image and huffman coding won't do much better either so that's pretty depressing we just spent all this time going through our two main weapons and it only took a second for a simple example to break everything such is the life of engineers right but at the same time i think we intuitively feel that this image should be compressible there is a predictable pattern here the way png solves this problem involves a pre-processing step called filtering when you have a problem that doesn't translate well to an existing approach you have a choice to either create a new approach or transform the problem into a version that can be solved with existing solutions png chooses to transform the problem in the filtering step by introducing redundancy in this particular image the pattern is not too hard to see the difference between every adjacent pixel is four if we transform the row to store differences rather than the raw pixels lzss can perform significantly better this is an example of one type of filter applied to the image png goes further and defines five different types of filters the most basic filter is simply no filter which is fairly self-explanatory the filter we just showed earlier with differences between row wise adjacent elements is an example of a subfilter if we take this row and apply the subfilter we get the following output you might notice immediately that there's some weird values that come out of this operation for example 134 minus 139 is equal to 251. an important point here is every filter actually operates on individual bytes not necessarily pixels in this example since a pixel is 8 bits they end up corresponding nicely but in images where a pixel is not 8 bits this can actually lead to quite a bit of confusion png filters will always take the value from a mathematical operation on individual bytes and make sure it's in the range of representable 8-bit numbers before proceeding this is usually done by taking the difference and applying a mod 256 operation so 134-139 is negative 5 but taken mod 256 it's 251. the next filter is the up filter it's the same concept as the subfilter but applied to the pixel immediately above the current pixel up filters work well when pixels are correlated to each other along the columns of an image the fourth filter called the average filter is a combination of the up and sub filters we subtract our current pixel by the average of the pixel immediately to the left and immediately above again be wary that all operations are on single bytes so anything negative will wrap around via the implicit mod operation the final filter png defines is the most intricate and it's called the path filter first we calculate a base value equal to the sum of the corresponding bytes to the left and above minus the byte to the upper left then we calculate the difference between this base value and the three pixels that we just dealt with left up and upper left we'll choose whatever pixel gives us the smallest difference this winner pixel in a sense is the one we'll choose to finally subtract our original pixel from the intuition here is that this process just calculates what was the best pixel to choose out of all three neighbors once we choose the best pixel we perform the same operation as the other methods subtract it from our original pixel and store that value in this example we showed filtering on a grayscale image when dealing with rgb images these operations are done on each channel individually this makes sure that we are not mixing up red green and blue values which are probably less correlated also because we are dealing with left up and upper left pixels edge cases in a very literal sense do come in what png filters do to gracefully handle this is to find any out of bounds pixel as zero an important question that may have crossed your mind is how png even figures out what filter to use well there were a couple of initial findings that provided some guidance in general if you have a palette-based image using a smaller set of colors to save data filtering usually won't create any additional redundancy also remember filtering operates on individual bytes so if the image has less than 8 bits per pixel png found that none filters should only be used since the overlap between pixels and a byte made it difficult to create any correlation but on most other types of images having some combination of filters was usually a good idea and led to better compression when taking account of how ldss works but simply trying every combination is a ridiculous solution on even a simple 240p resolution image the number of possible filtering combinations is 5 to the power of 240. that number is more than the number of atoms in the universe unfortunately you can't wait until the end of the universe to save a damn image so we have to be smarter the way png solves this problem is by defining a heuristic to determine which filter is best for a particular row it's called the minimum sum of absolute differences the actual calculation is not too difficult but the steps involved can be confusing so let's break it down suppose we take this particular row of our image and do a sub filter where we don't mod the final value by 256. in comparison this particular row is a true sub filter after modding by 256. now to calculate the sum of absolute difference metric we're basically going to treat all values between 0 and 127 inclusive as is and then any value greater than or equal to 128 will be remapped to a corresponding negative value where 255 is mapped to negative 1 and 128 is mapped to negative 128. so in our absolute sum of differences calculation we perform this operation on the remapped filtered data after summing the absolute value of each element in this remap set of data we get a score for the filter what png does to pick the best filter is perform this calculation on all five filters for each row and then pick the filter that has the minimum score the overall intuition is that filters that map data such that the sum of absolute differences is closer to zero are more likely to have data that is actually repetitive but truth be told there's no real mathematical proof that this is the best heuristic to use it turns out that no one really has found something that is as simple and performs better empirically so this is still the heuristic that's used to this day but who knows maybe someone down the line will find a better method the advantage of using this particular heuristic is that png generally only has to test each filter five times on every row which is much more tractable than a naive approach of trying all combinations of filters on the decoding side one nice property of these filters is that each operation is entirely reversible no part of the filtering component loses any information so putting this all together png takes an uncompressed rgb image and then filters each channel of the image row wise picking a filter through heuristics then png takes this filtered image and on each channel uses the lzss algorithm to further compress the image lcss replaces previously seen sequences of filtered pixels with back references but there's one final step to png after lcss we get a sequence of pixels and offset length pairs since lcss only emits back references with length greater than or equal to three there's still some individual pixel redundancy in the sequence that we can exploit the last step of png involves taking this encoded lcss sequence and performing a modified huffman coding scheme to further exploit there's a lot of painstaking details on exactly how this is done on a sequence that contains both individual pixel values and offset length pairs it is surprisingly complicated the combination of lzss and huffman coding that png uses is often called the deflate compression scheme these steps essentially encompass the entire png file format every part of this approach either creates some redundancy or exploits it in a way that is perfectly reconstructable with the fact that these methods perform better than other lossless image compression schemes png is often the go-to file format when you need to retain all information but at the same time if we look at the entire process from start to finish it does feel rather complicated right i mean we have all this filtering which we actually have to do at least five times on each row and then we have to perform lzss which is also computationally expensive that's not even mentioning that this entire process has to go through huffman coding and trust me that gets really complicated within the context of png so even though png is widely used it's actually by far one of the slowest compression methods in practice this is not usually a big deal since decoding a png image is not too inefficient but it does introduce an interesting question is there a compression scheme that's simpler and faster yet still somewhat comparable to png and until about a few months ago the answer would have been no but then an image format called qoi was developed by an independent software engineer dominique schrablski and turns out it's surprisingly close to png in compression ratios but way faster what i think is most notable about this game is how a simple set of rules for encoding end up making it extraordinarily fast but also quite effective the best way to understand qoi is through an example image qoi considers each channel in an image jointly in its encoding so let's split the image into its individual rgb values and then align them together the speed of qoi comes from the fact that it processes every pixel just once throughout encoding qoi we'll keep track of a reference to the previous pixel and the current pixel which will continuously be updated based on the value of the previous pixel and the current pixel there are a couple rules for encoding the first rule is if the current pixel is the same value as the previous pixel qoi encodes it using a run length and will increment the run length until it sees a different pixel it encodes run lengths in a single byte with two bits reserved for the run length tag qoi will use special tags for all its encoding options to help the decoder losslessly reconstruct the original image the other six bits encode the actual run length value the second option involves storing the rgb value of the current pixel as a difference from the previous pixel if the difference between the previously seen pixel value and the current pixel value is within a predefined constraint we can store the current rgb pixel in a single byte of memory two bits will be reserved for the difference tag and then the other six bits and code the actual difference on each channel qoi also has an option to encode larger differences in two bytes of memory if the previous pixel and current pixel differences meet the following criteria it is possible to encode the current pixel in two bytes using the following scheme these rules may seem a bit arbitrary but one key constraint that we're trying to maintain is having all the encoded bytes aligned one aid in the simplicity of qoi is that decoder can just read bytes one by one without ever having to worry about some pixels being encoded partially in one byte and then partially in another byte the fourth option involves encoding a pixel as an index of a previously seen pixel as qoi processes pixels one by one it'll store each pixel in an array we determine the location to place the process pixel by using a rather simple hash of the r g and b pixel values so when we encounter a new pixel we do a simple check of the array to determine if it has been seen before and if so we store the index position in the array qoi uses an array of size 64 which allows the index location to be stored in 6 bits and lastly if run length encoding is not possible the pixel has never been encountered in our array and the difference of a pixel is too large we'll just store the rgb value as is in memory notice that we actually do take a little bit more space to store the rgb value directly because we need to notify the decoder of this choice the hope here is that qoi will not have to output too many of these cases and that's really the core logic there are some minor details on bit representation and handling transparency but it really doesn't get much simpler than this for a compression algorithm i mean the entire specification is just one page long png specification for comparison is a true monster of a document when you run qoi on an image the main task of the encoder is to determine the proper tag and data that it needs to generate for the decoder this involves correctly updating the previous and current pixels and also keeping track of the index and updating it appropriately the logic is not too complex i really do believe you can code both an encoder and decoder without too much difficulty after reading through the spec in fact i highly recommend taking the time to do it as a fun exercise it's a good way to understand how to represent compressed information on this particular image after running qoi we end up with a stream of 50 bytes considering the original image had 64 pixels with 3 bytes each for a total of 192 bytes the compressed version of the image is only 26 of the original image so we know for sure that qoi is simpler than png but how does it actually hold up in terms of compression performance well for the most part png is better but the difference between qoi and png is not as large as you would expect there's a reference test suite of images with a variety of interesting features and we actually ran the algorithm on the entire test suite to see the results as you can tell qoi is pretty close to png for a lot of these images which is truly impressive and where qoi really shines is in its speed the speed of qoi encoding especially is on the order of 20 to 50 times faster than png mainly because qoi only has to process the pixel once while png may have to process pixels several times in various stages the decoding performance is also better than png but the differences are not as drastic since png decoding is generally a lot simpler than encoding rather than debate which format is better though i really think what's cool about qoi is just the fact that this relatively simple and understandable idea was discovered so recently the full png spec was developed in the mid-1990s so the fact that something almost 30 years later came around and is even comparable to it is incredible dominique the engineer who invented it also claims that he is no expert in image formats and kind of just played around with ideas that were much simpler for the average person to understand and i think there's a big lesson to take away from this a lot of times on this channel we like to go over algorithms and ideas that are really clever and complex but sometimes solutions that are simple or as dominique describes them stupidly simple should also be appreciated after all the biggest job of software engineers and computer scientists is to manage complexity a lot of problems are really difficult and they simply need elaborate solutions but along the way it's easy to get lost in the complexity and design something that's perhaps a little too difficult and slow as engineers if we can get away with it we should absolutely consider more simple and understandable solutions i believe the story of qoi when compared to png is a great example of that in this video we took our time diving into hands-on examples of both png and qoi to get a feel for how they work one of the best ways to learn new and complex ideas is by playing around with them on your own and a great tool for learning interactively is brilliant the sponsor for this video from the basics of mathematics and algorithmic thinking to more complex ideas in deep learning and probability brilliant offers a variety of courses and learning paths for those interested in getting hands-on practice lately i've been dabbling in their computer architecture course which is an area that intersects fundamentally with byte representations that we talked about in this video visually seeing how the lower layer of computers represent information in store and memory gave me some new perspectives on concepts i'd learned while in college all the courses come with elegant interactive visualizations and a great set of practice problems to encourage mastery of the content you can get started for free by going to brilliant.org reducible which is linked in the description below brilliant is providing a special offer through this channel where the first 200 members to sign up get 20 off the annual subscription it's a great way to learn more about the topics in these videos and also a good way to support this channel big thanks to brilliant for sponsoring this video thanks for watching and i'll see you all in the next one
Info
Channel: Reducible
Views: 220,082
Rating: undefined out of 5
Keywords:
Id: EFUYNoFRHQI
Channel Id: undefined
Length: 32min 0sec (1920 seconds)
Published: Mon Mar 28 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.