LZW Compression Algorithm Explained | An introduction to data compression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so imagine you are walking in the street minding your own business and happened to run into an old friend so you are both catching up but he's talking like this he's saying i was in the park yesterday i met john in the park john was walking his dog john adopted his dog last week then i met mary and i told mary that i have just met john in the park and that john adopted a dog last week you see he kept mentioning john's name a lot we don't do that right that's redundant usually you would mention someone or something once and then refer to the thing or the person using words like he or she or this or that so instead of saying john every time he could have just said he and the second time he mentioned mary he could have just said she because we already got it and we know what we are talking about right and instead of telling a story twice he could have just said i told her that because we don't need to be redundant we don't need to say things all over again and again and that's something we do naturally when we talk and that is some sort of a compression you see we do the same with computers too we try to avoid redundancy to give you an example consider the following area of bytes we have this array of bytes 8 6 2 blah blah blah if you look closely you will see that 6 2 4 has been mentioned three times here here and here so instead of mentioning three times we can just mentioning once and refer to it later how let's create a list of all the words that has been repeated which we can call patterns so here's a list of all the things that has been repeated under 0 we have got 6 2 and 4 and we can substitute here instead of 6 2 4 again and again we can just say we can just refer to the item number zero in the list so under the index of one we have got three and five and we can also substitute it do we see any other patterns um well we also see another pattern we have the index 0 which is 6 2 4 is also a part of a bigger pattern sometimes which is 6 2 4 9 or the index 0 and 9. so we can simply add this to the list 2 under the index of 0 it's it's substituted with the index of 0 sorry under the index of 2 we substitute it with zero and nine okay do we still see any other patterns maybe not okay let's just say that this is the compressed output so the compressed output would be the list and this uh this compressed string so you of course you need the list to decode that one more thing that i pretended not to notice until now which is uh in a binary form how do we distinguish between the indices and data so for example how do we know if that zero is a data that meant to be zero or an index referencing to something in the list we can think of many ways to do so one of them is actually not to start from 0. let's start from 10 because i see that the maximum here in the data is 9 so we'll have to mention that the maximum is 9 and anything bigger than 9 is an index so in this case it will look like this we have the list instead of 0 we'll say 10 referring to 6 2 4 11 referring to 3 and 5 and 12 referring to 10 and 9 and of course you have to mention the list we have to mention the maximum our range so in this case our range would be from 0 to 9 is data anything bigger than 9 is an index okay so that is some sort of a compression now before we talk about a real compression algorithm i'm gonna state something that's pretty much obvious but i think it still needs to be said we need to pack our bits we should not approximate them to bytes so if we have something that is only occupying four bits or nine bits or eight or ten bits we will just have to pack everything together so we are going to be using some bitwise operations so instead of occupying an entire byte we can just fit two or maybe we can fit one and a half the rest will go to the next byte i mean it all depends on the size but we'll have to use some bitwise operations like shifting and oring and masking and all of this so we can fit them together without twisting bits so from now on just assume that everything is packed and everything is read with bitwise operations let's go back a second to our first example so you met your friend and you're catching up now you want to be less redundant so do we start with a list of things that you are going to use many times in your small talk are you going to say hey friend before we talk let me give you this list so when i say he i mean john and when i say she i mean mary no we don't do that that will still be weird okay so you're going to be talking normally naturally and the more you talk the more context you are building up in your in your head gradually so you will mention john normally at first and then once you mention him again i mean once you mention me with the first time you will build the dictionary in your head oh we are talking about john the next time you say he he's referring to john this dictionary is only in your head it's being built through the whole chit chat it's not going to be given to you before we talk so our algorithm in this video is going to be kind of the same it's called the ljw algorithm and i can say that it's kind of a lazy quick algorithm it's not going through the whole text over and over again to find patterns no it will just go through it once and only once and then we compress it as we go and it never looks back so basically it's doing the same as we do naturally we just mentioned something once and the next time we mention it we just refer to it with an index okay let's just start with an example so in this example the maximum here is 15 so we can have indices starting from 16 okay so anything between 0 and 15 is going to be data anything starting from 16 and up is going to be an index okay so this is the row if the row information uncompressed so we start from three have we seen anything starting with three in the dictionary oh what dictionary we just started right now the dictionary is still empty okay let's insert three into the output stream and let's see if there are any other patterns to insert so um we have three and five now the question is do you think we should add three five to the dictionary i mean in case we see it later in case it got repeated do you think it would be a pattern well i i don't know i haven't seen the future yet and we said before that we are going through the text only one time we never look back so uh we don't know that's going to be a pattern or not so in this case we're just going to keep it in the dictionary just in case we don't know if three file is going to be repeated but in case it is it has an index in our dictionary so under the entry of 16 under the index 16 we have three and five okay moving on to the next now we have five have we seen any patterns starting with five no so we insert five into the output and we always every single time we have to make an entry to the dictionary so with our next letter we have five and four should we should we insert it just in case should we insert five and four just in case it's a pattern well the answer is always yes so we we insert our entry to the dictionary so under the index 17 we have got five and four okay the next is four have we seen any pattern starting with four in the dictionary no okay so we output four to the stream and we have four and fourteen under the index of 18 in the dictionary the next is 14 and we do the same and along with with 11 we enter 4 and 11 under 19. then the next would be 11 we do the same and along with 5 we have 11 5 under the index of 20. then we have got 5 and wait i have seen a 5 before in the dictionary we have a pattern starting with the five yes so its second letter also is matching our second letter so that's a pattern so yeah we have five and four and we have an index that is already in the dictionary under 17. so in this case we will not insert 5 in the output stream but we will just insert 17. so here the pattern 17 has proven itself worthy i mean it's we are just inserting patterns all the way as we go but uh nothing has been repeated before but 17 has actually made it so instead of uh considering two letter patterns we might as well consider three letter patterns and in this case let's check the next next letter it's nine so in this case seventeen nine or as we can say five four nine could also be a pattern so we're going to be inserting this under 21 and next we have got 9 and we'll do the same thing we haven't seen any pattern starting with 9 so we'll output 9 in the stream and we can insert 9 11 as a new pattern of index 22. the same goes for 11 12 under 23 12 14 under 24 and we find another match we have seen patterns starting with 14 before does it all to match our next letter yes it's 14 11 as well so we'll insert 19 in the output and make a pattern of 14 11 5 14 11 15 or we can also say 19 15 under 25 and then we'll have 15 5 never seen anything starting with 15 before so we'll just add 15 5 under 26 and 15 in the output and then we have got five so uh the question is do we have patterns starting with five yes actually we do have two entries in the dictionary starting with five okay let's just take the longer one the longer one also matches our next and next next so we have a pattern of length three under 21 so we'll insert it along with the letter next to that pattern which is 11 so under 27 we have an entry of 5 4 9 11. okay i think we are about to finish and notice this this is really important and really tricky we have several patterns starting with 11 but none of them are 11 11 so we'll just insert 11 in the output stream and every single time we have to insert a dictionary entry so we have got 11 11 under 28 seems normal right now we start from the next 11 the next letter and do we have any pattern starting with with 11 actually yes there is a match with one that we just entered right now it's 28 representing 11 11. and in this case we are done okay now can we decompress this output and see if we can get the original again let's do it okay so of course we don't start with just no information at all the only information we have is that we have a range from 0 to 15. so anything from 0 to 15 is data that's uncompressed and anything higher than 15 it's a pattern okay uh so since we are going from 0 to 15 uh this means we we use four bits but actually since we uh since we have indices that are higher than 15 we'll start with five bits right so because it could be 16 so we need five bits instead of four bits i mean the original data had four bits but the compressed will have five bits per letter sounds wasteful doesn't it so at the cost of removing repetitions we are actually increasing the length of each letter ok let's get going first we have a 3 and 3 is data because it's between 0 and 15. it's not an index uh we will just have to put it in the in the output stream which is the original uncompressed data okay so output 3 and the next is five oh that's also data let's put five to the output stream but first let's recreate the dictionary so we combine the previous pattern plus the for the the first unit or the first letter of the current pattern so in this case 3 5 goes to dictionary under 16 and now we have got four and that is still uncompressed data so we have to put an entry to the table in this case uh four i mean the previous pattern which is five plus the first character of the current pattern which is four five four and we enter into the dictionary under 17 and then the next is 14 so it goes to the output as well and we have an entry to the dictionary which is the previous pattern plus the first letter in this pattern which is 4 14 and we put it in the dictionary under 18. we see we are constructing the very same dictionary ok let's continue so the same for 11 we have 14 11 is added under 19 uh and then we get to 17 and since we know that our range is from 0 to 15 17 seems to be an index okay uh we go to the dictionary we do have an entry that's actually under 17 so we'll output this entire pattern which is 5 4 and now we need to add something to the dictionary right what would it be in this case like we said it's the previous pattern which was 11 and the first letter of the current pattern the first letter of the current pattern the current pattern is five four we only pick the five the previous pattern was eleven so in this case it's eleven it's sorry it's yeah it's eleven five in this case yeah this uh eleven five goes under twenty now we have the nine okay that's that's not an index that is uncompressed data so we insert we we insert it as it is and then we have also to insert something in the dictionary which is the previous pattern which is under the index 17 which is 5 4 plus the first letter of the current which is 9 so we have a pattern of 4 5 9 under 21. okay we are really reconstructing the same dictionary and we keep going so 11 goes as 11 and we enter 9 11 under 22 and the same for 12 and 12 11 12 goes under 23 in the table and then we get to the in next letter which is 19 that's an index do we have it in the dictionary yes we substitute it with 14 11 and we enter what okay what entry goes in the dictionary in this case okay it's the previous pattern which is 11 and the first letter of the current pattern which is 14 and so 12 14 goes under 24 and then we have 15 that's that is uncompressed data so um we put it as it is in the output stream and then uh what entry we put in the dictionary well it's the previous pattern which is 14 11 and the first letter of the current which is 15 so in this case uh 14 11 and 15 will go in the dictionary next is 21 which is an index referring to 1789 which is broken down again into five four nine and the next letter would be 28 and really pay attention to this one do you have a 28 in our table actually no and that means one thing it means that the previous pattern should be duplicated okay what where did this come from you know you want to know how this happened how did we reach at a certain point where we have an index that's not in our table in this case well uh you have to rewind the video and check how did we compress the last 11 11 but for now i will just give you the final answer whenever there's an entry that's exactly equal to the last index in the dictionary plus one i mean if it's the last index plus one that means duplication it means you should put the previous pattern plus the first letter of the previous pattern so the previous is 11 so the pattern is 11 11 and we are done okay one thing you might have noticed is that the index size is increasing so in the last example it went from 16 to 28 and if we kept going it would have kept going to until it's 32 or bigger and at that value we'll need to start using six bits instead of five bits so what do we do should we have started with six bits in the first place no we said from the beginning we will never look back we just go through the data one time so how do we know if you are going to use five bits or six bits or maybe seven or eight well in this case uh yeah at the fir at first we'll start using five bits per letter and then we upgrade to six bits and the longer the input stream gets the bigger length of each index gets you might ask also okay so what if the index gets really ridiculously big like uh what if the index kept going maybe beyond uh 32 bits or something what do you do in this case we have a very lengthy stream and we actually want to compress it but at the expense of having a very ridiculous long length well in this case yeah i agree with you it's really it's nonsense to actually use 32 bits when we actually could have used four bits from the beginning you see i mean we are using 32 bits for the data that was originally four bits and we are calling it compression so in this case we might also put a limit to the index size so if the index keeps growing and uh it keeps growing until for example let's say let's give it an arbitrary value let's say if it kept growing until 12 bits well in this case we have to stop we have to say okay that's enough we we stop growing like that and we start from the beginning and so in this case if we are compressing and the index grows beyond 16 sorry beyond uh 12 bits we just stop and reset the table and start again from the very beginning uh so in this case we reset also if you are decoding if you are decompressing if your dictionary which is the same as the which is going to be the same dictionary as the other dictionary during compression if you reach 12 bits you will have to stop and reset and lose all the patterns lose everything from the last dictionary that you have constructed and start over again i'm not saying it's the best algorithm ever it's just how it is it's a quick one okay so of course as we have seen since both the encoder and decoder are constructing and reconstructing the same dictionary the decoder should know exactly when will the dictionary become full which is when the last index reaches uh 4095. still the encoder has to drop a hint to reset the dictionary so uh we need like a code an index that when you see it means please reset the dictionary right now and start over again so we have to agree in a code on an index on something that when you see in the output stream it's a command for the for the decoder or for the decompressor to stop and reset the table and start over again from the initial size which was four bits or five bits in our case and this index uh is neither like an index in the table nor a data it's some arbitrary number that we should not use for data and we should not use as an index so um okay let's give it an arbitrary number we said our data is from 0 to 15 okay before we have okay so in this case we will not use 16 uh okay let's use 16 as a reset code so we start from 17 so the first index will be 17 right well we are going to use another code for ending the compression so if you see this code it means when i finish this compression the stream has ended please uh i don't know go do something else make other sense of the rest of the data it's not it does not belong to the same stream so we will need a stop code like an ending code and a reset code reset for the dictionary in case it gets ridiculously huge so we'll assign uh 16 in our case as reset code and 17 as ending code and then we start encoding from 18 not from 16 as we did in the last example so actually not exactly 16 and 17 it's simply it depends on your size so if you are using 8 bits data then you start with 9 bits and then we will use 256 as reset and 257 as end if you're having four bits data you'll use 16 and 17 and so on it could be 32 and 33 it could be 64 and 65. it depends on your length on your initial length okay so that's how the algorithm works that's the lzw algorithm maybe you have gotten some ideas about compression uh maybe you can take a look at other algorithm and get more ideas don't forget to like and subscribe and i will see you next time
Info
Channel: DevPal
Views: 1,697
Rating: undefined out of 5
Keywords: lzw, algorithm, algorithms, lempel-ziv-welch, compression, decompression, decompress, compress, patterns, programming, development, coding, c++
Id: KJBZyPPTwo0
Channel Id: undefined
Length: 25min 8sec (1508 seconds)
Published: Wed Feb 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.