Huffman Codes: An Information Theory Perspective

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in 1951 david huffman a graduate student at mit was in an information theory class topped by professor robert fano fano gave his students a choice between either writing a final exam or writing a term paper anyone who's ever taken an engineering class knows that the exams are no joke so you can't blame huffman when he made the decision early in the semester to write a term paper the topic of the term paper was as follows given a set of letters numbers or other symbols find the most efficient method to represent them using a binary code at a fundamental level this was a problem about data compression the text images and videos you see on a computer are just a set of letters numbers or symbols boiling it down to the simplest representation they are all just a set of zeros and ones each data type has a standard representation in bits and the goal in this problem is to compress the information that they contain into the fewest number of bits possible it's a problem that existed since the start of computing what fano neglected to tell his students however was at the time this was an unsolved problem in the field of information theory and data compression hoffman actually solved this problem as a graduate student and his solution was the huffman encoding algorithm now i could just tell you the huffman encoding algorithm and call it a day but where's the fun in that as we like to do in this channel we're going to take a little bit of a journey into the process of how we might rediscover this algorithm on the way this gives me a good excuse to dive into the fundamentals of information theory and some of the key ideas that led to modern techniques in data compression i think by the end of the video no matter what your familiarity is with huffman encoding you'll get something meaningful out of this journey when we have faced with an open-ended problem it helps to have some tools in our arsenal to break down the core problem to formalize the problem let's introduce a common setup we're going to define a communication network where we have a sender of information and a receiver of information we often refer to the sender as the source in our network the source wants to transmit some original message the only means of communication in the network is through bits so the source has to find some way to encode the message according to some mapping between symbols and binary codes before the message gets to the receiver the encoded message must be decoded so that the receiver can get back the original message there are a lot of moving parts in the setup that i want to make sure we don't take lightly we need to be aware of all parts of the setup but compression specifically focuses on the encoding step there are tons of ways we can encode a message some ridiculous and some super clever but the fundamental goal of an optimal compression algorithm is to find an encoding that uses the least number of bits possible there's some inherent requirements here the first is that each symbol must be mapped to a unique binary code this is a constraint of the problem huffman was trying to solve meaning that an encoding scheme that maps a combination of symbols the same binary code is not valid in this context the next constraint is that the receiver must be able to decode the exact original message we can't lose any information in this process notice that in this particular encoding if we try to reduce the number of bits by randomly removing a bit from the message we end up losing the original message this constraint is often referred to as lossless data compression the third major constraint is that given an encoding scheme the decoder must be able to unambiguously decode the original message this constraint is often referred to as unique decodability take this slight modification to the current encoding scheme where we try to reduce the length of our message by one bit the problem here is it's impossible for the decoder to figure out which message is sent this is an example of an encoding that is not uniquely decodable the solution that huffman created satisfies all these requirements and he also proved that there's no better encoding scheme given these constraints by the end of this video we'll see why this is true but to get there we're going to need some information theory to introduce that let's focus a little bit on the lossless compression constraint and some of its deeper implications if we think about what we need in an optimal lossless compression algorithm it ends up being fundamentally a balancing act on one side we have to be careful about not using too many unnecessary bits but if we use too few bits we end up losing information from the original message this brings up an interesting question it seems like there must be a limit on the amount of compression possible that still retains the original message in this situation it turns out that if we try to compress our encoded message to less than 15 bits we are guaranteed to lose some information but what is that limit in the general case that my friends is the exact same question claude shannon asked in this exact context and the answer is the foundation of information theory in a sense asking about the limit of how much we can compress a message fundamentally is also asking about how much information is contained in that message the question now is how do we objectively measure information this is pretty vague but let's play around with some ideas suppose we wanted to create some metric for measuring information what properties would we want one thing shannon realized while thinking about this problem is that information is somehow related to uncertainty for example suppose i gave you the following two statements it's snowing on mount everest and it's snowing in the sahara desert which one provides more information i think we can agree that number two is the expected answer here finding out it's snowing in the sahara desert is way more surprising than finding out it's snowing on mount everest and that in a sense should give us more information thought experiments such as this one was a good indicator that the probability of an event occurring is related to the amount of information tied to an event specifically a key property shannon realized was essential to appropriately defining a measure of information is that information is inversely related to probability this was one of four key axioms for shannon's definition of information let's denote this information metric as i of x where x is some event that occurs with probability p of x any metric for information should be non-negative observing an event should never cause you to lose information or increase uncertainty the third key property is that an event that occurs with 100 certainty yields no information and lastly if two independent events are observed separately the total information from these observations should equal the sums of the information of each of the events individually so the question now is significantly more precise is there a set of functions that satisfy these particular properties and in fact it can be proven that the only set of functions that satisfy these four properties must be in the following form this was the definition of the quantity shannon called self-information and is exactly the type of metric we want the unit of this quantity depends on the base of the logarithm for our purposes we'll use base 2 which gives self information in the unit of bits so self information gives us a nice measure of information when we observe a particular event x that has a defined probability associated with it for example if you observed a balanced dice roll was 6 we could say the observation yielded 2.6 bits of information a coin toss landing on heads gives one bit of information but self-information is only the start of understanding how to measure information one of shannon's biggest contributions was extending this notion of self-information to general probability distributions suppose i gave you the following distributions for weather in seattle versus weather in phoenix a natural extension for self-information for a general probability distribution is to just take the weighted average of the self information of each event shannon called this quantity the information entropy of a distribution here are entropies for our weather distributions a nice intuition on entropy is that the more unpredictable a distribution is the higher the information entropy is for the distribution you would expect the average person in seattle to check the weather more often since it's harder to predict than the weather in phoenix so this matches the intuition the best way to see this in action is by looking at the entropy of a distribution describing a biased coin among all possible probabilities of getting heads on a biased coin flip the entropy is maximized when each outcome is equally likely which is when it's hardest to predict what the outcome of a flip will be some of you may be wondering at this point why shannon called this measure entropy it seems a little random turns out that shannon actually had some trouble naming it the origin of the name comes with some help from jon von neumann another huge figure in computing when shannon asked him about what he should name this quantity von neumann mentioned that the word entropy has been used in a similar context in statistical mechanics but the more important reason to use it is that no one really knows what entropy is so it helps in winning debates moral of the story sometimes even some of the greatest engineers don't really think too hard about naming conventions so you shouldn't either the cool aspect of entropy is it has a lot of implications for compression and encoding algorithms let's go back to the original source and receiver setup a large focus of shannon's work in this domain was thinking about not necessarily individual messages as we've done so far but more so a general set of symbols sent according to some distribution one simple example is the following set of symbols where the source is equally likely to send a b c or d suppose we use the following encoding scheme we want to have some metric to compare encoding schemes with each other and one common metric used in information theory is the average length per symbol it's calculated by weighting the length of each symbol's encoding with the probability that the symbol is sent by the source so with this particular encoding and initial distribution the average length per symbol is 2.25 bits while with this particular distribution the average length per symbol is slightly smaller now here's the really interesting discovery let's say the source sends a message with symbols from this distribution shannon discovered that as the length of the message approaches infinity for any lossless compression method the fundamental lower bound on the average length per symbol is the entropy of the source distribution no matter what encoding scheme you try you cannot reduce the average length per symbol below the entropy without guaranteeing that you lost some information this was a key part of shannon's source coding theorem there is a formal proof of this theorem that relies on some more advanced concepts in information theory and i'll link it in the description but let's get an intuitive sense for why this is true the entropy of this particular situation when we are equally likely to send each symbol is two bits we have some theoretical understanding for entropy but if you're like me this number doesn't really mean anything unless we have some notion of where it comes from and for that let's put ourselves in the shoes of a receiver let's say a source sends a symbol according to this distribution suppose the receiver needs to ask questions to figure out what symbol it is what's the best set of questions to ask the goal here is to ask the fewest number of questions to determine the exact symbol here we have four possible symbols all equally likely so the first question ideally would narrow our search to two symbols and then from there we can pinpoint the exact symbol notice that the best the receiver can do is two questions and if we treat each yes or no response as a bit this defines the best possible encoding scheme for a compression algorithm the essence of shannon's source coding theorem says that you cannot do better than an average length of two bits for any encoding scheme for this particular distribution and this diagram provides a good intuition for why that's true when the symbols aren't equally likely it's more interesting we could do the same encoding scheme as before and get an average length of 2 bits once again but is this really the best we can do the entropy in this situation is 1.75 bits can we get closer to the lower bound now in some situations we can't but in this case there are some inefficiencies in our current encoding take a second to see if you can come up with a better encoding if you think about what we did in the situation of equally likely symbols it all boiled down to splitting our distribution into equally likely groups let's see if we can try the same thing here with this distribution it makes sense to start off with sorting out probabilities in ascending order where we can then find out first split on the right side we only have one element so we're done the remaining work is all about splitting the left group of symbols it's not too difficult to see the rest of the appropriate splits and we continue splitting until every symbol is no longer part of a group we can translate this easily to our earlier question diagram and from that we can get an encoding notice that the average length per symbol now is exactly the lower bound provided by the entropy the key idea here is that we gave more likely symbols a smaller encoding that's a big theme in data compression by giving more likely symbols smaller encodings we reduce the average length per symbol and get better overall compression one interesting coincidence here is that in our examples the entropy ended up being exactly equal to the average length per symbol the astute among you may have noticed that the distributions i picked are nicely defined such that each symbol's encoding length is exactly equal to the number of bits in the self information of that symbol that's why in these situations the average length per symbol will always be exactly equal to the entropy of the distribution so we just saw two examples of distributions where we came up with the optimal encoding which is the goal of the original problem given to huffman a natural question from here is can we extend it to any general distribution this is exactly what both shannon and fano tried to do in fact what we did here basically inspired one of the first widely used compression algorithms proposed by shannon and feno which they aptly named shannon fano coding when you invent fields like information theory you can basically name everything after yourself shannon phenol coding attempts to approximate the idea of splitting our symbols into equally likely groups and propose an algorithm to do it on general distributions i find it easiest to understand with an example given these probabilities the first step in shannon fano encoding is to sort the probabilities in increasing order we find the best split of probabilities between the two groups such that the sum of each individual group is as close to each other as possible it then repeats the same process recursively on each group until every group has one symbol the encoding is simple once we have the splits every time we go to the left group we add a zero and every time we go to the right we end up with the one one nice aspect of this algorithm is by using trees every code has a unique path in the tree so when we put this together in an encoding for a message there is no ambiguity for a decoder the coding boils down to just traversing down the tree until we find a symbol going left if we encounter zero and going right if we encounter a one the types of codes that shannon fano encoding produce are often called prefix free codes since no code is a prefix of another code they can always be represented by a full binary tree such as this one where every node has either 0 or 2 children this is a key property we'll come back to so keep it in mind for a couple years shannon fano coding was the best scheme people had to perform lossless compression and it produced optimal codes for a variety of distributions however shannon phenol coding does not always produce an optimal encoding meaning there sometimes exists another encoding scheme with better average length per symbol for example on this particular distribution shannon fano coding produces an average length per symbol of 2.31 bits but the scheme that huffman came up with which is always optimal provides the following encoding which just barely beats shannon fano encoding the question now is how in the world did huffman find a better algorithm that both shannon and fennel missed let's take a step back and revisit the requirements of our original problem we had three central requirements and we see that shannon fano coding does satisfy these requirements by its use of prefix free codes and full binary trees to represent the encoding now the additional constraint we want when dealing with optimal compression is we want our encoding to have the minimum possible average length per symbol we just saw an example of how shannon fano fails here so let's dive deeper into this requirement and see how we can come up with something better since shannon fenno encoding solved a lot of our requirements let's use the framework it provides to figure out how we can improve upon the average length per symbol when using full binary trees to represent prefix free codes one nice way to calculate the average length per symbol is by using the depth of the tree since the depth of the tree corresponds exactly to the encoding length what does this formulation immediately imply one immediate fact that comes from this formulation is it must be true that the optimal full binary tree representation must have the two least likely symbols at the bottom of the tree in other words it can never be the case that there exists some other symbol in the optimal tree that has longer encoding length than the two least likely symbols this makes sense intuitively since we want less likely symbols to generally have longer encodings but this is powerful what this implies in general is that given any set of symbols we can take the two least likely symbols create an internal node that's the sum of their probabilities and this subtree is guaranteed to be part of the optimal tree now the question is how do we construct the rest of the optimal tree this is where something pretty clever and by no means obvious comes into play this formulation of the average length per symbol using the depth of the bottom node in the tree is useful and fairly clear but there is actually one more method that works one nice aspect of how these trees are built is the average length per symbol is the sum of the probabilities of each node in the tree except the root why is this true let's compare this with our earlier calculation the best way to see why this is true is by expanding the internal nodes as the sum of the individual probabilities they represent notice that when we sum over all the internal nodes each symbol probability gets included exactly the same number of times as its depth in the tree so the two formulations will always be equal there are formal proofs of this fact that aren't too difficult to construct based on this idea but this is a nice way to get the intuition for why it's true alright so how does this formulation help us let's try to be as general as possible and formalize things a bit in both of our formulations notice that the average length per symbol is entirely a function of the probabilities since the probabilities of symbols determine the optimal encoding and as a result the optimal tree our goal is to minimize this function for a general set of symbols with the defined probability distribution i refer to this function as the cost of our tree we know from earlier that our optimal tree will for sure have the two symbols with the smallest probabilities at the bottom of the tree which ends up creating a new node let's add this new node to a set of nodes that we need to process now here's the cool part using our new formulation where we take the cost of our tree as the sum of all nodes in the tree except for the root the cost can be reformulated as the sum of the two smallest probabilities and the cost of the optimal tree containing this new internal node as a leaf and all other symbols with their respective probabilities in a sense you can think of this as a way of separating these two bottom nodes from the rest of the tree for an optimal encoding the rest of the tree is the optimal tree for a given set of n minus one nodes with this probability distribution and the beautiful thing here is that this problem is really just a smaller version of our original problem the way we minimize this particular cost function is the exact same process we can take the smallest two probabilities from the set and make a new internal node from those two probabilities and continue building the tree in this fashion this is the essence of the huffman encoding algorithm let's put this all together and show how this works on the example we've been looking at thus far we start off with the two lowest probability symbols making a new internal node with the sum of these two probabilities now the key finding is we have to repeat the same idea but now on a set of three symbols and this new node with probability 0.31 we once again pick the two lowest probability elements and make a new internal node as we did before the key idea here is that these two internal nodes are guaranteed to be part of the optimal tree now we have two internal nodes left along with one symbol the elements with two lowest probabilities are the internal nodes so we make a new node with the sum of these two internal nodes we now have two remaining elements one a node and another a symbol this means we can take them and create the root of our tree and that right there is the huffman tree with the guaranteed optimal encoding for this distribution one question i'm sure has been on your mind is how this applies to real world compression of text so far we've been looking at this algorithm from a fairly theoretical perspective but in practice we'll just get some text with no associated probability distribution but fortunately we can still use huffman encoding with a small modification instead of thinking of symbols with associated probabilities we will use the frequencies of characters to determine the encoding with higher frequency characters generally having the fewest bits in their encoding once you have the counts of each character in a text file hoffman encoding is the exact same process as before the two lowest frequency characters are continuously used to make new nodes until we have the entire huffman tree let's talk about the implementation of huffman encoding the specific part of the implementation we will discuss is how we build the huffman encoding tree given some text the first set of information we need is the frequencies of each character in the text then for our tree we want to define a node class that stores each character with the relevant frequency we also want to implement a way to compare nodes since that's one of the core parts of optimal encoding now the key implementation detail here is that one of the simplest ways to implement hofmann encoding is by using a priority queue a priority queue is a handy data structure that efficiently keeps track of the minimum element in a collection of elements using a priority queue makes it easy to continuously remove the two smallest elements form a new node from these elements and then insert the new node into the queue at the end the last element in the queue ends up being the root of the huffman encoding tree so there you have it that's huffman encoding not only is it the optimal encoding algorithm under the given constraints it's surprisingly simple and elegant so much so that in retrospect looking at the huffman encoding algorithm it's easy to think hey that's kind of obvious right of course we should take the least likely symbols and give them the smallest encoding and then build the tree up from there and i think to an extent when you first encounter this algorithm that's a common thought process but remember some of the world's best minds in information theory were stumped over this very problem the predominant compression method before huffman encoding was shannon fano coding which looked at this problem from a top-down perspective where we tried to optimally split symbols into groups and a lot of times this works quite well and produces the optimal encoding but to guarantee optimality the cool shift in perspective that huffman came up with was to look at building the tree from the bottom up the smallest shift in perspective can sometimes make all the difference i'll freely admit that the way we covered huffman codes in this video is definitely not standard in fact when i first learned huffman codes it was within the context of an algorithms class that mentioned it as an example of a greedy algorithm it wasn't until i saw it in information theory context that i gained a true appreciation for it if we look back on the story here it all started with the fundamentals of information theory which directly motivated shannon fano coding huffman then provided the key improvement that guaranteed optimality seeing how this all comes together is just immensely satisfying to me i hope you all enjoyed this journey into what i truly think is one of the more underappreciated stories in the world of computer science thanks for watching and i'll see you all in the next one

Info

Channel: Reducible

Views: 75,286

Rating: 4.9650826 out of 5

Keywords: Huffman_event

Id: B3y0RsVCyrw

Channel Id: undefined

Length: 29min 10sec (1750 seconds)

Published: Fri Jul 30 2021