My Favorite Algorithm Cheats using Probability

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if I gave you three cubes and asked how many colors there are well it's not that hard of a question but what about a million cubes how many unique colors are in this pile and how would you even go about figuring that out today we're talking about hyper log log an algorithm which does just that now before you click away you don't need to know anything about programming to understand how this works it simply takes a little observation about flipping coins and extrapolates that out to its logical conclusion this is such an important algorithm that despite being relatively new it's used by every major tech company and most modern scalable databases Google even spent time and resources refining this technique I think you'll find this really interesting even if you don't know anything about programming so the problem at hand is counting distinct or unique elements our example had three distinct colors we could add more Cubes but still only have three distinct colors red green and blue as humans we can just look at this and immediately know how many colors there are but how would a computer count those colors well one solution might be to keep a stack of Cubes that we've already seen and then reference that whenever we need to check a new Cube to see if it's a new color or not so we take our first Cube and place it on the stack but because this is the first one there's nothing to do except put it on the stack then we pick up the next Cube and check the stack to see if we've seen this color before we haven't so we've also put this one on top of the pile we repeat this process but this time we have a duplicate and we know it's a duplicate because when we scan through our little Tower of Cubes we can see that there's a matching Cube on the tower already so we can just discard it then we just continue this process adding new colors and discarding duplicates until we get done with our mountain of Cubes when we're done we're left with a stack representing all the colors that were in the original pile with all the duplicates removed and then we just count the number of colors in the stack to determine how many distinct elements there were this definitely works but I think you can appreciate that it will take a long time if you have a lot of items and many colors to check each time we pick up a new Cube we have to check all the items in the stack to determine if it's a new color instead of using a stack of Cubes we could instead use this magic chest when we throw a cube inside the chest it will place that Cube on a shelf depending on its color red goes to this spot on the shelf and blue goes to that spot on the Shelf it's a magic chest after all so it has spots for every imaginable color inside to count the number of distinct elements we just toss all of our cubes inside the chest it will Nom its way through all the cubes sorting them into their individual spots on the shelves and at the end we just ask it how many unique colors that it ate the problem with this approach is that while the chest is Magic physics is still physics and inside that chest it needs a huge amount of shelving for all the potential colors even if we didn't actually see those colors in our pile and this is where hyperlog log comes in it's faster than the stack of cubes and it takes up a whole lot less space than the magic chest to understand how this algorithm Works we're going to flip some coins specifically we're going to look for a run of heads or tails so if I flip the coin five times and get five heads we consider that a run of five mathematically the chance of this happening is one over two to the N so the probability of getting five heads in a row is 1 over 32. the probability of this isn't really that extreme so we could probably flip coins for I don't know like an hour and get five heads in a row but how about 20 heads in a row well that works out to one over one million forty eight thousand five hundred and seventy six you might be flipping that coin for weeks before you got 20 heads in a row it's not impossible it's just very uncommon and this is the key Insight which makes the whole algorithm work the length of your longest run is roughly equivalent to how long you've been flipping coins if you only have five heads in a row you probably just started flipping the coin and if you've seen a hundred heads in a row well you've probably been flipping that coin for a really really long time hyperlog log uses this concept but instead of flipping coins we're going to look at binary sequences and check for runs of zeros so we'll take our original Cube and pass it through this mechanical transmogrifier which spits out a binary sequence the details of this really aren't that important but what you do need to know is that when you put in a Red Cube you always get out the same binary sequence and when you put in a blue cube you get a different binary sequence but you always get that same sequence out so it's a consistent one-way transformation now we have everything we need to run the algorithm we transform our first Cube and get a binary sequence starting at the right hand side we count how many zeros there are in a row this is the Run of zeros that we're looking for in this case there are two zeros in a row so we record that and a scorecard off to the side and feed in the next Cube this binary sequence happens to have four zeros in a row which is larger than our current best run so we update our scorecard from two to four the next cube is a color that we've already seen and it transforms back to the same sequence that we had before with two zeros in a row this is less than our best score so we don't need to make any changes we just discard it because the transformation from color to Binary is consistent we can see that duplicates are essentially ignored because we've already accounted for that the first time that we saw that binary sequence and that is basically it we run through this process for the rest of the cubes updating the scorecard whenever we see a longer run of zeros in the end we're left with a number that represents the longest run of zeros that we've seen out of all the cubes we fed to the algorithm now here's the cool part remembering back to the coin flipping the length of a run is roughly proportional to how long we were flipping that coin so if the longest run of zeros we've seen is pretty small well that means we haven't really looked at that many colors just like we weren't flipping coins for all that long but if our final scorecard is really high chances are good that we've looked at millions of distinct colors to get there the chance of getting 20 zeros in a row in our binary sequence is the same probability as flipping 20 heads in a row so if we find a binary sequence with 20 zeros in a row we've probably seen on the order of a million different colors now this does mean it's a probabilistic algorithm it won't tell you exactly how many colors it's seen it's just guessing at how many colors it has probably seen but that guess is surprisingly good usually within two percent of the actual real distinct count and this algorithm can scale to trillions of items it's important to note that this only works with random events there's equal chances of getting heads or tails when flipping a coin and similarly our mechanical transmogrifier gives us a binary sequence that is completely random there's a 50 50 chance that any particular bit is a one or a zero but there is one notable Edge case that we need to talk about what happens if we just get unlucky maybe the first color has 20 zeros in a row it's unlikely but it's totally possible we've only seen this one item but the algorithm will estimate that we've already seen around a million distinct colors which is obviously pretty wrong the solution to this is just to keep multiple scorecards we use the first few bits of the binary sequence as an index value which picks the scorecard to use otherwise the algorithm is basically the same and when we're done we're left over with a list of numbers representing the longest runs rather than a single value and then we just take the harmonic mean of all these values to make our final guess by splitting it up into multiple scorecards we reduce the impact of getting unlucky and it also has some nice side benefits like being able to adjust the Precision of the algorithm if you want more accurate guesses you just increase the number of scorecards and if you want a more memory friendly algorithm you reduce the number these scorecards are relatively small numbers so you can actually compress them into five or six bits saving a fair amount of memory and the union of two sets is lossless and what that means is basically we can run this algorithm on two different computers and merge their scorecards together this gives us the total cardinality from both computers even though we ran the algorithm separately and didn't transfer any data between the two computers for example we might have three unique colors on both of our computers and so locally the number of distinct elements is three but there's a duplicate shared between the two computers so we can't simply add up the individual unique elements and get a total count of six because in this shared set there's really only five unique values for small data sets it's not a problem just to exchange all the data and count it in one spot but if you have large amounts of data this is just impossible to do and that's why hyperlog log is so helpful here we can calculate the local cardinalities and then merge all the results together to get a representation of the total cardinality of the whole system in practice this means we can run it on thousands of computers in parallel counting trillions of elements and then merge those results together to get a single estimate of the cardinality across all those computers the only real trade-off here is that you're getting an estimate of the cardinality rather than the True Value but that's often a perfectly acceptable trade-off since the alternative to calculate the true distinct count is an enormously expensive process now in an effort to keep this accessible to everyone I did gloss over some details and some of the finer nitty-gritty aspects of the algorithm there's a bunch of formal math and simulation data proving that this actually works so if you're interested in that there's links down in the description I know this video and the style is a little different than the sorts of videos I normally make but I've worked with these probabilistic algorithms a lot in the past and I just find them super interesting so I thought you know maybe you would too in some ways this is a pretty Advanced algorithm you won't find it in college textbooks or on a job interview but hopefully over the last 10 minutes or so I've convinced you that it's also not that complicated at its core it leverages a pretty simple idea I firmly believe that anyone can learn programming no matter how complicated it might seem if the concepts are presented in a visual and interactive manner if you've ever want to learn how to program but we're a little daunted on where to start I think brilliant.org is the best option like this video the lessons are all presented in a visual manner that helps develop a deep understanding of the concepts one of the things I love about the programming lessons on brilliant is the fast iterative nature you can make a change and see how it reacts immediately this short feedback loop means you can make a ton of progress really quickly which I find helpful to stay motivated when learning a new skill and Brilliant has a huge catalog of lessons ranging from programming to data structures and algorithms to foundational math when I was working as a full-time software engineer I often had to go out and learn new skills for a project or refresh old knowledge my calculus skills are pretty Rusty at this point but I have a moderate knowledge of statistics and so something like brilliant is great because as a selection of lessons ranging from beginner all the way up to expert for whatever topic you're working on if you've been thinking about taking up programming or have tried in the past and it just didn't stick I think you should give brilliant a try it's free for the first 30 days if you visit brilliant.org breaking taps or you can click the link down below the first 200 folks will also get 20 off brillian's annual premium subscription there's some other neat algorithms that do kind of related topics in a probabilistic manner so if there's interest maybe I'll make some videos about that in the future otherwise thanks for watching
Info
Channel: Breaking Taps
Views: 121,697
Rating: undefined out of 5
Keywords:
Id: lJYufx0bfpw
Channel Id: undefined
Length: 12min 5sec (725 seconds)
Published: Wed Jun 28 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.