CppCon 2017: Nicholas Ormrod “Fantastic Algorithms and Where To Find Them”

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
All right. Hello, everyone. And welcome to Fantastic Algorithms and Where to Find Them. An intern once asked me, "Nicolas, how often "do you write algorithms?" And the answer is rarely. Algorithms are tools. Hammers, and levels, and mops. And what makes a collection of tools impressive is not how many of them you made yourself, but rather, how many unique problems you can use your toolbox to solve. I hope that in today's talk, as we go through a couple fantastic algorithms, that there will be some tools here rare enough to be novel, but useful enough to be found in practice. So a little bit about me before we dive in, I am an Infrastructure Engineer at Facebook. I work in infra for two reasons. First of all, a lot of infra is services. Services are often written in C++, and I happen to like that language. The second reason that I like working in infra is that my clients are fellow engineers. If I've got a problem, something's going wrong, if I need to learn something about the use cases, I can just go and talk to another technical person. This leads straight into our first algorithm. Heavy Hitters. Despite my co-workers being engineers, people with no malicious intentions... I still get DoSed. Maybe a new feature gets turned on, maybe an upstream service gets DoSed which causes my service to get DoSed, or in this particular story, a command line interface used primarily by humans is called programmatically for the first time. This particular graph, those two spikes represent high latencies, specifically timeouts, as a result of too many queries coming into my service. Now, what I want to do, what I really would like is to guarantee to all the people who are using my service normally, that they will have a bare minimum amount of traffic. I do not want to starve any of my individual users in the event that I am overloaded. And so the ideal thing that I would like is I would like to have a map of the people who use my service to the number of times that they have made queries with me. The problem is that keeping that map is very expensive, it takes order n (O(n)) space. Even if I were to simplify my request, and instead just want to know the identity of the single user who was making the most requests, it would still take order n (O(n)) space to be done in a single pass. So fortunately, though, there is an algorithm that exists, Heavy Hitters, which does almost what I want. Heavy Hitters does not determine who the most frequent user of my table is, of my service is, however, what it does do is it does identify the users who single-handedly account for more than x percent of total traffic. For this special case, x equals 50%, some of you might have heard this algorithm under some different names, for x equals 50%, knowing which elements in an array occur strictly more than half the time, the algorithm's called Majority Element. Also known as Boyer-Moore's Majority Vote algorithm. And it works something like this. Imagine, for a moment, that we are all at a conference, and there's a lot of programmers who are attending the conference, and the programmers indicate by the color of their dress what their favorite programming language is. As the programmers enter the floor the passion of the moment overcomes them, and a furious fight breaks out. Programmers who represent different languages lock eyes and instantly do battle. They run at each other and try to knock the other out. Now, it happens to be that every single pair of programmers is perfectly evenly matched, and so when they run and punch each other, they both land a blow on the other's chin simultaneously, and both get knocked to the floor unconscious. The conference chair enters the room and notices that there is exactly one programming language left standing. Indeed, if there were multiple programming languages left standing, then one programmer from each of two distinct languages would lock eyes, run at each other, do battle, get knocked out. So there is exactly one language left. And so the conference chair smiles, 'cause he had a question that he wanted to have an answer to. He wanted to know whether or not there existed a programming language that enjoyed the support of a strict majority of all attendees. The conference chair's job is simpler now. Instead of counting the number of attendees who support every single programming language, he merely needs to count the number of attendees who support the programming language left standing. The reason this works... is because we see that pairs of people are knocked out where each member of the pair represents a different programming language. If there's n people, there's at most n/2 pairs of knocked out people. And if there's a programming language that is represented by more than n/2 people, then it cannot be present, then it cannot be completely knocked out in every single pair, thus will be left over. This conference room brawl translates very cleanly into an algorithm. The only difference is that you let the programmers into the room one at a time. Implement this as a counter-value pair, or sorry, a value-counter pair. You have a value, the current programming language which is winning, and some rather arbitrary integer representing the weight. So the first programmer walks into the room representing C++, there's no one else in the room, C++ is now the winner, the weight is one. A second programmer, a Java programmer, walks into the room, locks eyes with the C++ programmer, they do battle, they get knocked out, the counter goes back to zero. Third programmer walks into the room, C++. C++ is the winner, counter is one. Fourth programmer, again, C++ walks in, C++ now has a value of two, and the fifth and final programmer, a Perl programmer, walks in, locks eyes with one of the C++ programmers, they do battle, get knocked out. The algorithm's done. At the end of these five people, three of whom were C++, majority vote does indeed say that C++ is the winner, and there's a rather arbitrary value of one left in the counter. It is important after you do the first pass through to find your estimate of who the winner is, that you do a second pass through the array to verify that the returned element is indeed a majority element. We can imagine a case where this is not the case. Suppose there are a very large, odd number of pirates each of whom is voting for who the next captain of the ship is going to be. 1,001 pirates. And I've chosen pirates, because pirates are well known to always vote for themselves. And so the first pirate walks in and he's the winner. And the second pirate walks in, and they do battle, and there's no winner. Third pirate walks in, he's the winner. Fourth pirate walks in, oh they disagree, they do battle, no more winner. And the first 500 pairs of pirates go in and knock each other out until the last pirate, the 1,001st pirate, walks in and, "Hey, guys, I vote for Teddy." Teddy's the winner! No, he's not the winner, you have to do a second pass through the array. It is sufficient for a majority element to end up in the value, but it's not necessary. This is the x equals 50% version of the algorithm. It extends very cleanly and very naturally to percentages other than 50%. Specifically to percentages of the form 1/k for positive integers k. The only trick that you have to change is that, inside the conference room, which could previously accommodate only one programming language, now allow there to be k minus one distinct programming languages existing in harmony. And it's only when a fifth programming language walks in that the room becomes too crowded, and that new programming language plus the other four distinct programming-- sorry, if k equals five, the other four distinct programming languages, one person from each group does battle and leaves. And so they do battle in groups of k, therefore, no one who exists more than 1/k of the time can be completely knocked out. So coming back to the service that I own that was getting DoSed. I was upset that this was happening. I went and I talked to that engineer, because I'm an adult. But I also put up some code based on Heavy Hitters to make sure that, if this happened again, I'd be able to throttle the people who were single-handedly accounting for large percentages of my resources. I had one small variant, though. Heavy Hitters does not have a concept of time, so I did Heavy Hitters in five minute intervals. It was a long time ago, there was a little bit of trickery going on. I can't show you that code, though. But some code that I can show you is this majority vote implementation, found inside of Libgcc. I was talking to Nathan Sidwell yesterday. Nathan Sidwell, by the way, is going to give a talk tomorrow on modules, implementing modules inside of G++, which I would advise going to. But I was telling him about Heavy Hitters, and he was like, "Hmm, you know, "this code sounds familiar to me." And sure enough, Nathan found this Heavy Hitters implementation inside of Gcov. What this code is doing is it is trying to find whether or not there is a majority argument for profiling purposes. During profiling, this extremely cheap, extremely simple counter-value pair is used to determine whether or not there's a specific argument that occurs the vast majority of the time. There's one slight hiccup, though, in the Gcov implementation. Which is, typically, at the end of the Heavy Hitters implementation, you have to do a second pass to verify that the given argument does indeed occur more than 50% of the time. And Gcov can't do a second pass. So, instead, Gcov has this interesting trick where, at the very end, they check the counter and they see if the counter is higher than 50% of the total number of calls. If this is true, then we know for sure that that algorithm occurred at least 50% of the time. However, we don't know that an algorithm that occurred 51% of the time is guaranteed to make this cut. However, it is guaranteed that an algorithm that occurs at least, or an input that occurs at least 75% of the time will, once the algorithm completes, have a value in the counter at least equal to 50%. So this is a rather nice, nifty optimization that Gcov uses to profile your code so that your code can be optimized. All right. I've got another algorithm for you. Who here was at CppCon last year? That's actually fewer people than I thought, that was like less than half. Who here, regardless of whether or not you've been at CppCon last year, listened to Herb Sutter's talk Leak-Freedom in C++? That looks like quite a few more hands. Not everyone's hands, but more. For those of you who haven't seen Herb's talk, I highly advise, go and listen to it, it's a great talk about how to avoid leaking memory inside of your programs. And one of the most important things that Herb hammers home in his talk is not to manage pointers yourself. Do not use new and delete. Instead, rely on unique ptrs and shared ptrs when you actually have data that has to outlive your stack frame. And one of the examples that Herb had in his talk was he had a tree implementation. And instead of having a tree star left and a tree star right, Herb instead had unique ptrs. And this is very nice, because those unique ptrs are going to be cleaned up successfully, regardless of how your tree gets destroyed. You could have an exception thrown in your constructor and those child trees are going to be cleaned up, and you're not going to have a leak. There's a problem, though, Herb remarked. The problem is that, in the wild, not all trees are balanced. And so if you happened to have an unbalanced tree, in the worst case, just a straight line like a linked list, what happens when you free the root node? The root node's going to be freed, but in order to do that, it first has to call the destructor on it's left child unique ptr. So six calls the destructor of two, two calls the destructor of one, and so a lot of destructors get pushed onto your stack, and you can smash your stack. This isn't good. And so, at the end of the presentation, during Q&A, Eric Neebler went to the microphone and said, "Herb, did you know that there exists a way "to destroy a tree using order one space overhead?" Before I go on, I've got a small little story about Libgcc again for you. I went and I checked Libgcc's implementation to see whether or not they were doing a stack-smashing destruction call, or if they were doing some fancy-swancy order one space destruction call. And the answer is, that inside of Libgcc, their red-black tree implementation uses this type of destruction. It's just recursively call the destructors, come back up, push everything onto the stack. And the reason that makes a lot of sense is because the STL's red-black tree implementation is guaranteed to be balanced. So the height's going to be proportional to log(n), you're not going to smash your stack with log(n) space. But back to the fact that Herb has said that there do exist cases when trees are imbalanced. If you implement some not necessarily balanced tree, you have to be careful about smashing your stack in recursive contexts. So what are we going to do? The first answer is I'm going to step it up a notch. When you want to destroy a tree in order one space, there's something really handy you can do. Which is that you can muck with all of your tree's internal invariance with reckless abandon. You can rotate whatever internal node you want, don't worry about it, the tree's getting destroyed. So I'm going to throw an extra challenge on top. I would like to iterate through my tree using order one space overhead, but I'd like my tree to be in the same state as the beginning at the end. So what are we going to do? And the first thing we're going to do is we're going to remember a really cool implementation trick to save space inside of recursive languages. Languages like Scheme. See, one thing in recursive languages like Scheme is there's very frequently what's called tail-recursion. Which is when you return a value which is the smaller answer to your own function call. Now how does this relate to trees? Well, suppose we don't, for a moment, suppose we don't have a left subtree. When we visit, when we do in order traversal on node six, there's no left subtree to visit, we don't visit it, we visit ourself and then we tail-recurse in on the right-hand side. And that right-hand side, that node seven, does not care that six was its parent. Node seven never has to come back up the stack to see that node six was the parent. And so why bother keeping all the state for the stack frame about node six on the stack? Instead of doing recursive call, recursive call, recursive call, what languages like Scheme mandate is that the compiler perform tail call optimization. And that six is in the stack, seven is the child, and then instead of just pushing seven onto the stack, seven should replace the data for six in the stack frame, and turn the entire program from a recursive function into a loop. The difference between Scheme and C++ is that C++ does not mandate that this optimization be performed, however, your major compilers will perform this optimization at things higher than O0. Gcc, I did a quick experiment, Gcc seems to turn on tail call optimization at O2, and Clang on O1. Both of them, though, do not turn on tail call optimization with O0. Because with O0 the compiler is trying to turn your code as faithfully as possible into assembly code, and that means having order and stack calls. Back, though, to Morris Traversal. If our tree does not happen to have a left subtree, we can do an inorder traversal in order one space. The problem now is that trees would be kind of boring if they didn't have left subtrees. What are we going to do when we do have a left subtree? Because we need some way of getting back to the root node. We need a way to visit the left subtree, and then come back. And the way that Morris Traversal's going to do this is Morris Traversal is going to find a node in its left subtree, and set its right subtree node to point back to itself. And the specific node that Morris Traversal is going to pick will be the predecessor. So Morris Traversal's going to notice that it has a left subtree, it's going to go one step to the left as many steps to the right as it can, and it's going to set that node's right ptr to be itself. At that point in time, it can tail-recurse on the left-hand side. And so now we've got a smaller version of the problem. And one thing that I really like about recursive problems is that, when I reason about the whole, I can assume that the smaller cases just work. It's proof by induction. Technically, you need a base case, but we already covered the base case of not having a left subtree a couple of slides earlier. So this, Morris Traversal, will successfully traverse this tree, it'll visit one, then it'll visit two, three, four, five, and then finally it visits six again. And when node six gets visited a second time, it's again going to notice that it has a left subtree, it's going to find its predecessor, and it's going to notice, hey, wait a minute, my predecessor already has a back edge. Quite clearly, I've done this before. Quite clearly, my left subtree has already been visited. So instead of setting a back edge on this second iteration, Morris Traversal's going to erase the back edge to clean up after itself. It will visit itself, and then it will continue recursing on the right-hand side. This pattern of setting a back edge the first time and deleting the back edge the second time means that Morris Traversal will clean up after itself perfectly, leaving the tree in an unchanged state at the end of the algorithm. Certainly, it's going to mutate the state and make an invalid tree in the interim, but that's the price you pay for order one traversal of your tree. So, coming back to the fact that trees might be unbalanced. If you don't have an unbalanced tree, or if you have a tree that you know won't be too deep, just use unique ptrs, it'll be fine, your stack will be fine most likely. But if you do run into this problem, then there exist algorithms both for regular inorder traversal and also for destruction that will allow you to visit every node in your tree without requiring extra stack space. So one thing about the last two algorithms that I shared with you is that they're both space-efficient algorithms. And I realized that after putting all my algorithms together, that all of the algorithms I have are space-efficient algorithms. Because I work on services, and when I have a service that uses a function, a method that uses less memory, then I can do more with an individual server that I have as my resource. So Reservoir Sampling is no different. It is a space-efficient streaming algorithm. So awhile ago, and intern came to me and she had a problem. BigGrep wasn't doing what she wanted it to do, and she wanted BigGrep to do better. BigGrep, by the way, is Facebook's internal grep tool for the entire codebase. You can imagine it like grep dash r, except it's really, really fast. And because we are grepping over our entire codebase, if you grep for something particularly common such as a string, you'll get thousands and thousands of results. Well, you actually won't get thousands and thousands of results, you'll get an error, too many results. That's unfortunate. And this was especially unfortunate for the intern because she, like many other new employees at Facebook, learn how Facebook's codebase operates by looking at other code. And the way you find code, you hear a magical keyword, what's that keyword means, scuba, I don't know what scuba is, BigGrep scuba. Look at the BigGrep results, see what other people are doing, copy their style, figure out what they are doing. And if there's a particularly common name, a particularly common keyword, BigGrep's not going to return anything. So what this intern wanted is she wanted BigGrep to return a sample of the data. To return a subset of the 185,000 results instead of nothing. And to Reservoir Sampling. Reservoir Sampling is a generic algorithm to select k elements from a stream of unknown length. I have seen many halfway solutions that kind of try to be Reservoir Sampling. I have seen people pick elements with probability k/n from their stream. This actually works, when n is known, and also when you don't particularly care if you have exactly k elements. It works particularly well when you want to sample log messages. I have seen services that just accumulate all elements, store them all, and then pick some random subset of them to choose. I have also seen services just pick the first k element. This one's a little bit rarer, but if you say, have a service that has a queue, and requests are coming in in a random order anyways, you might as well take the first k. All of these, though, pale in comparison to Reservoir Sampling, which extremely efficiently will select k elements perfectly with a random probability from an unknown length stream. How does it work? And the state, the answer is that Reservoir Sampling maintains only a very small amount of state. The state that Reservoir Sampling maintains is the current selection of elements that it wants to return, as well as the number of elements currently processed. And so here's a little pictogram, and right before we processed element #60, we'd processed 59 elements from our stream, and k equals two, we've chosen red and blue to be our two random elements out of the 59 we've processed so far. The statistical invariant of this code, by the way, is that for every element, and every one of those two cells, each element has a 1/n chance of being in any given cell. So that red circle has a one in 59 chance of being in box zero. And then, hey, on comes element 60, the green element. If the algorithm were to terminate immediately after we processed element 60, then we would expect that green element to exist with probability 2/60. Therefore, we must take element 60 and place it into our currently selected elements with probability 2/60. So take a random number, take in mod 60, take in mod n, check if it's less than k, k is two, check if rand number mod 60 is less than two. And if it is, boot out a random one of the two elements already selected, otherwise discard it. Thinking back to the statistical invariant, right after we'd processed element 59, that red element had a one in 59 chance of existing inside bucket one. And after we've processed element 60, it had a one in 60 chance of being kicked out. Green had a two in 60 chance of being selected, but there were two boxes that it could've been placed into, and so there's a one in 60 chance that green replaced red. Worded another way, red has a 59 over 60 chance of remaining. So one in 59 chance of existing, times 59 over 60 chance of remaining is the red element, after having processed 60 elements, has a one in 60 chance of being there, maintaining the statistical invariant of Reservoir Sampling. There is one particularly cool implementation trick that I'd like to share with you for this code, which is a way to save on a random numbered generation. We take our random number and we take it mod 60, and we get a uniform distribution zero to 60. Then we check to see if it's less than k. If it isn't less than k, whatever, move on, nothing else to do here. But if it is less than k, we then need to generate a random number in the range zero to two. But wait, r was a random number in the range zero to 60, and furthermore, we know that it's less than k. Therefore, r is actually a random number in the range zero to two, and therefore, we can just replace the rth element with green, no need to call rand a second time. Algorithm in hand, my intern protagonist went off and implemented Reservoir Sampling inside of our BigGrep codebase, and she added the dash dash sample flag, which will successfully return a subset of results even if there are grossly too many results to be returned by the BigGrep server. Dash dash sample has stayed in production for the last four years, and is still in use today. All right. I've got one last algorithm for you. HyperLogLog. Who here, by the way, has ever made a table in a database? Wow, I feel embarrassed now, I haven't. (chuckles) All of you know how to database way better than me. Because I've never had to. Facebook has a lot of teams, a lot of really good abstractions, a lot of good services that deal with the database abstraction for me. So all I really need to do is take my data, throw it over a wall, and I trust that other people, other teams, other services are going to do all the hard stuff. They're going to deal with retention, they're going to deal with sampling, and they're going to have these really nice web UIs where I can go, when I can look at my data, and they'll take P99 of my data for me, and they'll make me pretty charts, and they even have these dropdown buttons that I can press with my mouse, it's fantastic. Very, very pleased with the state of databases at Facebook. Until one day... The web UI did not have a button to do what I wanted to do. What I needed to do with my data is I needed to know how many distinct elements there were. Specifically, what I was doing is I had the big, huge database of all the different compilations that Facebook has run, and I wanted to know how many distinct filenames Facebook had compiled in the last 24 hours. And there wasn't a button for this. It was, it was awkward. I had to go to the Wiki. Wiki... How do I database? (tongue clicks) And the Wiki was actually pretty good, but it told me what to do, it told me how to open the database, it told me how to find the specific table that I was looking for, and it even told me the name of the magical function that I had to call. Approx count distinct HLL. Okay, count distinct is what I want, I do want to know the number of distinct elements in my database, but why approximate? I'd be quite fine if you just gave me an exact answer, thanks. (audience laughs) And also, what does HLL stand for? It turns out, HLL stands for HyperLogLog, and it is a damn cool algorithm. The thing that HyperLogLog does is it counts the number of distinct elements, but for the exact same reason that it was impossible to find who the most common users of my service was in Heavy Hitters, it is also impossible to count the number of unique elements of my database in a single pass using less than order n space. And for databases, yeah, sometimes they're big. And so what are we going to do? How can we get around this problem of needing to have order n space? And the answer is that HyperLogLog trades away exactness in order to reduce the space from order n to a particularly tiny number. As the original author of HyperLogLog would've said, HyperLogLog can be used to estimate the number of words in the whole of Shakespeare's works using less memory than it takes to represent this slide. It's a very space-efficient algorithm. And the key observation for HyperLogLog is that if you have a random stream of bits, then the probability that the first k bits are all zero is one over two to the k. So we can imagine, maybe I have 128 distinct random numbers. I would expect for those random numbers, those 128 random numbers, I'd expect 64 of them to have no leading zeros, 32 of them to have one leading zero, 16 of them to have two, eight to have three, four to have four, two to have five, one to have six, and one of those 128 distinct random numbers to have seven or more leading zeros. So yeah, that's a nice observation. How do we even use it? I see there's three big problems with this observation from being turned into a practical approximation algorithm. And the first problem is that this is an observation about numbers of leading zeros, which is a property of random numbers. And my input is not random numbers. My input is arbitrary data. Specifically, my example was filenames. Filenames being strings, not integers, and also paths which are highly structured and certainly not random. Fortunately, there's a relatively straightforward solution to this. We're going to hash our input. Hash functions, turning arbitrary data to sufficiently random numbers since the 1960s. One of the other really great things about hash functions is that hash functions are deterministic. If I had hello world dot cpp occurring twice in my data set, then hello world dot cpp would hash to the exact same value. If I've got 1,000 strings, 500 of which are unique, then when I hash them, I'm going to get 1,000 numbers, half of which are unique. Technically, there's hash collisions, but HyperLogLog is already an approximation algorithm. And I'll show you guys later that it usually runs at 2% error rate. And with 64-bit hashes, and a built-in 2% error rate, hash collisions aren't going to make a difference until you have about a quintillion elements. That's a billion billion. So, whenever you guys have a billion billion elements that you want to figure out how many of them are distinct, HyperLogLog is no longer for you. Either that, or get a bigger hash function. Okay, so we can take our input, we can hash it to get random numbers, and we're good on that score. The next problem with this observation is that this is an observation about random numbers. However, our input is arbitrary, and specifically, our input has arbitrary duplication. That's the whole point of HyperLogLog. And the problem with duplication is that our nice distribution of how many leading zeros we expect to see is totally wacked off and bonkers. And so how, how are we going to solve the problem of duplication messing up our distribution of leading zeros? The answer is we need to find some property of our distribution that is not affected in the presence of duplications. And the answer for that is we're going to use the maximum. We are not going to track how many leading zeros there were for every different number, we are merely going to track the largest number of leading zeros that we have seen in the entire data set. So back to the example with 128 different elements, we see our distribution, and we see that there's one element that has seven leading zeros, and then without actually knowing that the data set contained 128 distinct elements to begin with, we would guess, because we see a maximum of seven leading zeros that our data set contains, two to the seven equals 128, distinct random numbers. Which brings me to the final problem, which is that that estimation sucks. It is a terrible, terrible estimation to just say, find the maximum number of leading zeros and to whatever the power of that number is your cardinality. Doesn't work. What happens if, instead of seeing seven leading zeros, by random chance, you saw ten. Three extra leading zeros. That should happen about 12% of the time. So 12% of the time, instead of estimating that you had a 128 distinct elements, you would estimate that you had 1,024 elements, exceeding the real number by a factor of 10. How are we going to solve this? The answer is that statistical randomized algorithms have two nice tricks up their sleeve. The first trick that randomized algorithms sometimes employ is to repeat the algorithm several times with different sources of randomness. And then to aggregate those different answers together to form a more statistically correct answer. We'll leave out the aggregation function in a bit. Because in HyperLogLog, we are relying on the hash functions to provide us with our random bits, what we need to do is we need to run this function some number of times with different hash functions. The second trick that approximation algorithms like to employ is you can sample the data. You don't need to investigate every single thing, as long as you pick a random subset of the data, you can form a statistically accurate estimate of what your data set is doing. How might you pick k random elements from your stream, you might ask. There might be an algorithm for that. But for the moment, we can combine these two ideas in HyperLogLog. And what we will do is we will process elements one at a time, and we will look at an element and partition it. We'll say, hey, by the way, this element belongs to partition 10. We'll send it off to partition 10, we'll hash it using partition 10's unique hash function, and then we'll count the number of leading zeros, and we'll update the maximum number of leading zeros that partition 10 has seen with this new number. There is a really interesting implementation detail here, as well as a very subtle detail. The cool implementation trick is we can dispense with this have a billion hash functions problem, we can get away with only one hash function. As long as you're willing to have a number of buckets, a number of partitions that is a perfect power of two. Suppose you want to have eight different partitions. That's three bits. When you see a new element, take its hash, and use the first three bits of the hash to identify the partition to which your element belongs. The other 61 bits of your hash, or n minus three for whatever n happens to be, is your partition's specific hash function. And so in this example, we see an element of hashes to zero one zero zero zero one zero one, take the first three bits. Hey, those first three bits represent the number two, this element belongs to partition two. Following which, we take the other five digits, and we observe that those five digits have two leading zeros. Hey, we're going to update the maximum number of leading zeros for partition two with the value two. The subtle detail here... is that the hash function means that we are partitioning the input space, not the input size. If we have hello world dot cpp twice in our data set, it's going to hash to the same thing both times, and that means that it's going to get sent to the same partition. That means that all of the strings that make it into partition a are disjoint from all of the strings that make it into partition b. Something to keep in mind for a moment. And so what does this look like? What does this look like when we run it on 10,000 random integers? And the answer is that it looks something like this, this is literally a run that I ran on my computer. And we see, for example, that partition six, of all the different numbers that were assigned to partition six, six saw a maximum of 10 leading zeros. Okay, that would mean that we'd estimate that partition six has two to the power of 10, which is 1,024, distinct elements. We would also estimate that partition five has 2,048 distinct elements. Keeping in mind that the elements that go into each partition are distinct, it is tempting to think that we would get a good estimation by summing up the estimates for each individual cell to get our final guess. Problem is that that estimation really sucks. I said there were 10,000 elements that generated that HyperLogLog table. It turns out that if you sum up the expected number for all the individual boxes, you get an estimate of 50,000. Five times your original number. And the reason you get 50,000 is because of that seven box that has a 15 in it. Box seven, by random chance, saw 15 leading zeros and so estimates that your data set has at least 32,000 distinct elements. So, we're a little bit stuck there. What are we going to do instead? Fortunately for us, some very smart mathematicians have come up with an answer. Instead of taking the sum, use the harmonic mean. Couple of extra details here. Because all the different boxes are unique, you have to add an additional factor of m to account for the fact that all the boxes have different strings in them. And in the formula that they provide in their paper, you also need to throw in an additional constant factor to account for the fact that there is structural bias in how this number is computed. But at the end of the day, you have a implementable formula that you can apply to the various numbers of leading zeros for each of your partitions. And if we apply that formula to these observed numbers, we get an estimate of about 8,000. Keeping in mind there were 10,000 elements to begin with, so we've got an error of about 20%. That's not actually all that great. There's a reason for that, though. This estimate is 20% off because I chose a very small value of m, I chose m equals eight. The expected error of HyperLogLog is proportional to 1.04 divided by square root of m. When m is eight, that comes out to an expected 35% error margin. Most people, I have been led to believe, choose 2,048 distinct partitions. M requires the first 11 bits of your hash function. Because when m is 2,048, your expected error comes out to a mere 2%. So, I found this function. Database, please approx count distinct HyperLogLog my data. (clicks tongue) Hmm. That just worked. That's one of the glorious things about algorithms. They just work. You don't need to invent them, you don't need to implement them, you don't really even need to understand what they're doing. All you have to do is you have to know that they exist, and know what problem they solve. And that is what turns an algorithm into a useful tool that you can use. There are so many algorithms and there's one place that I like to go to find them. Include algorithm. It's a very nice place, I like going there, seeing what algorithms are on offer. It's not the best. Do I wish there were more algorithms? Yes. Do I wish that the algorithms didn't have an iterator interface? Also, yes. But do they work? Are they fast, are they tested, are they rigorous? Yes! Every algorithm that you find inside of include algorithm will be an extremely efficient algorithm to add to your toolbox. Some of my favorites. Nth element, given a list of data, find the n largest ones. Accumulate, given a stream of input, sum it all up. And count if, another one of my favorites. People have a tendency to corrupt the data in the services that I own, and sometimes I need to count how much of my data is still valid. Really, whenever anyone, when I'm doing code review, and anyone writes a for loop that does something relatively simple, I go to my favorite C++ reference website, cppreference.com, and I look through their algorithms description, and I try to find one that does what I want them to do. Literally, I managed to do this with nth element. A co-worker of mine was taking his input, he was sorting it, and then he was taking the top k. And I was like, hey, hey bud, here's a cppreference link to nth element, if you'd like to shave a factor of log(n) off your runtime. But I'm not going to talk much more about standard algorithms, and the reason I'm not going to talk much more about standard algorithms is because there's already a great talk on this subject available on YouTube. Two years ago, at CppCon in 2015, Michael VanLoon gave a talk, STL Algorithms in Action, and it was one of my favorite talks that year. It contained just a whirlwind tour of the STL Algorithms library, and for me, it was really useful because it broke down the barrier of mysteriousness and weird function call syntax that was present in the STL. Before that talk, I did not use the STL very much, I was not very familiar with it, but Michael's talk introduced me gently into those algorithms, and I use them all the time now. If you do not use stood algorithm, go onto YouTube, watch Michael VanLoon's talk. It'll be a good use of your time. I could keep talking for a long time. I could talk for days about algorithms. The thing is, though, if I were to talk for days about algorithms, instead of limiting myself to just a few, we would all get very bored very quickly. Because most of the algorithms that I know you also know. We all know binary search, we all know quick sort, we all know substring, and that's because algorithms are the tools of our trade. I hope that today's presentation of algorithm has contained some new algorithms for you. Tools, rare enough to be novel, but useful enough to be found in practice. Each, in their own special way, fantastic. My name's Nicolas. It's been a pleasure speaking with you today. (applause)
Info
Channel: CppCon
Views: 47,607
Rating: 4.7547741 out of 5
Keywords: conference filming services, Bash Films, conference recording services, nationwide conference recording services, event videographers, capture presentation slides, record presentation slides, conference video recording services, conference videography services, + C (Programming Language), conference video recording, conference services, CppCon 2017, conference recording, event video recording, Computer Science (Field), conference live streaming, Nicholas Ormrod
Id: YA-nB2wjVcI
Channel Id: undefined
Length: 46min 58sec (2818 seconds)
Published: Fri Oct 27 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.