All right. Hello, everyone. And welcome to Fantastic
Algorithms and Where to Find Them. An intern once asked
me, "Nicolas, how often "do you write algorithms?" And the answer is rarely. Algorithms are tools. Hammers, and levels, and mops. And what makes a collection
of tools impressive is not how many of them you made yourself, but rather, how many unique problems you can use your toolbox to solve. I hope that in today's
talk, as we go through a couple fantastic algorithms, that there will be some
tools here rare enough to be novel, but useful enough
to be found in practice. So a little bit about
me before we dive in, I am an Infrastructure
Engineer at Facebook. I work in infra for two reasons. First of all, a lot of infra is services. Services are often written in C++, and I happen to like that language. The second reason that
I like working in infra is that my clients are fellow engineers. If I've got a problem,
something's going wrong, if I need to learn something
about the use cases, I can just go and talk to
another technical person. This leads straight into
our first algorithm. Heavy Hitters. Despite my co-workers being engineers, people with no malicious intentions... I still get DoSed. Maybe a new feature gets turned on, maybe an upstream service
gets DoSed which causes my service to get DoSed, or
in this particular story, a command line interface
used primarily by humans is called programmatically
for the first time. This particular graph, those two spikes represent high latencies,
specifically timeouts, as a result of too many
queries coming into my service. Now, what I want to do,
what I really would like is to guarantee to all the people who are using my service normally, that they will have a bare
minimum amount of traffic. I do not want to starve
any of my individual users in the event that I am overloaded. And so the ideal thing that
I would like is I would like to have a map of the
people who use my service to the number of times that
they have made queries with me. The problem is that keeping
that map is very expensive, it takes order n (O(n)) space. Even if I were to simplify my request, and instead just want to know the identity of the single user who was
making the most requests, it would still take order n (O(n)) space to be done in a single pass. So fortunately, though,
there is an algorithm that exists, Heavy Hitters,
which does almost what I want. Heavy Hitters does not determine who the most frequent user of my table is, of my service is, however, what it does do is it does identify the
users who single-handedly account for more than x
percent of total traffic. For this special case, x
equals 50%, some of you might have heard this algorithm
under some different names, for x equals 50%, knowing
which elements in an array occur strictly more than half the time, the algorithm's called Majority Element. Also known as Boyer-Moore's
Majority Vote algorithm. And it works something like this. Imagine, for a moment, that
we are all at a conference, and there's a lot of programmers who are attending the conference, and the programmers indicate
by the color of their dress what their favorite
programming language is. As the programmers enter the floor the passion of the moment overcomes them, and a furious fight breaks out. Programmers who represent
different languages lock eyes and instantly do battle. They run at each other and
try to knock the other out. Now, it happens to be
that every single pair of programmers is
perfectly evenly matched, and so when they run and punch each other, they both land a blow on the
other's chin simultaneously, and both get knocked to
the floor unconscious. The conference chair enters the room and notices that there is exactly one programming language left standing. Indeed, if there were
multiple programming languages left standing, then one
programmer from each of two distinct languages would lock eyes, run at each other, do
battle, get knocked out. So there is exactly one language left. And so the conference chair smiles, 'cause he had a question that he wanted to have an answer to. He wanted to know whether
or not there existed a programming language
that enjoyed the support of a strict majority of all attendees. The conference chair's job is simpler now. Instead of counting
the number of attendees who support every single
programming language, he merely needs to count
the number of attendees who support the programming
language left standing. The reason this works... is because we see that pairs
of people are knocked out where each member of the pair represents a different programming language. If there's n people, there's at most n/2 pairs of knocked out people. And if there's a
programming language that is represented by more than n/2 people, then it cannot be present, then it cannot be completely knocked out in every single pair,
thus will be left over. This conference room brawl translates very cleanly into an algorithm. The only difference is that
you let the programmers into the room one at a time. Implement this as a counter-value pair, or sorry, a value-counter pair. You have a value, the
current programming language which is winning, and some
rather arbitrary integer representing the weight. So the first programmer
walks into the room representing C++, there's
no one else in the room, C++ is now the winner, the weight is one. A second programmer, a Java programmer, walks into the room, locks
eyes with the C++ programmer, they do battle, they get knocked out, the counter goes back to zero. Third programmer walks into the room, C++. C++ is the winner, counter is one. Fourth programmer, again, C++ walks in, C++ now has a value of two, and the fifth and final
programmer, a Perl programmer, walks in, locks eyes with
one of the C++ programmers, they do battle, get knocked out. The algorithm's done. At the end of these five
people, three of whom were C++, majority vote does indeed
say that C++ is the winner, and there's a rather
arbitrary value of one left in the counter. It is important after you
do the first pass through to find your estimate
of who the winner is, that you do a second
pass through the array to verify that the returned element is indeed a majority element. We can imagine a case
where this is not the case. Suppose there are a very
large, odd number of pirates each of whom is voting for who the next captain of the ship is going to be. 1,001 pirates. And I've chosen pirates, because pirates are well known to always
vote for themselves. And so the first pirate
walks in and he's the winner. And the second pirate walks
in, and they do battle, and there's no winner. Third pirate walks in, he's the winner. Fourth pirate walks in, oh they disagree, they do battle, no more winner. And the first 500 pairs of pirates go in and knock each other out
until the last pirate, the 1,001st pirate, walks in and, "Hey, guys, I vote for Teddy." Teddy's the winner! No, he's not the winner,
you have to do a second pass through the array. It is sufficient for a majority element to end up in the value,
but it's not necessary. This is the x equals 50%
version of the algorithm. It extends very cleanly and very naturally to percentages other than 50%. Specifically to
percentages of the form 1/k for positive integers k. The only trick that you have to change is that, inside the conference room, which could previously
accommodate only one programming language, now
allow there to be k minus one distinct programming
languages existing in harmony. And it's only when a
fifth programming language walks in that the room
becomes too crowded, and that new programming language plus the other four distinct programming-- sorry, if k equals five,
the other four distinct programming languages,
one person from each group does battle and leaves. And so they do battle in groups of k, therefore, no one who exists more than 1/k of the time can be
completely knocked out. So coming back to the service that I own that was getting DoSed. I was upset that this was happening. I went and I talked to that
engineer, because I'm an adult. But I also put up some
code based on Heavy Hitters to make sure that, if this happened again, I'd be able to throttle
the people who were single-handedly accounting
for large percentages of my resources. I had one small variant, though. Heavy Hitters does not
have a concept of time, so I did Heavy Hitters
in five minute intervals. It was a long time ago,
there was a little bit of trickery going on. I can't show you that code, though. But some code that I can show
you is this majority vote implementation, found inside of Libgcc. I was talking to Nathan Sidwell yesterday. Nathan Sidwell, by the way,
is going to give a talk tomorrow on modules, implementing
modules inside of G++, which I would advise going to. But I was telling him about Heavy Hitters, and he was like, "Hmm, you know, "this code sounds familiar to me." And sure enough, Nathan
found this Heavy Hitters implementation inside of Gcov. What this code is doing
is it is trying to find whether or not there
is a majority argument for profiling purposes. During profiling, this extremely cheap, extremely simple counter-value pair is used to determine whether or not there's a specific argument that occurs the vast majority of the time. There's one slight hiccup, though, in the Gcov implementation. Which is, typically, at the
end of the Heavy Hitters implementation, you
have to do a second pass to verify that the given
argument does indeed occur more than 50% of the time. And Gcov can't do a second pass. So, instead, Gcov has
this interesting trick where, at the very end,
they check the counter and they see if the counter is higher than 50% of the total number of calls. If this is true, then we know for sure that that algorithm occurred
at least 50% of the time. However, we don't know that an algorithm that occurred 51% of
the time is guaranteed to make this cut. However, it is guaranteed
that an algorithm that occurs at least,
or an input that occurs at least 75% of the time will, once the algorithm completes, have a value in the counter at least equal to 50%. So this is a rather
nice, nifty optimization that Gcov uses to profile your code so that your code can be optimized. All right. I've got another algorithm for you. Who here was at CppCon last year? That's actually fewer
people than I thought, that was like less than half. Who here, regardless of whether or not you've been at CppCon last year, listened to Herb Sutter's
talk Leak-Freedom in C++? That looks like quite a few more hands. Not everyone's hands, but more. For those of you who
haven't seen Herb's talk, I highly advise, go and listen to it, it's a great talk about how to avoid leaking memory inside of your programs. And one of the most important things that Herb hammers home in his talk is not to manage pointers yourself. Do not use new and delete. Instead, rely on unique
ptrs and shared ptrs when you actually have data that has to outlive your stack frame. And one of the examples
that Herb had in his talk was he had a tree implementation. And instead of having a tree star left and a tree star right, Herb
instead had unique ptrs. And this is very nice,
because those unique ptrs are going to be cleaned up successfully, regardless of how your
tree gets destroyed. You could have an exception
thrown in your constructor and those child trees are
going to be cleaned up, and you're not going to have a leak. There's a problem, though, Herb remarked. The problem is that, in the wild, not all trees are balanced. And so if you happened to
have an unbalanced tree, in the worst case, just a
straight line like a linked list, what happens when you free the root node? The root node's going to be freed, but in order to do that,
it first has to call the destructor on it's
left child unique ptr. So six calls the destructor of two, two calls the destructor of one, and so a lot of destructors
get pushed onto your stack, and you can smash your stack. This isn't good. And so, at the end of the presentation, during Q&A, Eric Neebler went
to the microphone and said, "Herb, did you know
that there exists a way "to destroy a tree using
order one space overhead?" Before I go on, I've
got a small little story about Libgcc again for you. I went and I checked
Libgcc's implementation to see whether or not they
were doing a stack-smashing destruction call, or if they
were doing some fancy-swancy order one space destruction call. And the answer is, that inside of Libgcc, their red-black tree implementation uses this type of destruction. It's just recursively
call the destructors, come back up, push
everything onto the stack. And the reason that makes a lot of sense is because the STL's
red-black tree implementation is guaranteed to be balanced. So the height's going to
be proportional to log(n), you're not going to smash
your stack with log(n) space. But back to the fact that Herb has said that there do exist cases
when trees are imbalanced. If you implement some not
necessarily balanced tree, you have to be careful
about smashing your stack in recursive contexts. So what are we going to do? The first answer is I'm
going to step it up a notch. When you want to destroy
a tree in order one space, there's something really handy you can do. Which is that you can muck
with all of your tree's internal invariance with reckless abandon. You can rotate whatever
internal node you want, don't worry about it, the
tree's getting destroyed. So I'm going to throw an
extra challenge on top. I would like to iterate through my tree using order one space overhead, but I'd like my tree
to be in the same state as the beginning at the end. So what are we going to do? And the first thing we're going to do is we're going to remember a really cool implementation trick to save space inside of recursive languages. Languages like Scheme. See, one thing in recursive
languages like Scheme is there's very frequently
what's called tail-recursion. Which is when you return a value which is the smaller answer
to your own function call. Now how does this relate to trees? Well, suppose we don't, for a moment, suppose we don't have a left subtree. When we visit, when we do in
order traversal on node six, there's no left subtree to
visit, we don't visit it, we visit ourself and then we tail-recurse in on the right-hand side. And that right-hand side, that node seven, does not care that six was its parent. Node seven never has to
come back up the stack to see that node six was the parent. And so why bother keeping all the state for the stack frame about
node six on the stack? Instead of doing recursive
call, recursive call, recursive call, what
languages like Scheme mandate is that the compiler perform
tail call optimization. And that six is in the
stack, seven is the child, and then instead of just
pushing seven onto the stack, seven should replace the data
for six in the stack frame, and turn the entire program
from a recursive function into a loop. The difference between
Scheme and C++ is that C++ does not mandate that this
optimization be performed, however, your major compilers
will perform this optimization at things higher than O0. Gcc, I did a quick experiment,
Gcc seems to turn on tail call optimization
at O2, and Clang on O1. Both of them, though, do not turn on tail call optimization with O0. Because with O0 the compiler
is trying to turn your code as faithfully as possible
into assembly code, and that means having
order and stack calls. Back, though, to Morris Traversal. If our tree does not happen
to have a left subtree, we can do an inorder
traversal in order one space. The problem now is that
trees would be kind of boring if they didn't have left subtrees. What are we going to do when
we do have a left subtree? Because we need some way of
getting back to the root node. We need a way to visit the left
subtree, and then come back. And the way that Morris
Traversal's going to do this is Morris Traversal is
going to find a node in its left subtree, and set its right subtree node to
point back to itself. And the specific node that
Morris Traversal is going to pick will be the predecessor. So Morris Traversal's
going to notice that it has a left subtree, it's going to go one step to the left as many steps
to the right as it can, and it's going to set that
node's right ptr to be itself. At that point in time, it can tail-recurse on the left-hand side. And so now we've got a smaller
version of the problem. And one thing that I really
like about recursive problems is that, when I reason about
the whole, I can assume that the smaller cases just work. It's proof by induction. Technically, you need a base case, but we already covered the base case of not having a left subtree
a couple of slides earlier. So this, Morris Traversal,
will successfully traverse this tree, it'll visit one, then it'll visit two, three, four, five, and then finally it visits six again. And when node six gets
visited a second time, it's again going to notice
that it has a left subtree, it's going to find its predecessor, and it's going to notice,
hey, wait a minute, my predecessor already has a back edge. Quite clearly, I've done this before. Quite clearly, my left subtree
has already been visited. So instead of setting a back
edge on this second iteration, Morris Traversal's going
to erase the back edge to clean up after itself. It will visit itself, and then
it will continue recursing on the right-hand side. This pattern of setting a
back edge the first time and deleting the back edge the second time means that Morris Traversal will clean up after itself perfectly, leaving the tree in an unchanged state at
the end of the algorithm. Certainly, it's going to mutate the state and make an invalid tree in the interim, but that's the price you pay for order one traversal of your tree. So, coming back to the fact
that trees might be unbalanced. If you don't have an unbalanced tree, or if you have a tree that
you know won't be too deep, just use unique ptrs, it'll be fine, your stack will be fine most likely. But if you do run into this problem, then there exist algorithms both for regular inorder traversal
and also for destruction that will allow you to visit
every node in your tree without requiring extra stack space. So one thing about the last two algorithms that I shared with you is that they're both space-efficient algorithms. And I realized that after putting all my algorithms together, that
all of the algorithms I have are space-efficient algorithms. Because I work on services,
and when I have a service that uses a function, a
method that uses less memory, then I can do more with
an individual server that I have as my resource. So Reservoir Sampling is no different. It is a space-efficient
streaming algorithm. So awhile ago, and intern came to me and she had a problem. BigGrep wasn't doing
what she wanted it to do, and she wanted BigGrep to do better. BigGrep, by the way, is
Facebook's internal grep tool for the entire codebase. You can imagine it like grep dash r, except it's really, really fast. And because we are grepping
over our entire codebase, if you grep for something
particularly common such as a string, you'll get thousands and thousands of results. Well, you actually won't
get thousands and thousands of results, you'll get an
error, too many results. That's unfortunate. And this was especially
unfortunate for the intern because she, like many other
new employees at Facebook, learn how Facebook's codebase operates by looking at other code. And the way you find code,
you hear a magical keyword, what's that keyword means, scuba, I don't know what scuba is, BigGrep scuba. Look at the BigGrep results,
see what other people are doing, copy their style,
figure out what they are doing. And if there's a particularly common name, a particularly common keyword, BigGrep's not going to return anything. So what this intern wanted is she wanted BigGrep to return a sample of the data. To return a subset of the 185,000 results instead of nothing. And to Reservoir Sampling. Reservoir Sampling is a generic algorithm to select k elements from
a stream of unknown length. I have seen many halfway solutions that kind of try to be Reservoir Sampling. I have seen people pick elements with probability k/n from their stream. This actually works, when n is known, and also when you don't particularly care if you have exactly k elements. It works particularly well when you want to sample log messages. I have seen services that
just accumulate all elements, store them all, and then
pick some random subset of them to choose. I have also seen services
just pick the first k element. This one's a little bit
rarer, but if you say, have a service that has a queue, and requests are coming in
in a random order anyways, you might as well take the first k. All of these, though, pale in comparison to Reservoir Sampling,
which extremely efficiently will select k elements perfectly with a random probability
from an unknown length stream. How does it work? And the state, the answer
is that Reservoir Sampling maintains only a very
small amount of state. The state that Reservoir
Sampling maintains is the current selection of elements that it wants to return, as well as the number of elements
currently processed. And so here's a little pictogram, and right before we processed element #60, we'd processed 59
elements from our stream, and k equals two, we've
chosen red and blue to be our two random elements out of
the 59 we've processed so far. The statistical invariant
of this code, by the way, is that for every element,
and every one of those two cells, each element has a 1/n chance of being in any given cell. So that red circle has a one in 59 chance of being in box zero. And then, hey, on comes
element 60, the green element. If the algorithm were
to terminate immediately after we processed element 60, then we would expect that green element to exist with probability 2/60. Therefore, we must take
element 60 and place it into our currently selected elements with probability 2/60. So take a random number, take
in mod 60, take in mod n, check if it's less than k, k is two, check if rand number
mod 60 is less than two. And if it is, boot out a
random one of the two elements already selected, otherwise discard it. Thinking back to the
statistical invariant, right after we'd processed element 59, that red element had a one
in 59 chance of existing inside bucket one. And after we've processed element 60, it had a one in 60 chance
of being kicked out. Green had a two in 60
chance of being selected, but there were two boxes that
it could've been placed into, and so there's a one in 60
chance that green replaced red. Worded another way, red has a 59 over 60 chance of remaining. So one in 59 chance of existing, times 59 over 60 chance of remaining is the red element, after
having processed 60 elements, has a one in 60 chance of being there, maintaining the statistical
invariant of Reservoir Sampling. There is one particularly
cool implementation trick that I'd like to share
with you for this code, which is a way to save on a
random numbered generation. We take our random number
and we take it mod 60, and we get a uniform
distribution zero to 60. Then we check to see if it's less than k. If it isn't less than
k, whatever, move on, nothing else to do here. But if it is less than k,
we then need to generate a random number in the range zero to two. But wait, r was a random
number in the range zero to 60, and furthermore, we know
that it's less than k. Therefore, r is actually a random number in the range zero to two, and therefore, we can just replace the
rth element with green, no need to call rand a second time. Algorithm in hand, my
intern protagonist went off and implemented Reservoir Sampling inside of our BigGrep codebase, and she added the dash dash sample flag, which will successfully
return a subset of results even if there are grossly too many results to be returned by the BigGrep server. Dash dash sample has stayed in production for the last four years,
and is still in use today. All right. I've got one last algorithm for you. HyperLogLog. Who here, by the way, has ever made a table in a database? Wow, I feel embarrassed
now, I haven't. (chuckles) All of you know how to
database way better than me. Because I've never had to. Facebook has a lot of teams, a lot of really good abstractions,
a lot of good services that deal with the database
abstraction for me. So all I really need
to do is take my data, throw it over a wall, and I trust that other people, other teams, other services are going to do all the hard stuff. They're going to deal with retention, they're going to deal with sampling, and they're going to
have these really nice web UIs where I can go,
when I can look at my data, and they'll take P99 of my data for me, and they'll make me pretty charts, and they even have these dropdown buttons that I can press with my
mouse, it's fantastic. Very, very pleased with the
state of databases at Facebook. Until one day... The web UI did not have a button
to do what I wanted to do. What I needed to do with
my data is I needed to know how many distinct elements there were. Specifically, what I was doing is I had the big, huge database
of all the different compilations that Facebook has run, and I wanted to know how
many distinct filenames Facebook had compiled
in the last 24 hours. And there wasn't a button for this. It was, it was awkward. I had to go to the Wiki. Wiki... How do I database? (tongue clicks) And the Wiki was actually pretty good, but it told me what to do, it told me how to open the database,
it told me how to find the specific table that I was looking for, and it even told me the
name of the magical function that I had to call. Approx count distinct HLL. Okay, count distinct is what I want, I do want to know the
number of distinct elements in my database, but why approximate? I'd be quite fine if you just gave me an exact answer, thanks. (audience laughs) And also, what does HLL stand for? It turns out, HLL stands for HyperLogLog, and it is a damn cool algorithm. The thing that HyperLogLog does is it counts the number
of distinct elements, but for the exact same
reason that it was impossible to find who the most common
users of my service was in Heavy Hitters, it is
also impossible to count the number of unique
elements of my database in a single pass using
less than order n space. And for databases, yeah,
sometimes they're big. And so what are we going to do? How can we get around this problem of needing to have order n space? And the answer is that
HyperLogLog trades away exactness in order to reduce
the space from order n to a particularly tiny number. As the original author of
HyperLogLog would've said, HyperLogLog can be used to
estimate the number of words in the whole of Shakespeare's
works using less memory than it takes to represent this slide. It's a very space-efficient algorithm. And the key observation for HyperLogLog is that if you have a
random stream of bits, then the probability that
the first k bits are all zero is one over two to the k. So we can imagine, maybe I have
128 distinct random numbers. I would expect for those random numbers, those 128 random numbers,
I'd expect 64 of them to have no leading
zeros, 32 of them to have one leading zero, 16 of them to have two, eight to have three, four to have four, two to have five, one to have six, and one of those 128
distinct random numbers to have seven or more leading zeros. So yeah, that's a nice observation. How do we even use it? I see there's three big
problems with this observation from being turned into a
practical approximation algorithm. And the first problem is
that this is an observation about numbers of leading zeros, which is a property of random numbers. And my input is not random numbers. My input is arbitrary data. Specifically, my example was filenames. Filenames being strings, not integers, and also paths which are highly structured and certainly not random. Fortunately, there's a relatively straightforward solution to this. We're going to hash our input. Hash functions, turning arbitrary data to sufficiently random
numbers since the 1960s. One of the other really great
things about hash functions is that hash functions are deterministic. If I had hello world dot cpp
occurring twice in my data set, then hello world dot cpp would
hash to the exact same value. If I've got 1,000 strings,
500 of which are unique, then when I hash them, I'm
going to get 1,000 numbers, half of which are unique. Technically, there's hash collisions, but HyperLogLog is already
an approximation algorithm. And I'll show you guys later that it usually runs at 2% error rate. And with 64-bit hashes, and
a built-in 2% error rate, hash collisions aren't
going to make a difference until you have about a
quintillion elements. That's a billion billion. So, whenever you guys have
a billion billion elements that you want to figure out
how many of them are distinct, HyperLogLog is no longer for you. Either that, or get a
bigger hash function. Okay, so we can take our input, we can hash it to get random numbers, and we're good on that score. The next problem with this observation is that this is an observation
about random numbers. However, our input is arbitrary, and specifically, our input
has arbitrary duplication. That's the whole point of HyperLogLog. And the problem with duplication is that our nice distribution
of how many leading zeros we expect to see is totally
wacked off and bonkers. And so how, how are we
going to solve the problem of duplication messing up our
distribution of leading zeros? The answer is we need
to find some property of our distribution that is not affected in the presence of duplications. And the answer for that is
we're going to use the maximum. We are not going to track
how many leading zeros there were for every different number, we are merely going to
track the largest number of leading zeros that we have
seen in the entire data set. So back to the example with
128 different elements, we see our distribution,
and we see that there's one element that has seven leading zeros, and then without actually knowing that the data set contained
128 distinct elements to begin with, we would guess,
because we see a maximum of seven leading zeros
that our data set contains, two to the seven equals 128,
distinct random numbers. Which brings me to the final problem, which is that that estimation sucks. It is a terrible, terrible
estimation to just say, find the maximum number of leading zeros and to whatever the power of that number is your cardinality. Doesn't work. What happens if, instead of
seeing seven leading zeros, by random chance, you saw ten. Three extra leading zeros. That should happen about 12% of the time. So 12% of the time, instead of estimating that you had a 128 distinct elements, you would estimate that
you had 1,024 elements, exceeding the real
number by a factor of 10. How are we going to solve this? The answer is that statistical
randomized algorithms have two nice tricks up their sleeve. The first trick that randomized algorithms sometimes employ is to
repeat the algorithm several times with different
sources of randomness. And then to aggregate those
different answers together to form a more statistically
correct answer. We'll leave out the
aggregation function in a bit. Because in HyperLogLog, we are
relying on the hash functions to provide us with our random bits, what we need to do is we
need to run this function some number of times with
different hash functions. The second trick that
approximation algorithms like to employ is you can sample the data. You don't need to investigate
every single thing, as long as you pick a
random subset of the data, you can form a statistically
accurate estimate of what your data set is doing. How might you pick k random elements from your stream, you might ask. There might be an algorithm for that. But for the moment, we can combine these two ideas in HyperLogLog. And what we will do is
we will process elements one at a time, and we
will look at an element and partition it. We'll say, hey, by the way, this element belongs to partition 10. We'll send it off to partition 10, we'll hash it using partition
10's unique hash function, and then we'll count the
number of leading zeros, and we'll update the maximum
number of leading zeros that partition 10 has
seen with this new number. There is a really interesting
implementation detail here, as well as a very subtle detail. The cool implementation
trick is we can dispense with this have a billion
hash functions problem, we can get away with
only one hash function. As long as you're willing
to have a number of buckets, a number of partitions that
is a perfect power of two. Suppose you want to have
eight different partitions. That's three bits. When you see a new element, take its hash, and use the first three bits of the hash to identify the partition to
which your element belongs. The other 61 bits of your hash, or n minus three for
whatever n happens to be, is your partition's
specific hash function. And so in this example, we
see an element of hashes to zero one zero zero zero one zero one, take the first three bits. Hey, those first three bits
represent the number two, this element belongs to partition two. Following which, we take
the other five digits, and we observe that those five digits have two leading zeros. Hey, we're going to
update the maximum number of leading zeros for partition
two with the value two. The subtle detail here... is that the hash function
means that we are partitioning the input space, not the input size. If we have hello world dot
cpp twice in our data set, it's going to hash to the
same thing both times, and that means that it's going to get sent to the same partition. That means that all of the
strings that make it into partition a are disjoint
from all of the strings that make it into partition b. Something to keep in mind for a moment. And so what does this look like? What does this look like when we run it on 10,000 random integers? And the answer is that it
looks something like this, this is literally a run
that I ran on my computer. And we see, for example,
that partition six, of all the different
numbers that were assigned to partition six, six saw a
maximum of 10 leading zeros. Okay, that would mean that we'd estimate that partition six has
two to the power of 10, which is 1,024, distinct elements. We would also estimate that partition five has 2,048 distinct elements. Keeping in mind that the elements that go into each partition are distinct, it is tempting to think that
we would get a good estimation by summing up the estimates
for each individual cell to get our final guess. Problem is that that
estimation really sucks. I said there were 10,000 elements that generated that HyperLogLog table. It turns out that if you sum up the expected number for
all the individual boxes, you get an estimate of 50,000. Five times your original number. And the reason you get 50,000
is because of that seven box that has a 15 in it. Box seven, by random
chance, saw 15 leading zeros and so estimates that your data set has at least 32,000 distinct elements. So, we're a little bit stuck there. What are we going to do instead? Fortunately for us, some
very smart mathematicians have come up with an answer. Instead of taking the sum,
use the harmonic mean. Couple of extra details here. Because all the different
boxes are unique, you have to add an additional factor of m to account for the fact that all the boxes have different strings in them. And in the formula that
they provide in their paper, you also need to throw in an
additional constant factor to account for the fact that
there is structural bias in how this number is computed. But at the end of the day, you
have a implementable formula that you can apply to the
various numbers of leading zeros for each of your partitions. And if we apply that formula
to these observed numbers, we get an estimate of about 8,000. Keeping in mind there were
10,000 elements to begin with, so we've got an error of about 20%. That's not actually all that great. There's a reason for that, though. This estimate is 20% off because I chose a very small value of m,
I chose m equals eight. The expected error of
HyperLogLog is proportional to 1.04 divided by square root of m. When m is eight, that comes out to an expected 35% error margin. Most people, I have been led to believe, choose 2,048 distinct partitions. M requires the first 11
bits of your hash function. Because when m is 2,048,
your expected error comes out to a mere 2%. So, I found this function. Database, please approx count
distinct HyperLogLog my data. (clicks tongue) Hmm. That just worked. That's one of the glorious
things about algorithms. They just work. You don't need to invent
them, you don't need to implement them, you don't really even need to understand what they're doing. All you have to do is you
have to know that they exist, and know what problem they solve. And that is what turns an algorithm into a useful tool that you can use. There are so many algorithms and there's one place that
I like to go to find them. Include algorithm. It's a very nice place,
I like going there, seeing what algorithms are on offer. It's not the best. Do I wish there were more algorithms? Yes. Do I wish that the algorithms didn't have an iterator interface? Also, yes. But do they work? Are they fast, are they
tested, are they rigorous? Yes! Every algorithm that you find
inside of include algorithm will be an extremely efficient algorithm to add to your toolbox. Some of my favorites. Nth element, given a list of
data, find the n largest ones. Accumulate, given a stream
of input, sum it all up. And count if, another one of my favorites. People have a tendency to corrupt the data in the services that I own,
and sometimes I need to count how much of my data is still valid. Really, whenever anyone,
when I'm doing code review, and anyone writes a for loop that does something relatively simple,
I go to my favorite C++ reference website, cppreference.com, and I look through their
algorithms description, and I try to find one that
does what I want them to do. Literally, I managed to
do this with nth element. A co-worker of mine was taking his input, he was sorting it, and then
he was taking the top k. And I was like, hey, hey
bud, here's a cppreference link to nth element, if
you'd like to shave a factor of log(n) off your runtime. But I'm not going to talk much more about standard algorithms, and the reason I'm not going to talk much
more about standard algorithms is because there's already a great talk on this subject available on YouTube. Two years ago, at CppCon in 2015, Michael VanLoon gave a talk,
STL Algorithms in Action, and it was one of my
favorite talks that year. It contained just a whirlwind tour of the STL Algorithms library, and for me, it was really useful because
it broke down the barrier of mysteriousness and
weird function call syntax that was present in the STL. Before that talk, I did
not use the STL very much, I was not very familiar
with it, but Michael's talk introduced me gently
into those algorithms, and I use them all the time now. If you do not use stood algorithm, go onto YouTube, watch
Michael VanLoon's talk. It'll be a good use of your time. I could keep talking for a long time. I could talk for days about algorithms. The thing is, though, if
I were to talk for days about algorithms, instead of
limiting myself to just a few, we would all get very bored very quickly. Because most of the algorithms
that I know you also know. We all know binary search,
we all know quick sort, we all know substring, and
that's because algorithms are the tools of our trade. I hope that today's
presentation of algorithm has contained some new algorithms for you. Tools, rare enough to be novel, but useful enough to be found in practice. Each, in their own special way, fantastic. My name's Nicolas. It's been a pleasure
speaking with you today. (applause)