- The Holy Grail, A Persistent
Hash-Array-Mapped-Trie for C++ and in the course of this talk, I will explain what each
of those words means, so if you're not quite sure yet, hopefully you will be by at
least the middle of this talk. Actually starting with the
first part, the holy grail. Why am I using that term? I said the holy grail exclamation mark, maybe that should be the
holy grail question mark because what I want to examine is whether we can actually
achieve a container that has properties that have the
minimum number of trade-offs, given what we actually
want to get out of it, so we'll have a look at
what those properties might be as we go through. We'll see whether we actually achieve it, so we'll all discover this together and I want to start with
that last word there, trie, because I've been pronouncing it tree and you can see some
trees in the background to try and emphasize that, it's also a good excuse
for me to get one of my holiday pictures in, it comes from the word
retrieve and that's why it's pronounced tree but some people may pronounce it try, that's fine as well. In fact, if we look at the
Wikipedia article on tries, it said they were first described in 1959. So it's not a new idea, many of you probably already familiar with these sorts of tries and usually some sort of string dictionary. We're gonna use them
slightly differently today. Says the term was coined
two years later by Edward Fredkin, who pronounced
it as tree, as we said, because the middle syllable of retrieval. However, other authors
pronounce it as try, as we said, in an attempt to distinguish
it verbally from tree. I'm gonna stick with
the tree pronunciation, partly because I just
prefer historical accuracy and partly because I'm already
quite well known for catch, I don't wanna be known for try as well. (audience laughs) So with that, we'll come
back to how this fits in in a moment but for a bit of background, we talk about sets and
actually, more generally, we wanna look at all the
associative containers currently in the standard. You'll see why shortly, hopefully. There's eight of them
currently, if I counted right and you can sort of group
them quite nicely like this and you can see that
there's some quite obvious pairings there. They'll divide into the
single and multi versions. I'm sure you know what the differences are but basically you've
got multiple identities or the same identity for different values, only the first one will be
entered into a single map or set, multimaps or sets, you
can take multiple values of the same identity but the underlying data
structure is gonna be usually identical in both cases, so for our purposes, we
don't need to consider both pairs, so we're going to drop the multi versions for now. Let's keep it simple. So now we can see, new top and bottom, nicely divides between maps and sets or sets and maps and again, the underlying data structures
are usually the same, so you can implement a
map in terms of a set just by having a set of key value pairs where the comparison is on the key, so again, pretty straightforward stuff. So, for our purposes, we
don't really need to consider both pairs, so, let's
drop the maps as well and that's how we end up
just talking about sets, so now we come to this
indivisible set of properties between standard set and
standard unordered_set so before we look at what
those properties are, just an honorary mention to one of an associative container,
the (sorted) vector, so you load up a vector and you sort it according to your identity and then you could just do a
binary search for lookup, that's still a great choice
for many applications, so thought I'd mention it here but it's not strictly speaking an associative container. We may sort of come back
to that a little bit later but between sets and unordered_sets, we, first of all, observe that
the underlying data structure with a set is usually,
it's not required to be, usually a red-black tree. There's got to be some sort
of balanced binary tree and as far as I know, all of the standard library implementations
use a red-black tree. We will have a look at, a little bit more, of red-black trees later, so I won't talk about them more now but just to contrast
that to unordered_set, just based on a hash table. Then we have requirements on the types and this may dictate what you can actually put in these things. So, standard set, we have
the strict weak ordering, which implies that we either
have a less than comparison operator or you can provide
your predicate for that but we must have this
ordering for it to work and unordered_set, notably it
doesn't have that requirement, which, as we'll see later,
does become relevant but it does require the type
to be, obviously, hashable, either with specialization
of standard hash or your own predicate and equatable, so comparing equals. Again, provide your own predicate. So, it's actually quite different
requirements on the type, most types may supply both but then the bit we're normally interested in, the complexity guaranties. Red-black trees or the equivalent have log in complexity, this is on lookup, primarily talking about lookup here whereas an ordered_set is constant time, on average, we usually forget that bit, actually it can tend towards
linear in the worst case. We don't often get to that case but that can become relevant. We will see an example of that later, so it's not perfect but in general, this does translate to unordered_set being generally
significantly better performing than a red-black
tree or a standard set so where that matters,
we'll usually reach for something like standard set,
sorry, standard unordered_set or unordered map for our
associative container needs. So, we have these available to us but there's one property
that I want to talk about that these don't have
and that's what's gonna lead us on to the main
topic of our discussion. So, we'll look at that next. So, this word is persistence and I put this picture up here, this is me at the end of the
London Marathon last year. It's an example of what we're
not gonna be talking about, not this type of persistence, but also it's not the type of persistence that involves file systems, databases, any sort of serialization. We're not talking about that either, that's often conflated here. What we're actually talking about is probably best
explained with an example. So, just for the moment, we're gonna talk about-- (audience laughs) Lists, so, we have standard list but lists in general, particularly functional
programming languages, have an interesting property we'll just briefly look at so, again, I'm sure you're familiar with how a singular link list works but we're just gonna run
through it very quickly for background purposes. So, we'll have a number of nodes, each with a element value and a pointer to the next node or previous,
depending on how you look at it and then the first one we
usually refer to as the head and then the rest or the first
node in the rest, the tail, straightforward enough,
there's the pointers. The way it gets interesting
is when you want to introduce a new node at the head, so bring this new one in here, all we need to do is write the
pointer to the previous head into the new node, so the original list doesn't
need to know anything about this new node, so, if we're seeing that
we're gonna be wrapping this up in some other type of service, introduce those in here, we can now have two lists
in memory at the same time, sharing all of their common state, so this is what we mean by persistence. The old value persists. The old container, it's
still persisting while we're using the new one and we could do this
the other way as well, if we remove items, we just
point to an earlier head, again, the previous
instances don't need to know anything about the new one. Now this is not the
way standard list works but it is the way list tend to work in functional programming languages, which is how you can treat them immutably and yet still reproduce new values. Obviously you can stack up all
sorts of operations that way and when you are finished, if you only want the latest one, you can just collapse that down as if you'd written it that
way in the first place. Now obviously, in C++,
we don't tend to use link lists very often, they're not particularly efficient, usually something like standard vector's gonna serve us better but if we want this property of persistence, that could be a really
useful data structure but this was just to introduce
you gently to the idea. Obviously the problem here is
we've only got one pointer, so, if you do want to do
any sort of modifications further down the list, then you actually start to get
into some extensive copying but if we put two pointers in there, then we can have a tree structure, we just get a simple binary tree, which has some other problems
we'll talk about in a moment but it is particularly
good for persistence, so if we introduced a new value. Now this four here, I think that's right, we can put in a couple of places, we can hang it off the
three node or the five node but the five node, because
it's a leaf node already, that's much easier, we don't
need to change anything to put it down there so, we can add that in as
a new leaf node there and if we're just using
a mutable container, then we'd be done here, nice and easy. If we want to make this persistent node, then we can't write a new
pointer into the previous five node, so we're
effectively invalidating that, so what we do is we take a copy of that, write a new pointer into the copy but now we have a new five
node, obviously we need to write into it's parents node and so on all the way up to the root, so what we tend to get is
we can rewrite that whole path up to the root every
time we make a mutation, so there is some copying but particularly with a large tree, it's actually quite a small amount that the total number of
nodes that we need to copy. We can then share a significant
amount of the common state, so that's quite a nice property. I said there's problems
with simple binary trees in that if you load them up this way they will tend to get very unbalanced, with very long branches
on one side and not so much on the other side,
completely breaks that log-in complexity guarantee, so, we have strategies for balancing them. So we mentioned red-black trees earlier, that's just one such strategy. I'm not going to look in
detail at red-black trees 'cause that's not the
topic of this talk but these are the rules, if you're interested you can fairly easily create your own. So I just want to apply
those rules to this tree, just for the sake of this example because there's another
side effect of this that I want to examine. So, same example but now each
node is colored red or black, that can usually be done with no overhead, it's just a convenience thing. We introduce our new node at the bottom. That breaks one of those
rules we just talked about, the red node invariant, red nodes cannot have red children, so we need to start applying
these rebalancing rules which usually apply to
the parent, grandparent and uncle nodes, so you
could do it fairly locally. In this case, it just means
recoloring those immediately, both nodes in the immediate vicinity and having done that, we've
restored our invariant but now we've invalidated
the next level up. We now got another violation
of the red node invariant so, we have to keep applying that, again, all out to root. When we get to root, the root node must be black, so we just recolor that as well. So, fairly straightforward and that gives us this
nice property of being on average, fairly balanced. That's what we wanted, that's
our red-black tree's work but the reason I've
gone through all of this is because on the way up, if we're making this persistent, then we will have touched
these additional nodes as well. So as well as all the nodes up to root, so it's a little bit of collateral damage, so if we're making these
data structures persistent, then it's a little bit more
of a head to the copying as well as the nodes up to root, but it's still useful. In fact, if you go to Wikipedia, the source of all knowledge, look at the page on red-black trees, there's this paragraph at the top. It says, "Red-black trees are
also particularly valuable "in functional programming
where they are one of the "most common persistent
data structures used to "construct associative rays
and sets but can retain "previous versions after mutations." That's what we've been talking about. So, they are very useful and I've written a couple
of these in the past. They work pretty well but rewinding a bit, earlier we were talking about standard set compared to an ordered
set and the performance and complexity guarantees. Generally we want to reach
for the unordered versions, the hash tables for performance purposes, so if we want these to be performing, then we're actually adding
overhead to what we already had but it's very difficult,
if not impossible, to make a hash table persistent. You end up either having
to copy the whole thing or large parts of it and you can make some changes
to make it more suitable for persistence, we're not gonna go there but you tend to end up where we're
going to be going anyway, with something more like the
hash trie we're gonna present. But I'm gonna talk about
it from the perspective of red-black trees and
work our way from there. So, what was the problem
with red-black trees? The problem all comes from the fact that we only have two pointers in each node, because of that, then as our overall tree grows, then the depth for the
tree is gonna get bigger at quite a fast pace, so for example, a tree of 15 million values
will have a tree depth of about 24 nodes, so for any given lookup
that's a lot of pointer hops that you gotta go through
to get down to the value that you want and that's
where the login complexity will get you. It all comes from the fact
that we've got up to two pointers in each node, so that's the problem, solution must be add
more pointers and it is. In fact, we could stop
there but that raises a couple more questions and
in answering those questions that's gonna lead us to our hash-array and that's gonna actually
have a few interesting side effects as well. So, the first question is we've got all these
pointers but how do we know which one to follow as we're
traversing down the tree. With the binary tree, it's
simple because you just do the less than comparison,
takes you left or right, but what do we do here? So, the answer is, well,
let's just mention we have a node with a array of pointers, so we're gonna use 32, 64
is another common number but it's usually one
of those two, in fact. So, imagine you've got
an array of pointers and some of them are
gonna contain pointers to other branch nodes,
some of them to leaf nodes. How do we know which one to follow? What we do, we take the hash or the value or the key, rather, and then we'll peel off
the first, in this case, five bits from the hash. It gives us a value, 27 in this example. That's our index into the array, so we'll take out that pointer, follow it down, we get a
branch node in this case, there we take the next
five bits from the hash, peel that off, that
gives us another value, six in this case, that's our next index into the next array and that may take us to another branch node, in this case, we're at a leaf so we're done but now, because we're using hashes, we'll also have hash collisions, so those leaf nodes will
be containers of values, much like with the buckets in hash tables, we may need to do some, a
linear search or some other type of search at the end to
get our value but usually, once we got to the
leaf, we're almost done. So, that sort of answers
the first question and that's where the hash
part of our title comes in, the other question is,
well, how do we make this memory efficient? We've got 32 or even 64
pointers in each node and that's gonna get
very big, very quickly. So, what we note here is that
most of those slots are empty most of the time, we've
only got a few pointers actually in there, we're
not using them all. So, using some sort of
sparse array is the answer, so what we're gonna do is
just have a physical array of only the number of
pointers that we need, three in this case, and then an additional
integer that we're gonna use as a bit map to tell us
what that actually maps to in our virtual array. So all of the set bits in our bit map correspond to the actual
positions in the array and the unset bits to 20 pointers, then we just need a way of mapping those and all we need to do
is, for any given index, we'll go to that bit and
then we'll count the number of bits to the right and
that'll give us our index. So we look at that bit
right on the left hand side, there's two bits to the right, so index of two into our sparse array gives us the result we wanted, simple, but then how do we count the bits? We want to do that efficiently as well. That's another interesting question, we'll come back to that
in a little while as well, but where we've ended up now is something like this. So it's a tree structure
in a sense of T-R-E-E but it's also a trie, T-R-I-E, mush like you may have
seen with a string trie, a string dictionary
trie where you peel off each character of a string
to give you the index to the next level. In this case we're using a
hash as we just talked about and that will give us our
path down through the tree to the leaf node and
then hopefully a value. So, what are the properties
of this data structure? So, we looked at the example
in the red-black tree a moment ago. There were 15 million nodes, had a tree depth of 24. Well the maximum depth
of one of these trees, it's a 32 bit hash, is only gonna be six or seven deep, maximum. I say six or seven because
you've got those last two bits at the end, you may not want
to actually follow that, so let's say six and the
average depth for those 15 million values in the
example is only five, that's a lot better than 24, except the leaves hold arrays of values, we'll look at that a little
bit more in a moment. They're actually more space
efficient than hash tables because of this compacting
that we have to do as part of the storage. The complexity is technically log 32 n, which sounds a bit awkward,
harder to reason about, in practice, obviously there
is the logarithmic aspect to it but we'll tend to be
much closer to that sort of constant time, average
lookup of a hash table than the login of a red-black tree but we will look at the
practical aspects in a moment and this additional thing,
no need to rebalance. There's a caveat, we
do need to have a good hash distribution because
the distribution of the hash will actually impact the
distribution of the tree itself but given that, the tree
should be fairly well balanced to start with, so,
consider what that means in terms of persistence. Still got a tree
structure, so still have to that rewriting to root but
now we've got much less nodes to rewrite and we don't
have any collateral damage due to the rebalancing,
so this is much better than where we were with
the red-black tree already. Okay, you read the abstract on shed.com that I put up, oh, by the
way, before we get to that, the photo in the background there, anyone recognize that? Yeah, see, the waterfall from Twin Peaks, turns out to be about half an
hour down the road from here, so another one of my photos. Now in the abstract I made this claim. I said that, talking of hash tries, they're close but not quite
as fast as pure hash tables, that's the claim. We sort of looked at the background of why that might be the case but how does it actually weigh up in practice? Let's look at some figures. So, some performance metrics,
I did some benchmarks. My methodology, I used a
micro-benchmarking framework called nonius.io and the way that works is you'll first sort of
run a few tests against the system clock to gauge the precision so it knows how many iterations to run of your benchmark and it'll
then run that enough times for x number of samples, defaults to a hundred,
I've done most of these at a thousand to give us plenty of figures and it'll then do statistical
analysis on those samples to look at the average and the variants, exclude any outliers, so it's doing a lot of the work for us, so that's pretty good. Using libc++ we've claimed for this, the only ones I've mentioned so far, that is relevant, particularly
because we're gonna be comparing against standard
set, standard and order set so it's that particular implementation, other implementations may vary, of course and at least for our first test, what I've done is I've loaded
up each of those containers with 100 thousand integers, just monotonically increasing and then we will compare them, so that's our hash trie or hamt, I've abbreviated it here, standard set, standard unordered_set. Let's look at some figures. So, here's one of the graphs. So this is the inserting
100 thousand integers. On the right hand side, in green, we've got our hash trie, left hand side is a standard set and in the middle, that sort
of orange bar at the bottom is unordered_set. Now the y axis is time, so
that means that the bigger the bar, the worse we're performing, so we're not doing
particularly well on this one, in fact, we're about, I
think, four times slower than unordered_set, 20 times, sorry, four times slower than standard set, 20 times slower than unordered_set, no where near my claim,
so what's going on here? Well, this is insertion. My claim was primarily about lookup but insertion's important. When I run these metrics, there's basically no
optimization on the insert, we're paying the full cost of persistence, copying every single time we add something and there's a copy and write optimization that we can apply
because we didn't need to reference count the
nodes and I've done this with red-black trees in the past. It's very effective. If you're loading up an initial tree, you pretty much do the whole thing mutably because we haven't actually
started using it yet. I haven't done that here yet, so, we have seen the full cost of persistence, so yeah, it can be expensive. So, let's set that aside for the moment, this work in progress and have a look at, oh, before we get to lookup, I should also point out that if we look at the underlying data that nonius has given us, you can see, if I can, there we go, that's the hash trie at the top. On the right hand side there's a lot of extra variants over there. Something's obviously going on
in my machine at that point, it kicked in so nonius has
excluded those outliers but there's this whole second
half has really gone out so, to be fair, we should
probably just consider the first half of that, so what I've done for
the next charts is to just take that data directly
and produce my own graphs. Let's have a look at look up then. So the same 100 thousand integers. This time with a find. Now, we've got, right at the top, the orange is standard set, the green, just below that, that's just confusing. The green just below that is hash trie and at the bottom is unordered_set. So we're beating standard set now. This is much better. We're starting to get there, but we are still closer to standard set than unordered_set, so we're
not meeting my claim yet but we're getting there, just promising, so what's going on here? Remember I said what
we're doing is we're just loading up with 100 thousand integers, monotonically increasing and we're using libc++. Integers in libc++, these standard hash specialization is just the identity function, so our hash is also a
monotonically increasing series of integers. Remember I said that if
the hash distribution is not very good, then we're
gonna be very unbalanced and that's what we're
paying the cost of here. There are ways to address that but they haven't been applied yet so, we are susceptible to that. So let's move on and have a look, oh, before we do that, that was 100 thousand. I actually did all of
the orders of magnitude between 100 and a million, just to see what that graph looked like, so again, the orders are the same here but now you can see the
profile of the different data structures. So the orange line for standard set is slightly diverging from our hash trie in the first orders of magnitude and then sort of
re-converges towards the top but it's actually
tracking, the hash trie is tracking unordered_set fairly closely. They've got a very similar
scalability profile, so that's interesting to observe. What I did next was I
loaded up all the same orders of magnitude, but
this time using strings, so now we're hashing strings, we're getting a much
better hash distribution so we're getting closer to
what we want to end up with. Now let's have a look at what we've got. So the orange one at the
top is standard set, again, the blue one, unordered_set,
and at the bottom is hash trie. In this one, we're actually
beating unordered_set, so we're definitely in the right ballpark. Now, so this is definitely a quality of implementation issue. Another implementation, I
would expect to beat hash trie, that's fine but, I think this demonstrates
that we're in the ballpark of my claim, with some caveats, hopefully they can be
addressed, insertion time, hash distribution, but this
is what we want to aim for. I'll still a bit uncomfortable
about the integers, so I did one more test for this. I took the hashes from these strings and then I used those as a integer in another set of tests, so
this is another set of integers using those string hashes. What was interesting is everything's actually much
closer together in this one. I should point out, by the way, that the y-axis in these
examples is exponential, so where things are actually
tracking each other, they're actually diverging in practice. We start off with hash
trie winning, again, but then they sort of
converge in the middle and towards the top, actually
this one goes to 100 million, no, 10 million, sorry, towards the top from
around the million mark, hash trie starts to become
the worst performing and standard set is actually winning, actually beating unordered_set. What's going on there? Well, remember we said earlier that because of hash collisions, when we get to a leaf node in a hash table and the same in the hash trie, we're gonna have to do
perhaps a linear search, so at this point, we're
starting to get significant enough hash collisions that
that final linear search is starting to, not dominate but
become a significant factor but what's interesting is that's because of the
requirement on the types. We don't have the strict
reordering requirement on unordered_set, so when we get to that last linear search, it has to be linear,
really, that's the only way you can do it. If we're not constrained
by that requirement, then potentially we could
do a sorted array at the end in our hash trie and then
do a binary search instead. If we do that, I would expect
our hash trie implementation to be winning at the end as well, then it's another
opportunity for optimization that we haven't explored yet but we're looking pretty good. Let's say we're meeting
the claim in some areas, maybe not in others, hope to
explore those opportunities for optimization but it shows a potential. I gave this talk in Aspen at C++ Now earlier this year. It was a longer talk and
went into a lot more depth on the actual implementation, so I still got these slides
here because I didn't know exactly how long I was gonna have left, so we're not gonna go
through the whole thing, just wanted to pull out a
little bit that was interesting. So, first of all, we remember this one. How we map our sparse array with a bit map and I said that it was
an interesting problem, how we count the set bits to solve that. Here's one possible implementation. This is taken from, in
fact, there's the link, there's a stack over for your answer, takes you to a page
online somewhere that has, I don't know, dozens of
possible ways that you can do bit shifting to solve this problem. I don't follow this but
I know that it works. There's no branching in here at all, it's just pure bit twiddling,
it's pretty performant, but we can do better. As the comment in the
middle says, actually, most modern architectures
will have a single instruction that can do this, usually called popcount, so if you're using GCC or Clang, there's a built-in, built-in
popcount that will do this. You do need to compile with a certain flag to actually get that to
map onto the instruction but when you do that,
it performs much better. So all the examples I gave are with this. If you use the previous example, the bit shifting, then it
doesn't perform as well, everything sort of shifts up slightly but the basic ordering remains the same, so it's not significantly different. But in use, what I'm doing is using the count set
bits to create or to map the bit map to a compact index 'cause I have all these
indexes flying around, compact index, sparse index and so on. I'm using strong typedef internally to keep track of that and
there's some convenience functions on there. We just need to create
a mask of the bits to the right of the one
that we're interested in, so that's what I'm doing at the top there and then that will give us this method to compact,
which is just a method on our sparse index, strong typedef. So I'm not going to go
into all of this in depth but again, visually, in
practice here's how we use it with a bit map, create a
sparse index and then we can get the compact index from that and that turned out to
be really useful because in overloading, if we
already have a compact index we can go straight there,
if we have a sparse index, that gets mapped along the way. It worked out pretty nicely. Then we have the thing that
peels off the bits from the hash, say five at a time in this case, won't look at that in too much detail. Our node structure,
gonna skip most of that. Then we get down to the
really nasty low level stuff. There's actually an
issue with alignment here that I need to fix but it is nasty because, first of all, we need to create our
nodes big enough to contain all of the pointers, our values, as well without an additional allocation and then placement view
that into our allocation but do that where, if
this is gonna be copy of an existing node, we'll
need to copy those in as well. So, we're not gonna go
into too much detail there but you get an idea of
what that code's like. Branch nodes, similar, and then, again, actually if
you go to the C++ Now videos, you can watch that talk
there if you're interested in the detail walkthrough of the code. So, I'm just gonna get to, okay, right, here, so, the example we looked at earlier. This is our tree structure, so these are nodes, we have a root node and then, in addition
to the root node itself we'll just need a size, so we
can do it in constant time, look up the number of elements. That really is at the
entirety of our hash trie, all of the hard work's
going on in the nodes. So, because we just have those two values, we can just collect those together in it's own data structure, we're using a hash trie but we
can use it elsewhere as well. This is actually small
enough to treat atomically on most architectures. There's a implication to that. The hash trie itself will
have a hash trie data, at the top there, and then, just sort of
basically provides an interface onto that. Again, we won't go into the details there. Right, here we go and then we can have another type, so I haven't really
discussed this part yet so, that's why I wanted to
get on to this slide. This is just the test case, we'll look at the code in a second but this is shared hash trie
and the whole purpose of shared hash trie is
just to hold, atomically, that hash trie data and because we can treat that atomically, then if we want to read or
mutate the underlying data, we'll just take a copy of it into our hash trie, we'll
just do an atomic copy or atomic exchange and then when we're working
with our hash trie, we're completely shielded
from any other mutations going on with that hash trie, so it's just a shared
location that we're gonna get the hash trie from that we can then work with and if we mutate it, we can
then write it back, again, single atomic exchange and we get very effective concurrency with no logging on the condition that we
are mostly bound on reads rather than writes. We're doing lots of reads
then we'll keep looping around and redoing our work so, it's only appropriate for some conditions but they're very common conditions. So in this test, I'm not
really obviously using the concurrency, I'm
just demonstrating how you can use this transaction to do that. In practice, there's a
helper that I update with that will do all the atomic
inflection, the hash trie, passing that to you, you can work on it and then when it goes back,
it will atomically commit it and do this in a loop. So, if anyone else is
mutated in the meantime, you already did the work. So here's the shared hash trie itself. We're just asserting that hash trie data is trivially copyable, it's a requirement for it to be atomic and then we got these atomic operations. Don't pay attention to
the memory order stuff, that needs to be fixed, it's
more of a place holder and here we go, the memory order
stuff again is all wrong. Yeah, so this is the loop that
it's doing for the update, it's pretty simple stuff, you do have to get it right. This is not necessarily right but we've got it all in one place. As far as atomic data structures go, it's one of the simplest out
there and that's reassuring, I think. So I'm gonna stop with that now,
just to sum up and then we'll go to questions. The claim was that we can
approach the performance of hash maps without too many trade-offs, I think I demonstrated
that we can achieve that but there's some work to be done to overcome some of our other cases. They're more space
efficient, the hash tables, so that's good. Copies are cheap and memory efficient and we didn't really
go into this too much, touched on the concurrency
implications but think about having a large
container, associative container, that you can copy with just
the cost of two integers and share large amounts of common state you can hold in memory at once. When I was first working on this, it was in the context
of financial data and what you could do is load
up one of these things with an object graph of market data and then you could make
small changes and have entire scenarios sort of
matrices of these object graphs in memory at once, very, very efficiently and very, very performant as well, so many implications like that and as we mentioned, can be made to work very
nicely with concurrency. So, that's the summary. There's a number of references, including some papers that
this is all derived from, I put all the references up on my website, levelofindirection.com/storage/hamtrefs.html. If you can't remember
levelofindirection.com, I also have extralevelofindirection.com, that redirects there. (audience laughs) Or you can reach me on Twitter, phil_nash or you can find me later during the day. So thank you very much for listening. (audience applauds) And do we have any questions? - [Audience Member] Have you considered, given how sensitive you are to the hash function being decent, have you considered
throwing it through a couple of iterations of like mermafree just to improve the default case? - So, the question was, because we are very sensitive
to the hash distribution, have I considered running it through, very sort of rehashing
or remixing algorithms? Yes, that's almost certainly
what we'd need to do to solve the problem and I
did start looking at that and it was actually making
things worse than better so I need to explore that a bit better but there are definitely approaches
that can be taken there. Thank you. Yeah? - [Audience Member] One
thing I didn't understand about your sparse array, you were counting all of
the zeros to find which one to look up, is it from there you would get an offset entity, like you
have a 16 bit thing there and you have three or four
bits (muffled speech). - So, to summarize the
question, if I got it right. You just wanted to clarify
how we're doing the mapping of the sparse array. I'll go back to the,
ah, too much code here. Right, this one. - [Audience Member] So
is that yellow thing, that's only one bit, right? - Yeah, so, the top is an integer, so that's all the bits of the integer and each set bit in the integer represents the position in your virtual array where you have a set point of value and then all the unset
bits are where you would represent an old pointer. In the actual array at the bottom, we're only storing the
pointers themselves, so the number of set
bits we have on the top is the number of actual
items in the bottom. - [Audience Member] Okay, just a sec, so you got three pointers on the bottom and the top array is-- - So the top array is
actually just an integer. It's a bit map. - [Audience Member] Okay and so it's 32 to store (muffled speech). - Yeah, so you can do this
with first two pointers or 64 pointers if you
have a 64 bit integer. One of the references that
I showed at the end was a games developer who
implemented one of these and he experimented with both
32 bit and 64 bit arrays and in his case, he found
that the 32 bit arrays performed better. He wasn't 100% sure
why, you can speculate, he did suggest that maybe it was a quality of implementation issue
and you can get a better trade-off with 64 bits but there's not a lot in it, so that's why I stuck with 32 bits. Were you clear on how that maps now or? Thank you. Yes, in the middle there. - [Audience Member] How
do you handle freeing or deleting the nodes as the tries go out of scope? - Okay, question was, how do you handle nodes
that have been deleted going out of scope, so in this implementation
I'm using rough counting, it's actually intrusive rough counting, so I'm using a shared pointer. If you look at other
languages that implement this, particularly Scholar and Closure, they have garbage collection
and that's usually considered the way you have to do this but I've got it working with rec counting, performance enough I think and in fact, the reference counting also is what allows us
to do the copy and write, which I'm not doing in
this implementation but used it very effectively
before because as you go down, as you traverse down the trie
to find your insertion point, you check the share count at each level, as soon as you got a share
count of more than one than the rest of that branch is shared. If you get to the leaf
and you're unshared, you can do all the mutations in place and that can save a lot on copying, so we get the benefit from
shared counting as well, of being able to do that copy and write. I'm not sure it'll be so
easy with garbage collection but that's how we're doing
it, just reference counting. Uh, yeah? - [Audience Member] In the implementation where you have a hash array and have a multitrie,
would it possible to do an efficient merge operation
on two such data structures? - So, the question, if
I got that right is, we have, sir, what term did you use again? - [Audience Member] A hash
array mapped multitrie. - Hash array mapped multitrie, multiple values for the same key. So the difference between
the multi and the single version is really just a
constraint on the insertion. Possibly that could
have implications to how you're storing it as well, I haven't really considered that yet. That would only really affect the search at the end though, I think. You would have hash collisions either way, so you still need to deal with that so I'm not sure that will give any benefit but maybe I'm misunderstanding
your question. - [Audience Member] I'm talking about then you could construct a merge operation with
well-defined semantics that would show up in the resulting trie if they were present in both, in either original tries. - Right, still not quite sure
I'm fully following you, maybe we should talk about it afterwards, but yeah, thank you. Yes, towards the back. (muffled speech from audience member) Struggling to hear you,
let me come a bit closer. - [Audience Member] Are delete operations for leaves supported,
if so (muffled speech)? - Are delete operations
for leaves supported was the question. - [Audience Member] For keys. - For keys? Delete operations, oh okay, so, looking at where the
key and then deleting. In my reference implementation, I haven't done delete yet. My experience from doing
this with red-black trees is that delete by far is the hardest part. I'm expecting that to be
the case here as well, although hopefully not
quite as much because everything's, we'd have to
worry about the rebalancing. So I think just boils down
to the case of where you just create the copy with the removed item but I haven't done it yet so
I don't know if there's gonna be any additional complications, so I'm not sure if I can
fully answer your question but certainly you should be able to do it. Yeah? - [Audience Member] (muffled speech) - Sorry, I'm gonna come
a bit closer again. - [Audience Member] Can
you please elaborate on how you remember the allocation because the
languages you mentioned like Scour and Clutter (muffled speech) which have a garbage collector (muffled speech) you need to use techniques like, as pointers of stuff, what do you use? - So, I think to summarize, how do I handle memory allocation in a world where we don't
have a compacting GC, is that a fair summary? This language is like Scholar and Closure, when you're allocating,
everything is generally much closer together in memory
so you don't have to worry too much about it. So far, all I've done
is just straightforward view and delete but there's
definitely one of the opportunities for optimization, alternative allocators
and being able to reuse blocks of memory, 'cause
as you're building these things up and you're creating
and then discarding copies, so you're leaving slots
open that you could reuse and keep things much more closely located which can have implications
to cache readiness, haven't done that yet but
that's definitely something that could be done, sounds
like you have a followup. (audience member asks muffled question) That's a fair point. So, because I'm using
malloc under the hood, technically the implementation
is not lock free because there's a lock
in malloc but that's the point of insertion, replacing a new instance of the hash trie, where we're just doing the
atomic swap on the root node and the size, that operation is lock free. It's that exchange between
threads that are communicating on a single instance but
there will be locks involved during the insertion at the moment, that could be mitigated by
using an alternative allocator. Does that answer your question? So, yeah? - [Audience Member] Hi,
I just wanted to confirm because everything's on a hashing theme, the elements in here are unordered like a regular hash table? - The question is, the
elements are unordered as a statement to verify, yes. They're effectively ordered by the hash but it's the same
principle as unordered_set, unordered map. Okay, both of you have any more questions, especially Ansel, catch me afterwards. Thank you very much. (audience applauds)
I don't think that 'someone could do potentially do this other stuff in theory' while being competitive with only the std::unordered_set can be called 'The Holy Grail'.