CppCon 2017: Phil Nash “The Holy Grail! A Hash Array Mapped Trie for C++”

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I don't think that 'someone could do potentially do this other stuff in theory' while being competitive with only the std::unordered_set can be called 'The Holy Grail'.

👍︎︎ 2 👤︎︎ u/BCosbyDidNothinWrong 📅︎︎ Oct 26 2017 🗫︎ replies
Captions
- The Holy Grail, A Persistent Hash-Array-Mapped-Trie for C++ and in the course of this talk, I will explain what each of those words means, so if you're not quite sure yet, hopefully you will be by at least the middle of this talk. Actually starting with the first part, the holy grail. Why am I using that term? I said the holy grail exclamation mark, maybe that should be the holy grail question mark because what I want to examine is whether we can actually achieve a container that has properties that have the minimum number of trade-offs, given what we actually want to get out of it, so we'll have a look at what those properties might be as we go through. We'll see whether we actually achieve it, so we'll all discover this together and I want to start with that last word there, trie, because I've been pronouncing it tree and you can see some trees in the background to try and emphasize that, it's also a good excuse for me to get one of my holiday pictures in, it comes from the word retrieve and that's why it's pronounced tree but some people may pronounce it try, that's fine as well. In fact, if we look at the Wikipedia article on tries, it said they were first described in 1959. So it's not a new idea, many of you probably already familiar with these sorts of tries and usually some sort of string dictionary. We're gonna use them slightly differently today. Says the term was coined two years later by Edward Fredkin, who pronounced it as tree, as we said, because the middle syllable of retrieval. However, other authors pronounce it as try, as we said, in an attempt to distinguish it verbally from tree. I'm gonna stick with the tree pronunciation, partly because I just prefer historical accuracy and partly because I'm already quite well known for catch, I don't wanna be known for try as well. (audience laughs) So with that, we'll come back to how this fits in in a moment but for a bit of background, we talk about sets and actually, more generally, we wanna look at all the associative containers currently in the standard. You'll see why shortly, hopefully. There's eight of them currently, if I counted right and you can sort of group them quite nicely like this and you can see that there's some quite obvious pairings there. They'll divide into the single and multi versions. I'm sure you know what the differences are but basically you've got multiple identities or the same identity for different values, only the first one will be entered into a single map or set, multimaps or sets, you can take multiple values of the same identity but the underlying data structure is gonna be usually identical in both cases, so for our purposes, we don't need to consider both pairs, so we're going to drop the multi versions for now. Let's keep it simple. So now we can see, new top and bottom, nicely divides between maps and sets or sets and maps and again, the underlying data structures are usually the same, so you can implement a map in terms of a set just by having a set of key value pairs where the comparison is on the key, so again, pretty straightforward stuff. So, for our purposes, we don't really need to consider both pairs, so, let's drop the maps as well and that's how we end up just talking about sets, so now we come to this indivisible set of properties between standard set and standard unordered_set so before we look at what those properties are, just an honorary mention to one of an associative container, the (sorted) vector, so you load up a vector and you sort it according to your identity and then you could just do a binary search for lookup, that's still a great choice for many applications, so thought I'd mention it here but it's not strictly speaking an associative container. We may sort of come back to that a little bit later but between sets and unordered_sets, we, first of all, observe that the underlying data structure with a set is usually, it's not required to be, usually a red-black tree. There's got to be some sort of balanced binary tree and as far as I know, all of the standard library implementations use a red-black tree. We will have a look at, a little bit more, of red-black trees later, so I won't talk about them more now but just to contrast that to unordered_set, just based on a hash table. Then we have requirements on the types and this may dictate what you can actually put in these things. So, standard set, we have the strict weak ordering, which implies that we either have a less than comparison operator or you can provide your predicate for that but we must have this ordering for it to work and unordered_set, notably it doesn't have that requirement, which, as we'll see later, does become relevant but it does require the type to be, obviously, hashable, either with specialization of standard hash or your own predicate and equatable, so comparing equals. Again, provide your own predicate. So, it's actually quite different requirements on the type, most types may supply both but then the bit we're normally interested in, the complexity guaranties. Red-black trees or the equivalent have log in complexity, this is on lookup, primarily talking about lookup here whereas an ordered_set is constant time, on average, we usually forget that bit, actually it can tend towards linear in the worst case. We don't often get to that case but that can become relevant. We will see an example of that later, so it's not perfect but in general, this does translate to unordered_set being generally significantly better performing than a red-black tree or a standard set so where that matters, we'll usually reach for something like standard set, sorry, standard unordered_set or unordered map for our associative container needs. So, we have these available to us but there's one property that I want to talk about that these don't have and that's what's gonna lead us on to the main topic of our discussion. So, we'll look at that next. So, this word is persistence and I put this picture up here, this is me at the end of the London Marathon last year. It's an example of what we're not gonna be talking about, not this type of persistence, but also it's not the type of persistence that involves file systems, databases, any sort of serialization. We're not talking about that either, that's often conflated here. What we're actually talking about is probably best explained with an example. So, just for the moment, we're gonna talk about-- (audience laughs) Lists, so, we have standard list but lists in general, particularly functional programming languages, have an interesting property we'll just briefly look at so, again, I'm sure you're familiar with how a singular link list works but we're just gonna run through it very quickly for background purposes. So, we'll have a number of nodes, each with a element value and a pointer to the next node or previous, depending on how you look at it and then the first one we usually refer to as the head and then the rest or the first node in the rest, the tail, straightforward enough, there's the pointers. The way it gets interesting is when you want to introduce a new node at the head, so bring this new one in here, all we need to do is write the pointer to the previous head into the new node, so the original list doesn't need to know anything about this new node, so, if we're seeing that we're gonna be wrapping this up in some other type of service, introduce those in here, we can now have two lists in memory at the same time, sharing all of their common state, so this is what we mean by persistence. The old value persists. The old container, it's still persisting while we're using the new one and we could do this the other way as well, if we remove items, we just point to an earlier head, again, the previous instances don't need to know anything about the new one. Now this is not the way standard list works but it is the way list tend to work in functional programming languages, which is how you can treat them immutably and yet still reproduce new values. Obviously you can stack up all sorts of operations that way and when you are finished, if you only want the latest one, you can just collapse that down as if you'd written it that way in the first place. Now obviously, in C++, we don't tend to use link lists very often, they're not particularly efficient, usually something like standard vector's gonna serve us better but if we want this property of persistence, that could be a really useful data structure but this was just to introduce you gently to the idea. Obviously the problem here is we've only got one pointer, so, if you do want to do any sort of modifications further down the list, then you actually start to get into some extensive copying but if we put two pointers in there, then we can have a tree structure, we just get a simple binary tree, which has some other problems we'll talk about in a moment but it is particularly good for persistence, so if we introduced a new value. Now this four here, I think that's right, we can put in a couple of places, we can hang it off the three node or the five node but the five node, because it's a leaf node already, that's much easier, we don't need to change anything to put it down there so, we can add that in as a new leaf node there and if we're just using a mutable container, then we'd be done here, nice and easy. If we want to make this persistent node, then we can't write a new pointer into the previous five node, so we're effectively invalidating that, so what we do is we take a copy of that, write a new pointer into the copy but now we have a new five node, obviously we need to write into it's parents node and so on all the way up to the root, so what we tend to get is we can rewrite that whole path up to the root every time we make a mutation, so there is some copying but particularly with a large tree, it's actually quite a small amount that the total number of nodes that we need to copy. We can then share a significant amount of the common state, so that's quite a nice property. I said there's problems with simple binary trees in that if you load them up this way they will tend to get very unbalanced, with very long branches on one side and not so much on the other side, completely breaks that log-in complexity guarantee, so, we have strategies for balancing them. So we mentioned red-black trees earlier, that's just one such strategy. I'm not going to look in detail at red-black trees 'cause that's not the topic of this talk but these are the rules, if you're interested you can fairly easily create your own. So I just want to apply those rules to this tree, just for the sake of this example because there's another side effect of this that I want to examine. So, same example but now each node is colored red or black, that can usually be done with no overhead, it's just a convenience thing. We introduce our new node at the bottom. That breaks one of those rules we just talked about, the red node invariant, red nodes cannot have red children, so we need to start applying these rebalancing rules which usually apply to the parent, grandparent and uncle nodes, so you could do it fairly locally. In this case, it just means recoloring those immediately, both nodes in the immediate vicinity and having done that, we've restored our invariant but now we've invalidated the next level up. We now got another violation of the red node invariant so, we have to keep applying that, again, all out to root. When we get to root, the root node must be black, so we just recolor that as well. So, fairly straightforward and that gives us this nice property of being on average, fairly balanced. That's what we wanted, that's our red-black tree's work but the reason I've gone through all of this is because on the way up, if we're making this persistent, then we will have touched these additional nodes as well. So as well as all the nodes up to root, so it's a little bit of collateral damage, so if we're making these data structures persistent, then it's a little bit more of a head to the copying as well as the nodes up to root, but it's still useful. In fact, if you go to Wikipedia, the source of all knowledge, look at the page on red-black trees, there's this paragraph at the top. It says, "Red-black trees are also particularly valuable "in functional programming where they are one of the "most common persistent data structures used to "construct associative rays and sets but can retain "previous versions after mutations." That's what we've been talking about. So, they are very useful and I've written a couple of these in the past. They work pretty well but rewinding a bit, earlier we were talking about standard set compared to an ordered set and the performance and complexity guarantees. Generally we want to reach for the unordered versions, the hash tables for performance purposes, so if we want these to be performing, then we're actually adding overhead to what we already had but it's very difficult, if not impossible, to make a hash table persistent. You end up either having to copy the whole thing or large parts of it and you can make some changes to make it more suitable for persistence, we're not gonna go there but you tend to end up where we're going to be going anyway, with something more like the hash trie we're gonna present. But I'm gonna talk about it from the perspective of red-black trees and work our way from there. So, what was the problem with red-black trees? The problem all comes from the fact that we only have two pointers in each node, because of that, then as our overall tree grows, then the depth for the tree is gonna get bigger at quite a fast pace, so for example, a tree of 15 million values will have a tree depth of about 24 nodes, so for any given lookup that's a lot of pointer hops that you gotta go through to get down to the value that you want and that's where the login complexity will get you. It all comes from the fact that we've got up to two pointers in each node, so that's the problem, solution must be add more pointers and it is. In fact, we could stop there but that raises a couple more questions and in answering those questions that's gonna lead us to our hash-array and that's gonna actually have a few interesting side effects as well. So, the first question is we've got all these pointers but how do we know which one to follow as we're traversing down the tree. With the binary tree, it's simple because you just do the less than comparison, takes you left or right, but what do we do here? So, the answer is, well, let's just mention we have a node with a array of pointers, so we're gonna use 32, 64 is another common number but it's usually one of those two, in fact. So, imagine you've got an array of pointers and some of them are gonna contain pointers to other branch nodes, some of them to leaf nodes. How do we know which one to follow? What we do, we take the hash or the value or the key, rather, and then we'll peel off the first, in this case, five bits from the hash. It gives us a value, 27 in this example. That's our index into the array, so we'll take out that pointer, follow it down, we get a branch node in this case, there we take the next five bits from the hash, peel that off, that gives us another value, six in this case, that's our next index into the next array and that may take us to another branch node, in this case, we're at a leaf so we're done but now, because we're using hashes, we'll also have hash collisions, so those leaf nodes will be containers of values, much like with the buckets in hash tables, we may need to do some, a linear search or some other type of search at the end to get our value but usually, once we got to the leaf, we're almost done. So, that sort of answers the first question and that's where the hash part of our title comes in, the other question is, well, how do we make this memory efficient? We've got 32 or even 64 pointers in each node and that's gonna get very big, very quickly. So, what we note here is that most of those slots are empty most of the time, we've only got a few pointers actually in there, we're not using them all. So, using some sort of sparse array is the answer, so what we're gonna do is just have a physical array of only the number of pointers that we need, three in this case, and then an additional integer that we're gonna use as a bit map to tell us what that actually maps to in our virtual array. So all of the set bits in our bit map correspond to the actual positions in the array and the unset bits to 20 pointers, then we just need a way of mapping those and all we need to do is, for any given index, we'll go to that bit and then we'll count the number of bits to the right and that'll give us our index. So we look at that bit right on the left hand side, there's two bits to the right, so index of two into our sparse array gives us the result we wanted, simple, but then how do we count the bits? We want to do that efficiently as well. That's another interesting question, we'll come back to that in a little while as well, but where we've ended up now is something like this. So it's a tree structure in a sense of T-R-E-E but it's also a trie, T-R-I-E, mush like you may have seen with a string trie, a string dictionary trie where you peel off each character of a string to give you the index to the next level. In this case we're using a hash as we just talked about and that will give us our path down through the tree to the leaf node and then hopefully a value. So, what are the properties of this data structure? So, we looked at the example in the red-black tree a moment ago. There were 15 million nodes, had a tree depth of 24. Well the maximum depth of one of these trees, it's a 32 bit hash, is only gonna be six or seven deep, maximum. I say six or seven because you've got those last two bits at the end, you may not want to actually follow that, so let's say six and the average depth for those 15 million values in the example is only five, that's a lot better than 24, except the leaves hold arrays of values, we'll look at that a little bit more in a moment. They're actually more space efficient than hash tables because of this compacting that we have to do as part of the storage. The complexity is technically log 32 n, which sounds a bit awkward, harder to reason about, in practice, obviously there is the logarithmic aspect to it but we'll tend to be much closer to that sort of constant time, average lookup of a hash table than the login of a red-black tree but we will look at the practical aspects in a moment and this additional thing, no need to rebalance. There's a caveat, we do need to have a good hash distribution because the distribution of the hash will actually impact the distribution of the tree itself but given that, the tree should be fairly well balanced to start with, so, consider what that means in terms of persistence. Still got a tree structure, so still have to that rewriting to root but now we've got much less nodes to rewrite and we don't have any collateral damage due to the rebalancing, so this is much better than where we were with the red-black tree already. Okay, you read the abstract on shed.com that I put up, oh, by the way, before we get to that, the photo in the background there, anyone recognize that? Yeah, see, the waterfall from Twin Peaks, turns out to be about half an hour down the road from here, so another one of my photos. Now in the abstract I made this claim. I said that, talking of hash tries, they're close but not quite as fast as pure hash tables, that's the claim. We sort of looked at the background of why that might be the case but how does it actually weigh up in practice? Let's look at some figures. So, some performance metrics, I did some benchmarks. My methodology, I used a micro-benchmarking framework called nonius.io and the way that works is you'll first sort of run a few tests against the system clock to gauge the precision so it knows how many iterations to run of your benchmark and it'll then run that enough times for x number of samples, defaults to a hundred, I've done most of these at a thousand to give us plenty of figures and it'll then do statistical analysis on those samples to look at the average and the variants, exclude any outliers, so it's doing a lot of the work for us, so that's pretty good. Using libc++ we've claimed for this, the only ones I've mentioned so far, that is relevant, particularly because we're gonna be comparing against standard set, standard and order set so it's that particular implementation, other implementations may vary, of course and at least for our first test, what I've done is I've loaded up each of those containers with 100 thousand integers, just monotonically increasing and then we will compare them, so that's our hash trie or hamt, I've abbreviated it here, standard set, standard unordered_set. Let's look at some figures. So, here's one of the graphs. So this is the inserting 100 thousand integers. On the right hand side, in green, we've got our hash trie, left hand side is a standard set and in the middle, that sort of orange bar at the bottom is unordered_set. Now the y axis is time, so that means that the bigger the bar, the worse we're performing, so we're not doing particularly well on this one, in fact, we're about, I think, four times slower than unordered_set, 20 times, sorry, four times slower than standard set, 20 times slower than unordered_set, no where near my claim, so what's going on here? Well, this is insertion. My claim was primarily about lookup but insertion's important. When I run these metrics, there's basically no optimization on the insert, we're paying the full cost of persistence, copying every single time we add something and there's a copy and write optimization that we can apply because we didn't need to reference count the nodes and I've done this with red-black trees in the past. It's very effective. If you're loading up an initial tree, you pretty much do the whole thing mutably because we haven't actually started using it yet. I haven't done that here yet, so, we have seen the full cost of persistence, so yeah, it can be expensive. So, let's set that aside for the moment, this work in progress and have a look at, oh, before we get to lookup, I should also point out that if we look at the underlying data that nonius has given us, you can see, if I can, there we go, that's the hash trie at the top. On the right hand side there's a lot of extra variants over there. Something's obviously going on in my machine at that point, it kicked in so nonius has excluded those outliers but there's this whole second half has really gone out so, to be fair, we should probably just consider the first half of that, so what I've done for the next charts is to just take that data directly and produce my own graphs. Let's have a look at look up then. So the same 100 thousand integers. This time with a find. Now, we've got, right at the top, the orange is standard set, the green, just below that, that's just confusing. The green just below that is hash trie and at the bottom is unordered_set. So we're beating standard set now. This is much better. We're starting to get there, but we are still closer to standard set than unordered_set, so we're not meeting my claim yet but we're getting there, just promising, so what's going on here? Remember I said what we're doing is we're just loading up with 100 thousand integers, monotonically increasing and we're using libc++. Integers in libc++, these standard hash specialization is just the identity function, so our hash is also a monotonically increasing series of integers. Remember I said that if the hash distribution is not very good, then we're gonna be very unbalanced and that's what we're paying the cost of here. There are ways to address that but they haven't been applied yet so, we are susceptible to that. So let's move on and have a look, oh, before we do that, that was 100 thousand. I actually did all of the orders of magnitude between 100 and a million, just to see what that graph looked like, so again, the orders are the same here but now you can see the profile of the different data structures. So the orange line for standard set is slightly diverging from our hash trie in the first orders of magnitude and then sort of re-converges towards the top but it's actually tracking, the hash trie is tracking unordered_set fairly closely. They've got a very similar scalability profile, so that's interesting to observe. What I did next was I loaded up all the same orders of magnitude, but this time using strings, so now we're hashing strings, we're getting a much better hash distribution so we're getting closer to what we want to end up with. Now let's have a look at what we've got. So the orange one at the top is standard set, again, the blue one, unordered_set, and at the bottom is hash trie. In this one, we're actually beating unordered_set, so we're definitely in the right ballpark. Now, so this is definitely a quality of implementation issue. Another implementation, I would expect to beat hash trie, that's fine but, I think this demonstrates that we're in the ballpark of my claim, with some caveats, hopefully they can be addressed, insertion time, hash distribution, but this is what we want to aim for. I'll still a bit uncomfortable about the integers, so I did one more test for this. I took the hashes from these strings and then I used those as a integer in another set of tests, so this is another set of integers using those string hashes. What was interesting is everything's actually much closer together in this one. I should point out, by the way, that the y-axis in these examples is exponential, so where things are actually tracking each other, they're actually diverging in practice. We start off with hash trie winning, again, but then they sort of converge in the middle and towards the top, actually this one goes to 100 million, no, 10 million, sorry, towards the top from around the million mark, hash trie starts to become the worst performing and standard set is actually winning, actually beating unordered_set. What's going on there? Well, remember we said earlier that because of hash collisions, when we get to a leaf node in a hash table and the same in the hash trie, we're gonna have to do perhaps a linear search, so at this point, we're starting to get significant enough hash collisions that that final linear search is starting to, not dominate but become a significant factor but what's interesting is that's because of the requirement on the types. We don't have the strict reordering requirement on unordered_set, so when we get to that last linear search, it has to be linear, really, that's the only way you can do it. If we're not constrained by that requirement, then potentially we could do a sorted array at the end in our hash trie and then do a binary search instead. If we do that, I would expect our hash trie implementation to be winning at the end as well, then it's another opportunity for optimization that we haven't explored yet but we're looking pretty good. Let's say we're meeting the claim in some areas, maybe not in others, hope to explore those opportunities for optimization but it shows a potential. I gave this talk in Aspen at C++ Now earlier this year. It was a longer talk and went into a lot more depth on the actual implementation, so I still got these slides here because I didn't know exactly how long I was gonna have left, so we're not gonna go through the whole thing, just wanted to pull out a little bit that was interesting. So, first of all, we remember this one. How we map our sparse array with a bit map and I said that it was an interesting problem, how we count the set bits to solve that. Here's one possible implementation. This is taken from, in fact, there's the link, there's a stack over for your answer, takes you to a page online somewhere that has, I don't know, dozens of possible ways that you can do bit shifting to solve this problem. I don't follow this but I know that it works. There's no branching in here at all, it's just pure bit twiddling, it's pretty performant, but we can do better. As the comment in the middle says, actually, most modern architectures will have a single instruction that can do this, usually called popcount, so if you're using GCC or Clang, there's a built-in, built-in popcount that will do this. You do need to compile with a certain flag to actually get that to map onto the instruction but when you do that, it performs much better. So all the examples I gave are with this. If you use the previous example, the bit shifting, then it doesn't perform as well, everything sort of shifts up slightly but the basic ordering remains the same, so it's not significantly different. But in use, what I'm doing is using the count set bits to create or to map the bit map to a compact index 'cause I have all these indexes flying around, compact index, sparse index and so on. I'm using strong typedef internally to keep track of that and there's some convenience functions on there. We just need to create a mask of the bits to the right of the one that we're interested in, so that's what I'm doing at the top there and then that will give us this method to compact, which is just a method on our sparse index, strong typedef. So I'm not going to go into all of this in depth but again, visually, in practice here's how we use it with a bit map, create a sparse index and then we can get the compact index from that and that turned out to be really useful because in overloading, if we already have a compact index we can go straight there, if we have a sparse index, that gets mapped along the way. It worked out pretty nicely. Then we have the thing that peels off the bits from the hash, say five at a time in this case, won't look at that in too much detail. Our node structure, gonna skip most of that. Then we get down to the really nasty low level stuff. There's actually an issue with alignment here that I need to fix but it is nasty because, first of all, we need to create our nodes big enough to contain all of the pointers, our values, as well without an additional allocation and then placement view that into our allocation but do that where, if this is gonna be copy of an existing node, we'll need to copy those in as well. So, we're not gonna go into too much detail there but you get an idea of what that code's like. Branch nodes, similar, and then, again, actually if you go to the C++ Now videos, you can watch that talk there if you're interested in the detail walkthrough of the code. So, I'm just gonna get to, okay, right, here, so, the example we looked at earlier. This is our tree structure, so these are nodes, we have a root node and then, in addition to the root node itself we'll just need a size, so we can do it in constant time, look up the number of elements. That really is at the entirety of our hash trie, all of the hard work's going on in the nodes. So, because we just have those two values, we can just collect those together in it's own data structure, we're using a hash trie but we can use it elsewhere as well. This is actually small enough to treat atomically on most architectures. There's a implication to that. The hash trie itself will have a hash trie data, at the top there, and then, just sort of basically provides an interface onto that. Again, we won't go into the details there. Right, here we go and then we can have another type, so I haven't really discussed this part yet so, that's why I wanted to get on to this slide. This is just the test case, we'll look at the code in a second but this is shared hash trie and the whole purpose of shared hash trie is just to hold, atomically, that hash trie data and because we can treat that atomically, then if we want to read or mutate the underlying data, we'll just take a copy of it into our hash trie, we'll just do an atomic copy or atomic exchange and then when we're working with our hash trie, we're completely shielded from any other mutations going on with that hash trie, so it's just a shared location that we're gonna get the hash trie from that we can then work with and if we mutate it, we can then write it back, again, single atomic exchange and we get very effective concurrency with no logging on the condition that we are mostly bound on reads rather than writes. We're doing lots of reads then we'll keep looping around and redoing our work so, it's only appropriate for some conditions but they're very common conditions. So in this test, I'm not really obviously using the concurrency, I'm just demonstrating how you can use this transaction to do that. In practice, there's a helper that I update with that will do all the atomic inflection, the hash trie, passing that to you, you can work on it and then when it goes back, it will atomically commit it and do this in a loop. So, if anyone else is mutated in the meantime, you already did the work. So here's the shared hash trie itself. We're just asserting that hash trie data is trivially copyable, it's a requirement for it to be atomic and then we got these atomic operations. Don't pay attention to the memory order stuff, that needs to be fixed, it's more of a place holder and here we go, the memory order stuff again is all wrong. Yeah, so this is the loop that it's doing for the update, it's pretty simple stuff, you do have to get it right. This is not necessarily right but we've got it all in one place. As far as atomic data structures go, it's one of the simplest out there and that's reassuring, I think. So I'm gonna stop with that now, just to sum up and then we'll go to questions. The claim was that we can approach the performance of hash maps without too many trade-offs, I think I demonstrated that we can achieve that but there's some work to be done to overcome some of our other cases. They're more space efficient, the hash tables, so that's good. Copies are cheap and memory efficient and we didn't really go into this too much, touched on the concurrency implications but think about having a large container, associative container, that you can copy with just the cost of two integers and share large amounts of common state you can hold in memory at once. When I was first working on this, it was in the context of financial data and what you could do is load up one of these things with an object graph of market data and then you could make small changes and have entire scenarios sort of matrices of these object graphs in memory at once, very, very efficiently and very, very performant as well, so many implications like that and as we mentioned, can be made to work very nicely with concurrency. So, that's the summary. There's a number of references, including some papers that this is all derived from, I put all the references up on my website, levelofindirection.com/storage/hamtrefs.html. If you can't remember levelofindirection.com, I also have extralevelofindirection.com, that redirects there. (audience laughs) Or you can reach me on Twitter, phil_nash or you can find me later during the day. So thank you very much for listening. (audience applauds) And do we have any questions? - [Audience Member] Have you considered, given how sensitive you are to the hash function being decent, have you considered throwing it through a couple of iterations of like mermafree just to improve the default case? - So, the question was, because we are very sensitive to the hash distribution, have I considered running it through, very sort of rehashing or remixing algorithms? Yes, that's almost certainly what we'd need to do to solve the problem and I did start looking at that and it was actually making things worse than better so I need to explore that a bit better but there are definitely approaches that can be taken there. Thank you. Yeah? - [Audience Member] One thing I didn't understand about your sparse array, you were counting all of the zeros to find which one to look up, is it from there you would get an offset entity, like you have a 16 bit thing there and you have three or four bits (muffled speech). - So, to summarize the question, if I got it right. You just wanted to clarify how we're doing the mapping of the sparse array. I'll go back to the, ah, too much code here. Right, this one. - [Audience Member] So is that yellow thing, that's only one bit, right? - Yeah, so, the top is an integer, so that's all the bits of the integer and each set bit in the integer represents the position in your virtual array where you have a set point of value and then all the unset bits are where you would represent an old pointer. In the actual array at the bottom, we're only storing the pointers themselves, so the number of set bits we have on the top is the number of actual items in the bottom. - [Audience Member] Okay, just a sec, so you got three pointers on the bottom and the top array is-- - So the top array is actually just an integer. It's a bit map. - [Audience Member] Okay and so it's 32 to store (muffled speech). - Yeah, so you can do this with first two pointers or 64 pointers if you have a 64 bit integer. One of the references that I showed at the end was a games developer who implemented one of these and he experimented with both 32 bit and 64 bit arrays and in his case, he found that the 32 bit arrays performed better. He wasn't 100% sure why, you can speculate, he did suggest that maybe it was a quality of implementation issue and you can get a better trade-off with 64 bits but there's not a lot in it, so that's why I stuck with 32 bits. Were you clear on how that maps now or? Thank you. Yes, in the middle there. - [Audience Member] How do you handle freeing or deleting the nodes as the tries go out of scope? - Okay, question was, how do you handle nodes that have been deleted going out of scope, so in this implementation I'm using rough counting, it's actually intrusive rough counting, so I'm using a shared pointer. If you look at other languages that implement this, particularly Scholar and Closure, they have garbage collection and that's usually considered the way you have to do this but I've got it working with rec counting, performance enough I think and in fact, the reference counting also is what allows us to do the copy and write, which I'm not doing in this implementation but used it very effectively before because as you go down, as you traverse down the trie to find your insertion point, you check the share count at each level, as soon as you got a share count of more than one than the rest of that branch is shared. If you get to the leaf and you're unshared, you can do all the mutations in place and that can save a lot on copying, so we get the benefit from shared counting as well, of being able to do that copy and write. I'm not sure it'll be so easy with garbage collection but that's how we're doing it, just reference counting. Uh, yeah? - [Audience Member] In the implementation where you have a hash array and have a multitrie, would it possible to do an efficient merge operation on two such data structures? - So, the question, if I got that right is, we have, sir, what term did you use again? - [Audience Member] A hash array mapped multitrie. - Hash array mapped multitrie, multiple values for the same key. So the difference between the multi and the single version is really just a constraint on the insertion. Possibly that could have implications to how you're storing it as well, I haven't really considered that yet. That would only really affect the search at the end though, I think. You would have hash collisions either way, so you still need to deal with that so I'm not sure that will give any benefit but maybe I'm misunderstanding your question. - [Audience Member] I'm talking about then you could construct a merge operation with well-defined semantics that would show up in the resulting trie if they were present in both, in either original tries. - Right, still not quite sure I'm fully following you, maybe we should talk about it afterwards, but yeah, thank you. Yes, towards the back. (muffled speech from audience member) Struggling to hear you, let me come a bit closer. - [Audience Member] Are delete operations for leaves supported, if so (muffled speech)? - Are delete operations for leaves supported was the question. - [Audience Member] For keys. - For keys? Delete operations, oh okay, so, looking at where the key and then deleting. In my reference implementation, I haven't done delete yet. My experience from doing this with red-black trees is that delete by far is the hardest part. I'm expecting that to be the case here as well, although hopefully not quite as much because everything's, we'd have to worry about the rebalancing. So I think just boils down to the case of where you just create the copy with the removed item but I haven't done it yet so I don't know if there's gonna be any additional complications, so I'm not sure if I can fully answer your question but certainly you should be able to do it. Yeah? - [Audience Member] (muffled speech) - Sorry, I'm gonna come a bit closer again. - [Audience Member] Can you please elaborate on how you remember the allocation because the languages you mentioned like Scour and Clutter (muffled speech) which have a garbage collector (muffled speech) you need to use techniques like, as pointers of stuff, what do you use? - So, I think to summarize, how do I handle memory allocation in a world where we don't have a compacting GC, is that a fair summary? This language is like Scholar and Closure, when you're allocating, everything is generally much closer together in memory so you don't have to worry too much about it. So far, all I've done is just straightforward view and delete but there's definitely one of the opportunities for optimization, alternative allocators and being able to reuse blocks of memory, 'cause as you're building these things up and you're creating and then discarding copies, so you're leaving slots open that you could reuse and keep things much more closely located which can have implications to cache readiness, haven't done that yet but that's definitely something that could be done, sounds like you have a followup. (audience member asks muffled question) That's a fair point. So, because I'm using malloc under the hood, technically the implementation is not lock free because there's a lock in malloc but that's the point of insertion, replacing a new instance of the hash trie, where we're just doing the atomic swap on the root node and the size, that operation is lock free. It's that exchange between threads that are communicating on a single instance but there will be locks involved during the insertion at the moment, that could be mitigated by using an alternative allocator. Does that answer your question? So, yeah? - [Audience Member] Hi, I just wanted to confirm because everything's on a hashing theme, the elements in here are unordered like a regular hash table? - The question is, the elements are unordered as a statement to verify, yes. They're effectively ordered by the hash but it's the same principle as unordered_set, unordered map. Okay, both of you have any more questions, especially Ansel, catch me afterwards. Thank you very much. (audience applauds)
Info
Channel: CppCon
Views: 10,173
Rating: undefined out of 5
Keywords: Phil Nash, CppCon 2017, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording
Id: imrSQ82dYns
Channel Id: undefined
Length: 51min 23sec (3083 seconds)
Published: Wed Oct 25 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.