CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures"

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

I found it interesting to see how the LLVM team works in relation to High Performance Data Structures, and where C++ helps v.s where it makes it more difficult (he does not likes the current allocators syntax/semantics).

👍︎︎ 7 👤︎︎ u/Wolfspaw 📅︎︎ Oct 09 2016 🗫︎ replies

Some similar things in Rust:

https://github.com/servo/rust-smallvec

https://doc.rust-lang.org/1.2.0/std/collections/vec_map/struct.VecMap.html

Others?

👍︎︎ 4 👤︎︎ u/[deleted] 📅︎︎ Oct 10 2016 🗫︎ replies

I wonder if the pointer bit packing could be used by the compiler to optimize

enum MyPointerEnum {
    Boxed(Box<Something>),
    Vecced(Vec<SomeOtherThing>),
    AsText(String),
    Nothing
}

to the size of a single Vec. Or a enum Test {RefCounted(Rc<Struct>), Unique(Box<Struct>)} to a single pointer.

👍︎︎ 4 👤︎︎ u/CrystalGamma 📅︎︎ Oct 10 2016 🗫︎ replies

I remember when I saw his talk in Russia about LLVM in general and all other things about compilation and using their own "assembly" was really stunned by it.

👍︎︎ 3 👤︎︎ u/impsty 📅︎︎ Oct 10 2016 🗫︎ replies

Really intersting video, wondering if crates like smallvec have the possibility to switch from a stack-based storage to a heap-based storage when their capacity is exhausted.

👍︎︎ 3 👤︎︎ u/vbarrielle 📅︎︎ Oct 10 2016 🗫︎ replies

Captions

all right so my name is Chandler Carruth I'm up here to talk to you about high-performance code how many folks are here at CB peak on for their very first time whoo oh my goodness has a lot of people okay so how many folks have seen the the talk from two years ago on YouTube okay okay that's that's better because I'm not going to repeat that talk so this is kind of the second of a series right the idea is that the first time I talked a lot about kind of things that really limit high performance code kind of patterns you want to develop fundamental things you want to do over and over again but I felt bad about that top for one reason uh one reason was it was very negative right like it wasn't actually a very constructive talk I didn't tell you guys a lot of things you can do and you should do or kind of new exciting tools to try and help you out writing really fast code so this talk is going to try and be a little bit more a little bit more upbeat it's going to try and help you guys out a little bit more we're going to try and actually go through and describe some hybrid data structures that are going to hopefully be very useful for you in writing really fast code I just as a little bit of background if folks don't know I work at Google I work on the LLVM compiler and in particular I'm actually going to be talking today about lVN's internal data structures I turns out that LVM is a great example of kind of like really high performance code because you have a bunch of compiler nerds and I mean a bunch of really really kind of bad level of compiler nerds all of them performance nuts I might be a good example of this and and and they're all working on this compiler and we actually care a lot that the compiler itself is very fast right and and you know we're all thinking about kind of modern computer architectures and the challenges that face that we face when we try and make compilers run really fast and so we end up writing a bunch of great data structures writing a bunch of algorithms and spending a lot of time really tuning examining things to try and get the absolute best performance out of this so I actually think this is a great place to kind of figure out about high performance data structures so this talk is going I'm going to give away all the secrets of this talk before we even get into it just so you know there's no there's there's nothing nothing up my sleeve here all right so Ector sets and maps that's it that's all we're going to talk about whole hour but it's going to be awesome okay so the very first data structure is of course a vector okay so this is what L viens vector kind of looks like now I could have shown you the real code from LLVM but if I had like you would all have to be standing up here in front of the screen and squinting really hard and it would be really confusing so I've kind of distilled all of this down to the bare essentials to kind of communicate the essence of how LLVM is making these data structures fast so this is a pretty pretty straightforward small sized optimized vector that's that's the key idea it's a small size optimized vector now that doesn't mean that we expect all of the vectors in the entire compiler to be small that's actually not the point the point is to save on the early small allocations because it turns out those end up being a really large component of the cost of vectors and by doing a small size optimization we don't have to allocate those on the FreeStore we can very quickly use an internal buffer for those and we only actually reach out to the heap for allocations who were growing when we're growing to some some substantial size this is this is easily the most fundamental data structure that we use in the entire compiler when I when I get up and I talk about how no no vectors all you need like really I live and breathe that advice okay we use this thing for cues for work lists for stacks every single kind of basic operation you would imagine everything down to and including binary heaps and priority queues all of them are built on top of the back of a small vector in LLVM now I get a lot of kind of kind of immediate responses of why on earth don't you just use two vector you said that data structure at least wasn't completely broken in the standard I'm sorry it is so so even stood vector has a really bad problem in the standard because in the standard we specified that the iterators are not invalidated if you move the vector from one location to another now this can be convenient at times but it has a real problem if you can't invalidate the iterators when you move from one location to another you can't do a small-size optimization you can't do it at all it's completely off the table I think this is this is terrible right small size optimizations are a huge win they're huge enough win that the standard even uses them in the other vector in the standard that's spelled in a very awkward way of stood bass extreme I don't encourage you to write you know a custom care traits specialization just so that you can use stood basic string for your vector instead I think we actually need to have a small vector and that's what LLVM uses but we can do more interesting things than just small size optimizations so elegans vector has this kind of weird trick that it uses so so we kind of take on the left-hand side a lot of the guts of the small vector out okay and we just leave that small size optimized buffer okay then on the right hand side we have a base class and this base class is interesting because it looks exactly like a vector nothing fancy here this is just a vector it has all of the operations that you would expect to have on a vector it just doesn't have this the buffer and we pass the buffer in dynamically through the constructor right this is actually a really simple pattern but this is powerful because now in our interface boundaries whenever we want to have an output parameter or an input and output parameter that's a small vector we can actually type erase the size we can remove the size from the type completely and we don't pay any overhead we don't pay any abstraction boundaries right we don't have any cost to it we actually just have this nice slicing right across the type boundary this is really really powerful so all of our interfaces into carrying small vectors without baking in a particular size this even lets us do crazy things like have different callers to the same routine select different small buffer sizes based on their use case based on on their expected workload which again is a really powerful way to have hybrid data structures that adapt that fit the problem you actually have rather than being you know baked in stone in one particular design okay so that was vector now we have to talk about sets okay so the set it's actually the same thing all right this is a really unoriginal talk I'm sorry for that so the set is also just a small size optimized but but we aren't doing the set the same way that stood unordered set works for example okay we're instead using open addressing here we just have an array of buckets flat in memory we use quadratic probing I didn't want to write quadratic probing up here on the slide because it's really boring and it's really gross right but it's a very straightforward approach right you you just probe a flat array of buckets nothing to it and when you have a small size optimization you can even save on the allocation you have an inline buffer that you get to start off your bucket array with okay really really straightforward thing now the last thing you're going to see on this bottom is is that we have a traits object we actually put all of our hashing hash table traits into a single traits object instead of having to wait on order map does and I know reset does and it has you know the kind of expected things you can get a hash right you can you actually hash the value you can also compare things because we want to have the ability to compare things in a customized way but we also have two other routines you can get an empty value and you can get a tombstone value and this is actually what allows a very simple approach to to a flat quadratically probed hash table to work we need to be able to create this bucket array right and populate it with empty keys and we need to know that they're empty so that we know we can put new entries into them and if we start probing and then start erasing we're going to leave we need to leave markers behind so we understand that there are actually are values along a probing path right and you have to keep probing it's not just an empty slot these tombstones for that this does require your types to expose to kind of you know fake values that you can identify it's really important but most of our types we can do this with that's it this is all the the logic that goes into us having a set and you can imagine then what the map looks like the mat just generalizes this to a key and of value right we still use the same traits right but it's focused on the key we actually have to re-implement some things because we want to be we want to be fairly careful about the value here one thing that LLVM tries very hard to do is that for empty buckets and for tombstone buckets we don't put a we pair into those buckets we actually just put the key into them so that the value in the pair doesn't actually have to have these these special States in it and that requires some special code it's not - not too fancy and of course we layer a small size optimization on top of all of this all right that's so so that's the talk ten minutes what do you think right okay so so there are a few other things we should probably discuss before we call it quits on this though okay that but this I do want to emphasize that is the core foundations of all the data structures that we use throughout the entirety of LLVM compiler okay almost everything else you see that's really interesting is a very domain-specific data structure it's one that we've crafted to solve a particular problem with a very particular set of constraints other data structures you see tend to be abstractions around domain-specific problems they don't tend to be fully general they tend to have really particular use cases the the general purpose tools we use like 99% of the time are just these three tools okay also if you're like looking the ldm codebase live by the way they're all spelled differently and like they are they suffer from over a decade of kind of organic growth I've cleaned a lot of this up to try and just present the core ideas here in a way you can get but we should at least talk about one thing before we get out of here um some people get really grumpy about these these small size optimized containers and when I say some people need at least one person in the audience because they tell me over and over again I should be using an alligator for this I don't need to build yet another container so I want to actually talk about this so we understand why LLVM continues to actually use custom containers rather than alligators uh it's it's interesting right so alligators seem like they should work so this is this is lifted directly from Howard Henin's suggestion to the LLVM community that they should stop writing custom containers and use alligators and he proposed this version of small vector which relies on a short alec allocator that has a lot of code in it because alligators require a lot of boilerplate it's actually really simple though it just looks like it has a small buffer in it and it allows you to use that small buffer for the allocations and the vector and you construct things in this way and it works great okay this solution absolutely works but there are two limitations to it that really get a way of LLVM is usage patterns the first one comes when you when you start putting these on interface boundaries alright and this happens a lot we end up with a lot of small vectors on interface boundaries because they end up being a vocabulary container for us but I've got a bug in this code and I've got a bug in the code because I didn't remember to update the size parameter and I have to keep these things in sync across all the different callers they've also lost the ability to have kind of a customized size for different callers without the the function I'm calling knowing how to do that now I I don't know how to solve this with alligators today I mean Howard Howard may know how to solve this someone else may come up with a way to solve this but looking at it it seems really annoying to actually get this to work to get the kind of very low cost interface design that we have with a custom container here the second problem I have I think is more fundamental so one thing that we actually do again very often is we have small vectors in the return type and this actually works really well for us because people using auto now right throughout all teams code base we use auto and so we don't actually write the return types for these functions very often we don't have a lot of the actual small sizes scattered about the code base of LLVM right they just have auto you can update the the small size where we return it no problem at all it goes you know every single call site picks that up but if you do this with an alligator right we have a bug here right we're returning a vector that's allocated memory potentially on the stack okay and so this this can actually lead to subtle bugs and again maybe there is a way to to embed the small size alligator inside of the vector object but again it seems really challenging to me and having a custom container makes all these problems go away we get our value semantics back for this container and it's fairly easy for us to you know pin down the value semantics we want even when it's in a small size optimized State and that's important that's really really important to the usability of the API making sense okay silence so I've talked a whole bunch about small size optimization right and that's really the only kind of critical thing here but how do you actually get it to be effective right small size optimization works way better when your values are really small right when you have very high density to the values and so the bigger challenge in a lot of ways than coming up with a bunch of fancy data structures that use small size optimization that optimize for small values it's like how do we actually make things small enough to benefit from this so a bunch of LOV ends time is actually spent making its data its values especially the ones we put into the data structures as small as compact and as efficient as possible and and I'm going to try and show you some of the techniques for that because it turns out that's more important than having the containers the containers are easy using the containers and using them effectively tends to take a lot of work okay so the first key idea here is to give large objects address identity so what do I mean by address identity the first thing to think about is does your object have some intrinsic identity right is your object like fundamentally unique if you have two objects are they going to compare equal right it turns out for a lot of large objects especially that we work with in the compiler and that I've worked with at Google most of the large ones have identity there's something unique about that object okay and and that means that if we allocate it carefully we allocate it at a stable location we can actually borrow its address to represent that identity okay so we could just use the address the pointer as a stand-in for the identity of the object and all of a sudden we've taken the object which is quite large and we've made it a pointer which is very small I mean in the simplest tribulus case you have a little vector of unique pointers for a big object this seems like such a trivial use case this actually comes up all the time in real code okay you actually we do this in tons of places in ll geum's code and this gives us a few advantages right it gives us stable pointers to big object if we need to create a set or something else out of this we have stable addresses here they're not going to move around if big object is very expensive to move right resizing this vector is going to be a lot faster than if we put the big objects in place in the vector okay this can actually make things faster directly if big object isn't movable at all if it contains for example a mutex right then this can be essential because there's there's literally no other choice available to you now a lot of people look at this and they're like why don't you use a list this is what list was built for and you're not saving any in directions here right you've got a pointer already just use a list so there are two problems with that one though the first problem with the list is that a list doesn't have a pointer it has two pointers right and I don't want to pay for anything I don't have to I want I want this to be as lean as it can be now I could use for word list okay but fourth list comes with a lot of constraints all right it's very hard to mutate for word list but it is just one pointer so it would be the equivalent space here an interesting thing is that this will often be faster than for word list and for somewhat surprising reason so imagine you're walking this container if you're not iterating the container you don't care about the difference between list and vector but if you are iterating this container you're going to see an interesting difference between for word lists and a vector of unique pointers because as you're iterating you're going to be looking at one object and then in the next object and then the next object okay and you're probably doing this on a fairly modern processor okay modern CPU and that modern processor it's going to try and and speculate a lot of the operations you're doing ahead of time and so it's going to be doing things out of order one of the things that the processor is going to try very hard to do is to figure out what memory you're going to reference next okay that's really what it wants to know now if you write a loop over a small vector of unique pointers right you're going to be able to make that very clear to the processor because you have a loop index that's just an offset into this array essentially as far as the processor is concerned and each pointer in that array you lookyou load memory from and so each pointer in that array is memory that the processor needs to populate into its cache and actually make available what happens when you iterate a forward list though for a four it lists the next memory you're going to access you can't find without reading the current element okay so if you you know get a cache miss on the first element of yours forward list you also get a cache miss on finding the address of the next element okay and that's true the whole way down and and this kind of thing it makes a big difference when you add it all up and you do this all over the place in your code base so you want to think about these kinds of trade-offs when you're looking at data structures at the high end of performance but this isn't enough okay even even this still isn't enough in a lot of cases we're paying a pointers worth of overhead here we may not want to do that we don't want to keep this vector around we may accidentally end up with a very large vector that's only about you know 60% full that may be problematic for some reason we have some other techniques that we use as well but probably the biggest reason to not like this technique is that it doesn't give you any locality for these large objects right there just allocated wherever your memory allocator sees fit there's no locality at all so LVM uses this really horrible piece of code which I've actually simplified and made less horrible but it's still pretty bad um so this is it's like hacked up variant of allocators which is a terrible API I'm not endorsing the API here at all but the idea of it is really useful what this does is it allocates chunks of memory okay and it allocates a large chunk at a time in this case a whole page at a time and it keeps those pages around and it just hands out regions of those pages to to users this is essentially the world's worst Malak implementation right seriously most valve consultations start something like this the nice thing about this is it's local so I can create one of these for my particular code path and I can carefully allocate large objects on one of these and force them to have strong locality okay and I will waste a lot of memory because I'm going to allocate an entire page ahead of time unless I allocate a large number of these this could be very wasteful but I can adopt a very simplistic model if I know that I'm going to allocate a lot of these and I know that I care about the locality I can kind of adopt this model eagerly whereas a malloc implementation has to make a lot of very hard trade-offs the big simplification here is that there are no trade-offs to make we're essentially letting you the programmer make the trade-offs instead this does slab allocation every time we run out of space we get a new page of memory we then carve it up it's really fast it's incredibly fast okay the amount of things that this saves that a malloc implementation has to do even though they look essentially the same under the hood is astonishing and this lets us get stable addresses and reasonably high locality for large objects okay and once we have that we're off to the races we can put those addresses into data structures and containers and we can use small size optimization and other very cache friendly techniques on those containers making sense hopefully okay we will keep going so sometimes even pointers are too large right sometimes point you just have too many pieces of data you actually need to pack things more densely when that's the case just use an index right we've just talked about having a vector of unique pointers if you if you have a small vector of unique pointers you can use a very small index into that and you can save a lot of memory that way there are a lot of ways to kind of force the size equation down and down and down until you actually have small enough pieces of data to work with okay so the next big ticket item here is to aggressively pack bits together and this is probably the most complicated thing to ldm does to make its object small well the first realization when you're figuring out how to pack bits together is we just spend a bunch of time turning everything into a pointer right I mean a whole bunch of effort turning everything into pointers you know gotten a few of them but fortunately our pointers are a bit different from the xkcd pointers in xkcd pointers have these really unfortunate last characters a C and E here I don't know why you use those our pointers look something a bit more like this and it's a consequence of looking a bit more like this there's an opportunity there are these four zeros in the pointer I love these four zeros there they're essential okay these four zeros are like the most valuable four zeros in my entire program LVM does everything in the low bits of a pointer so this is where we're going to get most of the wins for packing data into smaller and smaller sizes all right now now you see what I'm talking about when I say I have a lot of code all right so let's actually step through what this is doing the very first thing we need to do is we need to classify things that are pointer like what does it mean to be pointer like well it means that you can be converted to and from an integer with the same number of bits as a pointer that's that's what this you wait put RT get pointer thing is about I've only provided one direction because the slide is already too big okay it also means that we can look at you and figure out you know how many free bits there are inside of the object and this this is the kind of generic template it works for for pointers right we just look at the alignment of the object we you know take the log two of it I understand this doesn't compile right but you get the idea we take the alignment of the object take the log two we know how many zeros there are in the low bits right we can convert back and forth very trivially now we can build a data structure on top of this so this is this is one of the most hilarious data structures in LTM it's a pointer in pair and it's this and and it is the size of a pointer okay and so it contains it takes a pointer type right it takes the number of integer bits that you would like it takes an integer type umm and it takes this traits object that describes that pointer type you know we need to make sure that the integer bits like fit into the free bit to the pointer type but if everything fits right we can shift things around we can mask things in and out and we get to reuse these low bits it's pretty nice making sense everybody like this little little clever data structure there's so much more we can do with this the next thing we realized is right some of the time we we have indices instead of pointers but the indices they look kind of like pointers right they they're smaller usually they don't need all 64 bits we want a way to store integers like they were pointers with a bunch of zero bits at the bottom where we aren't using the data and so we built this pointer embedded int type which takes an integer type and the number of bits that integer type needs and it stores that integer in the high bits of a pointer and leaves the rest of it zero and then we have the pointer of the traits class where we surface to other other users by the way we've got free bits in our object the low bits are all zeros and we tell them how to convert to and from the object right and this allows us to wire in integers as well as pointers into this pointer integer pair making some sense we can go a lot further though turns out a pointer integer pair is a pointer like type it has zeros that are free in the low bits we can surface this it's a fairly simple computation you take the free bits of the original pointer take off the integer bits you're going to use you have that many free bits left and so we can build nested pairs like this and we actually use this to kind of build up data structures all of which reside within a single pointers memory okay then we get to my favorite one of these there's a lot of code missing here I'm sorry but this is this is this is this is what kind of ties it all together we now have a pointer some type okay so I'll step through we start off by defining this kind of nonce class template all this does is bundle a particular in value a particular Poynter to like type and a traits class for that pointer like type okay now the N value has to be a particular value to discriminate one pointer type from another pointer type so you can't you can't have these things overlap we then build up this pointer some type class template okay and we we parameterize it with some number of member types here but they're actually member templates they're actually instantiation zuv this pointer some type member okay and so each one of them Associates a particular value with a particular pointer like type we also give it a tag type which has to be some integer like type usually an enum and what we do is we store the tag type value in the low bits of the pointer and we then use the tag values to make a nice discriminated union between these different things right it's it's variant but with a lot of time spent packing bits into smaller and smaller sizes and not a lot of time spent on really fancy api's or exception safety um didn't need them they're just pointers right they're all they're all very small little objects and we have nice little helper methods right we can get the tag value out of this we can test whether a pointer subtype is a particular pointer type if we can we can do a get which will try to get a particular pointer type out and it'll give us a nice mel representation if it's if the the tag is set to a different value so we can write little you know pattern matching like libraries on top of this and we have just gobs of this all over LLVM is code base okay everybody following this pointer like this pointer some type craziness okay so then we can use this to build one of my favorite all-time hybrid data structures this is called a tiny pointer vector and this too is a pointer like type okay so how does this work what we do is we have an enum which describes two states for the vector to be in one of them is an inline state this actually has small size optimization built in one of them is an inline state the other one is a vector state okay and tiny reporter vector right it has a type so so we know that and then we build up this some type here okay and the some type uses this enum to discriminate between either a pointer type which is T in this case which is stored in line or a vector a pointer to a vector right a unique pointer to a whole vector we've allocated somewhere on the heap and then we have just that some type as the value and and we can write I've written out just you know operator square bracket to kind of give you a feel for what this ends up looking like right we go in we check whether we are in the inline State right if we are in the inline state we don't really need to know what the index is there's exactly one index available and we just give you that pointer value directly no heap allocations nothing at all and if we're not in the inline State then we for down to an actual vector allocated on the heap the rest of the methods look exactly the same as this so the most useful place for this is when you need a multi map alright because it turns out we don't like multi maps we like we like those three things I presented earlier with like vectors sets and maps but it's really annoying to put an enormous small size optimized vector into the value of a small size optimized map you end up with something that is not very small or fast or optimized but what we can do is we can if we happen to have pointer like value types we can put this tiny pointer vector into the value of agents map and it's only the size of a pointer and so this remains very compact very very small and in a lot of cases if you have only one right this will actually be just as fast as a pointer it won't use any extra memory but it will gracefully degrade allocate memory as necessary to handle larger and larger sizes and this turns out to be incredibly useful in LLVM there are a lot of cases where you have a multi map because you have to have the multi map you know the overwhelming majority of cases you have one value or zero values we can even address that here and I think this one in particular I would love to see someone implement this with an alligator and these would be really challenging to get with an alligator okay the last tip of course is use bit fields I have to tell you guys about this this is the classic way you do bit packing but I couldn't I couldn't not mention the other fundamental thing we do to reduce the size for objects in LLVM and clang okay so we've we've come up with this elaborate system of making things small and once you make everything very small we put them into these three data structures vector a set and a map the set and the map are these open addressing hash tables that we probe quadratically we're often using pointers as our as our identity so we have a real problem when we need ordering okay like a real problem I mean you shouldn't be relying on the ordering of some kind of hash table anyways it's a really bad idea we're trying to make it on non portable with the change the standard because it's such a bad idea but if you if you want to get worse you really really don't want to rely on the ordering of a hash table peed on a pointer because if you were on him with address space layout randomization you are going to have an extremely bad time it does not work well at all this is actually a common bug we have an LLVM because people people make this mistake so it's important to think about how you get ordering back when you need it of course if you have any possibility to sort a vector do so this is fantastic if you have some some fundamental ordering property you can simply assert over your data set you're home free you've got you've got a great solution but you don't always have this sometimes you do not have an intrinsic ordering to your data you have to find some other way of putting your data together and you still need it to be ordered this comes up a lot in compilers where we need to be deterministic in our behavior and we really have to be deterministic in our behavior we don't even care in many cases which ordering we use but we have to use an ordering and and this is this is a real challenge when you've built up all of your data structures around on containers okay and so we need some techniques here this gave rise to another data structure of course this is our set vector and it does exactly what we think this is this it's completely done I've actually like I've actually copied all of the important code here because all it does is it takes a vector and a set okay and when you insert into this thing it checks the set first and then puts it onto the vector and when you want to iterate it it just walks the vector there is no rocket science here at all and this thing is incredibly efficient and incredibly fast we use it all over the place in the compiler and it works wonderfully okay and we can of course generalize this to a map okay and with the math we have to be a little bit more careful right but we just have a vector and a map and we even get to do something really nice because we don't want to store the values twice we can rely on keys being small I mean we really do rely on keys being small but the value type might be larger it might not be something we can duplicate everywhere right we might want copies of the value all over the place but we can we can solve that with with this because we have a vector and so we actually just have the map be a map to it index into the vector right and so even to build up this data structure we're literally reusing the concepts from the rest of the talk okay and so this is how we kind of piece together the the kind of fundamental things we need in order to have this now there's one more cool trick that I wanted to mention here the LLVM doesn't implement this yet because we haven't we haven't gotten to it yet for whatever reason the cool trick here is to is to think about what we're doing with a small vector and a small map or or or a small vector and small set they're the same difference okay so when we're doing these things we're allocating these small buffers but we're allocating two of them one for the vector and one for the set that's a little unfortunate what's more unfortunate is that it doesn't always make sense to use a you know hashed probed hash table data structure if you have very small number of values if you have a set of like four things that's overkill it's actually slow and it would be really nice if instead when we had really small containers if we actually just use kind of a linear scan right rather than doing any kind of fancy probing and hashing if you think about it if we do a linear scan and we have this kind of combined set vector or map vector and we do a linear scan here we're going to have two different small buffers both of them with a linear ordering of values and we're going to do a linear scan over well that's wasteful so we can actually change these data structures to share a single small buffer okay and then we can do the linear scan there and that's the vector when it's small and then as it grows we can allocate ourselves a set and we can allocate ourselves a big vector and we can fill it up and we can do the proper hash based set operations and so this stuff scales really really nicely we haven't gotten to that because this implementation is actually fast enough that no one's complained right no one's actually run into this but there's even more we can do to kind of you know tweak in tune the small size optimization to really deliver the last mile of performance oh right so I just presented two and a half small size optimized structures and an incredibly crazy amount of code doing kind of pointer bit packing and I've gotten absolutely no questions yes there's one question a commission about your tagged pointers how do they play with CPU pretty page a question about the packed pointers how do they play with sorry can you repeat that last part CPU bridge how do they play with CPU prefetching uh I mean they're not the most friendly things to CPU prefetching sadly at the same time if they're in the cache and we can like load them and put them into pointer into registers in the pointer forum the prefecture will still do it uh looking at the actual load operation right the operating to the load operation is actually the Pointer kind of restored it means that there's less time you can see it ahead of time but we can also insert explicit prefetching that does that arithmetic to get rid of the integer bits and to produce a pristine pointer for prefetching if you need it by the way if you do have questions please come up to the microphones and I'm already I'm already nervous because Howard has headed up to a microphone yes with small set vector how do you remove items how do you know which element of a vector portion the items in so the question is with a small set vector how do you remove items slowly specifically linearly we walk the entire thing to find them I it's it's not great we've we looked at doing a small set vector the exact same way we do the the map vector so that we actually have an index but it is a lot of overhead and it turns out almost all the users we have for set vector don't ever remove things until they're finished and then they delete the whole thing and so it's somewhat tune tailored to the use cases we happen to have okay Howard go ahead oh my name is Howard in it I noticed that you're completely ignoring the issue of time zones how could I have missed that nice talk thank you very much thanks by the way in case anyone's wondering uh Howard has a great talk about time zones yes John okay my name is John Laika's and I have something to do with allocators but that's not my question the first the first question is can you tell me when you when you're ignoring the low order 4-bit four bits why is it four zeros and not two or three I what the question is why oh my goodness this is way back why was this four zeros rather than two or three zeros uh I have I have many answers for you unfortunately number one answer is because I wrote the code ran it in gdb and that's the number I got uh the number two answer is because I'm running on a Linux 64-bit system this probably reaches out to Malik and Malik on my Linux 64-bit systems and most people's Linux 60 wrote systems just happens to always give you 16 byte aligned memory okay that works I couldn't figure it out yeah no I have one more question it's a very complex space when you talk about the size locality and doing bit operations and obviously if you have and the access if you have sort of arbitrary access in a list versus if you have sequential access but my question is I guess can you give us some insight into where the higher cost and I think we'd agree that doing bit operations can be a higher cost than not doing the bit operations having something that's slightly bigger but still small enough to fit inside cache or cache lines etc so the question really is at what point do we start seeing the cost of doing the bit packing outweigh the the benefit of compacting our data this is a great question so this is this is fundamentally the the cost we're paying right this is this is the thing that does most of the bit packing so the key insight here are a couple of things one well there's a lot of code here there's not a lot of dynamically executed code right so so one thing that helps us is that things like the the computing the shift values computing the masks all that happens at compile time it's all con sexpert so we're actually paying for in terms of overhead our shifts and masks and ORS and processors those are the instructions processors are fastest at in the entire world so the overhead here is remarkably low we have I don't know of any point which we have measured an overhead of doing bit packing into pointers I mean that doesn't mean it doesn't exist it almost certainly exists but it's very very low the other thing to think about is if you have a one or two or three bit integer and you have a 64 bit pointer anything you do other than pack those bits in will waste you know eight times the number of bits at least as you had in your integer and so the savings are relatively large to the problem at hand so I've seen in all of your map examples that you store both the keys and the values in knepper is there any look any situation where super beneficial to store all the keys together all the values together maybe when you have a heavy lookup situation so the question is in in all of my map examples which are further back here uh eventually we we store the key in the value in a single bucket you're like in a pair in a bucket um this is this is true there's a lot of interesting ideas around having separate buckets for keys and values I we don't do that I think that there probably is a win there especially for fairly sparse hash tables but we don't see a lot of win for for dense hash tables the real problem for us is that we're going to have to actually examine the key if we could cheat and not look at the key that would be a huge win if we were walking large portions of the hash table then it would also be problematic to be skipping over a bunch of value entries but if we're walking large portions of the hash table where we're probing and we shouldn't be probing that much we should be growing the hash table or fixing our hash function and so so I don't think that's really the biggest thing the biggest wins we've seen and these aren't in LLVM actually this is more in Google's code are around doing a better job of hashing and probing to look at fewer keys right rather than trying to separate the data structures you're already very random access and so I worry that you're just random access into two arrays and it doesn't help you as much as it does in kind of sequential patterns a lot of that those stems from the fact that we work very hard to not iterate the actual map and if you're iterating it I think that'd be a different a different two questions in this regard do you have any relative performance measurements that you can show us to compare these the structure access with their counterparts and the other you mentioned using bit fields are you talking about struct defined bit fields or bit masks in the traditional seed bit mask sense so two questions we deal with the first one so the first question is do I have relative performance numbers and I'm embarrassed to say that I don't really I anecdotally when we have made changes to the compiler to to introduce these kinds of data structures we have seen very very dramatic performance swings usually though that's because there was a very bad problem to start with a lot of what LVM is doing is systematically following these patterns so that we kind of take off the the broad creep of performance at the bottom of our stack if you write benchmarks you can measure very dramatic performance differences though I always hesitate to do that because the benchmarks actually misleading here write real code doesn't look like a benchmark that does nothing other than like the access a small hash table that's in the working set I the the second question was about bit fields when I say that fields I mean bit fields as in you know bit field members of a struct but they're present in both C and C++ they're not the sequel specificity construct I certainly would not advocate anyone manually pack bits when they don't need to write like the the code required to to pack the bits in here is horrible right you don't want to write this code you want to actually use a bit field to destruct it you can write I guess I was I was referring to the traditional performance difference between using a bit field and a struct versus using bit masks with the math so if you if you see a performance difference in your implementation between using a bit field and a struct and writing out the math manually you should file a bug with your vendor because there's no reason for that if anything the bit field should be substantially faster so you mentioned a small vector imple type racing the size of the small buffer being advantage I would expect you can go one step further new something like GSL span to even generalize across different kinds of containers so the observation is that you can use something like geocell span to generalize across different kinds of containers and in fact the array ref proposal that predates the span proposal was based on an array ref type in Ella Williams code base in Google's code base we use it very heavily the key thing is that you can't modify the actual structure though the real the really amazing thing about the small vector in pole is that you can actually grow the container from that reduced data type not just inspect it not just mutate the values within it so you talked about using slab a locator to get a better locality have you tried or measured using structure like a deck and pre allocating stuff like say P Nike what you need and using that instead so so the question is can we instead pre-allocate what we need rather than using slab allocators the answer to this is is yes and we do this a fair amount so we'll often reserve vectors to kind of very carefully chosen sizes and then like them those kinds of things but they tend to form a pretty subtle pitfalls okay because if you ever if you ever get that reservation wrong you then have a very surprising cliff in your performance where suddenly you have to go and move a large you know pile of data to new out locations the other problem is it can be a correctness bug instead of just a performance bug because you don't actually have that address stability guarantee from any of those containers now we could use something like deck to get this right this standard deck but the the standard deck data type is really quite bad III wouldn't recommend anyone use it for anything if you look at the implementation constraints the the constraints placed upon its its iterators and its invalidation constraints it's painted into a very unpleasant corner and it has very few opportunities to be an efficient data structure yeah I have a question about the for word list yes so um usually whenever people iterate over objects they tend to use them and when they use them the for word list prefetches them if it's carefully designed where the pointer is stored so why would that be not not good compared to say a vector of pointers to objects so the question is why can't this the for word list do the prefetching for you by pointing out where the next memory location is and it certainly can the problem is that that next memory location is still dependent on actually loading the cache line with the list entry in it and so you have this dependency chain every next pointer sits on the next cache line right and so you can't look more than one ahead exactly one and that's a real limitation of modern processors right as if you have an array of pointers right every single one of those pointers is visible immediately there's no dependency there's no uncertainty like the processor knows exactly where they're going follow-up question absolute you are saying that the modern architecture can actually do these loads in parallel somewhat in some cases it doesn't even matter if it can do the loads in parallel though merely having the sequential chaining between cache misses is going to cause stalls to become visible the aren't visible if you can actually addy couple the two the two operations so even if you can just just alive one of the cache line stalls loading the next one and overlap some of the operations are going to see wins immediately and even with prefetching you you don't see this happen with with linked lists thank you absolutely well we're running out of time but I have a little bit more time if there's any more questions hi and do you use some containers like try and how they are implemented do I use some containers like try G are i.e yes I do use containers like a try I I think there may be one place in all the compiler where we use try it's very very unusual they mostly show up in things like simple tables and other kind of text processing operations and we have relatively few of those in the compiler I think that's just that's just my personal bias working on LLVM the Tri data structures are fabulous things I'm just not an expert in the middle before answering one of the questions so I mentioned that if there is a performance difference between bit masks integers versus bit filled structures then you should contact your compiler vendor because there shouldn't be a big difference I was I was wondering are due to ABI standards if you return in a bit past integer it can be returned in registers and and register you later and do you know if beat field structure can be returned the same way and would that be a reason for performance difference I I'm not an avi expert I know that the ABI allows us to return some structures and registers though I would hope that allows us to return structures with bit fields in them in registers I also hope that you don't have a function call return in the hot path I hear that that that will like even if it gets the Cullen convention right even if you gets it in registers that's still going to be somewhat unfortunate but I would need to you need to ask an API expert rather than myself Chandler do you guys reclaim it's on the upper end as well so like on 1x you have both your x86 Linux systems your pointers are 16-bit aligned but you also are only using 48 bits of address space or can you not reclaim them because of address space randomization so the question is do we reclaim these the high bits that on some platforms and some pointer sizes are actually known to be the same value whether they're ones or zeros we actually leave those alone we leave them alone for a bunch of reasons some of it is because they're not as many bits that we can touch there as we might like because of address space layout randomization is as Bryce mentioned there also other reasons to leave them alone so Intel and other people keep warning us that they want to take those bits back and we listen and so we're not we're not kind of baking in an assumption about that we also do want to run LLVM on 32-bit architectures and all of the pointer alignment tricks here still work on a 32-bit architecture but we started stealing from the high bits we'd have to turn those optimizations off which would add a lot of complexity the final interesting observation is we can we can grow the alignment of our types however much we need to in order to get the number of bits we want and we do that and we do that more on 64-bit architectures where we know we have the address space to do it and so we can just we can just push the bits we need into the bottom by taking that space in the address space rather than abusing the the high bits we if we need to so we can take the remaining two questions fairly quickly and then I want to let the next presenter get up here way back to the beginning one of the arguments for using small vector instead of state vector had to do with iterator invalidation but I didn't catch what that argument was sorry so the question is what was this iterator in validation that causes a problem with standard vector having a small size optimization the iterators are not invalidated on move of the standard vector object itself and if you are in a small size and you have an iterator it will be invalidated if you move it from one location to another as you're actually going to copy from one in-line buffer to the other on move in a small vector but not it is okay thank you so this may be a 10 second question I might be misinformed but the piece of information from this is going to have the most impact on my code is that you said bit sets are should be faster or the same as bit wise math on integers that we've all done at some point so the reason why I use integers directly my code instead bit sets is that someone who's read pieces of the standard but not cover to cover I see a number of times I see I'm reading through something and I say if T is it says if T is not a bit set and then lists a whole bunch of other rules are there things I should be worried about there or I'm at just being neurotic though they're kind of terrifying I still think you should use them I think it's less terrifying than writing complex math yourself and hoping you get it ripe and then debugging it when you get it wrong because I hate debugging my code they're not the rules aren't going to bite you too badly they mostly are caught by compiler error messages these aren't like Oh your compiler your code is going to crash randomly without warning on the rules come in two places you can't form references to bit fields two bit field members which is a little annoying but you can work around quick quite easily with something like a reference wrapper and the second one is is I you you you have data race constraints with bit fields all the bit fields are essentially merged into one large integer for you and and that's what the you know any access is to then so it acts is the entire memory location the merge set of bit fields these don't really come up in kind of day-to-day stuff LLVM we have had far more bugs with people writing math than with the bit fields themselves so I would I would generally lean in that direction all right well you all very much

Info

Channel: CppCon

Views: 89,489

Rating: 4.9039702 out of 5

Keywords: Chandler Carruth, CppCon 2016, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording

Id: vElZc6zSIXM

Channel Id: undefined

Length: 55min 49sec (3349 seconds)

Published: Sat Oct 01 2016