Faster than Rust and C++: the PERFECT hash table

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
tables are one of the most important data structures in software hash tables are built into many programming languages under different names such as dictionaries associative arrays hashes objects or just hash tables now because hash tables are used all over the place they're carefully optimized and tune to make your program fast but sometimes the built-in hash table isn't enough so I set off to make the perfect hash table in this video I'll share my journey on making the perfect hash table while also dropping some performance nuggets so you can make your code fast too so let's start with a simple use case let's say we have the days of the week and you want to map numbers to the names so here we map the number one to the word Monday number two to the word Tuesday and so on how do you implement this well one way is to compact all the strings together and put them into an array and then you index into the array by subtracting one from the day number to get the index now this is something arrays are very good at but what if we want to go the opposite direction what if we want to go from the strings to the numbers now we can't just use an array index anymore what we can do is take the first character of each string map it to a number so in this case the letter M is the 13th letter of the alphabet letter T is the 20th letter of the alphabet and use that number derived from the first character as the index into an array this means we end up with a bunch of empty slots in the array but we're able to access the array quickly so we call this taking the first character the hash function this is one potential hash function it's not necessarily a good one we'll discuss better ones later but this is the essence of a hash table but there's a problem what happens if we have two days that start with the same letter for example Thursday starts with the letter t so it would take up the same array slot now different hash table schemes do different things in this case but all general purpose hash tables need to deal with this problem we don't want this Behavior it kind of slows down the hash table but it's a fact of life for most hash tables that they have to deal with collisions but what if we don't want to deal with collisions well what we can do is carefully construct our hash table such that it avoids any collisions in the array here's an example hash function where we take the second character and double it and then add that to the first character this gives us a bunch of unique numbers from these unique numbers we can index the table and there's no collisions with our data this is called a perfect hash function and that's the subject of this video pop quiz what happens if hash function always returns the same index rather than unique indexes how many keys would need to be checked when you look things up so let's talk about my real world problem I'm writing a typescript parser and I need to recognize keywords in particular I need to map each keyword to a number so the my most naive solution is a linear search where we go through and do a bunch of string comparisons one after the other if we find a match we return the number associated with it that performs okay we're able to reach 6 million lookups per second now you may have heard in your computer science class that binary search is a way to cut down the number of searches if we can sort our input data and in our case we have our input data sorted list of keywords we know ahead of time we can pre-sort the data so how fast is a binary search work ah actually it turns out the binary search was a little bit slower than linear search now this talk isn't about binary search I'm not going to go into the details why binary search might be a little bit slower than a linear search but but let's try something that's even better than a binary search we're going to use a hash table so let's use the standard C plus hash table which is called unordered map under a map is twice as fast as linear search so just by switching a basic data structure from a vector in this case to an unordered map we get a 2X performance speed which is great we put in very little work to get a massive gain but a lot of people say c plus Plus's under map is kind of slow it's stuck in the 1990s in terms of design so let's look at a more modern hash map implementation this time from the rust programming language it looks like rust's hash map is not as well optimized as C plus Plus's unordered map at least for our use case so performance tip number one try it try the first thing that comes to mind enter the performance mindset the first thing I had in mind was a linear map I tried it it works of course but then I tried a binary search I noticed it was a little slower so then I kept trying other things until I got something that was satisfactory off the shelf hash Maps use a hash function that is general purpose it doesn't know anything about our data set it has to look through every character inside of our strings it doesn't know ahead of time how many characters there are in a string or how many is expected CPUs like working with a fixed amount of data but with a general purpose hash function you could have a key that's one character or 10 characters or five characters it's a variable number CPUs don't really like that kind of data so what we want to do is have our hash function pick a specific number of characters rather than a variable number of characters that way the CPU doesn't need to do any Loops or anything like that so the fastest hash function that's possible for Strings is just taking its length because the length is information that's readily available we that's already a number it can easily be used as the N index to the table the problem is the length is not unique there are a lot of keywords that are four or five characters long so we end up with a lot of collisions with this hash function we need something a little more sophisticated something that's more unique is taking the first character but as we saw out with the day of the week example there are still collisions that can happen so what if we can pick a set of characters that lead to no collisions say the first two or maybe the first three well if we look at just the first character we have some collisions like Nolan new both start with n true and type of start with T if we look at the first two characters this and throw have a common prefix let's try looking at the last character if we take the first and the last character it's still not unique we have case and continue both start with c and end with e what if we take the first two characters and the last character well unfortunately declare and delete both start with d e and end with the letter e okay what if we take the first character in the last two we also have never a number having a collision all right what if we take four characters instead of three hey we get no collisions if we pick these different characters let's look at this in more detail say we have the keyword debugger now this is split into a bunch of different characters what we would do to select the first two characters is do a 16 bit load 16 bits meaning two bytes and for the last two characters we could also do a 16-bit load now our index is a single number not two numbers so we need to pack all of these characters into one number so what we could do is shift one of the pairs up 16 bits leaving the remaining bits zero and then we use a bitwise or to bring in the other 16 bits so we end up with a 32-bit number let's look at another example like the let's keyword we do a 16-bit load of the first two characters and we do a 16-bit load of the last two characters now notice the middle character the E we load twice but that's fine uh there's no real penalty to loading the same character twice then we do the same shift that we did before and the same bitwise or and we end up with another 32-bit integer so this is what the code looks like in assembly we have a 16-bit load we have another 16-bit load we have the shift instruction and the or instruction so there's no Loop involved to get the hash all we have to do is load two things from memory and mix them together four simple instructions now in order to make this work we need to make sure we don't go out of bounds so first we need a bounce check to make sure we have at least two characters because if you only have one character and you do the 16-bit load you're out of bounds of your array so let's see how this performs we're going to plug this into the standard unorder map in C plus plus and see how fast it is well we got a significant performance Improvement just by changing the hash function so with a small tweak to something that already exists we can get a massive performance Improvement so performance tip number two is to make assumptions about your data in our case the first and last two characters were unique because it was unique we could feed that as the hash function and end up with few collisions now you should make assumptions about your data because your data is never random there's always some pattern to it in our case the pattern is there's uniqueness in the first and last characters and also all of our keys are at least two characters Pop Quiz why do we want to pack the characters into a single 32-bit integer and what else could we assume about the key length so we discussed using off-the-shelf general purpose hash tables but because we have some uniqueness properties in our data set and because we know our data set ahead of time we can use tools to generate code for the hash table and bake it into a program so let's look at a few third-party libraries that accomplish this first up is rustph As the name implies this is a rust crate for generating perfect hash tables and perfect hash functions let's say we have the key if what it does is Hash the if using a normal hash function in this case Sip hash13 and we get back some number it uses the bottom bits of that number to look up in a table now this isn't the table with the data this is just the helper table once it looked up data from the table it mixes that with the hash to generate another hash and that second hash is used as the index into the real table now they use this middle table to prevent collisions because they if there is a collision they can just tweak the middle table to make sure there's no collisions so let's see how this performs compared to the off the shelf hash tables it looks like we got a smaller performance Improvement compared to the standard rust hash map however we're still much slower than the C plus plus unordered map with our custom hash function let's take a look at another Library this time C plus plus Frozen now how it works is it takes our key it hashes it similar to Rus phf using an off-the-shelf hash function uses this hash to index into a table just like with rustph then it directly indexes into the result table now what's the purpose of this middle table here well in the case of collisions what C plus plus Frozen does is add a marker indicating that the slot in the helper table is not an index but it's actually some data that needs to be mixed in with the hash it rehashes the data using this seed to generate a new index and then it uses that new index to go into the table so how does this perform compared to rustph C plus plus Frozen is much faster almost twice as fast as rust phf and it's also significantly faster than our custom hash function so we're definitely on to something here with this library now I said that Frozen uses a off-the-shelf hash function what if we gave it our own hash function instead wow so it looks like C plus plus Frozen's hash function has some improvements to be made and if we make those improvements we get double the performance still this is looking pretty promising let's look at a third library in this case G perf now what G perf does it takes our key splits it up into characters it looks up the characters in the helper table once it looks up the characters it adds them together to generate a number and the number is used as a key to our real hash table now this intermediate table is used as part of the hash function it is not separate from the hash function so we can't use our custom hash function with this approach we have to just rely on G perf's hash thing but let's see how G perp performs on its own G perf is beating the pants off of everything so far it's no wonder many compilers use gper for keyword recognition and other tasks before I made my own custom hash table I use G perf because it's pretty fast let's talk about the new hash table that I'm in so similar to Russ phf and C plus plus Frozen I use some hash function on the key and then we directly look up in the table it's that simple how do we avoid collisions how is this a perfect hash table we're kind of cheating right what happens is when we're generating the table if there's any collisions we tweak the hash function and the way we tweak the hash function is by giving it a different seed so the hash function needs to be seated and we start with a seat of one and then try a seed of two and try to seed of three we keep trying until we get no collisions and if we do get a collision we make a bigger table and try it again now the hash function we pick we don't give it the full string we give it those four characters that we talked about so let's see how this performs compared to G perf well I tried a bunch of different hash functions because I didn't know which one would perform best so I tried with f and v1a first because it's public domain and I've used it before and it doesn't perform as well as G perf I also tried XX hash three because I heard it was pretty fast but it turned out to be slower than F and V I also tried the Intel crc32 instruction which is just one instruction on x86 processors and it was actually fast as an fn1v so that's good unfortunately it's not portable so that was kind of a downside of it I also tried a linear Congress congruential congruential I also tried a basic linear congruential generator as my hash function and it performed about the same as the crc32 instruction which is pretty good now my my suspicion is that the crc32 function performed worse than it could have because I couldn't use a power of two hash table and non-power of two hash tables are a bit slower for lookup so all right so let's take the best of these hash functions and try some more so I tried the Intel AES instruction which should mix better than the crc32 unfortunately it performed worse I guess it's a bit slower I also tried a clever hash called the Pearson hash it's kind of like what G perf uses it uses a separate table for the lookups uh but unfortunately this was slower than the lcg so I'm not getting the performance that I want we're not matching G perf I want to match G perf that's my goal so let's take a look at the pseudocode for my algorithm first we do a bounce check and I think this is pretty fast it seemed to be like only two instructions or so so next we load those four characters and then we do the hash so the loading of the four characters is pretty fast it's just four instructions as we discussed the hashing I tried a bunch of different hash functions but I'm not certain that we got the fastest maybe that's something we can improve later but now we have this matches key function which compares the strings of the keys and uh turns out that's pretty slow how do I know this well I profiled it I tried Intel V2 which is a profiler for Linux and I think it also works on Windows it told me that memcomp avx2 move be is slow and it's taking about 60 percent of my CPU now memcomp is the function I'm using to compare the strings and the rest of our stuff is taking only 25 percent of The Benchmark time so clearly mem comp is the slow part here I also profile G perf and noticed that it wasn't that much of a bottleneck to do the string Compares so I decided to take a look at g perf's code so it does a string compare which is basically the same as a mem comp and it also does a links check which I'm also doing but it also has this piece of code this star stir equals star s what's that doing well that's checking the first character of the string and only if the first character matches does it do the string comparison in the length check so this seemed a bit weird because the string comparison is supposed to be doing the same thing anyway right I thought well what's the harm I'll try it we'll see how my hash table performs with this with this trick and we beat G perf just like that that one little thing was all we needed to get a huge 50 speed up and beat the pants off G perf Pop Quiz so we saw that checking the first character made it faster but why does it make it faster I also decided hey if we're checking the first character and that's faster why not check the first two characters we already loaded that into memory as part of our hash function so well it turns out it's a little slower than just checking the first character I didn't really investigate why but there you go perf tip number three see how other people did it in my case I looked at g-perf to see what trick they were doing to make their string comparisons faster and I was able to use the same trick to get a massive speed up but make sure to measure it because there might be some technique you copy from somebody else but it actually doesn't improve performance or maybe makes things worse speaking of seeing how other people did it I also looked at general purpose hash tables to see what tricks I could learn from them and one thing I saw is that for Collision detection some hash tables stored the hash inside the table in addition to the string and the value and this means before checking the string inside the table they can check the hash inside the table and see if there's a match so in this case we would check the hash for the if and see that there's a match and check the hash for Fetch and see that there's no match so let's see how this performs compared to check in the first character and it actually performs about the same unfortunately this approach needs a little bit more memory so I decided to just stick with the check in the first character so I beat the performance of G perf what else can we do with our perfect hash table well let's look at our keyword set again let's focus in on this keyword Constructor this is the longest keyword intercept it takes up 11 characters which means all of our keys are at most 11 characters we're using memcom but maybe there's a faster way to compare this key with the input key so let's do a web search what's the fastest way to compare 11 bytes and C plus plus wow our search engine gives us a thing about STD byte but then it points us at mem comp which is the function we're already using so uh that's not really helpful let's think about this again what we want to do is compare the input string Constructor with what's in the table Constructor all in one go but if we have a small input like let we want to treat that the same as if it was Constructor so we're going to add some zeros at the end and compare that with what's in the table which also has zeros at the end so now the question isn't how do we compare 11 characters but how do we load 11 characters and then do the comparison so let's do another web search this seems a bit more promising but there's some buzzwords here like XM and m256 what are those on x8664 there's a bunch of different registers so the xmm we saw in that web search refers to these 16 byte integers or floats XM 0 through xm15 now this can be accessed from C plus plus using the m128i type so let's say we're able to load them into memory now we want to compare the strings so how do we do that well let's do another web search and it takes us a while to find but we end up with this mm computers intrinsic now what is this it stands for multimedia compare explicitly sized string and I can't figure out what the C stands for now this is specific to Intel but I'm on an Intel processor let's see how it performs now compared to our fastest approach oh we got a crash what's going on here this is how I was thinking about it we have the word let and then we have garbage after the lead in the rest of the string we zero out the stuff after the string and then do a comparison with the thing in the table the problem is if we're near the end of the string then accessing the remaining bytes is out of bounds and that can cause a crash so what we want to do instead is make sure that the input string has a bunch of trailing zeros in it therefore even though we're kind of out of bounds it doesn't cause any crash for us now if we make sure after we load the input file that we add some little bytes at the end and we run our Benchmark and better than our inline hash approach from before so perf tip number four is to improve your data in our case we were able to use copy Cersei without any extra bounce checking by changing the input data to have the null bytes at the end and in general the more you can make assumptions about the data the faster code you can make and how do you make more assumptions you can massage your data so those assumptions are valid perf tip number five keep asking questions simple web searches can expand your toolbox and think about what your Hardware has built in and do some research now while developing the copy stirsy thing I noticed that it requires SSC 4.2 but I want my software to run on older processors where SSE 4.2 is not available so I developed a version that works on all Intel 64-bit processors so let's see how that one performs so it's actually faster than compy stersi even though compy stursi is dedicated at string comparisons while developing the portable move mask approach I learned about the SSC 4.1 instruction p-test so let's try using that to solve a problem instead p-test is even faster than either or move mask so let's stick with that huh so perf tip number six smaller isn't always better the compy stersi solution was the smallest using only 31 total instructions for the entire algorithm but it ended up being slower than both move mask and P tests which had more instructions so fewer instructions doesn't necessarily mean faster even if the instruction seems to be dedicated to the specific task you're doing quiz can we optimize our string comparison without using special symdy instructions for example just using normal integers so now our string comparison isn't showing up in Intel V2 so the v-tune profiler we used earlier tells us about functions we need to drill down a little bit deeper than functions and get to the assembly instructions so for that we're going to use a tool called perf recorder Benchmark and then perf report we get this an assembly print out this shows us the percent of time taken on the left column and on the right column is the assembly instructions you can see the instructions color coded based on how hot they are so let's zoom in to the hot part if we read carefully we can see this is a string comparison code we're loading things we're doing a p-test we're doing some comparisons and then we're returning I don't see a way to optimize this there doesn't seem to be any wasted instructions here so let's try a different approach for profiling this tab we're going to use perf stat which gives us performance counter data rather than instruction data and it gives us a lot of data but let's focus in on these two Branch misses and L1 decache load misses now our Branch misses is almost four percent of all branches now four percent of Branch misses doesn't sound like a lot but each branch Miss K and cost a lot of time so let's talk about branches for a little bit let's say you have this code if strong health is 4 otherwise helps as one and we return the health this might get compiled down to this code we have the if which is the test and the jump now the jump instruction J and E is what we call a branch sometimes j e will go to another instruction and sometimes it just goes to the next instruction branches are something CPUs predict and if they predict incorrectly it could be disastrous for performance one technique to get rid of Branch misses is to get rid of the branches entirely in this particular program we can avoid branches by using a little bit of math this formula gives us the same answer but it doesn't use any branches in fact it only compiles down to one assembly instructional line machine but we can't always use this technique right so let's use something a bit more General so let's rewrite our if this way Anna for compiler is nice to us it'll generate code like this so the if used to be a test and a jump now it's a test and a c move C move not equal in this case now there's no branches which means we won't get any branch misses from this code unfortunately C plus plus compilers are very unreliable when it comes to C move so we often need to resort to inline assembly to get the machine code we want so in our particular example we would convert these jumps into a few C moves so let's see how C move performed compared to the branching version here are the three simdi approaches we made and the branchless version of computers is flying we're almost at 10 times the speed of the standard hash table we can also use this technique for the other simdi approaches like move mask and for p-test now interestingly in the branch version p-test was a bit faster than move Mass but in this case move mask is faster than p-test and even comp Easter C is faster than p-test so it's a good idea to try all combinations of the different approaches you try rather than assuming the best one at one time is going to remain the best in the future perfect tip number seven use multiple profiling tools I wasn't able to notice the branch Miss issue until I used perfstat rather than perf record or v-tune sometimes your profiler can mislead you into thinking something is slow for one reason even though the reason is totally different perf tip number eight Benchmark on real data in my case I was benchmarking against keywords and variable names in jQuery which is a real world JavaScript code base if I instead benchmarked on say just keywords or just variable names I wouldn't have noticed the branch Miss issue because it would either always pass or always fail and if it's if your data is always passing or always failing that's pretty easy for a CPU to predict therefore you don't end up with Branch misses but because I use real world data where only 20 of identifiers are actually keywords the CPU wasn't able to predict as well therefore there were a lot of Branch misses and I was able to optimize for that use case if you use fake data in your benchmarks you might be optimizing for fake workloads rather than real workloads pop quiz could we make the length check at the top of our function branchless let's talk about more ideas after we do the hashing we need to compute the array index at a low level you take the hash you modulo by the table size you multiply it by the entry size and bytes in order to get the byte offset if our table size is a power of 2 such as 512 we can use a bitwise and instead of modulo a bitwise and is much faster than using a module and it's something the compiler might do for you but you should probably do by hand if you really care in the same vein instead of doing a multiplication we can use a bitwise shift if the entry size is a power of two so in our case we're going to try an entry size of 16 which lets us shift left by four but my thinking is what if we can get rid of the shift we're already doing a bit wise and what if we could use the bitwise and to clear the bottom bits which kind of acts like a shift so if we do our address calculation a little bit differently we can get rid of the shift and just do one bitwise and in addition to the addition with the table so what's this look like at the assembly level well on the left we have the original code which does the and that we talked about it also does some additions and multiplies to get the right address on the right is there a new way of doing things which only has an and instruction so let's see how this performs so originally we had a 13 byte entry so let's try a 16 byte entry with this technique and it performs just a smidge faster or maybe just the same it's hard to tell one theory is that we made a bigger table because we had to make the entries 16 bytes instead of 13 bytes each so let's try shrinking the table and seeing if that gives us a performance Improvement well how do we make the table smaller as I mentioned earlier the way my hash table works is that we try inserting all the keys with a certain hash function that doesn't work we change the seed and try inserting all of them again and we try that a bunch of times and if we can't fit them in after 50 000 attempts we make the table a little bit bigger and then try the process again but what if we just try harder what if we do more attempts and don't give up so early it'll make Generations slower but maybe we can get lucky and find a no Collision hash table with a smaller size here we have a hash table with the default attempts count and it ended up at 262 entries now if I crank that up to like a million attempts we end up at 194 entries which performs the same so it looks like smaller tables doesn't mean better I mean it might mean better for memory usage or something like that but it doesn't mean better for lookup performance so in addition to those failed attempts I've had other missteps I had trouble making Pilots generate two 16-bit lows that get merged in together for the character selection I also couldn't make compilers generate C move instructions that was a real pain I had to resort to inline assembly and the problem with inline assembly is it's kind of a black box for compilers so they can't optimize that much around the inline assembly also the several CMD approaches that I talked about I had to write like 20 versions of them until I got something that performed well and didn't have a bunch of crashes in particular generally the mass that took the input string and removed the extra stuff at the end that was kind of tough fortunately while I was developing it I learned about the p-test obstruction because clang generated P tests for one of my mass generation approaches even though I did struggle making the implementation I did learn something by virtue of claying optimizing my code so I did all these optimizations and then realized actually my length check wasn't working properly I had to go back and rewrite a bunch of the different approaches to make sure they were doing the link check correctly and I had to do the length check slightly differently for each one because the nature of the string comparison algorithms so perf tip number nine keep trying new ideas you will hate a lot of dead ends but you will learn things as you go along winners don't give up Pop Quiz what ideas do you have for my perfect hash table I myself have more ideas to try but performance is good enough and I've been hacking at this for two weeks so if anybody wants to try these ideas be my guest so let's take a look at those pop quizzes and see if we can answer them so the first pop quiz what happens if the hash function always returns the same index how many keys need to be checked the answer is bad things all values are put in the same bucket and the hash lookup function needs to check every single bucket to see is their match lookup devolves into a linear scan for both hits and misses and it's probably slower than blasting through an array next Pop Quiz why do we want to pack the characters into a single 32-bit integer the answer is that some hash functions such as Intel crc32 instruction work efficiently on 32-bit registers rather than being fed one byte at a time what else could we assume about the key length than the minimum size well the longest keyword has 11 characters and that means we can exit early just like we exit early if the length is less than two it also means that all the keywords fit in a 16-bit CMD register and we tried doing this and it gave us some pretty good speedups now why did checking the first character ourselves make a look of 50 faster I'm not 100 sure on this but my guess is because of the early check on lookup failures which are 80 of the time we could avoid calling memcom and mem comp is pretty cheap but it's not as cheap as an if statement next question can we optimize our string comparison without using special symdy instructions well the answer is yes if what we do is do a mem comp with exactly 12 all compilers will generate a 64-bit load followed by a 12-bit load and we could do the same technique and do some masking so let's try it here's our fast branchless 70 approaches and without using special 70 instructions we almost got the same performance as our p-test approach that's pretty good could we make the length bounce check branchless well the answer is yes but we got to be careful if we're looking up a single character identifier such as the letter X well our hash function checks the first two characters which means it'll check out of bounds one character but also it checks the last two characters which means it checks before the string and our string comparison function loads either 12 or 16 bytes depending on the algorithm so we have to make sure that one byte before and 15 bytes after are allocated and not out of bounds if we can make this assumption we could just delete the bounce check so let's see how this compares to our fast assembly approach and it's even better now this is surprising to me because we're doing all this hashing and all this lookup and all the string comparison stuff for one character variable names now I imagine there's a lot of one character variable names but it seems like it's not predictable by the CPU so it's faster to do all this work than to have the CPU mispredict so perf tip number 10 is to talk about your optimizations I didn't have this idea of deleting the balance check until I made this video and when I made the video I Decided well I should probably check the performance of deleting the balance check because I know it's possible let's try it and I was pleasantly surprised that it made things even faster so if you talk about your optimizations with people even if you just rubber Dock and talk to nobody going through that thinking process can unlock optimization opportunities and the last Pop Quiz question was about what ideas you have for my perfect hash table and of course you should leave a comment with your answer below so that's my journey of making a hash table that is 10 times faster than C plus Plus's built-in hash table preliminary results say my compiler is about five percent faster overall even though detecting keywords is actually a pretty small part of the compiler we covered 10 performance tips you can do in your own programs to make things faster now none of these are specialized for hash tables or data structures they're just general performance tips that you can use in your own programs so I hope you learned something have fun optimizing your code I want to thank Dave Churchill and other Chatters in my twitch stream who helped me make my perfect hash table even better
Info
Channel: strager
Views: 523,156
Rating: undefined out of 5
Keywords: hash table, hash map, data structures, optimization, performance, c++, rust, step by step, educational, teaching, hash tables, perfect hash tables, perfect hash function, hash function, perfect hash table, hashtable, hashmap, dsa, code, assembly, x86, x86_64, amd64, sse, simd, array, speed up, standard library, unordered map, hash, key value, dictionary, blazing fast, blazingly fast, avx, 10x, tips
Id: DMQ_HcNSOAI
Channel Id: undefined
Length: 33min 52sec (2032 seconds)
Published: Sat Mar 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.