Hash Tables, Associative Arrays, and Dictionaries (Data Structures and Optimization)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hash tables are kind of the next logical step after learning about arrays and linked lists if you're unfamiliar with hash tables i find it's really helpful to start by talking about the problem being solved here basically hash tables are a way of storing a bunch of crap and then being able to access it later quickly using a key so the first thing you might be thinking is hey wait that sounds exactly like an array and that's a good observation let's look at inserting a bunch of crap into an array so i dump some items into this array here and we can access the items later using the index of the array so the important thing to note here is that the key is the index but the problem is that this is entirely dependent on the insertion order if i had inserted them in a different order their indices or their keys would be different and the other problem is that if you don't know the index or the key ahead of time you have to search the array to find what you're looking for instead let's try to compute our own array index using something called a hash function and this can be literally anything but let's do something super stupid and we'll take the name and i'll just take the length of the name because i said super stupid but you can easily see now where they go in the array and it doesn't matter what order i insert them in because we computed our own array index and we can look up things really quickly if i want to see if hermes is in the table well the length is 5 so i check that spot in the array and boom nothing there super fast and easy but there's some obvious problems here like what happens when i try to insert kiff well that has three letters and that spot is already taken by fry so we need to figure out how to work around this in a hash table when two separate keys end up hashing to the same index you have what's called a hash collision and that's perfectly normal they can be mitigated against by using a better hash function our example used an extra stupid one that you'd never use in reality and there's plenty of great ones out there but no matter how random your hash function is you're still screwed because of basic probability and you're likely to get collisions anyway so just embrace it and learn to deal with it and there are a couple major approaches that are used in this case the first of which is chaining or separate chaining and the concept is simple instead of allowing you just to store one item at each bucket in the array or the hash table here's our hash table again instead you make the bucket hold a whole bunch of things crazy i know for example you can make each of these buckets its own linked list capable of holding a whole bunch of crap doesn't have to be a link list either there's other options you could make this like a dynamic array that'd work too or really whatever fancy pants data structure you want the important thing to note here is that each bucket can contain a whole bunch of items instead of just one the most obvious downside to this approach then is that when you're doing a lookup and you hash to that bucket you're going to have to sift through all the crap in that bucket to find what you're looking for which probably won't be a lot but who knows the other generally used approach is something called open addressing which is a fancy way of saying that you're going to keep trying new buckets until you find a free one so in our hash table if this bucket is used let's say your two keys hash to the same bucket again you just move on to another bucket using some predetermined approach and for the predetermined approaches well there's a few basic ones like linear probing which literally just means incrementing the index by some fixed amount that's really really easy there's a few other well-known scanning sequences like quadratic and double hashing which are somewhat self-explanatory in quadratic probing you'd increase the array index by some non-linear value and in double hashing your spacing is determined by a second hashing function there's a whole bunch of others with kind of silly names which we're not going to go into let's just cover the basics of how this works so yeah you've got your collision resolution scheme and they have various upsides and downsides for example with open addressing and linear probing lookups are often slightly faster than most other approaches due to cache effects watch the video on memory cache locality and why arrays are fast it should be pretty self-evident but one thing you can probably guess is that as the table fills up the performance starts to degrade pretty badly as you search larger chunks of the table the various approaches have their upsides and downsides for our purposes it's sufficient to understand that there are two main schemes for resolving collisions either a bucket holds a single key value pair and the collisions are resolved by moving to another bucket or we make each bucket hold a whole bunch of key value pairs shocking what you've probably noticed at this point is that there's another problem lurking in the shadows here which is that the hash table can be infinitely big you have to set some fixed size and as you fill up the table no matter what collision resolution scheme you use you'll get collisions and your performance will degrade depending on your resolution scheme now there's a more proper name for how full the table is which is load factor and it's a problem you have to keep your eye on because as the load factor increases performance decreases sometimes dramatically after reaching certain thresholds basically your approach at this point assuming you want to keep allowing new items is to increase the max size of the table by allocating a larger underlying array and then re-inserting every single item into that new array from the old one and you can do this all at once say once you've passed a certain threshold you can decide screw it we're building the whole thing from scratch or you can do it incrementally so there isn't a sudden slowdown but the gist of it is that at some point you got to make the table bigger to accommodate more items and ideally you want to do this before the table gets too full now if your question is when do we decide to do that well a common load factor is about 0.75 so when the table is about three quarters full it expands you can choose any threshold you want basically it boils down to being a trade-off between not using excessive memory and not tanking performance sparser hash tables waste memory but operations are snappy whereas much fuller ones don't waste much memory but operations could be slower and slower so now that we've talked about hash tables what exactly are associative arrays dictionaries and all that jazz often you'll see these references especially in higher level languages are these hash tables or what how do they relate to each other the short answer is some people group these all together as mostly the same thing and it's probably harmless to do so i mean the practical applications are generally the same but that'll rile up the pedantic people so let's dive in a bit here so there's something called an abstract data type which is essentially an abstract description of how something is supposed to look and act like if i look up an associative array it's an abstract data type that defines precisely how they're supposed to work you can see here the operations are listed you can add reassign remove and look up items using keys or key value pairs but it tells me nothing about how it's supposed to be implemented like for realsies using actual code it only roughly tells me what operation should be available and the restrictions on them if i were to say describe a vehicle abstractly here's my abstract vehicle and it's supposed to have some sort of engine two or more wheels and seat two or more passengers pretty vague a car might be the most common implementation of my abstract vehicle but it doesn't necessarily rule out other implementations they just might have other trade-offs or just outright be kind of stupid but they still meet the vehicle checklist so associative arrays and dictionaries they're the abstract description of a kind of data structure and hash tables are a specific data structure often used as an implementation for associative arrays or dictionaries so let's talk a little bit about advantages and disadvantages of hash tables now we'll mostly compare these with arrays and linked lists since we covered those previously but hash tables are kind of just a great all-purpose data structure i mean there's a reason why languages like javascript python go and many many many other languages have built-in language level support for associative arrays they're just useful as hell most operations are in the average case pretty fast insertions deletions and lookups are order one constant time contrast that to arrays and linked lists which are fast at some of these operations slow at others and it all depends exactly where in the array or linked list you're doing things so the disadvantages now first off hash tables are fast all of the operations are order one but if you recall from the video on big o notation the big o running time of an algorithm isn't everything for example on this blog post by scott meyers the author of effective c plus should you be using something instead of what you should use instead scott gives some great examples of just straight up linear searches on arrays beating hash table lookups for small numbers of entries and remember when we talked about rehashing so although the average time for insertion is order 1 like all your insertions are chugging along at a nice speed and then it is of course possible to hit that path where you need to expand and rehash the whole table so the cost of calling functions could be inconsistent if that's important to you there are some other downsides too like there's no inherent ordering in a hash table so you can't iterate things in any given order so if you require things to be in a consistent and reproducible order hash tables aren't a great fit i'll avoid babbling about theoretical concerns which are super unlikely in practice especially in game development like uneven hash distributions which is like say your hash function hashes a bunch of crap to the same spot over and over again i mean it's possible so i'll mention it but let's not lose sleep over this finally let's talk about how to use them and what better way to actually go through and show you if you've been following the channel for a while i've built a few games at this point mostly showcasing how you can get really far with really simple approaches and we can go poke around the code a bit on github looking at some of the use cases for associative arrays here's a good first example in a lot of these projects i have this entity class that holds a bunch of components stuff like rendering physics that stuff if you've used unity it's basically a poor man's game object system anyway i use an associative array here for the components there's often quite a few lookups for components by name and although we iterate the components every frame to update them the order doesn't matter here in the entity manager i keep a list of all the entities in existence in the game those are all just in a big array but i keep a separate associative array for quick lookups since there may be hundreds or thousands of entities and i often want to look up entities by name here's another quick and easy one the finite state machine has an associative array of states mostly because i refer to states by name there's not that many you could just do it with an array but the syntax for lookups is easy so i go with associative arrays instead and finally here's the spatial hashgrid code from a while back and this is a great example because originally i used an associated array but here we're using specifically a map in the code in github as the basis for the spatial hashgrid and this was an enormous improvement over doing it the stupid naive way but at the same time this is a fantastic example of how these kind of data structures are great as a general purpose pretty good solution but can often be beaten out by a custom data structure which is exactly what we did in this case to squeeze out even more performance so in general hash tables associate arrays dictionaries and whatever i just kind of match them together sue me so these along with dynamic arrays kind of make up like 99 of the data structures in the games i've been implementing i tend to just default to an array if i'm going to be iterating things a lot but not searching for specific items often otherwise an associated array is my other go-to data structure if i need to do a bunch of lookups but not iterate otherwise at this point i need to start thinking seriously about being smarter about either my architecture or the data structure i'm using i hope this was helpful until next time cheers
Info
Channel: SimonDev
Views: 10,314
Rating: 4.9863944 out of 5
Keywords: simondev, game development, programming tutorial, data structures, hash table, associative array, dictionary, data strutures and algorithms, optimization
Id: S5NY1fqisSY
Channel Id: undefined
Length: 12min 44sec (764 seconds)
Published: Tue Sep 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.