Hashing, Hashable, and Hash Values

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right so let's talk about hashing today I do the Google image search for the word hash which you should not do at work and this was the most wholesome image that came up so this is it so I want to talk about all the things about hashing and hashable and all that good stuff so let's get right to an example let's say I have a struct and it's got some properties I'm sure in some methods we don't really care about that that's not important right now and so we have a struct it's got some properties and we want it to be hashable so I'm going to make an extension on my struct we needed to conform to hashable and then if we look at the requirements for hat what hashable is it needs a single property read-only as an integer called hash values that's the requirements of the protocol as if swift 4.1 and we want to write our implementation and we'll do something like this so that is like valid code this will certainly compile because we are returning the name matches it's an integer and 0 is indeed an integer this will all compile but then the question is is this correct so we haven't talked about what hashes and hash values are but maybe you are a hashing expert already you work on the standard library or something like that or you have some general intuition about what these things are used for and so if I have an implication like this and I return 0 all the time then is this a correct implementation so let's take a poll who says yes this is for whatever hashes are used for this is a correct implementation a couple of people ok and everybody else says no this is not a correct implementation you're doing it wrong most people all right so we'll come back to it at the end some people will be like yes I told you so and then everyone else will be like mind blown emoji so what do we want to talk about today let's talk about what even is a hash we'll go over that we can talk about what our hash values used for well look at that a little bit but you know I'm also concerned about what they're used for but we might need to know that to understand how to do them properly and finally how do I calculate good hash values if you do want to do it yourself so it's the white board coding part of the talk there's no white board here unfortunately but there was going to be a quiz afterwards so let's talk about everybody's favorite thing linear search or array search so I haven't Raye of stuff numbers in this case and I want to say is some number in the array or not so this looks like it could be a sorted array but you know we'll just ignore that it's just general-purpose so we want to search through it so the way we'd have to do it is start at the beginning and go through the whole thing and then worst case we reach the end and it's not there or it's the last element and we'll get a yes or no but this is not so great because it will become slow if I have a million integers it'll take some time if I have two million integers they'll take twice as long to search through and so on and so on so how can we make this faster and if you're a data structure nerd you'll say you should use a set in which case you are hired and sets can magically for some reason if you put a million integers into a set and you say is the number 42 in there it'll say yes or no like pretty quickly and if you have two million integers it'll still say yes or no pretty quickly so like what kind of magic is it doing so we can kind of puzzle through this and say how is it being able to search more quickly so we can say well we'll keep our array here and what if we split the array and we said you know all the odd numbers go over here all the even numbers go over here or in this case we say numbers greater than 5 go over here less than or equal to 5 go over there and now if we say is the number 4 in the collection we can say well very quickly we can say 4 is less than or equal to 5 it must be over there and now we only have to search through half of it or if you know if you say is the number 100 there we can say that's greater than 5 it must be over there and then we can search much faster than we could before and so if we kept doing this and we split it up and we split up and we're just basically putting the data in two buckets and we can say gets a piece of data decide what bucket to put it into and then put it there and then when we search and we say hey is this value in the collection or not we can very quickly say what bucket it's in look in the bucket and ideally it'll be like zero one or two things in the bucket and we can very quickly say yes or no so it's basically how sets work how it decides to do the buckets how many buckets like that's sort of beyond the scope of what we want to talk about today but we want to do that and you can imagine how to do it with integers again we could do ranges like one to a hundred goes over here 101 to 200 go over here and so on so we can imagine how to do that but if you have some arbitrary piece of data like a string or a puppy and you're thinking I and how am I supposed to know what bucket to put this thing into and you say if only I had a way to map an arbitrary piece of data into an integer and then I could based on that integer decide what bucket to put it into and you say we have such a thing right it's our old friend the hash value which takes a piece of data arbitrary data and maps it to an integer and then that's how a set can use it and put it into buckets and so on and so on so this is sort of the magic piece of like what things might use a hash value so I think this is the way to think about it we get some data and there's some kind of hash function and it turns it into an integer that's the basics of hashes but there have to be some rules because you know scientists here we need some rules around this so what are the rules what are the invariants we need to maintain when we're talking about hashes and hash values so the first one we'll talk about is equal values I have two values and they're equal that means that their hash values must underline bold non-negotiable be equal so this is really it you can all take a picture of the slide this is like all you need to know if you have equal values then the hashes must be equal but that I'll note that arrow only goes in one direction so if I have some value and I ask for the hash and it says the hash and then you know later on I ask for the hash again I ask for the hash a million times it should always return the same value sort of with an asterisk which is the best kind of thing so let's just look at the quick example here so I've got a struct with a single field here I'm going to instantiate it once and then I'm going to print out the hash value twice which is like oh my god it's gonna be the same thing we sure hope so so I'll run this and it will return the same hash and I could copy and paste that code you know million times it'll always be the same hash so you might think you know I'm gonna do I'm gonna get the hash well I'm going to store it to disk and I'm gonna treat it like as a checksum and then tomorrow I'm gonna hash it again and make sure it's the same but there's a note in the documentation saying like not to do that because hash flowers aren't stable between launches of your program and you think that's crazy I launched the program a million times and it's always the same but that's sort of the you know documentation versus implementation there's like don't do this please don't rely on this but this is the kind of leads to like subtle bugs because it's like you're relying on this behavior so if I switch up here no I lost my mouse pointer here we go switch to the so if beta which I have up here that's swift 4.2 and I run this then you'll see I get some hash value which is nice negative seven five one three I'll run it again and they're still consistent but you'll know this is different so in Swift 4.2 there there's a fixed me and swift 4.1 it's like this should really be a random number but it's like a constant so and so 4.2 they fixed it and there is indeed a random seed so at launch the hashing mechanism gets seated with a random value and so they will be different across launches but within the lifetime of your program it'll stay consistent so the lesson there is hash values must be consistent between launches is sort of a little bit of an asterisk so you shouldn't get your hash values and store them to disk and do other things like that because that's not what they're for all right what else can we conclude so we can say the values are equal the hash values must be equal that's great if the values are different then that must mean that the hash values are different not enough people are shaking their heads because this is not true if the values are different then it's totally possible for the hash values to be the same and you're thinking how is this possible how can I live in a universe were unequal non equal values hash to the same thing so to kind of puzzle that we'll have to return to this and understand that we're mapping a piece of arbitrary data to an integer but if you think about how many pieces of arbitrary data are there there's like how many possible strings are there in the universe I don't like memory limitations maybe if you ignore those there's like an infinite number of strings and if you think about all of the types that you've created just in your application how many instances of that type could you possibly make it's like an infinite amount of data out there and integers are 64 bits which is you know very big there's like 18 quintillion 64-bit integers but compared to infinity that's like nothing so we should have this intuition taterhead that yes there is a lot of data out there in the universe and there's no possible way that everybody could get their own integer although there are a lot of integers out there that aren't enough there aren't an infinite well there are an infinite number but for 64 bits there's only 64 bits worth so with that in mind we should think that collisions two values hashing to the same thing as I totally normal everyday occurrence although with an asterisk it's not actually an everyday occurrence but you have to think about it as if it could be and so that's why this is something we should keep in our head too if the values are not equal then that does not necessarily mean that the hashes are also not equal all right what else correct or not correct if we have two values and they're hashes are equal what does that tell us does that mean that the values are also equal and then we would say no that's also not correct cuz what we just said just because two values have the same hash that tells you nothing about the values themselves they could be equal or they could not be equal so what we're left with is our original rule the one big rule that you have to remember equal values means equal hashes we can also sort of flip this around and take the inverse if you want to have two rules and you can say if the hashes are not equal that must mean that the values cannot be equal because if the values were equal they have the same hash therefore if the hashes are different the values must be different but Rule two is really just a variation of rule one just to make more room on the slide you won't really only have to remember I would say the first one because that's the simpler one I think equal values equal hashes and that's it so we can go all the way back to the beginning or our poor puppy has been waiting with this implementation where we have a hash and we return zero everybody gets the same hash so the question was is this a correct like you know technically correct the best kind of correct some might say technically correct this is a technically correct solution yes it is correct because the only invariant we need to maintain is that equal values get equal hashes and everybody gets the same hash here so by definition we've fulfilled the requirement that equal values get equal hashes so this is a correct implementation but it's not a great implementation so my job application to the standard library team will still be will still stand up why is this not good it's because what we're effectively doing is we're returning zero for everybody's hash so when we have our set implementation it's going to put everything into one bucket which is like fine because the bucket can hold lots of stuff but then when we ask the set hey does this thing exist or not it's go to do linear search again just like an array and it's gonna make everybody sad because the performance will be bad so that's sir the the downside I guess to this very very poor hash implementation is that you could get slow performance but it's like technically correct so that's the implementation again technically correct but I always like to know well what is what is the wrong implementation because that's kind of what I'm interested in how can I totally screw this up so this would be one way to do it where you return a random number for the hash so we can think this through and say here's a value I'd like to store in the set it like what's the hash it's like it's you know 57 great I put it in Bucky 57 and then you give me the same value so they put this in the hash and a property of sets is that they only store the value once and so you say here's the value again and you say what's the hash it's like oh it's 85 great I'm gonna put it over here in bucket 85 now you have two of them and then when I say hey here's this value is it in this set yes or no it's like what's the hash it's negative 105 and then you look in buck at negative 105 it's empty and you say no it's not and you totally mess it up so this is an example of what not to do maybe obviously because then it will make people very angry and then the puppies will get all mixed in together and you know that's just that's no good so what's a way to do maybe a good I'm gonna call it a better hash implementation so if you're into cryptography you know that cryptography is all just XOR in numbers together as they say and so a classic way to do it is to XOR the hashes inside there so our puppy puppy struct have two properties age and name age is an int name is a string those are both hashable themselves so I can take the hashes of the two components and XOR them together and get a hash so this is one common way to do it maybe not great but this is one way to do it but there has some peculiarities maybe on its own should I advance yes let's advance the slide so let's say I have a puppy it's gonna be hash but let me just go back one and say that starting in Swift I think 4.1 we got the automatic conformance so if your properties in the struct are all hashable you don't have to write this the compiler will do it for you so this this complete code this will just work on its own so I have a puppy it has a name and it has an age and then I have another struct for a red panda also has a name but they're endangered animals and so we keep track of them by sequence we have all of them we know about all of them we keep them by sequence and that's just by chance than they the properties are like the same if you squint your eyes there's a string and then there's an int so then the question is if I instantiate the puppy with this name who is eight years old and I instantiate a red panda that happens to have the same name and the sequence number eight and I take their hashes then what's gonna happen so you can think about that and let's give it a try unless the mouse pointer again do a little trick here we go this is the code puppy red panda instantiate them print out the hash values I will go ahead and run this and they're the same so that gives us a hint about how the implementation is working it must just be maybe going in order and getting the two values and hashing them together somehow maybe with an X or maybe with something else that doesn't matter but they are colliding and you think oh my god they're colliding but then you have to also remember that that's okay hash collisions are totally fine that doesn't mean that they're equal it just means they have the same integer representation they'll be in the same bucket but that's okay just as a second curious exercise let's look at this so I have a struct here that is a string and then a number and I have another struct that's a number and then a string and then if I instantiate one of them with the string and the number and I instantiate the other one with the same values but they're in like the reverse order then what will happen so if you again remember your interview questions about bitwise operations and you know XOR is I think the word is associative I don't know community associated I think that's the thing where one plus two is the same as two plus one so a XOR B is the same as BX or a so if this is a simple XOR these will be the same because it's the same values if it's not a simple XOR then they should be different so you can think about how would Swift do it Chris Lautner do it and we can run this and they are indeed different so there's actually a little addition there's a little bit shifting going on and that's why they're they're different all right so the same again totally fine because they'll just be in the same bucket which is fine they'll still look it up and they'll still be unique to properly because there are different types so they won't be equals equals each other so equality and hashing are two separate I'll just say separate two separate things if this is a problem though you have a lot of you know you have like dog and cat and they're like a million every dog in the planet like I'm sure there was a lot of duplication between pet names I don't know if there's a machine learning data sample about names of pets in the world but I'm sure there's a lot of overlap between dogs and cats for example and there are a lot of values that would have the same name and the same age and you think they're all gonna sort to the same bucket and from my implementation that's bad I don't want that so I could write my own custom implementation and say well I'm going to in my hash I'm gonna mix together the name and the age and the name of the animal and then that way a dog and a cat with the same name in the same age with hash differently if it were important to you but you know the compiler will do it for you unless it's like a big deal you've done some performance measurements and you're set lookups are you know horrible or something I probably not probably not worry about it so new in Swift 4.2 is that it adds another level of abstraction on top of the idea of a hash combining function so over here I just use X or if you look at the implementation in like Swift 4.1 you'll see it's like bit shift and some additions and some and some X ORS of course because you have to have those so in Swift 4.2 there's this idea of a hash er which is the second algorithm up here and that abstracts the hashing algorithm of which there are many in the world and it says I will worry about exactly how this works you just do the work of like passing me the values so you all you have to do in this hash function is called hash or combined and you give it each important bit of data so you passing the age you pass in the name it'll mix those together and run the hashing algorithm and output the hash value for you so this is like the new Swift 4.2 way to do things so if the question is how do I write a good hashing algorithm I would say as they like to say the best code is the code that you don't write and so the compiler will do it for you so in like I don't know 99% of cases I'll be bold here you should probably just not you don't have to worry about it you can just use the default implementation if for whatever reason you found that performance is bad you have a lot of hash collisions for whatever reason then you can use hopefully the new Swift 4.2 stuff with the nice hash combining functions still support Swift 4.1 you have to do the hash value and then you have to go to Wikipedia and look up the sip hash algorithm or something and then figured that out yourself so let's wrap up what even is a hash it's an integer 64-bit integer value that sort of represents your piece of data but then maybe I'll add a side note of like what isn't a hash so again the idea of hash values hashing and equality I would say sometimes you mix them up in your head you say all the hashes are the same that must mean they're the same or you think Oh a hash is like a checksum or it's like a unique identifier but I think that's like you know that way leads danger so hashing equality just separate them in your minds and the hash is just a convenient integer representation of your data what our hash values used for we see them a lot in data structures probably you don't call hash value yourself a lot I would guess but if you use sets or use dictionaries you've noticed that items and sets have to conform to any hashable dictionary keys have to be hashable for this reason for fast look ups and how do I calculate good hash values so again number one choice don't do it let the compiler do it for you number two use Swift 4.2 which has hash combining functions you can define your own hash combining functions if you're like a super nerd or use the hash value property and write it yourself like that so those are those three if you like puppy well ever likes puppies if you like reading and you want to sort of read what I've just said for the last twenty minutes or whatever and I have a blog post that is not published but I will like as soon as I'm done I will push the button and publish it and it's sort of the article version of what I just said so you can read it here and review it on your own time you know on your commute or something like that if you're curious about how the automatic conformance works where you have a struct that has hashable properties or an enumeration where all the Associated values are hashable and the compiler magically comes up with it for you and you're curious about how that works I also have an article about how that works for equatable where if all of your types new structs are equatable you get the equals equals for free so the way that works is very very similar to how it works for hashable so if you're curious about how that works you can go through the same very fine websites and find the article on automatic informants for equatable as well and there's a closing brace so that means that's the last slide and that's all I've got so thank you [Applause]

Info

Channel: Swift Language User Group

Views: 2,552

Rating: 5 out of 5

Keywords: swift, hashing

Id: MKIk8rH1OUQ

Channel Id: undefined

Length: 20min 8sec (1208 seconds)

Published: Sat Aug 18 2018