Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi welcome to the video today we're going to be covering another technique in similarity search called locality sensitive hashing or lsh now lsh is a hugely popular technique used in efficient similarity search now there are a huge number of companies that use similarity search i mean you have big names like google i mean google is built from similar search and then you have netflix amazon spotify all of them are constantly recommending you different products films music and they do that by comparing you to other customers so they are performing a similar research between you other customers and identifying the most similar ones now you have two approaches you have exhaustive which is comparing all of the data points i'm just gonna call them vectors from now on because that's what we'll be using so comparing all these vectors and obviously it's slow approximate search allows us to [Music] approximate those vectors restrict our scope to a more relevant range of vectors and so on so it includes a lot of different techniques not just one technique here the one we're going to be covering today is location sensitive hashing so at its core lsh is was a hashing algorithm which attempts to maximize hash collisions so what we see on screen right now is a dictionary and a like a typical python dictionary in the way that it hashes different items so we have our keys which are items that were hashing we process them through a hashing function and that hashing function attempts to minimize hashing collisions eg to not put keys in the same bucket he wants every key to go to a separate bucket and then these are connected then they don't contain the values but they're connected to the values that we um relate back to our keys so that's a python dictionary that's that's our python dictionary does it but you know we're not wanting to minimize collisions we are wanting to maximize the collisions so what we see here is a hashing function that maximizes those collisions so this is essentially what elliot huge is doing so we are attempting to for any similar keys so these here and these here they're all similar enough for us to want to put them into the same buckets so we've put two of them into here and then the other three into this bucket now there are quite a few different ways of doing this and there are a lot of different lsh methods in fact lsh is a very generic term that applies to a lot of different algorithms and the one that we will be covering is what i see as the traditional version so it is the original version of lsh and what we'll be covering in this video is shingling min hashing and that lsh function so we'll get to understand why uh very soon so here is the overview of the process that we're going to be walking through so we have shingling so we have at the very start we have this text so flying fish flew by space station now that's just a string and what we want to do is extract all of the unique pairs of text so when we say shingling it's it's k shingling and in this case our k value is two because we're taking two characters at once if we were to take k equals four for example then we would take like pace and then move on we'd take ace and a space and so on so that's the shingling and from that we create a set so if we have duplicate shingles we remove those so we just end up with one so in this i don't know if we do have any any duplicates but say maybe down here we had i n again because we also have it up here we would end up with just a single iron in the set we wouldn't have two and then we want to encode those so that means we take a vocabulary from all of our text and not just this one sentence but we'll have more than one sentence obviously that we're comparing and we'll use that to build a one hot vector from the vocab and our shingle set then we process that through something called a min hash function which produces this dense vector or signature so this thing down here okay that is our what's called a signature and then we ban that into so this final bit here this is our actual lsh process so we band that vector into multiple sub vectors and then we hash them so where we find that we have any two sub vectors go to the same hash bucket then that means that the full vector that they both come from is considered or the two four vectors that they both come from are considered a candidate pair and we take those and we then calculate some other uh we calculate the similarity between them okay so first step in our process like we discussed is the shingling operation so shingling is simply where we take a window of length k characters and we simply move that down through our text like you can see here and from that we create the shingle set so in python what we'll do to to shingle these these three sentences we have here is we'll create a shingle function here and this is going to take some text which is a string and we're going to say we're going to define the k values of the number of characters we take within each window which is obviously an integer now we initialize our shingle set here we'll make a string initially and then what we do is for i in range and then here we want to go from the or we want to go to the length of our text minus k so minus that window plus one because we want to go right up to the end of that and then here all we do is shingle set dot append and then we write so we have to text then we want to go from i up until i plus k okay that's our shingle list i suppose and then we want to return a set so this will remove any duplicates that we have so shingle set okay so that's our shingle function and we just want to process each one of our sentences through that so we'll go a equals shingle a also we need to define k which can be two i'll just define k here okay and then let's have a look at what we have and we see that we have this it's shuffled there's no there's no order to our set here and we see that we have all of the pairs of words in there so we have s for the the sort space part here or station actually could be either and if we try and find okay so here we have the very start to fly or flying we have the l y there as well i n so that's that's our shingles set and with this we have all of our shingles so the next step is to create our vocabulary which is just all of our shingles or shingle sets a union together so to create that all we do is we go a union b dot union like that you can see again we have just a lot more text in there now or a lot more many more shingles that is our vocab so now we have our shingle set and we have our vocab so we can tick both of those off now what we need to do is create our one hot encoding over here and the only other thing we need is a zero vector so two well i mean there's more than two ways to do this but i think it's two ways of thinking about it normally the more efficient way would be to create a numpy array full of zeros and then just add the ones in where we have matches between our vocabulary single set but i'm not going to do that i'm just going to keep things like incredibly simple in in the code that we're writing so i'm going to do a it's one hot or the one thing we should do is make this a list because we want we want order in our vocab and not not to have it shuffled so the what we do here is we say 1 4 x in a or sorry now one if x is in a else zero for x in vocab so what we're doing here is looping through the vocab and for every single shingle within there we're saying if that exists in our signature make that point in our list a1 otherwise make it a zero so that that's simply our one-hand coding so if we do abc and then we have a look at our a1 hot we see that we have this one hot encoded or this this sparse array now min hashing is the next step in our process and it allows us to convert our what are currently sparse factors into dense vectors which we call signatures now what you see here is a run-through of how we do this for maybe one signature we want to do that for multiple signatures though so we would actually run through this process multiple times so what we're doing here is we're creating a randomly permuted array which counts from one to the length of our vocab and then what we are essentially doing i know so in this where we're basically shuffling it and then we're counting through until we find the the first alignment one within our vector in reality you just take all of your values and you find a minimum one that aligns to one so that's if you're you know using numpy which we'll see later on um i'll just show you the code not going to actually write all of it there so in in code that would look something like this so we would start with a list which is the range from one to the length of our vocab and if we have a look at that we just see an account we're going to shuffle that so from random import shuffle and we just do it like this so it modifies it in place so we don't need to do anything there so let's view that okay so now we've shuffled that shifted twice now but that's fine and let's just loop through five of those so four oh we can loop through more for i in range from one to ten what we're going to say is i want to print i which aligns to the hash example index for that value okay if we print that we see so one the value one where is it um here is at index 85 2 is it phase 3 and 53 and so on and essentially what we're doing here is saying loop through these identify this index and this index in our one hot vector does it align to a one you can see that here we we find the first one a and that means that our signature value for this point is or for this minhash vector and our one hot sparse vector here that signature value will be eight and we repeat that for multiple min hash vectors which is what you can see here so if we were to work through this so we start at one here that does not align to a one so we work up to two and we find that it does align to one so that is why we have this here and then we go on to this one here we find one does not align two still does not align three there's not a and 4 does align so then we we assign a 4 in our min function we go along and keep doing that to create our signature okay so if we i'm going to use these functions here it's just what we wrote before but put into a cleaner format and what i'm going to do is create 20 min hash vectors run that and then here we are going to run each of our one hot sparse vectors through our create hash function which is here and it's going to convert them into our signatures as we described before and we see here that we have also what did i mention so here we have 20 million hash factors which means we have a length of 20 for each signature so what we see here are our dense vectors and these are just compressed versions of our sparse vectors and we can we can check that that is true by we'll define a we'll create a jacquard similarity function so we take and here we'll take x and y both will be sets and we just return the length of the intersection between both of those so the intersection between those divided by the union of both of those so that is how you you calculate jaccard similarity this should be a y okay and then if we do jacquard on both of those so we have a sig b sig these will have to be converted into sets forgot so like that and then if we we also take the jacquard for i think it's just a and b right so copy that okay so we get this is 0.6 and this is 1.4 now if we if we look up here um i think it's a and b are not supposed to be very similar so that's fine and then b and c should be similar so if we swap this for c and then see here we should both they should both get higher values they should be roughly in the same ballpark i mean they're not perfect because we're using a very low number here we're only using 20 values and typically use a lot more but that's that's fine so you can see that they they're both aligned right so despite converting these into the signature vectors it recognizes that they are pretty similar and converting these into signature vectors it still recognizes that they are reasonably similar so that's good that's what we want now the final step in our whole lhs process is the election function itself so this is essentially what it does so we have our signature over here which we built using the steps that we just went through which you can you can see here and from that signature we we take a certain number of equal length sub vectors so we define that using this here this uh this b so b is three so that means we split a signature into three different sub vectors which we see over here and ideally what we want to be doing here is saying okay we process our sub vectors each through a either a different hash function or it can be the same hash function just as long as we use that same hash function for the equivalent sub vector in another signature uh which you'll see in a moment it will make sense and you know once we have multiple signatures going together through those hash functions you can see here that they're equivalent on both sides one hash one here these can all just be a single hash function as well which is what we're going to do we're not really going to use a hash function and what we get here is three opportunities to identify these signatures as being potential candidate pairs which is where we consider it for further similarity comparisons in this case hash threes both collide down here so we say okay that means that a and b are candidate pairs i'm just going to put canned pairs so this act of of splitting our signatures up into multiple sub vectors just gives us more opportunities to identify similarities because if we were to use the four factors the full vector would have to be very similar for them to be put into the same bucket hash bucket with this we only part of it to be very similar so increases the chances of us finding those similar uh signatures so we're going to implement a very simple version of this i'm going to keep this very simple here we're just splitting our signature vector so we add our signature and b which is the number of bands and the first thing we do is just make sure that our signature can be split into b bands equally so where we take the remainder after the division here it must be equal to zero and then we say we need to calculate the rows so the number of rows um within each band which obviously just see the length of the signature divided by b and then we initialize a sub vector array or all list and then we loop through and append sub vectors really simple simple implementation and let's apply that to b and c so we have said that we want 10 bands so we only have 20 items or 20 20 numbers within our signature vectors so obviously we only get bands of two rows at a time and we should find that at least one of those match so what we do is we loop through and we say if b rows equals zeros break okay and we find very quickly that they there is a candidate pair there so that means that b and c the default vectors would be considered as a candidate pair let's do the same for a and we should find okay so for both a and b and a and c it's not considered a candidate pair because there's just no similarity there so that that's good that's that's exactly what we we wanted to happen that is our implementation of this so the lsh traditional lsh approach now a few other things that we we haven't covered but we we should uh just touch on quickly and you can find so there's an article link in the description which covers this uh i walk through all of this and there will also be a notebook where i'm getting these results from in the first place so you can also look at that that includes the the numpy implementations of what we've just done which is slightly more efficient although not not super efficient because i wanted to still be readable so what we have here is a visualization shows the similarity the cosine similarity of our signature vectors and whether they were considered as candidate pairs or not so the these are here these are our candidate pairs this is just a random sample i think the actual the full data is really big so running this all of them is super inefficient because we're also running everything else through so i can i can actually have the visualization here uh but if you run just highlight on it it does work it's fine so at the top there we have our candidates at the bottom we have our non-candice we have similarities so we can see that high similarity does correlate with them being classified as candidate pairs which is good that's obviously what we want and there is this formula that i did not write down which i should have done which is p equals 1 minus 1 minus s which is our similarity down here to the power of r which is the number of rows in each band and all of this to the power of b which is number of bands now that correlates to this line here this probability uh obviously it's p capital p so that's that's where it's coming from and if we if we run this with different similarity uh values this is the this is a pattern that we get and obviously that correlates you can see with whether something is classified as a candidate pair or not and what we can do is we can modify b to push the number of kind of pair classifications either up or down okay so here we have different b values at the side we have black which is 50 and then we go 25 20 which is what we used before and and five so let's say we found that we're not identifying enough candidate pairs we could push that down a little bit maybe we don't do it too much so we could change b from 20 to 25 and if we do that we see this so in green you have our old results and our old probability line and then in in blue and pink we have the the new ones again of blue and magenta so what we see here is we've we've pushed that down so we've changed b to 25 and now we're returning more results so over here we have these for example which we were not returning before and there are also more values in here as well and there are less values down here so that that's the result of us was modifying b so we can visualize that like so so if we uh decrease the if we decrease uh increase b sorry we we move it in this direction which increases the number of candidate pairs which also increases the number of false positives that we're going to return this line by the way is our threshold so similarity threshold is basically where we want the cutoff to be between things being identified as candidate pairs and not candidate pairs it's like our target or missed or if we wanted to reduce the number of uh candidate pairs because maybe we're getting too many false positives we can push it this way uh which will result in in less candidate pairs but also results in more false negatives so um non-candidate pairs where we should have candidate pairs so it's just a case of balancing both of those uh but that's everything for this video i hope it's been useful and i will see you in the next one
Info
Channel: James Briggs
Views: 539
Rating: 4.8947368 out of 5
Keywords: python, machine learning, data science, artificial intelligence, natural language processing, bert, nlp, nlproc, Huggingface, Tensorflow, pytorch, torch, programming, tutorials, tutorial, education, learning, code, coding
Id: e_SBq3s20M8
Channel Id: undefined
Length: 27min 6sec (1626 seconds)
Published: Fri Aug 20 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.