Best Indexes for Similarity Search in Faiss

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi and welcome to this video on well these composite indexes that you can you can see on the screen now uh ivf adc monty adc and ivf hmsw uh in the in the last video in the series we sort of introduced comps indexes and the index factory invice and what we're going to do now is obviously go through each of these and build them using what we learned in the last video the index factory so let's move on to the first one ivf adc this is it stands for ivf as in like the inverted file index followed by asymmetric distance computation if you watched the video or the videos uh or read the articles on product quantization you might remember abc adc is um well it is part of product quantization so during indexing so when we first take our vectors and we add them to the index uh there's sort of two main steps uh vectors are assigned to lists or the cells that's the ivf part of it so if i just point that out so we're taking our vector over here and ivf assignment is this part so this is our index ivf we're assigning them to ivs cells which are the the varano cells that we usually visualize and from that we are assigned this cell id so the cell id is the ivf part of our index and then the second step is those vectors that we've just assigned they go through a product quantization compression step so if i take there it goes all the way over here and we create and we create a pq code to quantize code and they get sold together so did the code get stored in the ivf list that is a an ivf abc index at at its core now during search this is where the the adc part comes in so adc is that that asymmetric distance computation so we have two in product quantization well when you when you quantize your vectors you have these two alternative ways of searching first which is symmetric distance computation over here is not as popular and what you do with that is you take your query vector which is x q and you quantize it and then you compare it to all of your pre-quantized three index vectors so essentially both sides of what you're comparing are quantized and that's why it's the symmetrical or symmetric distance computation the alternative which tends to work better is where you don't quantize your query vector so you just take x q as it is you compare that to your three quantized pre-indexed vectors so the quantized speed vectors that's where that's where you get the ivf side of things and then you also get the adc side of things and when it comes to files we can we can use this index factory string to put all that together so we have ivf on left product quantization on the right so let's let's go into python and put that together so i've already imported the the data using again the sift 1m data set there'll be a link in the description which will explain how you can get that data set from that we have show you we have xb which is the the index vector the vectors that we're searching through and then xq which is just a single vector that we're using to search and so we put that together we just write index files index factory as we as we have been doing uh we have our number of dimensions 128 list data set and then we want to input the the string dns factory string which is so we have ivf which is the first part we're going to split it into 256 cells foreign cells and we are using pq for the second part which is the adc part and here we're going to use an m value 32 and a n bits value of 8. now by default this is a so we can include the times eight there but you don't need it the only way you probably need this is if you want to use a lower end bit value so say you want to use four if you try and go for 12 well you're not going to be able to unless first has uh updated that yet so basically you're not allowed to go over eight for now with the the pq index or ivf pq index but that should be being removed in the in the future version and then like we usually do we just do index train so we need to train of course uh using ivf and and also quantization so we we really really do need to train and then we add our vectors and from there we just do search as we normally do so di equals index.search xq okay and i'll run that we'll see i don't really need to see what we return but i'll show you the performance of that so i mean while with this the i may as well explain so we have m here m is the the number of sub vectors that we're splitting our full vectors into for the product quantization set and the n bits value here is eight which is the number of bits used by each uh sub subquantizer within the product quantization set uh so you just increase that to perform uh to increase recall performance or accuracy okay and then you see we we have our nearest indexes there now let me show you the actual performance the metrics for this so what we have here is a few things i wanted to compare so we have the blue lines here are what we just built so the ivf abc then we also have the the magenta line that is still ivf adc but we've added a optimization step to it now for me there was no real performance increase there and i just wanted to compare that to a simple flat index now the flight index if we use that the the memory usage of that is something like half a gigabyte whereas the memory usage for our two pq or or ivf pq indexes is i think around 40 megabytes so there's a pretty yeah a huge difference there so that that's why you'd use pq in the first place to minimize your memory usage of the index but in terms of recall of course you can't really be a flight index so let's move on to the next one uh so we have multi d adc which is a multi index asymmetric distance computation so the adc part is the same so we know okay there's product quantization in there and the multi multi d part is what is different so this is based on imi which is a inverted multi index and imi is similar to ivf but it where we have our voronoi cells in ivf imi uses product quantization to split your vectors into sub vectors and then process them through that varanoi cell or cluster them using variano cells so it's like i guess split across multiple vector subspaces some like a like a tiered voronoi cell structure which is what i'm trying to visualize here so with ivf we just have one layer whereas with imi so this is imi not necessarily the multi dabc so this is just i m i so our vectors come in they get split into two in this case and then they get clustered based on that and then when we introduce our query vector we split that again into two and we find the cells that it belongs to and we find crossover between those cells let's find the nearest neighbor candidates okay and this is using just an empro value of one in this case now we can we can put that together quite easily using this index factory string so we have the imi index on the left and then we have again pq because it's a adc on the right so let's go ahead and put that together okay so i'm going to come up here and i'm just going to use the same same uh script we use here so i'm going to take imi 2 multiplied by 8 there or 2 2 by 8. and let me remove the by 8 there now in this case it's actually better if we include obq so this in this case it does improve the performance of our of our index so we'll add that and what i want to do so i'm going to get a recall function so we can compare the matches between this and a simple flat index and see how many of the responses it returns match or align okay so i'm going to add that in here uh all we're doing is is creating a flight index and i'm creating this this function here to compare how many of them match and what i'm also going to do is i'm going to change k to 100 and remember we're using 10 it's a little bit low so let's change it to 100 and then we can compare those a little easier the reason i want to do that is to show you the performance difference we'll get when we start modifying another variable called the emperor value so let me wait for this to to finish training and we'll jump into that okay so that and that's finished now and we so we've run this i can we can compare or do it yeah i want to do this and i'm going to call recall on here and we'll see what it is so we get five percent okay that's pretty terrible and the way that we we can improve this is by increasing the endpoint value so the number of cells in our ivf index i was searching the only problem so well let me show you what we normally do is this so we write index and probe and we set this to a higher number but we can't actually do that anymore because our index consists of let's say multiple indexes so if we were just using this part it would be fine we could do that but because we've added obq we can't access n-probe in a normal way so we have to um we have to use something else and that is if i write it out it's files extract index and we need to write ivf i know we're not using ivf using imi but we use this this here for both of those imi or ivf whichever one you're using and then in here we just pass our index and then from here so this is kind of extracted the imi part or the ivf part of our index and that means that we can now set the emperor value so we we're going to set that to 14 and what we can do is di equals let's search again new xqk and we'll do recall hi we've got 44 so it is increasing now what happens if we say increase that to 100 we get 69. so you see as we keep increasing this number uh the performance or the recall increases as well okay so 69 looks like the max we're going to get there although in reality you still tested this before you you can definitely get higher uh than that you should in fact well slightly higher you should be able to get up to about 75 so i mean that is that is that index and through this we we do get very fast search times so you see here 0.2 is uh faster i think so time it and there's a search and we want to say okay and let's reduce this let's say 50 see if we still get okay 60 okay so i'm going to go for a 60 for a 66 recall and then index search and this this loops through multiple times to get the the average speed and we get 379 microseconds which is pretty fast so that is that last two of our comps indexes let me uh before we move on to next one let me actually show you the performance you can expect from from this one so this is what i got when when i was testing on on this index so in the middle there the magenta line that's what we just built uh you also have the the non-optimized version which is the the blue line and then i wanted to also include the the flat index in there as well for comparison the flat index is slower though and plus again the the memory requirements flashlight index are sort through the roof this one is incredibly small not as small as ivf pq or ivf adc but still pretty small okay so this is our final concert index and i think the most well the most performing one from what i've seen as long as you don't mind increasing the memory usage quite a bit so before we're using product quantization we can use product quantization with this and we still get pretty good results actually but the the memory usage when we're using this with flat vectors is of course more so what we're doing here we still have that ivf component so we're not using imi this time using ivf like we were in the first composite index and then we're also having a hn sw or hierarchical navigable small worlds index or graph and what it does is uses ivf splits our data into all of its different cells or cells which each have a centroid to identify which cells are closest to your query and during search time rather than comparing all of those centroids to your search query so looking at exhaustive search across all those centroids we use hsw to approximate that and optimize or speed it up so let's just go into hmsw very quickly first so it's based on the small world's graph theory and i mean this way you can see on on the on the screen right now there's this circle of all the nodes or vertices on the side and then we have all the edges or links between them and the theory is that in a very large network so something like facebook where there's billions of users you will find the average degree separation the average number of links between any two people is is very smart i think with the facebook when it's something like incredibly small like four or six and that's from you know it's between the average between any one any two random people on facebook are connected by only four to six friends and with this what we find is that in these small word graphs we have sort of long range and short range links now when we're traversing around that graph so from one node or one vertex to another we obviously move more quickly across the graph when it's a long range link so let's say your friend you're american your friend has a friend over in india and they have a friend who is someone important in one of the villages in india for example you know between you and that important person in a village in india even though you don't know anyone in india you have your friend that friend over in india and him so there's three sets to get to him now the set between you and your friend you're pretty close you have a very high proximity so it's a short range link between your friend and their friend over in india that's a long range link now hmsw takes this theory and applies it so what it does is it takes long range links and splits them across different layers so in this case we define long range as geometrically far apart and on the highest level of our hsw graph we have all of these long range links and it's on that first level that we input our query vector and we first find those nearest neighbors so we take our first few traversals across the links and as we go down each layer so each traversal we go down a layer the links become shorter and shorter range so it becomes finer and at the bottom layer we we finally get to what is our approximate nearest neighbor now that's what you can see here so we saw those the entry point on layer 2 which is our entry layer that only has long range length so we can make it a big jump from our entry point over here over to here then we're on to the next step so we move down the layer down here and then we go okay where's the where's the nearest neighbor now uh it's over here okay so so we go there and then we're on to the next step so we we again want to move down a layer so we're now on our bottom layer and this is our final set and say okay where's where's our nearest neighbor it's over here okay and then that's what we assume we assume that this vertex or node is our nearest neighbor to our query vector and the reason we do this is if you think okay up here we only have long range links that means there's very few nodes whereas down here we probably have loads of nodes so if we started down here we'd be making loads of sets whereas doing it like this we can make big steps early on and then smaller sets later which just minimizes the number of steps we need to take or the number of comparisons that we make now how how does this apply to ivf we said okay this is a ivf and h sub u composite index and this is what you can see now so that h s w process that we just went through imagine all those points all those those vertices all of those are cell centroids from ivf so what we're doing is rather than comparing exhaustively comparing all of those cell centroids we're just comparing a few demo or we're finding the approximate closest or nearest neighbor very quickly and then we limit our exhaustive search scope to that single cell if the endpoint value is one obviously usually it's not going to be one it's going to be greater yeah but that's pretty much what we're doing with this ivf hsw index now because ivf hsw optimizes is centroid or focuses on optimizing the centroid search portion it makes more sense if we increase the number of centroids because if we increase the number of centroids we reduce the number of vectors within each cell and because our optimized search point here or the approximate search is the centroid comparison and then once we're in those centroids we're actually comparing all of the vectors in there so we're performing an exhaustive search on the scope that we do create so what we do is we increase the number of centroids because that's the optimize portion and we reduce the number of vectors that will return in our you know final scope because that's the unoptimized part of this search so to put all that together we're going to use this string this index factory string in terms of increasing speed like i said we'll just increase this number here which we'll see i'll show you another graph in a minute okay um so let's go up here i'm going to compare sorry i'm going to modify this so we're going to go ivf 4096 followed with an underscore by hsw 32 and what this means is we're creating this many cell centroids or cells and this 32 is how many connections or links each node or vertex has in the htw graph and then after that we we're using flat index here now i said before we we can use pq 32 for example here um but the accuracy to recall won't be as good but you can use that right so if you if you need to reduce the memory usage with this and it is very good by the way using pq and still you still get very good recall you can do like like that now let's run that we will do the search again we'll check the recall see what we get okay so that has just finished and we get this recall of 25 and now again we're using the the empire value of one by default we probably want to increase this so we'll come down here uh we don't need to change the variable name i'm i'm just doing it because it bothers me leaving it as as i am i'm i and let's go 60 again i mean we do have a high number of cells here so probably want to increase it a bit more than that but you know let's go with that see what we get and we get 93 which okay fair enough that's pretty good like straight out 93 that's nice so let's time it see how fast that is and you see okay so before we were getting with with the multi d adc index we're getting i think three six four micro seconds now we're getting five foot four a bit more but it's hardly anything and i mean we're getting percent before we were getting i think six what do we get 66 so but of course at the same time the memory usage for this is it's like half a gigabyte but we can reduce that and buy instead of using flat index we just change it to pq32 not like that uh we change it to pq32 and and that improves it a lot but we're not doing that here we're just going to leave it flat and i mean let's just so if we for example we know that the maximum number of cells here is 4096. let's do an insanely high number and this is the maximum performance that we're going to get out of it so 100 right because we're using flats of vectors we we can get that 100 point so yeah we can get very good performance 89 there and yeah it's super easy all right so look 432 microseconds i i mean for me i think that's very good so for that reason this is definitely one of my my favorite indexes it's just again the memory usage is high so let's have a look at the the actual performance search time recall so i included a few different number of cells here efforts to compare so like i said before what the optimal the optimized part is the the ivf part or the the centroid search part the number of cells so we can increase the number of cells and that will decrease the search time but also decrease the recoil a little bit as well but not that much so you know you can you can modify you can increase the number of cells ivf environment cells to improve the search time if you if you need to now i i mean that's it i don't want to cover anything else in the video we've covered those there's three indexes uh the ivf adc multi multi d adc and ivf h sub with those i think you can build some pretty cool indexes so thank you very much for watching i hope it's been useful and i will see you in the next one

Info

Channel: James Briggs

Views: 114

Rating: 5 out of 5

Keywords: python, machine learning, data science, artificial intelligence, natural language processing, bert, nlp, nlproc, Huggingface, Tensorflow, pytorch, torch, programming, tutorials, tutorial, education, learning, code, coding

Id: 3Wqh4iUupbM

Channel Id: undefined

Length: 26min 22sec (1582 seconds)

Published: Fri Sep 24 2021