Composite Indexes and the Faiss Index Factory

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

and welcome to the next video in our similarity search series so far we've worked on quite a few different indexes that we can use in files for similarity search and through the past few videos and articles we've tested a lot of different indexes and we've sort of tested um to see what works and what doesn't work and the sort of pros and cons of each of those indexes now that's very good and we've we've had some i think pretty interesting and cool results but in reality the best or the highest performing indexes are all composite indexes now composite simply means an index which is built of multiple parts and all state dr performances are produced by these composite indexes so what we'll cover in this video is constant indexes at high level we'll just introduce them we'll also learn how we can use the vice index factory function now when we're building constant indexes this function is incredibly useful and i would say almost essential you can build constant indexes without it but it's very difficult and the code is pretty horrific too to actually work with so i think index factory is pretty essential in building these sort of indexes and we'll also explore how we can build a few of the more popular composite indexes as well um so ivf adc multi adc and ivf hnsw indexes and uh let's explain what all of those are uh because i know it's just a load of words and letters so first we want to have a look at the composite components so the components that we can have within a composite index so there are two so main components that we're pretty much always going to have and that is so in the middle we have what is typically a course quantizer so i'll just put course here this is stuff like and ivf can also use hmsw here and then we pretty much always have another component which is our fine quantizer now in terms of fine quantization you're looking at things like product quantization but in this sort of space we would also just include flat vectors although that's not really a form of quantization it's just using flat vectors um they would be solved interchangeable so flat vectors get put into this box not because they are a form of fine quantization but simply because flat vectors would be placed into that position within the index structure then the most common uh addition to these two would be the pre-processing step so the pre-processing of our vectors and here you you have a few things you can have pca which is dimensionality reduction or we can also have opq now opq is optimized product quantization and in order to integrate both of these we would have to have pq down here as well and all this obq set would do is rotate the the vectors that are being placed into our index in such a way that it optimizes them for product quantization later on and then these these three are probably the most common ones but we also have another another block as well and that comes at the end here and this is a sort of like a a refinement step all throughout here we've sort of built our approximate search and then we returned top 10k our top 10 000 most relevant results according to our approximate search and then we perform an exhaustive search at the end uh which will re-rank them based on a more definitive i would say search here's just a few examples of that so we under the vector transform course quantizer fine quantizer and refinement steps there so for example what we we might do is when the vectors are incoming we transform them we rotate them using obq then we perform course quantization and assign all of those incoming vectors those rotated vectors now into different lists within a ivf index once they've been assigned to their different clusters or cells we use product quantization to compress them and reduce the memory usage of our index and then when it comes to search time what we might want to do is use a refinement step which would take the full vectors which in this case would not allow us to reduce the memory usage of our index significantly uh but you we could add a we could add like a pq refinement set here so this vectors are stored as a quantized forms or compressed forms rather than flat vectors whichever way works for your index and during our search would go through each of these steps return to top 10 000 items so the top 10k and using our final refinement step rather than using ivf we would just take the euclidean distance or the l2 distance between all of them and find the ones that are most similar and this is an example of that so this is this is ivf pq well this is an example of the two middle parts of that so ivf and pq we are excluding the re-ranking and excluding the obq part and what we're doing here is we take our original vector we assign it to an ivf cell we and then we also take that original vector in a separate or my stream we quantize it convert them into the the codes and place that under the ivf cell assignment now i think that's plenty for an introduction to constant index let's move on to how we start building them using the index factory okay so here i'm just importing our data we're using the sift 1m data as we you usually do if you don't have that there will be a link in the description to the download script that and i've also also just imported files up here so the first thing i'm going to do is show you the difference between um building an index with the normal method that we've been using in every previous article and video and how we can do the same using the index factory so what we what we need to do is we write quantizer vice and in here i'm going to write index flat l2 so this is because we're using the flat vectors again like i said before this the flat vectors are in the place of where we would put a fine quantizer usually but they themselves are not actually a fine quantizer and then to build our index from that we would write physe and this is our index flat ivf or ivf flat and then here we first pass our quantizer then our dimensionality and then we pass endless which is the number of voronoi cells or clusters that we want to include in our ivf list okay after that we add our vectors actually we train them first so we train on those vectors and then we add them okay uh do i need to oh i need to rerun this okay run that and run this again once we have those we can do search like we usually do and that's how we we build our index and then add data to it so that's the usual approach using the index factory approach is slightly different so what we do is instead we write i'm going to write index f just so we can compare the two indexes later on and all we need to write is vice we call it index factory function and in here we need to pass the dimensionality of our vectors like we usually do but then we pass this um it's called an index factory string and in here we specify the structure of the of the index that we'd like to build so we're going to say ivf 256 so a ivf index with 256 cells followed by or or which contains flat vectors and and that's it so we can run that and then we do the same thing again we do index f train xp and index f add xb like so and that should work so once we've done that let's what i want to show you is that they do in fact output the exact same results so we need to do a di so what we normally do in this search xqk okay i haven't defined let's go with a hundred or let's go 10 because so can actually see the results okay and then let's do the same but for index f which is our index factory index uh so here we just replace index with index f and we should see the same same thing again i'm going to just replace these as well okay and i mean we can see they're exactly the same there's no difference between them so we we're returning the the same nearest neighbors using both so they are the same indexes is what i'm trying to point out here the search time is pretty much the same i always find it's like very slightly faster when you use index factory but i mean very slightly on like micro scale value like maybe five microseconds faster which is obviously not that much and in terms of memory usage they're both exactly the same as well so that's a very simple example but obviously when we start building more complex indexes the index factory becomes a lot more useful so let me show you a a good example of why uh we'd use index factory eg let's put together more complex um a more complex index and we'll see the difference between between the two approaches okay so i've just pulled this code in here and this is i said before we have those kind of like four different uh components that we can have we have preprocessing and we have like the course and find quantizer so define quantizer here i've just renamed this as vex because it's not really quantizers so it's just flat vectors and then we also have the sort of the set that comes after where we're re-ranking the vectors as well and i also need to add that in so it's just index equals five index refine flat and in here we have to pass all more we have to pass this so i'm going to rename that to q and add it in here so here we're building our index like we usually do so we've got our rotation opq rotation which is the pre-processing set and then we go on to our ivf so we're you know putting um sorting those out into different cells and we are using the fine quantizer it's not really quantized it's the flat vectors and then we're merging all of those together into this sort of uh some merged almost proto index uh we could use this as an index but then we want to add another post-processing set which is a re-ranking which is what we've done there okay and i mean all this code is quite complicated and i mean at least to me it's it's not that easy to read there's a lot going on so run that it can take a little bit of time to to build to be honest and maybe it's maybe it's better if i use less training less data i think it's fine for now if it takes too long i'll i'll reduce how much data we have and if we wanted to rebuild this in using the index factory uh it's super simple all all we do is we write index equals five index factory so literally we're going to do this on one line and pretty much all of this index factory and we have our dimensionality d and then we have our string so first we have the obq set so the pre-processing set in there using an m value of the two then that goes on to how ivf sets so we're using ivf there um we are using a endless value two five six two five six that's flat okay i've just added these um so it's not training for too long i've added these in here um so here we're not using flat so we're using flat vectors here but then we're using iv fpq uh so this is actually pq and we're using m here so m is 32 up here okay and then the next step after that is our re-indexing or refinement a step and that's just r flat like this and that's that's it that's the whole what we wrote all of this we compress it into this single line with the index factory which is why when you're doing the more complicated stuff or at least the composite when you start using compost indexes you i think index factories practically an essential uh missing it is what we're trying to do without is you mean you can do it of course like we showed here you can but it's difficult and it gets a lot more complicated than this as well so i think it's probably the best approach in in my opinion at least now if we train and do the same again where i'm just doing first 50k which is still taking forever maybe maybe because we're recording at the same time okay and what i'm gonna do here is d i equals index of search and then we can again we'll just see what we return and make sure that make sure they match okay and let's have a look what we have let's print out i before we lose it and we do the same here so and we'll just do the search as well and next search xqk and again we just want to see i okay and that's finish and uh we can see so we compare those results and they are exactly the same which is what we'd expect because we are building the same index uh so you know from this to this using the index factory so i think it's fair to say there is good reason to use it when you're building the these composite indexes now in the in the next video we're going to be covering the comps indexes and how we how we can build them using the index factory now if you are watching this on youtube the it's the next video in the in the playlist otherwise if you're on if you're over on pinecone reading the article it's below so you can read about it or watch it there so thank you very much for watching and i'll see you in the next one

Info

Channel: James Briggs

Views: 94

Rating: 5 out of 5

Keywords: python, machine learning, data science, artificial intelligence, natural language processing, bert, nlp, nlproc, Huggingface, Tensorflow, pytorch, torch, programming, tutorials, tutorial, education, learning, code, coding

Id: GEhmmcx1lvM

Channel Id: undefined

Length: 17min 43sec (1063 seconds)

Published: Fri Sep 24 2021