Metadata Filtering for Vector Search + Latest Filter Tech

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi welcome to the video we're going to be exploring two of the common methods that we can use to filter indexes in vector similarity search and then we're also going to explore pinecones new solution to filtering in vector search now in vector similarity search what we do is build representations of data so that could be text images or cooking recipes and we convert them in to vectors we then saw those vectors in an index and what we typically want to do is perform some kind of search or comparison of all the vectors within that index so for example if you found this video or article through google or youtube you will have typed something some sort of query into one of those search engines so maybe something like how do i filter in vector similarity search then whichever search engine you use most likely converted your query into a vector representation and then it took that vector representation and compared it to all of the other vectors in its vector index so those could be pages videos and so it could be could be anything really and out of all of those index vectors it was this video or this article which seemed to be one of the most similar vectors to your query vector and so you were served this video or this article near the top of your search you clicked on it and now here we are in search and recommender systems there's almost always a need to apply some sort of filter on google we can search based on categories such as news or shopping we can search by date we can search by language or region and likewise netflix amazon spotify might want to compare users in specific regions so restricting the search scope to relevant vectors is in many cases and absolute necessity and despite the very clear need for filtering there isn't a particularly good approach for doing so so let's start having a look at the different types of filters available to us so during the video we'll be covering each one of those we have pre-filtering post filtering and pinecones new single stage filtering now what i want to just do here is have a look at what metadata filtering actually is so when we have a vector index each vector is going to be assigned some sort of metadata and that can be anything it could be a number a date text it can really be anything that we can use to filter our search and what we want to do is search where this or some condition is true so for example say we have a big corporation they have all these different departments and there are loads of internal documents within that corporation some of those documents are assigned to the engineering department some of them are assigned to hr and so on a user in that company might want to go into their search and sometimes they may want to search across all departments but sometimes they might want to apply some sort of filter so they want to might want to say i want the top k documents where the department is equal to engineering or i wanted to top k where the department is not hr and we can apply anything in these metadata filters so we may want documents that are quite recent so what we might say is we want the top k documents where the date is greater than or equal to 14 days ago and then we can sort of mix and match the all of those different metadata filters as well now to implement a metadata filter we need two things really we need our vector index which is what you can see at the top here and we also need our metadata index now each of these will be paired one by one so each vector will have its own metadata record and what we would do is we'd apply a condition to our metadata index and that would remove a few of these so we'd get rid of these and then based on what we have removed here we would also remove those equivalent vectors from our vector index and that's that's how our filter would get applied but there are different orders and different ways of doing this as we we will take a look at now so the first one of those is we could have a look at a post filter so a post filter is nothing more than we take our query vector which is over here and we also take our metadata query over here now we start by performing a approximate search between our query vector and all of our index letters over here and we get the top k matches there so let's say we wanted maybe 10 of those and then what we do is we then add in our metadata query so we add that in over here and that creates a filter for us so then we filter those remaining vectors through our new filter and that leaves us with the filtered top k now in this case top k is not going to be usually is not going to be the number we asked for so say over here we have 10 top k matches here we filter some of those out so we might end up with four and then we also have pre-filtering now pre-filtering we change the order a little bit we apply our filter before we do the search so we have our metadata query over here we use that to create a filter over here we then apply that to our full vector index over here and that leaves us with so many of our vectors and then what we do is we search based on that but because we are not searching through the full data sets we can't do approximate nearest neighbor search so we have to do an exhaustive search at this point now let's start with the pre-filter process so here like we saw before we start with our metadata index we apply our filter to this identifying which positions satisfy our filter condition and then we use this filtered and metadata index to filter out the vectors which not satisfy our condition and as we kind of saw before this is where the issue with pre-filtering comes in because we have just filtered out some of the vectors or many of the vectors in our index we no longer have the same index as what we started with and we need full index to apply our approximate search on but as soon as we we filter we change the structure of that index so we can no longer perform an approximate nearest neighbor search which means we're just doing a brute force exhaustive k nearest neighbor search now if our index is very small or the number of vectors that we have output after our filter is very small this is probably okay but as soon as we start working with big data sets this is not going to be very manageable and the only other alternative that we have here is to build an index for every possible filter outcome which is not really an option because it's just simply not realistic to build that many indexes so pre-filtering we have good accuracy but it's very slow now post filtering is of course slightly different so in this case we start with our vector index now we can perform our approximate nearest neighbor search because we have the full index we haven't filtered anything yet and that returns the top k vectors that we want so say we want 10 vectors at this point and then what we do is we find all the vectors through our metadata index that satisfy whatever metadata condition we've set and then we apply the filter to those top k vectors so 10 vectors maybe and of course at this point what we are doing is we're reducing the number of vectors that we get out so we don't actually get 10 vectors we for example could get four vectors and in the worst case scenario that filter could rule out all of the vectors that we've returned and in the end we return nothing even when in the index there could be some relevant vectors which is obviously not very ideal we can try and eliminate this problem by just increasing k a lot so of course if we use a low k value the chances of all of them being excluded when we apply our filter post search is reasonably high but if we increase k up to 1 million it's much lower but the only problem with that is that our search becomes a very slow and the more we increase okay to eliminate the problem the slower it gets so in this case we have unreliable accuracy or performance but it is faster unless we increase k so now let's introduce single stage filtering by pinecone now we are going to go through some code and test this but first i just wanted to introduce you know what it is at a high level so it's a new filter built by pinecone and at a high level it works by merging the vector and metadata indexes and it allows us to filter and then do an approximate notes enabled search so what we get there is the accuracy of pre-filtering and at the same time the search speed is often faster than post filtering as well so we really do get the best of both with this new filter but let's go and actually try out okay so we're going to be using pinecone here and all i've done here is imported pinecone import json and i've imported my data here so this data i've already uploaded or inserted to my pinecone client and what it is is just the squad data set in both english and italian now in there we have a few different items so we have the record id we have the text so that i've just saw this locally we have the vector which has been observed to pinecone and then we also have the metadata which is with pine cone as well now if we take a look at what we have in the metadata we see that we have the language so we have either english or italian and then we also have the topic now what i want to do is just test the new filterings so we're going to be filtering based on language topic and we also have another metadata item here which i don't have locally which is just a randomly generated date so we can have a look at using some of the grades that are equals to less than equals to filters that we can use in pine cone so the first thing i'm going to do is initialize my connection to pinecone so i write pinecone dot init i need to pass my api key which i've loaded above and also the environment that i am working in now of course this will be different and depend on which environment you are using so i've initialized there and what i can do is i can now create a direct connection to a specific index within my pinecone environment now i'm going to be connecting to one i've already made which is called squad test and now what i'll be able to do is use this index object to perform my queries so we're going to be performing a vector search here so what we need first is a query vector to perform our search with now i use the sentence transformers library to encode the already indexed vectors so what we're going to do is use the same model to encode our query vector so write sentence transformer and that embedding model is sentence transformer and i use the stsb xlmr multi-lingual model so i will need to download this okay so that is downloading now and then what we want to do is create our query vector so i'm going to assign it to xq and all i need to do is write embedded and dot and code and then i pass in the query that i would like to perform so in this case i'm going to search for context in our in our data set which mentions something along the lines of early engineering courses provided by american universities in the 1870s so i will execute that and note that we're using a multilingual model here so we should find that we will return both english and italian results but both of them should be something similar to this topic so we will return our results we just write index.query and what we can also do before we even do that is we'll just convert xq into the e format that we need so like this and then in here we just press xq and i'll say that i want the on my top k value to be three now remember if we were doing if we were using post filtering here if we set top k value of three we would probably return less than three so with post filtering we would want to set something stupidly high here just to get maybe three samples if we're lucky but as we're using single stage filtering we only need to set top k equal to three so we'll execute that and let's return the results and you see here that we get our ids so we get this id one two and three so what we now want to do is we want to map that back to the data that we have sold locally to do that we're going to write ids equals i id for i in the results so we're just getting these ids here so we're going results we need to enter into the results key we want to access the first position in that list and then in there we want to access the matches and then from there we'll just print ids see what we get okay we get those three ids and now we're going to do is use the data that i imported up here just here and we're going to use that to print out whatever is that these ids are referring to now what we have in our data at the moment is a big list which is not that useful so it would be more useful we just reformat that into a dictionary so i will do that quickly we'll just write get sample is equal to x id and then in here we will saw our context and metadata so context and metadata we don't need sort of vector in here because we can't read that anyway so it's not that useful for us let me say four x in data okay then we can okay so it's not context let me come up here and have a look at what we have i think it's text yeah okay so let's change that to text here and here okay so now we can do for i in ids i want to get the sample so get sample i and we'll just print that out so we see here that the first one gets italian and the translation for this is something to do the college of engineering was instituted in 1920 so we have college engineering that's good and then we also have something along the lines of the college of science from the 1870s so generally this looks pretty relevant i think and then down here we have this is italian again but we also have the english translation of this here as well so we can see straight away school of engineering public engineering school founded 1891 and it offered engineering degrees as early as 1873. so that's again pretty relevant now i don't understand italian and so my first filter here would probably be okay i only want to return the english results so let's go ahead and do that so i'm going to say results equals and let's just copy what we had up here so we're not repeating ourselves we just want to take the index query we can include result no we don't need it let's just take that and all we need to do now is add our filter so we just write filter and then in here we want to write so we have our metadata and we have our language and we want to say that this must be equal to so we use eq which is english so we get our results and we're going to want to do the same thing again so we want to get our ids and we want to print those out and there we go so now we're just getting english results now that was pretty fast so let's i think what is quite useful is to see how fast those two searches were now obviously we're getting pretty relevant results here where again we're returning three results even with our filter applied so that's good so it seems like we're getting the accuracy of pre-filtering here and let's have a look at the speed difference between the two approaches now we shouldn't see anything particularly major because this is a very small index we only have i think 40 000 vectors here so we won't see anything significant but at least we can check that we're not getting anything slow so let's have a look and you see here that we're actually getting a slightly faster response when we filter and this is typical with pinecones single stage filtering when we add a filter usually we'll actually get faster results which is pretty insane so not only are we getting good speed like post filtering but actually making our search faster by adding a filter which is neither post filtering nor pre-filtering can do that and again at the same time we're still getting that accuracy of pre-filtering so this is in my opinion pretty impressive now we might also want to add a another filter so at the moment we're just adding one filter which is fine it works but let's say i look at my results and i know this is hard to read but in here we have the topic we have university of kansas okay fine maybe i'm not interested in the university of canada so how about here we have university of notre dame i say i'm not even interested in these guys either institute technology let's say okay yeah we can keep them that's fine so i want to say okay i want everything that is one in english and two not from the university kansas and not from the university of notre dame so to do that i need to add another condition to my filter so to do that all i need to do is say topic is not in this time so we're going to say not in and then we pass a list here so a list of what we don't want to see which was the university of notre dame and also the university of kansas okay so let's add those two and let's see what let's see what we get so again seem pretty fast and we're getting university canvas here so that must mean that i have written something wrong and so i think here the topic filter up in my pancake index is actually maybe called title let's let's see and this is also wrong so let's correct that okay so now we're getting something different so yes this should be title in reality so institute of technology institute of technology and where is our other one institute of technology here now we're not returning university of kansas and we're not returning university of notre dame which is what we wanted now there was also the date folder that i wanted to show you as well so we don't and we need to filter based on strings but can also filter based on numeric data times and for me to show you this i think it's best if we it would also be better if we include our metadata here as well so we can just see it directly from our results so we know that we're returning relevant text so now let's just have a look at the metadata so i'm going to include metadata there let's just see what we get so we see now that we actually include the metadata in our results which is also pretty cool now we have date which is just a numeric value here it's just something very simple it's just randomly generated there's nothing there's no actual relation between the date and this record it's completely random and we can see okay we have a date from 2016 2008 and we also have 2020 here as well now the first thing that i might want to do is say okay i want to i want to return only the more recent dates so let's say okay we add we keep all of that all the other filters in there and we might say okay but we also want date to be greater than or equal to let's say what do we have here we have what is the most recent so it's 20 21 let's say we want to go for ones that are let's say 2018 onwards for now so 20 18 0 0 1 okay so the very first day of 2018. let's search and see what we we get and we can see yep it's it's definitely filtering correctly there now let's have a look at what the search time is for that so adding quite a few filter conditions here so let's just see what we what we get we should also execute that and you see it's actually slightly faster again which is again it's pretty cool but like i said it's a small data set when we do this on bigger dates that's the difference can be huge now what we can also do is we can actually add another condition within our date here so we can say okay we want it to be created equals to 2018 but let's say we want to search for records only in 2018 so we might also say okay we want it to be great in 2018 or the first day of 2018 we also want it to be less than or equal to the very last day in 2018 so 2018 12 31. and we will filter and we see that now we're only returning records from 2018 so again super cool and i think an incredibly useful functionality for vectors similarity search now we were just using a very small data set there so i couldn't really show you how impressive the speed up can be when we're applying filters but i do have this other index now i'm not going to go through just coding everything because i mean it's pretty straightforward we have a index here which is 1.2 million vectors and it has a single metadata field in there which i've called tag one and that's just a randomly generated number or integer from zero to a hundred so we of course initialize the connection to our index in the first cell up here and then over here i'm just creating a random query vector so first this here is our unfiltered search so we get this 79.2 milliseconds now again most of this is network latency rather than the search time in the actual index but we will see the search time decrease pretty dramatically here so first we'll say okay we want tag one to be greater than 30 so we're going from zero to 100 so we're roughly removing probably about 30 percent of the vectors from our search and we can see okay we've just shaved off eight milliseconds which is impressive and then we take that even further so we say okay one greater than 70 so now we're shaving off around 70 percent of our vectors and our search time goes down to 56.6 milliseconds doing even further so about 90 here and we go down to 54 milliseconds and then here i'm using the equals sign here so i'm only searching for about one percent of the index and it goes down to 51.6 milliseconds so incredibly impressive speed up there and this is kind of what it looks like so we have the tag one gt value or greater than value on the left and as we increase that up this way our time our search time in milliseconds it goes down now it is a little bit bumpy it goes up and down a lot i've tried to to showcase that in this graph but the trend is quite clearly downwards so the more we filter the faster search which is incredible now that's it for this video covering pre-filtering post filtering and pinecones new single stage filtering i hope this has been useful and insightful if you are interested in testing pinecone now yourself there is a link to pancakes website in the description but we'll leave it there for now thank you very much for watching and i'll see you in the next one
Info
Channel: James Briggs
Views: 155
Rating: 5 out of 5
Keywords: python, machine learning, data science, artificial intelligence, natural language processing, bert, nlp, nlproc, Huggingface, Tensorflow, pytorch, torch, programming, tutorials, tutorial, education, learning, code, coding
Id: H_kJDHvu-v8
Channel Id: undefined
Length: 34min 13sec (2053 seconds)
Published: Mon Sep 20 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.