Product Quantization for Vector Similarity Search (+ Python)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and welcome to the next video in our series on similarity search we're going to be covering product quantization which is a very effective method for reducing the memory usage of our vectors now i'll just show you this very quickly so we end it in the first column over here we have the the recall speed of memory for a data set of size 1 million so we're using uh the sift one m data set here and the dimensionality of the vectors in there is 128 so very small uh compared to a lot of vector sets out there now the memory usage of that when when we just saw them as they are is a quarter of a gigabyte just over which is pretty big and you think this is a small date set small dimensionality so you know this is already pretty significant and when you start getting into high dimensionality or or just larger data sets it quickly becomes very unmanageable so product quantization or pq allows us to reduce that significantly so uh in in this example so the test we run here we we reduce the memory usage by 97 which is pretty significant so from a quarter of a gigabyte to 6.5 megabytes it's like nothing and then the the speed increase is pretty significant as well so it's just under a six-fold speed increase there so pretty good obviously the recall also decreases a fair bit so that's just something that we need to be careful with but at the same time there is uh there's different parameters that we can use in order to fine-tune that so what we're just going to cover in in this video is the the logic behind product quantization there will also be another video which will be following uh this one where we'll have a look at product quantization in physe and we'll also have a look at a composite index of product quantization and uh inverted file so ivf and with that we can actually make the speed even faster the memory goes up a little bit to i think it's like nine megabytes so pretty minimal but the speed increase is is very significant we it's like a 92 times speed increase over a flight index so it's super fast but we'll be covering that in the next video for this one i'm just going to work through what product quantization is and how it works so we'll start with the example of dimensionality reduction which is what you can see here so we have this s value up here which is kind of like the it's the the scope of possible vectors that we we can have so all along here uh in the middle here this is just a i've highlighted a single vector so this would be x our vector and d over here is our of course our dimensionality that vector so uh typically well in this step 1m data set for example would use a dimensionality 128. dimensionality reduction is something like pca for example where we would just reduce the dimensionality there and we would try to maintain the geometric or spatial properties of that vector even though it has less dimensions we just still try and maintain that so vectors that were close to each other before the high dimensionality should be close to each other again in low dimensionality now what we have here is is not then dimensionality reduction this is quantization now quantization it's a very generic term and it focuses on any method where we're trying to reduce the scope of our vectors now in in this case um so beforehand the scope is is typically very either very large number or infinite technically infinite so if we're using floats for example uh the number of possible vectors that you can have within that space is usually pretty big that you you may as well say it's uh infinite now in this case uh what we do is reduce that so we reduce the possible scope into a more finite set then we may be going from something with like a practically infinite scope to something like a 256 possible vectors that would be a very small number but it's just an example then notice that the dimensionality does not necessarily change it can change with quantization but that's not the that's not the point of quantization the point is to reduce the scope of our possible vectors now a very high level uh let's just kind of cover what product quantization is doing so what we have here we just this blue big blue block at the top here this is our initial vector x now we will say dimensionality this is 128 so d equals 128 now the number of bits required to saw this vector let's say it's a 32 every value is a 32-bit float so we have uh 32 bits multiplied by the dimensionality it's 128 so in total that gives us 4096 bits that we need to sort a single vector now that is pretty significant so what we are going to do with with pq is try our best to reduce that first thing we do with uh product quantization is split our vector x into sub vectors uh each one represented by uh u so u to a subscript of j where j is a value from zero up to seven in this case representing the position of that sub vector now uh the the next step is the quantization of product quantization so the next step we process each one of these subfactors through a clustering algorithm each of these clustering algorithms is specific to each uh subspace so it's specific to j equals zero one two three or so on and within each one of those clustering sets we have a set of centroids now the number of centroids we set that beforehand so when we're building our index we set the number of centers that we want and the more centroids we have the one the more memory we use but also the more accurate the index becomes because you you think you have say you have two centroids uh in one effective subspace your sub vector will have to be assigned to one of those two centroids and in the case that we only have two centroids neither of them could be particularly close to our vector or subvector so that average distance between sub vectors and the clusters that they're assigned to is called the quantization error and for good accuracy good recall good uh results we want to minimize that quantization error as much as possible and we do that by adding more centroids so if we compare our cluster of two centroids to a a set of clusters where we have let's say 200 centroids the chances are we're going to find a centroid that is more closely fitting to our sub vector in the 200 centroid algorithm than the one which is two centroids so that's the logic behind this of the memory usage and accuracy trade off there obviously the more centroids we have the more data we saw in our index to cover the full scope of all those centroids now once we've assigned a centroid to our subvector we map that centroid to its unique id so every centroid within this index will have a unique id that unique id is a 8-bit integer typically so we've we've now gone from so originally we had a vector x of damage and i t 128 and each value was a 32-bit load so the number of bits we use there 4096. now we only have a id vector uh which contains eight eight bit integers so that's eight multiplied by eight in this case we have 64 bits so we've compressed our 4096 bit vector into a 64 bit vector and that's what gets solved in our index now there's also a something called a code book which maps each one of those ids back to the reconstruction or reproduction values which are those those centroid vectors but of course because we've minimized the scope there we don't need to store that many of those mappings so say our scope is 256 we only sold 256 of those mappings so this is how product quantization can be so effective now we'll we'll go through some more uh visuals but i want to actually write this out in code as we go along because i for me at least writing out in code makes it much easier to at least logically understand the process so there are a few variables that we we need to define here so we have this this vector up here it's very small it's using integer values you know the the data types here don't matter we just want to build out the process of product quantization we're not this is not an effective implementation or anything like that we just want to understand the process and logic behind it so the first thing is how many sub vectors are we going to convert our x full vector into in this case i want to create four sub vectors and we also need to get the dimensionality of peps as well which is just the length now we have those two and the first thing we need to do is make sure that d is divisible by m because we need equally sized sub vectors so what we do is we write assert that d is in fact divisible by m so we would just write this and if that is the case we we don't get an error so if i just add a one in there i'm going to throw it off we get a section error so we're just adding that that in there to say okay d is definitely division divisible by m so that's good we can continue and what we want to do is say okay what is the length of each sub vector it's going to be it's going to be equal to d divided by m and we'll have to convert that over to an int as well so we'll just do that and let's see what we get so each sub vector will be of size three now let's let's build that so our sub vector or our set of sub vectors will be assigned to u what we're going to do is go x from row to row plus d indeed the d underscore for row in range 0 to d we're going to take it in steps of the d underscore okay and then let's see what we get and see that we now have our sub vectors now what does that look like let's let's visualize that so here we we have that vector so d up here is our 12 and we're splitting that vector x of dimensionless 12 into four sub-vectors so m equals four down here i'm going to create four sub vectors now d over m which is what we calculated before produces this uh d star now in the code we we can't write the star so i've changed it to d on the score and the value of that is three so each of our sub vectors has a dimensionality of three so here we have that um that d star dimensionality of three here these are our sub vectors we have four in total equal so that is equal to m now next thing i want to do is i'll set k so k is the number of possible values that we're going to have in our data set so we are going to do 2 to the power of 5 and we print that out and we get 32 so k is going to be equal to 32. now the the one thing we need to make sure of here is this k value so the number of possible centroids that is going to be shared across our entire set of subspaces or sub vectors so we need to make sure that that is divisible by m again so we do the same thing again we say assert that k is in fact divisible by m if it is we can go on to get k star or k underscore which is going to be k divided by m okay and let's have a look at what that is so this means that with that well we should also make sure that's an integer like so okay so we what that means is that we are going to have eight centroids per subspace so each one of these sub-vectors here will have the possibility of being assigned to one of the nearest eight centroids uh within its subspace and in total across all of our sub vectors of course we have a total of 32 possible centroids now what that 32 there means is that we will have a code book so code book is simply a mapping of our centroid vectors or or in fact uh let's do the other way around is a mapping of centroid vector id so each one of the centroids that we haven't created yet but we will create will be assigned a unique id and it maps us from the unique id to the actual centroid subvector so that's where later on we you know we call those centroids reproduction values because later on when we're searching through our index our index is storing those eight-bit integers but obviously we can't compare those eight-bit integers to an actual vector that we're searching with our query vector so what we do is we map each one of those eight integers which are the ids back to their original centroid values not not to the original sub vector values but to the original centroid values and that's what our code book is for so this k value this 32 means that we will have a total of 32 mappings in our code book now i know that can maybe seem confusing so let's have a look at what that actually looks like so over here we have the u and our sub vectors uh here we would have let's say j equals zero uh j equals one two three and over here what we have is our reproduction values so that those uh centroid orders cluster centroids now in this case we have these three centroids in reality we we set k equal to eight so we should actually see and not three centroids here but we would actually see eight so we have one two three four five six seven eight so in reality there were actually that many centroids in there and each one of those centroids can be referred to as c j so in this case j would be zero and i so we would have let's say this centroid is i equal 0 this centroid is i equals 1 the centroid is i equals 2 and so on all the way up to in this case 7 so 0 is 7 giving us the 8 centroids there so to get to this centroid here in our code box so this c represents our code book we would go c zero zero or this one over here this number two would be c zero two let's say down here this green centroid let's say that is i equals four so that means that to get to that in our code book we would say c three four okay so that's just the notation behind it so our quantized sub vector is one of these centroid sub vectors and each one of our reproduction values or those sub-vectors the quantization vectors will be assigned a reproduction value id which is simply what we have here so this this cji so in this case this value here may be maybe the value zero eight inches zero and if we start counting through all of our code book that would relate to c zero zero now if we were to go let's say down here this is number 10 the the integer value of 10. now that if we if we count through so j is going to go up to 7 and that's our integer value of 7 and it's going to reset because we're going to take j from 0 to 1 which is going to reset our i counter so it will go from c 0 7 and then the next one in our code book will be down here which will be c one zero so and at this actual integer value in our id vector that would be represented by the value eight so that this value here would be eight so we add one more uh we go to nine so that would be c one one and ten would be c one two so c one two okay and that's how we represent those and we refer back to those original centroid subvectors within the reproduction values codebook but for now we we don't have those centroids so we need to we need to create some so what i'm going to do is from random import randint c is going to be our overall codebook or list of reproduction values and we're going to say 4 j in range m so we're looping through each of our subspaces here we are going to initialize cj okay and then in here we need to store eight different centroid vectors or reproduction values and of course we refer to those as i so stitching those is i for i in range and here we are looping through k underscores so in our case that would be equal to eight so we run through that eight times here we have m so you think four we loop through uh four j's and we loop through eight i's for each one of those so in total we get that 32 value which is our k value and what we want to do is say we're going to say c j i is going to be equal to we want to set rand int from zero up to it's going to be our vector space style cluster vector space so we're just going to set it to the maximum value that we we have in here so we have a nine um so we set that so these are the centroid values remember not the ids for underlying in range here we we have our dimensionality of each sub vector and what we want to do is just append that to cj c j i and then outside that loop we want to append c j to c okay so we've now just created our clusters which we can see in here so each subspace so each j has a total of eight possible centroids so one we have zero or one uh two eight here in each one and we don't don't forget we have m of those subspaces so we should see four of them although it does cut out here but this is in fact just the fourth one here now i think it's pretty cool to be able to visualize that so i'm just going to copy and paste this code across you can see here you can you can you can copy if you're on it will be also in the notebooks that will be linked in the description video and we're just going to visualize each of our centroid centroids within each of those subspaces so this is c where j is equal to zero j is equal to one two and three and these here are our centroids now typically of course we will train these uh but we are not going to train them we just want to create a very simple example here so we're not going to go that far into it okay so i mean that's that's everything we we see there and now what we need to do so we we produce those reproduction values all those centroids and now what we need to do is actually take our sub vector and assign it to the nearest reproduction value or centroid now how do we do that we we simply calculate the distance between our sub vector and each of its equivalent reproduction values or centroids i'm sure it's going to get annoying if i keep saying reproduction values or centroids so i'm just going to call them centroids from now on it's so much easier so we are going to calculate we're going to identify the nearest of our centroids to our specific sub vector so i'm just going to copy these these functions in uh we at the top here we're just calculating the euclidean distance it's pretty straightforward if you if you need to just google euclidean distance the form is pretty pretty straightforward and then and then here we're just using that euclidean distance and looping through each of our values so the decay underscore so each of the centroids within our specific j subspace so all we need to do there is pass a specific j subspace and our specific sub vector to that function and to do that all we want to do i'll just also note that what we're doing here these are the the index positions so the high positions and we're looping through and saying uh if if the new distance that we calculated between you see here we're accessing that specific centroid if that is less than the previous one uh between our subvector and that centroid then we assign the nearest id to that vector not necessarily distance we don't care about distance we just want to see the nearest id to nearest idea so which is what we return so what we're going to return there are in fact the i values for those centroids so first we're going to do is initialize our ids vector so here we're actually going to build that eight bit uh integer ids uh in in this case it's not going to be eight of the values there will be four because the the length of this vector is always going to be equal to m so we have four j in range m i is equal to nearest between c j or c sorry j and u j everyone say ids append i and then let's see what we have okay so that is how that's how quantize subjected so each one these ids here represent one of our centroid positions so if we we wanted to uh get those centroid positions so in this case i'm going to call them the reproduction or reconstruction values we would do this so this is going to be our reconstructed vector i'm going to go between so we're going to go for each j in range m again remember we're going to say cji so the the reconstruction value is equal to c for j and then the i position is whichever value we have in here so we need to say ids j okay and then we're going to do q send c j i and now let's have a look at what we get for q so this is our our fully reconstructed vector now earlier on i i said that we get the the quantization error and we typically measure that using the mean squared error so this function here you can you can see it there so what we can do is we can measure the error or mean sweat error between our quantized vector and the original vector so all we do here is we go mean squared error x and q and here we get the total squared error means whatever between our original vector and the quantized version through so increasing the the m value or increasing k we we should be able to minimize this and therefore improve the performance of our of our index so i have one final thing to show you which is is how the search works in pq so on the right we have just a single just a single id vector we would use our code book c so that they would each go into this like so and from there we would output um our quantized sub-vectors so these are the reproduction values now on the left we have our query vector and we would split that of course in like we did before with our initial vectors so we would split that and we would get this now what we would do is we calculate the euclidean distance between these and what we would get is we would simply add all of these up so we take the product of all of them hence why this is called product quantization and the total distance would be all of those added together okay let's say this now we would do that for all of our vectors so this is just this is a single so this is like q u that's a single quantized uh vector or or set of quantized sub vectors we need to do of all of them and then we just find the one that produces the lowest or the smallest product distance now as well it's worth noting because we're taking the product this isn't really the distance between the two vectors between u or the quantized version of u and xq but it is a almost like a proxy value for that and then that's what we use to define to identify the nearest vector now i'm not going to go into the code behind it because it's quite long but i am going to i'm definitely going to record that and just leave it as like an extra or almost embarrassed video if you do want to go through that but it is it's pretty long so i wouldn't say you necessarily need to uh but in the next video anyway we're going to have a look at how we implement all this in in vice which is obviously much more efficient and it is in that video that we will also introduce the ivf and pq constant index which it just allows us if you have a look at the speed here uh to really reduce the speed an insane amount um it's like a 90 times increase in in the speed of our search so it's pretty cool but yeah that's it for this video so thank you very much for watching and i will see you in the next one
Info
Channel: James Briggs
Views: 372
Rating: 5 out of 5
Keywords: python, machine learning, data science, artificial intelligence, natural language processing, bert, nlp, nlproc, Huggingface, Tensorflow, pytorch, torch, programming, tutorials, tutorial, education, learning, code, coding
Id: t9mRf2S5vDI
Channel Id: undefined
Length: 29min 36sec (1776 seconds)
Published: Mon Aug 30 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.