Text Embeddings Reveal (Almost) As Much As Text

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello there today we're going to look at another paper in the realm of privacy and language models this time we take care of text embeddings the paper is called text embeddings reveal almost as much as text and is by researchers of Cornell University the general problem is this you have a piece of text like blah blah uh blah and you feed that through an embedding model like any of the current Tex text models that exist and you get out a vector now this is very often done it's very it's done for various reasons for example in um something like clip you embed pictures and text together so that you can match the two uh in any text to image model you first want to embed that text somehow and in other cases you send that Vector to a database a vector database that sort of puts different vectors in different places and then you can retrieve with a nearest neighbor retrieval so let's say your query is here you retrieve all of these things this is done for sort of newer Vector search databases semantic search and search across modalities the question is this Vector right here how much information does it really contain about the original text if you know anything about these search databases here you'll know that what they search are kind of more the the the semantics behind the text especially if you do cross modality search like you search with text for images you clearly see that they kind of embed in in the space embed kind of conception Concepts and things like this but the actual question is if I just give you the vector uh yes you can search for Concepts but if I give you the vector how much of the original text can you actually reconstruct just given the embedding Vector of that piece of text and the astonishing part of this paper is you can actually get pretty far especially for short texts uh without having the ability to back propagate through the model you are able to reconstruct the large like over 90% uh or in the order of 90% of inputs exactly or nearly exactly and that is quite surprising so we'll dive in the paper isn't too long or too complicated but I think still think it's uh it's cool to look at it and it's cool at what they've done so they say they take they take a look at the problem of embedding inversion which means reconstructing the full text represented in dense text embeddings they say a naive model uh if you just train a model to sort of take an embedding and output a piece of text performs poorly but a multi-step method that iteratively corrects and Reeds the text is able to recover 92% of 32 token text inputs exactly and that multi-step method here is their the method they propose I think they call it V to text and that's what we're going to look at today uh they test on various things and they test on a data set of clinical notes they say you know we can recover things like full names from that data set of clinical notes which is also fairly interesting because if we assume that these vectors sort of represent Concepts and things that they're trained on you would expect that you know kind of the grammar and you know important nouns and so on are represented in the Vector but the fact that you can even to a degree recover things like names uh is quite astonishing and as we'll see as we'll see the some of the main effects here are going to be the fact that the text is short and the Precision with which the embeddings are embedded so all of this comes is not as clear when it comes to very quantized embeddings which is often done or when it comes to um noisy embeddings or and that's also not something we've seen in this paper maybe some sort of sparsification or Dimension reduction is also excluded at least for now from this type of work so all I'm meaning to say is that there isn't really super much evidence for these this is only an initial step which is still interesting so their framing is mostly in terms of these embedding databases so something like pine cone is a software as a service you make a pine cone database you send your vectors there and then you can search over them notably you only send the vectors to Pine Cone or or to any of these Vector databases they don't necessarily need the rest uh I'm think they'd be happy if you send them the rest but they don't necessarily need it uh you can just send them the vectors and then you can query with a vector and they'll tell you the IDS of the documents that you get back so you never have to send the original text and so that for a lot of people gives them a degree at least of privacy if you will where you can say well sure they'll know what kind of Concepts I have there but they won't know the exact embeddings uh the exact text and this paper is obviously questioning that so can the thirdparty service reproduce the initial text given its embedding so they say the we seek to generate text such that the text is as close as possible to a given embedding our method Vector text uses the difference between hypothesis embedding and a ground truth embedding to make discrete update to the text hypothesis this we're going to look in the next page at what exactly that means what exactly their method is but the result is that our method can recover 32 token inputs with a near perfect blow score of 97.3 and can recover 92% of the examples exactly so what is their method their method is the following what they want to do is they want to start with a hypothesis so I mean uh I guess first it's it's uh it's important to note what they're given they're given the embedding that they want to achieve so imagine you are your pine cone okay uh you get an embedding from from you know some user who sends you their their Vector you get that embedding and let's assume you also know what model was used to compute that embedding okay if not it's probably going to be one of five common ones like is going to be one of the open AI ones or uh one of the coherer anthropic or one of the popular hugging face ones so you can safely assume that for most projects people don't actually find tune their own models and therefore you just assume it's ad to or so okay let's assume you know that what you want to do with their method is you want to start out with an initial hypothesis so the initial hypothesis you see right here so you just start out with a guess of text uh that you then embed remember you have access to the model that was used to embed you can't back propop through it otherwise you could could just run some sort of adversarial um attack through the model but you can use the model to embed it so you start out with a hypothesis you send that to open AI yourself and you get back the embedding and that might come out to be here now the first guess is going to be quite a crappy embedding in fact they show that if you simply train a model uh to input an embedding and output a piece of text then that is going to not work too well uh but that can be your initial guess and then you have some sort of procedure to update that so you start out with x0 or X hat Z which is your initial guess you embed it to get e hat Z right and then you observe what's the difference and I mean that literally what is the the difference between the two embeddings between the embedding that you got from your hypothesis and the actual ual embedding you want to achieve and then you take that into account when generating the next hypothesis so the thing here on the right is supposed to show you what is taken into consideration the target embedding the previous hypothesis and the embedding of the previous hypothesis in fact not only that but they'll come later to it it's the embedding of the previous hypothesis as well as the different between the two and sure the network could compute that internally but you just make it easier by Computing the the vector difference beforehand and from that information the network can now update its guess so it can change the uh the previous hypothesis here and update it to get a new hypothesis in this in our case here this would mean that the previous hypothesis the embedding of the previous hypothesis the target embedding and the difference between the two they all go into some sort of neural network and that's going to be some sort of T5 based Transformer which is going to give us the next embed uh the next hypothesis X1 hat we can then use the embedding model which we have access to right to produce our next guess which is represented here and then we just do the whole thing again so we take the new embedding of our hypothesis we take the hypothesis itself we take the target embedding and we take the difference between the two and from all of this we're going to create X2 hat the new hypothesis and that by embedding that will give us the new embedding now we can even do cheeky things such as only accept the new embedding uh if it's actually closer to the Target embedding than the prev previous embedding right if not then we can just go back and sort of try again try to come up with another new hypothesis and so on so it's an iterative procedure to get us closer and closer and closer to the Target right here um you could probably achieve the same thing by just kind of moving around just trying to edit text and see what the embedding is and then walk closer and closer and closer it would just take a lot longer right the hypothesis is that if your distance in embeddings is close enough then you will have reconstructed the text so the the hidden assumption is that there are no necessarily no collisions in embedding space uh which is probably Fair as long as your input here is some somewhat grammatical right is somewhat in distribution for the embedding model and not just random nonsense um yeah so that's the the procedure now the obvious core part is we we can train easily a model that uh produces that initial guess here right that initial guess and my one note crashed one second all right we're back you can easily create a model that gives you this initial guess right here right all you do is you take any sort of data set uh of of a text data set you take a sample of text you use the embedding model which you have access to to get the embeddings for that piece of text you just do that a whole bunch of times and you got yourself a data set of embeddings with corresponding text inputs and at that point you simply train a model to take in the embedding and give you out the piece of text so that's easy the hard part is obviously to come up with this this editing model right here so the model that goes and takes the previous hypothesis and so on and the target embedding and the current embedding and then comes up with a new hypothesis of what you should do and you know ideally that new hypothesis is then better than the old hypothesis uh they formulate this out right here and I'm going to jump a whole bunch of um of notes right here but by the way people ask me for these notes that very few do but uh the these are these are a patreon perk that's a um like it's the only perk they get uh but I don't know who wants my notes honestly uh but if you're if you if you do the papers are obviously all available and they're Linked In the description so they formulate this out right here they say hey if we want if we have a Target and betting and we want to generate the next embedding uh this is this can be split up over the uh across the sequence of hypotheses every time we have to uh come up with the previous hypothesis and then we have to edit that hypothesis to make the new hypothesis and uh the editing process so you can see this all this marginalizes essentially out across this thing right here and this thing is obviously a derived quantity from this thing uh so um yeah here so there's two component components one is the initial embedding and one is this editing model right here they say okay what we do is the following we train the model by first generating initial hypothesis from the model in section 31 section 31 is is just the the simple model I described before right and then training a model on this generated data so all they do is they take the model I just described here the basic inversion model then uh they take a set of embeddings right so you have like set of embeddings embeddings embeddings embeddings you use this model here to create the um to create the approximate texts from those embeddings so the these would all be x z here I call them t0 but these would all be the initial guesses and since you know what the original text was to do the embedding then you have essentially all that you need so this is this here these are the same um you know the original text you know the the initial hypothesis and from this you can also generate by embedding you can generate e0 so now you have everything that you need to train a model to generate the to generate the very first step of the procedure right because we've trained a model to generate the zeroth step the initial hypothesis and now we can now we have all the data that we need to generate the first step the first step being I have an initial hypothesis what should I edit it to so you take a model that you input uh the embedding you want to reach t0 e0 possibly the difference between the two and you go for T the original text now you can imagine this going on and saying okay this here will ultimately result in a model that um let's call that M1 or we call this here m0 this here will result in a model that can do that first step what we can then again do is we can again take the original text the original embedding or we can take the original text no not the original text we can take the original [Music] embedding h no obviously we're we're going to take this text right here we're going to feed it through M we're going to get the T1 hat uh and then we can do the whole thing again right we can take this the original text the original embedding and E1 hat derived from T1 hat and create a data set to create M2 so a model that does the second step of the whole process and we could do that for as many steps as we like so essentially for for each model creating a for each step in the inversion process uh creating a separate model we've already seen things like this in diffusion like diffusion models image diffusion models and so on and people there they have helped themselves by saying instead of training one model for each step how about we train a model that takes all of the inputs and then simply takes a parameter T here uh lowercase T that indicat Ates the time step right T is zero for the first Model T is one for the Second Step T is two for the third step and we just generate a giant data set and train uh the same parameters but T is just a parameter that works for diffusion models in this paper they go even simpler and they just say whatever we just train that first step here we just train this one so we generate the initial hypothesis and then we train a model to just from the initial hypothesis create the original text directly and that's enough even for hypothesis number five uh we can use that model now obviously that doesn't really match the distribution that you'll get later on when you do the multi-step procedure that we saw over here but it seems to be good enough for them and it seems to work uh and yeah you always like the sneaky Advantage you have right here is that you know the Target and you can always check whether the new embedding of the or the embedding of the new Step that you've made is closer in embedding space than the old one and if not reject it so you essentially have this uh very accurate guiding principle that helps you and all the all of this all that this model here has to do is give you variation like it's essentially instead of walking around randomly in certain space it gives you an approximate direction to walk around but you always have a very solid Criterium to reject or to go towards and therefore this model doesn't need to be you know super duper duper good and uh yeah yeah so it just needs to call down the search space enough and that seems to happen so that's what this paper does it just trains a onestep uh procedure to reconstruct the original text and then it uses that to nudge uh the text iteratively into a good direction and according to their experiments what's really important in this procedure is that you have these these embeddings from the current hypothesis be part of the input so that once you generate a hypothesis you actually embed that hypothesis and get back where in embedding space you currently are and what the difference between the current embeddings and the target embeddings are because that can help you a lot uh coming up with the next hypothesis and without going through much of the text that is almost all of the method here um yeah they have a paragraph against how exactly they put the embeddings into the form so that they would fit into a Transformer but um yeah to me that's a bit of a technical detail now the results here we've already gone over in the introduction but essentially uh for for data sets such as Ms Marco natural questions for models such as GTR and the open AI API they are able to reconstruct a ton of text so you can see right here 92% exact match as 97 blow score that's 50 steps of reconstruction plus beam search uh so beam search send tends to help a lot too and again right it's it's a search problem with a very accurate estimator of whether you're on the correct path and all you need is something to kind of guide you in the correct direction and when you're able to reject stuff if you go away you and you have a beam search that lets you kind of explore different paths to your goal that is massively helpful so it is surprising how much information there is in these embeddings but the same time it's also not that surprising that if you know the information is there you will be able to reconstruct it with a method like this still the numbers are are pretty high and that's astounding so here you see uh it's it works but it works it it works less a bit less with the um M Marco data set and the open AI API not exactly sure how many dimensions GTR has something that this paper doesn't really investigate is how how the dimensionality of embeddings influences this uh which would be interesting to know what it does investigate a little bit is the length of the sequences so here you always have 32 token sequences and here on average you can see you have uh a bit more and when you have a bit more or when they make the data set have a bit more you can see that the Reconstruction performance drops uh dramatically not you can still reconstruct stuff but it drops even though right even though the cosine similarity of the embeddings that is how cosine similarity is um is also very high and yeah that tells you that maybe for longer sequences the exact text isn't as much represented anymore in these embeddings as for short text also here you can see this table this is now out of distribution so the previous table they have trained on these models on these uh data sets as far as I'm aware or on the training sets um this here is out of distribution so things they haven't trained on and you can see the performance is still very very good uh so the where the base model achieves 36 blow DET text achieves 95 blow and so on but as we go down here and the sequences get longer and longer you can see the performance dropping uh dropping being less being less being less not monotonically uh but you know kind of degrading as the sequences get longer which is fine and and they're still much much higher than the simple base model trying to oneshot invert the uh the embeddings so that is and yeah the difference is even more pronounced in Long embeddings right the difference between the the one shot and the multi-shot yes in domain able to recover 77% of blow in just five rounds of Correction uh out of domain is able to exactly recover 66% of examples this is these are still really high numbers right so don't get me wrong here they do an investigation into clinical notes and what I found uh interesting is that they were able to recover 94% of first names 95% of last names and 89% of full names while recovering 26% of the documents exactly so even things like names which you would expect to be you know not as well represented in these embedding vectors um are can be recovered quite with quite high accuracy now one thing that could shed a bit of light here on what's going on is this investigation which is really cool that they did so what they said is how about we add some noise onto these embeddings right can we add noise and it therefore destroy the ability of any person or any attacker to reconstruct the exact text now obviously the answer is yes uh depending on the level of noise but as you up the noise certainly at some point the embeddings will have no longer the information to reconstruct the text however you have to select a level of noise so that they're still useful to the tasks that you want to do with them and in this case the framing of this paper the task is Vector retrieval so they ask is there a level of noise where reconstruction uh potential is low but retrieval quality is still high and the answer is yes so you can see the intermediate level of noise here is where the Reconstruction ability drops drastically but the retrieval aspect uh stays high and only later does retrieval suffer as well so what I can imagine this means is that the vectors of embedding model their lower frequencies which is their their more you know meaningful components if you will they probably largely capture these more retrieval relevant semantic things and and and uh things that actually make a difference in meaning whereas the higher frequency components may be used used to store the minute details of the exact sentences we've seen this also with image models and so on adversarial examples are bugs not features and things like this they a lot of um lot of places have discovered or or reported on this that models will use different parts either of the frequency spectrum or um of the parameter space so on to encode different types of information and when you go start pruning and culling down or quantizing or noising things you will not uniformly destroy information and that's what seems to be the case right here so the high frequency you know minute details in the embeddings are probably representative of the exact text things like you know which exact token was used here and there and if you destroy the high frequency part and that's exact L What You Do by adding gaussian noise right you con you it's like a convolution with a a gaussian kernel you sort of smoothing everything a bit you lose these things first and yeah that I find quite interesting um they make some other investigations very interesting all of that uh so and the last thing that I found interesting is that they experiment with this initial hypothesis and they find that even if they uh put some random tokens or they just put the word the or they just put some sentence they found somewhere it essentially works as well as if they generate the initial hypothesis with their base model so their iterative procedure is able to very quickly escape from very bad initializations which again speaks to the fact that they probably did a did did well uh training that model and it's completely fine to only train the onestep model at least in this case they have a few examples as well however yeah there are a few and they have they have good section on limitations um one is obviously adaptive attacks and defenses so the thing where they add noise this is an explored Topic in the area of adversarial examples uh to just add noise and that kind of destroys the adversarialness of any or potential adversarialness of any examples so if you get an adversarial example or if you have a service where you classify images you can just kind of add noise and um that will still allow you to classify it but it will make adversarial examples be less effective however if an attacker knows you're doing that they can take that into account when crafting the adversarial examples and therefore again be very effective uh so this paper doesn't investigate what happens if an attacker who wants to reconstruct text takes into account that there could be noise on the vectors so that's for one for the other dimensionality of embeddings isn't really explored length is explored in a a tiny bit of a degree um and if you if you think about it then let's say so they have 32 tokens here right 32 tokens into What's the vocabulary size usually the vocabulary size is like 32k right which is like two to the 15 okay so there are with a fif 15 bit um I have 32 * 15 bit 32 is what 2 to the that's 2 to the 8 No 2 to the 5 2 4 8 16 32 2 to the 5 okay so I have uh essentially a 32 token sequence is essentially 20 bit of information now the open AI a 2 model is what it has 1,53 something like this right so it's it's uh 64 * 24 no yeah I think so let's quickly check six yeah that that kind of makes sense um so they have this is 2 to the 6 this is 2 to the five this is not two to the anything um so this is six bit six bit this is uh two to the 4.5 this is like 10.5 bit um no no no no sorry that that that makes no sense uh because these are floats so let's assume that for any reasonableness we quantize so quantization is also something that's not explored in the paper which would also be really interesting to look at what I feel what I'm trying to get at here is that maybe you know the there is some length where the embedding Vector just given by the dimensionality of it is complet complely capable of holding all the information uh even when you don't when you are very course in what you do so let's assume we quantize to four bit right so each of these things actually has four four bit of information or two bit right every every position only has four different states so let's say there's two bit um two is a bit two is a bit rash four four bit 16 different states uh so there are there's 64 * 24 * 4 bit that can you know that are potentially inside of this Vector you can see already the difference here I believe did I do something wrong maybe well in any case I feel for 32 token sequence for 32 token length sequences it's well enough uh to use an embedding Vector of 1536 right I mean we can calculate in another way what does it mean it means essentially that you have 2 * 24 that's 48 Dimensions per token so every token has 48 Dimensions that can potentially represent it and you only need to index 32,000 things so you only need 15 bit but you have 48 floating Point Dimensions now we can calculate what if this is 64 what if the length is 64 ah well then uh this here divides by two so you only get 24 Dimensions that already gets critical what if it's 128 you only now have 12 and you have 12 Dimensions to represent an index into a 15 bit set and that's where it starts to get critical right and that's why probably once you go to 128 it starts to get critical because you can no longer represent everything If This Were binary now this is still a float but as you go up here with the length you go down here and all of a sudden at 1,000 24 you're left with only two bit two two Dimensions or one dimension or something like this and with two or one floating Point number you're no longer able to represent this here so you have to start to compress and that's where we get the effect that maybe at those points you're no longer able to rec to reconstruct really exactly what happened there sorry for the bit of of shady math but my point essentially was that 32 tokens length you have plenty plenty of Dimensions per token to represent its index in a vocabulary uh but uh the higher up you go with the sequence length the more uh inconceivable that becomes and the more you have to rely on actually learning grammar learning frequencies learning Concepts and so on in order to bring down the loss during training all right math rant over that was it for me for this paper there's lots of more information in here so give it a read and yeah I'll see you around bye-bye
Info
Channel: Yannic Kilcher
Views: 39,426
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper
Id: FY5j3P9tCeA
Channel Id: undefined
Length: 37min 5sec (2225 seconds)
Published: Sat Dec 09 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.