Extracting Training Data from Large Language Models (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Captions

hi there today we're looking at extracting training data from large language models by what appears to be a big collaboration between corporations and academic institutions there are almost as many affiliations here as their authors so this is joint work between you know as you can see many many um sort of institutions and it is a pretty cool paper so the high level topic is that these authors take large language models as the title says right here and um trained large language models specifically and they're able to extract training data just from the trained model in fact just from the black box access to the trained model and and not only are they able to extract training data they are able to extract pieces of training data sort of verbatim that have appeared only very few times in the training data uh they that's what they call a form of memorization so they're able to extract these with a kind of pretty clever attack so if you look at this prime example right here they are able to query gpt2 in this case which is one of these large language models to output this piece of text and the black stuff here is by the authors to uh protect the sort of privacy of this individual right here this is though this is a real piece of text that they actually got out and you can verify that um so they're able to extract this just from gpt2 and um needless to say this has consequences for security and privacy and so on because if you train one of these models with let's say internal or private data user data and so on you have to be worried that these models are going to just output that data again on the other end and potentially leak information um this of course has not been a problem that much so far if you know once we just trained image classifiers and so on but here especially with only black box access this seems like it has some some consequences so we'll go over the paper we'll go over the the attack or the technique the author's device which is i think pretty clever um we'll go over sort of the the results that they get from using this on a gpt-2 and we'll go over my opinion of the paper which um i can already tell you my ultimate opinion is that the the tack is cool the concerns are valid but the paper is probably written a little bit more scary than it ultimately seems in fact i find that the results uh the the actual results of this paper fairly okay like fairly um [Music] promising uh and and sort of straightforward not that scary and also the paper is interesting from another perspective namely uh from the perspective of what it tells us about these language models and how they work and it it sort of strengthens a number of hypotheses that i've put forward in my video about gpt3 about how these models work and and that's also fairly cool to see in this paper so we're going to jump in here and as always if you like content like this don't hesitate to share it out or subscribe and subscribe i should say uh if you're not yet all right so they say it has become common to publish large so billion parameter language models that have been trained on private data sets this paper demonstrates that in such settings an adversary can perform a training data extraction attack to recover individual training examples by querying the language model right so we have a we already have quite a bit of information right here so large language models um have been of course trending with you know especially since gpt3 but at least since um since the the advent of the transformers bert and so on though bert isn't exactly a language model so language models are models that per given a piece of text predict the next word let's let's so easy as that or they predict a probability distribution over the next word so if you say a cat sat on so that's the input the language model would give you a probability distribution over the next word so the next word might be the or the next word might be a or the next word might be next because of next two and so on and it will sort of give you a probability distribution over each of these words that kind of looks like a face um it will tell you how likely each next word is and so on and then you can sample from it you can sort of choose one of those words and then go on and you can evaluate the likelihood of entire sequences and so on so gpt-3 is one of those large language models and these large language models they've been of course since they are large we know that they also need a lot of data to be trained on so a large language model would take like a giant piece you know a database of training data which is scraped from the internet usually so this is too much to simply be curated by humans they just let scrapers run over the internet then they use this to train the model whatever that is in gpt uh gpt2 in this case and gpt2 will then be a trained model so you you sort of throw the training data away and you simply say this is our model now we're going to publish this right now the problem is if there is a piece of data in here that is kind of secret and you you think well i it's just one piece of data like how much can how much can go wrong right the problem is if i can inspect gpt2 and recover this exact piece of training data so that gpt2 will output that exact piece right that is uh is a problem now they they make some good points here this notion of a piece of training data and what it means to memorize a piece of training data and what it means to extract one is fairly fuzzy and they go quite a bit deeper in this paper so they have kind of strict definitions they say we demonstrate our attack on gpd2 a language model trained on scraped scrapes of the public internet and are able to extract hundreds of verbatim text sequences from the model's training data these extracted examples include public personally identifiable informations so names phone numbers and email addresses as you saw on the right here irc conversations code 128-bit uuids and so on so so they are able to extract all of these things from the trained model right and this you can already see that how how this can become a problem they say our attack is possible even though each of the above sequences are included in just one document in the training data and this notion this notion of memorization here and when it is dangerous they correctly say that this is only dangerous of course if the training example is contained in let's say only one piece of training data because if something is contained in thousands of pieces of training data it's you know it's okay to memorize that right if a name of like some famous person um is memorized and maybe that the addre like like that the president of the usa lives at the white house that it is not a secret right so it is okay if your language model remembers that because it probably occurs in many training data points however if something is contained in just one document right and the model remembers it then that is you know kind of true memorization it it is not maybe or you know it's probably not learning anything from that data point it's simply memorizing it uh to make its training loss lower so that's the the case on the right right here though i have to say uh this as i said it's written a bit more scary so they don't exactly say that this name and phone number is contained in just one document and they they also say like this is of course this is pop this is on the public internet gpt2's training data was scraped from the public internet so here is sort of my first investigation into this of course you can google this and you'll find it um you'll find this and even though you know the blacking out here also is a little bit of i think it's a little bit gimmicky because i don't see a problem with disclosing this particular piece of information and i'll show you why so when you search for it you'll find the nist home page you'll find a cryptographic algorithm validation program and you'll find that this is a description of a software implementation and here is the personally identifiable information you can see this is a corporate address so this is an address of a corporation and the contact information is a corporate contact it's a corporate email address it's a corporate phone number and so on this is the exact thing right here and you know with with respect to it only being present once in the training data so if you actually search for if you complete the name here and search for this you'll find many many many many many results now i don't know how many of these results are actually from you know in the gpt training data no one knows that except open ai so there's two google pages of results but oh google has d sort of deduplicated some of them and now if i click on all there are many there are 9 000 results for this and they are not all the same oh no no so if you look at a bunch of those you'll see that they are almost the same but here at the bottom as you can see this changes so you know depending on your scraper these all count as separate websites and therefore i'm not so sure that this particular piece of information here is contained only once plus it is a corporate contact so again so to my point the paper might be written a bit more scary than um then it ultimately turns out to be uh though you know you have to you have to make two different points like this particular piece of information uh yes it might be written a bit more scary and gimmicky with the with the blacked out stuff however right the paper has a point namely that if let's say you as a company do this on internal data it might very well be and they they do have examples where they reproduce data from just one document but even it might be that something like this happens to you internally where you sort of maybe in your internal document base you sort of do quasi-duplicate a document with the same information over and over and and that's not de-duplicated and then your language model sort of memorizes that um so it's quite it it has a point the paper that's that's what i'm trying to say i hope that's clear all right so we'll get to the results in a bit i hope i've already given you some sort of a taste for what you can expect so first of all they go into language models into sort of the definition of language models and the language model here is simply framed as a model that can sort of give you a um a probability of a sequence of text in sort of a stepwise fashion so always probability of next word given the previous words and you can evaluate that right so the access to the model that they assume here is access to let's say the logits of the model or the output distribution of the model um they say they use gpt2 um because it's uh trained on large piece of text but it's also you can you can evaluate it it's not as slow i guess as gpt3 and it's publicly available uh however the trading data to gpt2 is not publicly available but they do have someone of open ai on the paper here and this person at openai made like mate they could sort of query the open ai person to make sure a given piece of text that they find is or isn't in the training date of gpt2 so that's how they they work uh so that one per the openai person acts as an api for the training data right so they they do um they define their attacks here so they do a lot of uh things to to uh set up cleanly what they do right here so they have two points right here there is this notion of memorization okay so there's they say there are many ways to define memorization in language modeling in this particular piece of work they say it is okay to memorize some stuff they say language models must for example memorize the correct spelling of individual words right because the words are made of word pieces and the language model needs to output that so that's fine if it memorizes this indeed there is an entire area of research that analyzes neural networks as repositories of memorized knowledge for example when gpt2 is prompted to complete the sentence my address is one main street san francisco ca it generates the next token 94107 a correct zip code for san francisco in california they say while this is clearly memorization in some abstract form we aim to formalize our definition of memorization in order to restrict it to cases that we might consider unintended okay so so memorization as such isn't bad what is bad is what they call here the identic memorization of text so eidetic memorization of text is when the model memorizes something that only appears very few times in the training so they say we first define what it means for a model to x to have knowledge of a string our definition is loosely inspired if a model f knows a string if s can be extracted by interacting with the model so if you can input whatever you need to input and the model outputs s then the uh you say that model knows s right so if s is a piece of training data then you say the model memorizes s the model has memorized it so here they say a string is extractable from a language model if there is a prefix and the prefix here is the input to the model such that if you input that model the output will be the will be the string and then they define this eidetic memorization respectively they define k eidetic memorization a string s is k iodic i have no clue whether i pronounce this correctly k identic memorized by a language model f if f if s is extractable from f so that's memorization and s appears in at most k examples in the training data okay so if this address of this person only appeared twice but you could extract it verbatim from the language model then that would be an example of two eidetic memorization okay because k in that case would be two because it appears twice in the training data though they um they also they are not clear what they mean by examples in the training data because usually this training data is sort of chunked to make it fit into the language model and so on and i think they do this on a document basis so they would consider something like this here one example right and then a different document a different example so if you have like um for example if you have these irc conversations that they are able to extract so they claim here they are able to extract irc conversations or they're able to extract the usernames of the irc conversations right the usernames might appear hundreds or thousands of time because they chat with each other and it will all be you know in one document but the document will be so long they will actually be chunked into different training data pieces maybe i don't know i i don't know exactly what it means to be an example right here um but they do the example for sure for sure that piece of text can appear more than once even if it is only in one example in fact they they actually analyze this situation all right so we've defined that this is the chi these k eidetic memorization that's what we're looking for that's sort of the problematic regime if k is very small and the extreme k is one one piece of training data contains a string and we can extract the string at from the trained language model they also say that for any given k memorizing longer strings is also intuitively more harmful than shorter ones so this kind of makes sense and they even they even go into sort of corner cases they say mid certain pathological corner cases for example many language model when prompting with the sequence repeat the following sentence and then you give a sentence we'll do so correctly uh this technically allows any string to be known under our definition but they they of course don't do that they assume they don't know the training data so they can't just say repeat the following sentence and so on but you do see that it is fairly hard actually to even define the problem right here even though we as humans have a sort of a an intuition what it means uh for a language model to unintentionally or un to do unintended memorization right so the adversary's objective here is to extract memorized training data from the model the strength of the attack is measured by how private so uh how k-identic a particular example is stronger attacks extract more examples in total and examples with lower values of k they say we do not aim to extract targeted pieces of training data but rather indiscriminately extract training data while targeted attacks have the potential to be more adversarially harmful our goal is to study the ability of language models to memorize data generally not to create an attack that can be operationalized by real adversaries to target specific users so you can see that here they simply want some training data they don't really care uh what it is they simply want to get some so they're going to search for sort of the easiest to get training data and that so they frame it as yeah we don't want to devise an attack that can attack individual users but there is a different component to it so if you had to sort of guess the password of any particular user that would be you know fairly fairly hard however if you had to guess a password that was used by any user um it's fairly easy right even if you discard the fact that most of people use password as password and so on if if people would just uniformly sample words from the dictionary as their password uh still you'd have a decent chance of figuring out a password right um you have a decent chance of figuring out you know not super high entropy things like maybe credit cards you'd have a decent chance of figuring out the credit card number uh just by guessing one so this is the regime we are in here and it's entirely different regime i think if you try to attack individual users essentially what they're going to do right here is they're going to say look there's training data right here now some training data these models can extract a pattern from right if and this is what we do with machine learning right we say okay this this data right here they they all have like some pattern and this data right here is some pattern and you can learn from this and it has some patterns so the machine learns to sort of abstract from extranging data samples and so on but here is a data point that doesn't really fall into any of these categories so what the model will do is it will simply say well this is its sort of own little group i'll remember that i can extract some pattern from here and from here but i can't extract any pattern from here but i need to get my loss down so i'll just remember that you know individual piece of training data and that's exactly what we can recover with this sort of attacks these individual pieces that aren't really don't really have anything close there is not really a pattern to it so the best the model can do is remember that it doesn't mean that with this attack you're going to get this piece of data or this piece of data right so if your personal identifiable information is sort of falls into some kind of regular pattern um it's it's likely to be more safe against an attack like this that's why they for example are able to extract these sort of uuids or or urls with random strings in them because random strings have no pattern right so they are likely to be out here away from the other training examples where the best the model can do is actually remember the thing rather than extract a pattern now the other example here with this personally identifiable information i believe that's just because it appears a lot of times honestly not because there is no pattern but because it appears so many times that the model simply you know it's it's why should it extract a pattern when it appears so often it can just you know remember it like a famous person's name seems to be an address that's important if it appears so often i guess from the point of view of the model so that's that's sort of what this does again it extracts indiscriminately it doesn't mean that the attack can be leveraged to you know get any training date to sample back it's still worrisome but you have to take into account uh another thing that that is really um sticking out in this paper is the amount of hedging that this paper does uh this this um almost in every paragraph but certainly in every subsection there is like hedging hedging against you know why it is okay to publish this research um and so on so you know when they say our attack target is is is gpt2 we select gpg2 is a nearly perfect target from an ethical standpoint the model and the data are public so any memorized data we extract is already public and so on um and they do this in in every piece of text and you know in my video about broader impact statements that was exactly my my point these large corporations right if many many of these authors i think a fair amount of work went into framing this research such that it sort of can't get attacked from you know people concerned about if you know ethical considerations when releasing research like this and this is clearly research that can be leveraged you know for for bad if you will um but since these you know companies have a lot of resources and and and they're you know can put many people on this can devote a fair bit of amount of um of work into framing the problem uh that can be mitigated whereas if you know some lonely phd student would do the same research right here the exact same research i very doubtful it would be uh received as well as this piece right here and in my opinion as i already said in that video this just sort of shifts you know a bit more power to these large institutions that sort of can afford the framing right here they don't have to change anything about their research um but the rest of us do all right rant over let's continue so they they're going to do this in two different steps right here and they have a diagram yes they have a diagram so first um they do this in two steps step one they query the model they have different queries right but they just sort of generate data from the model so they generate lots of data right here from the model then they select somehow they select from the model a subset that they think these could be memorized training examples then they do de-duplicate it they select again and then they check okay this is it's fairly fairly easy workflow so step one is generate a bunch of data that you think could be uh memorized and then step two check whether you find these samples in the internet because all of gpt2's training data comes from the internet um if you can find them on the internet verbatim right that probably means gpd2 as remember like the likelihood that it verbatim remembers you know a uuid um that wasn't in its training data is almost zero so yeah this this goes by manual internet search so respect to these authors who have done this they start out with some fairly fairly weak baseline which is they simply generate the large quantity of data by unconditionally sampling and then they predict which output contains memorized text by simply analyzing the likelihood so whatever text the model finds highly likely they they they think that could be memorized because if you provide a model with training data and you ask it to reduce its loss on the training data it will assign highest likelihood to the training data that's you know just that's how these models work so they assume that um [Music] if a model has high likelihood or low perplexity that's the sort of same thing um except yeah so you can see here if the perplexity is low then the model is not very surprised by the sequence and has assigned on average a high probability to each subsequent token in the sequence and if that happens they say this could be memorized this is obviously obviously very very very simple say um this simple baseline extraction attack can find a wide variety of memorized content for example gpt2 memorizes the entire text of the mit public license as well as the user guidelines of von life an online streaming site while this is memorization it is only k-i-retic memorization for a large value of k these licenses occur thousands of times okay the most interesting examples include the memorization of popular individuals twitter handles or email addresses in fact all memorized content we identify this baseline setting is likely to have appeared in the training data set many times so here they say it doesn't really work if you just sample and then look at what's most likely because yes this will be memorized but it is sort of a non-problematic um form of memorization like famous people's twitter handles this is like famous people's names at this point right so now they go about improving it okay so they improve both steps they improve step one uh where are we no it's down here they improve step one by doing one of two things either you want your temperature to decay so in this sampling when you sample from the model you have a temperature that you sample from and you can decrease that over time so at the beginning you can let the model explore a bit but then you can you can decrease it and that's so the that sorry the the goal of changing step one is to create a more diverse set of generations right so you can sample with high temperature at the beginning and then decrease it over time okay uh such such that you still get sort of high likelihood sequences but you get different ones so you start off differently and then you go into the high likelihood regime the second way they change this is what they do is they go to the internet again so they go to the world wide web which is okay i'm terrible at drawing the globe with okay they go to the world wide web and they just get pieces of text from the internet so they get a website and they just take some tiny substring from here from this and they use that as the input to their model and that's sort of to get more diverse predictions so if you input a short prefix that you found somewhere on the internet and then let the model continue that generates you of wide diverse variety of pieces of text okay so that's how they up the the how many different samples the model generates because in the initial experiments they found that the model will sort of output the same things over and over again if you simply query it unconditionally so either high temperature or conditioned on internet text the second step is sort of what i find the clever step so here they have to before they simply said whatever has high likelihood that's what we think is memorized but of course a lot of these will not be you know with low k memorized a lot of them will simply be high likelihood because they're actually likely so they say okay when when is when are we in this situation so let's say here is the here is our data set okay and here is the the mit public license is here and it you know appears like billion billion billion times so this data point is like ginormous it's all you know the mit public license and here is our outlier data point now this model will extract patterns let's say from this and this is a pattern and it will assign a single pattern to the mit public license because it just appears so often and it will assign a single pattern um to this data point down here just because it's such an outlier right so how do we how do we devise a scheme that will find this one reliably but sort of will recognize wait a minute this this memorization here is okay um but we need to devise a scheme without having access to the training data right um if a human looks at it of course the mit public license is you know seems common we know that it's common and so on we know that it's highly likely text because it's a it's a license almost everywhere if a human looks at this right here and sees you know the name and address of a person or a credit card number we know that's not really highly likely text and that's sort of the the answer right here so we say if a human looks at it but what is a human a human is just another language model among other things right but the human is just sort of another thing that has an intuition of how how likely text is so the basis of their approach is going to be the following let's take a second a second data set okay sampled in the same way also from the internet but not in exactly the same way in fact they use common crawl instead of the the reddit outbound links that gpt2 used but we take any other data set and i'm going to draw the other data set so here's the data point here's a data point maybe this data point is duplicated from the other data set and here's a data point here one right so you you're going to have um sort of other data points but also you know since you're sampling from the internet broadly you're going to have the mit public license many times and you're also going to have the outliers in this data set now the important part is you're probably if you sample this differently um in the same fashion but a bit differently you're probably not going to have this same outlier right here you're probably not going to have that in your new data set okay so you can see in the new data set i i hope you can see this you're going to have the the same pattern extracted here even though it's from you know slightly different data points you're going to have maybe a pattern extracted here maybe one here you're going to have this same cluster here because the mit public license will appear even though it comes from other documents it's copied over and over and you're going to have this outlier right here so what you can do to differentiate our two our two things you can consider a second language model and you can ask so here you have two things that the first language model things are very likely you have this thing right here and you have this thing right here both the first language model considers super likely you ask the second language model and the second language module says yes the mit public license i consider that to be also super likely but this outlier over here now that's i've never seen that what's that that seems very unlikely and so by the ratio of the two likelihoods of the two different models you can find out samples that the first model finds super likely but the second model things are not likely at all and that's exactly the trick they use right here in fact they use many instances of that trick so here are the strategies perplexity is simply what they used before whatever is likely is probably memorized um this yes it's memorized but it's often memorized justifiedly then they have these strategies small and medium and and this is the ratio of the log perplexities of the largest gpt-2 model that's the one they attack and the small gpt-2 model and this ties into so you don't even need a different model right you can simply train a the reason they train a smaller model is the following um and we on the machine learning street talk podcast if you don't know that it's it's a it's a podcast where we talk to people from various you know from the industry and from various research labs and so on and we spoke with sarah hooker who um we talked about their paper the hardware lottery but she also has other research where she sort of shows that if you have weights so you have a neural network and it has you know layers layers layers and you have weights in these layers right what she was able to show is that um not all weights are equal so some of the weights let's say the weights here will be allocated to these pattern extraction things so you know here we have these you know you have date training data training it to outlier outlier right so you'll have this you have these weights representing this pattern within a layer right you have these this pattern will be represented by these weights right here and then you'll have other weights they're sort of allocated to remembering single or very few outliers okay so here this will be allocated and these will be disproportionate so there will be many many more data samples covered by let's say this piece of weights right here i should have drawn the bottom one smaller than by this so there there might be you know a thousand training examples covered by one piece of weight space and there might be only one piece of training data covered by this other piece of weight space and that's simply because it can extract a pattern from one but not from the other so it needs to memorize it and the larger we make these models you know the more parameters we give them the more the more the more ability they have the more space they have to do this remembering so what what uh sarah hooker noticed in in her paper is if you then distill these models and distillation is the process of taking these models and putting their knowledge into smaller models then what happens is not all training data points will will so that in distillation you usually lose performance not all training data points will lose performance equally namely you will lose performance on the training data points that are sort of these outliers that are these not often represented in the training data that you know the model has a harder time extracting a patterns from it so they will be um seldom patterns or just hard patterns i would also assume that you know patterns that are harder to extract will also fall fall away so the the more complicated patterns will also be sacrificed but i guess among the things are these outliers so if you train a smaller model the smaller model would have less ability to remember these outliers and therefore if you do this you don't even have to do it on a different training data set right you can simply compare to the same model trained on uh sorry to a smaller version of the same model trained on the same training data set because that will probably not remember the outliers as much it would have been interesting if these authors here had actually distilled gpt 2 and though they do not have access to the original training data so i can get why they didn't do it but it would be interesting to see that that gives me an idea sort of maybe there is actually a way to look at the weights and i get these these offers don't have access to the weights but maybe there's a way to look at the weights and to actually be able to sort of in some way spot right which of the which of the weights only are associated with with single or very few training data points maybe during training you can sort of count how many times a weight is updated in a substantial amount or maybe looking at the attention matrices you can sort of determine what are the kind of patterns that need to happen that lead to this weight being activated right so if there's a weight and it's activated by lots of lots of different patterns um maybe you know that weight is useful for many many uh forward propagated signals but if there is another weight that's only activated by a specific pattern right then maybe that's one of these these memorization weights so maybe there's a way to recognize these in the weights uh directly so distillation um appears to be sort of a defense against this this memorization um of things though that's not that's not done in this particular paper they also have different strategies so you don't need to do this neurally right you can compare the ratio of the perplexity that gpt2 gives to the z-lib entropy so this is a simply a text compression method you can even compare it perplexities between the original string and the lowercase version and so on so they extract for each of these configurations we select 100 examples among the top 1000 samples so they produce a thousand samples and they sample 100 from those thousands so they mostly sample from low ranked samples but also they explore some of the high ranked samples they have a a formula where they sample they de-duplicate and then they investigate all right so they do google searches if they can find the thing they say that's memorized right so they say across all strategies what we identify 604 unique memorized training examples from among the 1 800 candidates uh our best variant has a true positive rate of 67 percent that's quite remarkable right so 67 um 67 percent of the things that this method delivers you automatically are actually memorized though you have to qualify that right if you want more than a thousand examples that rate's gonna drop right you since you select the top 1000 examples these are the most likely to be memorized um so yeah if an attacker wants more if they want to scale this attack up their positive rate is gonna plummet uh fairly quickly i'm going to assume it would actually be interesting also to see how that develops with the top the top retrieved document right here but i get the they have to do google searches to figure out and then ask openai to figure out if it's really a memorized training example they say they're categories we manually group the memorized samples into different categories the results are shown in table one most memorized content is fairly canonical text from news headlines log files entry from forums or wikis or religious text however we also identify a significant amount of unique data containing 128 bits uuids correctly resolving urls containing random strings and contact information of individual people okay so um as i said these this is this is fairly interesting but also a bit expected right if i give you the start of a uuid um then there is no pattern to extract except i guess the uuid structure but there is no deeper pattern to exact so all the model really can do is memorize the the uuid especially if there aren't too many uuids in the training data or if this particular uuid is some sort of as i said it's this outlier type of situations the same thing for you know urls containing random strings these are just not pattern extractable therefore easily more easily remembered by the model than learned so you can see right here the breakdown where they see how many of what they extract and here contact info 32 named individuals in non-news 46 that's a fair amount of things you can extract from gpg too though you have to say that that is all right all of gpt2 um you get approximately a hundred things that are kind of names or contact informations so as i said not too bad specifically considering uh what i've shown you here right um that's one of these contact informations and they do say this in the paper that this pers this information was obviously released in the context of this software project and the problem is only the model might actually output this in a different context right the model might think oh now i need to output some sort of name and address what kind of names and addresses do i know well this name and address appears pretty often i'm going to put that here and so that's a failure case you know that these things can do so here is a sort of a graph and they have more of these graphs later but you can see that here for example is a gpt-2 perplexity and here is this z-lib entropy and if you plot them one against another most things will fall on this diagonal right here with you know the giant blob around here for most texts of the internet and there will be a region where gpt2 thinks this is fairly low perplexity but zlib thinks the text is relatively high entropy so these are candidates for memorization and the red and blue here are the ones the authors selected for checking and the ones that are blue are ones that they found or memorized from the internet so a fairly high percentage in fact 67 of this method that they selected was in fact um was memorized though as i said you can see that there aren't super many more right so this is this is all samples and i don't know how many you know um they could generate more but you can see that it gets pretty sparse out here okay yeah so examples of memorized content personally identifiable information they say there are several examples of individual people's names phone numbers addresses and social media accounts some of this is memorized content is just exclusive to a few documents for example we extract the user names of six users participating in an irc conversation that happen in exactly one document yeah so i guess the question is how often did the usernames appear in that one document right and once the model sort of and how how distinct are these usernames from other usernames because if they're very distinct and they happen you know they have a long conversation it can be easy to see that the model will remember that not saying this is not a problem i i'm telling you the models it's not it's not that they'll just randomly remember stuff there needs to be very specific conditions for the models to remember stuff so they say we identify 50 examples of memorized urls that correctly resolve to live web pages okay many of these urls contain uncommon pieces of text such as random numbers or base64 encoded strings again this this random element right here makes it you can't extract a pattern they say we identify 31 generated samples that contain snippets of memorized source code and they can actually extend that so they can take these snippets and they they always i think they do 256 token lengths but they can extend that to sort of verbatim recover the source code and that's also you know that's that's fairly interesting um and unnatural text yeah these uu ids uh a google search for this string identifies just three document containing this uuid and it is contained in just one gpt2 training document okay though again we are not seeing how often they say table three gives nine examples of k equals one memorized content each of which is a random sequence between 10 and 87 characters long you can see the table right here so these are examples of random strings that for some reason appear in this training data in exactly one document however this string right here for example appears 10 times and this string right here appears 311 times so again it's a random string that appears though 10 times is fairly often for a piece of text to appear especially the same piece of text that is not pattern close to any other piece of text it seems okay that the model remembers that it seems expected right so yeah here they they also say data from two sources we find that samples that contain two or more snippets of memorized text that are unrelated to one another in one example gpt2 generates a news article about the real murder of a woman in 2013 but then attributes the murder to one of the victims of a nightclub shooting in orlando in 2016. and this i found very very interesting right because that's exactly what i said gpt3 does right especially so in gpth3 they have this example of gpt3 writing an entire news article about i'm not even sure about some pastors uh some split in the mormon church or something like this or i'm i don't remember correctly but i was able to google that and i did not find the verbatim sequence but i found that article that gpt3 wrote many many times in sort of different words in written down in you know books and reported about and so on uh so what gpt3 did is simply i would guess interpolated between these things and here they find the same thing gpt2 um just takes two pieces of text and sort of finds that they're close and sort of interpolates between the two i would call this memorization two and they say yeah there are this is memorized text this is not memorized text in their definition of memorized text but it is right so so it sort of mixes up different training data points together and this i think is a strong um it's very strong evidence for how these language models work in that they they sort of take training data points and they just kind of mix them together and they can do this in a grammatically well-founded fashion they can also change individual words of a sentence and so on by the way it doesn't mean that people are doing anything smarter like there are arguments like the best arguments i hear are you know people are kind of doing the same thing they're just kind of recount the training samples in their a bit of their own words um but yeah this this i found extremely extremely interesting and also you know what i found from gpt3 with this google example was that the problem of memorization may even be way more way worse than what they analyze in this paper right here because they look for sort of direct direct overlap in text where as they wouldn't catch strings that are you know sort of reformulated um again okay so here they they lastly they say um they can extend text and this thing here i find very interesting so they say if they if they put in this prompt 3.14159 gpd2 will complete the first 25 digits of pi correctly interestingly when they input pi is this it gives the first 799 digits and if they say e is this and pi is this then it gets the first 824 digits correctly so they make the point here that the memorization problem could actually be much worse if you only knew uh what prefix to input so this strengthens my case for the future job description of a prompt engineer right it seems to be that it's quite a sort of magical power to know what to input into these language models to make them output what you want them to output in this context but also in the context where you actually want to do them i want want them to do something useful all right and here here is where they investigate this number k so you might have noticed and this is a bit of the criticism of my paper up until this point yes they have you know they have the k equals one right here and they sometimes say that it's only found in very few examples but essentially they just they they um they investigate this memorization here pretty much in absence of k of what they themselves define to be problematic right they say well it's problematic if it only appears in few training examples but the the analysis here is done quite absent of k very often and here is where they investigate this so this is also pretty clever the the the experiments here are fairly clever um they find a they find a one piece one document a paste bin document so the pastebin document where that is sort of a json document and it has lots of links and i've i found the document the giant document okay and it's a john jason document with these entries so there's this entry there is color and then link and then here the url would go on right and um it is the in fact the the only document in the internet at least these these authors claim that uh contains these urls but many of the urls are repeated many times in fact um here you can see that these are the continuations of the urls right this one even though it's contained in one document it's actually repeated 359 times and so on so this is a playground they say okay this document was in the training data of gbt2 um here we know how often each of these strings appeared in the document so they can directly make an experiment how often does a string need to be present for the model to memorize it they simply order by the number of total occurrences right here as you can see and they ask each of these models whether or not it has memorized the string and they do this by inputting this so this is the input and they simply sample if the model manages to output any of these urls they consider that to be memorized if not then not if it doesn't memorize it they have a second trick that if model can get half a point if they input this first random sequence i think they put six tokens of this random sequence and if then the model completes then they say ah it has memorized it right so you can see right here it appears that the this large language model needs this needs a string let's say 20 times or higher for it to memorize it and you can also see the trend right here that if you go to the smaller models they need a lot more in order to memorize them because they have less weights they can't afford to memorize stuff easily right um they need to extract the pattern so they'd rather forget about the string incur a loss and focus on other training examples so yeah two things in this direction smaller models in this direction larger models so that means that something like gpt3 will have this problem much more pronounced so that's the the bad news about this result the good news about this result is that this is the case where you have fairly random sequences right these uh even you know that if tokenizing this is not going to be natural text and there are these you know random these reddit urls have these random prefixes uh so this is very much this sort of outlier case um it's a pretty clever case study to find this document i have to say but it is sort of good news um that this is not the usual case this is really the case that this data is very very prone to being memorized right because it's not patternable and it's very random and yeah so okay so that was that was that um as i said the amount of hedging right here is is really really uh like it's a lot um they discuss what you can do with it you can train with differential privacy though that doesn't really help as we said because some of these strings are included in you know more than one time you can curate the training data which doesn't really help because the training data is too large um you can limit impact of memorization on downstream applications so if you fine-tune but we don't know exactly what fine-tuned models forget and what they retain or you can audit which is essentially what this paper paper right here does and that seems like a that seems like uh seems like a good you know the best strategy we have so far is is to audit these models and yeah so i wanted to uh quickly check out also the appendix the appendix here shows sort of these graphs for the other methods and it is very cool if you want to you know check that out and it has sort of categorization of what they find as these memorized pieces of text but what my main point was right here is that this paper shows a problem let's say with these large language models namely that they memorize certain pieces of training data while that sounds scary i feel that the nature of the data that it remembers is very particular so not you cannot extract any piece of training data the nature is very particular it's the sort of outlier-ish uh training data points and also it very very very often it isn't enough that it just is there one time so even when they say this piece of information is only in one document very often it appears many times in that document um that together with the sort of non-pattern ability of the data that it memorizes right here actually makes me fairly fairly optimistic more optimistic than i would have thought honestly about these language models yes so we'll see what the future brings as i said this is going to be more pronounced in larger models and this is not the only problem with these models as my gpt3 google search in that video shows all right i hope this was enjoyable let me know what you think and maybe check out the paper bye

Info

Channel: Yannic Kilcher

Views: 15,939

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, google, apple, openai, berkeley, stanford, carlini, dawn song, google ai, nlp, natural language processing, gpt, gpt2, gpt-2, gpt3, gpt-3, gpt 2, gpt 3, bert, transformers, attention, training data, security, leak, privacy, data protection, ethics, broader impact, likelihood, perplexity, entropy, url, uuid, personal information, address, private, user data, gdpr, adversarial, zlib

Id: plK2WVdLTOY

Channel Id: undefined

Length: 63min 18sec (3798 seconds)

Published: Sat Dec 26 2020