Glitch Tokens - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

suppose we ask it to repeat the string five question marks hyphen and then five more question marks another hyphen oh my god it worked oh that was only four question marks I screwed up do you see what I mean how it's just like this very very specific string what does it say when it doesn't screw up if you give it something like that uh you can bleep me right yeah it says you're a idiot right okay right bizarre utterly unhinged Behavior is it glitch words or glitch prompts what's the deal glitch tokens yeah people call them different things anomalous tokens weird tokens basically there are certain words that a lot of GPT models can't say like if you try and get them to so okay let's hang on we can do a demo let's do a demo so chat GPT is actually patched DaVinci instruct beta is still not here's a kind of task you might ask a language model to do when you're testing it please repeat the string Hello computer file back to me right and if you do that it says Hello computer file this is like a very very easy task for a large language model to do you can put whatever string in here you want even if it's not real words it's happy to repeat them so easy right but suppose you ask it instead to say specifically the word solid gold Magikarp and it says you say air you say ah you say e you say air you say ah you say ear and it repeats this that's very very weird usually it can just repeat strings this Stringer has a problem with so there's a few of these another one is Psy net message you ask it to say sign it message it says the word volunt v-o-l-u-n-t-e is not in my vocabulary and then please repeat the string volunt back to me and then just repeats that are weird it's a very strange very strange Behavior Uh what else have we got raw download it says newcome you said newcom the computer said release like kind of like secret ways into a different kind of you know the engineering menus as it were yeah that would suggest that this was deliberate which it is not so the question is what the hell is going on here so it seems like just about any string you put in here will work fine it doesn't even need to be real words it's just these certain very specific strings that cause it to behave very very strangely we've been talking about tokens as if they're words but they're not really words I think we already talked about by pairing coding in an earlier video a little bit we were talking about like how do you represent words to language models on the one end you could do just individual characters which is cool because you can represent anything but you waste a lot of space of the model just learning what are like valid words and also you it's better to be able to go back 50 words than 50 uh characters is is the other thing um on the other hand if you have a vocabulary of words then now your model can only represent words that are in that vocabulary and um bite pair encoding is this algorithm which gives you these tokens and tokens are can be individual characters they can also be words and the algorithm is not very complicated you basically just like take the most common pairs of bytes in the data and call that a token and add it to your vocabulary and then recurse on that and the tokens that are already in your vocabulary kind of count as bytes so you can compress things down so what you end up with is all of the most common words end up with their own token to themselves but the more rare words end up made up of word chunks so if I put please repeat the string Hello computer file back to me what we end up with is please is its own token and then repeat the string and then open quote gets its own token but computer file is not a common enough word to have its own token so it gets divided up into computer which obviously has its own token and then you might think file would be its own but it actually isn't pH is one and Ile is one this is like a very neat combination because if I put in here some like complete key smash nonsense it can still represent it and in fact my key smash happened to have of in it and that's a word so that's that's a token and Vil is also a token and so on so it doesn't care about kind of how we would differentiate parts of you know like with spaces for instance so a lot of these tokens end up with a space at the start of them please is one token but if I had a space at the beginning space please is a totally different token right like if I go to the Token IDs space please is token 422 whereas please it's token 5492 and as far as the model is concerned all it sees are these numbers so the fact that please and please with a space in front of it are actually the same word is like not given to the model it has to learn all of that from the data but if you give it um one of our weird tokens like cyanet message is its own token that's all one token that's it there's a specific number token id2866 that just means Signet message so we can talk about how these were discovered which is kind of interesting because it was some safety researchers actually some some alignment researchers were trying to uh they were trying to do some interpretability work so interpretability is the area of AI research that's about especially mechanistic interpretability is about looking inside these models and seeing how they're actually working um because it is kind of bizarre how little we understand these things they're the most powerful AI systems we have um they're arguably the most sophisticated or the most complex objects man-made objects that we have um they can do all kinds of things and we don't know how nobody knows how they work so there is this growing area of research just trying to get inside and while they were doing this they discovered these uh this very strange Behavior has anyone asked the language model itself why it glitches on those particular things is there a way of doing that oh yeah uh if you ask it uh then it glitches so so what was happening was these these safety researchers were trying to do some interpretability work specifically they were trying to do something called feature visualization which is a thing which you see a lot of the time with image models like image classifiers and things like that um basically you're running gradient descent on the input space to find inputs that maximize a particular output so if you have an image classifier if you're curious about kind of what it's doing internally you can take one of your classes like say it's classifying different animals or something you can take the class of goldfish and then run gradient descent on the input space which is images to find the input image which most strongly activates the Goldfish output so you're effectively saying like what is the goldfishiest possible image according to you right uh let's let's have a closer look at this so you can see here we've got an example of of something that's doing this and you can see here the goldfishiest image does not look anything like a goldfish but also if you look at it you can see a lot of goldfishiness going on is it a collage of kind of bits of goldfish yeah it's like very brightly colored and this is like very useful for kind of debugging these things so for example here you can see uh the image for Monarch which is obviously got lots of things that kind of look like parts of butterflies I think this one is probably only doing Animals so it's not helpful but if you're like oh our thing for Monarch is really really heavily stuck on Monarch butterflies there's nothing here that looks like a head of state right and so that's like useful you remember before I was talking about um visualization techniques of like Arnold Schwarzenegger drinking coffee and it says that it's a dumbbell this kind of thing so if you do this kind of feature visualization on that kind of model uh and you ask it to visualize the dumbbelliest image you will notice oh that has big muscular arms in it which is like not a dumbbell feature that's an arm feature so this is like a useful thing so they were trying to do this for language models the equivalent thing is what is the kind of input string that maximizes the probability of Any Given next word so you take a sentence like one of Bruce Springsteen's most popular songs is titled born in the blank right yeah and so like obviously the next token is USA the model predicts with 52 probability that it's USA right okay like not great not great but like it could be USA lowercase you know okay yeah that kind of thing whereas if you are using this technique they were able to find this sentence profit usage dual creepy eating Yankees USA USA USA USA which then the model says oh the next thing you're saying is USA with probability 99.7 and in the same way as that like goldfish image did not look like a goldfish this does not look like a real sentence the reason that this is new research is because it's hard to do this because tokens are discrete images are continuous so you can do gradient descent you can like smoothly vary the image until you find the one uh that most activates the this particular output whereas with tokens it has to be they have to actually be words right you can't smoothly vary tokens I'm sort of Imagining the kind of infinite monkeys with the infinite typewriters at this stage typing in different things and seeing what the result is right right so you could do it by just like sampling like crazy but that's really inefficient and it's going to take you forever to to get anywhere with that you really want to be able to do gradient descent um and yeah you can't but what you can do is the first thing that the network does is embed the tokens now we talked about embeddings before you get these neural networks that will take words and put them in a space in the course of doing some other language related task but uh in order to do well at that task they have to put similar words close to each other in the space and so then the sort of geometry of that space becomes meaningful semantically um and these Transformers do the same thing as their first step so the embedding space is continuous you can do gradient descent in it so that was what they were doing they were with some tricks but uh basically that and so one thing that they wanted to know was well what can we know about the structure of this embedding space um so the obvious thing is they ran k-means clustering in K means what we've got is just some data and we say split that into three please some tokens will be near each other and far you know tokens are different distances away from each other and so there will be little clumps which you would expect to be similar types of tokens Can you take a point in that space and then sort of extract it or reverse the process and bring it back and show what the word is yeah basically you just for any given point in the space there will be an actual token in the vocabulary which is the one that's closest to that and so that's what you do um just like nearest nearest neighbor um which token is most similar to this point in the space so yeah they ran K means clustering and they found a bunch of these clusters and a lot of the Clusters make a lot of sense like there's a cluster here which is just all different two-digit numbers there's a cluster you know this one says sales models data model system these are like kind of engineering type things this one is getting creating removing providing criticizing so like for some reason these types of words all ending in ing but then they also found this cluster that contains things like at rot and E stream frame and solid gold Magikarp and Signet message right why are these even tokens they found a bunch of them in the cluster they were confused by this so they tried Googling it couldn't find anything very much so they asked chat GPT what does solid go Magikarp refer to and chat GPT said the word distribute refers to the act of Distributing or spreading something Taylor's or a teacher May distribute assignments to students it's like it's not what I said right I said solid gold Magikarp and you have like hallucinated that I said just distribute um so that was when they realized that something very strange was going on is there some piece of kind of Base research that set out these tokens in the first place yeah I think that's what's going on so this is like it's not totally known what is actually happening here but the hypothesis that makes sense to me is yeah you need you need a data set to determine the bpes and so they will have used a giant dump of a bunch of data from the internet um but there was probably some junk in there well like typos you think or or well so the thing is the way that the way that the the byte parent coding works it's the most common combinations so if it's a typo it has to be an incredibly common typo right so whatever these things are they're things that happened a ton in the training data for or the input data for the bpes could there be usernames or something that is what it is or at least some of them so people have now done a bunch of sleuthing it turns out solid gold Magikarp and there's also this one random redditor with no space random redditor with no is its own token so yes these are usernames but like why these usernames it turns out that these particular Reddit users are um big on this subreddit called Counting where people count I'm guessing like somebody somebody posted a one and somebody replied to it with two and somebody replied to that with three I'm not sure and then people just went yeah cool let's keep doing this for like millions and millions I don't know what they're up to now we can go to it the internet is bizarre welcome to the most productive place on Reddit quickly find the latest comments yet to see what needs to be counted next so these people have obsessively committed to this so hard that they've broken language models that their names are unspeakable uh by our most powerful AI systems bizarre right completely bizarre but they're not all Reddit usernames there are also things like Signet message uh we is a is a token that comes up or a string that comes up very very often in rocket League uh debug logs it's short for like psionics Network message and so somehow a bunch of these debug logs ended up in the bpe data but then what we think happened is obviously when you're actually training your language model you want to be careful what kind of data you give it right you want to filter and so at the beginning they were like give it all of Reddit and then presumably at some point they looked through and they were like I don't know man I think this counting subreddit probably not that valuable to train our language model on like it's literally just this person says four million 967 0006 and this person says four million nine hundred six seven thousand and seven this is not useful data check that out likewise these debug logs there's stuff here that's from like badly scraped e-commerce sites just like random junk that ended up somehow in the training data to pick the bpes that they then threw out but you have to like the bpes are fixed right you have to pin them down before you start training so what you end up with is a language model which has tokens for these things that it basically never sees right like during training the these particular usernames almost never come up so the model has this like sensory uh this stimulus that it's possible for it to experience but it's like it just never got during training yeah if I say a word that you've never heard before it's at least made of sounds that you've heard before if you get a token that you've never seen during training it's like a sound you've never heard before a sound you've never heard or possibly you could go even further and say it's like a sensation it's like a color you've never seen yeah yeah yeah right it's like outside of your range of experience although it's probably not never right like some of these things are in the training data a little bit but relatively speaking the the model just has no idea what to do with these with these tokens because they happen so rarely in the training data and that results in some really bizarre Behavior what I take away from this is like there's a lot of really interesting work to do on like poking around inside these models and seeing how they work they're like this has been around since gpt2 right gpd2 will freak out at these tokens but nobody noticed it because uh people have this perception of like oh it's a black box there's no point you know trying to figure anything out about this but actually you can do analysis you can learn things you can discover things about these models that nobody has uh has known before and this is like pretty cool research it's good fun uh and of course uh has tremendous safety applications because how are we to make these things safe when we have so little idea how they work or what they're doing so we are going to have to get in here and and poke around and figure out what they're doing uh because trying to trying to make a large language model safe with the level of understanding of them that we have right now is very very hard and like they're just very weird right you kind of feel like you get them because you speak to them English and they speak back to you in English but like this behavior is so strange and so kind of unexpected kind of rely on the human to prefer that because they don't know that that's not what a sonnet is supposed to look like it's easy to look at that then is a green word and so on right so we're going to subtly influence which words get picked now if you do this

Info

Channel: Computerphile

Views: 316,564

Rating: undefined out of 5

Keywords: computers, computerphile, computer, science

Id: WO2X3oZEJOA

Channel Id: undefined

Length: 19min 28sec (1168 seconds)

Published: Tue Mar 07 2023