how the tokenizer for gpt-4 (tiktoken) works and why it can't reverse strings

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey welcome back so large language models such as GPT really struggle to reverse words and today I'm going to explain why so just to kind of prove this I've gone on to po.com for a second and I'm going to ask it to reverse the following word and the word I'm going to use is elephant and when I put that in there you can going to see it makes a complete an hash of it e l e p h n t it just cannot get it right right the good news is if I go into GPT even 35 and say reverse the following word and we'll put an elephant again you're going to see that GPT 35 gets it completely right e e p h n t so it doesn't struggle so GPT used to struggle with this but it's got a lot better but GPT 35 by the way isn't infallible so if I say reverse the following uh work and then I put uh do append child and I'm using this word for a specific reason and I'll explain uh a little bit later you're going to see the GPT 35 makes a complete hash of it also and if you think the GPT 4 is any better you can see here it makes a complete hash of it as well it's missing the D on the pen chart however if I spell out the letters of the word individually by placing a space in between it you can see GPT 4 has suddenly become competent and is able to now reverse the word and that phenomena is not exclusive to GPT if I then uh do the same here and just spell out the word elephant you can suddenly see that llama 2 that couldn't spell the word elephant before it suddenly became a genius at spelling and is able to spell the word so what's going on large language models don't see words like we see words they see tokens so what do I mean by a token so if I take the word fish for example that actually is a single token if I take the word Fisher that is two tokens and the two tokens are fish and ER so when the large language model is being trained it will actually be being trained on fish and her not Fisher and therefore it makes it really hard for the large language model model to be able to do things like reversing words because it's not looking at things at a word level it's looking at things at a token level so today I'm going to Deep dive with you the tokenization model of the GPT models and that is a token Library called tick token and then I'm going to show you exactly how some of these words are tokenized and DET tokenized and then that will give you a really good understanding of what the GPT model is seeing now in other models such such as the mistol models or llama models they use a different tokenizer however the fundamentals are the exact same but today I'm going to focus on tick token okay so to get started we need to install tick token onto our local machine so I'm just going to do a pip install tick token and then as you can see it's already installed on my machine but that will uh get that installed onto your own machine next thing I am going to do is we're going to create a very simple python file we'll call it main.py and then uh we'll start using that python file to start encoding words again this might seem a little technical uh for some folks not familiar with python but it's actually super easy to do so now that I've done that I am just going to open up that python file in uh Visual Studio code and as you can see it's completely empty and we are just going to import The Tick token uh Library so the next thing I need is access to an encoding so to do that I'm going to type encoding equals tick token do get uncore encoding and I'm going to pass in the encoding as CL 100K at base now I could specify that I want GPT 35 or gp4 specific encoding model the the encoders are slightly different for each of the GPT models but for the purpos of this video I'm just going to stick with cl 100K base and now I want to do is pass in a uh string to be encoded so in this case I'm just going to sprint out encoding do encode and then I'm going to pass in the string hello world with an exclamation point and now that I've saved that if I do python main.py you're going to see it is three tokens so you see 15339 1 917 and zero and you can probably take a bit of a guess uh how that's been tokenized so 15339 is hello 1917 is is world and at Zer is zero yeah and if I wanted to I could kind of prove this I could just take this here and then rather than calling uh encoding do encode I could do a print encoding do decode and now if I pass in uh the the uh array of the tokens and then hit save and then run this one more time you can see that those tokens come back as hello world what I want you to notice here is that the casing is maintained hello world is all lower case now if I wanted to split this up a little bit I'm going to pass through 1533 91917 and0 individually and then we will just run this one more time and you're going to see that it comes back with hello space world and exclamation point three tokens and again that's probably quite interesting uh to everyone in this particular clear case Hello is a token but actually space world is a token so rather than it just being world uh it is uh space world now that's quite interesting so if I change that to uh from Hello World to hello Chris let's clear that again and then run that one more time you see I've now got 153 39523 6091 and O and that is probably because um the SPAC is now coming in here so actually let's just put that in and we will put in 523 6091 and we'll stick with the zero there let's just run this one more time and you can see the tokenization again here is even more different than you would expect right the hello is maintained but now we've got a space c I've got RIS there and I've got exclamation point so what's actually happening here is that within the vocabulary there is common words that are being used within the data sources and then they're being mixed and matched and combined so when tick token is looking uh to compose a word it's looking at common elements or sub elements to combine words together so in this particular case it's it's thinking rather than using the word Chris it's using space c and RIS and then it's combined and because those are things that are being used used commonly in different uh spaces so to come back to the hello world example I'm hoping you're starting to see that this is really just a dictionary type operation I have an integer so 15339 is represented by the word uh hello and 1917 is represented by the word world and again when training is happening or when inference is happening essentially these sentences these words are being converted into tokens and then that's what's being passed through and again as I said it's maintaining uh apis but of course if I want to do different variations if I was just saying hello world without the exclamation point then I'd just be passing in the tokens 15339 in 1917 without zero it is basically add dictionary now if I want to come back to the hello world example for a second if I change this uh from Hello World to hello world with capital H and W and then even if I make this all caps for a second you're going to see I'm going to get a completely different result so if I just run this one more time there's my original hello world but you see Hello with a capital H is completely different from Hello lowercase and then all caps hello is different as well 51812 now world with a capital W is 19 17 and then 4435 is H completely different there a world with a capital W but look here for everything all caps something is completely different and again if I take those values 51812 1623 Etc and then if we run it through the decoder as you see I'm passing through each one of these tokens individually we can see exactly how it's going to be decoded and as you see Hello World and capitals is transl into HL low and then you've got space World there and then an exclamation point so it's an interesting vocabulary and again we can do the same with different words as well so if I take the word fishmonger for example and then we run that through here you can see that splits into uh three tokens you see 1868 7225 and 261 and you can probably imagine what this is going to be it's probably fish at Mong and then ER and we can probably prove that and then than doing a decode I'll just paste that in and then if we uh run it one more time you can kind of see 1868 matches this token 7225 matches this one a 261 matches so it's fish Mong and then probably if I wanted to if I could do some combinations so I could take the word uh Fisher for example and then what should probably happen is if we remember did the coding that was 1868 and 261 so if I was to run that now let's just clear this we should see it is matching 1868 and 261 because it's the combination of fish and ER and again if I want to run my dcode on this so if I take 1861 261 here run that through as this we'll just change this to a decode we'll save that and then we'll run this here and then you can see it's taking these uh fish and ER and then it's combined that to make the word Fisher and that's how tokenization works and as I said I'm showing you the exact tokenization uh library that the GPT models use so if I wanted to I could actually just Loop through the dictionary and print out each token so here I'm starting from zero and I'm going to 50 and then I'm literally doing a decode as we did before now if I run that in uh python uh you're going to see uh lo and behold and shouldn't be a surprise here it is basically there's are uh exclamation point that we saw earlier zero there is a quote so you can sort of see it's just literally going through the alphabet in the beginning here and again if I go a little bit further so rather than doing one uh 0 to 50 if I do 10 uh 5 for example and then you can see a whole lot of other tokens there Indo element peed Ash use setion an include now anyone who's a little bit of stute amongst us there you can start to see that this a lot of code is been run through this so things like elements Etc and it's even got words like uh cons St paon maybe that relates to response you're starting to get the idea so it's not just restricted to things like the English language it actually works with uh programming languages so in order to have an efficient programming language then uh some of the commonly used tokens are being passed in there as well in order to maximize the token ation so far we've focused on English examples but actually let's throw in some Japanese words so I'm going to throw in Japanese for hello I'm going to put the Japanese word for elephant for fish I'm going to put World in there uh so that gives us four different encodings and now if we run that uh you can see the tokenization so there we go the uh the Japanese for hello is actually uh got its own individual token so it's it's used enough there to justify a token the word for elephant has also got his own token as well so you know 47523 but if we look at the next ones um so if we look at fish and world again fish even though it's a single token in the English language in Japanese it's being made up of three tokens and then same for the word world again is being made up of three tokens here and that's because the Japanese language and text hasn't been tokenized to the same level as the English language in fact it's probably not even been tokenized to the same level as programming languages um so it's maybe handling some of the uh common words such as hello but outside of those really common words and I would argue that they're probably common words that are appearing in an English text I'm really paying a penalty for the Japanese language and that penalty is I'm using more tokens for the training and I'm using more token tokens for the inference as well remember you are charged on uh token usage and this is this penalty that you're paying isn't restricted to things like Japanese it's also restricted to other languages as well so things like Swahili so in Swahili jumo is means hello dunia means world and samaki is fish so again if I run this in my terminal we'll run this one more time you see although I'm passing through three words hello has got two tokens probably because it's using the English word jam and bow together uh the same for uh the uh daia you can see daia is actually three tokens 6735 9689 and then you can see fish samaki is two tokens which is probably Sam and then Aki so again even tokenization is highly biased towards the English language and text which again makes sense for things things like uh models trained on Chinese text or I think some folks in Norway are training uh languages on Norwegian text there then there opens up this space to have models that are natively trained on individual languages where there's enough text and therefore you would see that would have an impact on tokenization you would see tokenizers for those models more tuned towards that language and therefore the cost of training the cost of inference is going to be much lower where you can see the the cost of inference the cost of training for uh something like the Japanese text or for Swahili is going to be much higher because the you're using far more tokens than you would uh for something that's really focused towards English language and you can kind of see the open AI model is highly tuned towards English language and again we'll maybe in another video we'll explore some of the other tokenizers now if I wanted to get really crazy here I think one of the things that we could do is is get out of the realm of of spoken languages like Japanese or Swahili or English and we can move into something like a uh signaling language like Morse code so if I look at Morse code here that's how you say hello and Morse code there lots of dots there and dashes and things same with World there um and then we can just sort of print that out and I'm printing out the final version just to remember everybody what it looks like so if I save that we'll clear our screen for a second and you see for morse code if I was getting the large language model to work with Morse code as a language I am paying a high tokenization cost 1975 6622 497 so in the English language hello would just be one word it would be hello it would be one token whereas in Morse code hello would be made up of 1 two 3 4 five six seven eight nine tokens and world it would be made up of 1 two 3 four six seven eight nine tokens as well so in the case of mors code um it's really doesn't understand mors code as a language it's really just working with the dots and the dashes from a tokenization perspective and now if we wanted to we could actually just decode all of those tokens that we see over here just to see what it looks like and in this case I'm just going to uh decode hello uh and then we'll run that one more time you you'll see there dot dot do space do space dot dash dot dot so again it's handling the spaces really weirdly is just trying to group up things that's commonly seen which is sequencies of dots Etc again there is no correlation to hello world so when I'm dealing with something like Morse code I'm going to be paying a very expensive tokenization price to be able to translate this so I think this example really demonstrates that if I am then got something like Morse code and I'm building a large language model that handles mors code it may make more sense to have a completely different vocabulary a different tokenizer to handle Morse code as a language one that is focused on the tokens of Morse code so rather than having dot dot dot dash and having it translating into lots and lots and lots and lots of different tokens then I may want to toonize all of those dots and dashes Etc just into a single hello world Etc and therefore I could have a vocabulary that's made up of sort of English Morse code as opposed to sort of paying the penalty of all of these uh split out tokens and again that's the same for uh anything else whether it's Japanese Chinese Swahili Etc you know same sort of thing so again that's the same for programming languages so if you've got a programming language that isn't one of the main flavors maybe it's an esoteric language for example then you're going to be paying a high cost of tokenization because you're going to get into this this same piece um that the vocab of that programming language isn't in the dictionary of your tokenizer so you're going to be paying a higher tokenization cost and since I mentioned programming languages why don't we for fun I'm going to do the encoding of a whole bunch of different symbols so if you got a bit of programming experience you might recognize some of these there's the not symbol here's cons here's enums here's a little bit of typescript callon string call on number callon int uh here's our Arrow functions here here's a sort of Pascal style begin statement if you're in a bit of VB script you might recognize the word wend here's a bit of uh bash or algol with ell if um here's a DOT text box for JavaScript folks and here's a do a pench child uh here which again was at the beginning of our examples again for JavaScript and Dome stuff so again if I run all of this the thing that is interesting in all of these cases is every single one of these is a single token so if you think about this for a second think of like Norwegian Swahili Morse code all all of that was being split up massively because it's not part of the vocab but you can see the tokenizer for open AI tick token is highly optimized towards programming language because you see all of these common operators that you have in JavaScript python Etc is already built into the vocab so that when we pass through a piece of JavaScript or a piece of C or a piece of python then it's minimizing the amount of tokens that we're sending and the amount of tokens that we're getting back as well uh so obviously some of the older languages are esoteric languages then what you're going to find is that those keywords may not be part of the tokenizer and then it'll get split into different tokens and then that will have a higher cost so again that's something to kind of be aware of maybe you're looking to train a model for uh your custom programming language or something you can imagine that the you may want to have a vocab that includes those keywords that you have there but out of the Tik token open AIS tokenizer has programming languages and common words used in development as part of the vocabulary so therefore when you're taking that text your source code and you're tokenizing it these are get translated into individual tokens so it's not just optimized towards spoken languages it's also optimized towards programming so let's come back to what we said originally why can't llms reverse words and it's because it's not dealing with words they're dealing with tokens and they're dealing with words that have been split up into different sub tokens so therefore it gets confused it's just like I don't know what this word is I don't know what to do with it is and then it starts uh trying to basically provide gibberish on the way back there however as you can see with the tokenizer if I can break the tokens so rather than having a a single world like the appen child and that's why I put aen child in there cuz it was was a single token uh even though it was it looks to us like a lot of letters um then even GPT sort of struggled with that however if I get it to split it back into the individual characters do a do space p p n d Etc so it's dealing with things at an individual letter level then it's able to do the split and then it can reverse the word now that's obviously you've seen how that works for GPT now obviously for uh other uh models such as mistal or um or llama 2 they use different tokenizers so I think uh the Llama models use the senent piece uh uh tokenizer and then mr's got its own one as well again you can use the same technique you can take that individual model maybe I'll do a video a shorter video on that at some point just to show you how to do that with uh their uh tokenizers but again you can then have a look and see what's in that tokenizer and see what it's doing underneath the hood there but the same problem that you're going to have and that's why the mystal models Etc the Llama models all have the same problems with trying to reverse a word now GPT used to have problems with this it's got a lot better at it as you can kind of see I suspect I haven't seen the data uh underneath the hood but I suspect what they're doing is they've got some sort of f j prompt that's saying hey you know um you know when I ask you to reverse this work I want you to sort of split it into individual letters like this and then do the reverse and spit that out I imagine the training is going to look at that and that could be a fun and interesting video to see if we can train one of these existing models like llama and get it to uh be able to reverse words properly so maybe we'll handle that in the next one anyway I hope that you've got a better understanding of how tokenizers work there uh maybe in a future video I'll show you how to uh build a tokenizer for yourself but uh hopefully this was a little bit useful and I will catch cat you on the next video
Info
Channel: Chris Hay
Views: 1,901
Rating: undefined out of 5
Keywords: chris hay, chrishayuk, chatgpt, tiktoken, llm, tokenizer, sentencepiece, llama-2, bpe, byte pair encoding, ai, how ai works, mistral ai, tokenizer explained, tokenizer llm
Id: NMoHHSWf1Mo
Channel Id: undefined
Length: 23min 59sec (1439 seconds)
Published: Wed Jan 17 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.