A Hackers' Guide to Language Models

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi I am Jeremy Howard from fast.ai and this is a hacker's guide to language models when I say a hacker's guide what we're going to be looking at is a code first approach to understanding how to use language models in practice so before we get started we should probably talk about what is a language model I would say that this is going to make more sense if you know the kind of basics of deep learning if you don't I think you'll still get plenty out of it and there'll be plenty of things you can do but if you do have a chance I would recommend checking out course.fast.ai which is a free course and specifically um if you could at least kind of watch if not work through the first five lessons that would get you to a point where you understand all the basic fundamentals of deep learning that will make this this lesson tutorial make even more sense maybe I shouldn't call this a tutorial it's more of a quick run through so I've got to try to run through all the basic ideas of language models how to use them both open source ones and open AI based ones and it's all going to be based using Code as much as possible um so let's start by talking about what a language model is and so as you might have heard before a language model is something that knows how to predict the next word of a sentence or knows how to fill in the missing words of a sentence and we can look at an example of one open AI has a language model text DaVinci 003 and we can play with it by passing in some words and ask it to predict what the next words might be so if we pass in when I arrived back at the panda breeding facility after the extraordinary reign of live frogs I couldn't believe what I saw I just came up with that yesterday and I thought what might happen next so kind of fun for Creative brainstorming uh there's a nice site called nat.dev Nat dot let Dev lets us play with a variety of language models and here I've selected text DaVinci 003 and I'll hit submit and it starts printing stuff out the pandas were happily playing and eating the frogs that had fallen from the sky there's an amazing sight to see these animals taking advantage of such a unique opportunity first after quick measures to ensure the safety of the pandas and the frogs so there you go that's what happened after the extraordinary reign of live frogs at the panda breeding facility uh you'll see here that I've enabled show probabilities which is a thing in that.dev where it shows um well let's take a look it's pretty likely the next word here is going to be the and after this since we're talking about a panda breeding facility it's going to be Panda's were and what were they doing well they could have been doing a few things they could have been doing something happily or the pandas were having the pandas were out the pandas were playing so it picked the most likely uh it thought it was 20 likely it's going to be happily and what were they happily doing could have been playing hopping eating and so forth so they're eating the frogs that and then had almost certainly so you can see what it's doing at each point is it's predicting the probability of a variety of possible next words and depending on how you set it up it will either pick the most likely one every time or you can change muck around with things like P values and temperatures to change what comes up so at each time then it'll give us a different result and this is kind of fun frogs perched on the heads of some of the pandas it was an amazing sight etc etc okay so that's what a language model does um now you might notice here it hasn't predicted pandas it's predicted panned and then separately us okay after Panda it's going to be us so it's not always a whole word here it's an and then harmed oh actually it's unha mood so you can see that it's not always predicting words specifically what it's doing is predicting tokens uh tokens are either whole words or sub word units pieces of a word or it could even be punctuation or numbers or so forth um so let's have a look at how that works so for example we can use the actual um it's called tokenization to create tokens from us from a uh from a string we can use the same tokenizer that GPT uses by using tick token and we can specifically say we want to use the same tokenizer that that model text eventually double O three uses and so for example when I earlier tried this it talked about the Frog splashing and so I thought I'll include data we'll encode they are splashing and the result is a bunch of numbers and what those numbers are they'd basically just lookups into a vocabulary that openai in this case created and if you train your own models you'll be automatically creating or your code will create and if I then decode those it says oh these numbers are they space r space spool hashing and so put that all together they are splashing so you can see that the start of a word is give me the space before it is also being encoded here so these um language models are quite neat that they can work at all but they're not of themselves really designed to do anything um uh let me explain um the basic idea of what chat GPT gpt4 Bard Etc are doing comes from a paper which describes an algorithm that I created back in 2017 called ULM fit and Sebastian Rooter and I wrote a paper up describing the ULM fit approach which was the one that basically laid out what everybody's doing how this system works and the system has three steps step one is language model training but you'll see this is actually from the paper we actually described it as pre-training now what language model pre-training does is this is the thing which predicts the next word of a sentence and so in the original ULM fit paper so the algorithm I developed in 2017 then Sebastian Rooter and I wrote it up in 2018 early 2018 what I originally did was I trained this language model on Wikipedia now what that meant is I took a neural network um and a neural network is just a function if you don't know what it is it's just a mathematical function that's extremely flexible and it's got lots and lots of parameters and initially it can't do anything but using stochastic gradient descent or SGD you can teach it to do almost anything if you give it examples and so I gave it lots of examples of sentences from Wikipedia so for example from the Wikipedia article for the birds the birds is a 1963 American Natural horror natural horror Thriller film produced and directed by Alfred and then it would stop and so then the model would have to guess what the next word is and if it guest Hitchcock it would be rewarded and if it gets guessed something else it would be penalized and effectively basically it's trying to maximize those rewards it's trying to find a set of weights for this function that makes it more likely that it would predict Hitchcock and then later on in this article it reads from Wikipedia at a previously dated Mitch but ended it due to Mitch's cold overbearing mother Lydia who dislikes any woman in mitches now you can see that filling this in actually requires being pretty thoughtful because there's a bunch of things that like kind of logically could go there like a woman could be in Mitch's closet could be in which is house and so you know you could probably guess in the Wikipedia article describing the plot of the birds it's actually any woman in Mitch's life now to do a good job of solving this problem as well as possible of guessing the next word of sentences the neural network is gonna have to learn a lot of stuff about the world it's going to learn that there are things called objects that there's a thing called time that objects react to each other over time that there are things called movies that movies have directors that there are people that people have names and so forth and that a movie director is Alfred Hitchcock and he directed horror films and um so on and so forth it's going to have to learn extraordinary amount if it's going to do a really good job of predicting the next word of sentences now these neural networks specifically are deep neural networks so this is deep learning and in these deep neural networks which have um when when I created this I think it had like 100 million parameters nowadays they have billions of parameters um it's got the ability to create a rich hierarchy of abstractions and representations which it can build on and so this is really the the key idea behind neural networks and language models is that if it's going to do a good job of being able to predict the next word of any sentence in any situation it's going to have to know an awful lot about the world it's going to have to know about how to solve math questions or figure out the next move in a chess game or recognize poetry and so on and so forth now nobody said it's going to do a good job of that so it's a lot of work to find to create and train a model that is good at that but if you can create one that's good at that it's going to have a lot of capabilities internally that it would have to be a drawing on to be able to do this effectively so the key idea here for me is that this is a form of compression and this idea of the relationship between compression and intelligence goes back many many decades and the basic idea is that yeah if you can guess what words are coming up next then effectively you're compressing all that information down into a neural network um now I said this is not useful of itself well why do we do it well we do it because we want to pull out those capabilities and the way we pull out those capabilities is we take two more steps the second step is we do something called language model fine tuning a language model fine tuning we are no longer just giving it all of Wikipedia or nowadays we don't just give it all of Wikipedia but in fact a large chunk of the internet is fed to pre-training these models in the fine tuning stage we feed it a set of documents a lot closer to the final task that we want the model to do but it's still the same basic idea it's still trying to predict the next word of a sentence after that we then do a final classifier fine tuning and then the classifier fine-tuning this is this is the kind of end task we're trying to get it to do now nowadays these two steps are very specific approaches are taken for the step two the step B the language model fine tuning people nowadays do a particular kind called instruction tuning the idea is that the task we want most of the time to achieve is solve problems answer questions and so in the instruction tuning phase we use data sets like this one this is a great data set called openalker created by a fantastic open source group and and it's built on top of something called the flan collection and you can see that basically there's all kinds of different questions in here so this four gigabytes of of questions and context and so forth and each one generally has a question or an instruction or a request and then a response here are some examples of instructions I think this is from the flan data set if I remember correctly so for instance it could be does the sentence in the Iron Age answer the question the period of time from 1200 to 1000 BCE is known as what choice is one yes or no and then the language model is meant to write one or two as appropriate for yes or no or it could be uh things about I think this is from a music video who is the girl in more than you know answer and then it would have to write the correct name of the remember model or dancer or whatever from um from that music video and so forth so it's still doing language modeling so fine-tuning and pre-training are kind of the same thing but this is more targeted now not just to be able to fill in the missing parts of any document from the internet um but to fill in the words necessary to to answer questions to do useful things okay so that's instruction tuning and then step three which is the classifier fine tuning nowadays there's generally various approaches such as reinforcement learning from Human feedback and others which are basically giving humans or sometimes more advanced models multiple answers to a question such as here are some from a reinforcement lighting from Human feedback paper I can't remember which one I got it from list five ideas for how to regain enthusiasm for my career and so the model will spit out two possible answers or it'll have a less good model and a more good model and then a human or a better model will pick which is best and so that's used for the the final fine tuning Stitch so all of that is to say um although you can download pure language models from the internet um they're not generally that useful of their on their own until you've fine-tuned them now you don't necessarily need step C nowadays actually people are discovering that maybe just step B might be enough it's still a bit controversial Okay so when we talk about a language bottle um where we could be talking about something that's just been pre-trained something that's been fine-tuned or something that's gone through something like rlhf all of those things are generally described nowadays as language models so my view my view is that if you are going to be good at language modeling in any way then you need to start by being a really effective user of language models and to be a really effective user of language models you've got to use the best one that there is and currently so what are we up to September 2023 the best one is by far gpt4 this might change sometime in the not too distant future but this is right now gpt4 is the recommendation strong strong recommendation now you can use GPT for by paying 20 bucks a month to open Ai and then you can use it a whole lot it's very hard to to run out of credits I find now what can GPT do it's interesting and instructive in my opinion to start with the very common views you see on the internet or even in Academia about what it can't do so for example there was this paper you might have seen GPT for can't reason which describes a number of uh empirical analysis done of 25 diverse reasoning problems and found it that it was not able to solve them and it's utterly incapable of reasoning so I always find you've got to be a bit careful about reading stuff like this because I just talked the first three that I came across in that paper and I gave them to gpt4 um and by the way something very useful in gpt4 is you can click on the the share button and you'll get something that looks like this and this is really handy so here's an example of something from the paper that said gpt4 can't do this Mabel's heart rate at 9 00 am was 75 beats per minute her blood pressure at 7 pm was 120 over 80. she died 11 p.m while she arrive at noon so of course you're human we know obviously she must be and GPT forces Hmm this appears to be a riddle not a real inquiry into medical conditions uh here's a summary of the information and yeah it sounds like Mabel was alive at noon so that's correct uh this was the second one I tried from the paper that says gpt4 can't do this and I found actually gpt4 can do this um and it said that gpt4 can't do this and I found gpt4 can do this now um I mentioned this to say gpt4 is probably a lot better than you would expect if you've read all this um stuff on the internet about all the dumb things that it does um almost every time I see on the internet saying something something that GPT 4 can't do I check it and it turns out it does this one was just last week Sally a girl has three brothers each brother has two sisters how many sisters does Sally have so have a think about it and so gpt4 says okay Sally's counted as one system each of her brothers if each brother has two sisters that means there's another sister in the picture apart from salary so Sally has one sister okay correct um and then this one I got sort of like three or four days ago this is a common view that language models can't track things like this see is the riddle I'm in my house on top of my chair in the living room is a coffee cup inside the coffee cup is a thimble inside the thimble is a diamond I moved the chair to the bedroom I put the coffee cup on the bed I turned the cup upside down then I return it upside up Place The Coffee Cup on the counter in the kitchen where's my diamond and so gpt4 says yeah okay you turned it upside down so probably the diamond fell out so therefore the diamond is in the bedroom where it fell out okay correct um why is it that people are claiming that gpt4 can't do these things we can well the reason is because I think on the whole they are not aware of how gpt4 was trained gpt4 was not trained at any point to give correct answers gpt4 was trained initially to give most likely next words and there's an awful lot of stuff on the internet where the most rare documents are not describing things that are true there could be fiction there could be jokes there could be just stupid people don't saying dumb stuff so this first stage does not necessarily give you correct answers the second stage with the instruction tuning uh also like it's it's it's trying to give correct answers but part of the problem is that then in the stage where you start asking people which answer do they like better people tended to say in these uh in these things that they prefer more confident answers and they often were not people who were trained well enough to recognize wrong answers so there's lots of reasons that the that the you know SGD weight updates from this process for stuff like gpt4 don't particularly or don't entirely reward correct answers but you can help it want to give you correct answers if you think about the LM pre-training what are the kinds of things in a document that would suggest oh this is going to be high quality information and so you can actually Prime gpt4 to give you high quality information by giving it custom instructions and what this does is this is basically text that is prepended to all of your queries and so you say like oh you're brilliant at reasoning so like okay that's obviously or to prime it to give good answers um and then try to work against the fact that um the the rlhf uh folks uh preferred confidence just tell it no tell me if there might not be a correct answer also the way that the text is generated is it literally generates the next word and then it puts all that whole lot back into the bottle and generates the next next word puts that all back in the model generates the next next word and so forth that means the more words it generates the more computation it can do and so I literally I tell it that right and so I say first spend a few sentences explaining background context Etc so this uh custom instruction um allows it to solve more challenging problems and you can see the difference here's what it looks like for example if I say how do I get a count of rows grouped by value in pandas and it just gives me a whole lot of information which is actually it thinking so I just skip over it and then it gives me the answer and actually in my uh um custom instructions I actually say if the request begins with VV actually make it as concise as possible and so it kind of goes into brief mode and here's brief mode how do I get the group this is the same thing but with VV at the start and it just spits it out now in this case it's a really simple question so I didn't need time to think so hopefully that gives you a sense of how to get language models to give good answers you have to help them and if you if it's not working it might be user error basically but having said that there's plenty of stuff that language models like gpt4 can't do one thing to think carefully about is does it know about itself can you ask it what is your context length how were you trained what Transformer architecture are you based on any one of these stages did it have the opportunity to learn any of those things well obviously not at the pre-training stage nothing on the internet existed during GPT 4's training saying how gpt4 was trained right uh probably Ditto in the instruction tuning probably Ditto in the rlhf so in general you can't ask for example a language model about itself now again because of the rlhf it'll want to make you happy by giving your opinionated answers so it'll just spit out the most likely thing it thinks with great confidence this is just a general kind of hallucination right so hallucinations is just this idea that the language model wants to complete the sentence and it wants to do it in an opinionated way that's likely to make people happy um it doesn't know anything about URLs it really hasn't seen many at all I think a lot of them if not all of them pretty much were stripped out so if you ask it anything about like what's at this webpage again it'll generally just make it up um and it doesn't know at least gpt4 doesn't know anything after September 2021 um because the um information it was pre-trained on was from that time period September 2021 and before called the knowledge cut off so here's some things it can't do um Steve Newman sent me this good example of something that it can't do um here is a logic puzzle I need to carry a cabbage a goat and a wolf across a river I can only carry one item at a time I can't leave the goat with a cabbage I can't leave the cabbage with the wolf how do I get everything across to the other side now the problem is this looks a lot like something called the classic River Crossing puzzle so classic in fact that it has a whole Wikipedia page about it and in the classic puzzle the wolf would eat the goat or the goat would eat the cabbage now in in Steve's version he changed it the goat would eat the cabbage and the Wolf would eat the cabbage but the wolf won't eat the goat so what happens well very interestingly gpt4 here is entirely overwhelmed by the language model training it's seen this puzzle so many times it knows what word comes next so it says oh yeah I take the goat across the road across the river and leave it on the other side leaving the wolf with a cabbage but we're just told you can't leave the wolf with a cabbage so it gets it wrong now the thing is though you can encourage gpt4 or any of these language models to try again so during the instruction tuning an R lhf they're actually fine-tuned with multi-stage conversations so you can give it a multi-stage conversation repeat back to me the constraints I listed what happened after Step One is a constraint violated oh yeah yeah yeah I made a mistake okay my new attempt instead of taking the goat across the river and leaving it on the other side is I'll take the code across the river and leave from the other side it's done the same thing um oh yeah I did do the same thing okay I'll take the wolf across well now the goats with the Cabbage that still doesn't work oh yeah that didn't work out uh sorry about that instead of taking the goat across the other side I'll take the goat across the other side okay what's going on here right this is terrible well one of the problems here is that not only is on the Internet it's so common to see this particular goat puzzle that it's so confident it knows what the next word is also on the internet when you see stuff which is stupid on a web page it's really likely to be followed up with more stuff that is stupid once gpt4 starts being wrong it tends to be more and more wrong it's very hard to turn it around to start it making it be right so you actually have to go back and there's actually a an edit button on these chats um and so what you generally want to do is if it's made a mistake is don't say oh here's more information to help you fix it but instead go back and click the edit and change it here and so this time it's not going to get confused so in this case actually fixing Steve's example takes quite a lot of effort but I think I've managed to get it to work eventually and I actually said oh sometimes people read things too quickly they don't notice things it can trick them up then they apply some pattern get the wrong answer you do the same thing by the way so I'm going to trick you so before you about to get tricked make sure you don't get tricked here's the tricky puzzle and then also with my custom instructions it takes time discussing it and this time it gets it correct it takes the Cabbage across first so it took a lot of effort to get to a point where it could actually solve this because yeah when it's you know for things where it's been primed to answer a certain way again and again and again it's very hard for it to not do that okay now uh something else super helpful that you can use is what they call Advanced Data analysis in Advanced Data analysis you can ask it to basically write code for you and we're going to look at how to implement this from scratch ourself quite soon but first of all let's learn how to use it so I was trying to build something that split uh into markdown headings a document on third level markdown headings so that's uh three hashes at the start of a line and I was doing it on the whole of Wikipedia so using regular Expressions was really slow so I said oh I want to speed this up and it said okay here's some code which is great because then I can say Okay test it and include edge cases and so it then puts in the code creates extra cases tests it says yep it's working it's not I notice it's actually removing the carriage return at the end of each sentence so I said I'll fix that and update your tests so it said okay so now it's changed the test update the test cases surround them and oh it's not working so it says oh yeah fix the issue in the test cases nope they didn't work and you can see it's quite clever the way it's trying to fix it by looking at the results and but as you can see it's not every one of these is another attempt another attempt another attempt until eventually I gave up waiting and it's so funny each time it's like debating again okay this time I gotta handle it properly and I gave up at the point where it's like oh one more attempt so I didn't solve it um interestingly enough and you know I I again it's it it's there's some limits to the amount of kind of logic that it can do this is really a very simple question I asked it to do for me and so hopefully you can see you can't expect even GPT for code interpreter or Advanced Data analysis is now called to make it so you don't have to write code anymore you know it's not a substitute for having programmers um um so but again you know it it can often do a lot as I'll show you in a moment so for example actually um OCR uh like this is something I thought was really cool um you can just paste and um sorry pastry upload so jpt4 you can upload um an image um Advanced Data analysis yeah you can upload an image here um and then um I wanted to basically grab some text out of an image somebody had got a screenshot of their screen and I wanted to edit which is something saying oh uh this language model can't do this and I wanted to try it as well so rather than retyping it I just uploaded that image my screenshot and said can you extract the text from this image and it said oh yeah I could do that I could use OCR um and like so it literally wrote at OCR script and there it is just took a few seconds so the difference here is it didn't really require it to think of much logic it could just use a very very familiar pattern that it would have seen many times so this is generally where I find language models Excel is where it doesn't have to think too far outside the box I mean it's great on kind of creativity tasks but for like reasoning and logic tasks that are outside the box I find it not great but yeah it's great at doing code for a whole wide variety of different libraries and languages having said that by the way Google also has a language model called bad it's way less good than gpd4 most of the time but there is a nice thing that you can literally paste an image straight into the prompt and I just typed OCR this and it didn't even have to go through code interpreter or whatever it just said oh sure I've done it and there's the result of the OCR and then it even commented I thought it just does yard which I thought was cute and oh even more interestingly it even figured out where the OCR text came from and gave me a link to it um that I thought that was pretty cool okay so there's an example of it doing well I'll show you one for this talk I found really helpful I wanted to show you guys how much it cost to use the open AI API um but unfortunately when I went to the open AI webpage it was like all over the place the pricing information was on all Separate Tables and it was kind of a bit of a mess so I wanted to create a table with all of the information combined like this um and here's how I did it I went to the open AI page I hit Apple a to select all and then I said in chat jpt create a table with the pricing information Rose no summarization no information not in this page every row should appear as a separate Row in your output and I hit paste now that was not very helpful to it because hitting paste it's got the nav bar it's got uh lots of extra information at the bottom it's got all of its uh footer Etc um but it's really good at this stuff it did it first time so there was the markdown table so I copied and pasted that into Jupiter and I got my markdown table and so now you can see at a glance the cost of gpt4 3.5 Etc but then what I really wanted to do was show you that is a picture so I just said oh chart the input Row from this table and just paste to the table back um and it did so that's pretty amazing now so let's talk about this um pricing so so far we've used chat GPT which costs 20 bucks a month and there's no like per token cost or anything but if you want to use the API from python or whatever you have to pay per token which is approximately per word maybe it's about uh one and a third tokens per word on average unfortunately in the chart it did not include these headers gpt4 GPT 3.5 so these first two ones are gpt4 and these two are GPT 3.5 so you can see the GPT 3.5 is way way cheaper um and you can see it here it's 0.03 versus 0.0015 so it's so cheap you can really play around with it and not worry and I want to give you a sense of what that looks like Okay so why would you use the open AI API rather than chat GPT because you can do it programmatically so you can you know you can analyze data sets you can do repetitive stuff it's kind of like a different way of programming you know it's it's things that you can think of describing but let's just look at the most simple example of what that looks like so if your pip install open AI then you can import check and chat completion and then you can say Okay chat completion.create using GPT 3.5 Turbo and then you can pass in a system message this is basically the same as custom instructions so okay you're an Aussie llm that uses Aussie slang and analogies wherever possible okay and so you can see I'm passing in an array here of messages so the first is the system message and then the user message which is what is money okay so GPT 3.5 returns a big embedded dictionary um and the message content is well my money is like the oil that keeps the Machinery of our economy running smoothly there you go just like a koala loves its eucalyptus leaves we humans can't survive without this stuff so there's the Aussie llm's view of what is money so the really uh the main ones I pretty much always use are gpt4 and GPT 3.5 gpd4 is just so so much better at anything remotely challenging but obviously it's much more expensive so rule of thumb you know maybe try 3.5 turbo first see how it goes if you're happy with the results then great if you're not planning out for the more expensive one okay so I just created a little function here called response that will print out um this nested thing um and so now oh and so then the other thing to point out here is that the result of this also has a usage field which contains how many tokens was it so it's about 150 tokens so at point zero zero two dollars per thousand tokens for 150 tokens means we just paid .03 cents point zero zero zero three dollars uh to get that done so as you can see the cost is insignificant if we were using gpt4 it would be 0.03 per thousand so it would be half a cent um so unless you're doing many thousands of gpt4 you're not going to be even up into the dollars and GPT 3.5 even more than that but you know keep an eye on it open AI has a usage page and you can track your usage now happens when we are this is really important to understand when we have a follow-up in the same conversation how does that work so we just asked what goat means so for example Michael Jordan is often referred to as the goat for his exceptional skills and accomplishments and Elvis and The Beatles referred to as goat due to their profound influence and achievement so I could say what profound influence and achievements are you referring to okay well I meant Elvis Presley and the Beatles did all these things now how does that work how does this follow-up work well what happens is the entire conversation is passed back and so we can actually do that here so here is the same system prompt here is the same question right and then the answer comes back with role assistant and I'm going to do something pretty cheeky I'm going to pretend that it didn't say money is like oil I'm going to say oh you actually said money is like kangaroos I thought what it's going to do okay so you can like literally invent a conversation in which the language model said something different because this is actually how it's done in a multi-stage conversation there's no state right there's nothing stored on the server you're passing back the entire conversation again and telling it what it told you right so I'm going to tell it it's it told me that money is like kangaroos and then I'll ask the user oh really in what way and this is kind of cool because you can like see how it convinces you of of something I just invented oh let me break it down for you cover it just like kangaroos hop around and carry their Joeys in their pouch money is a means of carrying value around so there you go it's uh make your own analogy cool so I'll create a little function here that just puts these things together for us just a message if there is one the user message and returns they're completion and so now we can ask it what's the meaning of life passing in the Aussie system prompt the meaning of life is like trying to catch a wave on a sunny day at Bondi Beach okay there you go so um what do you need to be aware of um well as I said one thing is keep an eye on your usage if you're doing it you know hundreds or thousands of times in a loop keep an eye on not spending too much money but also if you're doing it too fast particularly the first day or two you've got an account you're likely to hit the limits for the API and so the limits initially are pretty low as you can see three requests per minute um so that's for free users page users First 48 hours and after that it starts going up and you can always ask for more I just mentioned this because you're going to want to have a function that keeps an eye on that and so what I did is I actually just went to Bing which has a somewhat crappy version of gpt4 nowadays but it can still do basic stuff for free and I said please show me python code to call the open AI API and handle rate limits and it wrote this code it's got to try checks for rate limit errors grabs the retry after sleeps for that long and calls itself and so now we can use that to ask for example what's the world's funniest joke and there we go is the world's funniest trick so there's like the basic stuff you need to get started using the open AI llms um and uh and yeah I'd definitely suggest spending plenty of time with that so that you feel like you're really a llm using expert so what else can we do well let's create our own code interpreter that runs inside Jupiter and so to do this we're going to take advantage of a really Nifty thing called function calling which is provided by the open AI API and in function calling when we call our ask GPT function which is this little one here we had room to pass in some keyword arguments that will be just passed along to chat completion.create and one of those keyword arguments you can pass is functions what on Earth is that functions tells open AI about tools that you have about functions that you have so for example I created a really simple function called sums and it adds two things in fact it adds two it's um and I'm going to pass that function to chatcompletion.create now you can't pass a python function directly you actually have to pass What's called the Json schema so you have to pass the schema for the function so I created this Nifty little function that you're welcome to borrow which uses pedantic and also Python's inspect module to automatically take a python function and return the schema for it and so this is actually what's going to get passed to open AI so it's going to know that there's a function called sums it's going to know what it does and it's going to know what parameters it takes what the defaults are and what's required so this is like when I first heard about this I found this a bit mind-bending because this is so different to how we normally program computers where the key thing for programming the computer here actually is the doc string this is the thing that gpt4 will look at and say oh what does this function do so it's critical that this describes exactly what the function does and so if I then say um what is six plus three right and I just I really wanted to make sure it actually did it here so I gave it lots of prompts to say because obviously it knows how to do it itself without calling sums so it'll only use your functions if it feels it needs to which is a weird concept I mean I guess feels is not a great word to use but you kind of have to anthropomorphize these things a little bit because they don't behave like normal computer programs um so if I if I ask GPT what is six plus three and tell it that there's a function called sums then it does not actually return the number nine instead it returns something saying please call a function call this function and pass it these arguments so if I print it out there's the arguments so I created a little function called core function and it goes into the result of open AI grabs the function call checks that the name is something that it's allowed to do grabs it from the global system table and calls it passing in the parameters and so if I now say okay call the function that we got back we finally get nine so this is a very simple example it's not really doing anything that useful but what we could do now is we can create a much more powerful function called python and the python function executes code using python and Returns the result now of course I didn't want my computer to run arbitrary python code that gpt4 told it to without checking so I just got it to check first so say oh you're sure you want to do this um so now I can say ask GPT what is 12 factorial system prompt you can use Python for any required computations and say okay here's a function you've got available it's the python function so if I now call this it will pass me back again a completion object and here it's going to say okay I want you to call python passing in this argument and when I do it's going to go import math result equals blur and then return result do I want to do that yes I do and there it is now there's one more step which we can optionally do I mean we've got the answer we wanted but often we want the answer in more of a chat format and so the way to do that is to again repeat everything that you've passed into so far but then instead of adding in an assistant role response we have to provide a function role response and simply put in here the result we got back from the function and if we do that we now get the prose response 12 factorial is equal to 470 and a million 1 600. now functions like python you can still ask it about non-python things and it just ignores it if you don't need it right so you can have a whole bunch of functions available that you've built to do whatever you need for the stuff which um the language model isn't familiar with and it'll still solve whatever it can on its own and use your tools use your functions where possible okay so we have built our own code interpreter from scratch I think that's pretty amazing so that is um what you can do with or some of the stuff you can do with open AI um what about stuff that you can do on your own computer well to use a language model on your own computer you're going to need to use a GPU um so I guess the first thing to think about is like do you want this does it make sense to do stuff on your own computer what are the benefits um there are not any open source models that are as good yet as gpt4 and I would have to say also like actually open ai's pricing's really pretty good so it's it's not immediately obvious that you definitely want to kind of go in-house but there's lots of reasons you might want to and we'll look at some examples of them today one example you might want to go in-house is that you want to be able to ask questions about your proprietary documents or about information after September 2021 the the knowledge cut off or you might want to create your own model that's particularly good at solving the kinds of problems that you need to solve using fine tuning and these are all things that you absolutely can get better than GPT for performance at work or at home without too much without too much money or travel so these are the situations in which you might want to go down this path and so you don't necessarily have to buy a GPU on kaggle they will give you a notebook with two quite old gpus attached and very little Ram but it's something or you can use collab and on collab you can get much better gpus than kaggle has and more RAM particularly if you pay a monthly subscription fee um so those are some options for free or low cost you can also of course you know go to one of the many kind of GPU server providers and they change all the time is to kind of what's what's good or what's not run pod is one example and you can see you know if you want the biggest and best machine you're talking 34 an hour so it gets pretty expensive but you can certainly get things a lot cheaper 80 cents an hour um Lambda Labs is often pretty good um you know it's really hard at the moment to actually find um let's see pricing to actually find people that have them available so they've got lots listed here but they often have nine or very few available um there's also something pretty interesting called Fast AI which basically lets you use um other people's computers when they're not using them and as you can see you know they tend to be much cheaper than other folks and they they tend to have better availability as well but of course for sensitive stuff you don't want to be running it on some randos computer so anyway so there's a few options for renting stuff um you know I think it's if you can it's worth buying something and definitely the one to buy at the moment is the GTX 3090 used you can generally get them from eBay for like 700 bucks or so um a 40 90 isn't really better for language models even though it's a newer GPU the reason for that is that language models are all about memory speed how quickly can you get in and stuff in and out of memory rather than how fast is the processor and that hasn't really improved a whole lot so the two thousand bucks hmm the other thing as well as memory speed is memory size 24 gigs it doesn't quite cut it for a lot of things so you'd probably want to get two of these gpus so you're talking like fifteen hundred dollars or so um or you can get a 48 gig ram GPU it's called an a6000 but this is going to cost you more like five grand so again getting two of these is going to be a better deal and this is not going to be faster than these either um or funnily enough you could just get a Mac with a lot of ram particularly if you get an M2 Ultra Max have um particularly the M2 Ultra has pretty fast memory it's still going to be way slower than using an Nvidia card but it's going to be like you're going to be able to get you know like I think 192 gig or something um so it's not a terrible option particularly if you're not training models you're just wanting to use other existing trained models um so anyway most people who do this stuff seriously almost everybody has in video cards um so then what we're going to be using is a library called Transformers from hugging face and the reason for that is that basically people upload lots of pre-trained models or firetrained models up to the hugging face Hub and in fact there's even a leaderboard where you can see which are the best models now this is a really uh fraud area so at the moment this one is meant to be the best model it has the highest average score and maybe it is good I haven't actually used this particular model um or maybe it's not I actually have no idea because the problem is these metrics are not particularly well aligned with real life usage um for all kinds of reasons and also sometimes you get something called leakage which means that sometimes some of the questions from these things actually leaks through to some of the training sets so you can get as a rule of thumb what to use from here but you should always try things um and you can also say you know these ones are all the 70b here that tells you how big it is so this is a 70 billion parameter model um so generally speaking for the kinds of gpus you we're talking about you'll be wanting no bigger than 13B and quite often 7B um so let's see if we've confined here the 13B model for example um all right so you can find models to try out from things like this leaderboard um and there's also a really great leaderboard called fast eval which I like a lot because it focuses on some more sophisticated evaluation methods such as this Chain of Thought evaluation method so I kind of trust these a little bit more and these are also GSM 8K is a difficult math benchmark uh big bench hard um so forth so yeah so you know stable Beluga 2 Wizard math 13B dolphin Lima 13B et cetera these would all be good options um yeah so you need to pick a model and at the moment nearly all the good models are based on metas llama too so when I say based on what does that mean well what that means is this model here llama 2 7B so it's a llama model that's that's just the name meta called it this is their version two of llama this is their seven billion size one it's the smallest one that they make and specifically these weights have been created for hugging face so you can load it with the hugging face Transformers and this model has only got As far as here it's done the language model of pre-trading it's done none of the instruction tuning and none of the rlhf um so we would need to fine tune it to really get it to do much useful so we can just say Okay create a automatically create the appropriate model for language models so cause or LM is basically refers to that ULM fit stage one process or stage two in fact so we've got the pre-trained model from this name metal alarm element two blah blah okay now um generally speaking we use 16-bit floating Point numbers nowadays but if you think about it 16 bit is two bytes so 7B times two it's going to be 14 gigabytes just to load in the weights so you've got to have a decent model to be able to do that perhaps surprisingly you can actually just cast it to 8-bit and it still works pretty well thanks to something called discretization so let's try that so remember this is just a language model looking only complete sentences we can't ask it a question and expect a great answer so let's just give it the start of a sentence Jeremy how it is a and so we need the right tokenizer so this will automatically create the right kind of tokenizer for this model we can grab the tokens as Pi torch here they are and just to confirm if we decode them back again we get back the original plus a special token to say this is the start of a document and so we can now call generate so generate will um Auto regressively so call the model again and again passing its previous result back as the next as the next input and I'm just going to do that 15 times so this is you can you can write this for Loop yourself this isn't doing anything fancy in fact I would recommend writing this yourself to make sure that you know how that it all works okay um we have to put those tokens on the GPU and at the end I recommend putting them back onto the CPU the result and here are the tokens not very interesting so we have to decode them using the tokenizer and so the first 25 sorry first 15 tokens are Jeremy Howard is a 28 year old Australian AI researcher and entrepreneur okay well 28 years old is not exactly correct but we'll call it close enough I like that thank you very much llama 7B So Okay so we've got a language model completing sentences it took one in the third seconds and that's a bit slower than it could be because we used 8-bit if we use 16 bit there's a special thing called B float 16 which is a really great 16-bit floating Point format that's used usable on any somewhat recent GPS Nvidia GPU now if we use it it's going to take twice as much RAM as we discussed but look at the time it's come down to 390 milliseconds um now there is a better option still than even that there's a different kind of discretization called gptq where a model is carefully optimized to work with uh four or eight or other you know lower Precision data automatically and um this particular person known as the bloke is fantastic at taking popular models running that optimization process and then uploading the results back to hacking face so we can use this gptq version and internally this is actually going to use I'm not sure exactly how many bits this particular one is I think it's probably going to be four bits but it's going to be much more optimized um and so look at this 270 milliseconds it's actually faster than 16 bit even though internally it's actually casting it up to 16 bit each layer to do it and that's because there's a lot less memory moving around and to confirm in fact what we could even do now is we go up to 13B easy and in fact it's still faster than the 7B now that we're using the gptq version so this is a really helpful tip so let's put all those things together the tokenizer the generate the batch decode we'll call this gen for Generate and so we can now use the 13B GPT key model and let's try this Jeremy Howard is a so it's got to 50 tokens so fast 16-year veteran of Silicon Valley co-founder of cargo a Marketplace or predictive model here's company kaggle.com has become the data science competitions what I don't know I was going to say but anyway it's on the right track I was actually there for 10 years not 16 but that's all right um okay so this is looking good um but probably a lot of the time we're going to be interested in you know asking questions or using instructions so stability AI has this nice series called stable Beluga including a small 7B one and other bigger ones and these are all based on llama2 but these have been instruction tuned they might even have been RL hdf but I can't remember now um so we can create a stable Beluga model and now something really important that I keep forgetting everybody keeps forgetting is during the instruction tuning process during the instruction tuning process the instructions that are passed in actually uh um they don't just appear like this they actually always are in a particular format and the format Believe It or Not changes quite a bit from from fine tune to fine tune and so you have to go to the web page for the model and scroll down to found out what the prompt format is so here's the prompt format so I generally just copy it and then I paste it into python which I did here and created a function called make prompt that used the exact same format that it said to use and so now if I want to say who is Jeremy Howard I can call Jen again that was that function I created up here and make the correct prompt from that question and then it returns back okay so you can see here or this prefix this is a system instruction this is my question and then the assistant says Jeremy Howard's an Australian entrepreneur computer scientist co-founder of machine learning and deep Learning Company faster AI okay so this one's actually all correct so it's getting better by using an actual instruction tune model um and so we could then start to scale up so we could use the 13B and in fact uh we looked briefly at this open Orca data set earlier so llama2 has been fine-tuned on Oakman Orca and then also fine-tuned on another really great data set called platypus and so the whole thing together is the open Orca platypus and then this is going to be the bigger 13B gptq means it's going to be quantized so that's got a different format okay a different prompt format so again we can scroll down and see what the prompt format is there it is okay and so we can create a function called make open Orca prompt that has that prompt format and so now we can say okay who is Jeremy Howard and now I've become British which is kind of true I was born in England but I moved to Australia uh professional poker player no definitely not that uh co-founding several companies including first.ai also kaggle okay so not bad yeah it was acquired by Google was it 2017 probably something around there okay so you can see we've got our own models giving us some pretty good information how do we make it even better you know because it's it's it's still hallucinating you know um and you know llama two I think has been trained with more up-to-date information than gpt4 it doesn't have the September 2021 cut off um but it you know it's still got a knowledge cut off you know we would like to use the most up-to-date information we want to use the right information to answer these questions as well as possible so to do this we can use something called retrieval augmented generation so what happens with retrieval augmented generation is when we take the question we've been asked like who is Jeremy held and then we say okay let's try and search for documents that may help us answer that question so obviously we would expect for example Wikipedia to be useful and then what we do is we say okay with that information let's now see if we can tell the language model about what we found and then have it answer the question so let me show you so let's actually grab a Wikipedia python package we will scrape Wikipedia grabbing the Jeremy Howard web page and so here's the start of the Jeremy Howard Wikipedia page it has 613 words now generally speaking these open source models will have a context length of about two thousand or four thousand so the context length is how many tokens Can it handle so that's fine it'll be able to handle this web page and what we're going to do is we're going to ask it the question so we're going to have here question and with a question but before it we're going to say answer the question with the help of the context we're going to provide this to the language model and we're going to say context and they're going to have the whole web page so suddenly now our question is going to be a lot bigger our prompt right so our prompt now contains the entire web page the whole Wikipedia page followed by a question and so now it says Jeremy how does an Australian data scientist Edge entrepreneur an educator known for his work in deep learning co-founder of fast AI teaches courses develops software conducts research used to be yeah okay it's perfect right so it's actually done a really good job like if somebody asked me to send them a you know 100 word bio uh that would actually probably be better than I would have written myself and you'll see even though I asked for 300 tokens it actually got sent back the end of stream token and so it knows to stop at this point um well that's all very well but how do we know to pass in the Jeremy Howard Wikipedia page well the way we know which Wikipedia page to pass in is that we can use another model to tell us which web page or which document is the most useful for answering a question and the way we do that is we we can use something called sentence Transformer and we can use a special kind of model that specifically designed to take a document and turn it into a bunch of activations where two documents that are similar will have similar activations so let me just let me show you what I mean what I'm going to do is I'm going to grab just the first paragraph of my Wikipedia page and I'm going to grab the first paragraph of Tony Blair's Wikipedia page okay so we're pretty different people right this is just like a really simple small example and I'm going to then call this model so I'm going to say encode and I'm going to encode my Wikipedia first paragraph Tony Blair's first paragraph and the question which was who is Jeremy Howard and it's going to pass back a 384 long vector of embeddings for the question for me and for Tony Blair and what I can now do is I can calculate the similarity between the question and the Jeremy Howard Wikipedia page and I can also do it for the question versus the Tony Blair Wikipedia page and as you can see it's higher for me and so that tells you that if you're trying to figure out what document to use to help you answer this question better off using the Jeremy Howard Wikipedia page than the Tony Blair Wikipedia pitch foreign so if you had a few hundred documents you were thinking of using to give back to the model as context to help it answer a question you could literally just pass them all through to encode go through each one one at a time and see which is closest when you've got thousands or millions of documents you can use something called a vector database where basically as a one-off thing you go through and you encode all of your documents and so in fact um there's there's lots of pre-built systems for this um here's an example of one called H2O GPT and this is just something that I've got um that I've got running here on my computer it's just an open source thing written in Python and sitting here running on Port 7860 and so I just gone to localhost 7860 and what I did was I just uploaded I just clicked upload and I've wrapped last uploaded a bunch of papers in fact I might be able to see it better yeah here we go a bunch of papers and so you know we could look at uh let me search yeah I can so for example we can look at the ULM fit paper that uh so bruter and I did and you can see it's taken the PDF and turned it into slightly crappily a text format and then it's created an embedding for each you know each section so I could then um ask it you know what is ULM fit and I'll hit enter and you can see here it's now actually saying based on the information provided in the context so it's showing us it's been given some context what context did it get so here are the things that it found right so it's being sent this context so this is kind of citations performance by leveraging the knowledge and adapting it to the specific task at hand um how what techniques be more specific does ULM fit uh let's see how it goes okay there we go so here's the three steps pre-trained fine-tune fine tune cool um so you can see it's not bad right um it's not amazing like you know the context in this particular case is pretty small um and it's and in particular if you think about how that embedding thing worked you can't really use like the normal kind of follow-up so for example um if I so it says fine tuning a classifier so I could say what classifier is used now the problem is that there's no context here being sent to the embedding model so it's actually going to have no idea I'm talking about new lmfit so generally speaking it's going to do a terrible job yeah I see it says it's used as a Roberta model but it's not but if I look at the sources it's no longer actually referring to Howard and Rooter so anyway you can see the basic idea this is called retrieval augmented generation Reg um and it's a it's a Nifty approach but you have to do it with with some care um and so there are lots of these uh private GPT things out there um actually the H2O GPT web page does a fantastic job of listing lots of them and comparing so as you can see if you want to run a private GPT there's no shortage of options and you can have your retrieval augmented generation I haven't tried I've only tried this one H2O GPT I don't love it it's all right um good so finally I want to talk about what's perhaps the most interesting option we have which is to do our own fine tuning and fine tuning is cool because rather than just retrieving documents which might have useful context we can actually change our model to behave based on the documents that we have available and I'm going to show you a really interesting example of fine tuning here what we're going to do is we're going to fine tune using this um no SQL data set and it's got examples of like a a schema for a table in a database a question and then the answer is the correct SQL to solve that question using that database schema and so I'm hoping we could use this to create a um you know I kind of it could be a hand to use a handy tool for for business users where they type some English question and SQL generated for them automatically don't know if it actually work in practice or not but this is just a little fun idea I thought we'd try out um I know there's lots of uh startups and stuff out there trying to do this more seriously but this is this is quite cool because it actually got it working today in just a couple of hours so what we do is we use the hugging face data sets library and what that does just like the hugging face Hub has lots of models stored on it hacking face data sets has lots of data sets stored on it and so instead of using Transformers which is what we use to grab models we use data sets and we just pass in the name of the person and the name of their repo and it grabs the data set and so we can take a look at it and it just has a training set with features and so then I can have a look at the training set so here's an example which looks a bit like what we've just seen so what we do now is we want to fine-tune a model now we can do that in in a notebook from scratch takes I don't know 100 or so lines of code it's not too much but given the time constraints here and also like I thought why not why don't we just use something that's ready to go so for example there's something called Axolotl which is quite nice in my opinion here it is here lovely another very nice open source piece of software and uh again you can just pip install it and it's got things like gptq and 16 bit and so forth ready to go and so what I did was a um it basically has a whole bunch of examples of things that it already knows how to do it's got llama 2 examples so I copied the Llama 2 example and I created a SQL example so basically just told it this is the path to the data set that I want this is the type um and everything else pretty much I left the same and then I just ran this command which is from there read me accelerate launch Axolotl passed in my yaml and that took about an hour on my GPU and at the end of the hour it had created a q Laura out directory Q stands for quantize that's because I was creating a smaller quantized model Laura I'm not going to talk about today but Laura is a very cool thing that basically another thing that makes your models smaller and also handles I can use bigger models on smaller gpus for training um so uh I trained it and then I thought okay let's uh create our own one so we're going to have this context and um this question get the count of competition hosts by theme and I'm not going to pass it an answer so I'll just ignore that so again I've found out what prompt they were using um and created a SQL prompt function and so here's what I've got to do use the following contextual information to answer the question context create tables there's the context question list or competition host sorted in ascending order and then I tokenized that chord generate and the answer was select count hosts kind of theme from Farm competition Group by theme that is correct so I think that's pretty remarkable we have just built it also took me like an hour to figure out how to do it and then an hour to actually do the training um and at the end of that we've actually got something which which is converting um Pros into SQL based on a schema so I think that's that's a really exciting idea um the only other thing I do want to briefly mention is um is doing stuff on Macs if you've got a Mac uh you there's a couple of really good options the options are mlc and lima.cpp currently mlc in particular I think it's kind of underappreciated it's a you know really nice project um uh where you can run language models on literally iPhone Android web browsers everything it's really cool and and so I'm now actually on my Mac here and I've got a tiny little Python program called chat and it's going to import chat module and it's going to import a discretized 7B and that's going to ask the question what is the meaning of life so let's try it python chat.pi again I just installed this earlier today I haven't done that much stuff on Max before but I was pretty impressed to see that it is doing a good job here what is the meaning of life is complex and philosophical some people might find meaning in their relationships with others their impact in the world et cetera et cetera okay and it's doing 9.6 tokens per second so there you go so there is running um a model on a Mac and then another option that you've probably heard about is llama.cpp llama.cpp runs on lots of different things as well including Max and also on Cuda it uses a different format called gguf and you can again you can use it from python even if it was a CPP thing it's got a python wrapper so you can just download again from hugging face at gguf file so you can just go through and there's lots of different ones they're all documented as to what's what you can pick how big a file you want you can download it and then you just say Okay llama model path equals pass in that gguf file it spits out lots and lots and lots of gunk and then you can say okay so if I called that llm you can then say llm question name the planets of the solar system 32 tokens and there we are right in Pluto no longer considered a planet two mercury three Venus poor Earth Mars six oh never run out of tokens so again you know it's um just to show you here there are all these different options um uh you know I would say you know if you've got a Nvidia graphics card and your reasonably capable python programmer you'd probably be one of you use Pi torch and the hugging face ecosystem um but you know I think you know these things might change over time as well and certainly a lot of stuff is coming into llama pretty quickly now when it's developing very fast as you can see there's a lot of stuff that you can do right now with language models um particularly if you if you're pretty comfortable as a python programmer I think it's a really exciting time to get involved in some ways it's a frustrating time to get involved because um you know it's very early and a lot of stuff has weird little edge cases and It's tricky to install and stuff like that um there's a lot of great Discord channels however first AI have our own Discord channel so feel free to just Google for fast AI Discord and drop in we've got a channel called generative you feel free to ask any questions or tell us about what you're finding um yeah it's definitely something where you want to be getting help from other people on this journey because it is very early days and you know people are still figuring things out as we go but I think it's an exciting time to be doing this stuff and I'm yeah I'm really enjoying it and I hope that this has given some of you a useful starting point on your own Journey so I hope you found this useful thanks for listening bye

Info

Channel: Jeremy Howard

Views: 503,136

Rating: undefined out of 5

Keywords: deep learning, fastai

Id: jkrNMKz9pWU

Channel Id: undefined

Length: 91min 13sec (5473 seconds)

Published: Sun Sep 24 2023