How GPT3 Works - Easily Explained with Animations

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

the tech world is a buzz with hype about gpt 3 a new ai model that is doing impressive things massive language models like gpt3 keep impressing us with their abilities one demo that i really enjoyed is is this one of a website where you can type a description of a web application and that description would be sent to gpthree gpg3 would actually generate code and build that website for you it's a simple tiny react application but i thought this was impressive as far as language model and machine learning applications are involved another interesting application is this that uses gpt 3 inside of a spreadsheet and so it was able to figure out the relationship but this notice that this is not a calculation it's actually going and getting that piece of information well that piece of information is actually encoded inside of gpd3 and so i thought this was really impressive you can find multiple examples and demos and they will keep rolling in in the next days and weeks of things that gpt3 and massive language models like the gpt3 are able to do now gpt3 is actually two things it's an api served by openai a company and the foundation that does uh ai research and that a api is in invite only beta now so you apply and they give you access and then you're able to send prompts input prompts and interact with the ai and then inside that api on the servers of open ai is a machine learning model called gpt3 and that is the model that we will be talking about i want to remove the aura of mystery and explain some of the main concepts behind how a model such as this is built and trained and how it works that's what we'll be talking about in this video [Music] so we can think of open ai gpt3 as a model as a black box that takes an input a series or a sequence of words and it generates an output which is another sequence of words this is an example from the foundation and the work of isaac asimov for the three laws of robotics so this is not an actual example i just use this so at the simplest way you can think of it as a machine learning model that takes let's say a sentence or text and it outputs some more text and then the output is generated using what the model has learned during its training phase the training phase we start with an untrained model and in this case the model was trained against a lot of data 300 billion tokens you can think of a token as a word so this was a lot of text collected from the internet and the model was trained on a specific task against all of this data what is the task that the model was trained on it is predicting the next word and so we present a few words to the model uh a sequence of words and we say okay predict the next word um and the model has to do that and it will after millions of steps it learns something and it captures something from the let's say probability distributions and the statistical relationships inside of those examples of text and we'll look at that in this next video so how do we generate training examples from the text so let's say we have this sentence at the top the second law of robotic the robot must obey the orders given it by human beings we can generate multiple examples that we train the model uh against so we can take four or five of these words as input and we say okay generate the the sixth word which is a in this case and then we can generate another example by saying okay we'll give you six words and you generate the seventh word or predict it and so on and so forth and if you crawl the internet and get texts from forums and websites and newspapers you'll end up with a lot of text and you can break it down like this and generate millions or even billions of examples for the ai to be able to train from or for the machine learning model to be able to learn from this is a one step of the training process that we just discussed so we have the example that we generated so we know that a robot must let's say obey we present the model with only the three words in green a robot must and we say okay predict the next word we don't show it the actual answer but we know what that is so we present this these examples to the model the model makes a calculation we'll talk later about how that calculation is made at just a high level and it generates an output and that output is not going to be a good output because this model is not yet fully trained it's untrained or it's starting to go through the training process and so we say you generated an output of a word that is troll which is incorrect the correct answer we were expecting is obey we have a way of calculating the error we have we can quantify how much this prediction was wrong or how much it was off and then that error calculation is fed back into the model where we update the weights or parameters of the model or we update how the model works so that the next time it makes it comes across this example it's able to generate a a better prediction a prediction that is much closer to obey than it is to troll and so this is the basic step for training machine learning models this is not novel uh a lot of what we'll discuss here is uh an explanation of how gbt3 works not what it invented uh or what is you know novel in this model and so this is if you've come across machine learning examples before this is the loop that is usually used in in supervised training of machine learning models now let's look at these steps with a little bit more detail and so the model actually takes each token or each word at a time and it outputs its output uh also one token at a time i just mentioned that this video we're just discussing how the model works and not what is a novel or what is new in it and the main thing that is actually new in in the gbt3 is the size the model is massive what is the size of a model the model contains a lot of numbers called parameters or weights and gpt3 contains 175 billion of these now you can look at my first video of the in my intro to ai and in that we look at a simple machine learning model with one weight one parameter that we can make predictions using now this is some of the latest uh high tech models and it's using 175 billion parameters and these are numbers that the model uses to encode what it learns from being exposed to all of this text and so i'd refer you to my intro to ai to have a little bit of an understanding of what a parameter is of a model and these models these these parameters are sorted into various matrices inside the model and the process of generating a prediction is mostly multiplying these different matrixes together by the inputs that the model gets at each to with each token now another way of looking more closely at gpt3 is to say that each word each token flows through a track and the model has a context window of 2048 tokens and so the input and output have have to fit within that number of tokens there are ways of going beyond that you can adjust the model of doing to do more than that number of tokens but for all we understand right now or one good way of to start to understand how a transformer model like gpd3 works is to think about the number of tokens it it's uh it can process and each of these tokens is processed on its own in its own track and then once we've processed all of the inputs the model starts to generate tokens that we can use as output or think of as the output of the model one way of thinking how the model works is this so you have the words and with every word you have a vector representing that word and i would like to refer you to my illustrated word to vac post on my blog if you want to understand a little bit more about word embeddings and in this case each word has a list as a vector or a list of numbers that capture some of the meaning and represent that word and those are the boxes here in yellow and green and all the way to blue when we process a word we actually process the vector and that vector goes through various layers of transformer decoders gpt3 has 96 of these you see how these are stacked one on top of each other this is the depth when you hear deep learning deep learning is is these models that are a little bit more complicated they're able to extract or make predictions that are a little bit more sophisticated using various number or you know high number of layers where the computation flows between them and we see that earlier uh layers process different things than what later layers would we would be able to process and so this is the processing of the first token and then the second token goes in and it's processed through every layer and then every token in the input sequence goes in and then when we are processing the last token in the input the output we will start generate and so if this is an example where we're giving a command to this ai model and then this is its response to us so we tell it okay a robot must obey the orders given it and this is it would respond to us hopefully okay human and it would do it in this way so this is just an x-ray in into how this model is structured and how it processes its input and output in the react code generation example that we've seen before my assumption is that the model works like this so the description is is given as an input to the model but we also have to give it a number of examples to prime the model to generate the kind of output so to let it know that we're expecting react code when we give you this description and then to do that we have to give the model 2 or 3 or 10 or more examples of description code description code and between these examples we have special tokens so this is my assumption of how this works given how gpt2 works and how previous transformer language models have worked we don't have an implementation to look at yet uh for gpthree but this is my uh my best assumption and that input goes in it's process token by token and then the model is is able to generate its outputs like this now we have not seen the best demos of what gpt3 is able to do these will start rolling in in the coming weeks and months that's because the model is going to be able to do more amazing things once openai releases the ability or the feature of being able to fine-tune the model and this is one of the tools in in large machine learning let's say language models and other models that has enabled some some of the really impressive results so far gpt-3 as we've discussed uses the same model the same weights that were trained and costs whatever five million dollars or 4.6 million dollars to to to train in 355 um years of gpu time if it's processed in one gpu so that training process has been done and then every demo that we've seen so far uses that one model with no updates to the weight just changes in the prompt and the input to be able to get the model to do more interesting things now fine-tuning is something that's going to be rolled out i believe soon we've heard from from open ai and in that case you give the model you give open ai or the api more examples and the model is actually trained a little bit more and the weights are updated so the model is able to create better websites or do better translation from one language to another and we'll start to see some really impressive demos once this is rolled out so this is it this is a high level overview of how gpt3 works hope you've enjoyed it please let me know if you have any ideas or comments please subscribe and like and see you in the next video

Info

Channel: Jay Alammar

Views: 58,097

Rating: 4.875195 out of 5

Keywords: #machinelearning, ai, nlp, gpt3, ML

Id: MQnJZuBGmSQ

Channel Id: undefined

Length: 13min 41sec (821 seconds)

Published: Thu Aug 13 2020