What are Transformer Models and how do they work?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello my name is Louis Sano and this is Sano Academy and today I'm going to tell you about Transformer models so as you may have seen Transformers are absolutely amazing they can do anything they can chat with you they can answer questions they can follow commands they can build stories they can build poems they can even write code which is pretty amazing but now the question is how do these Transformers work well despite the mystery behind them you'll be happily surprised to know that they're actually not that complicated they do require huge data sets and a lot of computing power to be trained properly but in terms of the architecture it's actually a pretty simple architecture this architecture is formed by several blocks such as attention a feed forward neural network embeddings Etc in this video I'm going to tell you how each of these blocks work separately and then how to put them together in order to get the Transformer model architecture are you ready let's begin Transformer we're introduced in this paper called atten attention is all you need now this is the third of a series of three videos the first one was about attention mechanisms mostly the idea behind them the second one is also about attention mechanisms except this one had more math involved and this is the third one where I'm going to tell you about the whole architecture of a transformer model more specifically in this video I will tell you the following first I'm going to tell you about the architecture of a transformer model which includes several steps such as tokenization embeddings in particular I will tell you about one type of embedding called word to VC then we'll go a little bit on attention mechanisms which is a crucial part of these models I've already told you about them in the previous two videos so I'm just going to tell you the idea here and finally I will tell you about feed forward neural networks which are the main engine of a transformer [Music] model so let's start with the first question what is a Transformer well this is how I imagine a Transformer and as you've seen before it generates text now how does it generate text well you'll be surprised it actually does it one word at a time so you may have seen that it generates really long and elaborate responses it actually does them one word at a time so for example if the prompt or the question is hello how are you then it doesn't generate an answer it only generates the next word that would come here so what would the next word be here well it can be many things but perhaps the word doing so that now you have the sentence hello how are you doing now you can come up with more complicated prompts like write a story now the Transformer is not going to write a story right away it's just going to generate the next word what's the next word after to write a story well perhaps it's once because it's starting the story and then it restarts so now it prompts write a story once what's the next word well perhaps upon and then it puts back in the prompt and it goes write a Story Once Upon what's the next word well it's a and it keeps going right WR A Story Once Upon A generates the word time etc etc and if it does this many times it will actually generate a story now you may have been as surprised as I was I first thought it would generate the entire response right away or maybe it would come up with an idea and start elaborating on it but no it actually just goes one word at a time now in order to generate one word at a time and actually come up with coherent language it actually needs a lot of parts so this is the architecture of a trans former it looks complicated with a bunch of things called input embedding multi-head attention feed forward Etc but in this video I will break down all of these blocks one by one actually the way I like to see it is more like this you have an input and an output and let's say the input is write a story and uh the output has to be probably the word ones but what happens here well first it goes to a step called tokenization then a step called embedding then a step called positional encoding this is more of a pre-processing then it goes through the main engine which is a series of Transformer blocks and these all have something called a tension which you saw in the first two videos and feed forward which is a neural network there can be many blocks here and then at the end there is a block called a soft Max and out of this softmax comes the answer which in this case would be the word once so fasten your seal belts because in this video I'm going to tell you about all these [Music] parts now before we get to the architecture of a transformer let's take a look at this concept of generating sentences one word at a time recall that the architecture of Transformer is this one where the main engine is in the middle it's a series of Transformer blocks in particular the feed forward part it's a neural network that tries to find the next word in a a sentence and all the other blocks are there to help this neural network work as best as possible but this concept of generating the next word in a sentence is something that has been around for a while can you think of a time where you have used it well if you have texted anybody you may have noticed that the texting apps have these suggestions you say the sentence hello how are and it generates three suggestions you your and things and let's say you pick the word your that means that your sentence will be hello how are your and then after you type your or you press the middle button then three more suggestions will appear based on the sentence you've given it now let's think about the following let's say that you have 10 minutes and you are given the task of creating a model that suggests the next word in a sentence what would you do if you only have 10 minutes well here's a simple solution let's say that you pick the last word the word r R and you think what is the most common word after the word r well if you have a large data set of sentences you can just look at all the instances of R and then pick the most common word after in which case here is happy and so you suggest the word happy obviously that doesn't work very well but at least it suggests words that could be sensible this has a name it's called a 1 G now can you think of how to make it more complicated well let's look at a 3 g a prgram is when you take the last three words hello how are and then think of what's the most common word that comes after so you look at all the instances of hello how are in your data set of sentences and let's say the next most common word is things so you suggest the word things now how can we make this even more powerful well let's do a 10r you picked the last 10 words for example in sentence hello today I would like to inform you that and then find that in your data set and let's say that it only appears two times but both of them are followed by the word the so you suggest the word the now what's the small problem here well the more words we pick the more educated the guess is but also the more words we pick the less likely that that sentence could have appeared before if I take a sentence like the other day I was walking and five unicorns said well that sentence may never appear because it could have been the first time somebody says this sentence so you can't rely on just looking at words that appear before because it may be that they've never been said before however this method sometimes is used and it's called the NR method n is the number of words you're looking at so this one would be a 10 G but it looks like we need something more advanced so what can we do well something that's been used for many years in order to suggest the next word is something called a neural network so in a neural network you input the sentence hello how are and it outputs the word your now I'm not going to go into detail on what a neural network is I will go into some detail but but if you want to really learn what neural networks are and how they work check out this other video in my channel called neural networks the link is on the comments however up until Transformer these things didn't work that well I don't know if you've ever tried to actually write a text by pressing the suggestion in the middle I tried doing it and it gave me the sentence hello how are you feeling this week and your kind wishes for a wonderful weekend with your kindness and happiness for your family love you and your kindness that makes absolutely no sense but if you look at every three or four consecutive words it makes sense hello how are you feeling and happiness for your etc etc because this neural network is trained to remember a few words at a time and to build short sentences not to build long text and to build long text we need a Transformer but first let's take a little look at how these neural networks work because they are the key to understanding a Transformer model so before we get into how neural networks generate the next word in a sentence let's look at a simpler example which is called sentiment analysis sentiment analysis has been around for a while and it's one of the quintessential problems in natural language processing and it basically consists on building a model that is able to tell if a sentence is happy or sad for example the sentence I am very happy is a happy sentence and the sentence today was awful is a sad sentence and the idea is that you can use a data set that has been labeled so you have a lot of sentences and you have the information on if the sentence are happy or sad and based on that data you can build a model that is able to tell if a new sentence is happy or sad now the question is how would you train a sentiment analysis model so I'd love for you to pause the video and actually give it a thought if you haven't seen it before here is a way to do it a way to do it is actually to have a set of points a score for each one of the words so happy words will have positive points sad words will have negative points and very happy words like magnificent will have more points than okay happy words like happy and for the rest of words that are neutral they will have zero points so if I were to have such a model if I were to have such scores then I can easily tell if a sentence is happy or s for example the sentence I am very happy well what's the total score of the sentence the words I am and very are neutral so they have zero points and the word happy has three points so when I add them I get three points and if the score is positive I conclude that the sentence is happy now for the sentence I had a very bad day the words I had a very and day are all neutral so they have zero points and the word bad has minus four points because it's a sad word and therefore when I add them I get minus 4 which means that the sentence was set now the idea is that a good sentiment analysis model will have points for every single word and if you are able to to score every word I haven't told you yet how to score them but if you are able to score them then you have a sentiment analysis model and this looks like this I'm going to put all the words on the left each one with a red node and a node on the right and then connect them all with an edge and on the edges I'm going to put the scores that we found so happy 4.2 for example s is - 3.7 and all the other ones and now when I have a sentence like I am happy then I simply locate the words I am and happy I put ones on them and then I multiply these ones by the scores in the edges I actually multiply every score here by the score in the edges but most of them are zero except for the words in the sentence and then I add them so when I add them I get 3.2 + 0.1 - 0.2 and that's 3.1 so I conclude that this sentence is happy this this is called a perceptron and if you've seen perceptrons before you may have noticed that I didn't talk about the bias unit there is also a bias unit there is a constant that we add to all these sentences and it takes into account empty sentences if empty sentences are happy or sad then only the bias can tell us that but as I said this video is not about neural networks per se if you want to know more about perceptrons the bias unit and these scores and in particular how they work for sentiment analysis check check out this video on my channel called logistic regression this video will also tell you how to train a neural network which is how to find these very good scores for each one of the words now the problem is that this is not going to work all the time because language is a lot more complex than that if we could just understand Language by adding a number to each word and then we know if sentences are happy or sad or they're talking about different topics then language would be very simple but it's not so we need some something more complicated than that and that's when we jump to neural [Music] networks now let me show you what a neural network is and how it can help us with the sentiment analysis problem so here's the graphical representation that I showed you before of a very simple sentiment analysis model and notice that it looks a lot like a neuron like a brain cell because it has a lot of inputs in this case each word is a potential input and one output which is if the sentence is happy or sad this is called a perceptron and it's the basis of neural networks and the output is one if it's happy and zero if it's said now just like brains they don't have just one neuron they have lots and lots of neurons well a neural network is formed by many many many of these neurons connected in a very strategic way now I'm going to tell you the mental picture I have of neural networks when I think of a concept like happiness it's pretty complex it can't just be determined by by having words or not having these words in a sentence but maybe there are simpler Concepts and maybe one of them could be if you're talking about family let's say for the sake of argument that that is a concept that is slightly easier to model with a perceptron like this and maybe there are some other simple topics that are easy to model like this let's say if you are talking about free time or not that's a simple concept that can be let's say modeled by a perceptron and another one is if you are hungry or not and so we have three perception and each one of them now doesn't tell us if we're happy or sad but it tells us if we're talking about free time if we're talking about family and if we're talking about being hungry and let's just say again for the sake of argument that these three determine if we're happy or sad so if we have free time if we are happy with our family and if we're not hungry then let's say that determines if we're happy or sad and we're going to put weights here and the weights can be positive or negative so for example we have a two for free time because the more free time you have the happier you are let's say you have a three for family because if you have your family then you're happy and let's say we have a minus two for hungry because if you're hungry maybe that affects your happiness so what we have here is a small neural network it's a small set of perceptrons or neurons attached to each other in a way that they work together to determine if the sentence is happy or sad obviously this is still too simple maybe we need more notes like rest family friends food money and maybe those don't determine happiness but they determine other slightly more complex things such as emotional stability health and purpose and maybe those three actually determine if you're happy or set or maybe there's hundreds of these and they may not be easily labeled for us but maybe the computer knows and in the first layer you have some simple Concepts then in the next one you have more complicated Concepts and so on and so on and you could have tens of these layers or hundreds and as you may imagine if we have a neural network big enough we may have a much more accurate way to tell if a sentence is happy or sad so this is the mental image that I have of a neural network now this can have many applications can you help me think of some applications of neural networks one can be customer support because you can take a chat with a representative and determine if it was happy or sad you can also use it for emails you can use it for spam detection let's say that you train a neural network to tell you if the email is Spam or not spam however notice that these two examples have two outputs happy or sad or spam or not spam you could actually have a NE Network that works for more than two output for example email classification you can classify your email into three personal email work email and invitations or even more so this opens a new door now let's think of an extreme case where there's not two or three outputs but there's lots of them how many of them as many as words so I can think of a neural network where the input is a sentence given by a bunch of words and the output is one word out out of all the words that exist and that is a neural network that can predict the next word at a sentence now before we get to that let's look at how the architecture would look so let's look at email classification here's a neural network that could tell you if your email is personal work related or an invitation notice that it doesn't have one output it has three outputs and I can think of many more outputs I can think of all these outputs up to here which are as many as the words that are inputed so in this neural network the first layer and the last layer have the same size and when you input a sentence like how are it will output a word which is for example you so this is a neural network that can predict the next word at a sentence and is very similar to the neural network that appears in the Transformer of course the neural network in the Transformer is bigger has a more structure is more complicated but at the end of the day it's a big big huge neural network that tells the next word of a sentence however as I said before just a neural network doesn't do the job a Transformer needs all these other blocks to help it work so well so now we're ready to go into all these [Music] blocks so the first one of the blocks in a Transformer is tokenization and it goes right after you input the text and tokenization pretty much breaks everything everything into units for the most part each unit is a word so for example the sentence WR a story gets broken into a bunch of tokens one for the word right one for the word a for the word story and for the period because punctuation signs also have tokens sometimes you will also have tokens for the beginning and the end of a sentence however some words like doesn't get broken into two Tokens The Token for does and the token for in there's a whole science into building good tokens and I'm not going to elaborate here but if you want to learn all about Tok tokens I want to recommend you this video from Jay Alamar called did the llm even see what you typed which is also a chapter on his great book handson large language [Music] models the next step is embeddings and it's right here now I've talked about embeddings on the previous two videos on attention but I'll tell you a brief summary here so embeddings are the most important part of a large language model because it's really where the translation happens from from humans who speak in words to computers who speak in numbers the stronger this bridge is the better large language models are and the bridge is exactly an embedding if you recall from the previous video we had a quiz where the question is where would you put the word Apple so here we have all the coordinates of the words where the first number is the horizontal coordinate and the second one the vertical coordinate and the question is where would you put Apple in position a b or c and the answer was C because it should go around the fruits so a good place for Apple would be the coordinates 55 now that's as much as we talked about embeddings and we also mentioned that you don't just associate them to two numbers you can associate them to thousands of numbers but the point is that similar words get sent to similar numbers and in some way the embedding is a bit of a description of the word so every coordin actually means something but here we can get a little further actually how do you think we build embeddings and I encourage you to pause this video and think about it there are many ways and I'm going to show you one that actually has to do with neural networks so bear with me I like to think of the following thought experiment imagine that you actually want to associate numbers to every single word how do you do it well let's say that we can take ourselves or any of our friends and put them into this experiment so here's a human and a human has a brain so what we're going to do is we are going to put two sensors somewhere in the head now I don't know any any Neuroscience so I'm going to make up a lot of stuff so as I said please bear with me but this is the thought experiment that I like to use to imagine embeddings so we have two sensors that we locate somewhere in the head of our subject and we plug them into a machine and each one of the sensors is going to measure some activity on that part of the brain in particular maybe if we can get it to measure the activity on one brain C that would be wonderful but let's just say it measure something now in front of the subject we put a strawberry and then we measure what each of the sensors records let's say the first one records a five and then the second one records a four that's our embedding so for the word strawberry we associate with the numbers five and four now what do you think happens if we change a strawberry to an apple and measure the same thing some numbers will come out and maybe it's 55 so we record that the embedding for apple is 55 now if we put a castle instead then some numbers will come out and let's say they are two and one so we record two and one as the embedding for Castle now as I said I don't know any Neuroscience but I imagine that a strawberry and an apple being similar must fire similar places in the brain whereas a castle would fire different places therefore the numbers Associated to strawberry and apple are likely to be similar and the numbers Associated to Castle are likely to be different than these two so we have pretty much an embedding I like to imagine that somewhere in the brain the words are located and somewhere strawberry and apple are close by and castle is far away and this is how I imagine an embedding now of course we don't have subjects with brains to experiment and to build embeddings like that but what's the closest thing we have to a brain in machine learning well we have a neural network which Loosely emulates the brain so let's try this experiment on a neural network let's show it as strawberry what does it mean to show it a strawberry well this was a neural network that takes a sentence and builds the next word so let's just feed the word strawberry to it and see what comes out maybe the next word comes out as is or eight or delicious it doesn't matter it doesn't really matter what comes out the pink layer doesn't really matter let's take a look at the Green Layer and put our sensors there so let's measure what numbers came out of here when making that prediction and let's say that the numbers for example 1 3 and two came out well that's going to be the embedding for strawberry it's going to be 1 3 and two now let's do something similar and put an apple and let's say the numbers 1.1 2.9 and 2.2 came out which are similar to the words for strawberry why because the reaction that a neural network has to Strawberry is similar to the one it has for Apple because the same words or very similar worlds would come out so one would imagine that in the previous layer whatever numbers came in must be pretty similar and remember that the neural network has layers and each layer understands deeper and deeper properties of the word so so the penultimate layer has to understand the word pretty well therefore the three numbers coming out of the penultimate layer in this neural network must be similar for strawberry and apple and must be different for Castle maybe for Castle they are 7 - 5.4 and 0.4 in other words when we train a neural network to guess the next word it kind of has to understand words pretty well and understanding words pretty well means that these layers must capture properties of the word and the penultimate is going to capture some pretty deep properties and therefore if we train a neural network well for free we get a pretty good embedding by simply looking at the penultimate layer now in this case the layer has three nodes but it can have 100 or thousands and so therefore we can create embeddings of thousands of numbers like this this is a pretty common method to find embeddings and it's called word to V why word Toc because it sends the words to vectors as I said there are many ways to create in embeddings but those embeddings created with neural networks are some of the most powerful [Music] ones now the next step is positional encoding positional encoding is very important because it takes care of order now to be honest position encoding looks a little strange to me because it doesn't really take care of the order in a way that captures the semantics it just kind of differentiates sentences that have different order but but at the end of the day it works well so we use it so let's say you have the sentence write a story period that has four tokens and that's not the same as saying story period I write that makes no sense but now imagine that you make it into something that makes sense by reorganizing the words and the answer would be different so at the end of the day the order of the words actually matters so we would not want to feed the same thing into a neural network with the sentence on the left and with the sentence of the right so what do we do well we just learned embeddings so let's look at what the embedding looks for these two sentences let's say that this is where the words are located in the embedding and what we're going to do is perturb each of the words slightly in such a way that the first word gets perturbed in the same way then the second word gets perturbed in a different way etc etc so I'm going to symbolize that by arrows the first word gets moved to the right the second one up the third one Di diagonally to the left and the fourth one a little bit more towards the left it doesn't really matter the fact is that we are moving the words in this direction and for the second sentence we're doing the same thing now when the words move now we get different embeddings for these two sentences and as I said there's no semantics here it's really adding different numbers to the first word the second word the third word the fourth word Etc in such a way that now the embedding captures order in some way and then the neural network would take care of learning this order more I understand if it's not as satisfying as you wanted it to be but it works really well and now these arrows what is the rule for these arrows well the reason I'm not being more specific is because they change a lot I've seen them using signs and cosiness I've seen them using exponential function but better and better imp positional encoding functions keep coming out so maybe by the time you see this video there's a completely different one all you need to know is that these numbers normally follow a sequence so for example this sequence of arrows that rotate slower and slower every time are one that works as positional encoding but if you want to remember one thing about position en coding you just find some kind of sequence that you add to each one of the words in the sentence in order to move them around and perturb the [Music] order now the next step is attention which is crucial in in the Transformer models as a matter of fact it's really the step that made them work really well because it's the step that really captures the context attention can appear many times in the architecture because it appears in each one of the Transformer blocks followed by some feet forward layers and I know that I already talked about attention for two long videos which I really recommend you to watch if you haven't but I can't help but I love talking about attention and so I'm going to tell you a little bit more about it here so to summarize if you have an embedding let's say you have all the fruits around here you have a strawberry an orange banana and a cherry and here you have a bunch of computer Brands and computer words you have laptop smartphone Android and Microsoft the question is where would you put the word Apple well it kind of belongs to both so it's hard to tell so the best you can do is to put it somewhere in the middle and that says that no matter how good embedding is there are some things it cannot capture because it's not going to know where to put Apple so we are forced to use the context and in the context next we have to look at the sentence if you have the sentence please buy an apple and an orange then Apple means a fruit and if you have the sentence apple and vile and new phone then you're talking about the technology brand so that means the word Apple gets influenced by the word orange in the first sentence which means that the orange has a gravitational pull over the word apple and pulls it closer to it and for the second sentence the word phone affects the word Apple so the phone pulls the word Apple here and therefore now we're not going to use the coordinates of the Apple in the middle we're going to use for the first sentence the coordinates of the apple on the top right that is pulled by the orange and in the second sentence we're going to use the coordinates for Apple when it's pulled by the word phone so I like to imagine tension as gravitational pull between words that appear in the same sentence or in the same context now speaking about context there is this example that I really like imagine the sentence the bear ate the honey because it was blank we haven't finished the sentence the question for you is what does the word it describe is it talking about the bear or is it talking about honey well we don't know because it depends on the next word it could really be either one in particular if the blank word is hungry the bear ate the honey because it was hungry then we're definitely talking about the bear but if the sentence says the bear ate the honey because it was delicious then we're definitely talking about the honey so that last word is the one that makes a difference now how do we expl explain this using a tension and gravitational pull between words well let's look at all the words and let's put them in an embedding like we know how to do and let's forget about the last word so let's see what happens before we've reached the end of the sentence in this embedding words are similar if they appear in the same context many times so two words that are likely to be replaceable are similar so for example it and bear are similar because many times a sentence could have the word it and could also have the word bear and the same thing happens with it and honey they are similar because I could have sentences with the word it and the same sentence with the word honey therefore these two words exert strong gravitational pull to the word it all the other words also exert gravitational pull but not as strong because they're not as similar so let's forget about them and let's only think of the word it being pulled towards bear and towards honey but at equal amounts because it's kind of similar to both so it kind of stays in between gets here in a game of tog of War but but it's between the two and if we don't know the end of the sentence then it can't really gravitate towards bear or towards honey however once we know the next word something happens if the next word is hungry well hungry is similar to Bear more than hungry similar to Honey because hungry will be often describing a bear therefore now we have two words pulling in the bear Direction and Wen pulling in the honey Direction and therefore it will gravitate towards bear and honey hungry in particular be closer to to bear so we know that it describes the word bear on the other hand if the word at the end of the sentence is delicious well delicious is much closer to honey than to Bear because delicious describes the honey and delicious is going to pull the word it so now we have two words pulling towards honey and one towards bear so it goes in this direction and after many iterations it's going to go very close to the word honey and therefore we know that it describes the word honey and this is how attention works now I'm going to skip ahead a little bit and tell you about the soft Max now I haven't told the whole story about the series of Transformer blocks but assume that it's a big huge neural network that predicts the next word now we saw that before right because we have this big neural network that is used to predict the next word or a sentence if I input how are it will output you now we don't exactly want that we don't want a deterministic answer we don't want that every time I say how how are it responds you because in reality you wouldn't want the exact same answer every time you ask a question you want something probabilistic something stochastic so we're just going to add a mini step that makes this into a probability so instead of outputting a bunch of zeros and one one for the next word we're just going to make it output some scores and the scores can be anything they are high numbers for words that are likely to appear and low numbers even negative numbers for words that are very unlikely to appear now we would like to turn this into probabilities because at the end of the day we would like to draw a word out of a bag given probabilities where the wor of high scores have higher probabilities and the worst with low scores have lower probabilities because we still want it to be accurate we just want it to be a little different every time but what property probabilities have well all these numbers must add to one so how do we make all these numbers add to one well one way is to divide them all by the sum so the sum of these numbers is 10 and therefore if we divide everything by 10 we get 1 over 10 0 over 10 etc etc etc and those are the probabilities for each word and then we just draw a word based on these probabilities so the fourth word for example has probability four out of 10 and therefore it's more likely to appear than for example the third word which has probability Zer out of 10 the problem is that we have some negatives and we can't have negative probabilities - 1 / 10 is not a probability so we actually need these numbers to be positive what do we do for these numbers to be positive well one very popular way is to raise e to the score because e to the anything is a positive number we can actually raise two or three or any number we want but e works pretty well and this is actually called the softmax function if you've heard about it so we take 1 becomes e to the 1 0 becomes e to the 0 4 becomes e to the 4 minus one becomes e to the minus one which is okay because e to Theus 1 is a small number but it's positive and now we have a bunch of positive numbers and we divide by by the sum and when we divide by the sum we get the following probabilities and notice that the highest one is the one corresponding to four which is 54 the lowest one is corresponding to minus one which is not zero but it's so close to zero that I rounded it to zero and the higher the score the higher the probability and the lower the score the lower the probability so we are being faithful to our scores and if we draw a word out of here with this probabilities well most likely we're going to get the word you but we can also get the word they or the word things sometimes so this is a way to select words out of a probability distribution formed by the neural network in a way that you're always going to get good words coming out according to the input sentence but you're not always going to get the same ones okay so now that I told you softmax let's go back a little bit and take a look at the whole [Music] architecture okay so now we have all the parts to build Transformer and the architecture is this there is one part that is is still not super clear which is a series of Transformer blocks formed by attention and feed forward so this can be multiple ones there could be 12 it could be more and at the end of the day each block has the similar architecture so to see this clearly let's go through the whole architecture first we start with the input text which is let's say write a story it goes through tokenization which means everything becomes a bunch of tokens now it goes to an embedding which means these get put into some big big space with lots of coordinates then we go through positional encoding which means each one of these words gets Disturbed in some different way in a way that now position gets recorded and two sentences with the same words but in different order get sent to different embeddings next this numbers these coordinates are fed to a neural network and that neural network is training to find the next word in that sentence however that neural network is not very good because it still doesn't capture context this neural network is made made by a bunch of feed forward layers now what we're going to do is we're going to throw in an attention layer in between each one of these and now the feet forward doesn't need to be one layer it can be multiple layers at a time but at the end of the day what we have is that in between each of the feet forward layers we do one attention step one gravity step and if you see the second video on attention there's actually something called multi-ad attention that's what happens between two feet forward blocks this part may be sounding a little abstract but just imagine a big neural network with a bunch of attention layers thrown in between so as you train the neural network to guess the next word you always remember to add context often and this is going to not give us the next word but it's going to give us for every single word in the language it's going to give us actually for every single token it's going to give us a score and the softmax is going to give us a probability for each word or for each token for example a token could be end of sentence that could be the one that dictates when the Transformer basically needs to stop talking so from here the highest probability word is the most likely one to appear and let's just say that when we draw out of here then the word once comes out and so that's the output of the Transformer so all this process to guess the next word but because we've trained it that way then this word is pretty good because first of all position is taken into account second of all cont is taken into account and the neural network that guesses the next word is pretty strong because we're talking about a huge architecture with lots of nodes lots of data and that's really what makes Transformers so good so as you can see the architecture is not really that complicated it's really just a big neural network with a bunch of padding and a bunch of boosters around that make it work really well and that's it that's the Transformer architecture and so are we done yet well let me show you some problems that Transformer have and also some solutions that are provided [Music] so as I mentioned before transform mods don't always work as intended if you simply train them to talk using the entire internet as a data set and to fix that we have a process called fine-tuning so why wouldn't a model work if it's trained on the Internet it's a huge data set well maybe because the internet is not exactly the best place for many things as a matter of fact the internet is not a question answer repository for example if I train the model to guess the next word and I ask it a question what is the capital of Nigeria well the answer is Abuja but that may not be the most common word right after that sentence I could have the internet looking like this and there could be some quizzes where after what is the capital of Nigeria the next thing is a question what is the capital of Chad or what is the capital of Lebanon and the next word is what I could also have stories where we have the sentence what is the capital of Nigeria she asked in this case the next word word is she or it could be a chat where you have the sentence what is the cap of Nigeria and somebody answering that's a good question in which case the word is that or I could have a more elaborate answer that doesn't start with the answer it simply says since 1991 blah blah blah and in that case the word is sin so for the model it's not very clear what the next word should be and remember that the transform model is not trying to answer questions it's trying to find the next word so how do we get the model to know that it's not just the next word that we want it's the answer well what we have to do is train it on question answer data sets so we can use experts to curate a bunch of data sets that have particular questions and answers so that the model starts learning that when it gets a question it has to answer with the answer that's called fine-tuning a model and it's a very very expensive part of training because you actually require people to create custom data sets sets and you need to be very careful and curate these data sets very well now this is not only for question and answer if you want the model to actually chat if you say hello how are you and it says good and you and then you continue talking and the model continues talking to you then you need to finetune the model by post training it with tons and tons and tons of chats in that way the model starts learning that if you're talking to it it has to reply as a person who's chatting back at you and if you continue into it it should remember the previous steps of the conversation another thing you should find trained models for is for commands so for example if you say do the following thing then the model must do it for example you can tell it to write a poem about elephant or to correct your code or to write code that does a particular thing to write a particular essay about something or to Simply give you lists of things etc etc so you need to build big data sets with commands and then the command being executed in order for the model to learn that when it's given a command it has to follow it and as you can imagine there are a lot of other use cases that you would need to find tune your model so I'd love for you to actually think of other cases where you would have to F tune a model for it to work well and feel free to put them in the comments it'll be lovely to have a discussion about [Music] it all right that's all about Transformers thank you very much for your attention with this video which was a little long but we went through the whole thing just like the previous two videos this would have not happened if not for my friends joal Jay and Omar who actually explained Transformers to me in much detail in particular there's a podcast I have with Omar that I've linked down in the comments it's in Spanish but if you speak Spanish I think you will like it and if you like this content we have a lot more at llm University by coh here that I built with my two colleagues meor and and again J this is a comprehensive course where we have a lot about large language models including introduction and a very conceptual explanation followed by a bunch of Labs on many many topics including prompt engineering search deployment and much more and in particular if you want to go for more details on Transformers I recommend the Illustrated Transformer Again by J which is a really really good video and blog post this is where I learned what the Transformers are so so thank you very much here is some information if you like this video please subscribe to my channel it's san. Academy and uh hit like and uh add a comment and share with your friends I love reading the comments that you leave in particular when you leave good topics about future videos many times I've made a video about it because it gives me good ideas you can also tweet at me san. Academy or check out my page san. Academy where I have all these videos and also a bunch of block posts and code and different things and I also have the book rocking machine learning which I really enjoyed writing and it was about basically supervised learning and in detail the way I like to explain things with a lot of code examples Etc if you like to get it the discount code is Sano YT and the information on how to get it is down on the comments this discount code is for 40% so thank you very much and see you in the next video
Info
Channel: Serrano.Academy
Views: 33,605
Rating: undefined out of 5
Keywords:
Id: qaWMOYf4ri8
Channel Id: undefined
Length: 44min 25sec (2665 seconds)
Published: Thu Nov 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.