The Attention Mechanism in Large Language Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello my name is Luis Serrano and this is Serrano Academy this video is about attention mechanisms attention mechanisms are absolutely fascinating they are what helped large language models take that extra step that helps them understand and generate text so if you've seen Transformer models lately and have been Amazed by the camp text they can generate this is in great part thanks to attention mechanisms and what attention does is it helps the large language model understand the whole context of the text as opposed to just a few words at a time so at the end of the day all attention does is a few mathematical operations in the world that can be seen geometrically and I like to see them as words gravitating towards each other in space like planets so in this video I'm very excited to show you that visualization and how I understand attention mechanisms attention was introduced in this paper called attention is all you need this paper also introduced Transformers which were a huge step in large language models so this is going to be the first of a series of three videos so this one is on attention mechanisms in a high level pictorial way and the next one is attention mechanism with math so in this one we're actually going to be looking at a numerical example and going through all the formulas and the third video is going to put all this together into a Transformer models video so here you're going to learn the entire architecture of a transformer model so first of all I'm very happy to announce that with cohere we have launched a course in a large language models called llm University I'll tell you more at the end of this video but check out this link llm.university so let's get to attention first let's start with embeddings embeddings are really important in large language models I would say that they're the most important part of large language models and the reason is that they're the bridge between humans and computers see humans are very good with words and computers are very good Within in numbers however there needs to be a good bridge between them in order to communicate this is an embedding and embedding is really where the rubber meets the road because it's where words become numbers and the better the embedding the better the models are because if you have really good numbers for each word or for each piece of text then the problems become much easier so let's see an example of an embedding let's start with a quiz I'll have a bunch of words here in the plane so there's the horizontal and the vertical coordinate and each word has a horizontal and a vertical coordinate for example banana is six to the right and five to the tops and the coordinates are six five that means there's an embedding that sense words to pairs of numbers the pairs of numbers given by the coordinates and now the quiz is the following where would you put the word Apple if you were gonna put the Apple in this embedding over here and I'm going to give you three options a b or c so think about it feel free to pause the video and tell me where do you think you would put the word Apple and I'm going to give you the answer I would put it on C and the reason is because you can see that there are fruits here on the top left you see some sports on the bottom left you see some housing and on the bottom right you see some Vehicles so the fruits are in the top right so I would put the Apple there so I place like five five is pretty good for putting this apple obviously there are places where I could put that Apple but if I put it anywhere around the fruits it's a good place so that's the main thing about embedics that similar words gets sent to similar numbers and in reality we're not going to have just two numbers per word we're just gonna have lots of them so some embeddings have up to 4096 numbers per word and some embeddings are even more General they send sentences to numbers and or they send long pieces of text to two numbers and these are called vectors because it's a list of many many numbers and you can actually go farther and see that each column actually has some property of the word some of them are things we know like size or color some of are things that are more hidden or combinations of other features that the computer has noticed so in short embeddings are really really really important so the more work is done in embeddings the better the language models can become now embeddings have some difficulties for example let's look at quiz number two we have on the top a strawberry and an orange and on the bottom we have a phone and the windows logo so now you're gonna tell me if things go on the top right or on the bottom left so let's do a quiz so using it in your head and I'm going to be telling you the answers where would you put a cherry well a cherry is a fruit so I'll probably go somewhere in the top right where would you put Android well it's a technology brand so I will probably go close to the phone and close to the windows logo uh where would you put a laptop well it's technology so I'll probably go and the bottom left uh where would you put a banana well probably on the top right because it's a fruit uh where would you put an apple ha I got you right because Apple could be the fruit but it could also be the brand and the fruit would go on top right and the brand would be on the bottom left so the embedding no matter how good it is would not know where to put the word apple and let's say that it puts it somewhere in the middle and that's a problem because words can have different meanings and the computer doesn't know embedding just knows the word so how do we solve this problem well with attention and here's where we get to the main topic of today which is attention so first I'm actually going to tell you about self-attention and then I'll tell you about something called multi-head attention and basically what attention does is it uses the context of the sentence in order to help the embedding resolve this type of ambiguous now in the paper that I mentioned before attention is all you need attention is explained using these formulas over here using a trio of matrices called the query key and values and then some diagrams like this now I don't know you but when I looked at this this is the face I made I was a little confused luckily I have friends that explain things to me so I went to my friend Jay and my friend joao and after many hours of asking them lots and lots and lots of questions something made sense and this is what I'm going to explain to you in this video so it has nothing to do with key query and value matrices I mean it does but indirectly because I don't like to see matrices as a raise of numbers to me thinking of Matrix as an array of numbers is like seeing a book as an array of letters books have a lot more structure than just an array of letters and Matrix have a lot more structure than just an array of numbers and in particular I love to see Matrix and linear transformations in the plane or in a higher dimensional space and that's how I managed to wrap my head around the attention mechanism so what does attention do well remember that that the word Apple can confuse an embedding because the embedding doesn't offer talking about Apple the brand or apple the fruit so what do we need well what we need is the context we need to use the other words in the sentence to tell us what we're talking about so for example if I say please buy an apple and an orange you know we're talking about a fruit but if I say apple unveiled a new phone you know we're talking about the brand and what is the key that tells us that we're talking about a fruit on the left sentence and the brand in the right sentence well there's normally a couple of words that help us out and in this case it's the word orange that helps us because please buy doesn't mean you're talking about a fruit you could be talking about buying and product of Apple so it's the word orange that really helps and on the right it's the word phone that tells us that we're talking about a brand so we're going to use these words to help us out and what they're going to do is they're gonna pull the word on the embedding quite literally that's exactly what they do so let's go back to the embedding and let's say that here is the word orange and here is the word phone and the embedding remember that couldn't figure out what to do with Apple so it put it here in the middle and it doesn't know if we're talking about a fruit or the brand so now the word orange is going to tell us that the first Apple in the first sentence is a fruit so what the Apple here is gonna do is it's going to move towards the orange so we're not going to use the original coordinates we're going to use the coordinates that we've Modified by moving the Apple towards the orange so literally the orange pulls the apple and when we're talking about the second sentence Apple unveiled a new phone then the word apple is going to be moved towards the word phone and it's going to be somewhere around here so I'm going to throw some numbers for explanation they don't match the formulas in attention exactly but just do have an idea so if the Orange is at the coordinates 11 11 and the phone needs at coordinates one two those are the numbers we use when we pass the words to the model however Apple based on the coordinate 6 7 and the number six seven are not so useful because they don't know if we're talking about an apple as a brand or as a fruit so what we do is we modify these numbers so in the first sentence we're not going to use six seven we're going to use this one over here let's say 88.5 and for the second one we're going to use let's say four and five and if we use these new numbers this new Vector as the embedding in the sentence then the model is going to have an easier time and obviously like everything in machine learning this process is done many many many times so imagine if you do it many times throughout your model then you're gonna end up with coordinates at an apple that are very close to the orange and that it's definitely a fruit and for the second sentence you're gonna end up with coordinates in apple that it's very close to the technology sector of the embedding so we know we're gonna deal with technology in this case so imagine this process being done many times and the actual position of the word being moved drastically towards the region of the next of the sentence you're talking about now here's the question what about the other words because as humans we know that orange is the word that helps here but the computer can't just look around and be like oh yeah I think orange is the reason this this apple is a fruit no the computer kind of has to look at all the words because it's just doing an algorithm right so the computer has to use all the words to modify the word Apple now how do we do it in a way that doesn't distract definition of Apple well we put them over here and I like to see this as gravity so the word apple and orange are close and so the Orange is going to be pulling the Apple strongly and the other words let's say are far because apple and please or apple and and buy are probably pretty far away and so those also pull the word Apple but not so much it's kind of like gravity right so when you pull it it gets pulled more towards orange and orange also gets pulled towards Apple all these words gravitate towards each other the weight planets gravitate so for example the Earth and the moon pulling each other they have a strong gravitational pull because they're close by whereas the other planets don't pull it so much now here imagine that everything has the same amount of mass to make things simpler and if we apply gravitation here well the Earth and the moon are going to go a little close to each other and the other planets are kind of going to go close and yeah Jupiter still pulls the Earth but very little and so on but it's not so distracting that it will change the meaning of the sentence so for example the word and is not really affecting this very much because it's very far whereas the word orange is affecting the definition of the word apple and I like to see this as galaxies right so let's say that if I've been talking about fruits for a long time because at the end of the day you don't just use a sentence you use the entire context so if my context has a lot of fruits and all of a sudden I say the word Apple well there's like a galaxy of fruits at the top right and it starts pulling this one quite strongly and at the end my coordinates for Apple end up really really close to the coordinates for all the other Roots because I have a fruit Galaxy because that's where I've been talking about so in general if you're talking about a particular topic for a while then there's a strong heavy Galaxy of words in that topic and whatever you say will be pulled towards that Galaxy and then that's how the model keeps track of the content and that's pretty much what the attention step does I just like to see it as words pulling each other like gravity however we can do more and I'm going to show you that next so what I showed you is attention which works really well however multi-head attention works even better let me tell you what it is about so first I have a question a slightly loaded question by the way I ask it is one embedding good enough so do you think one embedding that we use for attention is good enough to do really well well it does okay but actually no we can do much better we can ideally have lots and lots of embeddings because why not why can't we just use 10 or 20 or 100 embeddings and combine those results well that's what multi-head attention is so let's say for example that you have one embedding over here then another one over here and then another one over here and they all have the word for orange the word for phone and that ambiguous word for Apple that the embedding hasn't yet figured out so I have a quiz for you which one of these three do you think is the best one now use any criteria you would like but which one do you think is the best embedding which one you rather use if you're gonna apply the attention step and feel free to pause the video and think about it and I'm gonna tell you now it's the first one and the reason is because it separates the points really well it takes that ambiguous apple apple in the middle and pulls one towards the orange and the other ones towards the phone and the orange and the phone are really far away and so therefore it manages to separate this ambiguity and send the Apple in the first sentence and the Apple in the second sentence two completely different coordinates now from the other two embeddings which one do you think is the worst one again feel free to pause and I'm gonna tell you now it's the second one the second one is the worst one because when you apply the attention step it still can separate the apple and the other Apple very well because that phone and that orange are not in a really good position they're kind of close to each other so this is a bad embedding and the third one is so-so because it manages to kind of separate the apple and the Apple so when General there's going to be very good embeddings and maybe not so good embeddings and for particular words some embeddings may work better than others for some topics you could have an amazing embedding that's not that great for other topics Etc so ideally we want to have lots of embeddings and be able to combine them now there's a small problem is that building and bettings is a lot of work we can't just buy them at the corner store research groups and companies build this really large and amazing embeddings but they may have different sizes that means different dimensions so every word goes to a vector of different length and that can be a confusion so in reality we can't just really take a bunch of big embeddings and combine them is not easy but there's something easy we can do which is we can have one embedding and modify it a lot and build other embeddings like that so let me show you how what we're gonna do is linear Transformations linear Transformations is one of my favorite topics and it's actually the way I like to see matrices as I said before I don't like C Majors is a race of numbers I like to see them as Transformations on the plane and how do they work well let's say that I have this image here of a dog and I'm going to do several linear transformations to it one is I'm going to rotate it to some degree doesn't matter how many degrees but it could be anything between 0 and 360. another thing I can do is I can stretch it horizontally or I can stretch it vertically or actually I can stretch it in any direction just pick your favorite Direction in the plane and stretch it in that direction and as you can imagine these work in more Dimensions as well here I'm doing it in two Dimensions because the screen is flat but you could also take something in three dimensions or on many more and rotate it and stretch it as you will and there's another one called Shear which is the following imagine that you have let's say a book and you put the bottom attached to the table and push the top of it so it looks like this this is called a Shear transformation as I mentioned before these are all represented by different matrices now you can combine this in any way you want and pretty much what you end end up with anything like this the way I like to imagine it is let's look at the two axes the blue axis and the yellow axis and just send the blue axis to any other Vector it could be longer could be rotated could be shorter and then do the same thing with the yellow one and that defines that where the main Square goes and then the rest of the plane just follows this transformation because on the left you have a Plane full of little squares and on the right you have a Plane full of little of these parallelograms so any transformation is a transformation for the whole plane or for the whole space and why we need these linear Transformations because what we're going to do is we're going to get new embeddings from existing ones by applying linear Transformations so let's say that this is our original embedding and I'm going to apply some random linear transformation and get this embedding and then some other random transformation and get this embedding now here is a question for you which one of these three is the best one and which one of these three is the worst one well just like before the best one is the one on the right because separates the apples really well the worst one is the one in the middle because it doesn't really separate the two apples and on the left is so-so so this one over here is okay separates them okay this one over here doesn't really separate them so it's a bad embedding and this one over here is amazing because it really separates the two apples so if you were to score them let's say you give them some kind of score and it's supposed to be high for the good embeddings and low from the bad embeddings so let's say we're gonna give the first one a score of one the second one score of 0.1 and the third one is score of four these are just random numbers I made up but the idea is that good embeddings get good scores bad embeddings have bad scores and I haven't told you how to score them the idea is that as you train the neural network you train these embeddings and you train their scores and that's where the key query and value Matrix is come in so now we have the big picture of multi-hand attention here is our text and here is our embed ready now we're going to need to create a lot more embeddings so the K and Q matrices are going to help us build a bunch of Transformations and the V are going to help us visualize them this is going to be more clear on the video where I show you all the math but for now basically think of this as a sum so I'm going to take each one of the embeddings and some of them are scored with a high score and some of them with a low score and the good ones are scored with a high score and the battle is scored with a low score and so we weight them by that score so we really take into account the good ones and don't take into account the battle and so much and then we add them all together and if we add them we get a really good embedding because this embedding has a lot of the good ones and not so much of the balance so this embedding has great context so it actually keeps track of the content very very well and that's the big picture of multi-head attention so That's all folks thank you very much for your attention as I mentioned before this is the first of a series of three videos so this was the high level pictorial attention example but in the next one we're going to see attention with a lot more math so we're going to look at all the formulas all the matrices and also we're going to go through a numerical example explaining similarity and all the linear Transformations that are happening and finally the third video is going to be about the entire architecture of Transformer models so if you are enjoying this please feel free to check out llm University is a course that we've created at go here with two great creators me or ammer and Jay Alamar and it's an entire course on llms with code Labs videos and text so I highly recommend it and in particular one of my colleagues at llm University has a great video and blog post called Asus traded Transformer his name is Jay Alamar you should check out all his videos and thank you so much so here are my coordinates this is my YouTube Channel Please Subscribe if you enjoyed this video put some comments or likes or share them around your friends and my Twitter is serrano.academy so feel free to tweet at me my page is also serrano.academy where you can find this video and others including blog posts courses Etc and my book is called grotting machine learning you can find the discount code below if you'd like to buy it the discount code is Sereno YT and you can find all these links Below in the description so thank you very much for your attention and see you in the next video
Info
Channel: Serrano.Academy
Views: 41,117
Rating: undefined out of 5
Keywords:
Id: OxCpWwDCDFQ
Channel Id: undefined
Length: 21min 1sec (1261 seconds)
Published: Tue Jul 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.