Attention is all you need explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Applause] hi there this is Richard Walker from lucidate  welcome to this fourth video on Transformers and   gpt3 the Transformer architecture was introduced  in a 2017 paper by Google researchers attention is   all you need the key innovation of the Transformer  was the introduction of self-attention a mechanism   that allows the model to selectively choose which  parts of the input to pay attention to rather than   using the entire input equally in this video we  will talk about why we need attention and what   design choices have been made by Transformers  to solve the attention problem in the next   video we'll delve more into the implementation of  attention and see how Transformers accomplish this   before the Transformer the standard architecture  for NLP tasks was the recurrent neural network   or RNN rnns have the ability to maintain internal  State allowing them to remember information from   previous inputs but rnns do have a few drawbacks  as discussed in the prior video they're difficult   to parallelize furthermore they tend to suffer  from the vanishing and exploding Radiance problem   which makes it difficult to train models with  very long input sequences please see the video   position and positional embeddings if you're  unfamiliar with rnns or either of these drawbacks   the Transformer addresses these limitations by  using something called self-attention in place   of recurrence self-attention allows the model  to weigh the importance of different parts of   the input without having to maintain an  internal State this makes the model much   easier to parallelize and eliminates The  Vanishing and exploding gradient problem   look at these two sentences they differ by  just one word and have very similar meanings   to whom does the pronoun she belong to  in each sentence in the first sentence   we would say it's Alice in the second sentence  we would say that it's Barbara pause the video   if you like to externalize why you know that  this is the case well in the first sentence   we have the word younger which makes she attend  to Alice in the second sentence we have the word   older which causes the she in this sentence  to attend to Barbara this attention itself is   brought about by the phrase more experienced  being attended to by the phrase even though   now consider these two sentences with very similar  wording but very different in meaning this time   focus on the word it we effortlessly associate the  it in the first sentence with the noun swap while   in the second sentence we associate it with  AI the first sentence is all about the swap   being an effective hedge the second sentence is  all about the AI being clever this is something   that we humans are able to do effortlessly  and instinctively now of course we've all been   taught English and spent a whole bunch of time  reading books articles websites and newspapers   but you can see to have any chance at all  at developing an effective language model   the model has to be able to understand all  these nuanced and complex relationships the   semantics of each word and the Order of the  words in the sentence will only get us so far   we need to imbue our AI with these capabilities of  focusing on the specific parts of a sentence that   matter as well as linking together the specific  words that relate to one another for one sentence   we have to link it with Swap and in the other  sentence we have to link it with AI and we have to   do this solely with numbers all our AI understands  our scalars vectors matrices and tenses now   fortunately for us modern computer systems are  extremely efficient at mathematical operations on   tensors and can deal effortlessly with far larger  structures and with many more Dimensions than a   spinning on your screen so let's spend the rest of  this video describing what design the developers   of Transformers came up with and in the next video  we'll take a deeper look at how this design works   the solution was to come up with three matrices  that operate on our word embeddings recall from   the previous two videos that these embeddings  contain a semantic representation of each word   if you recall this semantic representation was  learned based on the frequency and occurrence   of other words around a specific word it also  contains positional information this position and   information was not learned but rather calculated  using periodic sine and cosine waves the three   matrices are called Q for query k for key and V  for Value like the semantic embeddings the weights   in these matrices are learned that is to say  during the training phase of a transformer such   as gpt3 or chat GPT the network is shown a vast  amount of text if you ask chat GPT just how much   training information it will explain that hundreds  of billions to over a trillion training examples   have been provided with a total of 45 terabytes  of text we have only chat gpt's word for this as   the information is not publicly disclosed but chat  GPT asserts that it is not given to overstatement   or hyperbole the method that GPT 3 uses for  updating the weights is back propagation lucidate   has a whole series of videos given over to back  propagation and there is a link in the description   but in summary back propagation is an algorithm  for training neural networks used to update its   internal weights to minimize a loss firstly  the network makes a prediction on a batch of   input data then the loss is calculated between the  predicted and actual output thirdly the gradients   of the loss with respect to the weights are  calculated using the chain rule of differentiation   fourthly the gradients are used to update the  weights and finally this process is repeated until   convergence back propagation helps neural networks  like Transformers to learn by allowing them to   iteratively adjust their weights to reduce the  error in their predictions improving accuracy over   time so what are these mysterious query key and  value matrices whose weights are calculated while   the network is being trained and what role do they  perform remember that these matrices will operate   on our positional word embeddings from our input  sequence the query Matrix can be thought of as   the particular word for which we are calculating  attention and the key Matrix can be interpreted   as the word to which we are paying attention the  eigenvalues and eigenvectors of these matrices   typically tend to be quite similar the product of  these two matrices gives us our attention score   we want high scores when the words need to pay  attention to one another and low scores when   the words are unrelated in a sentence the value  Matrix then rates the relevance of the pairs of   words that make up each attention score to  the correct word that the network is shown   during training now look that's a lot to take  in let's back up and use an analogy for what's   going on in the attention block then we'll take  a look at a schematic for how these q k and V   matrices work together before finally looking at  the equations at the heart of the Transformer to   complete our understanding of the design first  thing an analogy our Transformer is attempting   to predict the next word in a sequence this might  be because it's translating from one language to   another it might be summarizing a lengthy piece  of text or it might be creating the text of an   entire article simply from a title but in all  cases it's singular goal is to create the best   possible word or series of words in an output  sequence the attention mechanism that helps   solve this is complex and the linguistic concepts  are abstract to understand this mechanism better   let's imagine that you're a detective trying  to solve a case you have a lot of evidence   notes and clues to go through to solve the case  you need to pay attention to certain pieces of   evidence and ignore others this is exactly what  the attention mechanism does in a Transformer   it helps the Transformer to focus on the  important parts of the text and ignore the rest   the query q Matrix is like the list of questions  you have in your head when you're trying to solve   a case it's the part of the program that's trying  to understand the text just like how you have a   list of questions to help you understand the case  the Q Matrix helps the program understand the text   the key K Matrix is like the evidence you  have it's all the information that you have   to go through to solve the case you want to pay  attention to the evidence that's most relevant   to the questions that you have in the same way  the product of the queue and the K Matrix gives   us our attention score the value V Matrix is the  relevance of this evidence to solve the case two   words might attend to each other very strongly  but as a singular and non-exhaustive example they   might be an irrelevant pronoun and a noun that  doesn't help us in determining the next predicted   word in the sequence so we have an analogy using  questions evidence and relevance for queries keys   and values and that analogy I hope is helpful  but how do the matrices work together in this   schematic we can see that we first multiply the Q  and K matrices together then we scale them somehow   we pass them through a mask and we'll discuss this  mask in detail in the next video we then normalize   the results and finally multiply that result by  the V Matrix we can formally write this down with   the following equation so we first multiply the  query Matrix with the transpose of the key Matrix   and this gives us an unscaled attention score  we scale this by dividing by the square root   of the dimensionality of the key Matrix this  can be any number a standard is 64 which will   mean dividing by 8. we then further scale using  a soft Max function that ensures that the weights   assigned to all the attention scores will sum  to one finally we multiply these scaled and   normalized attention scores by our value Matrix  so to summarize we use Transformer models like   chat GPT and gpt3 to perform language processing  this might be translation from French to German   translation from English to a computer program  written in Python alternatively it might be   summarizing a body of text or generating a whole  article based just on a title in all cases this   involves predicting the next word in a sequence  Transformers use attention to dynamically weight   the contribution of different input sequence  elements in the computation of the output   sequence this allows the model to focus on the  most relevant information at each step and better   handle input sequences of varying lengths making  it well suited for translation summarization   and creative tasks just outlined the attention  mechanism is captured using three huge and crazily   abstract matrices the values in these matrices are  obtained using a technique called back propagation   over a huge amount perhaps hundreds of billions  of training examples this attention mechanism   along with the semantic and positional encodings  described in the previous videos are what enable   Transformer language models to deliver their  impressive performance this is Richard Walker   from lucidite please join me next video where  we will take a deeper dive into the Transformer   architecture to look at examples of training  and inference of Transformer language models [Music] [Applause]
Info
Channel: Lucidate
Views: 61,489
Rating: undefined out of 5
Keywords: chatgpt, chatgpt examples, chatgpt explained, neural network, transformers, ai transformer, ai, artificial intelligence, deep learning, deep learning explained, gpt3, openai, nlp, natural language processing, machine learning, attention, attention is all you need, attention is all you need explained, attention is all you need paper explained, what is chatgpt, how to use chatgpt, chatgpt tutorial, artificial intelligence transformer, how transformers work machine learning
Id: sznZ78HquPc
Channel Id: undefined
Length: 13min 56sec (836 seconds)
Published: Tue Jan 31 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.