Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello hello guys uh can you hear me clearly can you hear me clearly everyone can I get a quick yes yeah I hope everybody is able to hear me perfect so yeah we'll just start in 2 minutes so that everybody joins I'm sorry for 2 minutes delay okay perfect so guys today we will be discussing about Transformers we'll understand the full in-depth intution you know we'll understand the architecture and again so second part of tomato project don't worry it'll come uh it'll be in uh it'll actually be in uh uh what to say in Google collab I'll just try to upload that but I hope uh everybody's super happy that we are finally doing this Transformers session now guys uh Transformers uh before I heard about Transformers you know it was somewhere around last year I I was planning to learn Transformers so if I if I talk about attention is all you need research paper Okay first we'll go to This research paper trust me it has taken many days for me to learn this thing and again continuous you have to actually learn many number of times and yes after Transformers We will be also understanding about Bert in the other live session okay so will there be a practical implementation of attention Transformer in this session not in this session we'll just understand the whole architecture later on uh there is a library which is called as hugging face uh uh we will be discussing about that probably in on another live session so let's take it one by one but I really want to dedicate this whole thing uh to the researchers you know they have done some amazing work I don't know like they have a lot of patience uh when they usually implement this many things you know continuously they are working they they're working in Sharing knowledge and all so attention is all you need we in this in this session we'll be covering attention is all you need and definitely a good hats off to all these researchers uh you know know for this amazing research paper I could not understand much of the research paper initially but uh again hats off to them they have done some amazing work so this is the whole research paper the research paper link is given in the description okay and uh if I talk about this research paper it was uh basically in collaboration with uh uh researchers or scientist from I mean the developers from Google brain you can see Ashish wasani noem Sai Nikki perad Jacob utor I hope I'm pronouncing it right guys if I'm not pronouncing it right I'm extremely sorry about that Leon Jones Aiden and Gomez Lucas Kaiser and Elia pokin okay so you can see that this was a collaboration between Google and the professors from uh University of Toronto um and I don't know uh Ilia pukin where does he actually work I'm extremely sorry about that uh I tried to find out some information but I could not and this is the whole architecture of the Transformer model and definitely just by reading first time this research paper trust me you'll not be able to understand because we are not okay because at least I am not that intelligent by just reading this research paper I'll be able to understand because there are lot of complex equations and all but I definitely you know initially when I saw this I told okay I'm just going to skip this because I'm really not able to understand I I just completed around 10 to 20% and then I was trying to follow this you know it it really messed up in my head um and uh you you will be able to see that uh apart from this research paper I was able to understand this research paper okay uh if your YouTubers do a streaming by mistake please warn them through Linkedin if our YouTubers do a streaming um okay I've never done that so okay now I really want to give a very good hats off to the people over here apart from that this whole video this whole video is basically credited to Jay Alamar so he is this guy who who who has written this amazing blog on Transformers bip giving in the most more animated way you know so trust me guys after reading this I went to this research paper in short okay initially I tried to read this research paper nothing hardly I was just able to cover till here and after seeing this architecture I was like shocked what is happening because till now whatever we discussed about RNN lstm RNN encoders decoders and many other things right uh it was pretty much simple because from seeing the research paper the concepts I was able to understand but definitely after seeing this it was very very much difficult but uh thanks God uh we have uh again uh we have Jay Jay Alamar so he has WR this amazing blog this blog link is given in the description guys we are actually going to discuss about Transformers through his blog and Jay also has a YouTube channel okay so in this YouTube channel he has uploaded about gpt3 you can see how gpt3 Works easily explained with animation so definitely I I liked his uh teaching a lot he has I don't know why he has so less subscribers instead because he may have uploaded just three videos but trust me uh he has done a whole lot of things if you if you read his uh and there is also one more thing right so there is whole there's a whole this URL which is called as jar. github.io where he has actually put up lot of Articles a lot of amazing articles which I was actually going through you know uh this is one of the article uh that is Illustrated Transformer and guys uh again I'm telling you the whole credit of this session think I you just think that I will not be able to take this session without his blog okay J's blog so please make sure that you subscribe his channel the link is in the description because you'll be seeing some amazing sessions amazing blogs from him I think in the future as you go ahead because he's already set up his whole uh you can see this is all his github.io page you know and recently he had also updated this how gpt3 works with visualization animation I think so uh this will be pretty much amazing if you just go through this so so what we'll do is that first of all we'll go through his blog okay and then we will go through this uh um This research paper with which you'll be able to understand in a much more pretty much easier way okay so let's go ahead um and uh let's start with the Transformers and remember guys please again the whole credit goes to J uh for this amazing thing you can also follow him in this github.io you can actually follow his YouTube channel just subscribe his YouTube channel the link is given in the description okay perfect so uh are we good to go everyone can I get a good go ahead so that you are fine that we should start this session yes I hope everybody so it's okay guys I I really I really admit to people who teach well you know I I tell them you go and subscribe the channel because you will be able to learn a lot not only uh J but many people right so so sh shall we start guys can I get a good go ahead so that I we will start this whole session and then we will uh we'll we'll go through it you know line by line step by step and it'll be pretty much amazing okay okay so many people are saying go ahead go ahead go ahead go ahead yep yep yep yep okay guys so first uh what we'll do is that sometimes um if you really want to understand this whole architecture just follow his blog okay you just have to read it's in very very simple words you'll be definitely able to read it but I will make something more easier by just telling you how this actually works but after finishing this okay after finishing this you I I would like to request you please go through this particular blog read it from start to end okay so guys Transformers what is the use of Transformers let's let's start with the simple thing so this is my Transformer this whole Transformer it just consider this like a blackbox and right now we are converting a language from I mean we are converting sentences of one language to the other so this probably is French and we are trying to convert this into English so this is the blackbox model so usually we say that machine learning or deep learning models are just like blackbox we give some input we get the output fine right now after going through this after going through this we are actually going to go to the next section this section now inside this okay and remember guys for each and every step there is some one wonderful explanation that is given okay so again guys all this GitHub link all this uh link is given in the description of this video so please make sure that you go over there and try to access it along with me so we'll learn together um I know you may have learned it some or the other way but let's let's learn with li let's see how uh how we will learn okay we've learned together over here if I miss some points you can also tell me okay now guys so this is our Transformer we are considering Transformer as a kind kind of blackbox model where we give an input and we get the output in this case we are giving an input of sentences in one language we are trying to translate into another language right done now in The Next Step what is present inside this Transformer okay what is present inside this Transformer it is simple whatever we have learned about encoders and decoders okay so this is the thing that we have inside this we have encoders and we have decoders in the encoders in the encoders you can see that we are giving some input which we are actually giving in the form of French sentence and then we are actually getting the output okay so we are actually getting the output over here now when I'm giving this input in the encoders and then some processing will happen inside this right now again I'm telling you let us consider this as a blackbox then we'll go and understand what is present inside that okay so the input is given to the encoders and then you know you hope already we have discussed in our previous session how encoders and decoders work and this this decoder we get the output so you can see over here French and over here English so G Susu deont uh I'm sorry I don't know French so this is getting converted as I am a student okay perfect now let's go inside this encoders and this decoders what is actually present inside this encoders and decoders okay so let's go to the next step now this encoder now I'm going inside encoder and seeing what it is okay now from this blog it basically says and the this is an amazing blog it is being compared along with this research paper in this research paper also we'll go and have a look okay so inside this block inside this encoders right you have multiple encoders like this okay and you you have to remember the count guys one important thing over here is that the count the count over here let's see okay so if you if you go and see this okay the count of the number of encoders okay sorry this number of encoders is basically six let me just take some other color okay let me just take some other color and let me increase the thickness okay I'll just so the encoders that are actually present are six 1 2 3 4 5 6 now the question comes why six okay according to this research paper that you see in the right hand side that is attention is all you need this research paper it says that they had tried with different different different number of encoders okay so but by selecting six encoders it was giving us good results good results okay it was giving us good results right fine now inside this encoder this is one inside this encoder inside this block that you see right we are having six encoders and then probably in the decoder also we have six decoders okay again guys you can change this value okay this is a kind of hyper parameter you can say like this this is a kind of hyper parameter and they are lot of hyper parameters that are going to come up in this session okay usually whenever you imagine with respect to the number of counts in Hidden layers number of nodes in Hidden layers you know you can consider that as an hyperparameter always the accuracy may change based on the change of that particular value okay now in this case we have six encoders and we have six six decoders okay so we have six decoders now it is very very simple in encoder this input is going okay then something will happen inside this and finally all this particular Valu is given to the decoder and here I will be getting an output okay so here you you're taking this as an input and you're taking this as an output this is pretty much simple everybody clear with this much everyone just by seeing this blog I think you'll be able to understand guys are you everybody clear with this any any questions with respect to this step that we have please do let me know I hope everybody everybody's clear can I get a Qui quick go ahead just give me a thumbs up guys I'll just give you two minutes till then I'll drink water yeah everyone all good guys you there all good or not yeah okay now coming to the okay no sir what what know what happened see this is very very simple this is your blackbox model I don't know like if you're having problems in understanding this then I don't know how you'll understand the Future Part so here encoders is there and decoders and there this encoder is basically represented by this six single encoders okay I'll tell you how the encoders worked and everything we are going from the top level right and then this output of the final encoder is given to the decoder of every given as an input to every decoder okay how the words are encoded we will rep we will try to understand that okay so here is our input and here we get the output now let's see what is present inside this encoder okay what is present inside this encoder because inside this encoder you know there will be something happening okay so let's go to the next step so this encoder this encoder now I'm going to take it over here now inside this encoder there are two things one is self attention and one is feed forward neural network this feed forward neural network is an example of AR offcial neural network it is just like that so we are going to consider this feed forward neural network and what is this self attention now this self attention is completely derived from something called as attention is all you need okay and this actually plays a very very important role okay again I'm telling you self attention plays a very very important role okay now let's see now we have understood that it's very simple see guys this is my Transformer I got my encoders and decoders what is present inside this encoders they are multiple encoders like that this and there are multiple decoders inside this right this encoder sticks an input all the processing will happen inside this then the output of the final encoder will be given to uh given as a input to all the decoder and we finally get an output okay now what is present inside this encoder this uh the thing that is present inside this encoder is again two separate layers one is self attention and one is feed forward neural network what is the difference between attention and self attention I'll just come to that okay I'll come to that okay okay so self attention and feed forward neural network now we'll try to understand what is self attention Okay and guys always make sure that as I'm going just if you have time just read all these particular statements okay that is how I'm actually explaining now finally let's take an example let's take an example where I have one encoder and one decoder and how these things actually happen okay so here you can see inside the encoder you have self attention lay and feed forward feed forward okay now let me see what is the input that goes to this encoder because here we have seen some input is going right some input is going over here G suant right this input is basically going now what is this particular input this input is a text right this text now that input is first of all converted into some Dimension that input is converted into some Dimensions how it is converted if you have an input text how do you convert that by using an embedding layer right using an embedding technique it may be word to work it may be other techniques right so in this particular case this particular set word from that sentence is getting converted into some vectors and how it is getting converted into some vectors based on this particular research paper they have used word to V okay so they have used word to V and they have converted this into a vector of what dimensions here you can say that they have specified a vector of size 512 okay now why 52 again it is a kind of hyperparameter okay again it is a kind of hyperparameter based on the research paper see guys this research paper is just not written in one or two day they have explored it for I don't know how many years but I think this research paper came somewhere around uh 2017 I guess I think it it has come somewhere around 2017 okay and I have just read it at 2019 okay 2019 I think the ma the month was somewhere around uh July August July August month I was actually started reading this okay so again I'm very much late because understand because as I was working and after I started uploading videos in YouTube so I started focusing more on some Advanced concept okay so so I say that YouTube has actually helped me to learn more now the each and every word we are using word to W and we are converting into some Dimensions okay this is just like a vector of Dimension 52 of size 512 similarly this word will also get converted similarly this word will also get converted so whatever input we are giving over here this will be first getting converted into this vectors of 512 Dimensions just understand in this particular way and if you don't know about word to work if you don't know about embedding techniques just go and see my deep learning playlist guys I have uploaded video on that okay now the next step next step now after this this is getting converted into vectors okay now this is passed to the self attention layer okay this is passed to the self attention layer okay and as it is what what happens inside the self attention layer we will understand that in the upcoming steps don't worry about that okay so we will try to understand that don't worry about it okay just imagine that we are giving one kind of vectors to self attention layer and remember this words will be given not one at a time like how it how we do it in RNN right in RNN in the X axxis we basically have time stamp right so if I consider this if I consider this if I consider this right in RNN this will be my x-axis right so this will be my time stamp this will be my time stamp so here I will not be giving based on time stamp but here the inputs will be given all the inputs will be given at the same time so parall all the inputs will be given okay so what do I mean by this is that whenever whenever I am trying to execute this suppose if I give three words so this words will first of all get converted into 512 vectors Dimensions this word will also get converted this word will also get converted and this all inputs will be given all at one time okay this is the amazing thing guys trust me I I'm able to learn this whole thing attention is all you need and about Transformers and Birds because of J because he has actually given this in an amazing way Trust Me by just reading this This research paper I was nowhere satisfied I was nowhere I had just half knowledge okay but just by reading this statement I were able to understand this okay so this particular words has got converted into word to so vectors of 52 Dimension and this and this is basically pass to the self attention and one point you have to remember all the words are passed parall okay all at one time then you'll be getting some output like Z1 Z2 and Z3 now how this self attention model works and how this vectors is actually created we'll try to understand that before that what is self attention Okay we we need to understand that also now what uh what J has actually done is that instead of taking three words he has he has just simplified the diagram with two words okay with the two words and you can see that after this it goes to the feed forward neural network okay after this this vectors goes to the feed forward neural network okay now if I go and see the feed forward neural network this will be a separate feed forward neural network so here what he has done is that he has basically simplified this diagram and he has now basically what he has done is that he has basically used one self attention layer and over here he has drawn two feet foral neural network since each and every words are going parall and here he has just taken two words and remember guys this is my second encoder because the output of the feed forward neural network will be going to the encoder two okay it will not be just going to the encoder one this will now go to the encoder 2 so here you'll be able to see that the output that is R1 and R2 that is created is going to the encoder 2 now we'll try to understand what exactly is self attention and what it is exactly okay now self attention as higher level so he has taken this example the animal did not cross the street because it was too tired okay now he asks a simple question what does it in the sentence refer to okay whether this it whether this it it whether it refers to the street or whether it it refers to the animal because remember based on this it right it will be converted into French suppose if I'm trying to convert this into French this it will be getting passed over here right this will getting passed over here in this particular uh input right it will be converted into word to right so based on this it what should be a output okay what should be a output and what this self attention basically says that it will be giving the output based on what this it is basically referring to to okay what this it is basically referring to based on that it will create these vectors okay so here it is very very much clearly uh written you can see that the animal did not cross the street because it was too tired what does it in this sentence refer to is it referring to the street or the animal it is a simple question to a human but the algorithm cannot understand this right but the algorithm cannot understand this so how do we make the algorithm understand that the whenever this it Act actually goes it is basically referring to the word animal okay and this it can also be referring to too tired right right to the tired word and to the animal word it can also refer to multiple words so guys I hope everybody is understanding can I get a quick go ahead guys if you able to understand till here till here everybody is able to understand right because now we are trying to understand the maths inside self attention how does it actually work guys can I get a quick yes if you are able to understand yeah guys just a quick go ahead what to I've already made a playlist so guys can I get a quick go ahead if you're able to understand yeah is it interesting okay perfect okay guys so thank you so all the six encoders are the same right yes all the six encoders are same inside you'll be having self attention and you'll be having a feed forward neural network but remember this output after the first encoder will be the output of this it again this input will not be going over there okay I I'll talk about that also okay yes wasant he has structured it amazingly so I'm telling you right this whole video credit goes to him I've understood Transformers because of J okay okay guys now the next thing what we are going to do okay now let's take this particular example okay now when I say SE self attention whenever I will be able to give this as an input I should be able to get the vectors which is specifying the importance of word like animal or tired okay so here you'll be able to see that in one of the attention model and this is just a diagrammatic representation okay so here you'll be able to see that this words that you have like the animal cross this this this this now here when you're giving to the self attention model this it word specifies or provides the importance to animal words how it is specifying because you can see the darkness of the line right the darkness of the line the line is too much thick in animals okay and you'll also be seeing that there is one problem it is not able to specify tired word also properly but we'll try to see how to fix that and all but just let's go ahead and try to understand how does this self attention model work how we are able to give the importance to other words like animal whenever the it word is given over there okay we'll try to understand in this self attention okay now let's go now self attention in details now what will happen first of all we have this input okay this input is basically my word this will be given in parallel we have converted this into uh 512 Dimension vectors so here you can see that I have this X1 Vector I have this X2 Vector okay now the first step that usually happens is that we will be creating three weights okay three weights and this weights value will be randomly initialized because remember this weight value will be changing based on your back propagation if you remember in back propagation what will happen first of all your loss value will be calculated at the end of the decoder then your whole back propagation will happen and then the weights will be getting updated so they are actually creating this three weight parameter WQ WK WV okay we'll just discuss about this what is queries key and values now what will happen is that this Vector that is X1 it will multiply by WQ okay so let me just uh try to show it to you so here you will be able to see that in this step this Vector right will be getting multiplied by this Vector we will be getting q1 okay remember this is my 512 Dimension and this Vector will also get multiplied by this and we will be getting Q2 okay and this will be this one and this we know that these are 512 Dimensions the number of Dimensions that we'll be getting with respect to this is 64 Dimensions okay 64 Dimension that is what it is been written in the research paper also so that way this particular uh Vector this particular Matrix or weight Matrix will be created okay so WQ now similarly with respect to WQ and WV like why it is getting multiplied because we know that formula right um input multiplied by weight added a bias and then we activate that is how a a Ann actually works right we we take the input parameters we multiply with the weight and then we add a bias on top of that an activation function is applied but in this case we are just taking this input we are multiplying with W of Q we are getting this query parameter and this query parameter will be 64 Dimension similarly we'll take this this X2 we'll multiply with this okay so this weights and this is getting multiplied and we are getting this okay now similarly what will happen with respect to key keys keys uh again this X1 Vector will get multiplied with W of K and then I will be getting this key1 okay similarly this X2 Vector will get multiplied by w k and we will be getting K2 okay so this multiplication is also happened and similarly values how we'll get this multiplied by your this vector and this multiplied by your this Vector will be actually helping you to get V1 and V2 so did you understand this guys I will just mention this at my step one okay so step one you can see that what has happened two things has happened first of all word to W embedding okay word to we embedding it has converted vectors in 512 Dimension then this multiplication actually create created your queries keys and values and this is basically 64 Dimensions so how WQ is converted into 64 Dimensions remember the shape of the weights that is taken if you multiply a 512 Dimension vectors into this you'll be getting a 64 Dimension again I cannot go and internally say you that how how you can visualize this this just given as an example okay I hope everybody's clear with step one just tell me guys we will go now to the step two everybody clear shall we go with the step two why 64 Dimension again I'm telling you this is also considered like an hyper parameter 64 Dimension 512 Dimension is taken as a hyper parameter can you change it or not yes you can definitely change it okay okay perfect so everybody's clear I'm going to rub this and go to my step two and again amazing diagrams guys all are amazing you just have to read any any lame person I think a college student will also be able to understand whoo matrix multiplication okay trust me in this now the next step I got my q1 K1 and V1 okay for this word I got my q1 K1 V1 for this word I got my Q2 K2 V2 okay I got my Q2 V2 uh K2 V2 okay why query keys are getting created I'll tell you I'll tell you why query ke ke and values are getting generated everything will make sense as you go ahead with this blog guys everything will make sense okay uh what is WQ WK WV these are the metric these are the weights for for for generating queries keys and values this the shape is fixed but inside that all the weights will be randomly initialized because in the back propagation all these weights will be updated think in that specific way okay okay guys now next step okay now next step okay now let's see for this thinking word we have generated X1 q1 K1 and V1 for machines we have X2 Q2 K2 vs2 now the next step is that for each and every word for each and every word what we are going to do is that we are going to multiply this q1 and K1 okay now for this you can see that over here q1 and K1 has got multiplied okay for this word for this word you can see that I have multiplied q1 and K1 so here you'll be able to see this two values is getting multiplied over here q1 and K1 okay then for this same word this q1 and K2 will also get multiplied so this two words will also get multiplied so now you have two values that is q1 k1 and q1 multiplied by K2 okay for the same machines okay this is with respect to the first word with respect to the second word again you'll be seeing that Q2 and K1 will first of all get multiplied like this Q2 multiplied by K1 and then which is nothing but some value you'll be getting and Q2 multiplied by K2 so this two operation will take place with respect to the machine word but with respect to the thinking word this queries will get multiplied by keys of both the words and this value you'll be getting okay so this value is called as score this score you are actually getting okay this score you are actually getting uh reason for q1 and K1 multiplication I'll tell you just a just second just wait for a second we need to find out this words right this Words which are the most related words the problem in the attention model you understood that guys we were trying to find out whether it is referring to the animal or it was referring to the street right whether it was referring to the animal or whether it was referring to the street now in this case I'm just trying to find out whether thinking is referred is basically referring machines or if it is referring some other words right so for this I will be first multiplying queries and keys I'll get some value I'll tell you after this now with respect to my machines in that whatever keys were there I'm getting 96 and here I'm actually getting 112 okay so this is basically my second step where I will be creating my score my score for each and every words how by multiplying your queries with all the Keys okay queries with all the keys okay what if there were three words again three words again what is very simple guys if there are three words you'll be having this q1 Q2 Q3 again right see if you are having word three uh suppose if I say I'm having a word uh three like go so again go will be your X3 word like this this will be X3 52 dimensions then for this also you'll be generating Q3 for this also you'll be generating K K3 okay and finally you'll be generating V3 simple right and again now this operation how it will take first q1 will get multiplied by K1 then q1 will get multiplied by K2 then q1 will get multiplied by Q3 sorry K3 right now H frage is asking sir can you explain what is neural network phras just go and check my deep learning playlist okay complete deep learning playlist now we are too much in the advanced stage if you really want to see some old videos you have to go back okay so I hope everybody's clear guys right now let's go to the step third now what will happen after this that we are trying to understand okay that we are trying to understand now in The Next Step I'll again rub this because I don't require this okay perfect now in The Next Step what will happen in the next step you saw that we have computed the score okay we have computed the score okay we have computed the score now in my next statement I'm going to divide by8 okay why we are dividing by eight because understand guys as I told you that we have 64 dimensions in this right we have six oops oops oops oops oops oops oops sorry sorry for that okay we have 64 Dimensions here right we also have 64 Dimensions here now in the next step whatever q1 and K1 we are getting or whatever q1 and K2 we are actually multiplying and getting for this word we will divide by 8 what is this eight8 is nothing but root of DK root of DK is nothing but this is basically treated as DK okay queries what is the dimension of the queries okay and when I do the root of 64 I'm going to get 8 so here you'll be seeing that we are dividing 112 by 8 we are dividing 96 by 8 so here you can see that this is my word 14 uh by value 14 and 12 okay now this is basically my step three now in the step four I'm actually applying a soft Max activation function now this soft ma activation function if you don't know about soft Max guys whenever we have some multiple categories like this kind of value you can see that 14 is definitely greater than 12 right now soft Max usually whenever you give any inputs to the soft Max the value the total value will be equal to one okay if I do the submission of all the values that it is giving as an output now since 14 is greater than 12 the higher value that you have got is for 14 like 088 okay and then the lower values lower value is basically given to 0.12 but the interesting thing is that whenever you add this the value is always one if you have three words again trust me in this this particular soft Max will be giving us three values but again the summation will be equal to one why soft Max why soft Max because understand we are trying to find out the most important words based on probability right for thinking if you're are trying to pass through the uh softmax layer you'll be able to see that based on this score the highest importance is given to this then the second highest importance may be given to this the third highest importance may be given to that right so it is pretty much simple guys just try to understand with respect to that okay uh Kish we are not able to understand why we are dividing by the root of dimension in one of my interview I got this question because this is based on the research paper guys they had research with different different values but they were able to derive it when they were considering this query size this query size was 64 Dimensions they were just trying to use it as an hyperparameter and they finally used this as square root of 64 that that is what it is mentioned in this particular blogs also okay so this is a kind of hyper parameter guys this 64 dimensions 8 you can see it has t simply taken root of D of K so D of K is basically your dimension okay so uh have everybody got clear with uh this two steps guys with we're dividing by 8 root of DK and the other things everyone can I get a quick yes I hope you are able to understand it I hope you are able to understand it properly right everybody is able to understand guys just give me a go- ahead quickly yeah guys are you liking or you you think that you're not able to understand from my explanation something like that I will try to improve it more but just understand we are just going step by step right okay now what is the next step okay now next step you can see that we have applied this softmax we are getting this value and this value okay this value and this value yes yes yes this should be available yes this should be available in the Deep learning playlist guys it is the continuation of the previous live session the thing is that I will try to cover ber also after this and probably we'll be using Jay's blog only again he has written in an amazing way I told you right here it is everything is available here some amazing thing he has run visual guys to use bird for the first time illustrated gp2 gpt2 amazing blogs he has actually written you also see his uh uh qon Londa he has given the talk it was amazing and I just saw this today it was very very good um so I don't understand what is query Keys values and why we calculate them to provide the self attention I told you right here now I'm getting 8812 definitely I know thinking whenever I considered thinking the first importantance is gone to this particular word then you can see the next important with machine suppose you have multiple words like that right okay now uh in short uh what is the next step okay that we are trying to have a look now in The Next Step what happens is that guys okay now in The Next Step whatever the softmax value is there right we have still not used this values right this values is multiplied by the softmax value and we will be again getting a different Vector we will be getting this different vector and for this particular word you'll be having Vector V1 vector vs2 and finally we add these two vectors to get zed1 now you may be considering Kish why we are multiplying this value understand guys sometimes researchers you know they try to continuously research various things and they try to use some of the other parameters and indirectly you'll be seeing that internally due to the due to the multiplication or due to the addition of those things it works obviously well right so here this is one more things because this values that we have created it is it will just not go for Waste right somewhere we have to use so this soft max value is multiplied by this value and here we'll be getting V1 and vs2 this V1 and vs2 will be added and finally we get this Z1 okay so here you can see that all this particular operation is the final step here you are actually multiplying soft Max multiplied by value you're getting Vector V1 and V2 and then you are basically getting the sum Z1 and and Z2 okay so here with respect to this Z1 will be the output of the self attention layer when you are passing thinking and this Z2 will be the output of the machine word when you're passing through the uh self attention Okay so this was about this particular step uh I'll drop this so you can see that guys simple over here one more addition one more multiplication is done where we have multiplied this values with the soft Max and then finally I'm actually getting this particular sum where we are adding V1 and V2 okay so and this V1 and V2 will also be giving you information about the other words because here you're adding the values right you're adding the values so finally you'll be able to see this particular Vector that is zed1 okay how many words are taken as input any number of words you can provide guys understand this works absolutely fine uh as a language translation for very very big sentences it works absolutely fine okay okay there is no there's there's not a problem with respect to that okay now let's go to the next step so I have got my zed1 and Z2 now indirectly what type of multiplication is taking place guys so see all the steps all the steps can be told or can be mentioned in this specific number of steps so first you take the input you multiply by W of Q you get the Q then you take the other input I mean the same input you multiply by WQ and you get the K value similarly X X is multiplied by W of v and you get the V so mugesh Kanan says is there any window size like in attention model sir no it is not okay no window size you can give because all the words are going parallell okay all the words are going parallell understand one thing okay now this operation is this what we have discussed over here right we have discussed the same thing right so this just by using this four steps you can understand this this multiplication will happen you'll get q k and V now what is happening after that you are multiplying Q multiplied by K and obviously when you are multiplying this two Matrix you have to transpose one of the Matrix so in this case I have transposed K of t Okay and then you have divided by root of D of K D of K is nothing but your Dimensions is 64 right so on top of that you have applied an activ activation function like soft Max right you have applied a soft Max and then you multiplying by V uh then finally you'll be able to see that I'm getting the Zed value right now in this case what is happening this first layer first layer is your first word the second value is your second word right now see this guys this is very important to understand we have combined all the words we have combined all the words the first word is nothing but thinking right the second word vectors is basically machine so we have placed all the words one after the other and we are actually doing a matrix multiplication with W of q w of K and W of V right if there are three words then probably we'll be having one more vectors sorry sorry I don't know suddenly this will change okay now what is happening over here if we have one more word so I will try to add one more Vector over here right one more Vector which is again this all Dimensions will be 512 right this will be a 64 Dimension now you can think that what dimension will come over here right so this will be your third word third word suppose if I say go if you have fourth word again we'll stack it one after the other and by this we will be able to do this matrix multiplication very easily right matrix multiplication that is our main name right we don't have to just wait here we are giving all the words at once are you getting it guys everyone how can we create weights of key values and queries we just have to provide the size guys the weights initially will be initialized will be initialized randomly remember this thing it will be initialized randomly okay now if you don't know how to initialize weight just go and check my deep learning playlist because in the back propagation in the back propagation in the back propagation this weights will be updated the back propagation this weights will get updated yes so this weights are going to get updated this all weights will be updated because understand once whole from encoder to decoder it goes over there will be Computing the loss function based on the loss function we'll be creating an Optimizer the optimizer work is basically to reduce the loss function by updating the weights we need to update the weights right I hope everybody's clear I think everybody's clear with this yes key weight everything those values are basically learned during the back propagation again guys it is very very difficult just visualize it so I'm following this particular diagram which will actually give you an idea again there are lot of things that happens internally right each and everything is very very difficult to just specify what will happen after that how that value is actually updated many things okay how to choose W and W of K it'll be initialized randomly guys this will be updated in the back propagation always remember that okay don't get confused just read the research paper you'll be able to understand okay fine let's go to the next step so I hope everybody has understood and we got the final Zed value now one problem with this what is the one problem with this now guys you remember that we are actually using we are actually using you can see in this uh just a second if you go up okay we are actually using one W of Q one W of K and one W of V for all the work words okay for all all the words right are we passing words as a single input or sentence see guys in sentence there will be many words right each and every word will be converted into vectors of the same Dimension that is 52 Dimension it will be stacked one after the other okay and it'll be going so coding I have not yet posted guys in the upcoming session coding will also come okay now what will be the next step now see this guys whatever inputs we have used we have used W of q w of K and W of V these are my vectors and we are just use same weights for all the words right so we can basically say this as we can indicate this whole WQ WK and WV as something called as uh you can see that mult uh a single head attention Okay so this if I'm considering this three uh matrices these three weights sorry these three weights for the same for all the words we will call it as single hand attention but understand by using just single head attention what will be the problem now in this case when I applied a single head attention you can see that for it it is getting the importance is given to words like animal right the but if I consider some other words like tired because tired is also being represented by this it right it is basically representing animal tired is also represented for the animal so here in this case you can see that the line is very very light right but we should also get something we should also get some importance for this tired word right and right now we are not able to get because we are just used the single weights and these are the same weights for all the words now what if I use multiple weights like this okay multiple weights this is just your single head attention instead of single head attention we will try to use multi-head attention so multi-headed attention basically says that in this case we will try to use head zero like this will be my weight z q weight 0 K weight 0 V this will be my weight 1 Q weight 1 K Weight 1 V suppose if I am using two heads so I can say attention head zero so this will basically be my attention head zero weights this will be my attention head one weights okay this will actually help us to indicate or find out the importance of the other words or how familiar that particular word is right now in our case when I when I was referring to that particular sentence it was just getting referred to animal but that it is also should give some importance to tired but since we are just using one head if I'm considering this W WQ w k w v i will be considering this as one of my head right here in the back end you'll be seeing that this this weight is getting used for each and every senten each and every words right so what if if we have different different attention head for different different words right so here you can see that I have my attention head zero I have my attention head one in this my queries are basically indicated by q0 q0 sorry q0 k0 and v0 here we have q1 K1 and V1 okay so like this how many heads we will be considering so here again based on this they have told we will be using eight heads we will be using like this eight heads now when we use eight heads that means we will be getting eight Zed values okay this is with respect to one head right we are getting one zed value but what if we use eight heads so we will be getting Z 1 Z 0 Z 1 Z2 Z3 Z 4 Z 5 Z 6 z7 right so like that you'll be able to see that now we are passing it to eight attention heads so I'm finally able to get Z 0 Z1 till z7 now still it is not just the end because I told you as soon as we pass the input to the attention model I should be getting one Z1 value right but there I have seven Z value eight Z values Z 0 Z 1 Z2 Z3 Z 4 Z 5 Z 6 z7 Z Z8 how do I combine them so that I actually form zed1 okay so for for that the next step is basically you'll be seeing over here we will try to combine this all the Zeds so here you'll be able to see that we are combining Z 0 Z 1 Z 2 Z3 Z 4 Z 5 Z 6 z7 after we combine all those things guys again before passing it to the feed forward neural network we will again initialize another Matrix we will again initialize another Matrix this is in the case of multiple attention heads okay we will initialize another Matrix and this Matrix will get multiplied by all the concatenation of this attention heads and finally we'll be getting a z value then we can forward it to the feed forward neural network okay so why and again this w0 weights can also be updated in the back propagation okay understand this particular thing I hope everybody's clear till here right now we are still in the encoder layer guys okay we are just understanding these two things right now because the same thing will happen in each and every layer we are just understanding till here we we have come up till here we have calculated Z1 similarly for this word we have calculated Z2 okay so now we'll pass this input to the feed forward neural network now let's see what will happen after that and still we are in the encoder layer only just imagine guys how much effort he may had actually put you know J okay guys so all good it needs lot of computation power yes definitely it require if you're training it from scratch but now right now lot of transfer learning techniques will be used if you know about BT trust me Bert is just one kind of Transformer where you'll just be using much of the transfer learning techniques for most of the NLP task so how many weights matrices are total I think you have to search for that I think there will be many many weights parameter if they are 10 words which eight words will be chosen in case of 10 words eight eight words will not get chosen guys just trying to understand this thing it will it will just multiply with all the attention right of all the words okay now guys this is the whole operation that is taking place here you have your input here is your attention head one here is your attention head two let me just uh denote it into for you so here is your attention zero attention head zero attention head one one attention head seven finally you'll be getting your key uh query's key values query Keys values query ke's values so here you have z0 Z1 Z2 you'll combine all this things you will multiply by this weight of zero and finally you'll be getting your output Z now this will be passed to your feed forward neural network okay perfect so let's go to the next step now what will happen after that feed forward neural network I think it is pretty much simple for you all to understand now okay what is the importance of this I was saying right why we are using multiple heads now if I go and see now you'll be able to see that if I'm using a head two heads right now here you'll be able to see that I'm also getting the importance of tired word for it and I'm also getting the importance of the animal word for it okay so both the importance word we are actually getting this is just for head two but if they are multiple heads again more importance word whichever is more important word we will be getting highlighted that particular values okay now you can see that as we encode the word it one attention head is focusing most on the animal while another is focusing on the tired in a sense the model representation of the word it baks in some of the representation of both animal and tired okay okay guys so the next step now this is in case of if we add all the attention head to the picture suppose if you are adding eight attention head you'll be able to get different different colors and different different values with respect to eight okay okay so guys uh cool all good till here I know see sometime it may be difficult guys but you really have to read it once or twice amazing blog has been given by Jay to I think you can just go through it read it if you really want to be good because implementation wise it'll be very very easy okay they will be specifically using libraries and all all good guys shall I move ahead okay perfect now fine let's let's understand one more thing okay in this now there is one more instance that we need to there is one more thing that we need to take care of guys sometime when we give words to the encoders after converting it into a word to Vector okay now in this case you'll be able to see that I am basically giving this word and I have actually converted word to Vector basically into a vector of 512 Dimension now before giving to the encoder guys there is a concept which is called as positional encoding okay there is a concept called as positional encoding this positional encoding is what what exactly it is okay positional encoding now this positional encoding will basically say that whether this word and this word are near or whether this word or this word are near okay now here you can see that the ordering of the words are also pretty much important whenever you are writing sentences okay The Ordering of the words okay The Ordering of the words is pretty much important with respect to any kind of sentences right so based on the ordering of the words we will be providing some position encoding okay positional encoding this like suppose if we have J Su right I will basically be able to relate this word will be much more relatable to this particular word because of the distance and this distance I can provide it through uh positional encoding okay now what is this positional encoding I'll just give you one example now see this now this is one example we have given this input to this EMB voiding by a vector so this is of 512 dimension and then we have applied some positional encoding now positional encoding just understand that whenever we take these vectors we are going to assign some vectors over here whenever we are assigning these vectors and suppose this vectors is there s right and this vectors is there itent right suppose this three have been computed this three positional encoding has been computed you'll be able to see that if I try to find out the distance between this Vector to this Vector this will be less when compared to this Vector to this Vector okay because this basically says that how much distance whe whether this particular word is the next word after this or whether this word or whether this word is the last word after this so it is basically just calculating the distance of the word okay so distance of the word like how far it is whether it is in ordered or not but remember the nearer words the nearer words like this will be having this particular distance of vectors very very less very very less okay when compared to the farer words okay so that is the reason why we are using positional encoding because the ordering ordering of the words plays a very important uh functionality ordering of the words is important ordering of the words is also important right to find out which is the most important thing okay so I hope everybody's able to understand this thing so this was about this positional encoding which is one more thing and when this is happening this is happening when you are pro providing this as an input when you are providing this as a input to the encoders before providing this to the encoders you'll be able to see before providing to this you are applying a positional encoding and you're adding to this vectors and finally you'll be seeing that you are getting some embedding Dimensions with some time signals time signals basically means whatever positional encoding parameters or values you had you are actually integrating in that by adding it with this embeddings okay so I hope everybody's clear this is basically handling the ordering of the words now uh let's uh skip this for right now now we are also going to understand some more things okay now after before uh before it is passed to the feed forward neural network guys you should also understand one thing okay after the self attention we will be also applying something called as a normalization we are also going to apply this normalization this normalization is basically doing two function now here you can see that one detail in the architecture of the encoder that we need to mention before moving on is that each sublayer in each encoder has a residual connection around it followed by a layer normalization now what is this residual connection guys can anybody tell I hope everybody is heard of reset I hope everybody's heard of resonet if you go and see the resonet architecture suppose this is your uh hidden neurons right suppose this is your hidden layers the data will be passing like this right sometimes it may happen that data of directly passing like this it may pass to something like this like suppose this two are not playing this two hidden layers are not playing that very good role so what we can do is that whatever input is coming here we can actually pass it to the this particular node or this particular layer okay like this this is basically from residual connection we basically say it as a residual connection where we can skip where we can skip some layers if this is not important okay yes so we can skip this connection we can skip this two connection and we can go directly over there so similarly in this case we have if the self attention is not performing well what we can do or if this okay if this self attention is not performing well we can directly skip this and we can Pro provide this connection over here directly to this normalizing function it has to normalize this is your layer normalization okay NE normalization and why this is used because understand sometime your self attention may not work well it is it may not be that important yes it can be same as Dropout you can also consider it as a Dropout or you can you're just skipping it you you are just skipping this lay similarly over here you can see the output of this ad normalization if this feed forward is not important it will directly be passed to this add or normalize okay otherwise if it is important it'll go in this particular path okay so this is what is happening inside this apart from the additional functionalities whatever we have discussed under the top so what all we discussed guys see first of all we understood about encoders we understood the operation of encoders with respect to self attention right we understood there are keywords like queries Keys values so many things we have understood right and then we also understood about positional encoding right positional encoding if your words are in out of order so we have done this we have also understood about uh residual connection right residual connection so many things are there now just imagine if this blog was not there I would not been able to understand this I'm not that smart guys okay by understand by seeing the equation I can I I usually take time to understand things but I would really love suppose if I go and see that particular blog and if I don't understand the equation I'll skip the topic because for me to understand it will take time and I have to put a lot of efforts right in that and suppose if I'm not able to understand I'll not be able to teach you also so sometimes some of the videos will take time because I'll still be exploring it you know okay guys so I hope everybody's clear with this I will be taking all the questions at the last don't worry now let's go to the next step can I get a quick yes if everybody's able to understand is it good because the session has gone for more than 1 hour now I hope you're not getting bored right because you're learning with me I think I'm able to make make you entertain also perfect I'll take up the question guys don't worry now what is this uh okay I told you add and normalize what this add and normalize does is that see this add and normalize is nothing but whatever output I'm getting as Zed okay I'm taking that I'm adding it to my inputs okay and then I'm performing a normalization layer okay this is what it happens inside okay now this is all together the whole architecture guys you can see the values is getting passed now we still need to understand about the decoder in decoder what all things are there now this is your encoder one encoder 2 I told you that obviously in our in our um research paper they have written that there are six encoders but here the example is given of uh uh two encoders and again all both the encoders are almost same here also you have residual connection here also you have the residual connection right now after this the output of this particular model is is passed to the decoders I told you but inside decoders what all things are there we need to understand now just let's see what are present inside decoders that we'll try to see now just compare with respect to encoder and decoder in encoder you have a self attention layer you have a feed forward neural network right so here also you have a self attention layer you also have a feed forward neural network but in between you have something called as encoder and decoder attention now what is this encoder and decoder attention this is similar to the self self attention model this is similar to the self attention right but the only difference is that the output of this encoder is passed to this is passed to the is passed to this layer this layer of every decoders okay it is not passed from starting over here okay it is not just passed to the self attention layer but instead it is passed to the encoder and decoder attention and this encoder and decoder attention will be doing the same work as self attention layer okay only the input is basically getting passed to this through the encoder remember the input to this is basically is the output of the encoder encoder output is passed to this as a input okay that is pretty much interesting to and uh it is very much important to understand okay now this is done now here you'll be seeing all the operation but here you have to notice one more thing guys just let me just show you one one more very very important thing that uh I forgot to tell you okay I'll just rub this I hope everybody understood what is the difference the output of the encoder this output of the encoder is passed to encoder and decoder attention and it is also passed to this decoder too again to the encoder and decoder attention pass okay now if I consider this here also we give some kind of input here also we give some kind of input now what will be the input let's see what will be the input now now input remember guys the output of the encoder is given over here right again all this processing is done we get the output and after getting the output first of all we apply a linear and then we apply a soft Max after soft Max we get a output suppose I'm passing think think in h thinking in Hindi is like sna okay so I'll write sna okay now this SOA word will again be passed as a input to my to my self attention layer in the decoder okay so this will get passed sna in the first instance I'll pass sna over here if I'm converting English to um Hindi okay this SOA word I hope I have written right guys my handwriting is bad okay so I am actually passing it to my self attention model again again when it is going up it will get combined with this next output because after this we have to find out the next word remember one thing guys here we are we are providing all the parallel inputs right but here the output that is getting generated it will be one by one okay it will be generated one by one again try to understand this here we passed all our input at once but here we will be generating our output one by one so in the first case uh in the first basically my encoder output gave this it went over here it converted into sna okay but after sna I have to generate the next word okay next word so what this will happen this will go again over here as an input now this input will get combined with the output of the encoder output of the encoder over here again decoder attention will happen again all the process will happen then you'll go to the linear and soft Max and then that machine will be converted into something like yantra suppose anything okay guys it is up to you okay whatever you want to use machines for Hindi whether you want to use yantra whether you want to use something okay just it'll get converted now in The Next Step what will happen sna will also get passed and yantra will also get passed okay because this two words will now be responsible in finding the third word along with your encoder output along with the encoder output okay along with the encoder output and remember again your input will not be going okay again your input will not be going input has gone at once but this will operation will happen again and again unless you you don't get your EOS end of statement so it is similar to sequential right in we only use the final Contex Vera yes yes everything is almost same then go to your third word then your third word will again pass to this again all these particular steps will happen and again these things will actually happen guys have you understood this or not just tell me everyone I think I've never taken so long a session but have you understood this or not we'll try to complete this this is important trust me this is very very important right let me rub it okay everybody clear yes uh guys sometimes you may not be clear exactly but I would suggest that just try to have a look and read the read this blog along with the research paper now see this let's see on the decoder side how it works okay so at the time step one you can see that I've given my input embedding has happened we we are embedded with time signals by positional encoding now we get the output over here with respect to the encoder okay with respect to the encoder you can see over here we got the output now this output is basically passed as a output to decoder and decoders over here finally we get the output and in the first instance I get I now this I will again be passed to the decoder okay again will it will be passed to the decoder and this sentence will again not be passed because that value is there okay now here you can see in the second instance you second instance you can see that the first output I is getting passed as a vectors over here previous output and here you have M again some positional encoding is getting applied then all these getting values are getting converted okay so see this this whole operation so we will send decoder model also parallel so how no it is not getting passed parallel one by one see at time instance two I got passed m is generated in third time instance m is getting passed then you're getting the output over here a in the fourth time instance a is basically passed you get the output student right in the fth you can see student is passed and this is basically your end of the statement okay so I'm telling you guys this blog is amazing I I I really I really want to thank Jay okay for this amazing Vlog he has written I I I don't think so like any body could had written this invested so much time and written this right I would definitely not I would not been able to do it and trust me so much efforts is there his YouTube L link is given in the description guys just follow him okay subscribe his channel because Bert also will try to discuss with respect to this now the final and the softmax layer you can see guys this is how the log probability will be coming and based on the probability you'll be getting the value okay so here it is all guys uh and again you can read all these things and uh this is how amazing this is this whole thing is and uh so why ATT self attention uh is not working well no it works well it works well but remember uh Transformers work better than that okay it works well but this is a stateof art algorithm okay this is state of art algorithm so please uh go quick shout out to Z subscribe his channel guys it is pretty much important that you know again and you you you'll be able to see this guys attention is all you need they have done dot product attention multi-head attention so this is how they have actually shown in the research paper same type of operation soft Max of q multili k of T divided by root of D of K multiplied by V multi-head attention you know concatenated head attention this this is there so all these things are there self attention recurrent convolution self attention everything is there okay so let read the research paper now you'll be able to understand but in the upcoming session I think I will be including some um examples uh practical examples with the help of hugging face there you'll be able to understand very nicely but uh in the next session we'll try to discuss about Bert okay we'll discuss about Bert Bert is there next and probably again I'll be referring him to through his particular uh lessons itself okay everyone yeah all good everyone okay perfect guys so I hope uh when will be the bir in a couple of days I will be uh again keeping a session for that also okay in couple of days mostly okay guys so I hope you have understood it uh and uh thank you please do subscribe share with all your friends and please do follow J Alamar guys you will be able to see a whole lot of things from his okay so if you like this video please do hit like share with everyone that would be very very helpful okay tomorrow interview maybe let's see okay how exactly attention model by considering the multiple how relations are extra distracted just read this once just read this once okay just read this once I would request everyone to read this okay okay guys so thank you everyone uh thank you for attending session I'll be seeing you in the next video bye-bye so in a couple of days I'll be also coming up with bird session thank you guys bye-bye tada
Info
Channel: Krish Naik
Views: 198,892
Rating: undefined out of 5
Keywords: yt:cc=on, transformers architecture, end to end trnasformers indepth intuition
Id: SMZQrJ_L1vo
Channel Id: undefined
Length: 79min 23sec (4763 seconds)
Published: Thu Sep 03 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.