LSTM explained simply | LSTM explained | LSTM explained with example.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's read this sentence together Aman stays in London Aman has a friend whose name is Alex is a data scientist Aman learns data science from fill in the blank what should be here So based on our human intelligence we can simply say the right word for this place is l x y because Aman has a friend who is LX and Alex knows data science so Aman must be learning data science from LX but you know something interesting here you used two things together one is called long term memory and other is called short-term memory what is a long term memory long term memory is Aman and lxr Friends and Alex knows data science is something short-term memory okay when you combine these two things together then you come to the right answer of okay LX can teach data science to Aman and when long term memory and short-term memory combines together we get something known as long short term memory standing for lstm welcome to unfold data science my name is Aman and I am a data scientist in this video I am going to make lstm extremely simple for you please ensure you was till end and you will have to no need of watch any other lstm video going forward let's start guys here are the topics for today's video so we are going to see what are the problems with simple RNN okay then we are going to see lstm at high level I am going to take that example sentence that we spoke and I'm going to break it and show you how lstm works at high level after that we are going to dig deeper and go into lstm various kinds of gates for example what is a forget kit what is an input gate and what is a output gate so we are going to understand all these things after that I am going to put all the mathematics at one place so that it's easier for you to revise anytime you want to do and in the end I am going to summarize all the discussions all the learnings okay so let's start guys with problems with RN if you have watched my last video then it will be very easy for you to understand what I'm explaining now because that's a prerequisite for this video so let's say I write a sentence here my name my name is Aman okay and as you know in the world of simple RNN these sentences will go one by one into the network so what will happen at first time step I will send my into the network then I will get some output I will call it let's say H1 in the next time step I will send name in the network okay I will get some output then in the next time step I will send is in the network I will get some output in the next time step I will send Aman as the input so this is how my network will look like in this network there is a very basic problem that I explained in the last video as well and name of the problem is Vanishing gradient problem okay so the name of the problem is Vanishing gradient problem what is this problem is when I reach at the last time step so if I call this time step 4 right if I call this time step 4 when I reach to the last word what I am trying to do is I am trying to remember information of outputs for for all these words so I am trying to remember what happened at this time step what happened at this time step what happened at this time step and imagine if it is a hundred word long sentence right then what happened in all these hundred previous time steps I am trying to remember all those things when I reach here what happens due to that is my network is not able to distinguish between important and non-important words so some of these words may be important for my sentence some of these words may not be important for my sentence but my network is not able to understand that that is problem number one and problem number two is I am not able to understand what will optimize my model because of this Vanishing gradient problem when you are trying to optimize your weight right this thing I had explained in the last video in detail so this is the basic problem with RNN you are not able to capture which word is more meaningful which word is less meaningful which word you should focus more which word you should focus less and since there is a gradient problem so if the network is long if the network is long then your network will not perform good okay so your previous words will start losing importance in your model to keep it very simple so this is the basic problem with RNN architecture and hence lstm architecture came into the picture now what is lstm architecture let's try to understand with this simple sentence that I was speaking so if you notice here guys I told you in simple RNN the problem is we don't know which part of the sentence is important and second thing is the previous words in a long sentence starts losing its importance now let's see what happens in RNN if you can see I have written the sentences in three different colors so this is green color okay then I have written something in black color it is black color okay and I have written something in blue color this is blue color now if you pay notice if you pay attention here guys you will understand that whatever is written in green color right this is irrelevant format prediction okay this is irrelevant this is not relevant this is not relevant for my prediction because if somebody wants to predict the next word here they don't want to know where our money stays right so the green whatever is written in green it's irrelevant whatever is written in Black okay that is relevant so I will write here relevant for my knowledge relevant format prediction okay and whatever is written in blue that is also relevant okay for my prediction now from this relevant and irrelevant whatever I am calling it right there are there are five Concepts in lstm that I will just write it here okay just keep in mind these five terms one is called long term memory okay long term memory other is called short term memory short term memory okay third is called forget gate I will explain you all these things now don't worry third is called forget gate fourth is called input gate okay and fifth is called output git now please pay a little attention here what I'm explaining you long term memory is what is happening in your sentence from starting to end okay so long term memory means all the words of my sentence imagine there is a conveyor belt so if you are working in a mechanical industry let's say a car plant or a power plant or any kind of those kind of areas understand there is a belt okay there is a belt like this on this belt there are all the words flowing from starting to end all the words are flowing in this this becomes your long term memory okay all the words are flowing in this and then there is something called as a short term memory short-term memory means since I am interested in predicting this part here I am interested in Pretty in this part so short term memory will only be let's say for example five words before this whatever I am highlighting in red here you understood the difference between long term and short-term memory long term memory means all the words in your sentence imagine there is a belt in which all the words are flowing like this that is your long term memory short term memory I want to predict this so This what I violated in red here in blue color written in blue color highlighted in red that is my short-term memory two things clear now what is a forget kit remember I was telling you some information is irrelevant so forget gate helps you to remove the irrelevant information okay and what is input gate input gate helps you to keep the relevant information along with the help of this belt long term belt I will show you now how okay and what is the output gate output gate is nothing but that gives you this output LX I know it is getting little heavy for you but try to understand from a human Common Sense point of view so there are many things which you did when you read this sentence the same intelligence we have to feed to the machine we have to feed to the computer we have to feed to the computer that hey this part is not important hey this part is important hey remove irrelevant information keep relevant information all these intelligence how will you feed to the computer by using long term memory short-term memory forget gate input gate and output gate okay let's try to understand more details on this now so the very first thing that I want you to know is something known as let me see my notes once the very first thing I want you to understand is a lstm sale is made up of three parts basically Okay so if this is an lstm cell lstm cell means you can simply think of lstm unit lstm model simple to keep it very simple okay in this one you will have two passes going on one pass will be long term memory what I am creeping in red okay long term memory I just told you what is long term memory right and there will be one more pass going on here this is called short term memory okay short term memory fine there will be two passes parallely going on in lstm model and inside the model you will have three gates fine one gate will be forget get other gate will be input gate and third gate will be output gate okay the mathematical way of representing your long term memory we write C T minus 1 and in the output we write CT so for timestamp you know this right for short term memory we simply write h t minus 1 and here we write h t so output of the Hidden state it means okay so as I told you these five things are important in the world of lstm one two three four and five okay from the use of these five things model will know what is important what is relevant what is not important what is not relevant now let's dive deep into all these one by one apologies for the intervention guys but if you are liking the video so far please share unfold data Science Channel with at least one friend today and hit the like button and tell me in comment you are liking the content all these things will keep me motivated let's continue so first thing I want to explain you is forget get and the process so pay attention here guys forget get and process okay so as I told you whenever I discuss a lstm model there will be two parallel lines going on one parallel line will be your conveyor belt what is your conveyor belt I explained you that is nothing but a belt carrying all your words so let's assume this is CT minus 1. and there will be a short term memory belt also so that will be HT minus one okay with this there will be the current word for example the let's take a sentence let's take a sentence Aman Aman loves Aman loves cricket okay I have taken a simple sentence so Aman will be my word X1 loves will be X2 and Cricket will be X3 suppose I am processing this word loves okay so this x will go as well in the input so let me enter this x here so I will just enter this word here now pay attention guys output from the previous state is coming the new state word is coming okay and both these things are going to one neural network activation function basically sigmoid activation function okay and whatever is the output right whatever is the output from here that will get multiplied with that will get multiplied with let me take this in a red color only previous state the conveyor belt or previous state long memory unit and most important thing since I am explaining you forget get here so here ft will come okay so till now I am explaining you just the forget gate this is just the forget gate okay nothing else if you just forget it I will explain you one more time what is happening here in lstm there will be two passes going on always one will be a conveyor belt carrying all your words long term memory other will be your short term memory obviously when a new word comes your new word plus previous state output both will pass through a sigmat activation function and it will pass through f t f t is nothing but your forget gate and the output gets multiplied with C T minus 1 CT minus 1 is your previous words vocabs importance you can say so what will be the f t here formula wise ft will be sigmoid of pay attention here guys it's very similar to what we did in recurrent neural network okay h T minus 1 x t okay Plus bias of forget git so just this thing if you understand then it's very easy to understand what is happening here so as I told you weight will get multiplied with the HT minus 1 and T this is this is where both these things are merging right this is the point where both these things are merging right so both these things will get multiplied by a weight of forget gate okay and this weight is nothing but that needs to be optimized for your model so this weight will get multiplied with this it will pass through a sigma with a bias and whatever you get is an output right this gets multiplied with C T minus 1. now you will ask what is CT minus 1 CT minus 1 is your long term memory State okay so this is the entire concept of forget gate in lstm what does forget gate does to keep it very simple suppose the word Aman is not relevant okay so what is the job of this forget gate is it will tell to the cell state that hey in the next state right give less importance to Aman okay so once this forget completes right it will output something right so in this output it will say that now it is C T minus 1 right at CT give less importance to the word Aman because from my analysis I have found that a money is not relevant here that is the use of this forget kit okay and what needs to be optimized only one thing that is wfwf is nothing but the width so this is about forget kit now it will become little more mathematical because we will include something known as input gate okay and we will try to put that in that architecture so remember the functionality of forget gate and then we will put into the input gate so again what will happen on the top there will be CT minus 1 okay on the bottom there will be h t minus 1. from bottom what will come x t which means recent word both these things comes both these things passes go to a sigmat activation function I will keep drawing this again and again because I want you to remember this okay it will go through the sigmoid what will get generated is a forget gate output and once the forget gate output is generated here one multiplication will happen and this output will move forward up to here we saw in the forget kit but what happens in the input gate this ht-1 and XT combination will move again forward and once it will move from sigmoid and once it will move from 10 H why this is happening that I want you to answer main comment what is the range of sigmoid and what is the range of 10 H that you know why are we passing from sigmoid only and why are we passing from 10 H only based on your knowledge of deep learning please comment me okay once it passes through sigmoid then what will happen is it will output like the way it was outputting I T okay input at time T and it will get multiplied here multiplication will happen okay here multiplication will happen with something known as c t candidate so I will explain you what is this first let me draw this okay and this addition will go and get plus here so you see this blue line let me make it red it will be simple for you this red line is your conveyor belt conveyor belt means your long term memory okay so in this long term memory if you pay attention here once this multiplication is happening and other time this plus is happening what does multiplication is saying multiplication is saying I want to remove importance of this word I want to add importance sorry this is forget get so I want to forget this element right what is input gate is saying input gate is saying I want to add importance to this new word I want to add importance to this new couple of words that is what input gate is saying okay and what becomes the formula for your input gate in that case so you will write i t is equal to same formula just that weight will change w i multiplied by h t minus 1 x t very simple only these two inputs are going everywhere this is first input and this is second input if you see this black line is going only here this black line is only going here this black line is only going here so this black line itself is going everywhere okay so this plus bias will come let me write in same color this plus bias will come input bias of input remember this is weight of input and this is bias of input in the previous formula weight of forget and bias of forget okay this is what needs to be optimized in the neural network this is what the training is happening okay this is what the training is called all will try to optimize now CT what the CT is here if you see here right I have written CT CT is the candidate okay candidate means suppose you put a new word into this network okay so suppose I put a new word into this network called loves okay and the candidate can be Aman loves or maybe three words together or maybe loves and one more word like 20 word before okay candidate means what is the new word or new combination of words that I want to give more importance to okay and how that will be calculated is 10 h weight candidate multiplied by hidden candidate and this is a function of as I told you x t plus b of candidate so now what you have done you have passed your word from the in forget gate and you have passed your word from the input gate okay past your one pass from the forget gate from the input gate so now what you need to do is you just need to find out what will be your new CT so CT minus 1 is done and then you will have a CT okay so how this CT will be calculated so let's see here CT is nothing but let me write it in a different color CT is all about if you see here right at one place multiplication is happening at other place addition is happening right so the same thing will reflect here f t into C T minus 1 okay this is one part this part okay Plus what will happen plus c i t is getting multiplied with CT okay see digital T that is candidate key right so i t multiplied by CT so this is C candidate T basically C candidity so this is what happens to your conveyor belt or this will be your new cell State here so this value will go and sit somewhere here now if you remember you have done two things here I told you what are the important things for lstm long term memory you are passing it always on the top short term memory you are passing it always on the bottom forget it you forgotten event information input get you inputted relevant information now only one step is left that is called output gate so what will happen in output git let's see that so guys if you understand this right it may look little complex but it is not if you understand RNN right it will be a very easy thing for you to understand see at every time step you are taking the current word and output from the previous word this becomes your one set of input the same set of input you pass through forget gate you pass through input gate and then you update your conveyor belt update your previous sell State that's all you do okay and how do you know what kind of update you have to do how do you know that for example in this sentence right how do you know that a month stage in London is irrelevant for me how do you know that Aman has a friend whose name is Alex is relevant for me how do you know that the answer of that is that is your training that itself is your training so WF means weight metric of forget gate so when you train your model this weight metric gets optimized w i means weight of your input gate so this will get optimized based on your model training and that is how you will know what to keep what to forget what to pass to the next state okay having said that let's see how the output gate will look like so once we have these two things done once we have input gate so as I told you I will keep drawing this because I want you to remember this okay so I will come here and I will draw this one more time C T minus 1 will come from here okay sorry C T minus 1 will come from here and I should not make this Arrow because it will not be an ending here right C T minus 1 will come from here h t minus 1 will come from here x t is the current word that will come from here both these things will pass through a sigmoid this will output my f t okay this gets multiplied then this becomes the forget gate I take the same output from here and I pass it to one sigmoid I take the same output here I pass it to 110 h okay and then from here I get a input out input time State okay the input gate I mean to say and from here I get the candidate right CT tilde both these things I multiply and I get in the current context what should I put okay and this thing I go and add in my conveyor belt okay once I add this in my conveyor belt then what happens I get a CT that is new CT this the new status of my conveyor belt but for prediction what I need to do if I want to predict the next word in a sentence the same current word and previous time Step I Take It and I pass it one more time to another sigmoid function and I call this OT means this is your output gate okay and I take the CT and I pass it with a 10 h actually this will be little down so this CT will come here with the help of a 10h and it will get multiplied with it will get multiplied with this OT and that is where you get your output okay so from here you get your HD and CT you can take a forward pass from here as well and HT if you want to take out you can draw a line simply above and take HT as well okay so if you see here this is this part right which I am drawing now this is your output gate this is your output git so what is output gate doing output gate is taking your input which is this input that I am highlighting now passing it through a sigmoid again this is the output gate so I am writing OT multiplying with CT CT has been updated before okay multiplying with CT and giving you the output so there are two parts that is done in advance to generate the output what are these two parts in forget gate forget gate and input gate these two things has been done in advance to generate your current output okay and once this CT has been updated then these words will directly flow to Output pass through sigmoid take the latest CT multiply and give the output so I hope this is clear now because there are three main steps that I keep saying always Step 1 Step 2 and step 3. okay for getting I will write here for getting inputting and outputting okay long term memory long term [Music] memory the above one and short term memory below one short term memory below one okay and in terms of mathematical formula you might be thinking it's complex but everything is very similar to what you have learned till now in RNN okay nothing is happening all the places only weight optimization and parameter optimization is happening so just to conclude let me write here all the equation once okay let me write here equation for forget gate what is the equation for forget gate sigmoid of what will happen forget weight will come so this will get optimized okay and then you will write in all these steps you will write the same thing HT minus 1 x t y because everywhere this HT minus HT minus 1 XT only is Flowing okay so same thing will come forget get just the bias also will be a forget bias okay input at time T will be again sigmoid of it will be input and this term remains same as T minus 1 x t and it will be B of input okay candidate at time T will again become 10 h of 10 h of W candidate and as you know these two things remain same T minus 1 x t and this thing becomes B C so these three things are done and then comes if you update the conveyor belt if you update the current status if you update the current long term memory then how you will update ft multiplied by C T minus 1 that is the previous cell State and this you will go and add into i t multiplied by C T but this time candidate t Okay and then if somebody asks you how to write the output state of lstm model then you will simply say output at time T will be sigmoid of again output and what will be sigmoid of output and what will be inside again those two things because those two things remain constant St minus 1 x t Plus B of output and if somebody asks you how the HT in the end will be computed in lstm model so it will be basically OT into 10 h of c t so remember this is the final formula for lstm model so how I am Computing h t is equal to OT in OT multiplied by 10 h of CT this is your HT where from your HT is coming see 10h of CT 10h of CT here 10 h of c t this part multiplied by OT okay and this this multiplication right 10 h of c t multiplied by o t okay so this is the complete mathematics of lstm I know it may be little too much for you but if you pay attention right only three things are happening that I explained you many times okay and this is the simple architecture this is the simple architecture long memory short memory forget input output this is the complex architecture forget input output and complete architecture okay and here you will find all the formulas written so we did all the things in lstm guys and in the next video what I will do is I will do a huge case on lstm in Python and I will explain you how this model is very very useful for you in long sentences okay so I will see you all in the next video guys wherever you are stay safe and take care
Info
Channel: Unfold Data Science
Views: 4,312
Rating: undefined out of 5
Keywords: LSTM explained simply, LSTM explained, LSTM explanation, LSTM explained with example, LSTM explained medium, lstm for stock prediction, lstm time series prediction, lstm pytorch, lstm tensorflow, lstm vs transformer, lstm nueral network tensorflow, lstm model for time series prediction, long short term memory, lstm for anomaly detection, lstm for chatbot, unfold data science deep learning, unfold data science nueral network, lstm stock price prediction
Id: HUlN5mHNqhw
Channel Id: undefined
Length: 33min 17sec (1997 seconds)
Published: Wed Jul 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.