Illustrated Guide to LSTM's and GRU's: A step by step explanation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Looking forward to see the transformer!

πŸ‘οΈŽ︎ 10 πŸ‘€οΈŽ︎ u/__me_again__ πŸ“…οΈŽ︎ May 09 2020 πŸ—«︎ replies

Thank you. I was just looking for something like this. Great set of videos.

πŸ‘οΈŽ︎ 5 πŸ‘€οΈŽ︎ u/jliendo πŸ“…οΈŽ︎ May 09 2020 πŸ—«︎ replies

I really liked this video. I also found Understanding LSTM Networks really helpful when I was first learning about LSTMs. I also really like a paper β€œLSTM: A Search Space Odyssey.” Looking forward to seeing more videos!

πŸ‘οΈŽ︎ 4 πŸ‘€οΈŽ︎ u/DMLearn πŸ“…οΈŽ︎ May 09 2020 πŸ—«︎ replies

Nice illustrative video.. btw how do you make this type of video. I mean what tools do you use..

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/wierd_dev πŸ“…οΈŽ︎ May 09 2020 πŸ—«︎ replies

I really appreciate your videos! The way you present these concepts is very intuitive and accessible. Transformers are about where I left off as a machine learning hobbyist/enthusiast. I found the attention/BERT papers to be a steep barrier for learning these concepts until I found your videos.

Could you comment on time series forecasting or classification with these architectures?

Thank you and please keep up the great work!

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/AmusingEgo πŸ“…οΈŽ︎ May 09 2020 πŸ—«︎ replies

I have seen it before and it is a great video.

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/abhishek-shrm πŸ“…οΈŽ︎ May 09 2020 πŸ—«︎ replies

Thanks for the video!

Nicely done.

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/justanaccname πŸ“…οΈŽ︎ May 10 2020 πŸ—«︎ replies

Very nice, very intuitive. I like your transformers video as well!

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/_tbrunner πŸ“…οΈŽ︎ May 11 2020 πŸ—«︎ replies
Captions
hi and welcome to an illustrated guide to LST ms in gr use I learned vector MMA machine learning engineering a I voice assistant space in this video we'll start with the intuition behind LST M s and G are used then I'll explain the internal mechanisms that allow the LST M s and G are use to perform so well I'll follow up with an example if you want to understand what's happening under the hood for these two networks then this videos for you LST M's and GR use are the more evolved versions of the vanilla recurrent neural networks during back propagation recurrent know now we're suffering from the vanishing gradient problem the gradient is the value used to update a neural networks weight the vanishing gradient problem is when a gradient strengths as it back propagates through time of a gradient value becomes extremely small it doesn't contribute to much learning so in recurrent no networks layers that get a small gradient update doesn't learn those are usually the earlier layers so because these layers don't learn RN ends can forget what is seen and longer sequences does having short term memory if you want to know more about the mechanics of recurrent neural networks in general you can watch my previous video in illustrated guide to recurrent neural networks LST MS in gr use were created as the solution to short-term memory they have internal mechanisms called gates that can regulate the flow of information these gates can learn which data in a sequence is important to keep or throw away by doing that it learns to use relevant information to make predictions almost all state-of-the-art results based on recurrent neural networks are achieved with these two networks you can find LS CMS and gr use and speech recognition models speech synthesis text generation you can even use them to generate captions for videos okay so by the end of this video you should have a solid understanding of why LST ms and gr use or get up processing long sequences i am going to approach this with an intuitive explanation and illustrations and avoid as much math as possible let's start with a thought experiment let's say you're reading reviews online to determine if you want to buy a product when you read review your brain subconsciously only remember important keywords if your goal is trying to judge if a certain review is good or bad you pick up words like amazing and will definitely buy again you don't care much for words like this gave all should etc if a friend asked you the next day what the review said you wouldn't remember it word by word you might remember the main points though like it was a perfectly balanced breakfast the other words would just fade away from memory unless of course you're one of those people with perfect memory that is essentially what an LST m and a GRU does I can learn to keep only relevant information to make predictions in this case the words you remember made you judge that with a good review to understand how STM's or GRU so chooses let's review two recurrent neural network in our networks like this first words gets transformed to machine readable vectors then dione and processed a sequence of vectors one by one while processing it passes the previous head state to next step of the sequence in a hidden state access to neural networks of memory it holds information on previous data that the network has seen before let's zoom into the cell of an RNN to see how the hidden state is calculated first the input in the previous hidden state is combined to form a vector that vector has information on the current input in the previous inputs the vector goes through the teen activation in the output is the new hidden state or the memory of the network the tan activation is used to help regulate the values flowing through the network the tan function squishes values to always be between negative 1 and 1 when vectors are flung the rain' neural network undergoes many transformations due to various math operations so imagine a value that continues to be multiplied by let's say 3 you can see how some values can explode and become astronomical causing other values to seem insignificant let's see what a tan does a tan function ensures that the values stay between negative 1 and 1 thus regulated and neural networks output so that's an R net it has very few operations internally but works pretty well RN n uses a lot less computational resources than its involved variants let's take a look at LS and LS TM has the same control flow as a recurrent neural network it processes data sequentially passing on information as it propagates forward the difference are the operations within the LST upsells' these operations are used to allow to lsdm to forget or keep information now looking at these operations can get a little overwhelming so we'll go over this one by one I want to thank Chris Ola sorry about butchered that name he has an excellent blog post on LS TMS the following information is inspired by his exceptionally written blog post I'll include the links in the description the core concepts of L SCM sort are cell states and its various gates the cell state acts as a transport Highway that transfers relative information all the way down to the sequence chain you can think of as a memory of the network because the cells they can carry information throughout the sequence processing and theory even information from earlier time steps could be carried all the way to the last time step thus reducing the effects of short-term memory as it goes on it's dirty information gets added or removed to the cell state via gates the gates are just different neural networks that decides which information is allowed on the cell state the gates learn what's information is relevant to keep or forget during training gates contain sigmoid activation X sigmoid activation is similar to the 10 instead of squishing values between negative 1 and 1 squishes values between 0 & 1 that is helpful to update or forget data because any number getting multiplied by 0 is 0 causing values to disappear or be forgotten any number multiplied by 1 is the same value therefore that value stays the same or is kept the network can learn what data should be forgotten or what data is important to keep let's dig a little deeper into what the various gates are doing so we have three different gates that regulate information flow in an LS TM cell I'll forget gate input gate and output gate first we have to forget gate this gate decides what information should be thrown or kept away information from a previous hidden State and information from the current input is passed through the sigmoid function values come out between 0 and 1 the closer to 0 means forget and a closer to 1 means to keep to update the cell state we have the input gate first we passed that previous head and stay in the current input to a sigmoid function that decides which values will be updated by transforming values to be between 0 and 1 0 means not important one means important you also pass the hidden state and current input into the tan function to squish values between negative 1 and 1 this helps regulate the network then you multiply the tan output with a sigmoid output the sigmoid apple will decide which information is important to keep from the tan output now we should have enough information to calculate the cell state first the cell state is multiplied by the forget vector this has a possibility of dropping values in a Cell State if it is x values near zero then we take the output from implicate and do a polarize Edition which updates the cell state to new values this gives us our new cell state last we have the output gate the applicate decides what the next hidden state should be remember that the hidden state contains information of previous inputs the hidden state is also used for prediction first we pass a previous hidden state in the current input until a sigmoid function then we passed newly modified cell state to the tamp function we multiply the Tanner output with the sigmoid output to decide what information the hidden state should carry the output is the hidden state the new cell state and a new hidden state is then carried over to the next time step for those of you who understand better through seeing code here is an example using python pseudocode to showcase the control flow first the previous hidden state in the current input gets concatenated we'll call it combined combined gets fed into the forget layer this layer removes non-relevant data then the candidate layer is created using combine the output holds possible values to add to the cell state combined also gets fed into the input layer this layer decides what data from a candidate layer should be added to the new cell state after computing to forget layer the candidate layer in the input layer the cell state is computed using those vectors in a previous cell state the output is then computed point wise multiplying the output and a new cell state gives you the new hidden state that's it the control flow of an LS TM is simply a for loop the hem state output it from the LS TM cell can be used for predictions using all those mechanisms annales TM is able to choose which information is relevant to remember or forget during processing so now we know how an ALICE TM works let's look at the GRU the GRU is a newer generation of recurrent neural networks and is pretty similar to an LSD M gr use got rid of the cell state and used a hidden state to transfer information instead it also has two gates a reset gate and an update gate the update gate acts similar to forget in input gate of MLS TM it decides what information to throw away and what new information to add the reset gate is a gate used to decide how much pass information to forget so that's the gr you do use has less tensor operations therefore they are little speedier to train in LS TMS researchers and engineers usually try both to determine which one works better for their use case to sum this up are an ends you're good for processing sequence data for predictions but suffer from short-term memory LS TMS and your use were created as a method to mitigate short-term memory using mechanism called gates gates are just no no words that regulate the flow of information being passed from one time step to the next LS TMS in G reuse are used in stated they are deep learning applications like speech recognition speech synthesis natural language understanding etc if you're interested in going deeper I've added links in a description on some amazing resources that can give you a different perspective and understanding LS TNS and gr use I had a lot of fun making this video so let me know in the comments that this was helpful or what you would like to see the next one if you like this video please subscribe for more AI content thanks for watching
Info
Channel: The A.I. Hacker - Michael Phi
Views: 281,355
Rating: 4.9457936 out of 5
Keywords: long short-term memory, gated recurrent units, recurrent neural network, lstm, gru, rnn
Id: 8HyCNIVRbSU
Channel Id: undefined
Length: 11min 18sec (678 seconds)
Published: Wed Sep 19 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.