Recurrent Neural Networks : Data Science Concepts

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone welcome back so today we're going to be talking about a special type of neural network called the recurrent neural network now let's go into it by talking about a specific problem just so we start from something real let's say that we're designing some kind of website and that website has a search bar at the top where our users can come in and search for whatever they want and to make the user experience as easy as possible we want to implement some kind of autocomplete just like on google you start typing your query and google will automatically suggest some things that make sense given the words you've already typed so for example let's say the person has typed three words so far how to make we want to design some kind of model that is intelligently and automatically going to know that the next word is more likely to be pizza rather than yoga and so we have a bright idea we just finished learning about typical neural networks and we think about can we apply a neural network to this problem so that's our bright idea now this is a really good starting point because any time in data science or otherwise you typically want to start from the things you know and only go to more complex or more intricate things if there's a reason to do so so that brings us into the next topic which is what are the drawbacks in terms of typical neural networks that we learned so far that are going to make applying those typical neural networks to this problem a little bit tricky and there's two big ones the first one is that typical neural networks cannot take into account variable sequence or input length so for example here i explicitly said the person's typed in three words and we're trying to use those three words to make a good prediction about what word will come next but of course any good autocomplete method shouldn't be just using three words it should be able to use one word or two words or three or four or five however many words the user has chosen to give us our model should be able to dynamically use however many words that is in order to give us a good recommendation for the next word and with typical neural networks it's not super easy to force this kind of problem into that setup and so that's going to be one of the drawbacks there the other big drawback is that typical neural networks don't really have a notion a built-in notion of the directionality of the inputs and you see this problem here which is a natural language processing problem but more generally it is a problem where the order of the input sequence matters for example when we process this it would make the most sense to process the first word how use that information to process the next word to use all of that information to process the next word make in that specific order in order to have the best prediction of what the next word should be we definitely wouldn't get as good results if we chose to just process these words in any random order and again typical neural networks don't have a notion of the order of the inputs you're passing in and for those two big drawbacks is where we bring in the solution of using recurrent neural networks now i'm about to start talking about this diagram that's behind me but the main thing that i want you to remember through the course of this discussion is that we're not learning anything fundamentally new i think it's always a little bit comforting when you learn a new topic quote-unquote new topic but their pieces of that topic are basically things you already know from something you've learned previously it's a lot less intimidating that way and so we'll be using a lot of the same terminology we've been using in previous videos inputs outputs activation functions hidden layers it's just that we're repackaging it in a slightly different way to make it able to solve these sorts of problems so now let's go into how a recurrent neural network actually works how it processes its inputs into outputs and how we use all that to solve problems like these so this diagram may look a little bit intimidating at first lots of symbols and steps going on but let's tell it as a story which actually works out really well for recurrent neural networks because we use recurrent neural networks to usually solve sequential modeling and so that pretty much lends itself to telling the sequential story we start our story down here and actually there's kind of a step zero we should address which is that you'll notice that the inputs we have right now are english words and to do any kind of data science data analysis we need vectors we need numbers how do we take these words and turn them into some kind of vectors and there's not one answer here so let me give you a couple of options that you could do one pretty out of the box option is to use embeddings or vectors for words that already exist on the internet that were pre-trained on very large corpuses of data usually wikipedia or books or something like that and just say well those seem smart enough i'm going to use the same ones for my problem so this is in the realm of something called transfer learning where you use some kind of information learned via a different task a different but related task and hope that'll work well for your task too now a more targeted approach would be to learn these vectors along with the model itself i consider this bit of a more advanced topic and we won't say anything more than that here but that is one option that may be more targeted to your specific problem and what we'll be doing today is even simpler than both of those which is that we'll be assuming each of these vectors x1 x2 x3 x4 however many words we're processing is a one-hot encoded vector all that means is that each of these vectors is going to be a big big vector of mostly zeros except it has one one in it somewhere and that one one the position of that one one tells you which word we are actually looking at right now for example if the word how was the first word in our vocabulary then the one hot encoded vector x1 would look like one followed by a ton of zeros if two was the second word in our vocabulary then that one hot encoded vector would look like a zero with a one followed by all zeros so this is definitely not the most efficient way to encode your data because the size of this vector is going to be like the size of your vocabulary but it'll serve us well because that's not the main focus of this video so we'll just pretend that these are one hot encoded vectors today so now that we have step zero out of the way we can actually get into how our current neural network is going to process these inputs and so the first thing that happens is we take this vector x1 which again represents the word how and we pass it through a linear transformation u so u is a linear transformation or a matrix whichever is easier for you to think about we apply that linear transformation to x1 in order to get h1 which is the hidden state so you see the same terminology popping up which is the hidden state at the first time step so the h says it's a hidden state the one says it's at the first time step or the first word that we're currently processing now the next thing that happens is that we apply a different linear transformation w to that h1 in order to get our first output vector which is y1 now i'm being a little bit loose with all the math here i'll be more concrete when we talk about the math itself but when i say a linear transformation i'm also implicitly saying there is a activation function the sigma is applied afterwards so when i say linear transformation please take it to mean linear transformation and then an activation function just as we do with typical neural networks and so after we've done that we have y1 now y1 what is y1 in the context of this problem y1 is the neural network the recurrent neural network's answer about what it thinks the next word should likely be if the only word the user has typed in so far is the word how so i've typed in how and i say what should the next word be if that's all i've seen so far and the recurrent neural network after all these weights have been learned the answer is y1 and so explicitly y1 is also a vector which is the size of our vocabulary and the highest number in y1 the index of the highest number in y1 tells us what actual word should be predicted and the reason that works out is if we kind of fast forward for a moment down here you can see why one is achieved by applying some kind of activation function maybe a sigmoid and so therefore we can interpret everything in there as probabilities and so the biggest number in there is going to be the most likely next word the index of the most likely next word and so again this part of the story maybe i'm spending too much time on this but i want to be very explicit which is that if all we've seen so far is the word how the model says i think the next word should be y1 and that's the top autocomplete suggestion we would suggest to the user but of course as we were saying before we've seen more than just the word how we've seen a couple of more words so how does the story progress from here well that gets us into this part of the diagram and let's focus now on the hidden state at next time step which is h2 how do we get h2 you'll notice there's two arrows coming into h2 the first one is that it's a linear transformation of the hidden state at the first or previous time step h1 and so the first thing we'll do is apply a different linear transformation v to h1 we add that to this linear transformation of the next input word so x2 gets u applied to it and so we take u times x2 we add that to v times h1 and that is exactly how we get h2 and of course there's the activation function that we apply there so we can actually talk about the equation while talking about the diagram i think that makes more sense so again u times x2 that's this piece right here we take v times h1 that's this piece right here notice there's a t minus 1 subscript so we're saying hey to get the hidden state at the next time step please consider the hidden state at the previous time step and also consider the current input that's why this t and this t match up but this is a t minus one and there's also a bias term b sub h here we run all that through some kind of activation function and that's how we get h2 but forget about the math for a second what is h2 why have we created this hidden state at the next time step this is encoding if you notice you kind of follow the arrows h2 is encoding information about the sequence so far which is not very exciting it's just one word so it's encoding information about the sequence so far and also the next thing in the sequence we just saw so that you can think of this thing you can think of this vector h2 as a representation of the entire sequence so far and using that representation of the entire sequence so far we apply the same linear transformation w that we did before and we get y2 which is going to be the model's best prediction of what the next word should be if the only words we've seen so far are now how to and so if we've trained the model well it should be more likely that this next word is the word make because how to make is a pretty common phrase and it should be a lot less likely that the next word is pizza because how to pizza isn't really doesn't really make any sense and we just keep going hopefully you're starting to see the point now but there's nothing really new coming up in the next one but let's continue the story to make sure we got it the next thing we would do is say how do we compute the hidden state at the third time step which is h3 follow the arrows one thing that happens is that we apply the same linear transformation we did before v to h2 we add that to the same linear transformation we saw before u to this third input vector x3 and those two things together is how we make h3 again the equation to get h3 you would take x3 and apply u that's this going on right here you take h2 and apply v that's this right here again you have your bias term you run that through some kind of activation function and you have yourself the hidden state at the third time step h3 and again that's the math but what is the intuition why are we doing this h3 following all the arrows again is capturing all the sequence so far how to why is it capturing all the sequence so far because of this horizontal arrow h2 because h2 itself has captured the sequence before that so h2 is containing all this information has encoded information about how to and then we also add that to the new information we're seeing at this time step which is make so that h3 can be thought of as a embedding or representation of the entire sequence how to make and again just like before we run that through w so that's this going on down here so w times h3 we add the its own bias term down here we run this through an activation function and just a quick note i called the sigma sub h and sigma sub y because those activation functions don't have to be the same one could be a tan h and one could be a sigmoid for example so they could be the same they could not be the same up to you and we get y sub t in this case y sub 3 which is just like before the recurrent neural network's best guess about what the next word in the sequence should be and so going back to our intro example here it should be much more likely that the recurrent neural network says the next word is pizza versus yoga if all has been trained well and that's it hopefully this story is kind of starting to make sense and also hopefully you can extend the story to even more words for example what if i saw how to make pepperoni the person typed in how to make pepperoni all we would do is just add another one of these blocks over here we would have an h4 and the exact same things would be going on and hopefully after that we have even more evidence the next word should be pizza because pepperoni pizza makes even more sense than some other food like pepperoni popsicles um so um so in a nutshell that's literally what our recurrent neural network is doing we're just sequentially processing our inputs the hidden state at each time step is collecting or encoding information about the sequence so far and also taking into account information about the current word or current data point that's being seen at this time step and so here's your diagram and here's your math down here so let's just talk about the math explicitly i think it's been pretty much covered but again to get the hidden state at any time step you are going to first take this linear transformation u so that's all these u's down here process the current input xt you're going to add that to a linear transformation v so that's all these horizontal arrows you're seeing here times the hidden state at the previous time step the value of the hidden state of the previous time step that's why there's a t minus 1 right there add that to your bias term run that through your activation function and boom you have your encoding of the entire sequence thus far and then if you want to get the most likely next word you're going to apply this other linear transformation w to the hidden state so far at the current time step add that to its own bias term run that through an activation function and boom you've got your y sub t which is the output at any time step that tells you what the most likely next word should be now this this version of the recurrent neural network we've drawn here is often called the unrolled or unfolded version and it gives you a pretty good look at the mechanics of everything that's going on that lets us talk about step by step by step but there's also you'll often see in the literature a folded compact version which looks like this because you might have noticed that there's nothing really new going on from block to block to block from time step to time step so to make everything more concise people will often just draw the entire neural network as this diagram here which captures all the relevant information so this diagram is saying given some input x t at any time step we apply the linear transformation u and then we also this the key part of the recurrent neural network diagram is this recurrence arrow which says apply v to the value of the hidden state at the previous time step ht minus one and those two things together give you h sub t which is the value of the hidden state at this time step t and finally if you want to get the output at this time step you would apply uh linear transformation w go through all the activation function stuff and you would get y sub t so this tells the exact same story as this just in a much more compact form and there's this other thing i wanted to talk about too the the unrolled or unfolded form is of course nice to understand exactly what's going on but it has this unfortunate property that people often think about h1 h2 and h3 as separate layers of the network and it's not people's fault when you're learning this at first because if you think about the diagrams we drove for typical neural networks they look an awful lot like this kind of drawn horizontally from left to right and each thing as we move from the left to the right of the page are different layers in the neural network and so it can be really tempting to say each one is the first hidden layer h2 is the second hidden layer h3 is a third hidden layer but that's not right and it's not just one of these technical things i think it's important enough for me to actually write this on the board okay i couldn't really find much room but it says right here h sub t are not layers okay each of these is not a hidden layer there's only one hidden layer and so we have to read that from bottom to top in this case because think about what the purpose of a hidden layer was when you first learn neural networks it was something between your inputs and your outputs it was some kind of intermediate step and we can have many of these intermediate steps if we want but if you look at where your inputs are they're down here so here's your inputs where are your outputs they're up here and count the number of layers between your inputs and your outputs there's only one layer here h right you could put more if you want if you want to make this more complicated but the way we've drawn it here there's only one hidden layer between your inputs and your outputs so the correct way and i've been trying to choose my words very carefully to make sure i don't slip up so if i did i apologize in the video but each of these h1 h2h3 are not hidden layers themselves they are values of the hidden layer of the single hidden layer at different time steps so explicitly h1 is the value of the hidden layer at the first time step h2 is the value of the hidden layer at the second time step h3 is the value of the hidden layer at the third time step and on and on and on okay and the folded version does a lot better job of this because you can literally say inputs hidden layer outputs and so it's just that the hidden layer can take on different values based on what the value of t is at which time step or which position in the sequence that we're at you probably noticed the applications of recurrent neural networks we the whole video was one particular application of autocomplete but in general anytime you have sequential or time-based data recurrent neural networks and all of the more complicated things that come from them are very very powerful models to use and so i've just put two examples here these are kind of like classes of examples there's lots of sub-examples under here so natural language processing uses recurrent neural networks a lot both in the vanilla form we've talked about here and also all the variations that we'll hopefully talk about in the future and why does that case because when you think about language any kind of language really it is sequential the thing that you say next depends a lot on the things you've said before it's not like every word is random in any case and the order matters a lot so these two things hopefully you can see now have been addressed the variable sequence length as we have in language if you say now the sequence is two terms longer you can just add two more of these blocks and you can address that really well and by the way why are you able to address that it comes down to the fact that these linear transformations u v and w you might have noticed they're the same at every time step and so one natural thought is wait if you want to tune this network really well and you want to be able to have more flexibility shouldn't you have a unique uv and w per time step like this should be u1 v1 w1 that should be u2 v2 w2 that would give you more flexibility but it would not that you allow you to do this variable sequence length because if in your training data you've only seen sequences of length 5 or less and now in the real world someone types eight words well it's like what should that u v and w six seven and eight b it's unclear and so but if they are the same as is implied by this folded version then if you have some input sequence length you've never seen before all you need to do is just append a couple more of these blocks a couple more time steps and you're using the same uv and w as we did before so that's why they are the same anyways um i think i got off topic but yeah we were talking about applications to natural language processing so with any kind of language the sequence length can be variable can be short sentences to large paragraphs and also the directionality of course matters people talk about things in a logical progression the other big example is for time series we talk about time series so much on this channel but we've mostly stuck to using the arima family of models so auto regressive moving average all these sorts of things which are interesting but then there's also the case of hey why don't we try applying recurrent neural networks which as we just learned today are able and designed to deal with this time-based sequential data and take into account the history of things that have happened in the past so any kind of time series based data is also a great candidate for recurrent neural networks now let's close the video out by talking about the issues or the drawbacks of recurrent neural networks we have to talk about all sides on this channel one of the biggest ones is computation so if you think about typical neural networks just kind of thinking in a rough at a rough level a lot of that can be parallelized and the reason it can be parallelized is that we don't have any kind of notion of directionality of inputs which intuitively just means that there's less dependencies of some data point on other data points whereas here we have built the model to have dependencies which is kind of a double-edged sword because on on one hand these dependencies let us take into account the more interesting contextual historical representations of sentences and time series so for example if you look at h2 it takes into account information about h1 which is exactly what we want but from a computational point of view it also means that we can't have the value for h2 until we've processed the value for h1 those two things cannot happen in parallel because one requires the output from the last one which doesn't mean people haven't tried or even succeeded at this point in time to parallelize many parts of the recurrent neural networks it's just that it is much trickier to do to do things like that computationally then with typical neural networks where there's a lot less dependencies between certain inputs and another way to see at a high level this computational issue is when you think about how do we actually learn the weights in the network so we have three sets of weights we have u v and w and as we learned with the neural networks video we typically just use back propagation so that was the typical idea here if we try to use back propagation so if i have a loss function l and i'm trying to find what should i update for the weights u i would do something like derivative of the loss function with respect to u just like that right and what would computing that entail i would need to look at everywhere in the recurrent neural network where u gets used so it's used here here here actually it's used at every single time step so i need to think about how a change in u at every time step is going to affect the final loss function which means i have to take into account all of the time steps which can be a bit of a computational burden so there's a specialized type of back propagation called back propagation through time bptt of course not going to be talked about in this video which is designed to deal with things like that but it does make our lives a little bit trickier because updating these weights does require you to consider all the time steps in the network all at once which can be a bit of a computational burden now one of the other issues here is that a vanilla recurrent neural network as the one we looked at here when people say vanilla in data science he thinks i don't know it's a weird term i've tried not to use it but they just mean the most like basic version of some kind of algorithm or method so in the most basic version of recurrent neural networks it is one directional so you notice that we have arrows going from left to right but no arrows going from right to left and so that's not something we can't fix people have obviously tried to fix this because when you think about language models especially language does not just flow from left to right but there are many use cases for you to want to flow the language from right to left or whatever is the natural flow in your language we want to go the opposite way because you want to consider things not just earlier things in the sequence affecting later things in the sequence but also how later things in the sequence are referencing earlier things in the sequence so a lot of variants of recurrent neural networks will also throw in these backwards arrows which makes this computation problem uh worse and more difficult to solve but does give us more flexibility in in both in bi-directional modeling basically and the final issue i want to talk about is a very interesting one too which is the vanishing gradient problem and so you're thinking wait i've heard about this for regular neural networks too and yeah it's not just a problem for recurrent neural networks it's of course a problem for regular neural networks too but it has a very interesting um interpretation in recurrent neural networks especially these language models so if you remember back to our vanishing gradient video which i'll link in the description below the gradient of the loss function with respect to the very early weights in the network tends to zero vanishes as the network gets bigger and bigger and bigger if we haven't done anything to address this the counterpart here is that let's say we're not just processing three words but we're trying to process like a massive paragraph or book or something very large that has hundreds to thousands of words in it and let's say we're again trying to compute something like derivative of the loss function with respect to u that's going to require basically considering how you at every single time step as we were talking about before eventually affects the loss function part of that will involve answering how a change in h1 affects the final loss function well for the piece of the loss function that depends on this prediction y1 shouldn't be too much of a problem because there's only a couple of sigmoids or tan h's these activation functions there but if you think about how the pieces of the loss function much later down the road maybe hundreds to thousands of tokens or words down the road are going to be affected by h1 there's so many sigmoids and tan h's and activation functions that it has to pass through get there that its gradient for those is likely going to vanish and what that means in intuitive terms is that this the recurrent neural network kind of tends to forget or disregard information from earlier in the sequence because there's just so many steps that has to go so in terms of language modeling for example if you have something much later in the sequence that should be referencing something very early in the sequence there is a risk that recurrent neural networks will run into the vanishing or exploding gradient problem for those if not dealt with properly so anyway this was just recurrent neural networks in a nutshell there are many variations and branches and offshoots over current neural networks that address these issues and try to do more fancy things and hopefully we'll talk about them if there's enough interest but if you have any questions at all please leave them in the comments below hope you like this video please like and subscribe if you did and i will catch you next time
Info
Channel: ritvikmath
Views: 23,851
Rating: undefined out of 5
Keywords:
Id: DFZ1UA7-fxY
Channel Id: undefined
Length: 27min 16sec (1636 seconds)
Published: Mon Mar 14 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.