MIT 6.S094: Recurrent Neural Networks for Steering Through Time

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
All right. So, we have talked about regular neural networks, fully connected neural networks, we have talked about convolutional neural networks that work with images, we have talked about Reinforcement, Deeper Reinforcement Learning, where we plug in a neural network into a Reinforcement Learning Algorithm, when a system has to not only perceive the world but also act in it, and collect a reward. And today we will talk about, perhaps the least understood but the most exciting neural network out there, flavor of neural networks, is Recurrent Neural Networks. But first, for administrative stuff, there’s a website. I don’t know if you heard, cars.mit.edu, where you should create an account, if you’re a registered student, that’s one of the requirements. You need to have an account if you want to get credit for this, you need to submit code for DeepTrafficJS, and DeepTeslaJS, and for DeepTraffic, you have to have a neural network that drives faster than 65mph. If you need help to achieve that speed please e-mail us. We can give you some hints. For those of you who are old school SNL fans, there’s the Deep Thoughts section now, in the profile page, where we encourage you to talk about the kinds of things that you tried in DeepTraffic or any of the other DeepTesla or any of the work you've done as part of this class for DeepLearning. Okay, we have talked about the Vanilla Neural Networks on the left. The Vanilla Neural Network is the one where it's computing is approximating a function that maps from one input to one output. An example is mapping images to the number that is shown in the image. For ImageNet is mapping an image to what's the object in the image. It can be anything. In fact, Convolutional Neural Networks can operate on audio, you can give it a chunk of audio, a five second audio clip, that still counts as one input because it’s fixed-size. As long as the size of the input is fixed, that's one chunk of input and as long as you have ground truth that maps that chunk of input to some output ground truth, that’s the Vanilla Neural Network. Whether there's a fully connected neural network or convolutional neural network. Today we’ll talk about the amazing, the mysterious Recurrent Neural Networks. They compute functions from one to many, from many to one, from many to many. Also bidirectional. What does that mean? They take its input sequences, time series, audio, video, whenever there's a sequence of data, and that temporal dynamics that connects the data is more important than the spatial content of each individual frame. So, whenever there's a lot of information being conveyed in a sequence, in a temporal change of whatever that type of data is, that's when you want to use Recurrent Neural Networks like speech, natural language, audio and the power of this is that for many of them, for a Recurrent Neural Network, where they really shine, is when the size of the input is variable, so you don’t have a fixed chunk of data that you're putting in is variable input. And the same goes for the output, so you can give it a sequence of speech, several seconds of speech and then the output is a single label of whether the speaker is male or female. That’s many to one. You can also do many to many. Translation. You can have natural language put into the network in Spanish and the output is in English. Machine translation. That's many to many. And that many to many doesn't have to be mapped directly into same sized sequences. For video, the sequence size might be the same you're labeling every single frame, you put in a five second clip of somebody playing basketball and you can label every single frame counting the number of people in every single frame. That's many to many when the size of the input and the size of the output is the same Yes, question? The question was, are there are any models where there's feedback from output and input? That's exactly what Recurrent Neural Networks are. It produces output, and it copies that output and loops it back in. That's almost the definition of a Recurrent Neural Network. There's a loop in there that produces the output and also takes that output as input once again. There's also many to many where the sequences don't align. Like machine translation, the size of the output sequence might be totally different than the input sequence. We will look on a lot of cool applications; you can start a song, learn the audio of a particular song have the Recurrent Neural Network to continue that song after a certain period of time. So it can learn to generate sequences of audio, of natural language, of video. Okay. I know I promised not many equations, but this is so beautifully simple that we have to cover backpropagation. It's also the thing that, if you're a little bit lazy and you go to the internet and start using the basic tutorials of TensorFlow, you ignore how backpropagation work. At you peril. You kind of assume it just works. I give it some inputs, some outputs, and it's like Lego pieces I can assemble them like you might have done with DeepTraffic A bunch of layers put in together and then just press Train. backpropagation is the mechanism that neural networks currently-- The best mechanism we know of that is used for training. So you need to understand the simple power of backpropagation, but also the dangers. Summary, I put on the top of the slide, there's an input for the network that's an image, there's a bunch of neurons, all with differentiable smooth activation functions on each neuron, and then, as you pass through those activation functions, take in an input, pass it through this net of differentiable compute nodes, you produce an output. In that output you also have a ground truth, the correct, the truth that you hope or you expect the network to produce. And you can look at the differences between what the network actually produced and what you hoped it would produce, and that's an error. And then you backward propagate that error, punishing or rewarding the parameters of the network that resulted in that output Let's start with a really simple example. There's a function that takes its input up on top, three variables, X, Y and Z. The function does two things: it adds X and Y and then it multiplies that sum by Z. And then we can formulate that as a circuit, circuit of gates, where there's a Plus gate, and a Multiplication gate. Let's take some inputs, shown in blue. Let's say it's X is negative two, Y is five and Z is negative four. And let's do a forward pass through the circuit to produce the output. Negative two plus five equals three q is that intermediate value, three. This is so simple, and so important to understand that I just want to take my time for this because everything else about neural networks just builds on these concepts The add gate produces q, in this case, is three, and three times negative four is twelve. That's the output. The output of the circuit of this network, if you think of it as such, is negative twelve. The forward pass is shown in blue the backward pass will be shown in red in a second here What we want to do, what would make us happy, what would make f happy is for the output to be as high possible. Negative twelve, so-so, it could be better. How do we teach it How do we adjust X, Y and Z, to ensure it produces a higher f makes f happier. Let's start backward, The backward pass. We'll make the gradient on the output one, meaning we want this to increase. We want f to increase. That's how we encode our happiness. We want it to go up by one. In order to then propagate that fact that we want the f to go up by one, we have to look at the gradient on each one of the gates. And what's a gradient? It's a partial derivative with respect to its inputs. The partial derivative of the output of the gate with respect to its inputs, if you don't know what that means, is just how much does the output change when I change the inputs a little bit. What is the slope of that change if I increase X for the first function of addition, f of X, Y equals X plus Y. If I increase X by a little bit, what happens to f? If I increase Y by a little bit, what happens to f? Taking a partial derivative of those with respect to X and Y you just get a slope of one When you increase X, f increases linearly. Same with Y. Multiplication is a little trickier. When you increase X, f increases by Y. Do the partial derivative of f with respect to X is Y, the partial derivative of f with respect to Y is X. If you think about it, what happens is the gradients, when you change X, the gradient of change doesn't care about X. It cares about Y. It's flipped. So we can backpropagate that one, the indication of what makes X happy backward. And that's done by computing the local gradient. For q, the partial derivative of f with respect to q, that intermediate value, that gradient would be negative four. It will take the value of Z as I said it's the Multiplication gate, It'll take the value of Z and assign it to the gradient. And the same for the partial derivative of f with respect to Z, it will assign that to q. The value of the forward pass on the q. There's a three and a negative four on the forward pass in blue and that's flipped. Negative four and three on the backward pass. That's the gradient. And then we continue in the same exact process. But wait. What makes all of this work, is the Chain Rule. It's magical. What it allows us to do is to compute the gradient, the gradien of f with respect to the inputs X, Y, Z. We don't need to construct the giant function that is the partial derivative of f with respect to X, Y and Z analytically. We can do it step by step backpropagating the gradients. We can multiply the gradients together as opposed to doing the partial derivative of f with respect to X. We have just the intermediate, the local gradient of f with respect to q, and of q with respect to X, and multiply them together. So, Instead of computing gradient of that giant function X plus Y times Z, in this case is not that giant, but it gets pretty giant with neural networks, we just go step by step. Look at the first function, simple addition, q equals X plus Y, and the second function, multiplication, f equals q times Z. The gradient on X and Y, the partial derivative of f with respect to X and Y is computed by multiplying the gradient on the output, negative four, times the gradient on the inputs, which as we talked about, when the operation is addition, that's just one. It's negative four times one. What does that mean? Let's interpret those numbers. You now have gradients on X, Y and Z the partial derivatives of F with respect to X, Y, Z. That means, for X and Y is negative four, for Z is three. That means, in order to make f happy, we have to decrease the inputs that have a negative gradient and increase the inputs that have a positive gradient. The negatives ones are X and Y, the positive is Z. Hopefully, I don't say the word “Beautiful” too many times in this presentation this is very simple. Beautifully simple. Because this gradient is a local worker, it propagates for you; it has no knowledge of the broader happiness of f. It computes the greater between the output and the input. And it can propagate this gradient based on, in this case f, a gradient of one but also the error. Instead of one we can have on the output the error as the measure of happiness. And then we can propagate that error backwards. These gates are important because we can break down almost every operation we can think of that we work within neural networks into one or several gates like these. The most popular are three, which is addition, multiplication and the Max operation. For addition, the process is you take a forward pass through the network, so we have a value on every single gate, and then you take the backward pass. And through the backward pass you compute those gradients. For an add gate, you equally distribute the gradients on the output to the input, when the gradient on the output is negative four, you equally distribute it tonegative four. And you ignore the forward pass value. That three is ignored when you backpropagate it. On the Multiply gate, it's trickier. You switch the forward pass values, if you look at f, that's a Multiply gate, the forward pass values are switched and multiplied by the value of the gradient in the output. If it's confusing, go through the slides slowly. It'll make a lot more sense. Hopefully. One more gate. There's the Max gate, which takes the inputs and produces as output the value that is larger. When computing the gradient of the Max gate, it distributes the gradient similarly to the Add gate, but to only one, to only one of the inputs; the largest one. unlike the Add gate, pays attention to the input the input values on the forward pass. All right. Lots of numbers but the whole point here is, it's really simple; a neural network is just a simple collection of these gates. You take a forward pass, you calculate some kind of function in the end, the gradient in the very end, and you propagate that back. Usually, for neural networks, that's an Error function. A Loss function, Objective function, a Cost function. All the same word. That's the Sigmoid function there When you have three weights W zero, W one, W two and X, two inputs, X0, X1, that's going to be the Sigmoid function. That's how you compute the output of the neuron. But then you can decompose that neuron you can separate it all into just a set of gates like this Addition, multiplication, there's an exponential in there and division but all very similar. And you repeat the exact same process. there's five inputs, there's three weights and two inputs. X zero, X one. You take a forward pass through this circuit, in this case again, you want it to increase so that the gradient of the output is one and you backpropagate that gradient of one, to the inputs. Now in neural networks, there's a bunch of parameters that you're trying through this process, modify. And you don't get to modify the inputs You get to modify the weights along the way, and the biases. The inputs are fixed, the outputs are fixed, the outputs that you hope the network will produce. What you're modifying is the weights. So I get to try to adjust those weights in the direction of the gradient. That's the task of backpropagation.  The main way that neural networks learn. As we update the weights and the biases to decrease the loss function. The lower the loss function the better. In this case, you have three inputs on the top left. A simple network, three inputs. Three weights on each of the inputs. There's a bias on the node, b and produces an output a, and that little symbol is indicating a Sigmoid function. And the loss is computed as Y minus A squared, divided by two, where Y is the ground truth, the output that you want the network to produce. And that loss function is backpropagating in exactly the same way that we described before. The subtasks involved in this update of weights and biases is that the forward pass computes the network output at every neuron, and finally, the output layer, computes the error, the difference between a and b, and then backward propagates the gradients. Instead of one on the output, it will be the error on the output and you backpropagated. And then, once you know the gradient, you adjust the weights and the biases in the direction of the gradient. Actually, the opposite of the direction of the gradient, because you want the loss to decrease. And the amount by which you make that adjustment is called the Learning Rate. The learning rate can be the same across the entire network or can be individual through every weight. And the process of adjusting the weights and biases is just optimization. Learning is an Optimization problem. You have an objective function, and you're trying to minimize it. And your variables are the parameters, the weights and biases. Neural networks just happen to have tens, hundreds of thousands, millions of those parameters. So the function that you're trying to minimize is highly non-linear. But it boils down to something like this, you have two weights, two plots-- or actually one weight and as you adjust it, the cost you adjust in such a way that minimizes the output cost. And there's a bunch of optimization methods for doing this. this is a convex function, You can find the local minimum. If you know about these kinds of terminologies, the local minimum is the same as the global minimum, it's not a weirdly hilly terrain where you can get stuck in. Your goal is to get to the bottom of this thing and if it's really complex terrain, it will be hard to get to the bottom of it. This general approach is gradient descent, and there's a lot of different ways to do a gradient descent. Various ways of adding randomness into the process, so you don't get stuck into the weird crevices of the terrain. But it's messy. You have to be really careful. This is the part you have to be aware of, when you design a network for DeepTraffic and nothing is happening this might be what's happening: vanishing gradients or exploding gradients. When the partial derivatives are small, so you take the Sigmoid function, the most popular for a while, activation function, the derivative is zero at the tails. When the input to the Sigmoid functions is really high or really low, that derivative is going to be zero. Gradient tells on how much I want to adjust the weights. The gradient might be zero, and so you backpropagate that zero, a very low number, and it gets less and less as you backpropagate and so the result is that you think you don't need to adjust the weights at all. And when a large fraction of the network weights don't need to be adjusted, they don't adjust the weights. And you are not doing any learning So the learning is slow. There are some fixes to this, there are different types of functions. There's a piece, the ReLUs function which is the most popular activation function. But again, if the neurons are initialized poorly, this function might not fire. it might be zero gradient for the entire data set. Nothing that you produce as input, you run all your thousands of images of cats, and none of them fire at all. That's the danger here. So you have to pick both the optimization engine, the solver that you use and the activation functions carefully. You can't just plug and play like they're Lego's You have to be aware of the function. SGD, Stochastic Gradient Descent, that's the Vanilla optimization algorithm for gradient descent. For optimizing the loss function over the gradients And what's visualized here is, again, if you have done any numerical optimization, and non-linear optimization, there's the famous saddle point, that's tricky for these algorithms to deal with. What happens is, it's easy for them to oscillate, get stuck in that saddle and oscillating back and forth as opposed to what they want to do which is go down into-- You get so happy that you found this low point that you forget there's a much lower point. So you get stuck with the gradient. The momentum of the gradient keeps rocking it back and forth without you going to a much greater global minimum. And there's a lot of clever ways to solving that, the Atom optimizer is one of those. But in this case, as long as the gradients don't vanish SGD, the Stochastic Gradient Descent, one of these algorithms will get you there It might take a little while, but it will get you there Yes, question. The question was, you're dealing with a function that is not convex, how do we ensure anything about converging to anything that's reasonably good, the local optimum converges to-- The answer is, you can't. This isn't only a non-linear function it's a highly non-function The power and the beauty of neural networks is that it can represent these arbitrarily complex functions. It's incredible. And it can learn these functions from data But the reason people are referring to neural networks training as art is you're trying to play with parameters that don't get stuck in these local optimal. For stupid reasons and for clever reasons. Yes, question. The Question continues on the same thread. The thing is, we're dealing with functions where we don't know what the global optimal is. That's the crocs of it. Everything we talked about, interpreting text, interpreting video, even driving. What's the optimal for driving? Never crashing? It sounds easy to say that, you actually have to formulate the world under which it defines all of those things and it becomes a really non-linear objective function for which you don't know what the optimal is. That's why you keep trying and get impressed every time it gets better. It is essentially the process. And you can also compare, you can compare with human-level performance. For ImageNet, who can tell the difference between cats and dogs, and top five categories, 96% of the time accuracy, and then you get impressed when a machine can do better than that. But you don't know what the best is. These videos can be watched for hours, I won't play it until I explain this slide. Let's pause to reflect on backpropagation before I go on to Recurrent Neural Networks. Yes, question. In this practical manner, how can you tell when you're actually creating a net whether you're facing the management gradient problem or you need to change your optimizer or you've reached a local minimum? The question was, how do you practically know when you hit the vanishing gradient problem? The vanishing gradient could be-- The derivative being zero on the gradient, happens when the activation is exploding, like really high values and really low values. To really high values is easy. Your network has just gone crazy. It produces very large values. And you can fix a lot of those things by just capping the activations. The values being really low, resulting in a vanishing gradient, are really hard to detect There's a lot of research in trying to figure out how to detect these things. If you're not careful, often times you can find that, and this isn't hard to do, we're like 40 or 50 percent of the network, of the neurons, are dead. We will call it, for ReLU, they're dead ReLU They're not firing at all. How do you detect that? That's part of learning If they never fire you can detect that by running it through the entire training set. There are a lot of tricks. But that's the problem. You try to learn and then you look at the loss function and it's not converging to anything reasonable. They are going all over the place, or just converging very slowly. And that's an indication that something is wrong That something could be the loss function is bad, that something could be you already found the optimal, or that something could be the vanishing gradient. And again, that's why it's an art. Certainly, at least some fraction of the neurons needs to be firing. Otherwise, initialization is really poorly done. Okay, to reflect on the simplicity of backpropagation and the power of it, this kind of step of backpropagating the loss function to the gradients locally, is the way neural networks learn. It's really the only way that we have effectively been able to to train a neural network network to learn a function. To adjusting the weights and biases, the huge number of weights and biases, the parameters It's just through this optimization. It's backpropagating the error, where you have the supervised ground truth. the question is whether this process, of fitting, adjusting the parameters of a highly non-linear function to minimize a single objective, is the way you achieve intelligence. Human-level intelligence. That's something to think about. You have to think about, for driving purposes, what is the limitation of this approach? What's not happening? The neural network designed, the architecture is not being adjusted. any of the edges, the layers, nothing is being evolved There are other optimization approaches that I think are more interesting and inspiring than effective. For example, this is using soft cubes to-- This is falling out of the field of evolutionary robotics. Where you evolve the dynamics of a robot using genetic algorithms and that's These robots have been taught to, in simulation, obviously, to walk and to swim. That one is swimming. The nice thing here is that dynamics that highly non- linear space as well, that controls the dynamics of this weird shaped robot with a lot of degrees of freedom, it's the same kind of thing as the neural network. In fact, people have applied generic algorithms, ant colony optimization, all kinds of sort of nature inspire algorithms for automatizing the weights and the biases but they don't seem to currently work that well. It's a cool idea to be using nature-type evolutionary algorithms to evolve something that's already nature inspired which is neural networks. But, something to think about the backpropagation, while really simple it's kind of dumb and the question is whether general intelligence reasoning can be achieved with this process. All right, Recurrent Neural Networks, on the left there's an input X with weights on the input, U, there's a hidden state, hidden layer S, with weights on the edge connecting the hidden states to each other and then more weights, V, the on the output O. It's a really simple network, there's inputs, there's hidden states, the memory of this network and there's outputs. But the fact that there's this loop where the hidden states are connected to each other means that as opposed to producing a single input, the network takes arbitrary numbers of inputs, it just keeps taking X, one at a time and produces a sequence of Xs through time. Depending on the duration of the sequence you're interested in, you can think of this network in its unrolled state. You can unroll this neural network where the inputs are in the bottom, Xt-1, Xt, Xt+1, and same with the outputs, Ot-1, Ot, Ot+1, and it becomes like a regular neural network, unrolled some arbitrary number of times. The parameters, again, there's weights, there's biases, similar to CNNs, convolutional neural networks and just like convolutional neural networks make certain spatial consistency assumptions, the recurrent neural network assume temporal consistency amongst the parameters, shares the parameters. That W, that U, that V, is the same for every single time step. You're learning the same parameter, no matter the duration of the sequence and that allows you to look at arbitrary long sequences without having an explosion of parameters.  This process is the same exact process that's repeated base on the different variants that we talk about before, in terms of inputs and outputs, one to many, many to one, many to many. The backpropagation process is exactly the same as for regular neural networks. It's a fancy name of backpropagation through time, BPTT, but it's just backpropagation through an unrolled recurrent neural network, where the errors are on the computed on the outputs, the gradients are computed, backpropagated and computed on the inputs, again, suffering for the same exact problem of vanishing gradients. The problem is that the depth of these networks can be arbitrary long if at any point the gradients hits a lower number, zero, becomes, that neural becomes saturated. That gradient, let's call it saturated, that gradient gets-- drives all the earlier layer to zero, so is easy to run to a problem where you're really ignoring the majority of the sequence. This is just another Python weight, sudo-called weight to look at it. Is you have the same w, remember you're sharing the weights and all the parameters from time to time, so if the weights are such WHH, if the weights are such that they produce [unintelligible] they have a negative value that results in the gradient that goes to zero, that propagates through the rest. That's the sudo-call for backpropagation, pass to the RNN, that WHH propagates back. You get this things with exploding and vanishing gradients for example, error surfaces for a single hidden unit RNN, this is visualizing the gradient, the value of the weight, the value of the bias and the error, the error could be really flat or could explode, both are going to lead to you not making-- either making steps that are too gradual or too big. It's the geometric interpretation. Okay. What other variants that we look at, a little bit? are they [unintelligible 00:41:13]? It doesn't have to be only one way, it can be bi-directional, that could be edges going forward and edges going back What that's needed for is things like filling in missing, whatever the data is, filling in missing elements of that data, whether that's images, or words, or audio. Generally, as always is the case in neural network, the deeper it goes, the better. That deep referring to the number of layers in a single temporal instance. On the right of the slide we're stacking node in the temporal domain. Each of those layers has its own set of weights, its own set of biases. These things are awesome but they need a lot of data when you add extra layers in this way. The problem is, while recurrent neural network, in theory, is supposed to be able to learn any kind of sequence, the reality is they're not really good at remembering what happened a while ago, the long-term dependency. Here's a silly example, let's think of a story about Bob, Bob is eating an apple. The apple part is generated by the recurrent neural network. Your recurrent neural networks can learn to generate "apple" because it's seen in a lot of sentences, with "Bob" and "eating" and it can generate the word apple. For a longer sentence, like "Bob likes apples, he's hungry and decided to have a snack, so now he's eating an apple", you have to maintain the state that we're talking about Bob and we're talking about apples, through several discreet semantic sentences. That kind of long-term memory is not-- because of different effects, but vanishing gradients, it's difficult to propagate the important stuff that happened a while ago in order to maintain that context in generating "apple", or classifying some concept that happened way down the line.  When people talk about recurrent neural networks these days, they're talking about LSTMs, long-short-term memory networks so all the impressive results results on time series and audio and video and all that, that requires LSTMs. Again, vanilla RNNs are on top of the slide, each cell is simple, there are some hidden units, there's an input, and there's an output. Here, we used TANH as activation function, it's just another popular Sigmoid type activation function. LSTMs are more complicated, or they look more complicated but in some ways, they're more intuitive for us to understand. There's a bunch of gates in each cell, we'll go through those. In yellow are different neural network layers, Sigmoid and TANH, are just different types of activation functions. TANH is an activation function that squishes the input into the range of negative one to one. Sigmoid function squishes it between zero and one and that serve different purposes. There's some pointwise operations, addition, multiplication, and there's connections, data being passed from layer to layer, shown by the arrows. There's concatenation and there's a copy operation on the output We copy, the output of each cell it's copied to the next cell and to the output. Let me try to make it, clarified, clarify a little bit. There's this conveyer belt going through inside of each individual cell and they all have, there's really three steps in the conveyer belt. The first is, there is a Sigmoid function that's responsible for deciding what to forget and what to ignore, it's responsible for taking in the input, the new input, x(t), taking in the state of the previous, the output of the previous cell, previous time step and deciding "do I want to keep that in my memory or not?" and "do I want to integrate the new input into my memory or not?" This allows you to selective about the information which you learn. For example, there's that sentence "Bob and Alice are having lunch, Bob likes apples, Alice like oranges, she is eating an orange". Bob and Alice are having lunch, Bob likes apples, right now, if you had said you have a hidden state, keeping track of the gender of the person we're talking about you might say that there's both genders on the first sentence, there's male in the second sentence, female in the third sentence, and that way when you have to generate a sentence about who's eating what, you'll know- you keep the gender information in order to make an accurate generation of text corresponding to the proper person. You have to forget certain things, like forget that Bob existed at that moment, you have to forget Bob likes apples but you have to remember that Alice likes oranges so you have to selectively remember and forget certain things that's LSTM in a nutshell. You decided what to forget, decided what to remember and decided what to output in that cell. Zoom in a little bit, because this is pretty cool There's a state running through the cell, this conveyer belt, previous state like the gender that we're currently talking about, that's the state that you're keeping track of and that's running through the cell. Then there's three Sigmoid layers outputting one, a number between the zero and one, one when you want that information to go through and zero when you don't want it to go through, the conveyer belt that maintains the state. First, Sigmoid function is, we decided what to forget and what to ignore, that's the first one, we take the input from the previous time step, the input to the network on the current time step and decided, do I want to forget or do I want to ignore those? Then we decided which part of the state to update, what part of our memory do we have to update with this information and what values to insert in that update. Third step is, we perform the actual update and perform the actual forgetting, that's why you have the Sigmoid function, you just multiply it, when is zero is forgetting, when is one that information passes through. Finally, we produce an output from the cell, if its translation is producing an output in the English language where the input was in Spanish language and then that same output it's copied to the next cell. What can we get done with this kind of approach? We can look at machine translation. I guess what I'm trying to-- question. what is your representation of this state? Is it like a floating point or is it like a vector or what is it, exactly? The state is the activation multiplied by the weight, it's the output of the Sigmoid or the TANH activations. There's a bunch of neurons and they're firing a number between negative one or one, or between zero and one, that whole's a state. It just that calling it a state it's sort of simplifying, but the point is that there's a bunch of numbers been constantly modified by the weights and the biases, those numbers hold the state and the modification of those numbers is controlled by the weights and then once all of that is done, the resulting output of the recurrent neural network it's compared to the desired output and the errors are backpropagated to the weights. Hopefully, that makes sense.  So, machine translation is one popular application all of it is the same, all of these networks that I've talked about, they're really similar constructs. You have some inputs, whatever language that is again, German maybe, I think everything is German, and the output. The inputs are in one language, a set of characters composed a word in one language, there's a state being propagated and once that sentence is over, you start, as opposed to collecting inputs, start producing outputs and you can output in the English language. There's a ton of great work on machine translations. It's what Google is supposedly using for their translator, same thing. I've show this previously but now you all know how it works, same exact thing, LSTMs generating handwritten characters, handwriting in arbitrary styles, controlling the drawing, where the input is text and the output is handwriting. Is again, the same kind of network with some depths here, the input is the text, the output is the control of the writing. Character-level text generation, this is the thing that taught us about life, the meaning of life, literary recognition and the tradition of ancient human reproduction. That's again, the same process, input one character at the time, what we see there is the encoding of the characters on the input layer, there's a hidden state, hidden layer that is keeping track of those activations, the outputs of the activation functions and every single time it's outputting its best prediction of the next character that follows. Now, on a lot of these applications you want to ignore the output until the input sentence is over and then you start listening to the output, but the point is that it just keeps generating text, whether is given an input or not, so you producing input is just adding, steering the recurrent neural network. You can answer questions about an image, the input you get there, you could almost arbitrary stack things together, you take an image as your input, bottom left there, put it in your convolutional neural network, and take the question. There's something call word embeddings, it's to broaden the representative meaning of the words. "How many books?" is the question. You want to take the word embeddings and the image and produce your best estimate of the answer. For question of "what color is the cat?" it could be gray or black, it's the different LSTM flavors producing that answer. Same with counting chairs you can give an image of a chair and as the question "how many chairs are there?" And it can produce an answer of "three". I should say this is really hard, arbitrary question asks an arbitrary image, you are both interpreting-- you are doing natural languages processing and you're doing computer vision, all in one network. Same thing with the image capture generation, you can detect the different objects in the scene, generate those words, stitch them together in syntactically correct sentences and rearrange the sentences. All of those are LSTMs, the second and the third step, the first is computer vision detecting the objects, segmenting the image and detecting the objects, that way you can generate a caption that says "a man is sitting in a chair with a dog in his lap". Again, LSTMs for video. Caption generation for video, the input, and every frame it's an image that goes into the LSTM, the input is an image and the output is a set of characters. First, you load in the video, in this case the output is on top, you encode the video into a representation inside the network and then you start generating words about that video. First comes the input, the encoding stage, then the decoding stage. Take in the video, say a man is taking, talking, whatever and because the input and the output are arbitrary, there also has to be indicators of the beginnings and the ends of a sentence, in this case, end of sentences. You want to know when you stop in order to generate syntactically correct sentences. that indicates the end of a sentence. You want also to be able to generate a period You can also, again, recurrent neural networks, LSTMs here, controlling the steering of a sliding window on an image that is used to classify what is contained in that image. Here, a CNN being steered by a recurrent neural network in order to convert this imagen into the number that's associated with a house number, it's called visual attention. That visual attention can be used to steer for the perception side and it can be used to steer a network for the generation. On the right, we can generate an image as-- So the output of the network-- it's a LSTM where the output on every time step is visual, and this way you can draw numbers. Here, I mention this before, is taking in as input silent video, sequence of images and producing audio. This is an LSTM that has convolutional layers for every single frame, takes images as input and produces a spectrogram, audio as output. The training set is a person hitting an object with a drumstick and your task is to generate, given a silent video, generate the sound that the drumstick will make when in contact with that object. Medical diagnosis, that's actually-- I've listed some places where it has been really successful and pretty cool, but it's also beginning to be applied in places where can actually really help civilization, in medical applications. For medical diagnosis there's the highly spars and variable lengths sequence of information in the form of, for example, patient electronic health records. So, Every time you visit a doctor, there's a test being done, that information is there and you can look it as a sequence over a period of time and then given that data, that's the input, the output is the diagnosis, a medical diagnosis, in this case, we can look at predicting diabetes, scoliosis, asthma and so on, with pretty good accuracy. There's something that all of us wish we could do, is stock market prediction. You can input, for example, well first of all, you can input the raw stock data, [unintelligible 01:00:30] books and so on, financial data, but you can also look at news articles from all over the web and take those as input as shown here, on the X axis is time, articles from different days, LSTM, once again, and produce an output of your prediction, binary prediction, whether the stock would go up or down. Nobody has been able to really successfully do this but there is a bunch of results and trying to perform above random which is how you make money, significantly above random on the prediction of it's going up or down? So you could buy or sell and especially when there is-- in the cases when there was crashes it's easier to predict, so you can predict an encroaching crash. These are shown in the table, the error rates from different stocks, automotive stocks. You can also generate audio, is the exact same process as it generates language, you generate audio. Here's trained on a single speaker, a few hours epics of them speaking and you just learn, that's raw audio of the speaker and it's learning slowly to generate [audio] Obviously, they were reading numbers. this is incredible, this is trained on a compress spectrogram of the audio, raw audio and is producing something that over just a few epics is producing something that sounds like words, it could do this lecture for me, I wish. This is amazing, this is raw input, raw output, all again, LSTMs, and there's a lot of work in voice recognition, audio recognition. You're mapping-- let me turn it up. You are mapping any kind of audio to a classification, you can take the audio of the road and that's the spectrogram on the bottom there, being shown you could detect whether the road is wet is wet or the road is dry. you could do the same thing for recognizing the gender of the speaker or recognizing many to many map of the actual words being spoken, speech recognition. This is about driving, so let's see where recurrent neural| networks apply in driving. We talked about the NVIDIA approach, the thing that actually powers DeepTeslaJS, it is a simple convolutional neural network, there's five convolutional layers in their approach, three fully connected layers, you can add as many layers as you want in DeepTesla, that's a quarter of million parameters to optimize all you are taking is a single image, no temporal information, single image and producing the steering angle, that's the approach, that's the DeepTesla way, taking a single imagen image and learning a regression of the steering angle. One of the prizes for the competition is the Udacity, self-driving car engineer nanodegree for free, this thing is awesome, I encourage everyone to check it out, but they did a competition that's very similar to ours, but a very large group of obsessed people, they were very clever, they went beyond convolutional neural networks of predicting steering, taking a sequence of images and predicting steering, what they did is, the winners, at least the first and I'll talk about the second place winner tomorrow, on 3D convolutional neural networks, the first and the third place winners used RNNs, used LSTMs, recurrent neural networks and map a sequence of images to a sequence of steering angles. For anyone, statistically speaking, anybody here who is not a computer vision person, most likely what'd you want to use, for whatever application you're interested in, is RNNs, the world is full of time series data, very few of us are working on data that is no time series data, in fact, whenever it's just snapshots, you're really just reducing the problem to the size that you can handle but most data in the world is time series data. This is the approach you end up using if you want to apply it in your own research, RNNs is the way to go. Again, what are they doing? How do you put images into a recurrent neural network? it's the same thing, you take, you have to convert an image into numbers in some kind of way, a powerful way of doing that is convolutional neural networks, so you can take either 3D convolutional neural networks or 2D convolutional neural networks once it takes time into consideration and whatnot, process that image to extract a representation of that image and that becomes the input to the LSTM and the output at every single cell, at every single timestep, is a predicted steering angle, the speed of the vehicle and the torque that's what the first place winner did, they didn't just do the steering angle, also did the speed and torque and the sequence length that they were using for training and for testing, for the input and the output, is a sequence length of 10  did they used supervised learning or did they used reinforcement learning? The question was, did they used supervised learning? Yes, they were given the same thing as in DeepTesla, a sequence of frames where the have a sequence of steering angles, speed and torque, I think there's other information too available, there's no reinforcement learning here. Question.  Do you have a sense of how much information is being passed, how many LSTM gates are there in this problem? The question was, how many LSTM gates are in this problem? This network, it's true that this diagrams kind of hide the number of parameters here, but it's arbitrary just like convolutional neural networks are arbitrary, the size of the input is arbitrary, the size of Sigmoid function, TANH is arbitrary, so you can make it as large as you want, as deep as you want and the deeper and larger, the better.  What these folks actually used-- the way these competitions work and I encourage you, if you're interested in machine learning to participate in Kaggle, I don't know how to pronounce it, competitions where basically everyone is doing the same thing, you're using LSTMs or if it's one- on-one mapping, using convolutional neural network fully connecting networks with some clever pre-processing and the whole job is that takes months and you probably, if you're a researcher, that's what you'd be doing your own research, playing with parameters, playing with pre-processing of the data, playing with the different parameter that controls the size of the network the learning rate, I've mentioned, this type of optimizer, all these kinds of things, that's what you're playing with, using your own human intuition and you're using your-- whatever probing you can do in monitoring the performansce of the network through time. Yes? The question was, you said that there's a memory of tenth in this LCM, and I thought RNNs are supposed to be arbitrary. It has to do with the training, how the network is trained. It's trained with sequences of 10. The structure is still the same, you only have one cell that's looping onto each other. But the question is, in what chunks, what is the size of the sequence that we should do in the training and then the testing. It can be arbitrary length. It's just usually better to be consistent and have a fixed length. You're not stacking 10 cells together. It's just a single cell still. The third-place winner, Team Chauffeur, used something called transfer learning and it's something I don't think I mentioned but it's kind of implied, the amazing power of neural networks. First, you need a lot of data to do anything. That's the cost, that's the limitation in neural networks. But what you could do is, there's neural networks that have been trained on very large data sets. ImageNet, Vdg Net, AlexNet, ResNet, all these networks are trained on a huge amount of data. Those networks are trained to tell the differences between a cat and dog Specific optical recognition of single images. How do I then take that network and apply it to my problem, say of driving or length detection, or medical diagnosis, or cancer or not? The beauty of neural networks, the promise of transfer learning, is that you can just take that network, chop off the final layer, the fully connected layer that maps from all those cool high-dimensional features that you have learned about visual space, and as opposed to predicting cat vs. dog, you teach it to predict cancer or no cancer. You teach it to predict lane or no lane, truck or no truck. As long as the visual space under which that network operates is similar or the data like if it's audio or whatever if it's similar, if the features are useful then you learn, in studying the problem of cat vs dog deeply, you have learned actually how to see the world. As you're going to apply that visual knowledge, you can transfer that learning to another domain. That's the beautiful power of neural networks it's that they're transferable. What they did here is-- I didn't spend enough time looking through the code I'm not sure which of the giant nework they took but they took a giant convolutional neural network, they chopped off the end layer, which produced 3000 features, and they took those 3000 features to every single image frame, and that's the Xt. They gave that as the input to LSTM. And the sequence length, in that case, was 50. This process is pretty similar across domains. That's the beauty of it. The art of neural networks is in the-- Well that's a good sign [chuckles], I guess I should warp it up-- The art of the neural networks is in the proper parameter tuning.  That's the tricky part, and that's the part you can't be taught. That's experience, sadly enough. That's why they talk about Stochastic Gradient Descent SGD, That's what Geoffrey Hinton refers to as Stochastic Graduate Student Descent, meaning you just keep hiring graduate students to play with the hyperparameters until the problem is solved [laughter]. I have about 100+ slides on driver state, which is the thing that I'm most passionate about, and I think will save the best for last. I'll talk about that tomorrow. We have a guest speaker from the White House, will talk about the future of Artificial Intelligence from the perspective of policy, and what I would like you to do first off you registered students is submit the two tutorial assignments, and pick up can we just set the boxes right here or something? Just stop by and pick up a shirt. And give us a card on the way. Thanks guys. [Applause]
Info
Channel: Lex Fridman
Views: 101,791
Rating: 4.8606629 out of 5
Keywords: mit, deep learning, recurrent neural networks, introduction, rnn, steering, end-to-end driving
Id: nFTQ7kHQWtc
Channel Id: undefined
Length: 75min 59sec (4559 seconds)
Published: Wed Feb 01 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.