MIT Introduction to Deep Learning | 6.S191

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Good afternoon everyone! Thank you all for joining today. My name is Alexander Amini and I'll be one of your course organizers this year along with Ava -- and together we're super excited to introduce you all to Introduction to Deep Learning. Now MIT Intro to Deep Learning is a really really fun exciting and fast-paced program here at MIT and let me start by just first of all giving you a little bit of background into what we do and what you're going to learn about this year. So this week of Intro to Deep Learning we're going to cover a ton of material in just one week. You'll learn the foundations of this really really fascinating and exciting field of deep learning and artificial intelligence and more importantly you're going to get hands-on experience actually reinforcing what you learn in the lectures as part of hands-oOn software labs. Now over the past decade AI and deep learning have really had a huge resurgence and many incredible successes and a lot of problems that even just a decade ago we thought were not really even solvable in the near future now we're solving with deep learning with Incredible ease. Now this past year in particular of 2022 has been an incredible year for a deep learning progress and I like to say that actually this past year in particular has been the year of generative deep learning using deep learning to generate brand new types of data that I've never been seen before and never existed in reality in fact I want to start this class by actually showing you how we started this class several years ago which was by playing this video that I'll play in a second now this video actually was an introductory video for the class it kind of exemplifies this idea that I'm talking about. So let me just stop there and play this video first of all Hi everybody and welcome to MIT 6.S191 -- the official introductory course on deep learning taught here at MIT. Deep Learning is revolutionizing so many fields: from robotics to medicine and everything in between. You'll learn the fundamentals of this field and how you can build some of these incredible algorithms. In fact, this entire speech and video are not real and were created using deep learning and artificial intelligence. And in this class you'll learn how. It has been an honor to speak with you today and I hope you enjoy the course. so in case you couldn't tell this video and its entire audio was actually not real it was synthetically generated by a deep learning algorithm and when we introduced this class A few years ago this video was created several years ago right but even several years ago when we introduced this and put it on YouTube it went somewhat viral right people really loved this video they were intrigued by how real the video and audio felt and looked uh entirely generated by an algorithm by a computer and people were shocked with the power and the realism of these types of approaches and this was a few years ago now fast forward to today and the state of deep learning today we have have seen deep learning accelerating at a rate faster than we've ever seen before in fact we can use deep learning now to generate not just images of faces but generate full synthetic environments where we can train autonomous vehicles entirely in simulation and deploy them on full-scale vehicles in the real world seamlessly the videos here you see are actually from a data driven simulator from neural networks generated called Vista that we actually built here at MIT and have open sourced to the public so all of you can actually train and build the future of autonomy and self-driving cars and of course it goes far beyond this as well deep learning can be used to generate content directly from how we speak and the language that we convey to it from prompts that we say deep learning can reason about the prompts in natural language and English for example and then guide and control what is generated according to what we specify we've seen examples of where we can generate for example things that again have never existed in reality we can ask a neural network to generate a photo of a astronaut riding a horse and it actually can imagine hallucinate what this might look like even though of course this photo not only this photo has never occurred before but I don't think any photo of an astronaut riding a horse has ever occurred before so there's not really even training data that you could go off in this case and my personal favorite is actually how we can not only build software that can generate images and videos but build software that can generate software as well we can also have algorithms that can take language prompts for example a prompt like this write code and tensorflow to generate or to train a neural network and not only will it write the code and create that neural network but it will have the ability to reason about the code that it's generated and walk you through step by step explaining the process and procedure all the way from the ground up to you so that you can actually learn how to do this process as well now I think some of these examples really just highlight how far deep learning and these methods have come in the past six years since we started this course and you saw that example just a few years ago from that introductory video but now we're seeing such incredible advances and the most amazing part of this course in my opinion is actually that within this one week we're going to take you through from the ground up starting from today all of the foundational building blocks that will allow you to understand and make all of this amazing Advance as possible so with that hopefully now you're all super excited about what this class will teach and I want to basically now just start by taking a step back and introducing some of these terminologies that I've kind of been throwing around so far the Deep learning artificial intelligence what do these things actually mean so first of all I want to maybe just take a second to speak a little bit about intelligence and what intelligence means at its core so to me intelligence is simply the ability to process information such that we can use it to inform some future decision or action that we take now the field of artificial intelligence is simply the ability for us to build algorithms artificial algorithms that can do exactly this process information to inform some future decision now machine learning is simply a subset of AI which focuses specifically on how we can build a machine to or teach a machine how to do this from some experiences or data for example now deep learning goes One Step Beyond this and is a subset of machine learning which focuses explicitly on what are called neural networks and how we can build neural networks that can extract features in the data these are basically what you can think of as patterns that occur within the data so that it can learn to complete these tasks as well now that's exactly what this class is really all about at its core we're going to try and teach you and give you the foundational understanding and how we can build and teach computers to learn tasks many different type of tasks directly from raw data and that's really what this class spoils down to at it's it's most simple form and we'll provide a very solid foundation for you both on the technical side through the lectures which will happen in two parts throughout the class the first lecture and the second lecture each one about one hour long followed by a software lab which will immediately follow the lectures which will try to reinforce a lot of what we cover in the in the in the technical part of the class and you know give you hands-on experience implementing those ideas so this program is split between these two pieces the technical lectures and the software Labs we have several new updates this year in specific especially in many of the later lectures the first lecture will cover the foundations of deep learning which is going to be right now and finally we'll conclude the course with some very exciting guest lectures from both Academia and Industry who are really leading and driving forward the state of AI and deep learning and of course we have many awesome prizes that go with all of the software labs and the project competition at the end of the course so maybe quickly to go through these each day like I said we'll have dedicated software Labs that couple with the lectures starting today with lab one you'll actually build a neural network keeping with this theme of generative AI you'll build a neural network that can learn listen to a lot of music and actually learn how to generate brand new songs in that genre of music at the end at the next level of the class on Friday we'll host a project pitch competition where either you individually or as part of a group can participate and present an idea a novel deep learning idea to all of us it'll be roughly three minutes in length and we will focus not as much because this is a one week program we're not going to focus so much on the results of your pitch but rather The Innovation and the idea and the novelty of what you're trying to propose the prices here are quite significant already where first price is going to get an Nvidia GPU which is really a key piece of Hardware that is instrumental if you want to actually build a deep learning project and train these neural networks which can be very large and require a lot of compute these prices will give you the compute to do so and finally this year we'll be awarding a grand prize for labs two and three combined which will occur on Tuesday and Wednesday focused on what I believe is actually solving some of the most exciting problems in this field of deep learning and how specifically how we can build models that can be robust not only accurate but robust and trustworthy and safe when they're deployed as well and you'll actually get experience developing those types of solutions that can actually Advance the state of the art and AI now all of these Labs that I mentioned and competitions here are going to be due on Thursday night at 11 PM right before the last day of class and we'll be helping you all along the way this this Prize or this competition in particular has very significant prizes so I encourage all of you to really enter this prize and try to try to get a chance to win the prize and of course like I said we're going to be helping you all along the way who are many available resources throughout this class to help you achieve this please post to Piazza if you have any questions and of course this program has an incredible team that you can reach out to at any point in case you have any issues or questions on the materials myself and Ava will be your two main lectures for the first part of the class we'll also be hearing like I said in the later part of the class from some guest lectures who will share some really cutting edge state-of-the-art developments in deep learning and of course I want to give a huge shout out and thanks to all of our sponsors who without their support this program wouldn't have been possible at first yet again another year so thank you all okay so now with that let's really dive into the really fun stuff of today's lecture which is you know the the technical part and I think I want to start this part by asking all of you and having yourselves ask yourself you know having you ask yourselves this question of you know why are all of you here first of all why do you care about this topic in the first place now I think to answer this question we have to take a step back and think about you know the history of machine learning and what machine learning is and what deep learning brings to the table on top of machine learning now traditional machine learning algorithms typically Define what are called these set of features in the data you can think of these as certain patterns in the data and then usually these features are hand engineered so probably a human will come into the data set and with a lot of domain knowledge and experience can try to uncover what these features might be now the key idea of deep learning and this is really Central to this class is that instead of having a human Define these features what if we could have a machine look at all of this data and actually try to extract and uncover what are the core patterns in the data so that it can use those when it sees new data to make some decisions so for example if we wanted to detect faces in an image a deep neural network algorithm might actually learn that in order to detect a face it first has to detect things like edges in the image lines and edges and when you combine those lines and edges you can actually create compositions of features like corners and curves which when you create those when you combine those you can create more high level features for example eyes and noses and ears and then those are the features that allow you to ultimately detect what you care about detecting which is the face but all of these come from what are called kind of a hierarchical learning of features and you can actually see some examples of these these are real features learned by a neural network and how they're combined defines this progression of information but in fact what I just described this underlying and fundamental building block of neural networks and deep learning have actually existed for decades now why are we studying all of this now and today in this class with all of this great enthusiasm to learn this right well for one there have been several key advances that have occurred in the past decade number one is that data is so much more pervasive than it has ever been before in our lifetimes these models are hungry for more data and we're living in the age of Big Data more data is available to these models than ever before and they Thrive off of that secondly these algorithms are massively parallelizable they require a lot of compute and we're also at a unique time in history where we have the ability to train these extremely large-scale algorithms and techniques that have existed for a very long time but we can now train them due to the hardware advances that have been made and finally due to open source toolbox access and software platforms like tensorflow for example which all of you will get a lot of experience on in this class training and building the code for these neural networks has never been easier so that from the software point of view as well there have been incredible advances to open source you know the the underlying fundamentals of what you're going to learn so let me start now with just building up from the ground up the fundamental building block of every single neural network that you're going to learn in this class and that's going to be just a single neuron right and in neural network language a single neuron is called a perceptron so what is the perceptron a perceptron is like I said a single neuron and it's actually I'm going to say it's very very simple idea so I want to make sure that everyone in the audience understands exactly what a perceptron is and how it works so let's start by first defining a perceptron as taking it as input a set of inputs right so on the left hand side you can see this perceptron takes M different inputs 1 to M right these are the blue circles we're denoting these inputs as X's each of these numbers each of these inputs is then multiplied by a corresponding weight which we can call W right so X1 will be multiplied by W1 and we'll add the result of all of these multiplications together now we take that single number after the addition and we pass it through this non-linear what we call a non-linear activation function and that produces our final output of the perceptron which we can call Y now this is actually not entirely accurate of the picture of a perceptron there's one step that I forgot to mention here so in addition to multiplying all of these inputs with their corresponding weights we're also now going to add what's called a bias term here denoted as this w0 which is just a scalar weight and you can think of it coming with a input of just one so that's going to allow the network to basically shift its nonlinear activation function uh you know non-linearly right as it sees its inputs now on the right hand side you can see this diagram mathematically formulated right as a single equation we can now rewrite this linear this this equation with linear algebra terms of vectors and Dot products right so for example we can Define our entire inputs X1 to XM as a large Vector X right that large Vector X can be multiplied by or taking a DOT excuse me Matrix multiplied with our weights W this again another Vector of our weights W1 to WN taking their dot product not only multiplies them but it also adds the resulting terms together adding a bias like we said before and applying this non-linearity now you might be wondering what is this non-linear function I've mentioned it a few times already well I said it is a function right that's passed that we pass the outputs of the neural network through before we return it you know to the next neuron in the in the pipeline right so one common example of a nonlinear function that's very popular in deep neural networks is called the sigmoid function you can think of this as kind of a continuous version of a threshold function right it goes from zero to one and it's having it can take us input any real number on the real number line and you can see an example of it Illustrated on the bottom right hand now in fact there are many types of nonlinear activation functions that are popular in deep neural networks and here are some common ones and throughout this presentation you'll actually see some examples of these code snippets on the bottom of the slides where we'll try and actually tie in some of what you're learning in the lectures to actual software and how you can Implement these pieces which will help you a lot for your software Labs explicitly so the sigmoid activation on the left is very popular since it's a function that outputs you know between zero and one so especially when you want to deal with probability distributions for example this is very important because probabilities live between 0 and 1. in modern deep neural networks though the relu function which you can see on the far right hand is a very popular activation function because it's piecewise linear it's extremely efficient to compute especially when Computing its derivatives right its derivatives are constants except for one non-linear idiot zero now I hope actually all of you are probably asking this question to yourself of why do we even need this nonlinear activation function it seems like it kind of just complicates this whole picture when we didn't really need it in the first place and I want to just spend a moment on answering this because the point of a nonlinear activation function is of course number one is to introduce non-linearities to our data right if we think about our data almost all data that we care about all real world data is highly non-linear now this is important because if we want to be able to deal with those types of data sets we need models that are also nonlinear so they can capture those same types of patterns so imagine I told you to separate for example I gave you this data set red points from greenpoints and I ask you to try and separate those two types of data points now you might think that this is easy but what if I could only if I told you that you could only use a single line to do so well now it becomes a very complicated problem in fact you can't really Solve IT effectively with a single line and in fact if you introduce nonlinear activation functions to your Solution that's exactly what allows you to you know deal with these types of problems nonlinear activation functions allow you to deal with non-linear types of data now and that's what exactly makes neural networks so powerful at their core so let's understand this maybe with a very simple example walking through this diagram of a perceptron one more time imagine I give you this trained neural network with weights now not W1 W2 I'm going to actually give you numbers at these locations right so the trained weights w0 will be 1 and W will be a vector of 3 and negative 2. so this neural network has two inputs like we said before it has input X1 it has input X2 if we want to get the output of it this is also the main thing I want all of you to take away from this lecture today is that to get the output of a perceptron there are three steps we need to take right from this stage we first compute the multiplication of our inputs with our weights sorry yeah multiply them together add their result and compute a non-linearity it's these three steps that Define the forward propagation of information through a perceptron so let's take a look at how that exactly works right so if we plug in these numbers to the to those equations we can see that everything inside of our non-linearity here the nonlinearity is G right that function G which could be a sigmoid we saw a previous slide that component inside of our nonlinearity is in fact just a two-dimensional line it has two inputs and if we consider the space of all of the possible inputs that this neural network could see we can actually plot this on a decision boundary right we can plot this two-dimensional line as as a a decision boundary as a plane separating these two components of our space in fact not only is it a single plane there's a directionality component depending on which side of the plane that we live on if we see an input for example here negative one two we actually know that it lives on one side of the plane and it will have a certain type of output in this case that output is going to be positive right because in this case when we plug those components into our equation we'll get a positive number that passes through the nonlinear component and that gets propagated through as well of course if you're on the other side of the space you're going to have the opposite result right and that thresholding function is going to essentially live at this decision boundary so depending on which side of the space you live on that thresholding function that sigmoid function is going to then control how you move to one side or the other now in this particular example this is very convenient right because we can actually visualize and I can draw this exact full space for you on this slide it's only a two-dimensional space so it's very easy for us to visualize but of course for almost all problems that we care about our data points are not going to be two-dimensional right if you think about an image the dimensionality of an image is going to be the number of pixels that you have in the image right so these are going to be thousands of Dimensions millions of Dimensions or even more and then drawing these types of plots like you see here is simply not feasible right so we can't always do this but hopefully this gives you some intuition to understand kind of as we build up into more complex models so now that we have an idea of the perceptron let's see how we can actually take this single neuron and start to build it up into something more complicated a full neural network and build a model from that so let's revisit again this previous diagram of the perceptron if again just to reiterate one more time this core piece of information that I want all of you to take away from this class is how a perceptron works and how it propagates information to its decision there are three steps first is the dot product second is the bias and third is the non-linearity and you keep repeating this process for every single perceptron in your neural network let's simplify the diagram a little bit I'll get rid of the weights and you can assume that every line here now basically has an Associated weight scaler that's associated with it every line also has it corresponds to the input that's coming in it has a weight that's coming in also at the on the line itself and I've also removed the bias just for a sake of Simplicity but it's still there so now the result is that Z which let's call that the result of our DOT product plus the bias is going and that's what we pass into our non-linear function that piece is going to be applied to that activation function now the final output here is simply going to be G which is our activation function of Z right Z is going to be basically what you can think of the state of this neuron it's the result of that dot product plus bias now if we want to Define and build up a multi-layered output neural network if we want two outputs to this function for example it's a very simple procedure we just have now two neurons two perceptrons each perceptron will control the output for its Associated piece right so now we have two outputs each one is a normal perceptron it takes all of the inputs so they both take the same inputs but amazingly now with this mathematical understanding we can start to build our first neural network entirely from scratch so what does that look like so we can start by firstly initializing these two components the first component that we saw was the weight Matrix excuse me the weight Vector it's a vector of Weights in this case and the second component is the the bias Vector that we're going to multiply with the dot product of all of our inputs by our weights right so the only remaining step now after we've defined these parameters of our layer is to now Define you know how does forward propagation of information works and that's exactly those three main components that I've been stressing to so we can create this call function to do exactly that to Define this forward propagation of information and the story here is exactly the same as we've been seeing it right Matrix multiply our inputs with our weights Right add a bias and then apply a non-linearity and return the result and that literally this code will run this will Define a full net a full neural network layer that you can then take like this and of course actually luckily for all of you all of that code which wasn't much code that's been abstracted away by these libraries like tensorflow you can simply call functions like this which will actually you know replicate exactly that piece of code so you don't need to necessarily copy all of that code down you just you can just call it and with that understanding you know we just saw how you could build a single layer but of course now you can actually start to think about how you can stack these layers as well so since we now have this transformation essentially from our inputs to a hidden output you can think of this as basically how we can Define some way of transforming those inputs right into some new dimensional space right perhaps closer to the value that we want to predict and that transformation is going to be eventually learned to know how to transform those inputs into our desired outputs and we'll get to that later but for now the piece that I want to really focus on is if we have these more complex neural networks I want to really distill down that this is nothing more complex than what we've already seen if we focus on just one neuron in this diagram take is here for example Z2 right Z2 is this neuron that's highlighted in the middle layer it's just the same perceptron that we've been seeing so far in this class it was a its output is obtained by taking a DOT product adding a bias and then applying that non-linearity between all of its inputs if we look at a different node for example Z3 which is the one right below it it's the exact same story again it sees all of the same inputs but it has a different set of weight Matrix that it's going to apply to those inputs so we'll have a different output but the mathematical equations are exactly the same so from now on I'm just going to kind of simplify all of these lines and diagrams just to show these icons in the middle just to demonstrate that this means everything is going to fully connect it to everything and defined by those mathematical equations that we've been covering but there's no extra complexity in these models from what you've already seen now if you want to Stack these types of Solutions on top of each other these layers on top of each other you can not only Define one layer very easily but you can actually create what are called sequential models these sequential models you can Define one layer after another and they define basically the forward propagation of information not just from the neuron level but now from the layer level every layer will be fully connected to the next layer and the inputs of the secondary layer will be all of the outputs of the prior layer now of course if you want to create a very deep neural network all the Deep neural network is is we just keep stacking these layers on top of each other there's nothing else to this story that's really as simple as it is once so these layers are basically all they are is just layers where the final output is computed right by going deeper and deeper into this progression of different layers right and you just keep stacking them until you get to the last layer which is your output layer it's your final prediction that you want to Output right we can create a deep neural network to do all of this by stacking these layers and creating these more hierarchical models like we saw very early in the beginning of today's lecture one where the final output is really computed by you know just going deeper and deeper into this system okay so that's awesome so we've now seen how we can go from a single neuron to a layer to all the way to a deep neural network right building off of these foundational principles let's take a look at how exactly we can use these uh you know principles that we've just discussed to solve a very real problem that I think all of you are probably very concerned about uh this morning when you when you woke up so that problem is how we can build a neural network to answer this question which is will I how will I pass this class and if I will or will I not so to answer this question let's see if we can train a neural network to solve this problem okay so to do this let's start with a very simple neural network right we'll train this model with two inputs just two inputs one input is going to be the number of lectures that you attend over the course of this one week and the second input is going to be how many hours that you spend on your final project or your competition okay so what we're going to do is firstly go out and collect a lot of data from all of the past years that we've taught this course and we can plot all of this data because it's only two input space we can plot this data on a two-dimensional feature space right we can actually look at all of the students before you that have passed the class and failed the class and see where they lived in this space for the amount of hours that they've spent the number of lectures that they've attended and so on greenpoints are the people who have passed red or those who have failed now and here's you right you're right here four or five is your coordinate space you fall right there and you've attended four lectures you've spent five hours on your final project we want to build a neural network to answer the question of will you pass the class although you failed the class so let's do it we have two inputs one is four one is five these are two numbers we can feed them through a neural network that we've just seen how we can build that and we feed that into a single layered neural network three hidden units in this example but we could make it larger if we wanted to be more expressive and more powerful and we see here that the probability of you passing this class is 0.1 it's pretty visible so why would this be the case right what did we do wrong because I don't think it's correct right when we looked at the space it looked like actually you were a good candidate to pass the class but why is the neural network saying that there's only a 10 likelihood that you should pass does anyone have any ideas exactly exactly so this neural network is just uh like it was just born right it has no information about the the world or this class it doesn't know what four and five mean or what the notion of passing or failing means right so exactly right this neural network has not been trained you can think of it kind of as a baby it hasn't learned anything yet so our job firstly is to train it and part of that understanding is we first need to tell the neural network when it makes mistakes right so mathematically we should now think about how we can answer this question which is does did my neural network make a mistake and if it made a mistake how can I tell it how big of a mistake it was so that the next time it sees this data point can it do better minimize that mistake so in neural network language those mistakes are called losses right and specifically you want to Define what's called a loss function which is going to take as input your prediction and the true prediction right and how far away your prediction is from the true prediction tells you how big of a loss there is right so for example let's say we want to build a neural network to do classification of or sorry actually even before that I want to maybe give you some terminology so there are multiple different ways of saying the same thing in neural networks and deep learning so what I just described as a loss function is also commonly referred to as an objective function empirical risk a cost function these are all exactly the same thing they're all a way for us to train the neural network to teach the neural network when it makes mistakes and what we really ultimately want to do is over the course of an entire data set not just one data point of mistakes we won't say over the entire data set we want to minimize all of the mistakes on average that this neural network makes so if we look at the problem like I said of binary classification will I pass this class or will I not there's a yes or no answer that means binary classification now we can use what's called a loss function of the softmax Cross entropy loss and for those of you who aren't familiar this notion of cross entropy is actually developed here at MIT by Sean Sean Excuse me yes Claude Shannon who is a Visionary he did his Masters here over 50 years ago he introduced this notion of cross-entropy and that was you know pivotal in in the ability for us to train these types of neural networks even now into the future so let's start by instead of predicting a binary cross-entropy output what if we wanted to predict a final grade of your class score for example that's no longer a binary output yes or no it's actually a continuous variable right it's the grade let's say out of 100 points what is the value of your score in the class project right for this type of loss we can use what's called a mean squared error loss you can think of this literally as just subtracting your predicted grade from the true grade and minimizing that distance apart foreign so I think now we're ready to really put all of this information together and Tackle this problem of training a neural network right to not just identify how erroneous it is how large its loss is but more importantly minimize that loss as a function of seeing all of this training data that it observes so we know that we want to find this neural network like we mentioned before that minimizes this empirical risk or this empirical loss averaged across our entire data set now this means that we want to find mathematically these W's right that minimize J of w JFW is our loss function average over our entire data set and W is our weight so we want to find the set of Weights that on average is going to give us the minimum the smallest loss as possible now remember that W here is just a list basically it's just a group of all of the weights in our neural network you may have hundreds of weights and a very very small neural network or in today's neural networks you may have billions or trillions of weights and you want to find what is the value of every single one of these weights that's going to result in the smallest loss as possible now how can you do this remember that our loss function J of w is just a function of our weights right so for any instantiation of our weights we can compute a scalar value of you know how how erroneous would our neural network be for this instantiation of our weights so let's try and visualize for example in a very simple example of a two-dimensional space where we have only two weights extremely simple neural network here very small two weight neural network and we want to find what are the optimal weights that would train this neural network we can plot basically the loss how erroneous the neural network is for every single instantiation of these two weights right this is a huge space it's an infinite space but still we can try to we can have a function that evaluates at every point in this space now what we ultimately want to do is again we want to find which set of W's will give us the smallest loss possible that means basically the lowest point on this landscape that you can see here where is the W's that bring us to that lowest point the way that we do this is actually just by firstly starting at a random place we have no idea where to start so pick a random place to start in this space and let's start there at this location let's evaluate our neural network we can compute the loss at this specific location and on top of that we can actually compute how the loss is changing we can compute the gradient of the loss because our loss function is a continuous function right so we can actually compute derivatives of our function across the space of our weights and the gradient tells us the direction of the highest point right so from where we stand the gradient tells us where we should go to increase our loss now of course we don't want to increase our loss we want to decrease our loss so we negate our gradient and we take a step in the opposite direction of the gradient that brings us one step closer to the bottom of the landscape and we just keep repeating this process right over and over again we evaluate the neural network at this new location compute its gradient and step in that new direction we keep traversing this landscape until we converge to the minimum we can really summarize this algorithm which is known formally as gradient descent right so gradient descent simply can be written like this we initialize all of our weights right this can be two weights like you saw in the previous example it can be billions of Weights like in real neural networks we compute this gradient of the partial derivative with of our loss with respect to the weights and then we can update our weights in the opposite direction of this gradient so essentially we just take this small amount small step you can think of it which here is denoted as Ada and we refer to this small step right this is commonly referred to as what's known as the learning rate it's like how much we want to trust that gradient and step in the direction of that gradient we'll talk more about this later but just to give you some sense of code this this algorithm is very well translatable to real code as well for every line on the pseudocode you can see on the left you can see corresponding real code on the right that is runnable and directly implementable by all of you in your labs but now let's take a look specifically at this term here this is the gradient we touched very briefly on this in the visual example this explains like I said how the loss is changing as a function of the weights right so as the weights move around will my loss increase or decrease and that will tell the neural network if it needs to move the weights in a certain direction or not but I never actually told you how to compute this right and I think that's an extremely important part because if you don't know that then you can't uh well you can't train your neural network right this is a critical part of training neural networks and that process of computing this line This gradient line is known as back propagation so let's do a very quick intro to back propagation and how it works so again let's start with the simplest neural network in existence this neural network has one input one output and only one neuron right this is as simple as it gets we want to compute the gradient of our loss with respect to our weight in this case let's compute it with respect to W2 the second weight so this derivative is going to tell us how much a small change in this weight will affect our loss if if a small change if we change our weight a little bit in One Direction we'll increase our loss or decrease our loss so to compute that we can write out this derivative we can start with applying the chain rule backwards from the loss function through the output specifically what we can do is we can actually just decompose this derivative into two components the first component is the derivative of our loss with respect to our output multiplied by the derivative of our output with respect to W2 right this is just a standard um uh instantiation of the chain rule with this original derivative that we had on the left hand side let's suppose we wanted to compute the gradients of the weight before that which in this case are not W1 but W excuse me not W2 but W1 well all we do is replace W2 with W1 and that chain Rule still holds right that same equation holds but now you can see on the red component that last component of the chain rule we have to once again recursively apply one more chain rule because that's again another derivative that we can't directly evaluate we can expand that once more with another instantiation of the chain Rule and now all of these components we can directly propagate these gradients through the hidden units right in our neural network all the way back to the weight that we're interested in in this example right so we first computed the derivative with respect to W2 then we can back propagate that and use that information also with W1 that's why we really call it back propagation because this process occurs from the output all the way back to the input now we repeat this process essentially many many times over the course of training by propagating these gradients over and over again through the network all the way from the output to the inputs to determine for every single weight answering this question which is how much does a small change in these weights affect our loss function if it increases it or decreases and how we can use that to improve the loss ultimately because that's our final goal in this class foreign so that's the back propagation algorithm that's that's the core of training neural networks in theory it's very simple it's it's really just an instantiation of the chain rule but let's touch on some insights that make training neural networks actually extremely complicated in practice even though the algorithm of back propagation is simple and you know many decades old in practice though optimization of neural networks looks something like this it looks nothing like that picture that I showed you before there are ways that we can visualize very large deep neural networks and you can think of the landscape of these models looking like something like this this is an illustration from a paper that came out several years ago where they tried to actually visualize the landscape a very very deep neural networks and that's what this landscape actually looks like that's what you're trying to deal with and find the minimum in this space and you can imagine the challenges that come with that so to cover the challenges let's first think of and recall that update equation defined in gradient descent right so I didn't talk too much about this parameter Ada but now let's spend a bit of time thinking about this this is called The Learning rate like we saw before it determines basically how big of a step we need to take in the direction of our gradient on every single iteration of back propagation in practice even setting the learning rate can be very challenging you as you as the designer of the neural network have to set this value this learning rate and how do you pick this value right so that can actually be quite difficult it has really uh large consequences when building a neural network so for example if we set the learning rate too low then we learn very slowly so let's assume we start on the right hand side here at that initial guess if our learning rate is not large enough not only do we converge slowly we actually don't even converge to the global minimum right because we kind of get stuck in a local minimum now what if we set our learning rate too high right what can actually happen is we overshoot and we can actually start to diverge from the solution the gradients can actually explode very bad things happen and then the neural network doesn't trade so that's also not good in reality there's a very happy medium between setting it too small setting it too large where you set it just large enough to kind of overshoot some of these local Minima put you into a reasonable part of the search space where then you can actually Converge on the solutions that you care most about but actually how do you set these learning rates in practice right how do you pick what is the ideal learning rate one option and this is actually a very common option in practice is to simply try out a bunch of learning rates and see what works the best right so try out let's say a whole grid of different learning rates and you know train all of these neural networks see which one works the best but I think we can do something a lot smarter right so what are some more intelligent ways that we could do this instead of exhaustively trying out a whole bunch of different learning rates can we design a learning rate algorithm that actually adapts to our neural network and adapts to its landscape so that it's a bit more intelligent than that previous idea so this really ultimately means that the learning rate the speed at which the algorithm is trusting the gradients that it sees is going to depend on how large the gradient is in that location and how fast we're learning how many other options uh and sorry and many other options that we might have as part of training in neural networks right so it's not only how quickly we're learning you may judge it on many different factors in the learning landscape in fact we've all been these different algorithms that I'm talking about these adaptive learning rate algorithms have been very widely studied in practice there is a very thriving community in the Deep learning research community that focuses on developing and designing new algorithms for learning rate adaptation and faster optimization of large neural networks like these and during your Labs you'll actually get the opportunity to not only try out a lot of these different adaptive algorithms which you can see here but also try to uncover what are kind of the patterns and benefits of One Versus the other and that's going to be something that I think you'll you'll find very insightful as part of your labs so another key component of your Labs that you'll see is how you can actually put all of this information that we've covered today into a single picture that looks roughly something like this which defines your model at the first at the top here that's where you define your model we talked about this in the beginning part of the lecture for every piece in your model you're now going to need to Define this Optimizer which we've just talked about this Optimizer is defined together with a learning rate right how quickly you want to optimize your lost landscape and over many Loops you're going to pass over all of the examples in your data set and observe essentially how to improve your network that's the gradient and then actually improve the network in those directions and keep doing that over and over and over again until eventually your neural network converges to some sort of solution so I want to very quickly briefly in the remaining time that we have continue to talk about tips for training these neural networks in practice and focus on this very powerful idea of batching your data into well what are called mini batches of smaller pieces of data to do this let's revisit that gradient descent algorithm right so here this gradient that we talked about before is actually extraordinarily computationally expensive to compute because it's computed as a summation across all of the pieces in your data set right and in most real life or real world problems you know it's simply not feasible to compute a gradient over your entire data set data sets are just too large these days so in you know there are some Alternatives right what are the Alternatives instead of computing the derivative or the gradients across your entire data set what if you instead computed the gradient over just a single example in your data set just one example well of course this this estimate of your gradient is going to be exactly that it's an estimate it's going to be very noisy it may roughly reflect the trends of your entire data set but because it's a very it's only one example in fact of your entire data set it may be very noisy right well the advantage of this though is that it's much faster to compute obviously the gradient over a single example because it's one example so computationally this has huge advantages but the downside is that it's extremely stochastic right that's the reason why this algorithm is not called gradient descent it's called stochastic gradient descent now now what's the middle ground right instead of computing it with respect to one example in your data set what if we computed what's called a mini batch of examples a small batch of examples that we can compute the gradients over and when we take these gradients they're still computationally efficient to compute because it's a mini batch it's not too large maybe we're talking on the order of tens or hundreds of examples in our data set but more importantly because we've expanded from a single example to maybe 100 examples the stochasticity is significantly reduced and the accuracy of our gradient is much improved so normally we're thinking of batch sizes many batch sizes roughly on the order of 100 data points tens or hundreds of data points this is much faster obviously to compute than gradient descent and much more accurate to compute compared to stochastic gradient descent which is that single single point example so this increase in gradient accuracy allows us to essentially converge to our solution much quicker than it could have been possible in practice due to gradient descent limitations it also means that we can increase our learning rate because we can trust each of those gradients much more efficiently right we're now averaging over a batch it's going to be much more accurate than the stochastic version so we can increase that learning rate and actually learn faster as well this allows us to also massively parallelize this entire algorithm in computation right we can split up batches onto separate workers and Achieve even more significant speed UPS of this entire problem using gpus the last topic that I very very briefly want to cover in today's lecture is this topic of overfitting right when we're optimizing a neural network with stochastic gradient descent we have this challenge of what's called overfitting overfitting I looks like this roughly right so on the left hand side we want to build a neural network or let's say in general we want to build a machine learning model that can accurately describe some patterns in our data but remember we're ultimately we don't want to describe the patterns in our training data ideally we want to define the patterns in our test data of course we don't observe test data we only observe training data so we have this challenge of extracting patterns from training data and hoping that they generalize to our test data so set in one different way we want to build models that can learn representations from our training data that can still generalize even when we show them brand new unseen pieces of test data so assume that you want to build a line that can describe or find the patterns in these points that you can see on the slide right if you have a very simple neural network which is just a single line straight line you can describe this data sub-optimally right because the data here is non-linear you're not going to accurately capture all of the nuances and subtleties in this data set that's on the left hand side if you move to the right hand side you can see a much more complicated model but here you're actually over expressive you're too expressive and you're capturing kind of the nuances the spurious nuances in your training data that are actually not representative of your test data ideally you want to end up with the model in the middle which is basically the middle ground right it's not too complex and it's not too simple it still gives you what you want to perform well and even when you give it brand new data so to address this problem let's briefly talk about what's called regularization regularization is a technique that you can introduce to your training pipeline to discourage complex models from being learned now as we've seen before this is really critical because neural networks are extremely large models they are extremely prone to overfitting right so regularization and having techniques for regularization has extreme implications towards the success of neural networks and having them generalize Beyond training data far into our testing domain the most popular technique for regularization in deep learning is called Dropout and the idea of Dropout is is actually very simple it's let's revisit it by drawing this picture of deep neural networks that we saw earlier in today's lecture in Dropout during training we essentially randomly select some subset of the neurons in this neural network and we try to prune them out with some random probabilities so for example we can select this subset of neural of neurons we can randomly select them with a probability of 50 percent and with that probability we randomly turn them off or on on different iterations of our training so this is essentially forcing the neural network to learn you can think of an ensemble of different models on every iteration it's going to be exposed to kind of a different model internally than the one it had on the last iteration so it has to learn how to build internal Pathways to process the same information and it can't rely on information that it learned on previous iterations right so it forces it to kind of capture some deeper meaning within the pathways of the neural network and this can be extremely powerful because number one it lowers the capacity of the neural network significantly right you're lowering it by roughly 50 percent in this example but also because it makes them easier to train because the number of Weights that have gradients in this case is also reduced so it's actually much faster to train them as well now like I mentioned on every iteration we randomly drop out a different set of neurons right and that helps the data generalize better and the second regularization techniques which is actually a very broad regularization technique far beyond neural networks is simply called early stopping now we know the the definition of overfitting is simply when our model starts to represent basically the training data more than the testing data that's really what overfitting comes down to at its core if we set aside some of the training data to use separately that we don't train on it we can use it as kind of a testing data set synthetic testing data set in some ways we can monitor how our network is learning on this unseen portion of data so for example we can over the course of training we can basically plot the performance of our Network on both the training set as well as our held out test set and as the network is trained we're going to see that first of all these both decrease but there's going to be a point where the loss plateaus and starts to increase the training loss will actually start to increase this is exactly the point where you start to overfit right because now you're starting to have sorry that was the test loss the test loss actually starts to increase because now you're starting to overfit on your training data this pattern basically continues for the rest of training and this is the point that I want you to focus on right this Middle Point is where we need to stop training because after this point assuming that this test set is a valid representation of the true test set this is the place where the accuracy of the model will only get worse right so this is where we would want to early stop our model and regularize the performance and we can see that stopping anytime before this point is also not good we're going to produce an underfit model where we could have had a better model on the test data but it's this trade-off right you can't stop too late and you can't stop too early as well so I'll conclude this lecture by just summarizing these three key points that we've covered in today's lecture so far so we've first covered these fundamental building blocks of all neural networks which is the single neuron the perceptron we've built these up into larger neural layers and then from their neural networks and deep neural networks we've learned how we can train these apply them to data sets back propagate through them and we've seen some trips tips and tricks for optimizing these systems end to end in the next lecture we'll hear from Ava on deep sequence modeling using rnns and specifically this very exciting new type of model called the Transformer architecture and attention mechanisms so maybe let's resume the class in about five minutes after we have a chance to swap speakers and thank you so much for all of your attention thank you

Info

Channel: Alexander Amini

Views: 1,892,610

Rating: undefined out of 5

Keywords: deep learning, mit, artificial intelligence, neural networks, machine learning, introduction to deep learning, intro to deep learning, 6s191, 6.s191, mit 6.s191, mit 6s191, mit deep learning, alexander amini, amini, lecture 1, ava soleimany, tensorflow, computer vision, deepmind, openai, basics, introduction, deeplearning, tensorflow tutorial, what is deep learning, deep learning basics, deep learning python, andrew ng

Id: QDX-1M5Nj7s

Channel Id: undefined

Length: 58min 12sec (3492 seconds)

Published: Fri Mar 10 2023