MIT Introduction to Deep Learning | 6.S191

Video Statistics and Information

Captions Word Cloud
Reddit Comments
Good afternoon everyone! Thank you all for joining today.  My name is Alexander Amini and I'll be one of your  course organizers this year along with Ava -- and   together we're super excited to introduce you  all to Introduction to Deep Learning. Now MIT  Intro to Deep Learning is a really really fun  exciting and fast-paced program here at MIT   and let me start by just first of all giving  you a little bit of background into what we   do and what you're going to learn about this year. So this week of Intro to Deep Learning we're going   to cover a ton of material in just one week. You'll learn the foundations of this really   really fascinating and exciting field of  deep learning and artificial intelligence   and more importantly you're going to get hands-on  experience actually reinforcing what you learn in   the lectures as part of hands-oOn software labs. Now over the past decade AI and deep learning   have really had a huge resurgence and many  incredible successes and a lot of problems   that even just a decade ago we thought were not  really even solvable in the near future now we're   solving with deep learning with Incredible ease. Now this past year in particular of 2022 has been   an incredible year for a deep learning progress  and I like to say that actually this past year   in particular has been the year of generative  deep learning using deep learning to generate   brand new types of data that I've never been  seen before and never existed in reality in   fact I want to start this class by actually  showing you how we started this class several   years ago which was by playing this video that  I'll play in a second now this video actually   was an introductory video for the class it kind  of exemplifies this idea that I'm talking about.  So let me just stop there and  play this video first of all Hi everybody and welcome to MIT 6.S191  -- the official introductory course on   deep learning taught here at MIT. Deep Learning is revolutionizing   so many fields: from robotics to  medicine and everything in between.  You'll learn the fundamentals of  this field and how you can build   some of these incredible algorithms. In fact, this entire speech and video   are not real and were created using deep  learning and artificial intelligence.  And in this class you'll learn how.   It has been an honor to speak with you  today and I hope you enjoy the course. so in case you couldn't tell this video and  its entire audio was actually not real it was   synthetically generated by a deep learning  algorithm and when we introduced this class   A few years ago this video was created several  years ago right but even several years ago when   we introduced this and put it on YouTube it went  somewhat viral right people really loved this   video they were intrigued by how real the video  and audio felt and looked uh entirely generated   by an algorithm by a computer and people were  shocked with the power and the realism of these   types of approaches and this was a few years ago  now fast forward to today and the state of deep   learning today we have have seen deep learning  accelerating at a rate faster than we've ever   seen before in fact we can use deep learning  now to generate not just images of faces but   generate full synthetic environments where we can  train autonomous vehicles entirely in simulation   and deploy them on full-scale vehicles in the  real world seamlessly the videos here you see   are actually from a data driven simulator from  neural networks generated called Vista that we   actually built here at MIT and have open sourced  to the public so all of you can actually train and   build the future of autonomy and self-driving cars  and of course it goes far beyond this as well deep   learning can be used to generate content directly  from how we speak and the language that we convey   to it from prompts that we say deep learning can  reason about the prompts in natural language and   English for example and then guide and control  what is generated according to what we specify   we've seen examples of where we can generate for  example things that again have never existed in   reality we can ask a neural network to generate  a photo of a astronaut riding a horse and it   actually can imagine hallucinate what this might  look like even though of course this photo not   only this photo has never occurred before but  I don't think any photo of an astronaut riding   a horse has ever occurred before so there's  not really even training data that you could   go off in this case and my personal favorite  is actually how we can not only build software   that can generate images and videos but build  software that can generate software as well we   can also have algorithms that can take language  prompts for example a prompt like this write   code and tensorflow to generate or to train  a neural network and not only will it write   the code and create that neural network but it  will have the ability to reason about the code   that it's generated and walk you through step by  step explaining the process and procedure all the   way from the ground up to you so that you can  actually learn how to do this process as well   now I think some of these examples really just  highlight how far deep learning and these methods   have come in the past six years since we started  this course and you saw that example just a few   years ago from that introductory video but now  we're seeing such incredible advances and the   most amazing part of this course in my opinion is  actually that within this one week we're going to   take you through from the ground up starting  from today all of the foundational building   blocks that will allow you to understand and  make all of this amazing Advance as possible   so with that hopefully now you're all super  excited about what this class will teach and I   want to basically now just start by taking a step  back and introducing some of these terminologies   that I've kind of been throwing around so far  the Deep learning artificial intelligence what   do these things actually mean so first of  all I want to maybe just take a second to   speak a little bit about intelligence and  what intelligence means at its core so to   me intelligence is simply the ability to process  information such that we can use it to inform some   future decision or action that we take now the  field of artificial intelligence is simply the   ability for us to build algorithms artificial  algorithms that can do exactly this process   information to inform some future decision  now machine learning is simply a subset of AI   which focuses specifically on how we can build  a machine to or teach a machine how to do this   from some experiences or data for example now deep  learning goes One Step Beyond this and is a subset   of machine learning which focuses explicitly on  what are called neural networks and how we can   build neural networks that can extract features in  the data these are basically what you can think of   as patterns that occur within the data so that  it can learn to complete these tasks as well   now that's exactly what this class is really all  about at its core we're going to try and teach   you and give you the foundational understanding  and how we can build and teach computers to learn   tasks many different type of tasks directly from  raw data and that's really what this class spoils   down to at it's it's most simple form and we'll  provide a very solid foundation for you both on   the technical side through the lectures which will  happen in two parts throughout the class the first   lecture and the second lecture each one about one  hour long followed by a software lab which will   immediately follow the lectures which will try to  reinforce a lot of what we cover in the in the in   the technical part of the class and you know give  you hands-on experience implementing those ideas   so this program is split between these two pieces  the technical lectures and the software Labs we   have several new updates this year in specific  especially in many of the later lectures the   first lecture will cover the foundations of  deep learning which is going to be right now   and finally we'll conclude the course with  some very exciting guest lectures from both   Academia and Industry who are really leading  and driving forward the state of AI and deep   learning and of course we have many awesome  prizes that go with all of the software labs   and the project competition at the end of the  course so maybe quickly to go through these   each day like I said we'll have dedicated  software Labs that couple with the lectures   starting today with lab one you'll actually  build a neural network keeping with this   theme of generative AI you'll build a neural  network that can learn listen to a lot of   music and actually learn how to generate  brand new songs in that genre of music   at the end at the next level of the class on  Friday we'll host a project pitch competition   where either you individually or as part of a  group can participate and present an idea a novel   deep learning idea to all of us it'll be roughly  three minutes in length and we will focus not as   much because this is a one week program we're  not going to focus so much on the results of   your pitch but rather The Innovation and the idea  and the novelty of what you're trying to propose   the prices here are quite significant already  where first price is going to get an Nvidia   GPU which is really a key piece of Hardware that  is instrumental if you want to actually build a   deep learning project and train these neural  networks which can be very large and require   a lot of compute these prices will give you  the compute to do so and finally this year   we'll be awarding a grand prize for labs two and  three combined which will occur on Tuesday and   Wednesday focused on what I believe is actually  solving some of the most exciting problems in this   field of deep learning and how specifically how  we can build models that can be robust not only   accurate but robust and trustworthy and safe when  they're deployed as well and you'll actually get   experience developing those types of solutions  that can actually Advance the state of the art   and AI now all of these Labs that I mentioned and  competitions here are going to be due on Thursday   night at 11 PM right before the last day of  class and we'll be helping you all along the   way this this Prize or this competition in  particular has very significant prizes so I   encourage all of you to really enter this prize  and try to try to get a chance to win the prize   and of course like I said we're going to  be helping you all along the way who are   many available resources throughout this class to  help you achieve this please post to Piazza if you   have any questions and of course this program  has an incredible team that you can reach out   to at any point in case you have any issues or  questions on the materials myself and Ava will   be your two main lectures for the first part  of the class we'll also be hearing like I said   in the later part of the class from some guest  lectures who will share some really cutting edge   state-of-the-art developments in deep learning  and of course I want to give a huge shout out and   thanks to all of our sponsors who without their  support this program wouldn't have been possible   at first yet again another year so thank you all   okay so now with that let's really dive into  the really fun stuff of today's lecture which   is you know the the technical part and I think I  want to start this part by asking all of you and   having yourselves ask yourself you know having  you ask yourselves this question of you know why   are all of you here first of all why do you  care about this topic in the first place now   I think to answer this question we have to take a  step back and think about you know the history of   machine learning and what machine learning is and  what deep learning brings to the table on top of   machine learning now traditional machine learning  algorithms typically Define what are called these   set of features in the data you can think of these  as certain patterns in the data and then usually   these features are hand engineered so probably  a human will come into the data set and with a   lot of domain knowledge and experience can try to  uncover what these features might be now the key   idea of deep learning and this is really Central  to this class is that instead of having a human   Define these features what if we could have a  machine look at all of this data and actually   try to extract and uncover what are the core  patterns in the data so that it can use those   when it sees new data to make some decisions  so for example if we wanted to detect faces   in an image a deep neural network algorithm might  actually learn that in order to detect a face it   first has to detect things like edges in the image  lines and edges and when you combine those lines   and edges you can actually create compositions  of features like corners and curves which when   you create those when you combine those you can  create more high level features for example eyes   and noses and ears and then those are the features  that allow you to ultimately detect what you care   about detecting which is the face but all of these  come from what are called kind of a hierarchical   learning of features and you can actually see some  examples of these these are real features learned   by a neural network and how they're combined  defines this progression of information but   in fact what I just described this underlying and  fundamental building block of neural networks and   deep learning have actually existed for decades  now why are we studying all of this now and today   in this class with all of this great enthusiasm  to learn this right well for one there have been   several key advances that have occurred in the  past decade number one is that data is so much   more pervasive than it has ever been before in our  lifetimes these models are hungry for more data   and we're living in the age of Big Data more data  is available to these models than ever before and   they Thrive off of that secondly these algorithms  are massively parallelizable they require a lot of   compute and we're also at a unique time in history  where we have the ability to train these extremely   large-scale algorithms and techniques that have  existed for a very long time but we can now   train them due to the hardware advances that have  been made and finally due to open source toolbox   access and software platforms like tensorflow  for example which all of you will get a lot of   experience on in this class training and building  the code for these neural networks has never been   easier so that from the software point of view  as well there have been incredible advances   to open source you know the the underlying  fundamentals of what you're going to learn   so let me start now with just building up from  the ground up the fundamental building block of   every single neural network that you're going  to learn in this class and that's going to be   just a single neuron right and in neural network  language a single neuron is called a perceptron   so what is the perceptron a perceptron  is like I said a single neuron and it's   actually I'm going to say it's very  very simple idea so I want to make   sure that everyone in the audience understands  exactly what a perceptron is and how it works   so let's start by first defining a perceptron  as taking it as input a set of inputs right so   on the left hand side you can see this perceptron  takes M different inputs 1 to M right these are   the blue circles we're denoting these inputs as  X's each of these numbers each of these inputs   is then multiplied by a corresponding weight which  we can call W right so X1 will be multiplied by W1   and we'll add the result of all of these  multiplications together now we take that   single number after the addition and we pass it  through this non-linear what we call a non-linear   activation function and that produces our final  output of the perceptron which we can call Y   now this is actually not entirely accurate of  the picture of a perceptron there's one step   that I forgot to mention here so in addition  to multiplying all of these inputs with their   corresponding weights we're also now going to add  what's called a bias term here denoted as this w0   which is just a scalar weight and you can think  of it coming with a input of just one so that's   going to allow the network to basically shift  its nonlinear activation function uh you know   non-linearly right as it sees its inputs now  on the right hand side you can see this diagram   mathematically formulated right as a single  equation we can now rewrite this linear this this   equation with linear algebra terms of vectors and  Dot products right so for example we can Define   our entire inputs X1 to XM as a large Vector  X right that large Vector X can be multiplied   by or taking a DOT excuse me Matrix multiplied  with our weights W this again another Vector of   our weights W1 to WN taking their dot product  not only multiplies them but it also adds the   resulting terms together adding a bias like  we said before and applying this non-linearity   now you might be wondering what is this non-linear  function I've mentioned it a few times already   well I said it is a function right that's passed  that we pass the outputs of the neural network   through before we return it you know to the next  neuron in the in the pipeline right so one common   example of a nonlinear function that's very  popular in deep neural networks is called the   sigmoid function you can think of this as kind of  a continuous version of a threshold function right   it goes from zero to one and it's having it can  take us input any real number on the real number   line and you can see an example of it Illustrated  on the bottom right hand now in fact there are   many types of nonlinear activation functions that  are popular in deep neural networks and here are   some common ones and throughout this presentation  you'll actually see some examples of these code   snippets on the bottom of the slides where we'll  try and actually tie in some of what you're   learning in the lectures to actual software and  how you can Implement these pieces which will help   you a lot for your software Labs explicitly so  the sigmoid activation on the left is very popular   since it's a function that outputs you know  between zero and one so especially when you want   to deal with probability distributions for example  this is very important because probabilities live   between 0 and 1. in modern deep neural networks  though the relu function which you can see on the   far right hand is a very popular activation  function because it's piecewise linear it's   extremely efficient to compute especially when  Computing its derivatives right its derivatives   are constants except for one non-linear idiot  zero now I hope actually all of you are probably   asking this question to yourself of why do we  even need this nonlinear activation function   it seems like it kind of just complicates this  whole picture when we didn't really need it in   the first place and I want to just spend a moment  on answering this because the point of a nonlinear   activation function is of course number one is to  introduce non-linearities to our data right if we   think about our data almost all data that we care  about all real world data is highly non-linear   now this is important because if we want to be  able to deal with those types of data sets we   need models that are also nonlinear so they can  capture those same types of patterns so imagine   I told you to separate for example I gave you this  data set red points from greenpoints and I ask you   to try and separate those two types of data points  now you might think that this is easy but what if   I could only if I told you that you could only  use a single line to do so well now it becomes   a very complicated problem in fact you can't  really Solve IT effectively with a single line   and in fact if you introduce nonlinear activation  functions to your Solution that's exactly what   allows you to you know deal with these types of  problems nonlinear activation functions allow   you to deal with non-linear types of data now  and that's what exactly makes neural networks   so powerful at their core so let's understand  this maybe with a very simple example walking   through this diagram of a perceptron one  more time imagine I give you this trained   neural network with weights now not W1 W2 I'm  going to actually give you numbers at these   locations right so the trained weights w0 will  be 1 and W will be a vector of 3 and negative 2.   so this neural network has two inputs like we  said before it has input X1 it has input X2 if   we want to get the output of it this is also  the main thing I want all of you to take away   from this lecture today is that to get the output  of a perceptron there are three steps we need to   take right from this stage we first compute the  multiplication of our inputs with our weights   sorry yeah multiply them together add  their result and compute a non-linearity   it's these three steps that Define the forward  propagation of information through a perceptron   so let's take a look at how that exactly  works right so if we plug in these numbers   to the to those equations we can see that  everything inside of our non-linearity   here the nonlinearity is G right that function G  which could be a sigmoid we saw a previous slide   that component inside of our nonlinearity is  in fact just a two-dimensional line it has two   inputs and if we consider the space of all of  the possible inputs that this neural network   could see we can actually plot this on a decision  boundary right we can plot this two-dimensional   line as as a a decision boundary as a plane  separating these two components of our space   in fact not only is it a single plane there's a  directionality component depending on which side   of the plane that we live on if we see an input  for example here negative one two we actually   know that it lives on one side of the plane and  it will have a certain type of output in this case   that output is going to be positive right because  in this case when we plug those components into   our equation we'll get a positive number that  passes through the nonlinear component and that   gets propagated through as well of course if  you're on the other side of the space you're   going to have the opposite result right and that  thresholding function is going to essentially live   at this decision boundary so depending on which  side of the space you live on that thresholding   function that sigmoid function is going to then  control how you move to one side or the other   now in this particular example this is very  convenient right because we can actually   visualize and I can draw this exact full space  for you on this slide it's only a two-dimensional   space so it's very easy for us to visualize  but of course for almost all problems that we   care about our data points are not going to  be two-dimensional right if you think about   an image the dimensionality of an image is going  to be the number of pixels that you have in the   image right so these are going to be thousands  of Dimensions millions of Dimensions or even   more and then drawing these types of plots like  you see here is simply not feasible right so we   can't always do this but hopefully this gives  you some intuition to understand kind of as we   build up into more complex models so now that we  have an idea of the perceptron let's see how we   can actually take this single neuron and start  to build it up into something more complicated a   full neural network and build a model from that  so let's revisit again this previous diagram of   the perceptron if again just to reiterate one more  time this core piece of information that I want   all of you to take away from this class is how a  perceptron works and how it propagates information   to its decision there are three steps first is the  dot product second is the bias and third is the   non-linearity and you keep repeating this process  for every single perceptron in your neural network   let's simplify the diagram a little bit I'll get  rid of the weights and you can assume that every   line here now basically has an Associated weight  scaler that's associated with it every line also   has it corresponds to the input that's coming  in it has a weight that's coming in also at the   on the line itself and I've also removed the bias  just for a sake of Simplicity but it's still there   so now the result is that Z which let's call  that the result of our DOT product plus the   bias is going and that's what we pass into  our non-linear function that piece is going   to be applied to that activation function  now the final output here is simply going   to be G which is our activation function of  Z right Z is going to be basically what you   can think of the state of this neuron it's  the result of that dot product plus bias   now if we want to Define and build up a  multi-layered output neural network if we   want two outputs to this function for example  it's a very simple procedure we just have now   two neurons two perceptrons each perceptron will  control the output for its Associated piece right   so now we have two outputs each one is a normal  perceptron it takes all of the inputs so they   both take the same inputs but amazingly now  with this mathematical understanding we can   start to build our first neural network entirely  from scratch so what does that look like so we   can start by firstly initializing these two  components the first component that we saw   was the weight Matrix excuse me the weight  Vector it's a vector of Weights in this case   and the second component is the the bias Vector  that we're going to multiply with the dot product   of all of our inputs by our weights right so the  only remaining step now after we've defined these   parameters of our layer is to now Define you know  how does forward propagation of information works   and that's exactly those three main components  that I've been stressing to so we can create this   call function to do exactly that to Define this  forward propagation of information and the story   here is exactly the same as we've been seeing it  right Matrix multiply our inputs with our weights   Right add a bias and then apply a non-linearity  and return the result and that literally this code   will run this will Define a full net a full neural  network layer that you can then take like this   and of course actually luckily for all  of you all of that code which wasn't much   code that's been abstracted away by these  libraries like tensorflow you can simply   call functions like this which will actually  you know replicate exactly that piece of code   so you don't need to necessarily copy all of  that code down you just you can just call it   and with that understanding you know we just saw  how you could build a single layer but of course   now you can actually start to think about how  you can stack these layers as well so since we   now have this transformation essentially from  our inputs to a hidden output you can think   of this as basically how we can Define some  way of transforming those inputs right into   some new dimensional space right perhaps closer  to the value that we want to predict and that   transformation is going to be eventually learned  to know how to transform those inputs into our   desired outputs and we'll get to that later but  for now the piece that I want to really focus on   is if we have these more complex neural networks  I want to really distill down that this is nothing   more complex than what we've already seen if we  focus on just one neuron in this diagram take is   here for example Z2 right Z2 is this neuron that's  highlighted in the middle layer it's just the same   perceptron that we've been seeing so far in this  class it was a its output is obtained by taking   a DOT product adding a bias and then applying  that non-linearity between all of its inputs   if we look at a different node for example Z3  which is the one right below it it's the exact   same story again it sees all of the same inputs  but it has a different set of weight Matrix that   it's going to apply to those inputs so we'll have  a different output but the mathematical equations   are exactly the same so from now on I'm just  going to kind of simplify all of these lines and   diagrams just to show these icons in the middle  just to demonstrate that this means everything   is going to fully connect it to everything and  defined by those mathematical equations that we've   been covering but there's no extra complexity in  these models from what you've already seen now if   you want to Stack these types of Solutions on top  of each other these layers on top of each other   you can not only Define one layer very easily but  you can actually create what are called sequential   models these sequential models you can Define one  layer after another and they define basically the   forward propagation of information not just  from the neuron level but now from the layer   level every layer will be fully connected to the  next layer and the inputs of the secondary layer   will be all of the outputs of the prior layer  now of course if you want to create a very deep   neural network all the Deep neural network is is  we just keep stacking these layers on top of each   other there's nothing else to this story that's  really as simple as it is once so these layers are   basically all they are is just layers where the  final output is computed right by going deeper and   deeper into this progression of different layers  right and you just keep stacking them until you   get to the last layer which is your output layer  it's your final prediction that you want to Output   right we can create a deep neural network to do  all of this by stacking these layers and creating   these more hierarchical models like we saw very  early in the beginning of today's lecture one   where the final output is really computed by you  know just going deeper and deeper into this system   okay so that's awesome so we've now seen how  we can go from a single neuron to a layer to   all the way to a deep neural network right  building off of these foundational principles   let's take a look at how exactly we can use these  uh you know principles that we've just discussed   to solve a very real problem that I think all  of you are probably very concerned about uh   this morning when you when you woke up so that  problem is how we can build a neural network to   answer this question which is will I how will  I pass this class and if I will or will I not   so to answer this question let's see if we can  train a neural network to solve this problem okay   so to do this let's start with a very simple  neural network right we'll train this model with   two inputs just two inputs one input is going to  be the number of lectures that you attend over the   course of this one week and the second input is  going to be how many hours that you spend on your   final project or your competition okay so what  we're going to do is firstly go out and collect   a lot of data from all of the past years that  we've taught this course and we can plot all of   this data because it's only two input space we can  plot this data on a two-dimensional feature space   right we can actually look at all of the students  before you that have passed the class and failed   the class and see where they lived in this space  for the amount of hours that they've spent the   number of lectures that they've attended and so  on greenpoints are the people who have passed red   or those who have failed now and here's you right  you're right here four or five is your coordinate   space you fall right there and you've attended  four lectures you've spent five hours on your   final project we want to build a neural network  to answer the question of will you pass the class   although you failed the class so let's do it we  have two inputs one is four one is five these   are two numbers we can feed them through a neural  network that we've just seen how we can build that   and we feed that into a single layered neural  network three hidden units in this example but   we could make it larger if we wanted to be more  expressive and more powerful and we see here   that the probability of you passing this class  is 0.1 it's pretty visible so why would this   be the case right what did we do wrong because I  don't think it's correct right when we looked at   the space it looked like actually you were a good  candidate to pass the class but why is the neural   network saying that there's only a 10 likelihood  that you should pass does anyone have any ideas exactly exactly so this neural network is just uh  like it was just born right it has no information   about the the world or this class it doesn't  know what four and five mean or what the notion   of passing or failing means right so exactly right  this neural network has not been trained you can   think of it kind of as a baby it hasn't learned  anything yet so our job firstly is to train it   and part of that understanding is we first need  to tell the neural network when it makes mistakes   right so mathematically we should now think  about how we can answer this question which is   does did my neural network make a mistake and if  it made a mistake how can I tell it how big of a   mistake it was so that the next time it sees this  data point can it do better minimize that mistake   so in neural network language those mistakes  are called losses right and specifically you   want to Define what's called a loss function  which is going to take as input your prediction   and the true prediction right and how  far away your prediction is from the   true prediction tells you how big of  a loss there is right so for example   let's say we want to build a neural  network to do classification of   or sorry actually even before that I want to  maybe give you some terminology so there are   multiple different ways of saying the same thing  in neural networks and deep learning so what I   just described as a loss function is also commonly  referred to as an objective function empirical   risk a cost function these are all exactly the  same thing they're all a way for us to train the   neural network to teach the neural network when it  makes mistakes and what we really ultimately want   to do is over the course of an entire data set not  just one data point of mistakes we won't say over   the entire data set we want to minimize all of the  mistakes on average that this neural network makes   so if we look at the problem like I said of  binary classification will I pass this class   or will I not there's a yes or no answer that  means binary classification now we can use what's   called a loss function of the softmax Cross  entropy loss and for those of you who aren't   familiar this notion of cross entropy is actually  developed here at MIT by Sean Sean Excuse me yes   Claude Shannon who is a Visionary he did his  Masters here over 50 years ago he introduced   this notion of cross-entropy and that was you  know pivotal in in the ability for us to train   these types of neural networks even now into the  future so let's start by instead of predicting   a binary cross-entropy output what if we wanted  to predict a final grade of your class score for   example that's no longer a binary output yes or  no it's actually a continuous variable right it's   the grade let's say out of 100 points what is the  value of your score in the class project right for   this type of loss we can use what's called a mean  squared error loss you can think of this literally   as just subtracting your predicted grade from  the true grade and minimizing that distance apart   foreign so I think now we're ready to really put  all of this information together and Tackle this   problem of training a neural network right to not  just identify how erroneous it is how large its   loss is but more importantly minimize that loss  as a function of seeing all of this training data   that it observes so we know that we want to find  this neural network like we mentioned before that   minimizes this empirical risk or this empirical  loss averaged across our entire data set now   this means that we want to find mathematically  these W's right that minimize J of w JFW is our   loss function average over our entire data set  and W is our weight so we want to find the set   of Weights that on average is going to give  us the minimum the smallest loss as possible   now remember that W here is just a list basically  it's just a group of all of the weights in our   neural network you may have hundreds of weights  and a very very small neural network or in today's   neural networks you may have billions or trillions  of weights and you want to find what is the value   of every single one of these weights that's  going to result in the smallest loss as possible   now how can you do this remember that our loss  function J of w is just a function of our weights   right so for any instantiation of our weights  we can compute a scalar value of you know how   how erroneous would our neural network be for  this instantiation of our weights so let's try   and visualize for example in a very simple example  of a two-dimensional space where we have only two   weights extremely simple neural network here very  small two weight neural network and we want to   find what are the optimal weights that would train  this neural network we can plot basically the loss   how erroneous the neural network is for every  single instantiation of these two weights right   this is a huge space it's an infinite space but  still we can try to we can have a function that   evaluates at every point in this space now what  we ultimately want to do is again we want to find   which set of W's will give us the smallest loss  possible that means basically the lowest point   on this landscape that you can see here where  is the W's that bring us to that lowest point   the way that we do this is actually just by  firstly starting at a random place we have no idea   where to start so pick a random place to start in  this space and let's start there at this location   let's evaluate our neural network we can compute  the loss at this specific location and on top   of that we can actually compute how the loss is  changing we can compute the gradient of the loss   because our loss function is a continuous function  right so we can actually compute derivatives of   our function across the space of our weights and  the gradient tells us the direction of the highest   point right so from where we stand the gradient  tells us where we should go to increase our loss   now of course we don't want to increase our loss  we want to decrease our loss so we negate our   gradient and we take a step in the opposite  direction of the gradient that brings us one   step closer to the bottom of the landscape and  we just keep repeating this process right over   and over again we evaluate the neural network  at this new location compute its gradient and   step in that new direction we keep traversing  this landscape until we converge to the minimum   we can really summarize this algorithm which  is known formally as gradient descent right so   gradient descent simply can be written like this  we initialize all of our weights right this can   be two weights like you saw in the previous  example it can be billions of Weights like   in real neural networks we compute this gradient  of the partial derivative with of our loss with   respect to the weights and then we can update our  weights in the opposite direction of this gradient   so essentially we just take this small  amount small step you can think of it   which here is denoted as Ada and we refer  to this small step right this is commonly   referred to as what's known as the learning  rate it's like how much we want to trust that   gradient and step in the direction of that  gradient we'll talk more about this later   but just to give you some sense of code this this  algorithm is very well translatable to real code   as well for every line on the pseudocode you can  see on the left you can see corresponding real   code on the right that is runnable and directly  implementable by all of you in your labs but now   let's take a look specifically at this term here  this is the gradient we touched very briefly on   this in the visual example this explains like I  said how the loss is changing as a function of the   weights right so as the weights move around will  my loss increase or decrease and that will tell   the neural network if it needs to move the weights  in a certain direction or not but I never actually   told you how to compute this right and I think  that's an extremely important part because if you   don't know that then you can't uh well you can't  train your neural network right this is a critical   part of training neural networks and that process  of computing this line This gradient line is known   as back propagation so let's do a very quick  intro to back propagation and how it works so   again let's start with the simplest neural network  in existence this neural network has one input one   output and only one neuron right this is as simple  as it gets we want to compute the gradient of our   loss with respect to our weight in this case let's  compute it with respect to W2 the second weight   so this derivative is going to tell us how much a  small change in this weight will affect our loss   if if a small change if we change our weight a  little bit in One Direction we'll increase our   loss or decrease our loss so to compute that we  can write out this derivative we can start with   applying the chain rule backwards from the loss  function through the output specifically what   we can do is we can actually just decompose this  derivative into two components the first component   is the derivative of our loss with respect to  our output multiplied by the derivative of our   output with respect to W2 right this is just a  standard um uh instantiation of the chain rule   with this original derivative that we had on the  left hand side let's suppose we wanted to compute   the gradients of the weight before that which in  this case are not W1 but W excuse me not W2 but W1   well all we do is replace W2 with W1 and that  chain Rule still holds right that same equation   holds but now you can see on the red component  that last component of the chain rule we have to   once again recursively apply one more chain rule  because that's again another derivative that we   can't directly evaluate we can expand that  once more with another instantiation of the   chain Rule and now all of these components we  can directly propagate these gradients through   the hidden units right in our neural network all  the way back to the weight that we're interested   in in this example right so we first computed  the derivative with respect to W2 then we can   back propagate that and use that information  also with W1 that's why we really call it   back propagation because this process occurs  from the output all the way back to the input   now we repeat this process essentially many many  times over the course of training by propagating   these gradients over and over again through  the network all the way from the output to   the inputs to determine for every single weight  answering this question which is how much does   a small change in these weights affect our loss  function if it increases it or decreases and how   we can use that to improve the loss ultimately  because that's our final goal in this class   foreign so that's the back propagation algorithm  that's that's the core of training neural networks   in theory it's very simple it's it's really  just an instantiation of the chain rule   but let's touch on some insights that make  training neural networks actually extremely   complicated in practice even though the algorithm  of back propagation is simple and you know many   decades old in practice though optimization of  neural networks looks something like this it   looks nothing like that picture that I showed you  before there are ways that we can visualize very   large deep neural networks and you can think  of the landscape of these models looking like   something like this this is an illustration from  a paper that came out several years ago where   they tried to actually visualize the landscape  a very very deep neural networks and that's what   this landscape actually looks like that's what  you're trying to deal with and find the minimum   in this space and you can imagine the challenges  that come with that so to cover the challenges   let's first think of and recall that update  equation defined in gradient descent right so   I didn't talk too much about this parameter Ada  but now let's spend a bit of time thinking about   this this is called The Learning rate like we saw  before it determines basically how big of a step   we need to take in the direction of our gradient  on every single iteration of back propagation   in practice even setting the learning rate  can be very challenging you as you as the   designer of the neural network have to set this  value this learning rate and how do you pick   this value right so that can actually be quite  difficult it has really uh large consequences   when building a neural network so for example  if we set the learning rate too low then we   learn very slowly so let's assume we start on  the right hand side here at that initial guess   if our learning rate is not large enough  not only do we converge slowly we actually   don't even converge to the global minimum right  because we kind of get stuck in a local minimum   now what if we set our learning rate too high  right what can actually happen is we overshoot and   we can actually start to diverge from the solution  the gradients can actually explode very bad things   happen and then the neural network doesn't trade  so that's also not good in reality there's a very   happy medium between setting it too small setting  it too large where you set it just large enough to   kind of overshoot some of these local Minima  put you into a reasonable part of the search   space where then you can actually Converge on the  solutions that you care most about but actually   how do you set these learning rates in practice  right how do you pick what is the ideal learning   rate one option and this is actually a very common  option in practice is to simply try out a bunch of   learning rates and see what works the best right  so try out let's say a whole grid of different   learning rates and you know train all of these  neural networks see which one works the best   but I think we can do something a lot smarter  right so what are some more intelligent ways   that we could do this instead of exhaustively  trying out a whole bunch of different learning   rates can we design a learning rate algorithm  that actually adapts to our neural network and   adapts to its landscape so that it's a bit  more intelligent than that previous idea   so this really ultimately means that the learning  rate the speed at which the algorithm is trusting   the gradients that it sees is going to depend  on how large the gradient is in that location   and how fast we're learning how many other  options uh and sorry and many other options   that we might have as part of training in  neural networks right so it's not only how   quickly we're learning you may judge it on many  different factors in the learning landscape   in fact we've all been these different algorithms  that I'm talking about these adaptive learning   rate algorithms have been very widely studied in  practice there is a very thriving community in   the Deep learning research community that  focuses on developing and designing new   algorithms for learning rate adaptation and faster  optimization of large neural networks like these   and during your Labs you'll actually get the  opportunity to not only try out a lot of these   different adaptive algorithms which you can see  here but also try to uncover what are kind of the   patterns and benefits of One Versus the other  and that's going to be something that I think   you'll you'll find very insightful as part of your  labs so another key component of your Labs that   you'll see is how you can actually put all of this  information that we've covered today into a single   picture that looks roughly something like this  which defines your model at the first at the top   here that's where you define your model we talked  about this in the beginning part of the lecture   for every piece in your model you're now going  to need to Define this Optimizer which we've just   talked about this Optimizer is defined together  with a learning rate right how quickly you want   to optimize your lost landscape and over many  Loops you're going to pass over all of the   examples in your data set and observe essentially  how to improve your network that's the gradient   and then actually improve the network in those  directions and keep doing that over and over   and over again until eventually your neural  network converges to some sort of solution so I want to very quickly briefly in the  remaining time that we have continue to talk   about tips for training these neural networks  in practice and focus on this very powerful   idea of batching your data into well what are  called mini batches of smaller pieces of data   to do this let's revisit that gradient descent  algorithm right so here this gradient that we   talked about before is actually extraordinarily  computationally expensive to compute because it's   computed as a summation across all of the pieces  in your data set right and in most real life or   real world problems you know it's simply not  feasible to compute a gradient over your entire   data set data sets are just too large these days  so in you know there are some Alternatives right   what are the Alternatives instead of computing  the derivative or the gradients across your entire   data set what if you instead computed the gradient  over just a single example in your data set just   one example well of course this this estimate of  your gradient is going to be exactly that it's   an estimate it's going to be very noisy it may  roughly reflect the trends of your entire data set   but because it's a very it's only one example in  fact of your entire data set it may be very noisy   right well the advantage of this though is  that it's much faster to compute obviously   the gradient over a single example because  it's one example so computationally this   has huge advantages but the downside is that it's  extremely stochastic right that's the reason why   this algorithm is not called gradient descent  it's called stochastic gradient descent now   now what's the middle ground right instead of  computing it with respect to one example in   your data set what if we computed what's called a  mini batch of examples a small batch of examples   that we can compute the gradients over and when we  take these gradients they're still computationally   efficient to compute because it's a mini batch  it's not too large maybe we're talking on the   order of tens or hundreds of examples in our data  set but more importantly because we've expanded   from a single example to maybe 100 examples  the stochasticity is significantly reduced and   the accuracy of our gradient is much improved so  normally we're thinking of batch sizes many batch   sizes roughly on the order of 100 data points  tens or hundreds of data points this is much   faster obviously to compute than gradient descent  and much more accurate to compute compared to   stochastic gradient descent which is that single  single point example so this increase in gradient   accuracy allows us to essentially converge to  our solution much quicker than it could have   been possible in practice due to gradient descent  limitations it also means that we can increase our   learning rate because we can trust each of those  gradients much more efficiently right we're now   averaging over a batch it's going to be much  more accurate than the stochastic version so we   can increase that learning rate and actually  learn faster as well this allows us to also   massively parallelize this entire algorithm in  computation right we can split up batches onto   separate workers and Achieve even more significant  speed UPS of this entire problem using gpus the   last topic that I very very briefly want to cover  in today's lecture is this topic of overfitting   right when we're optimizing a neural network with  stochastic gradient descent we have this challenge   of what's called overfitting overfitting I looks  like this roughly right so on the left hand side   we want to build a neural network or let's say  in general we want to build a machine learning   model that can accurately describe some patterns  in our data but remember we're ultimately we don't   want to describe the patterns in our training data  ideally we want to define the patterns in our test   data of course we don't observe test data we only  observe training data so we have this challenge of   extracting patterns from training data and hoping  that they generalize to our test data so set in   one different way we want to build models that can  learn representations from our training data that   can still generalize even when we show them brand  new unseen pieces of test data so assume that you   want to build a line that can describe or find  the patterns in these points that you can see on   the slide right if you have a very simple neural  network which is just a single line straight line   you can describe this data sub-optimally right  because the data here is non-linear you're not   going to accurately capture all of the nuances  and subtleties in this data set that's on the   left hand side if you move to the right hand  side you can see a much more complicated model   but here you're actually over expressive you're  too expressive and you're capturing kind of the   nuances the spurious nuances in your training  data that are actually not representative of   your test data ideally you want to end up with the  model in the middle which is basically the middle   ground right it's not too complex and it's not too  simple it still gives you what you want to perform   well and even when you give it brand new data so  to address this problem let's briefly talk about   what's called regularization regularization  is a technique that you can introduce to your   training pipeline to discourage complex models  from being learned now as we've seen before   this is really critical because neural networks  are extremely large models they are extremely   prone to overfitting right so regularization  and having techniques for regularization has   extreme implications towards the success of  neural networks and having them generalize   Beyond training data far into our testing domain  the most popular technique for regularization in   deep learning is called Dropout and the idea of  Dropout is is actually very simple it's let's   revisit it by drawing this picture of deep neural  networks that we saw earlier in today's lecture in   Dropout during training we essentially randomly  select some subset of the neurons in this neural   network and we try to prune them out with some  random probabilities so for example we can select   this subset of neural of neurons we can randomly  select them with a probability of 50 percent and   with that probability we randomly turn them off  or on on different iterations of our training   so this is essentially forcing the neural network  to learn you can think of an ensemble of different   models on every iteration it's going to be exposed  to kind of a different model internally than the   one it had on the last iteration so it has  to learn how to build internal Pathways to   process the same information and it can't rely on  information that it learned on previous iterations   right so it forces it to kind of capture some  deeper meaning within the pathways of the neural   network and this can be extremely powerful  because number one it lowers the capacity   of the neural network significantly right you're  lowering it by roughly 50 percent in this example   but also because it makes them easier to  train because the number of Weights that   have gradients in this case is also reduced so  it's actually much faster to train them as well   now like I mentioned on every iteration we  randomly drop out a different set of neurons right   and that helps the data generalize better and the  second regularization techniques which is actually   a very broad regularization technique far beyond  neural networks is simply called early stopping   now we know the the definition of overfitting  is simply when our model starts to represent   basically the training data more than the  testing data that's really what overfitting   comes down to at its core if we set aside some of  the training data to use separately that we don't   train on it we can use it as kind of a testing  data set synthetic testing data set in some ways   we can monitor how our network is learning on  this unseen portion of data so for example we   can over the course of training we can basically  plot the performance of our Network on both the   training set as well as our held out test set and  as the network is trained we're going to see that   first of all these both decrease but there's  going to be a point where the loss plateaus and   starts to increase the training loss will actually  start to increase this is exactly the point where   you start to overfit right because now you're  starting to have sorry that was the test loss the   test loss actually starts to increase because now  you're starting to overfit on your training data   this pattern basically continues for the rest  of training and this is the point that I want   you to focus on right this Middle Point  is where we need to stop training because   after this point assuming that this test set  is a valid representation of the true test   set this is the place where the accuracy  of the model will only get worse right so   this is where we would want to early stop  our model and regularize the performance   and we can see that stopping anytime before  this point is also not good we're going to   produce an underfit model where we could  have had a better model on the test data   but it's this trade-off right you can't stop  too late and you can't stop too early as well   so I'll conclude this lecture by just summarizing  these three key points that we've covered in   today's lecture so far so we've first covered  these fundamental building blocks of all neural   networks which is the single neuron the perceptron  we've built these up into larger neural layers and   then from their neural networks and deep neural  networks we've learned how we can train these   apply them to data sets back propagate through  them and we've seen some trips tips and tricks   for optimizing these systems end to end in  the next lecture we'll hear from Ava on deep   sequence modeling using rnns and specifically  this very exciting new type of model called the   Transformer architecture and attention mechanisms  so maybe let's resume the class in about five   minutes after we have a chance to swap speakers  and thank you so much for all of your attention thank you
Channel: Alexander Amini
Views: 1,892,610
Rating: undefined out of 5
Keywords: deep learning, mit, artificial intelligence, neural networks, machine learning, introduction to deep learning, intro to deep learning, 6s191, 6.s191, mit 6.s191, mit 6s191, mit deep learning, alexander amini, amini, lecture 1, ava soleimany, tensorflow, computer vision, deepmind, openai, basics, introduction, deeplearning, tensorflow tutorial, what is deep learning, deep learning basics, deep learning python, andrew ng
Id: QDX-1M5Nj7s
Channel Id: undefined
Length: 58min 12sec (3492 seconds)
Published: Fri Mar 10 2023
Related Videos
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.