Building a neural network FROM SCRATCH (no Tensorflow/Pytorch, just numpy & math)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone my name is samson today i'm going to be building a neural network from scratch so not using tensorflow not using keras just numpy with equations linear algebra um from the ground up anyone interested in artificial intelligence or machine learning is probably very familiar with neural networks at a high level right you have lots of layers lots of nodes you you connect them all together you can have some really complex models from it that make some some cool predictions right i find that a lot of that kind of learning right when you're just looking at stuff from the high level is is kind of wishy-washy and even if you go into tensorflow and implement these networks it's still a little unclear like how they work you know at least for me i feel like i learn a lot better when i can get to the equations when i can really build from the ground up and you really can't get too much closer than implementing a neural network from scratch so that's what we're going to be doing today so let's dive right into it the problem that we'll be tackling is digit classification we'll be using a famous data set called the the mnist data set so what the nist data set is is it's it's tens of thousands of these 28 by 28 so pretty low res grayscale images of handwritten digits we are going to be building a neural network that classifies images of handwritten digits and tells you what digit is is written in that image this is an overview of what everything is going to look like what we're going to implement today we're going to start off with 28 by 28 pixel training images so that's 784 pixels overall and each of those pixels is just a pixel value right it's between 0 and 255 255 being completely white 0 being completely black so we have m of these training image so we can represent it as a matrix that looks like this each row constitutes an example and each row is going to be 784 columns long because each of them is going to correspond to one pixel in that image what we're actually going to do is transpose this matrix instead of each row being an example each column is an example so our first column is going to be our first example and it's going to have 784 rows corresponding to each pixel and we're going to have m columns corresponding to our m training examples our goal is to take this image just do a bunch of processing and then predict give out a prediction for what digit that image represents and how we're going to do that is with a neural network so we're going to be building a quite simple neural network it's only going to have two layers so the first layer in this neural network is just going to be 784 nodes this is our input layer right each of the 784 pixels maps to a node and that's our first layer our second layer is going to be our hidden layer and it's going to have 10 units and then our third layer is going to be our output layer again with 10 units and each corresponding to one digit that can be predicted in that intro right there i called this the first layer this the second layer and this the third layer going forward i'm actually calling this right here the zeroth layer this is the first layer and this the second layer and the reason for that is this layer right here will also be called the input layer there's no parameters here right this is not really a layer of the network it's just the input this right here is the first hidden layer and this is our second layer and also our output layer so yeah so i just called them first second and third but going forward we're going to be not using first second and third but rather input layer first hidden layer and second layer otherwise known as output layer there's three parts to to training this network the first part is called forward propagation so forward propagation is just when you take an image and you run it through this network from this network you compute what your output is going to be so that's the first part that we have to figure out so to start we're going to have this variable a0 that's just our input layer that's just equal to x there's no processing going on here a0 is just this first layer right here z1 is the unactivated uh first layer what we're going to do to get z1 is apply a weight and a bias so a weight is going to be this matrix we're going to take the dot product between that matrix and an a0 our input layer matrix and that's going to give us something and then we're going to add just a bias term to it what we're doing here is is really multiplying by a bunch of weights that correspond to each of these connections each of these what's 7 7840 connections and then we're going to add a constant bias term to each of the nodes we're going to do after that is kind of interesting we're going to apply an activation function to it so if we didn't apply an activation function what would happen is that each node would just be a linear combination of the nodes before it right plus a bias term the second layer is going to be a linear combination of the nodes in the first layer but the first layer is just a linear combination of the nodes in the input layer so what's really happened is that that second layer is just a linear combination of the input layer right it's like you don't have a hidden layer at all so if you only have linear combinations if you only have your weights and your biases you're never going to get a really interesting function out of neural network um really you're just doing a really fancy linear regression to solve that we apply a activation function there's a bunch of like common ones so tanch and sigmoid i'm sure you've heard of them they kind of look like this you know they got your nice curves and that's going to make it more interesting if you apply a tange function or a sigmoid function to all of the nodes in a layer um it's no longer linear right when you move on to the second layer that's adding now a sec a layer of complexity and a layer of interesting non-linear combination rather than just a linear model and that's going to make your your model able to be a lot more complex and more powerful we're going to be using another really commonly used activation function called relu or rectified linear unit so relu just looks like this it's it's really simple and how it works so real u of x is defined as if x is greater than zero rel u of x is just equal to x it's literally linear and if x is less than zero or less than or equal to zero it doesn't really matter it's equal to zero value is equal to zero so it's just this really simple function um but just that's going to add the complexity that we need a1 is going to be equal to the relu function applied to every value in z1 now we're going to do kind of a similar thing to get from layer 1 to layer 2. so our z2 right so our unactivated second layer values is going to be equal to a second weight parameter right weights corresponding to the values of the weights on each of these connections here between the first and the second layers times our a1 times the activated first layer plus another constant bias term this time we're going to apply another activation function but the activation function we apply this time is not going to be a relu or sigmoid or tan or anything like that it's going to be something quite different it's called softmax because this is an output layer right each of the 10 nodes corresponds to each of the 10 digits that could be recognized we want each of them to have a probability right we want each of them to be a value between 0 and 1 where 1 is absolute certainty and zero is no chance at that that's what it what it is so to get that we use a soft max function so i've just copy pasted an image here i didn't feel like uh writing this out myself it takes each of the nodes in the uh layer that you feed into it and it goes e to the z so e to that node divided by the sum of e to all of the nodes right or rather the sum over all the nodes of e to that node each of your outputs after the softmax activation is going to be between 0 and 1 right because e to the a single node is is always just going to be a part of the hole that is the sum of e to uh to all the nodes so for propagation is how you take an image and get a prediction out but that's not enough right we need good weights and biases to make these predictions and the whole thing about machine learning is that we will learn these weights and biases right we will run an algorithm to optimize these weights and biases as we run it over and over again we're going to do that in something called back prop backwards propagation and and what we're doing here is basically going the opposite way so we're going to start with our prediction and we're going to find out how much the prediction deviated by the actual label right so that's going to give us a sort of error and then we're going to see how much each of the previous weights and biases contributed to that error and then we're going to adjust those things accordingly so dz2 represents sort of an error of the second layer writes how much the the output layer is off by and this is really simple we just take our predictions and we subtract the actual labels from them and we're gonna one hot encode the correct label so for example if y equals four we're not going to subtract four from this right we're gonna encode y equals four into this array here where the fourth index right representing the fourth class is a one everything else has a zero from there we do some some fancy math to figure out how much w and b contributed to that error so we can find this variable dw2 that is the derivative of the loss function with respect to the weights in layer 2. db2 this one is really easy to understand what this literally is is an average of the absolute error right literally just the the error how much the output was off by that's for the second layer right we're going from uh we're finding how much the the second layer was off by from the prediction and then we find out uh correspondingly how much we should nudge our weight and our biases for this layer now we'll do the same thing for the first hidden layer but we're going to find dz1 right how much that hidden layer was off by and here there's a there's a fancy bit of math here that's the intuition is kind of that it's just doing propagation in reverse right here you see this this weight for the second term transpose times dz2 right we're taking the error from that second layer and applying the weights to it in reverse to get to the errors for the first layer and we're also taking this um this g prime here that's the derivative of the activation function uh because we have to undo the activation function to get the proper error for the first layer and then we do the same thing as we did before to calculate how much uh w1 and b1 contributed to the error in the first layer and that's how much we're going to nudge them by so yeah so after we do all this fancy calculation and figure out how much each weight term each bias term contributed to the error we update our parameters accordingly so this is a set of pretty simple equations right w1 is equal to w1 minus alpha some learning rate alpha times dw1 similarly b1 is b1 minus alpha times db1 w2 is w2 minus alpha times dw2 and b2 is b2 minus alpha times uh d b2 alpha is what's called a hyper parameter it's not trained by the model uh the the when you run this cycle when you run gradient descent um the learning rate is a parameter that you set not that gradient descent set so once we've updated it we run through the whole thing again we go through forward prop we make a new round of predictions we are changing our parameters tweaking them so that the prediction is ever closer to what the actual correct answer should be that's math now let's get to coding it up i'm going to attempt to do it in the next 30 minutes buying we're gonna be doing this on a site called kaggle kaggle's this really great site that makes it really easy to access data sets and have notebooks have python notebooks jupyter notebooks so our digit recognizer data set is already nicely configured here and all we have to do is hit new notebook and we will be taken into a kernel here we are we have a notebook and let's just rename it real quick so the first thing that we're going to do is import our package so numpy is for linear algebra for like working with matrices and then pandas is just for reading the data we're not going to use it much so import next let's import the data so we're going to use pandas for this so data is going to be equal to pandas dot read csv we're going to specify the file path so you can see how kaggle makes it super easy to access the data here because it's pandas it loads it as a pandas data frame we can call head on it and get a little preview of what's up here each row corresponds to one training example they have a label and they also have all these pixel values for each example we have pixel 0 all the way to pixel 783 so inclusive that is 784 pixels we actually don't want to be working with a pandas data frame we want to be working with numpy arrays so that we can manipulate them and do fancy linear algebra with them so we are going to say that data is equal to mp array of data and then we are going to split up our data into uh dev and training sets there's always a risk of overfitting right there's a risk that your model is going to figure out exactly how to make exactly the right predictions for your training data but not generalized to the data that you actually want to be generalizing to so you set aside a chunk of data that you don't train on and that's your dev or your cross validation data and that way you can test your hyper parameters on that data you can test your performance on that data you eliminate the risk of overfitting to your actual training data we're going to shuffle our data so hopefully it's just mp.random.shuffle okay hopefully i'm not going crazy here we're gonna do one one thing before that we're gonna get the dimensions of the data so m and n is gonna be called data.shape so m is is just the amount of rows right that's the amount of examples that we have and n is not quite the amount of features it's the amount of features plus one because we have this label column but it'll be helpful to have that so there we go and now we're going to go into the dev is going to be equal to data um from zero to a thousand okay the first a thousand examples and then i want all of it so i'm gonna go with that okay let's transpose this as well okay so we're gonna transpose it so if you remember that's the thing where we we flip it so that each column is a uh example rather than each row and that just makes it easier here y dev if you remember is now going to be the first row right it's going to be super convenient uh why div damn what am i doing datadev0 and xdev is going to be data dev 1 2 n remember that that's our n coming in handy here and then our data train this is the data we're actually training on it's going to be zero sorry 1000 to m so all the rest of it transpose it again y train is going to equal the data train at zero and then x changes data train uh from one to n beautiful so this should be all of our data loaded in let's run that and uh take a look at for example y train okay y train look at that so we have an array of all of our labels and then let's take a look at x train zero that's like the shape it's not super helpful look at that um that's looking at the first row so that's actually not what i want i believe i want this there we go so there we go that's our first column and it has 784 pixels in it that's exactly what we want so okay now we have our data now we can start just bashing out code for the uh for the actual neural network part let's see how much time we spent okay seven minutes we'll move fast we'll move fast the first thing we're going to do is initialize all of our parameters we need a starting um w1 b1 w2 b2 right and we have all the dimensions for them here so we're going to go def init params and this takes no arguments it doesn't need any uh any arguments for that function because it's creating the complete from scratch w1 is mp.random and i think is it that mp random rand believe it's it's rand n okay and our first and we want the dimensions of this array to be w1 is going to be 10 by 784 this is going to generate random values between 0 and 1 for each uh element of this array so we're actually subtract 0.5 from that to get it between negative 0.5 and 0.5 v1 is going to be mp.random. the dimensions here are going to be uh 10 by 1. and again we're going to subtract 0.5 from that and then same exact thing for w2 okay so w2 is going to be 10 by 10 and b2 is going to be that and then we're going to return all these values okay after we initialize the params we are basically ready to go into forward propagation so we're going to go def forward prop and this is going to take w1 b1 w2 b2 as arguments and also x we're also going to need x so first we're going to calculate um let's calculate z1 okay so z1 is going to be equal to w1 dot so remember w1 is a numpy array so we can do matrix operations using uh this dot dot dot x okay plus b1 now a1 is going to be equal to rel u of z1 but relu is not defined so let's define relu here probably u of uh z and um and remember row u is just this this linear function right here right it's it's gonna be x if x is greater than zero and zero if x is less than or equal to zero there's a pretty elegant way to do this that's np.maximum i'm gonna take the max of zero and z and this is element wise so when we take maximum like this what it's doing is it's going through each element in z if it's greater than zero it's just going to return z right but if it's less than zero it's just going to return zero so this is uh this is exactly what we want we'll return that so a1 is going to equal to relu of z1 that's exactly what we want uh great and now we'll do z2 c2 is going to be equal to w2 dot a1 now plus b2 and then a2 is going to be equal to softmax of a1 and again we need to define softmax now softmax of z and we're going to return here it's going to be we're going to reference this formula right here and what we can actually do is mp.exp and remember that's just like that's just e to the x right and we're doing that to each element of the array so we're going to do this divided by mp.sum of xp across z first it just applies like e to the z to every single value and then np dot sum is going to sum up through each column so it preserves the amount of columns and then collapses the amount of rows to just one to get this sum right and that's because and that's exactly what we want we want the sum for each column across all of the rows in that column and we want to divide each element by that sum and that's going to give us the probability that we want so there we go that's that's for prop forward prop is done now in backprop we are going to take in let's just take in all of these okay beautiful so the first thing that we're going to do is we actually need to do one thing to white we need to one hot and code y we need to take these labels and turn it into um this this matrix thing here remember so oops we're going to define a function one hot okay and it's going to take in y and uh here i've actually cheated a little bit and i've taken one hot from uh from uh when i did this previously what this is doing is first it creates this new matrix one hot y right which is just m zeros it's uh an array a matrix of zeros and this is a tuple of size right so y dot size is just a length of size so this is um this is m right this is how many examples there are and then y dot max plus one it assumes that the classes are uh you know zero through nine right so the max is going to be nine and then we add one to that and we get 10 which is exactly how many output classes we want so this creates the correctly sized matrix for us and this is this is a really cool thing you're indexing through one hot y using arrays right mp dot a range white dot size this is gonna create an array right that's just a range from uh zero to m right to the number of training examples um and that specifies what row this is accessing and then why remember is our whole thing of labels right this is going to be like 0 1 4 1 7 whatever this is going to specify what column it accesses so what this is doing is it's just going through and it's saying for each row go to the column specified by the label in y and set it to one right and that's beautiful that's exactly what we want um that'll that'll do the one hot encoding for us and the one last thing we want to do is just flip this okay one hot y dot t and then we can return one hot y and we want to flip it because right now um right now each row is an example and we want it the other way around we want uh each column to be in an example so we'll transpose it and uh and return it back here so perfect now we're going to go um one hot y is equal to 1 plot of y which you look at that and dz2 okay this is a new variable is equal to a2 our predictions minus 1 hot y dw2 now is going to be equal to 1 over m okay let's define m m is going to be y dot size okay just like we did before one over m times d z oops dz2 looks like this is a dc2.a1.t um yeah i believe that's right i believe that's right nextd b2 is db2 is just 1 over m times mp dot sum of dz2 and then we want our dz1 which is going to be our fancy formula here um it's going to be w2 dot transpose dot um dotted with d z2 so this is kind of applying the weights in reverse and now we're going to have to implement the derivative of the activation function in layer one and the activation function in layer one remember was relu so we're gonna implement the derive derivative of relu and um and that seems kind of fancy at first and i was freaked out when i was uh i was first you know i was like how how on earth am i supposed to do this but it's actually really easy just think about your calculus right what is the slope of this this part here it's one right it's just a uh it's just a linear linear thing what's the slope of this part here it's a zero right it's a flat line so a really elegant way to do this is just return z is greater than zero well this works because uh when booleans are converted to numbers true converts to one and false converts to zero if one element in z is greater than zero we're going to return one and if otherwise we'll return zero which is exactly the derivative that we want beautiful and then now the same thing as uh same thing as we did before oops uh i believe it's it's actually going to want x as well because we're going to have x t x dot t here um dz1 oh my goodness that was not what i wanted to do come on there we go and then from here we're going to return dw1 db1 dw2 db2 and we go def def update params now all right we're all the way down to update params and we're going to take in here dw1 db1 dw2 db2 and actually before that i'm just going to take in w1 b1 w2 b2 um and then alpha right i'm going to go w1 is equal to w1 this part is really straightforward right times alpha times dw1 to b1 equals b1 minus alpha times db1 and then we'll do the same thing for two and then we're going to return our new w1 b1 w2 b2 perfect so in theory if that's correctly implemented that is all of the functions that we need for doing our gradient descent on our new network def uh gradient descent that's what we're going to do now we're going to take x and y first because we need both of those and then we're going to take iterations um alpha and i believe that's all that we need so first we're going to go w1 b1 w2 b2 is equal to init params okay that's going to return us that these are our initial values um and now we're going to we're going to run that loop we were talking about right so 4 i in range iterations first one go four prop right z1 a1 c2 a2 equals forward prop and we're going to pass in w1b1 w2 b2 next we're going to get our wd w1 db1 dw2 db2 from backprop and then lastly we're going to update our weights right when i say w1 b1 w2 b2 is now going to equal to update params w1 v1 w2 p2 pw1 db1 uh dw2 db2 and alpha and that's going to run a bunch of times and um and by the end of it it will return the uh the params that we want we're gonna make a few things just to to so that we can actually see our progress i'm gonna cheat a bit and check my uh check the time that i i did this last we're just gonna steal these two functions here all right we're gonna do here is if i mod uh 10 is equal to zero so every tenth iteration we are going to print iteration uh and i and then print uh accuracy oops it's gonna be a little ugly get predictions a2 all right so we're going to take our a2 which is our our predictions from forward prop and get predictions from those and then get accuracy on that and why okay and then it'll return that so let's run that so all those functions are defined now now let's run gradient descent and see what happens and i'll take these parameters w1 v1 w2 b2 is equal to green sent x y iterations oh x train y train now we're using the actual variables and we're going to go for let's say 100 iterations with a learning rate of 0.1 editing samsung here so because of those two errors early in the video namely initializing weights and biases to all negative numbers and putting in an a1 where there should have been a z2 of course the model did not run well it ended up with 10 accuracy which is just random guessing and i spent like an hour just debugging uh in the end just to change those two variables right all the rest of the code is is okay i didn't change any of that i'm just gonna skip over all of that in editing and we're gonna go straight to the end straight to an hour later when i figured out what was wrong and finally ran the model and go here but look at that 75 percent accuracy 78 at iteration 250. 84 accuracy on training data is what we got so there's definitely a lot of room left for improvement things like adding more layers adding more unit cell layers but but 84 for this for something that we threw together in less than 30 minutes is not bad it's not bad so let's try out a few things let's now and i'm just gonna rip code straight off from my old thing now this is just a function to make that prediction and then print out the prediction and label and then display the image so once we've defined that let's do test prediction uh let's test out let's just random number doesn't matter if you want to b2 so bang it's zero uh our model predicts a zero and the label is zero um let's do the second one so there we go uh let's do this one uh so yeah so this is a tricky one you know this is a weird you could see maybe it's a four or maybe it's a five or so i don't know something weird but no the the model did it it's an eight and the label is an eight i need to find one that it mislabels um because the missed labels are pretty interesting as well uh okay we are going sequentially and we're not finding one that's mislabeled it's doing really well here we go here's one okay yeah so this is a jank 3 and it labeled it a five right because you can definitely see like just like put the thing here and it's like a perfect five this is the kind of thing that if you had more layers if you had more units uh you'd be able to recognize it better that this this model wasn't quite complex enough to to capture that like basically like a 45 degree rotation here but you know we went through a good amount without finding uh one that was was mislabeled so this is definitely a model that that actually worked just to do the last step i want to check with the cross validation accuracy on this is x dev w1 b1 w2 b2 okay and then dev predictions okay let's run that through and okay look at that that's an 86 85.5 percent accuracy on the dev set so that's not the training data we didn't train on this data this is effectively testing data right we haven't done any optimization for this data and um 85.5 accuracy that's pretty good there you go we've done it we've done it we built a neural network from scratch and maybe watching me code through it watching me explain those equations hopefully it helped a bit um i'll have a link in the description to an article a blog post that i'll put up just with all these notes um i'll have a link to this notebook so you can look through all the code you can look through all the equations you can figure everything out for yourself if you so desire there's a lot of other stuff to explore implementing manually right so things like regularization different optimization methods right so instead of just gradient descent there's gradient descent with momentum there's rms prop there's atom optimization which are all just variants of gradient descent and they're fairly simple to implement but it could be interesting just to implement that yourself and see what the impact is yeah this is a fun exercise it's always satisfying to see the accuracy go up like that it's satisfying to see on like a model that you train using like keras tensorflow and it's it's all the more satisfying to see on a model that you built yourself that'll be it for this video just a simple demo of how you can build out all the math that's on this screen here i hope this gave you a more concrete understanding of how neural networks work maybe piqued your interest to dive into the math and do some more yourself thanks for watching this video and i'll see you guys in future videos if you decide to stick around
Info
Channel: Samson Zhang
Views: 767,514
Rating: undefined out of 5
Keywords:
Id: w8yWXqWQYmU
Channel Id: undefined
Length: 31min 28sec (1888 seconds)
Published: Tue Nov 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.