Joel Grus - Livecoding Madness - Let's Build a Deep Learning Library

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Joel and this is live coding madness let's build a deep learning library so Who am I I am a research engineer at the Allen Institute for artificial intelligence in Seattle I wrote a book called data science from scratch it's good you should check it out I have a Twitter account it's also good you should also check it out and I am the co-host of a podcast called adversarial learning please listen listen to it I think you'll like it so today we're going to be live coding and I live code the way that I live code which is to say I use type hints on everything ice every night used to type in to get used to them because I'm using him I use Python 36 so use fight on three six only features so if you don't use Python three six get used to it as well couldn't type very fast that's how I type I'm going to talk very fast too it's possible I'll swear at my text editor sometimes it does autocomplete in a bad way that makes me really angry and we are going to be building a deep learning library so not just some code that does deep learning but we're also going to be doing a making a library that's extensible that can solve lots of different kinds of problems to that end we're going to use good abstractions we're going to use good variable names we're going to document what we're doing and we're going to make heavy use of my PI which is a Python static type checker and pilant which checks for style and they will hopefully find all of our mistakes before we ever run our code and allow us to code very quickly and if this were live I would say you tell me when I screw up this is not live it's recorded so you can't do that you can still tell the screen and pretend like I hear you and hopefully I will fix the things that you see so we're going to build a deep learning library which if I talk about what deep learning is here's a one slide version of it you represent our input data as multi-dimensional arrays you know an image might be have a width and a height and a number of channels and an intensity for each of those and then we'll predict output using parametrize deep neural network will have a loss function that depends Smitha and the parameters and tells us how good our predictions are and then we'll use some combination of calculus greediness and cleverness to find parameters that minimize loss and that therefore result in good predictions and so I have a plan it's an outline I'm going to use because without it I can't get this done tensors will do loss functions will do layers for these neural nets when you optimize there's data training and to end in examples XOR and buzz so now it's time for some live coding madness so I made this directory called Joel net I'm gonna call it Joel next that's me Joel and right now it has a readme and nothing else it also has you know get ignore and a pilant RC and it's got our plan then it because I need this plan as a cheat sheet and what I will do is I will make a directory called gillnet do you put our library in and because it's a Python library I need to add an init dot pi to that folder and finally I will start a vs code window here and use that as our editor so it's got the readme and that's got our plan just perfect so we made this Joel net directory and the first step in our plan is tensors so let's add a file I'll call it tensor not pi and we're going to use doc strings and everything to document everything a tensor is just a in dimensional alright and if we had a lot of time we would create like a really good full-featured tensor class we don't have a lot of time and we're in a hurry so I'm going to cheat and say from numpy import india ray which is the numpy in dimensional array class as sir and we're done that's our tensor class we can move on to the next best thing which is loss functions so let's add a file to have boss functions in it and get rid of this Explorer we'll say a loss function measures how good our predictions are we can use this to adjust the parameters of our effort so what does the loss function look like start with let's import from Joel net tensor let's import the tensor class and I happen to know I'm gonna need numpy so I'll import numpy as NP this is traditional so our loss function will have two methods on it why not just call loss and it will take some predictions which are going to be a tensor and some actual values which will also be a tensor and it's just going to return a float so some number and this is our abstract base loss class we'll be able to implement a lot of different losses so we'll just raise not implemented error so that if someone tries to use this base class it won't work and now we also have a grad function so this grab will be the gradient which is the vector error matrix of partial derivatives of the loss function with respect to each of the predicted things going into it so it will also take predicted as a tensor and actual as a tensor and this time itself will return the tensor and it'll be the same shape as predicted because it's just the vector of partial or the tensor of partial derivatives of the loss with respect to each of those predictions and again this is a base class so let's raise the mus implement an error and know we want to implement an actual loss function and so let's call this MSE for loss MSE is mean squared error although we're just going to do total squared error so I didn't name it right but I like calling things MSE so don't tell anyone so how do we do the loss well we're going to do total squared error so the error is just predicted minus actual and then the squared error will square it and then if we want to take the total I'll just do in P dot sum and so I sum up the predicted minus actual squared and for the gradient you know when you have a function squared its derivative it's just 2 times X and it turns out that that's the same thing here and so the gradient of this function is just 2 times predicted minus actual and we've set this up so that we could implement a lot more different loss functions but for today we're only going to have omit these two and I should probably put return because otherwise it won't work and that's all we need to about loss functions so now our next step part 3 is layers so let's add layers that PI and so what are layers our neural Nets will be May made up of layers each layer needs to pass its inputs forward and propagate gradients backward we'll talk a little bit more about that when we implement it for example our neural net might look like so we have some inputs and we send through a linear layer and then we send them through a hyperbolic tangent activation layer sent them through another Eneida layer and those are our outputs and so here's where we're going to build the layer machinery to make that happen so from Joel net tensor we're going to need the tensor class and we're probably also going to need numpy again important numpy is it as in P and I think that's good to get us started so let's define a class for a layer and I am going to give it a constructor because I happen to know that I want a constructor but it's not gonna do anything for right now and it's going to have two methods one I'll call forward which will take some inputs which is a tensor and we'll return a tensor and you know let's produce that out on foot corresponding to these inputs and since this is the base class I'll raise the not implemented error and then it's also going to have backwards I'm just going to take self and some gradient which is a tensor and again we return a tensor and it will back propagate this gradient through the layer so if we have a gradient which is the partial derivatives of some function which in practice will be our loss function with respect to the output of this layer backwards we'll take it and it will produce the gradient of that same function with respect to the inputs of this layer and then along the way it will also save the radiance with respect to the weight and you'll see that in a second but so will raise the not implemented error here - and that's our layer so now I haven't done this but let's run my pi so my pi is the command-line civic type checker I'll run it on Chalmette and i need to add this flag ignore missing imports because we're using numpy and numpy does not itself have type int so if I don't use that flag it'll throw me a bunch of errors and warnings that numpy doesn't have at least eye pins I don't know I can write type it's for numb by today so my advice is happy so far we haven't made any egregious type errors so now our first layer is going to be a linear layer and so what is a linear layer a linear layer just computes output equals inputs matrix multiplied by some weight that hat is the matrix multiplying Python three six plus some bias term and so that's all this layer does it's a linear regression or a linear function or whatever so it will need a constructor and in particular we need to tell it an input size which is an int and an output size it's also an int and returns none incidentally these are the type pimps I've been talking about here we say this emphasizing an ant this appetizer needs to be an int and my PI wants this that all of our constructors return none and this way when we check if at some point we try and feed in something that say a string here or whatever a tensor it will see with tensor mind to get angry people for various reasons but if you try and give it a string it will get angry so what do I want to do well I'm going to use this input size and output size to create the parameters so this I this weight parameter which is what I want to multiply our inputs by and I've set a comment here you know inputs will be batch size by input size and outputs will be batch size by output size we're going to do all of our layers to take batches so we have a we want to multiply it by size input size to something and get baptized output size so that's going to be numpy dot and I'll just initialize it with a standard random normal which is input size by output size and then similarly I need self dot params not b2b impede random not R and in and just if I do output size it will broadcast correctly and we'll be fine so if I go now and do my PI again it's probably going to complain and it will complain linear has no attribute params right so I tried to assign to this dictionary but that dictionary is not defined in the base class so self dot params is an empty ticked and now I need to make sure to call a superclass constructor to get that dick initialize and now if I do this it's probably gonna complain again need type annotation for variable alright so layers not pi16 I've defined this soft up RAM so I didn't tell it what type it is so can't do anything with it so this is going to be a dictionary whose keys are strings and whose values are tensors and now this dict is not built into Python so from typing which is where a lot of type int stuff comes from if the import ticked so now we say these params our dictionary and now I think it will pass okay good so now here we get to the interesting part which is you know pushing inputs forward through this linear layer so we have some inputs which are tense around we want to return a tensor and again let's just repeat outputs equals inputs at W plus B so the only really interesting here is that we want to save a copy of the inputs which I'll say self Devon busy with inputs and that way when we do the back propagation we can use them because we're going to need them and then all we want to do is input at self dot params W plus self dot params B so that's all the forward has to do it's nothing exciting it's nothing that interesting although maybe it is interesting so the backpropagation is where something really interesting happens so I have some gradient again which is with respect the the gradient of some function with reflecting the outputs of this linear layer and I want to back propagate it to get the gradients of that function with respect to the inputs of this linear layer and then along the way I also want to compute the gradients that function with respect to the parameters within this layer and so you'll see what I do with that in a second second type okay so let's do a little bit of explanation here because if you know y equals f of X and x equals a times B plus C then dy da equals F prime of x times B and UI DB equals F prime of X times a and DUI DC just equals F prime of X so that's uh it's easy enough so that's just single variable calculus if you did calculus and it was eight a at B plus C or C a some vector then dy da is f prime of X matrix multiplied by B transpose and DUID B is a dot transpose a matrix multiplied by F prime of X and T Y DC is actually just let's just call it F prime of X and then we're gonna need to sum it across the batch dimension but anyway so this is not obvious but it's not hard either if you were to sit down and write out like what does this multiplication mean in terms of you know elements of a elements of B and just work out the math you would find out that this is the case it's sort of tedius and we're not going to do it right now but this is what we're going to do so you know we want to apply this we have basically the outputs is inputs at W be so think of that as the eight inputs be W C B which were bad names maybe but anyway so we're gonna do so women self grabs with respect to W let's you be first sexually so all we need to do here is sum up the grads along access equals 0 so basically we have this batch dimension the outputs are batch size by output size we really want the gradient to be to just be output size so we so among the batch dimension because they just add up self dot grads W equals and again so you can see that we have you know the output is basically inputs matrix multiplied by W so when we take the derivative with respect to the W we do inputs transpose at grad and then finally we want to return the gradient with respect to the input so that's the derivative with respect to that which is just going to be grad at self dot params W transpose okay and now we don't have grads to find either so let's go back and you know define it up here self dot grads again it's going to be a dictionary or the keys or strings and the outputs are tensors and that should be pretty good let's go here and she's going to shrink this because it keeps being off the screen and my pie seems pretty happy with it so that is perfect so you know we did the linear layer we also want to do a tan H layer but 10 H is an activation layer so let's do activation layers more generally so what is an activation layer you know class activation is the subclass of layer an activation layer just applies a function element wise to its inputs and so normally this function will be some kind of nonlinear function but you know you could do a linear function if you really wanted to you that'd be a little bit strange but all right I won't judge you so what will this constructor take well we want to define some function and I'll call that cap at for the type we'll come back to them second and then we also need its derivative so that when we do backwards with a chain rule it will work and I'll just save these so it's not Fe equals F self dot F prime equals F prime and now so what is this F type well this is gonna be a function that takes a tensor and returns the tensor so I can just say F equals callable callable is the typing type for functions and so I give it a list of the inputs and the single output so f is a callable from tensor to tensor and that's what this type is and now I need to import column L from my type int library and let's probably member to call the superclass constructor as well and certainly my PI and see what does perfect okay so now again we can create all kinds of activation layers but today we're going to create one activation layer and it's going to be the hyperbolic tangent activation layer which means we need to find an F and an F prime that go into it so I'm just going to do tan H of X as a tensor and it returns the tensor and this is going to be real easy I'm just going to return the number 10 H function except that autocompletes going to erase it and we also need a derivative and I happen to know that the derivative of the tan H function is the following I say y equals 10 H of x return 1 minus y squared that's not obvious but if you go look if there's no tan H if you could work through the math it works out and so now I can define 10 H which is a subclass of activation and it's constructor is real simple because it doesn't take any parameters I'm just going to call the superclass constructor which is the super class is activation and I just want to give a tan H and tan H Prime and oh well one thing I forgot to do was implement the forward and backward for activation so we should probably do that too that's an important part of making this thing work so we have some inputs which sensor and hopefully some of you yelled to me about that so this is real easy again we want to save the inputs because we're going to need them during the back propagation and then we just want to return self dot F of the input so in there are ten each example we're just going to say the inputs and return 10 H of the inputs that's real easy deft backward self and we have a gradient again with respect to the output and we want to get a gradient with respect to the input so again let's do a bit of math if y equals f of X and x equals G of Z then DUI DZ equals F prime of x times G prime of Z so that's that's just the chain rule applying element wise and so that's what we're going to do here - so f prime of X is just self dot F prime of self dot inputs and then multiplied by G prime of Z and if you're actually I've got this backwards but still multiply this by the gradient which is the the rest of the derivative so let me see if I can explain this better because I think I explain it really poorly think of F as being the rest of the neural net and G being the part being done by this layer so then G prime of Z is actually F prime of the inputs and F prime of X is the gradient with respect to the outputs of this which is what we want so anyway the moral of the story is that now we have our layers finished and I think we're pretty much done with layers and flosses so next is neural nets so let's make file out and end up high and then is for neural nets and we'll say and a neural net is just a collection of layers it behaves a lot like a layer itself although we're not going to make it one okay so we're definitely going to need from Jonetta tensor import tensor and we're definitely going to need from Joel net layers import layer and I think that might be all we need for the moment so we have a class of neural Nets and to initialize it we're just going to give it some layers which will be a sequence of layer sequence is just again a Python typing thing that can be either a list of layers or a tough old layers probably you'd give it a list but you know I'll just do sequence I guess you want to give it a couple and I'll just save the layers self dot layers equals layers and that means that I need to go back and from typing import sequence anyway at this point I hope you appreciate it you know the type hints not just I it's nice no marginal named layers this sorry let's run MiFi and see why it's complaining about that and it's complain about it for no reason okay good sometimes pilant gets unhappy so you can't always trust it anyway what I was going to say is that I hope you will agree with me that in addition to having the mypie checking these type ins actually makes the code a lot more readable and so you can really understand what's going on much better so we're going to give the neural net a forward method that looks a lot like our layers forward method and all we're going to do is for layer and self dot layers inputs equals layer dot forward inputs turn inputs that's real easy we're also going to give it a backward method and so we in order operator on that you just push the inputs through one layer at a time obviously there are some really more complicated neural nets that can't be thought of in terms of stacks of layers and our library won't handle this but for the kind of things we're doing it will handle just fine so we'll do backwards the same way we have some sort of gradient with respect to the operand of the network now we want to go through the layers backwards so for layer in reversed self dot layers let's say grad equals layer dot backward grad and then return grad so we take the gradient and push it backwards one layer one layer at a time until it comes out in front of the network and then we return it so that should be good and let's run my PI one more time and it seems to work ok so that's it for neural nets for now ok so next step is optimizers ok so we have some neuron that you know we can do a loss function we do layers and now we need some way of we use an optimizer to adjust the parameters of our network based on the gradients computed during back propagation ok so it was a spell that class optimizer so what does it optimize it optimizer actually has only one method it's going to be step and step we'll take a net which is a neural net and it will return nothing actually and this is the base class again so we'll raise the not implemented air so from gillnet dot and an important rule net and so we're again going to implement only one specific optimizer and that will be our stochastic gradient descent optimizer again I misspelled it so our sarcastic gradient set optimizer will take one parameter which is a learning rate it's a floating-point number and one default it to 0.01 I'm not for any good reason just cuz I chose that number so we'll say self dot L R equals L R and now we need to give it a step method as well so step you know if we have some neural net and we want this optimizer to take an optimization step how does stochastic gradient descent work well it works pretty easily for each parameter and gradient in net programs and grads and here I'm calling a function that doesn't exist so we're going to need to go write that in a minute all we want to do is take the per am and subtract off the learning rate times the grad so if I have a function of some you know tensor and I compute the gradient of the function with respect to that tensor input that gradient gives me the direction in which the function increases the fastest so that means that if I take the parameter and I adjust it in the direction of the opposite of the gradient that will be the direction in which the function decreases the fastest and so this in theory assuming the function is pretty well behaved this is going to make the output function and the loss functions being the gradient with smaller and then the learning rate is you know a small factor and to make sure that our steps aren't too big so we take derivatives only or for small changes and so that will work okay but now we need this net dot params and Gratz which we don't have so you should write it so let's go visit neural net again and give it another method which I'll call params and grads take self and what does it return it returns something we can iterate over so an iterator it's going to return pairs of tensors parameter ingredients so in typing world we do that with tuples iterator tupple tensor tensor and let me go back up to my typing imports and say okay I also need iterator and I also need tupple and this will also be pretty easy to write so for name and per am actually we need to go layer by layer four layer in self-doubt layers for name and Purim in layer dot params dot items let's give the gradient out of layer dot Kratz so the crowd is linear that crash that name and we want to yield the parameter and crash so this is really a generator but I put iterator that will work too it's fine let's ask my PI again and it says good so my pi is happier this I mean our code works but it means that it doesn't have any Regius type errors and I think we're done with that neural net for now we did optimizer so now data good so we want to feed our inputs through in batches and so we need some tools will feed inputs into our network in batches so here are some tools for iterating over data in batches since it's necessarily a natural thing to do so you know from Joel net tensor we're definitely going to need a tensor and we're definitely going to need numpy as NP well and from typing we're going to need iterator and named tupple so if you don't know name couple name couple is sort of it's a top over things are named in this capital version they have types as well and so here's going to be our two let's work with batches so first let's define batch this batch is a named couple and you have to you there's a new style named Tempel format in three point six that I haven't used and someone wants to ask me why do you use that I don't know it well now I know it but if I use it I'm probably gonna screw it up so I'm not going to do use it right now so I'll have some inputs which is a tensor and some targets which is also a tensor so now batch has these two fields and I get if I have a batch of new batch not even put something new batch that targets and I know they're going to be answering the tensor so I'm going to find a class data iterator in the ideal world I probably call an iterator but there's already something called an iterator you can see it right here so I don't want to get a name collision and this is just one method so call which means I can call it as if it were a function and so I will give it some inputs which will be a tensor and some targets again will be a tensor and it will return me an iterator over batches okay so and I'm going to raise not implemented error and I'm going to implement only one of these as well I'll call it a batch generator and then they're all batching areas but that's what I call it and all you have two constructor has a couple parameters when you give it a batch size which is an int and will default that to 32 I think that's a good size and we will also give it a shuffle purim which is a bool and which is true and that just means do we want to shuffle the order every time we pass it the dataset and we probably do and so I just need to save these self dot batch size those batch size and self dot shuffle equals shuffle okay so that's easy enough so what happens when I call this over inputs and targets well in an ideal world I would shuffle everything together but I'm a little bit cheating so what I'm going to do is I'm going to find the start indexes and then just shuffle them so just these so I'll shuffle the batches but not within the batches so as a starts is numpy array range so we start at 0 we go all the way up to the number of inputs and then our step size is just self that back size so if it leave it a 32 this will be 0 32 64 and so on and then if self dot shuffle then I will just call the number at random shuffle starts and so now what I want to iterate are over batches so I just need to use these starts to create batches so for start and starts well we can find the end the end just just start plus self dot batch size the batch inputs will just be inputs for start to end and the batch targets will just be targets from start to end and I can just yield the batch with batch inputs and batch targets and I think that's all I need to do for this again we should check it with my opinion make sure we didn't get anything obviously wrong my fines taking longer and longer all right so this looks good and I think this batch generator will do what we want so let's close that let's close that and see what's next finally training good so let's add a file called trained eye so here's a function that can train on neural nets ok so here we're going to need a lot of imports from Joel net tensor import tensor from Joel net dot and then we're definitely done neural net from Joel net dots Tomas I'm definitely going to need a loss function as well as my specific loss from Joel net dot optimizer I'm definitely going to need the optimizer as well as a stochastic gradient descent optimizer and from Joel net dot it's not only need my data I'm definitely going the data iterator and batch iterator ok so that's a lot of imports we need them all and we're going to find a train function so what is a train function do well it will take a neural net that's what we want to Train and then we need some inputs which will be a tensor and we need some targets this will be a tensor and we need a number of epochs which would be an int and I'll say 5,000 that's the number I like so how many passes over our data set do want to make when we're training you know you could do something more sophisticated like stopped raining when symmetric it's better but we're just going to be a fixed number of epochs and then we need an iterator which is a data iterator and it will default to the batch generator with the defaults we need a loss function which is a loss function and by default it will just be I mean squared error loss which is actually told us where we need an optimizer which is an optimizer and by default we'll just use stochastic gradient descent and I think that's probably what we need we need something else we'll add it so what do we do well we said we're going to train over a certain number of epochs so we'll just iterate for epoch in range epochs so we'll compute the loss within each epoch epoch loss equals zero point zero and now we want to get batches out of our iterator so for batch in iterator and if you were called the iterator was callable and we'd give it inputs and targets so the one batch at a time and what do we want to do we want to make some predictions predicted equals net dot forward that's not input so we take the inputs out of the batch we run them forward through our network and use them to make a prediction and so then we can use our loss function to compute the loss and we'll just add it to the epoch loss plus equals loss of predicted and the batch targets can also compute a gradient which is lost gradient of predicted and batch targets and now that gives us what is the derivative of that loss function with respect to every one of our you know predictions and we can then propagate that gradient backwards through our network which will compute the gradient with respect to each of the parameters of the network and then kind of pass it backwards to the previous layer to do that again and then finally we have this optimizer and we want to use the optimizer to take a step on the neural net which will adjust the weights based on this greediness we just constructed and then at the end of each epoch let's print the epoch and the epoch loss so that is enough to train a known that and at this point our library is done I mean a batch underscore targets is not defined because it should be batch targets so good thank you my for catching that incompatible to fall for argument iterator default has better a turning this type of data iterator so that tells me we probably forgot to do the sub classing properly and indeed we did so batch iterator should be a data iterator subclass let's try it one more time perfect so as I would say our neural net is actually our library is actually done and there's a lot of features we can add and we'll talk about that in a little bit but for the most part like we have everything we need in order to Train neural nets and so the canonical example of something that you know you can't learn with a simple linear function let's write this out the example of a function that can't be learned with a simple linear model is exclusive or because it's just not when you're least separable that way so you know from odious yeah what do you see in it from Joel net dot train imports train from Joel net in an import neural nets from Joel net dot layers imports let's do the linear and also tan H and that might be if we gonna need an umpire as well imports numpy is NP and okay so for x or you know exclusive orders if I have 0 and 0 that's not exclusive over 5 0 and 1 that is if I have 1 and 0 that is if I have one in one that's not so let's define some inputs is just going to be a numpy array and so your 0 1 0 so there's only four possible inputs so this isn't super exciting so those are the four possible inputs and now for the targets channel do numpy array and the output is either 0 1 but I'll represent that as a list of length 2 where 0 is in the first place so you know zero zero that's not X or 1 0 that is X or 0 1 that is X or and 1 0 that's not X or so that's our data inputs and targets and now we define and that's so net is going to be in neural nets and to start with we said this cannot be learned by a linear layer so let's show that early show that we can't learn it with all new layer so input size is 2 because our inputs all have size 2 and output size is 2 as well and so you know we can train the net the inputs and the targets and then real simple for X Y and sip inputs targets predicted equals net dot forward of X and you know normal which probably in a batch but it turns out that the way we've written this so far it all works even if I know to it in the back so I'm cheating and so we'll just print X predicted Y and so I think you know go here and do Python sort up I can see that as it runs through these 5,000 efox the loss gets down to zero and it's predicting 0.5 and 0.5 which is not doing a good job at all and it's because as a linearly err can't learn this function so now the beauty of having a library like this is let's change our neural net okay after the linear layer let's add a 10 age activation layer and now let's add another linear layer and emphasize here again we'll be - that's the appetizer the previous layer and we want the final output size to be 2 as well so now we actually have a neural net with a hidden layer only two hidden units but that should be enough for this problem and we can go back and train it and you can see that over 5,000 efox the loss gets down very small and you know so for C or 0 it predicts 1 and then almost 0 for 1 0 it predicts almost 0 and then 1 almost 0 and that one wanted to do so you know if we ran this longer the lawsuit even smaller and the predictions would get even closer to 0 and 1 but anyway our neural net library has succeeded in solving this XOR problem and now this XOR problem is not that exciting a problem but it worked so what can we do that's exciting well if you follow my blog and who doesn't follow my blog you'll know that I wrote a very good blog post about solving fizzbuzz using tensor flow in an interview situation so here let's solve fizzbuzz using our neural net library that we just created so maybe you don't read the blog post maybe you're not familiar with fizzbuzz vis buzz is the following problem this is like sometimes people ask it as like a weed out question for software interviews sometimes people joke about it is like this is the worst question you never ask anybody so for each of the numbers 1 to 100 if the number is divisible by 3 print fizz if the number is divisible by 5 print buzz if the number number is divisible by 15 print fizzbuzz and then otherwise just print the number okay so you know if you're an accomplished programmer you can probably think of some easy ways to solve this I'm just going to copy these imports so I have to type them again and so you might say to yourself how can I use a neural net to solve this problem well one way it is sort of obvious which is that if you look there's four different classes that you want to predict there's the fist class the buzz class the fizzbuzz class and the just the number class and so from an output side you can treat it as a for class classification problem so let's write that part first and I'll call this fizzbuzz in code and it will take some int and it will just return a list of int so probably from being import list and so you know if X mod 15 is zero I'll put it in the fourth class otherwise if X mod 5-0 I'll put it in the third class otherwise if X mod three is zero put it in the second class and finally you know if none of those things is true I'll put it in the first class okay so that's kind of an obvious way to deal with the output the next question is what to do with the input right I mean we have these numbers one to a hundred obviously if we really want to do a model we don't want to Train on the numbers 1 to 100 so let's train on numbers bigger than 100 and then try and predict on numbers 1 to 100 but you know we don't want to just have one input so we don't want to to be twice as much as 1 and tend to be ten times as much as one cuz that doesn't really capture the structure of the problem so one thing we can do it's probably not obvious that it will work but I tried all other things and this is the one that works is binary encoding so if we want to predict on numbers 1 to 100 we'll just train on numbers going up to 1024 which is 2 to the 10th so we'll just encode you know 10 digit binary encoding of X okay and if you're python it's shifting arithmetic is not good you might need to study up because it looks like video but basically for each I from 0 up to 9 I write shift X by that many bits bitwise and it with 1 and that will give me the binary encoding so if you don't understand that it may be worth studying maybe it'll come up on your programming interviews and so anyway for my inputs will be the numpy array and again like I said I want you to do this for X in range 101 to 1024 so we're going to train numbers bigger than Adri so that we can predict our numbers below 100 so the inputs will just be binary and code X 4x in range 101 1024 and the targets again will also be an umpire numpy array and that will just be the fizzbuzz encode x again 4x and range 101 1024 okay so that's what we want to learn and again we need to determine and net must be a neural net and we got to give it its layers so we'll start off with a linear layer and put sizes 10 because our binary encoding a size 10 and I happen to know that I want an output size of 50 so 50th units how do I know that cuz I've done this much times and that's a nice trade off between mostly works and doesn't take forever to train and so then when I do another linear layer input size equals 50 that's what came out of the previous layer and output size equals 4 because that's the number of classes were trying to predict and that's what our targets look like and so finally I can train the net inputs targets and let's just do num epochs equals 50 to make sure this works and finally you know for X in range 1 to 101 so this is what we want to do predicted equals net dot forward on binary encode X and so that predicted will be you know of a factor of 4 numbers and so we want to get our predicted index which is to be the numpy Arg max of those numbers and again I'm cheating by not using batches for this it works but you shouldn't copy me to do that the actual index is just the numpy Arg max of fizzbuzz and codec so that's that's the right answer and then we need some labels so labels will be you know in the first class will just return the string of X otherwise in sunglasses phys ed classes last classes fizzbuzz and let's print X labels predicted index and labels actual index and so I think that's that should allow us to evaluate it on one to a hundred the only thing I did here that was sort of how the ordinary was only dating 50 epochs because I want to see where it breaks because I'm sure it will break Python fizzbuzz PI and sure enough one you can see that my predictions are terrible and two you can see that if you leave these losses per epoch starts off with a huge number it gets even huger it gets even huger and then it overflows and becomes not a number and so because I've done this before in the bucket before I know that the problem is the default step size I put in my SGD optimizer of 0.01 is too big so let's import SGD and now let's give our trainer a different parameter optimizer equals SGD learning rate equals 0.001 okay and I think that will fix it and you can see it starting to go slower now so let's you know give it the full 5,000 epochs and let it rip and so you know what I discovered is that normally these 5,000 epochs don't take that long but it turns out that when you're running like screen recording software at the same time it's doing video encoding using up most of your CPU for that the 5,000 efox actually take a really long time so let's take this opportunity to kind of review what we did and talk about what else we can do to make it a more full-featured deep learning library so you know first step was tensors and this was cheating I don't feel too bad about cheating but if you were building a real deep performing library you might want to use your own tensor class and make it work on GPUs and do a bunch of stuff like that that we did a loss functions mean squared error is a good loss function it works in a lot of places but you know if you were getting a lot of multi-class classification you might want to do across entropy you might want to do some other stuff and add in some kind of regularization term so that penalized for having larger weights and these are a bunch of different things you do with the loss function actually I don't think regularization would fit in this white entirely well we don't have to change the API but that's okay we can do that in terms of layers you know the linear layer is pretty good as is you could obviously add a few more activation layers you can add a sigmoid activation layer that a relative activation layer you can add a leak he really there's a million different activation leader so you can do and there's a lot of Voodoo around how to choose one versus the other and I'm not good at that so I just neh and I just stuck with it it seems to work most of the time you know if you want to get really fancy you could start add adding you know recurrent layers LSD M's you can add convolutional layers if you want to start doing vision problems or whatever you're probably need to get into that territory in terms of the optimizers you know SVD is a good basic solid optimizer but you might want to add one that has you know momentum you might when you use rmsprop or integrated Doulton there's all these different optimizers of the week and you know most of them work better than SGD but SVD works good enough for us and it's real simple to implement as you can see and then we also had you know this training piece and I could have made trained via method on the neural nets I could have not it doesn't really matter what else would I do differently here probably not a lot that's pretty uh it's pretty straightforward so as you can see this thing is taken absolutely forever to Train it went a lot faster when I wasn't recording and screencasting at the same time but it's taking a lot longer now so you know why did I do this well I was scheduled to give a talk at a DITA science conference and I needed something to talk about and I thought this would be fun and sort of an exhibition of you know live coding madness and skill and things like that but I also find it interesting to kind of take things apart and see how they work and you know the true tests one true test of if you really understand neural networks is can you sit down and you know built them starting with nothing and I don't want to claim that I really understand them because probably in many senses I don't but I understand them well enough that I can do this and that's the sort of thing I enjoy doing and I'm going to keep rambling here for another couple of minutes as this thing finishes training or I can sit there and cheer for it just imagine that I'm like singing or dancing or something it will finish real real real soon and the sad thing is I could probably run it a lot longer and get better results anyway so for each number you know 100 predicted buzz actual buzz that's good these are right these are right so right so right he's all the crying pretty good pretty good pretty good pretty good pretty good man so I don't specify a right enough seat here so you never know what's going to happen but occasionally it gets all of them right and so if this is one of the times when it gets all of them right I'd be very excited his buzz is fizzbuzz buzz oh man ten left eight seven six five four three two look at that it learns every single one correctly so that means that our little neural network and our little neural network library are qualified to do any number of programming jobs so that's great and I think I have a couple more one more slide at the end so thank you for watching this was fun to do and I'm super excited that it works I will push the code up to github you know concise troll guru saj Joel Nets that's where I'll put it and you can look at it and download it and run it yourself we're trying to recreate it or anything if you want to check out my blog that's at Joel groose comm you can read my post on fizzbuzz and tensorflow which makes it seem like I'm much better tension flow than I actually am check out my Twitter that's actual groose you can check out my podcast that's adversarial learning calm and if you're interested in LSD for artificial intelligence check out our website Allen they had org we do a lot of really really interesting work that's not really stuff on our website depending on when you're watching this we might be hiring may not be hiring but check out the jobs page if you think it's someplace you might like to work if you like to do cool stuff it's a cool place to work and yeah this has been fun thanks for watching you
Info
Channel: Joel Grus
Views: 71,236
Rating: 4.9627099 out of 5
Keywords: deep learning, python, live coding, coding, fizz buzz
Id: o64FV-ez6Gw
Channel Id: undefined
Length: 56min 43sec (3403 seconds)
Published: Tue Nov 28 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.