How to Create a Neural Network (and Train it to Identify Doodles)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] [Music] hello everyone today I'd like to try teaching my computer to recognize various Doodles and images now there are all sorts of techniques we could try using to tackle this problem but the approach I'm interested in at the moment is neural networks I first heard about these mysterious things about 10 years ago and I soon set about trying to program one to get these little stick creatures walking around on their own my code for handling the physics must have been a little buggy though because this was the most successful result I ever managed to achieve although that's not counting the times where the creature fell over and then somehow kept stretching its legs out further and further which tricked my program into thinking he was doing extremely well I eventually gave up on that and decided to try something a bit simpler so here we have a little two-dimensional car with some sensors that attack the distance to the edge of the road and these ladies get fed into a neural network which then tells the car which way to steer of course it's completely hopeless at first but if we let a bunch of them compete and then select the top view that drove the furthest to clone with some random mutations to their networks and then let those compete and so on we eventually get a car that's able to drive around happily on its own the last thing I tried doing a couple of years ago now was training a network to identify handwritten digits and once we've built our little neural network today that's probably the first task we'll try testing it on after that we can see if the same code can be trained to recognize these tiny images of various articles of clothing and fashion accessories instead and finally I'd like to try trainer to recognize Doodles of 10 different objects such as helicopters umbrellas octopuses and windmills actually we could even try one last thing after that which is attempting to identify these little color images again of 10 different things this time from Cars to cats and birds to boats this is obviously quite a leap up in complexity though so if it proves too baffling for our simple Network then we'll have to return in the future to upgrade it to something like a convolutional neural network which is supposed to be much better at these kinds of problems anyway to help figure out how we're going to build our neural network let's imagine a simple example we've discovered A peculiar new fruit which is purple and spiky with orange spots and extremely delicious strangely though some of them seem to be poisonous and will give you a terrible stomach ache so let's examine a few of these fictitious fruits and we can see that their appearance varies in two main ways the size of their spots and the length of their spikes we could try drawing a graph with the size of the spots on one axis and the length of the spikes on the other then we could collect a bunch of fruit and plot them all on this graph based on those attributes with some pretty volunteer fruit eaters with label which were safe and which turned out to be poisons now if the result ended up looking kind of random like this then the spots and spikes probably don't have much relationship with whether the fruit is poisonous or not and we'd have to think of something else but if the data turned out looking more like this then we're in business we could draw a little line here called a decision boundary and say that any fruit we find that falls on this side of the boundary is probably poisonous and on the other side it's more likely that it's safe so our very first step is to create a simple Network capable of determining exactly that we have two inputs in our problem the size of the spots and the length of the spikes and there are also two possible outcomes safe or poisonous the way we'll interpret these is if the first output has the highest value then we're predicting that it's safe but if the second output has the highest value our prediction is that it's poisonous now you might be thinking two outputs is quite extravagant why not just have one and say a positive value means safe and a negative value means poisonous and we could definitely do that but in future problems there'll be more than just two possible outcomes so it's helpful to have a separate output for each of them all right so these outputs obviously depend in some way on the inputs but we don't know how much of an effect each input should have so let's connect both inputs up to the first output and these connections each represent a weight for essentially how important the inputs are to that particular output so the actual value of this first output will be equal to the first input multiplied by the weight of its connection plus the second input multiplied by the weight of its connection and it'll be the same story for the second output [Music] I have quickly written up some code to perform this calculation which you can see in the classify function over here then to visualize what's going on there's this visualize function which gets run for every pixel in the graph display and it asks the network to predict whether a fruit at that point would be considered safe or poisonous and then colors the graph accordingly so let's see what this does at the moment it's very paranoid and seems to think everything is poisonous but up in the corner here we have our Network weights so I'll play around with these a bit to see if I can get that decision boundary how we want it [Music] unfortunately no matter what I do I can only get the boundary to essentially rotate around the origin of the graph here when what we need now is to be able to shift it vertically so let's go back to the code and make a tiny upgrade to our Network by adding in two new values called bias1 and bias2 and these will simply be added onto our weighted inputs over here allowing us to move those values up or down [Music] okay let's give it another shot so I'll fiddle with the weights again to try get the angle of the boundary correct and then I'll move on to the biases it's a bit finicky to control but with some patient tweaking we should eventually be able to hand train our little Network to correctly classify the fruit in our training sample here this was a pretty straightforward task though so let's imagine that when we collected this fruit data it actually ended up looking something like this instead this is trickier of course because we can no longer separate safe from poisonous with a straight line so we'll need to make some more upgrades to our Network one way we could try to improve the network is simply by making it bigger it doesn't make sense to change the number of inputs or outputs because those are determined by the problem we're trying to solve but instead we can create a new layer that sits in between these in-between layers are known as hidden layers for whatever reason and they can have as many nodes as we like and we could even have multiple hidden layers but let's keep things simple for now so as our input values get fed forward to the next layer they will be multiplied by their weights as we've seen before and added up with a bias value together creating a weighted input for that node and once it's been computed for all the nodes in the middle layer those values can then be fed forward in the same way to form the weighted inputs of the next layer even for this extremely tiny network with only 12 weights and 5 biases it would be quite tedious to write out all the calculations by hand like I did over here so I'm going to throw all of this code out and start working on a more sensible solution so I've begun by making a layer script which stores the weight values for all the incoming connections along with bias values for each node in the layer and these get set up over here based on the number of incoming and outgoing nodes just to clarify in the code I'm thinking of this layer for example as having three incoming nodes and two outgoing nodes [Music] then this layer would have two incoming nodes and three outgoing nodes and finally this layer really just represents the values given to our Network it doesn't do anything so it's not going to be an actual layer in the code all right back to the layer script all that's left to look at in here is this calculate output function which takes in some input values and computes the weighted inputs simply by looping over each outgoing node setting the corresponding weighted input to that node's bias value and then looping over all the incoming values multiplying each of those by the weight of their connection and adding the results onto the current weighted input for now at least these weighted inputs are then returned as the output of the layer next we have the actual neural network class which contains an array of these layers so when a network is created it needs to be told how many nodes there should be in each layer and it uses that information to set them all up like so now over here is the function for calculating the output of the entire network and all this does is Loop through all the layers calculating the output of each layer and using that as the input to the next layer so once these inputs have been fed through the entire network they've become the outputs finally the classify function Works essentially the same as before it just calculates the output values and Returns the index of whichever value is largest so I'll create a network with three layers now which will give us a lot more parameters to play with and let's see what new and exciting things we can do well it turns out adding an extra layer doesn't help at all at least not on its own to make this boundary Bend we're going to need to allow the layers to have a non-linear effect on the output so let's go back to our design and zoom in on a single node in a loose analogy to biological neural networks we can think of this node as a neuron and the weighted input over here as some sort of stimulus if the stimulus is sufficiently stimulating that should cause the neurontifier which in our model could mean outputting a value of one whereas if the stimulus is small the neuron wouldn't fire so with just the outputting zero I'll refer to this output as the activation value so we just need to write a little function that takes in the weighted input and computes that activation value something like this and let's quickly visualize that with a graph so the x-axis here shows the weighted input and the y-axis shows the corresponding activation value using an activation function like this should allow us to have more complex decision boundaries before we can try it out though we'll need to quickly jump into the layer script and instead of outputting the weighted inputs over here let's pass them through the activation function and then output those activation values instead so returning to the joyful process of randomly tweaking sliders until something happens we can at last see that we're able to actually make simpler shapes with the decision boundary and of course increasing the size of the network now would allow us to make increasingly fancy shapes but this tiny network is already sufficient to correctly classify the made-up data we have at the moment by the way you might have noticed that the biases no longer simply shift the whole boundary up or down like they did before instead they're just a way of easily Shifting the value that goes into the activation function so thinking about biological analogy it would be like a threshold that the stimulus needs to exceed in order for a particular neuron to fire now one thing I don't like about our current setup is how abruptly and dramatically the output can change in response to just a tiny tweak to some of these sliders so let's go back to our choice of activation function and maybe replace it with something like this instead which simply Smooths things out this is called a sigmoid function by the way and it's just one of many different functions that people have experimented with for neural networks here's a few others just as a matter of Interest anyway let's go with this one for now and see how that affects things so I'll mess around with these sliders once again and we're now able to make nice smooth boundaries and more importantly making small changes to any of the sliders will no longer result in a drastic change to the output anyway as fun as it is trying endlessly to train this network by hand the goal of course is for the computer to do it all by itself and for that to work it's good to need a way to measure how well it's doing one approach would be to Simply count the number of known data points that are being classified correctly but the trouble with that approach is that often making a small adjustment to one of the sliders won't actually change that number meaning it's impossible for the computer to know if that adjustment was beneficial or not so we should try find a more precise way of measuring progress let's think about the outputs of our Network again which are now being squished to somewhere between 0 and 1 by our sigmoid activation function so if we give the network the inputs of a fruit that's safe to eat we'd hope to see a one at the first output and zero at the second output representing total confidence that it's safe and for a poisonous fruit that should be the other way around so since we know what the output should be for our training data I've added a little function to the layer script called the node cost which takes in the output activation value of a single node along with the value we want it to be all it does is calculate the difference between the two and square the result just to make it positive and I guess to emphasize large differences as being much more urgent to correct than small differences then in the neural network script I have added a function simply called cost which takes in a single data point such as one of our fruit and runs its inputs through the network to get the output values using those it then adds up the individual costs of all of the nodes in the output layer to get an overall cost telling us how badly the network is doing for the given data point we were most interested though in how the network is doing across all our data points so here's another version of the cost function which takes in multiple data points this time and just adds up each of their costs and Returns the average so you can see that value at the bottom of the screen and the goal of the network is now simply to find values for its 17 weights and biases here which result in the smallest average cost for the Tiny Network we have here we could probably even get away with the Brute Force trial and error approach but our networks are going to get a lot bigger as we tackle trickier problems so we should definitely try to find a better solution let's simplify things by thinking first about this little example function where similar to the task of finding the widths and biases that minimize the cost function we want the computer to find the input value that results in the smallest output or at least some reasonable alternative and of course we'd like it to do this in as few steps as possible not just calling the function millions of times until it finds a good answer now our solution is going to rely on calculating the slope of the function so let's try writing some code to visualize it we can approximate this Loop very easily by fast defining some tiny value which I'll just call H and then calculating how the output of the function changes in response to this tiny positive nudge to the input the steepness of the slope will just be this change to the output divided by the change to the input that caused it so as I said this is just an approximation because we're technically calculating this Loop between two points on the graph which the smaller we make H the closer our approximation will be to what the truth Loop would be precisely at the given input value anyway we can then visualize the slope with a bit of code like this let's see how that looks so the slope value is pretty intuitive we can see for example that it's negative over here because the function is decreasing as the input increases but it's getting closer to zero as things level out and once the function starts increasing faster and faster we can see the slope value increases as well to reflect that now for a function like this one we could actually do some maths to directly calculate the points where the slope is zero which would be super efficient and wonderful but unfortunately that maths just isn't viable when it comes to our actual neural network instead we're going to be relying on a technique called gradient descent and the idea here is that we'll pick a random starting value and then just slide down the slope into the valley here's some code for doing exactly that it initializes the input to a random value and then in this learn function it approximates the slope like we did a moment ago and then simply subtracts the slope value from the input value and this learn rate parameter here just allows us some control over how much the input changes with each iteration let's try it out so I'll give us a random starting value and then I'll just go press this learn button a bunch of times to repeatedly run the gradient descent algorithm and we can see it's very slowly making its way down the slope let me restart this and try again with a much higher learn range but now it's clearly trying to learn too quickly because it's just bouncing around all over the place and not making any kind of consistent progress so we need to try strike some sort of balance with the learn rate and if we manage to find a good value for our specific problem we can see it's able to learn pretty efficiently of course there's no guarantee that we'll fall into this optimal solution known as the global minimum we could very easily fall into a local minimum instead but this process tends to give good results in practice and there are all sorts of things we could experiment with in the future to try and improve it so to get a better picture of how this will apply to our actual Network let's pretend that the network has just two weights and these will start out with random values so let's imagine them over here for example now if we want to know how good these weights are we could run all our data through the network to calculate the average cost and let's represent that as a height in the third dimension we can then imagine the average cost that would result from any configuration of Weights as a kind of landscape like this now in the previous example we approximated the slope by looking at how a tiny change to the input variable affected the output of the function and we could do the same thing here we'd look at how a tiny change to the fast weight affects the cost as well as how a tiny change to the second weight affects the cost and these two together tell us the slope or gradient rather of the cost function with respect to the weights so if we subtract the current gradient from the weights again using the learn rate to control things of it that'll cause the weights to move downhill and so with each iteration of gradient descent we can imagine our weights rolling further down the slopes of the cost function before finally settling into one of the values now obviously in our actual Network we'll have loads of weights and biases affecting the cost so this is all happening in more Dimensions than we can hope to imagine but the idea remains exactly the same [Music] all right I'm going to go ahead and Implement all the stuff we've been talking about and I'll see you then [Music] so over in the layer script I have added these two arrays for holding the cost gradients with respect to the weights and biases then we have this new apply gradients function which takes in the learnrich and it Loops over all the weights and biases and just subtracts the corresponding value from the gradient from each of them I have also added a function for giving random starting values to all of the weights which looks like this now in the neural network script we have this new Len function which takes in all our training data and it starts off by using that to calculate the current cost value then for each weight in the network it makes a tiny nudge to that weight to measure how much that causes the cost to change and it resets the weight after doing that just to not throw off the calculations for the rest of the weights then it calculates this value here telling us essentially how sensitive the cost is to the current weight and stores that in the gradient array this here is just the same process for the biases and at the end of all of this the gradients are applied on all the layers so as long as the learn rate isn't set too high this process should cause the overall cost of the network to decrease each time we run it so let's return to our potentially poisonous fruit to see if it actually works it has quickly found this linear boundary which does an okay job of separating a safe from poisonous and the cost is still going down even though not much seems to be happening so let's give it a little bit longer to contemplate matters foreign and at last it's managed to perfectly classify the training data so our neural network is learning which is exciting the problem we have now though is that it's excruciatingly slow this is mainly because we're having to run the cost function for every single weight and bias parameter and remember the cost function needs to feed all of the data it's given through the entire network so one thing we could try is simply to give it less data when we give it all the data we have we can see the cost goes down nice and steadily as it lands but if we had hundreds of thousands of training samples it would take a really long time just to complete a single learning iteration instead if we just give it a tiny portion of the data each time we can complete learning iterations much faster this does make the learning process noisy because we can imagine our cast landscape will look a bit different for each mini batch so they won't all agree exactly on which way downhill is but it does speed things up immensely and the noisiness can apparently even be beneficial in a number of ways for example in helping to escape the dreaded saddle point which are regions like this which can slow the learning process down dramatically [Music] so this mini batch technique is a big Improvement which by itself fits not enough if we want to scale our Network up to see tens of thousands of Weights which is actually still very tiny then that would mean running each data point through the network tens of thousands of times as well and so our network is going to take pretty much forever to learn anything interesting happily for us there is another way to calculate these gradients where for each iteration we'll only have to run our data points through the network once and all let's go to cost us is a bit of calculus in case you haven't studied calculus or just need a refresher I'd like to quickly go over the essential ideas we'll be using today so as an example let's consider this function f of x equals x squared minus 3x Plus 4. then here's our code from earlier for approximating and drawing this Loop of a function and we just want to figure out a more efficient way to calculate the slope where we don't need to call the function multiple times so let's consider this line first which I'll write out in more math C notation over here and let's just see where this takes us if we try patiently working our way through the calculation first of all we can write out this f of x plus h in full we're just looking at our equation up here and wherever there's an X replacing that with X Plus H like so and then from that we want to subtract f of x so let's write that out in full as well now if we want to simplify this we'll first need to expand it all so X Plus H squared comes out as x squared plus 2xh plus h squared then we subtract 3x and 3H we add 4 and finally subtract the rest of this stuff and now we can see that this x squared and this negative x squared will gobble each other up as will these two terms and these two terms so once the dust settles we've managed to simplify things quite a lot and we can also see that H is a common factor in all these terms so we could even neaten things up a little that's as far as we can really take it though so referring back to our code we then calculate the slope by dividing this change in the output by the tiny change to the input that caused it so let's write that out over here right away we can see that these two h's will cancel each other out and we're left with just 2x plus h minus 3. here's where things get a little weird though we know that the closer H is to zero the more accurate our approximation of the slope will be we also know it can't actually be zero because then Not only would the change in the output just be zero but we'd also be trying to divide everything by zero nevertheless if our answer over here gets more accurate to the closer that H is to zero it makes sense intuitively at least that we should just remove H entirely but we don't want to be put in math jail of course so let's be a bit careful about this first of all we should switch to a proper calculus notation which looks like this so this notation up here represents an approximation whereas this is going to be our exact answer we can then add this bit of mathematical legalese which just says as H gets closer and closer to zero this thing we calculated 2x plus h minus 3 is going to get closer and closer to being just 2x minus 3. so what we've calculated here is called the derivative of f with respect to X and to see what it means let's go back to the graph of our function and draw in its derivative as well now we can see here that where the derivative is zero corresponds to where our function has zero slope and where the derivative is negative that's where the function is looping downwards and so on so the derivative is the super helpful thing that tells us the exact slope of our function at any input value let's quickly go into our code and create a little function which Returns the derivative we just figured out and then we can replace all the stuff here with just a single call to that derivative function like so then to test it let's try drawing this loaf again and it works perfectly so we've managed to make the slope calculation more accurate and more efficient with the only downside being that we do have to actually figure out the derivative of whatever function we're using just as another quick example here's that function we used earlier when we're thinking about gradient descent and here's its derivative and we can see again how the derivative tells us exactly what the slope will be at any input value or put another way how sensitive the output of the function is to a change in the input [Music] okay so to figure out how all of this is going to actually help us let's consider a ridiculously simplified Network that has just three nodes connected by two weights like this and let's write out quickly how this works so the input node receives some input value let's call that activation zero and we then calculate the weighted input Z1 for short which is just the input multiplied by the weight plus some bias value next we calculate activation one simply by passing the weighted input into our activation function we then do the same thing to calculate Z2 and finally A2 we can now evaluate the network by calculating the cost so we just use the cost function and pass in the output activation value along with the expected output for the current training sample or why for short now if we think back to this horrifyingly inefficient gradient descent code that I wrote remember what you're trying to speed up is this calculation here of how sensitive the cost is to a change in any particular weight or bias parameter so let's say we want to calculate that value for weight number two over here instead of approximating it we've seen recently that we can actually get a more efficient answer by calculating the derivative and I'll write this with these fancy curly D's since we're dealing with functions with multiple variables now anyway this might look a bit confusing to calculate because we're trying to figure out how W2 affects the cost but it isn't one of the cost functions inputs so to unravel this mystery there's one final calculus concept for us to contemplate today the chain rule the chain rule tells us to Simply look at how W2 affects Z2 which you can write out like this and then look at how Z2 affects A2 which I'll write out over here as well and finally look at how A2 affects the cast which once again I'll write out over here all we need to do now according to the rule is multiply these partial derivatives together and it will give us the result we're looking for we can even kind of see that this is true because if we think about how fractions cancel out when you multiply them this gives us the correct result so our task now is figuring out what each of these partial derivatives actually is and let's start with this one on the end here's our code again for calculating the cost of a single node and we need to figure out the partial derivative of this with respect to the output activation value there are shortcuts for doing this sort of thing thankfully but I just like to quickly show how the approach we used earlier still works exactly the same way for a multi-variable function like this one so over here we're looking at how the cost changes in response to a tiny change to this output activation value and we're then dividing by the size of that change like before let's write this out in full and expand everything which is pretty tedious but we then get to cancel a bunch of stuff out which is always satisfying I'll tidy that up and we can see once again we end up being able to cancel out this division by H which is crucial to the final step of then being able to say that as H approaches 0 this is going to approach simply two times the output activation minus the expected output so that's our answer and in code that would just look like this okay so we figured that one out next let's look at how the activation changes in response to the weighted input so here's that sigmoid activation function we're currently using and calculating its derivative is quite a bit more involved than the others so I'm just going to skip right to the answer which turns out to be simply the activation value multiplied by 1 minus the activation value let's quickly graph it at least just for interest sake and here's what it looks like [Music] all right for this last one we want to know how the weighted input changes in response to the weight this one's very easy because if we just look at our equation here we can see that the amount of effect that a change to the weight will have on the weighted input depends entirely on the input which is this a value here for example if a is zero then changing the weight would have Zero Effect whereas if a is 10 then changing the weight would have a 10 to 1 effect and so on so that's our answer we now know how the cost is affected by the second rate we just need to figure out the same thing for the first wage so here's the partial derivative we want to calculate and if we just look at how the first wage ends up affecting the cost we can again use the chain rule to figure this out taking a closer look we can see that these two partial derivatives on the end are the exact same as these two so we know those already then over here this is the same thing as this just taking place in a different layer so we already know how to calculate it and it's the same story with this one being the same thing as this just again in a different layer so all we need to worry about then is how the weighted input changes in response to the input but again if we just look at this equation here we can see that the amount of effect that a change to the input will have on the weighted input is determined entirely by the value of the weight so there's our answer and we now know how the cost is affected by both of the weights in our Tiny Network okay I'm going to tidy these notes up a bit because it's time for a little Network to grow up this does mean the roundups here don't make total sense anymore but they do translate fairly intuitively I think so they're going to guide us as we work through this let's begin with this calculation over here where we're taking the partial derivative of the cost with respect to the activation of our output node and multiplying it by the derivative of the activation with respect to the weighted input of course we now have two output nodes and that just means we'll do the same calculation again but using the activation of this other output node and its weighted input instead so we end up calculating two separate values which I'm just going to call our node values because I'm bad at naming things so over in the neural network script I have started working on this little function which takes in a single data point and runs it through the network and I modified the layer code slightly so that when that happens each layer stores all the information we're going to need like the weighted inputs and so on after that it asks the output layer to calculate the node values which it does just like we talked about alright so we're now ready to figure out how each weight in the final layer is affecting the cost so that means we'll actually be calculating six different values in this example one for the weight of each of our connections here to do this we need to multiply our node values by this partial derivative here which we figured out earlier with just the activations from the previous layer how that will work for this connection for example is we'll take the activation from the previous layer that's going to be input along the connection and multiply it by the node value that it connects to on the other end and we'll be doing that for all of them so just as another example for this connection we'd be taking the activation coming out of this node and multiplying it by this node value on the other end so over in the layer script I've made this update gradients function which takes in the node values and then for each connection it calculates the partial derivative of the cost with respect to the weight of that connection and then uses that to update the gradient while we're here we should actually also be updating the bias gradient so to figure that out let's quickly tweak our maths to be with respect to the bias instead and we can see that we just need to multiply the node values by whatever the partial derivative of the weighted input with respect to the biases so let's once again have a look at this equation and here we can see that there's nothing affecting the bias value which means that however much the bias changes the weighted input will change by the same amount so that partial derivative is just one and I'll go ahead and add that to the code quickly then back in the neural network script we'll of course tell the output layer to update its weight and bias gradients using our new function okay so we can now move on to our second equation here and let's focus on calculating this new set of node values for our hidden layer as we noticed before this uses the old node values in its calculation so we first just need to take those and multiply them by this partial derivative here which we figured out earlier is the weights between the two layers so how this works is that the first of the new node values here will be equal to the weight of this connection multiplied by the old node value it connects to plus the weight of this connection multiplied by the old node value that it connects to and then as you can probably guess the second new node value will be equal to this weight multiplied by this old node Value Plus this weight multiplied by this alternate value and same thing for our third new node value foreign for these new node values is not quite done yet we still need to do this bit and so each of those new values will just be multiplied by the derivative of the activation function with respect to its weighted input so here's our final addition to the layer script this function takes in the old node values and uses them to calculate the new set of node values like we just talked about then let's head back to the neural network script to finish things up over there as well so we can create our new node values here and then just use those to update the gradients of the Hidden layer and at long last we've now completely implemented our two equations [Music] of course this implementation only works if we have just a single hidden layer but we can very easily modify it with a little Loop so it can handle any number of layers [Music] by the way this approach of starting with the output layer and going backwards through the network so that we can keep reusing these node value calculations is known as the back propagation algorithm and as we can see it's only running each data point through the network once or I guess we could call it twice since it does have to go backwards through the network as well but that's a huge improvement over having to do it for every weight and bias parameter like before anyway now that we're able to update the gradients for a single data point I've made a new learn function to replace the old slow one and this takes in all the data points in the current training batch and adds up the gradients for each of them after that it just performs our gradient descent step by telling all the layers to apply their gradients and remember they multiply the gradient by the Lend rate when doing that so if we just divide the learn rate by the size of the training batch that'll average out all the gradients that we added together with that our new learning code is complete so let's go test it on some fruit and it looks like it's working of course there's many things we could still do to improve it for example one low-hanging fruit would be making it process all the data points in parallel which I might actually quickly do behind the scenes [Music] okay let's move on from the fruit at last and challenge our Network to this classic handwritten digits data set we have 70 000 images on our hands here all labeled with the correct answer and each of these images is a minuscule 28 by 28 pixels so that's 784 values in total each ranging from zero meaning black to one meaning White now I've set up a simple interface for creating our neural network here so I'll make one with 784 inputs to take in all those pixel values and 10 outputs for telling us which of the 10 digits it thinks it's looking at let's take a moment quickly to wrap our heads around this so our fruit data had just two input values meaning we could draw each data point in two-dimensional space for the digits though we now have a 784 dimensional input space which is a bit trickier to imagine so let's just pretend it's in three dimensions instead with each axis representing the brightness of a single Pixel then perhaps images of a zero would tend to have values for those three pixels somewhere in this region whereas for images of a one maybe they'd be more in this region over here and so on I am completely making this up of course but hopefully the idea makes sense that we can think of each of the images as points somewhere in 784 dimensional space and our Network just needs to figure out which regions of the space each digit tends to hang out in just like it figured out which regions of two-dimensional space the safe and poisonous fruits tend to inhabit okay let's get back to setting up our Network and I'll try giving this a learn rate of one and a mini batch size of maybe 100 and that means it'll take us 700 batches to go through all the data we have which is known as one Epoch we don't actually want to let the network train on all the data though instead we'll set some of it aside so that we can test how the network performs on data that it hasn't seen during its training since that's what we really care about [Music] after training for a few seconds it seems to have plateaued at around 90 accuracy so let's try expanding the network with a hidden layer I'll give it 100 nodes maybe and we'll see if it can put all those extra connections to good use [Music] it is taking a lot longer now to Crunch the numbers so I'll fast forward through the training but in the end we've managed to get around 95 accuracy which is not great but not terrible either to try and prove it a bit I've been doing some research and implementing a few small things like some different costs and activation functions and also adding momentum to the gradient descent algorithm which allows the weights and biases to essentially accelerate as they slide down those High dimensional slopes of the cost function hopefully all of this is going to help a little so I'll set it up quickly and let's see how it goes well that wasn't a resounding success but before I go hunting for bugs let's just try turning the land rate down because that could also be the issue [Music] okay that's looking much better we were getting around 97 of the test image is correct which I'm pretty happy with for now so let's take a closer look at some of these images here we have a five and up here by the way we can see what the network thinks it's looking at so I'll flip through a few of these and we can see it's getting them all correct so far I'm more interested in seeing which ones it's getting wrong though so I might need to add a way to find those okay I've added a little button here so let's try pressing it here's the first mistake it thinks this is an H but clearly it's a zero although I can't see where it's coming from next we have a four apparently and I think we can all forgive it for thinking that's a nine after that we have a three which it thinks is a seven and I can half sympathize it's maybe a bit unclear alright let's look at one more this time we have a seven which it thinks is a two for some reason so some of its mistakes are kind of understandable but it's also clearly making some very blatant errors as well we might be able to get a better idea of its shortcomings if we're able to draw our own digits so I've been working on a super simple little drawing program over here which I'll hook up to the neural network and let's try it out starting off it thinks this blank canvas is a five with a confidence of 53 percent and we can see the rankings of all the other digits below that so let me start by drawing a one and it's managed to get that correct so I'll try changing it a little bit and all of a sudden it's saying that it's actually a three that's a bit concerning well let me try drawing maybe a six no it thinks that's a five okay how about an eight that looks like a three apparently so that's a bit strange it's very good at recognizing images from the data set but when I draw my own digits it's like it's never seen a number before I think what's going on is that all the digits in the data set have been sized and centered in a specific way and I guess the network has come to rely on that being the case so we could obviously try processing our own images in the same way but I'm actually more interested in trying sort of the opposite approach I have written some code that can rotate well that was more sensitive than I was expecting it can rotate the digits in the data set as well as scale them move them around and add some noise these settings can be applied randomly and I'll tell it to do that for all the images and then we can train a new network on this modified data let's maybe increase the size of the network first though since that's presumably made things a bit more complicated and it's definitely struggling a lot more now we can see it's just scraping past 90 on the test accuracy it is over 99 accuracy on the training data though which means that it's learning lots of overly specific patterns essentially memorizing the answers which is not very helpful so I've been experimenting with various techniques for reducing this and what I'm trying at the moment is simply adding some random noise to all the Network's inputs with the hope that if it can't rely on specific patterns remaining exactly the same it will end up learning more General patterns instead from the results we can see that it's somewhat worked the test accuracy is up about one and a half percent which might not be wildly exciting but it's nothing to sneeze at either so let's go see if this new network can decipher my handwritten digits a bit better so zero one and two all correct so far so it's already looking a lot more encouraging than last time five is correct as well and six and seven and eight this is exciting commentary I know and there's all 10 digits correct on this first run through at least I'll give it another try to make sure we didn't just get completely lucky the first time and it has managed to get all of them correct again I'll keep testing it for a while though and let you know how it goes okay so it definitely does have some big weaknesses for example if I draw a one as just a straight line it's very happy but if I add a little stroke at the top here it can quite quickly start thinking it's a seven instead we can steer the network back on course by adding a little line to the base here now it thinks it's a one again but if that line extends just a tiny bit too far the network will suddenly think it's looking at it too instead it seems to be a bit more reliable with other digits like eights for example but even there it doesn't take too many attempts to run into a case that bamboozles the network despite its flaws though I am happy for now that it at least seems to work most of the time just for fun let's quickly try training our Network on this fashion data set which was created as a more challenging drop-in replacement for the digits [Music] so after a few minutes of training it's reached around 89 test accuracy which is not amazing but it is apparently better than human performance at 83 percent which I thought was interesting anyway let's take a brief look at some of these so here's a chat excuse me a t-shirt which is getting correct then here's a shirt and a pullover also both correct let's see how long it can keep the streak going so we then have a sneaker a dress and another dress then a coach and finally it's been caught out by what it thought was another coach but it's actually a pullover okay I thought that was interesting to try quickly but let's move on to our goal from the beginning of recognizing various doodles so I've downloaded a doodle data set drawn by people from all around the world which comes from the very cool quick draw project I see mustache or peanut or swimming pool or Canon I see pillow or horse or tiger oh I know it's zebra to keep things simple for us today I have picked just 10 categories from the 345 that exist in total so our full lineup is helicopters house plants cats cruise ships windmills popsicles tractors umbrellas bicycles and octopuses I've increased the size of the network again and also added some extra hidden layers not for any deep reason I've just been messing around with different settings out of curiosity so I'll leave this to train for a few minutes and by the way I did make the same random transformations to the Doodles as we did with the digits since that seemed to be very helpful alright We've Ended up with about 87 accuracy here but for the true test let's see if it can recognize some Doodles of Our Own I'll begin with a popsicle 100 popsicle that says very nice I wonder if I could give this some squiggly legs and turn it into an octopus okay it's very confident about this being an octopus now with just a sliver of doubt that it may in fact be a tractor let's try something else so at the moment you thinks I'm drawing either another Popsicle or an umbrella although now it's just realized that I'm actually trying to draw a windmill I see that house plant is its second prediction so let's try planting the windmill in a little pot and it is saying that it's a house plant now although it's definitely a bit suspicious of it anyway I'm just going to play around with this for a little while and see how it goes [Music] [Music] okay so it's been working pretty well I think although like with the digits earlier it does still make some very obvious mistakes I've printed out its accuracy on the individual categories here and it looks like popsicles umbrellas and bicycles are at specialty whereas tractors are what really baffles it the most for me though it seemed to struggle the most with helicopters here you can see it's in a lot of damage over whether this is a tractor a helicopter or a cruise ship and in cases like this it seems to be very sensitive to small changes for example if I add a random little line here now it's an umbrella out of the blue now it's back to being a tractor and now it's a cruise ship but if we can really convince the network that this is a helicopter and it seems like just emphasizing the rotor here usually does the trick then it's not nearly as indecisive so clearly it's a very long way from perfect but overall it does actually work reasonably well I think so I'm happy with it for now to end things off today let's try giving our Network by far its greatest challenge yet which is not supposed to look like this what's going on okay I forgot that these images are robopping 32 pixels wide now let's try that again so these images are a bit bigger than before and in color which means we have 3072 inputs instead of just 784 and let's take a closer look at some of them right away we can see that this is going to be trickier for the network because the object or creature it needs to recognize hasn't been nicely cut out for it or anything there's the whole environment around it to confuse matters and we can also see how much variety there is even just among pictures of birds for example we've got little birds on branches big birds jogging around bird swimming birds flying birds staring straight into your soul you name it so with all this complexity in mind let's see how a simple Network fares in its training not terribly well is unfortunately the answer it ended up with an accuracy of about 53 but let's take a look anyway to begin with we have a deer which it thinks is an airplane and a dog which he thinks is a horse so not off to a good start it has recognized this bird though as well as this little dog in a box but then this airplane it's unfortunately misidentified as a chef let's see what's next so this time ship was correct and then we have an automobile and a Truck both correct but that's certainly not a horse so as expected it's getting around half of them correct which is of course a lot better than random but still a pretty underwhelming result I'm sure we could coax a bit more accuracy out of our simple network with some tweaks here and there but if we wanted to really be able to take on this challenge it's going to need some significant upgrades so sometime in the future I hope to return to this project to do exactly that is all for now though so I hope you've enjoyed this ridiculously long video and until next time cheers [Music] [Music] thank you [Music] foreign [Music]
Info
Channel: Sebastian Lague
Views: 1,855,847
Rating: undefined out of 5
Keywords:
Id: hfMk-kjRv4c
Channel Id: undefined
Length: 54min 50sec (3290 seconds)
Published: Fri Aug 12 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.