Tensorflow and deep learning, without a PhD, Martin Gorner, Google

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what I wanted to do today with you is on that one example whatever run through a full you know working session of what someone who trains neural networks for a living actually does too to start with a to start analyzing a problem and then all the optimizations and everything he's got to do until he can put this in production with a 99 point something percent accuracy that is actually useful in the real world and the message I would like to I would like you to carry home is that it's it's not actually that complicate we'll see if I can manage that what we are going to do here is with about a hundred lines of Python actually solve this problem with a very good accuracy so all the way from the beginning to the end so I'm I'm not going to reference some Google papers telling you the solution is in there no we're going to see everything from 0 to Z and get a 99 point something accuracy out of it all right so let's run and yes by the way great introduce Tiano I'm going to use tensor flow which has very similar concept and since you will be asking what is the main difference I think that the main difference between the two is that tensor flow introduces an additional complexity which is that it it's meant for distributed computing which means that it has a two phase execution model you first define your computation graph in memory and then you run it that's how you can run it you know a distributed that's how you can get can get a lot of goodness out of it but it introduces a little bit of complexity okay so let's start you know the data set already a miss data set we spend half an hour so let's jump straight ahead and design the simplest possible neural network that can do something on this data set that's what ekaterina called the logistical regression and but this time i will not spare you the matrix multiply and I want you to exert you understand exactly all the computations in it so this is a neuron what is a neuron a neuron is something that does a weighted sum of all of its inputs that's about it actually after the way it's um we will also add a constant and feed the result through some activation function and those activation functions we will see several the only common point is that they are non linear so on the top we have all the pixels of our images here flattened as one big vector we take 28 by 28 images and just put all the pixels in one big vector then each of those neurons does a weighted sum so it adds up all of the pixels with the different weights what those weights are well we will determine that is what the training will determine we add as I said a constant and then we feed this through a an activation function so the activation function I'm suggesting here for this level is called softmax and that's simply taking the exponential of the inputs so we do the weighted sum at a constant take the exponential and at the end we normalize so we divide this vector of 10 neurons by its own norm which norm which ever L to whatever you want what we get as an output here are values between 0 & 1 which we can interpret as probabilities we can say if this neuron has a high output well it's very likely that the picture that we have put in is it - ok so now we need to formalize this in matrices as a matrix multiply and actually I'm going to do a little bit more it's a very very common practice in neural networks to process not one image with your model but a batch of images so actually I'm going to process 100 images at a time here I have one image per line in this matrix one image per line all flattened all the pixels flattened and I said I was going to do weighted sums so using my first columns column of weights I do a weighted sum of all the pixels in the first image yeah I have my weighted sum now I use the second column of pixels to the second weighted sum and so on so here I have my input to my ten neurons ten wicked sums well I said we were adding a constant which is just an additional degree of freedom don't worry too much about it we call it a bias so we add ten constants one to each and this is the real input to my ten neurons okay now I said I wanted to do and in on the next slide I will be running this through my softmax activation function now I want to process all hundred images so I repeat this operation with the second image third image and so on and that is actually a matrix multiply and I obtain the neuron inputs for the second image third and so on all of the images okay now I have to I want to write this simply as this matrix multiplied by that matrix plus my ten constants I usually get to explain that you know adding so what is this this is 10 by 100 matrix with a ten element vector doesn't work and you need to tweak the plus operation a little bit I guess in a Python conference you all know what broadcasting is who knows what broadcasting is in numpy okay not everyone so let's let's let's let's explain this it's not complicated we would like to write this as a plus okay but we can't because you can't add a vector of ten elements to a matrix of 10 by 100 elements so actually we still write it as a plus we just redefine what we mean by plus and that's completely standard practice actually in numpy it's called plus with broadcasting and what it means is that if the dimensions do not fit well try to replicate this on all lines and we if that fits that's what you do with this little agreement between us we can actually write this operation as X multiplied by W plus our by SSB and this becomes the central formula for one layer of a neural network so let's recap I have 100 images here in X one per line all pixels flattened on one line I have ten output neurons which means that when I multiply by my W my matrix of weights I'm actually performing ten different weighted sums of all of those pixels x the W weights I have a vector of biases that's just an additional constants ten additional degrees of freedom from I for my system so I add those and then I feed this line by line into my softmax activation function so line by line I'm going to take the exponential of the value and once I have my ten Exponential's I normalize my line go to the next line and now hmm what is the beauty here is that you actually see that oops I'm going to stay put here with one layer of my neural network written like this it will become possible to chain them we will see this in in a minute here the X is all the inputs but actually it could be the output of a previous layer so now in Y I will have 100 predictions and each prediction is those 10 neurons if one of them lights up it means well if it's neuron number 2 lighting of it means I think this is it - I need to measure if those are good predictions so for this oh by the way this is how you write it in Python tensorflow not very surprising right now how do I knew this is a good prediction or a bad prediction so for each of my images in the data set I know what those images are the labels are unknown I have the labels they come like formatted like this so you see it's 10 numbers all zeros and there is one in the middle that is a 1 that means this number is a 6 and here I have my computed probabilities again one of them is big and that's what the computer thinks this number is so now any any distance between those two would work you want to do l2 distance the sum of squared differences that's going to work no problem it's just from experience that we know for that for classification problems putting things into categories one distance works slightly better and it's called the cross entropy and the cross entropy is those numbers on the top multiplied by this the logarithm of this one and then you add all of those sums across the vector alright so that's the theory and before we actually jump into the code let me show you a demo so what do we see here well first of all here you see the numbers I'm feeding in to train the system what do I mean to train the system well those weights and biases we have seen we don't know which value they are supposed to take for the system to be a good classifier so we are going to use an algorithm that I will describe in a second to determine the weights and biases by feeding in training numbers with their known labels and trying to converge to modify the weights and biases trying to converge towards something that is a good classifier so here you are seeing in white all the numbers that have been correctly recognized on my training set and still on a right background a couple of numbers that have been that are still being missed this was so quick that I'll restart the queue you should stand here well I have to stand in front of my computer unfortunately okay in the middle graph over there this is our cross entropy so this is the error function we are trying to minimize this to produce a recognizer with the minimal possible error in blue I have the error function on my training examples and in red the error function on my test examples so this is also something very important when you are training the system on training examples but then when you want to know if the system is actually performing well you have to test it using examples that it has never seen during training otherwise it would be cheating and the last graph I have here is simply what we call the accuracy that's the percentage of recognised images that so in blue the percentage of recognized images on the test set sorry on the training set and in red on the test set actually let's see the test set so this is these are my test images I have 10,000 test images I'm only showing 1000 here okay so you imagine there are 10 more screens like that on the bottom again in white those that have been recognized on a red background those have that have been missed but I have sorted it so that the missed values are all on the top so you're seeing all of the pictures that the system has not recognized and you see there is a percentage here we are somewhere around 91 92 percent recognition right with this very very simple model alright so now let's dive into the code and let's actually see how you write this using tensor flow so first of all you have to define placeholders and variables what are variables variables are going to be our weights and biases what we call variables are the things in the system that we don't know that we want the system to determine for us using the using the training algorithm and for the training to work we are going to need to give pictures and known labels to the system so for that we define a placeholder here X will be a placeholder for the images that we are going to feed during the training okay so placeholders something you're going to feed during training variable a degree of freedom of the system that is going to evolve by itself so here are weights and biases are variables and our images is a placeholder the images is something we are going to feed in during training now I have my model here which you recognize so the matrix multiply of X multiplied by the weights plus the biases my softmax activation function the only thing that I have added here is this reshape you remember we had square images 28 by 28 but our model needs all the pixels on one line so that is what reshape does and the minus one just means well one dimension I once I want twenty by twenty eight which is 784 and the other dimension minus one means figure it out there is only one solution okay I I load a second placeholder which will be my labels so the known labels of all my training images and now I can compute my cross-entropy which is known labels multiplied by logarithm of predictions added across the vector we use some those are some across the Vectra and this is just for computing the percentage of correctly recognized images on and putting it on the screen so this is really just for for display and now we come to the core of what tensorflow actually does for you so we have this this loss function cross-entropy which we have defined as depending from our images from the the weights the biases and also from the images we are going to ask tensorflow to compute the gradient of this what is the gradient the gradient is the partial derivative of this function relatively to all the weights and all the biases well remember there are a couple of thousand weight sensor and ten biases so thank God that tensorflow does all these partial derivation z' formally for you otherwise you would have to do them and actually code them there is a way you know it's called maybe you heard about it it's called the retro propagation algorithm who has heard about retro propagation okay a couple so that's the algorithm which makes it possible to compute the disc region doesn't mean that it's easy it just makes it possible here you don't have to worry about it when the tensorflow will formally derive your function and compute the all these partial derivatives now mathematically speaking this vector of partial derivatives is called a gradient and a gradient is a fantastic thing a gradient is an arrow pointing up so here we want to go down we want to minimize the cross entropy which means that we simply reverse the gradient by putting a minus sign and take a small step along the gradient to go where the cross entropy is smaller that will be our optimization algorithm at each point compute gradient take a small step in the direction shown by the gradient where the cross entropy is smaller repeat with a new set of images so that's the core of the training now you have something there called the the learning rate so why do we need a learning rate well if we were to change our weights and biases by the full amount of the gradient we will be making very big steps okay so you imagine you are in the mountains in Europe you're looking for the bottom of the valley if you have if you're making very big steps like one or two kilometers in one step you will be jumping from one side of the valley to the other side and you will never reach the bottom you have to make small steps to get to the bottom of the valley so that's what the learning rate is for when you compute a gradient remember we the space in which we are evolving is the space of all the weights and all the biases so when you compute a gradient which is a small change while no a big change to be to be made to your weights and biases to minimize the cross entropy you don't want to make that big change all at once you only want to make a small change in that direction so instead of jumping by a whole gradient you jump by a small fraction of the gradient and that's what the learning rate is for so we talked about gradient descent by the way in the library you have many other optimization techniques gradient descent is the simplest one but there are other that work in the same way but converge faster or are more stable numerically and so on ok so this is what tensor flow does for you and now let's finish so now we need to train the system so this is the thing that I want you to remember about tensor flow is that it has a deferred execution model everything we have written up to now was not executed as we wrote it it was all those TF thatsomething statements were only generating a computation graph in memory and now when we want to actually run those computations we need to start a tensor flow session and then all the calculations we'll do we do will be inside of session dot run and some piece of our computation graph so why is it important to have this distinction between I define my graph and I run it that is because as I said tensor flow was meant for production as well as research and it was meant to for distributed computing for that purpose it's very important for tensor flow to have your computation graph in memory with that it can do a couple of logistical chores that you don't have to do we are not going to go into these details but you can assign a computation note to each part of the graph and since tensor flow knows what the full graph is it will automatically figure out all the data transfers that need to happen between your notes so that's one of the logistical challenges that tensorflow solves for you because it knows your computation graph beforehand before starting execution well when writing code the deferred execution model introduces a a bit of complexity but now you know why it's there okay so how does training look you remember here train step is what I obtain from my optimizer when I said meaning please minimize my cross entropy so this train step is a piece of the computation graph that computes the gradients applies the gradients to my weights and biases to obtain new weights and biases so now in order to train my system I start a session and then in a loop I feed in one batch of images and run my Train step and again feed another batch of images and run my Train step and so on and so forth everything you see at the bottom all this that's just for display that's just stuff that I want to put on the screen so that I can see how the system is evolving that has nothing to do with the training so here at the bottom I'm computing my accuracy and cross-entropy so that I can display it and I'm recomputing my accuracy and cross-entropy this time feeding in my test images to compute the percentage of recognized images on the test set by the way you see here how my placeholders are filled each time I run session dot run I need to provide a feed dictionary and that dictionary like here train data over there has x and y underscore as keys and those are the variables well those are my placeholders this is what I have defined as tensorflow placeholder this is a value that will be filled during training so through this feeding dictionary this is how I fill the value okay so that's it that's the fook full tensorflow code for running this simple so-called logistic regression which is a neural network with just one layer so we have defined placeholders and variables weights and biases are the variables X so the images will be fed during training so that's the placeholder I have my model just one line because it's just one line of neurons a second place holder for the known labels Y underscore the labels that I know for the images in the training set my cross-entropy that's my error function two additional lines to compute the accuracy the percentage of recognized images then I define what my training step is so I say let's take the gradient descent optimizer algorithm and ask it to minimize my cross-entropy that's how I get my training step and then I start a session which means I start computing any in a loop I load 100 images and I run my training step which computes the gradient and takes a small step along the gradient to obtain new weights and new biases which I'm going to test on my training data and test data to see if with those new weights and you biases my system classifies better so that's all there is to it for this very simple model how are we faring and this is exactly the code I have shown you here the the visualization is done with matplotlib which is the standard well you know what matplotlib is and actually that's the nice thing also about tensorflow is that you know I told you here when you do TF something bla bla bla you are defining a graph in memory but here once you do sessions run the matrices that are returned are normal numpy matrices which means you can feed them into my plot label or whatever you prefer for visualization that's what I did for this visualization and so with this oops we get to let's see here 92% precision we recognize 92% of images which is not so great yet so we have to do better and so this is called a deep learning session so we like we will have to go deep so add more layers and since we we want to go really deep we'll add tons of layers let's try with five layers so how do we add layers one simply as I explained if you look at the second layer instead of each neuron doing weighted sum of all of the pixels it does awaited sum of all of the outputs of the previous neurons that's it I will keep my softmax activation function for the last layer because I want to output values between zero and one which I want to interpret like you know kind of probabilities for the other ones I'm going to change my activation function I'm going to use the Sigma which is simply a function that goes from zero to one it's nonlinear that's all we want from it it goes from zero to one okay so how do we write this in tensor flow well first instead of one matrix of weights and a one vector of biases I will need since I have five layers five matrices of weights and five mate and five vectors for my biases and this is my new model so you see it shouldn't be very surprising from what we have seen previously the first line is still almost the same formula as we have seen before I have my images multiplied by my first weight matrix plus the biases fed through this time my sigmoid activation function and then on the other layers well I simply feed the the output of the previous layer as the input I could have 100 layers like that if I want it that's it actually that's the only change but now I want to show you a couple of tricks that we use to make those neural networks converge better and the first thing is that well I show you I showed you the the sigmoid function here mainly for historical reasons because as we were starting to stack layers the sigmoid you see it it compresses all values between 0 & 1 so with a lot of layers the the values were compressed between 0 0 & 1 a lot and you might have problems so forget about the sigmoid someone had a bright area and invented a new activation function which actually is even simpler than the sigmoid it's called a rectified linear unit and it's just this it's 0 for inputs below 0 and identity for inputs above 0 how simple can it get the the little story is actually that these two functions were actually inspired from biology from what people thought the neurons in our head were doing and the consensus also shifted in biology from neurons represented like that two neurons represented like this now we think that the neurons in our head are more likely to follow this a kind of activation function but well we are in computer science so thanks for the biologist for giving with good ideas but now we tried it and it works better so let me show you just very quickly this is my accuracy just to first 300 iterations so just the very beginning this is how the kind of convergence you can you are going to get with the sigmoid and this is with Ray Lewis it goes up much faster so forget about Sigma it's just remember Ray Lewis okay so now let's push this we added our five layers we replaced the the Sigma is with values and push this to 10,000 iterations so first of all look at the accuracy 98% who that's a big improvement we're very happy but now when I look at these curves they're not very brightly on there look at the the accuracy here the test accuracy the red curve it's it's it's jumping up and down by a full percent so this is a clear sign that we are trying to converge to fast we're making big steps from one side of the valley to the other we're going to need to go slower and what is the good way of the going slower well you could lower the learning rate but then you will you will go very slowly so actually a good technique oh yeah you have to go slow but a good technique is to decay the learning rate exponentially you start with reasonable learning rate and then you go slower and slower and slower as as the iterations progress this is actually quite spectacular this is what we had with a learning rate of 0.003 reminder this is the fraction of the gradient by which you modify your values your weights and biases at each step if I decay this from zero zero three two one 10-4 Wow much nicer and actually not just nicer but my accuracy is actually stabilized at the at what was previously a peak value and look at this there is something even more marvelous here look at the blue accuracy so the percentage of images recognized in my training set it's stuck at 100% for thousands of iterations here for the first time I have a system that has learned to recognize my training images 100 and all of them all of the training images are recognized here it's not the case for the test images but at least the training images yeah okay let me show you something else now so a new demo and yes which one just a second don't look at what I'm doing and cheating here I'm just changing my parameters a little bit okay here so this time sorry I launched the wrong one once again yep now it's the correct one so this time here I'm showing you if this these are no longer the weights and biases these are the inputs and the outputs of my neurons okay and the band's those are percentiles okay so each band is I have seven bands so it's 100 DB divided by seven percent of the values what I wanted to show you here I have to zoom in a little bit look at the outputs of the neurons each band is a percentile okay so and these represent all the possible values that the outputs of the neurons have taken in my whole system on all layers so you see there is one band here that as is gone all those neurons are outputting zero another one that is gone these are outputting zero the middle band so this is my mean here actually crashing towards zero as well more and more of my neurons are starting to output zero if you look at the inputs while remember we are using array lose value is a function if you feed it something less than zero it outputs zero it's flat and then X so here as my weights and biases are getting modified you see that actually the inputs the weighted sounds that go into the neurons they are diving down as well look at the zero is here and more and more of them are going below zero so what's going on here what actually my neurons for some reason are dying more and more of my neurons are outputting zero which is not very useful for not very useful value for a neuron what can I do about it well for a problem of dying neurons a very good technique has been invented which is to shoot the neurons it makes sense just wait a second so it's called dropout what is dropout again it's mind boggling ly simple you wonder why this works during training I'm going to actually remove a certain fraction of my neurons from the network I remove the neurons I remove their weights the only thing I do above that is when I remove a neuron I will slightly boost the output of the remaining neurons on that layer to make sure that the activations of the next layers are not shifted okay apart from that I am just removing the neurons so I take a probability here 0 75 means I have a 75% chance of keeping a neuron a 25% chance of dropping that neuron and during training at each batch of images at each turn of the training loop I'm going to shoot 25% of 50% of my neurons and then on the next on the next batch I should a different set it's random doing a random set of 25% neurons of course when I evaluate the performance of my network I put them all back ok so I need a way of putting them all back this is how we would write it in tensor flow there is a helpful dropout function that actually in your outputs will set some of the values to zero according to the probability that you give it and boost the remaining values a little bit so what is the result this is what we had without dropout now I add dropout this is what we have what we have with dropout my neurons are dying a lot faster mm-hmm once it what is this guy talking about yeah okay so why is this actually a good thing well you remember a couple of slides back when I had it all those five layers I told you oh let's go big plenty of layers and plenty of neurons in each layer actually I have designed a neural network that is way too big for my problems and the learning algorithm is actually correct in shutting down all of those unused non useful neurons and what dropout gives me is simply that they are getting shut down faster so it's faster convergence and more freedom for me as the network designer to make a mistake in evaluating how many neurons I need with dropout I can make a mistake maybe maybe by a factor of ten and it's still going to be okay so just to sum up and yeah with dropout I get to 98% which is slightly above 99% but just to summarize what we got we had initially our five layers with sigmoids slightly below 98% of performance accuracy okay we replaced the Sigma's with values well you see there's it's starting a bit faster and actually this also had a positive effect on the on the performance of the system ninety eight point two eight jumped above the ninety-eight bar then we added a learning rate DK which cleaned up those curves a lot now our 98.2 are sustained over there and then we added a dropout so in this case dropout well naturally adds a bit of noise you see how harsh the technique is so you have to expect expect some more noise here is actually not adding much to the final accuracy but you see it's doing something you know quite funky over there we'll come back to that so with all this we get slightly above 98% accuracy is that good enough no come on in production if this is actually used in production 98% means that out of 100 digits to our crap well come on you can't use this in production now we've got to do better so let's go let's go back to our curves and let's look at what we've got and you see over there whatever we did we always had a disconnect between the last function on our test images and the last function of our training images on our training images the last function actually was optimized all the way down to zero but on the test images not so much there is always a gap it started going down but now here there is a gap so there will always be a gap okay we are optimizing by looking at the training images the test images the optimization algorithm never sees them so it can't do much about them specifically which is what we want we don't want something specifically optimized for our training for our test images but the question is how big this gap is going to be between my last function on the training set and on the test set so when people see this on on the curves for the last function they usually call this over overfitting and if you ask those specialist what is overfitting they will give you several possible answers so first and the most common answer is you have too many neurons why is that a bad thing well imagine it I have so many neurons that the system can actually store 100 percent of my training set into its neurons it can just remember it could store all the images into those weights and biases and and just do some kind of you know look up from there that will work it can learn by heart all of the training examples but that will have no impact whatsoever on its real-world performance the digit that I'm going to draw in the future the fact that it has learned by heart a certain number of digits doesn't help with that you have to force the system to generalize and in a neural network you do this by restricting the degrees of freedom if you have too many new degrees of freedom it can do weird things to remember just the training set and minimize the cross entropy remember that's its own go that's the only goal of the training algorithm minimize cross entropy so that's what it does so that's the first two many neurons you have to constrain this to force the system to generalize and actually have good performance on real-world data well the mathematical opposite of that is not enough data even if you have not so many neurons if you have only a very little data set you can still store the full data set in your weights and biases and still have a system that I had that has just learned a training set by hand by by heart and still cannot do anything useful in the real world and then if we are in our case here we know that well we have a little bit too too too many neurons but then we added dropout to regularize that so we are pretty sure that the overfitting what we call overfitting here is not because there are too many neurons and not because it cannot generalize the dropout forces it to generalize we have enough data the Minister data set is 60,000 digits so we do have enough data so there is only a third solution the shape of our network is not capable of extracting more information from this data set and we have to invent something else how do we do that well we can go back to the beginning and try to identify what we did that was completely stupid well you remember at the beginning we had nice pictures of image of digits and we said well let's put them all the pixels on one line well that sounds pretty stupid isn't it a digit has shapes like you know lines and curves and and what you just put all the pixels in a big bag you are destroying those shapes you're destroying the locality information that says this is a small line these pixels are in a small line and so on turns out that information is actually useful and people have devised another type of neural networks called convolutional networks that can take advantage of this locality information okay so let's jump into convolutional networks and I know it's right the time in the afternoon where it's hot and feeling stuffy and sleepy and so on so I warn you this is where usually people fall off and tell me oh I understood everything up to this point and after that it was a bit mushy so the warning is simply that there is nothing complicated in what I'm going to explain just hang on to the explanation still no mathematics no complicated math by the way there are not so much complicated math in deep learning I found I mean I I've seen domains with much more complex mathematics than this ok so what is a convolutional layer a neuron still does the same thing it does a weighted sum of all of its inputs but now instead of summing all of the pixels in the image I'm just summing a small subset a small patch of pixels from my image and here I have taken the general example of a color image so my color image has three channels three color channels ok now using the same weights that's the important thing to understand I'm going to sweep this image using the same weights and produce other weighted sums of that labeled little batch and I do this on the whole image in this direction in the other direction and with proper padding on the on the sides I obtain as many weighted sums as I had pixels in my original image okay using one set of weights okay I insist because previously each neuron had a different set of weights for doing its weighted sums here it's the same well that's great but that's also a problem here how many weights do I have how big is that little highlighted cube four by four by three that's 48 previously I had two three ten thousand weights degrees of freedom in my system and now it's going to work with only 48 so no it's not going to work I need more degrees of freedom so I'm going to simply do the same thing a second time with a different set of weights I choose a different weight matrix and do the same scanning a second time I now obtain here the second channel of waked sounds and now just a bit of mathematical trickery I'm using tensors so instead of representing two tensors I put them both in one by adding one dimension here and lo and behold I have a convolutional layer which does a convolution of four by four on my image on an image having three input channels outputting to output channels of results now that I have it formalized like this I can stack convolutional layers one more little problem to solve as I'm stacking those convolutional layers I also need to boil my information down remember at the very end I want only ten neurons as an output telling me this is a 1 or this is a 9 so traditionally the way this was done based on intuition was after each convolutional layer at a sub sampling layer which intuitively well the intuitive reasoning was that these convolutions these weights are going to evolve into some kind of shape recognizers like little patches recognising straight lines or little patches recognizing circles and so on and once you have somewhere in the image recognize a small circle you want to hold on that information you want to pass down the fact that here I have seen a small circle whereas on all the pixels where you have not seen anything you can forget that so the way the subsampling was done is that you would take squares of two by two pixels and take the maximum value maximum value being well here I have seen a small circle you pass that down well but then people realized that you can also do simply the convolution instead of convolving one pixel by one pixel if you jump two pixels at a cell at a time well instead of obtaining 28 values if I apply these two pixels by two pixels I obtain only 14 values that's also a very good way of boiling the information down by playing with the stride the step of the convolution so I'm mentioning those subsampling layers but those are today again history now when you do convolutional networks they are all convolutional only convolutional layers and here is the network that we are going to try day to break the 99% per ear so first convolutional layers a patch of 5x5 reading in one channel we have a gray grayscale image so just one channel in each pixel and outputting four channels so I use four different patches of weights to convolve my image okay next step this time I use a stride of two so I do my convolution every two pixels which means that I obtain a smaller 14 by 14 output values and I use a four by four patch for input channels that's what I have and eight output channels and next line thread convolutional layer again strategy 2 which means that I go from 14 by 14 to 7 by 7 for the size and I I bump it up to 12 12 channels here you're going to ask me why these values why not different values we'll come back to that then I will add a fully connected layer so that's a layer as we have seen previously okay here one neuron in each of those in this layer does a weighted sum of all of the pixels okay and our soft max output layer so how do we write this in in tensor flow I skip the initialization but this is the model tensor flow has a comp 2d function which does the convolution you give it X which are the images and w1 which is the first patch of weight and it does the full convolution of that first patch of weights on the whole image second convolutional layer thread convolutional layer after the third layer I need to reshape here my values into one line so that I can do a weighted sum of all of the pixels here a normal fully connected layer that's the model we have seen at the beginning and finally myself max layer so you replace in the code we had previously you replace the model with this with proper initialization of the weights and biases and we can try what this gives you so let's go run run run run run and this one is ready kay why is it not there run run I'm not this 13.0 what is my 3.0 ok I don't know why it's not he doesn't want to okay I got it sorry about that here Merton actually we have no no more time so just yeah it has one minute it's done don't worry so first thing you see it's slightly smaller slower we are asking it to do more computations but the second thing look at the accuracy it's shooting right up and we are only at two hundred iterations here okay so let this go fortunately I have a small video but we'll jump to the very end and we'll see with some disappointment that with even with those tricks we're only at ninety eight point nine accuracy so we'll have to do something else fixing that error curve going up on the test data but we have seen we have a technique for that called dropout and actually a very general technique in a neural networks is to restrict the network until it doesn't perform that good give it a little bit more breathing room a little bit more freedom and a dropout so here I'm extending my channels from four eight twelve to six twelve for 24 I'm adding dropout on this layer and lo and behold 99.3% thank you and just so that you see what the dropout did this was without dropout this was with dropout so you see the regularization on the Honda on the Lawson on the cross entropy and you see we won a couple of fractions of a percent and just before we finish this was tensorflow if in the future you want to run your tensorflow code in the cloud taking the advantage of Google's infrastructure and by the way in Google's infrastructure we now have GPUs specialized hardware to run neural networks that is roughly NX faster than the fastest GPUs you have on the market today so we are pretty soon releasing a system for that called cloud ml cloud ml you write your neural network in tensorflow you ship it to Google and you train there with very very good performance thank you
Info
Channel: Видео с конференций IT-People
Views: 41,608
Rating: 4.9641256 out of 5
Keywords:
Id: sEciSlAClL8
Channel Id: undefined
Length: 56min 29sec (3389 seconds)
Published: Wed Jul 13 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.