TensorFlow, Deep Learning, and Modern Convolutional Neural Nets, Without a PhD (Cloud Next '18)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] MARTIN GORNER: Please settle in. I see people are still coming. I see empty spaces. Make yourself comfortable. While you settle in, can you tell me has anyone built a neural network in this room? Raise your hand. OK, quite a few. And among the others if I say things like, let's say, convolutional layer or cross entropy loss, if it rings a bell, if you know what I'm talking about, raise your hand. OK, a few as well. All the others, you will not be able to follow. No, I'm joking. No this talk is specifically designed to take all of you developers through the learning curve, and today we will build a neural network together from scratch to something that I believe is a good quality neural network. And if some of you have seen other talks from the TensorFlow Without a PhD series, today we will be focusing on the latest advances in visual processing neural networks and what are the architecture ideas that go into a good neural network today. So let's start, and for that we need a dataset. So what can we do? Let's head to Kaggle, the data science community. I went there, and I found this dataset of little 20 by 20 tiles with airplanes and non-airplanes. Actually plane spotting was quite a nice activity. Why don't we build something that can recognize airplanes from not airplanes and then continue building an airplane detection neural network. So how do we start? Well, let's take the neural network manual 101 on the first page and show-- the first page, it says this is a fully connected or dense neural network. So the image comes in as a set of pixels, and we flatten all the pixels into one big vector. And we will be processing that big vector. The white circles you see are neurons. So a neuron in a neural network will always do the same thing. A neuron does a weighted sum of all of its inputs and then feeds the sum through what is called an activation function. That's just a function, number in, number out. But in neural networks, this activation function is always non-linear. This is what-- this is the key thanks to which neural networks can solve non-linear problems. So you can layer those layers of neurons. The second layer, instead of doing weighted sums of pixels, does weighted sums of the outputs of the previous layer. You could have as many layers as you want. And then at the very end, I added a layer with just two neurons. And here, since I'm building a classifier, something that will classify those images into airplane, non-airplane, my hope is that with the correct weights, these two neurons, one of them will have a very strong output when this is a plane, and the other one will have a very strong output if this is not a plane. So we will need to choose the activation functions correctly. And again looking at the neural network 101 manual, most of the time the activation function you use is called a ReLU. And the manual says if you're building a classifier, there is one exception on the very last layer. You will use a different activation function, which is called softmax. So I will dive into those two on the next slide, but first let's write the code for this. So this is what it looks like in TensorFlow. The first line just flattens all the pixels as one big vector of pixels. And then in TensorFlow, you have this high level API, which can instantiate an entire layer in one line. So here I have instantiated my three layers. The first one with 200 neurons then 20 and the last one with two neurons. As you see the intermediate layers are activated with this ReLU activation function. And at the end, I need to do something different. So if you are building a classifier, let's follow the recipe. The recipe says if you're building a classifier, the last layer must have the softmax activation function, and you will apply-- you will compute now some distance between what your network is predicting and the correct answer. You-- we are doing here supervised training. So we don't know initially what those weights are. We are doing weighted sums, but we don't know what those weights are. And we will be feeding examples into this model where we know the answers and comparing what the model is predicting against our answers. So we need a distance function. And in a typical classifier, the distance you will use is called cross entropy. I don't even necessarily need to know at this point what that is. TensorFlow actually has a function-- a combined function-- that applies this lost softmax activation and computes the cross entropy distance between what the network predicts and the correct answer. This is called a loss error function, and that's what we want to minimize. So as soon as you have built your layers and computed your loss, an error of function, TensorFlow can take over. You pick any of the optimizers that are available-- here I chose Adam optimizer-- and you kindly ask it to minimize this loss. This will give you a training operation, which TensorFlow will then repeat in a loop, and what happens in this training operation is all the training magic. TensorFlow will look at your loss, will differentiate it relatively to all the losses-- sorry, to all the weights in the system, obtain something that is called a gradient mathematically, and apply an algorithm gradient descent that figures out how to change the weights in your network so as to minimize this loss-- so as to make this loss smaller. And so you will be feeding in images and correct answers in batches. And at each batch, TensorFlow slightly adjusts the weights in your network so as to make the loss, the distance between what your network is predicting and the correct answer, smaller and smaller and smaller. That is how you train a neural network. So just a little look at those two activation functions. So here is again just one neuron, weighted sums of all of its inputs. Usually you add something called a bias. That's another degree of freedom, something that will be determined by training. And you feed this through an activation function. So this ReLU function that we have used on the intermediate layers, this is what it looks like. It's very, very simple function, 0 for all negative values, identity for all positive values. The softmax activation function that you use on the last layer, here I have represented the last year with 10 neurons, so that's if you're doing a classification into 10 categories. So that softmax function is slightly more complex. Well, actually it's just an exponential, so you compute a weighted sum and elevate it to the exponential. And then you normalize across those 10 neurons. Well, the output-- I put in a little animation. Here is what it does. Boom. It pulls the winner apart. But without destroying-- it's not max. It doesn't completely nullify all the bad answers. So you still have signal if your network is mis-recognizing something. That's why it's called softmax. It's max but in a soft way. It pulls the winner apart, but it doesn't discard all the bad-- all the negative information, which is still useful for training. So we have all that. We have built our network. Now we can train it. We have our dataset. During training-- well, so here is what goes in the witch's brew. So ReLU on softmax, the two activation functions, the cross entropy was the distance function we used. And it's ready. Let's test. Actually before we go there, a word about the tooling we're using here. So this is-- the code is in TensorFlow. The tool I'm using to actually run the training is ML Engine on Google's Cloud. It has one super useful feature that I love is that whatever infrastructure I run my training on, when the job is done, it shuts it down. That's not rocket science, but I just can't be bothered shutting down my machines because-- and figuring out of my jobs I've done. ML Engine gives me the job-based view. I can show you here that I need to work my jobs running finished. And then when they are finished, the machine or the cluster goes down. And finally TensorBoard, that's a visualization tool that you have in TensorFlow where you see all your curves and you can see what's going on. So I will-- oh, no. I can't skip this. When handling images-- so I will have to introduce another piece of technology here. Those dense neural networks that you have seen, they work well for a lot of things, but for images you need something else. And those are called convolutional neural networks. So bear with me here. In a convolutional neural network, what is called a neuron behaves a little bit different. A neuron-- this little cube here, that's the output-- sees only a little fraction of the image right above him. It doesn't do weighted sums of all the pixel of the image, just a little portion. And then the next neuron actually does a weighted sum of a little portion just above him but using the same weights. So it's actually a filtering operation. You pick a set of weights, as many weights as I have highlighted cubes in the image over there. My image has the red, green, and blue channels because it's a color image. So if you have red, green, and blue channels. And I'm doing weighted sums of all these pixels but using the same weights at each position. It's a filtering operation. And once I have moved this filter across my entire image with correct padding on the sides, I have as many outputs as I had pixels initially. So how many weights did I use? Well, as many as the highlighted pixels. So that's 4 by 4 by 3, which is 4 by 4 16, 48-- 48 weights. Typical neural networks have something in the tens or hundreds of thousands of weights. So we need a way of giving this more freedom, more weights to play with. And a good way is to pick another set of weights and repeat the operation. So you pick just another set of weights, repeat the operation, and you obtain a new channel of data in the output. And you can do that again and again, which means that a convolutional layer in a neural network will transform a data cube into another data cube. And this is the shape of this convolutional-- the matrix of weights. So the first two digits-- the first two numbers 4 by 4-- that's the size of the filter in pixels. The third digit is how many channels of information here you are reading in the input image so three channels red, green, blue. That's this number here. And then you repeat this operation four times with four different sets of weights, and as a result, you obtain four different channels of outputs. And the four is this last number here. So this is a convolutional layer. And since it has a number of input channels and a number of output channels, you can chain them. And a convolutional network will be a sequence of those layers transforming the data cube into another cube then filtering that data cube again, transforming it into a new data cube and so on. So this data cube can grow in both directions, the vertical direction or the horizontal direction. So we have seen here in the vertical direction, that's just the number of times you repeat the filtering operation with a different set of weights. But how do you adjust the size in the horizontal direction? Usually the idea is to go from an image and boil the information down into something smaller like recognizing what is in the image. So two choices for that actually. We have many options. The first one is to play with the step of the convolution. Instead of doing those weighted sums pixel by pixel, you jump every second pixel. Well, mechanically you obtain twice fewer results in the output. That's the stride parameter that you see here, stride 2 or stride 1. And there is a second option, which is actually more used, and that's called max pulling. And here the idea is interesting. These are filtering operations. So as the network trains, you will-- those filters will train to pattern match or recognize certain features in the image. Let's say there is one that trains to recognize little horizontal lines, another one that specializes in vertical lines, and so on. So the output of the filter is basically something like here I have seen a little horizontal line. Here I have seen nothing. Here I have seen nothing and so on. The max pulling operation takes four of those in a square and just keeps the max, which makes sense because you are interested in the value of the filter that was maximum at that point because that is where the filters has actually seen something. The other ones, which again-- which you discard, are I've seen nothing and that's not really interesting. So this is a basic subsampling operation. You take your image. You take all-- well, not your image but your data cube. You take the data points 4 by 4 and squares 2 by 2 and just keep the maximum. And again that reduces the size of the data cube horizontally. And the little guy, he points something out. That is something called a 1 by 1 convolution. Well, if you're a mathematician, that doesn't make much sense, like 1 by 1 filter, that's just multiplying by a constant. That's not very useful. But again we are doing this filtering multiple times with different constants. So 1 by 1 convolution actually makes sense. It's the weighted sum of this little column of data points. And that weighted sum might be interesting. So we will see later that 1 by 1 convolutions actually in convolution networks make sense. So this is what we will build. There are three convolutional layers. We have our image here, three convolutional layers, and then at the end, we need to connect this to our softmax layer, which will do the airplane, non-airplane classification. So the last data cube, we reshape it. We flatten it out as one big vector here. And this-- we apply normal dense layers to this vector and end up with our softmax, softmax activation, and softmax cross entropy loss because this is a classifier. After that, a lot of experimentation and a lot of regularization. So I will not go into this. You have many other talks in the TensorFlow Without a PhD series that focus on what is regularization. For now, all you need to know these are techniques-- standard techniques that can improve the convergence. And I am quite good at those techniques. So plus another one, which is hyperparameter tuning. As you have seen, there are many parameters here. The number-- the size of the filters, the number of layers, the strides, and so on and so forth. You can-- well, if you know what you're doing, you know the acceptable ranges for those parameters, but still it's a lot of work to explore this parameter space. So that's why ML Engine gives you this hyperparameter tuning module where you just define your space and say go ahead. Try all the combinations. But there are a couple of ways of trying out all the combinations. The basic one is just grid search, and that's where you would start, just map out all the possible combinations of parameters and search through the whole grid. It's actually slightly faster to do a random search than a grid search. That's a bit counter-intuitive. We are rational people. I like the grid, but it turns out random is slightly better. But ML Engine does a third algorithm that is called Bayesian optimization, and I won't go into this one. It's something where, from one set of runs, it can mathematically determine which part of the parameter space has been mapped and where it's still missing information and focus on that other part of the parameter space in an optimal way. So that is the best way of doing hyperparameter tuning and ML Engine does that. How well-- this is how you package your files to go to a ML Engine, just your Python code in a folder and a config file, which usually has just this in it scaled here basic GPU, which means one machine with the GPU, and then you go run. The run here. You use the G Cloud command line, G Cloud, ML Engine, job submit training and you train. If you add these lines here to the config file, you start-- instead of starting a normal training job, you start a hyperparameter tuning job. So what do you-- what did I do here? I said I want to maximize some metric. That is a way of specifying what your metrics are in your TensorFlow code, so here my accuracy. I want 50 trials, 10 trials in parallel, again Bayesian optimization. So it derives useful information from trials for the next ones. So it's better not to run all the trials at the same time, even if you have the necessary hardware. And then you say which parameters you want. So I have one parameter called LR2, which is an integer, min max values, the scale. This other one is a categorical parameter, and it will try all those parameters in some optimal way. So using all this, the network you have seen here, this one plus my best knowledge of regularization techniques plus hyperparameter tuning, I was able to bring this network to an accuracy of 99.6%, and I was very, very proud of myself until I tried this network in real life. Let me show you a demo. Well, first of all just a little trick, how do you transform a classifier into a detector? It's actually fairly easy. If you have a big image, just cut it out in 20 by 20 tiles slightly overlapping and maybe a different resolution, which you resize then to 20/20, run the classifier. Wherever you see a plane, you put a box there, and you have a detector. So here we have-- that's San Francisco-- and this was my very first model. This one, not yet very good, just the data I had, not much regularization, not much hyperparameter tuning. I guess it looks like it's doing something. It's not completely horrible. So it's encouraging. Then I use my best skills to hyperparameter tune and regularize the hell out of this. And I obtained my 99.6% accuracy model, which I will now run for you and wait for it. It's absolutely horrible. You see here-- let me show you lots and lots of false positive everywhere. It's noisy. It's not clean. So here is the first lesson of neural networks. There is your training data. There is your evaluation data on which you compute your accuracy. Of course, computing your accuracy on your training data would be cheating. And then there is real life. And real life has nothing to do with either your training or your evaluation data. Real life is hard. So with a lot of effort actually augmenting the dataset, adding tiles of non-planes, hoping that this would make things better, I was able to increase the accuracy of this model a little bit. But as you will see, it's still not great. And that was the end of my neural network 101 handbook. So from now on, you've got to read papers. That's scary, so before we go there, let me give you a couple of just TensorFlow tips. To use ML Engine really to the best of its capabilities, I advise you to wrap your model into what is called an estimator API. That's just because in an estimator, we have written for you a ton of boilerplate coach that is not interesting to write, things like check points-- regularly outputting check points so that if your training crashes after 24 hours, you can restart from where you were, exporting the model at the end so that you have something that is ready to deploy to a serving infrastructure or distributed training. The distributional algorithms of distributed training also baked in into estimator. And to wrap your model in an estimator, you need those-- sorry-- those four things here. And then you can run train and evaluate, which will alternate training in evaluation phases so that in your output, you get nice curves with your training data, training metrics, evaluation metrics. You can compare the two and so on. Mostly what you need to provide are those four functions here. So let's go quickly through them. It's really nothing fancy. The model function, it's your layers. That's your model. And then it returns whatever a model is supposed to return, the predictions the last. You put the loss into an optimizer, and you've got this training operation, which the estimator will run in a loop, and whatever evaluation metrics you care about. So that's your model. Then the training input function-- and I'm putting code on the slides here. You don't try to read all the code. I will give you the highlights of what is there in the code. You will not be-- you will not have the time to see all the syntax. So the training input function, that's the function that will define how your data goes into the model. And I use this dataset API. And that's really good because this data set API is designed for out of memory datasets. You define what your dataset is, and then as you-- as your model is training, the data is loaded, and the loading triggers-- the loading of additional files from disk if the dataset does not fit in memory. And by the way, no dataset ever fits in memory, no real dataset. So here, for example, I'm reading images and focus on this here. The dataset is initialized from files, get matching files in a directory, all the files in a directory. And then I like this syntax. It's really the workflow I'm used to. I apply some loading operations, which will load those files and decompress them and so on. I usually need to shuffle my data into batches because the training always proceeds by batches and usually are repeated indefinitely. And my dataset is done. And finally, the serving input function. So estimator also saves periodically a snapshot of your model, which is ready to be deployed, again, on ML Engine. ML Engine has those two parts. One is for training. The second one allows you with one click to put your model behind REST API. But for that, when your model will be listening behind this REST API, it will be receiving data in a certain format, and usually you wants to do stuff with that before feeding it into your model. So if you don't, this is the do nothing pass through serving input function. But if you do, this is the serving input function, which I used to first decompress the incoming images from JPEG to pixels. And then I also implemented right in this the scanning operation. You remember to transform a classifier to detector, you need to cut your image into little 20 by 20 tiles overlapping at different resolutions. Actually I was able to do this on the fly in the deployed model. This is the code, so it's about 10 lines. Don't read it. But I find it interesting that I was able to do this on the fly in the deployed model. And the demo I was showing here, this is a JavaScript UI, which is calling into ML Engine, sending a part of the image to ML Engine and then getting the results back. So this is actually live. So we need to read papers. There are many papers. For detection and image-- image work and detection, there are many papers. I want to talk about these two. And a little bit about the others but mostly focus on the big ideas. And the first-- oh, yes. And I need a new dataset unfortunately because now I will be doing real detection. Those 20 by 20 tiles of planes and airplanes, that's not good any longer. Now I have to handle-- I want to handle big images indirectly output a square box around each airplane. So I had to build my own dataset. In this case, it was actually possible. I built myself a little JavaScript UI, and then I went clicking on airplanes. And, well, I-- and it was about a day of work, so not so bad here. So let's start with an inception. This is the paper that really brought to the table some of the big ideas in this space. On this side, you see what the inception model looks like. So all those little squares are called in convolutional layers. And what you see is that it's weird. I told you before that convolutional layers should be sequenced just piled up one layer, next layer, next layer, next layer. And here we see branches and then things coming back. What is that? So the big idea here is that you are somewhere in your convolutional neural network. And then you ask yourself the question. What is the best here? Should I now add a 1 by 1 convolutional layer or maybe a 1 by 1 followed by a 3 by 3 or maybe something else? What is best? They have the idea that you could actually do all of these things in parallel and simply concatenate in the vertical direction the results. And they call this a module. Basically during training, the network will decide which is the best path to use for a specific image on a specific task. So this is this module-based approach. That was one of the big ideas. The second by big idea is called filtering factorization. So what you see here are-- is a sequence of two 3 by 3 filters. You see-- let's look at this bottom one. So this piece of data is used by some weighted sum of this 3 by 3 square here. And if you look where those the data points in this 3 by 3 square are coming from, they actually come from some combinations of this white data points in this 5 by 5 piece of the previous data. So it looks like two consecutive 3 by 3 filters do some combinations of a 5 by 5 zone in the same way as a 5 by 5 filter will be doing combinations of data points in 5 by 5 zone. It's not the same combinations. But let's count the weights. A 5 by 5 filter has 5 by 5 by 1 by 1. That's the def-- the number of channels. That's 25 weights. And now if we count the weights for three-- two consecutive 3 by 3 filters, it's 3 by 3 plus 3 by 3 which is 18. Two 3 by 3 filters are 30% cheaper in terms of number of weights than one big 5 by 5 filter. They're not doing the same thing, but it's worth checking if for our purpose it wouldn't be enough. So that's the second big idea. And the third big idea are those 1 by 1 convolutions, which again could sound funny to a mathematician. But once you realize that you are applying many of them and that this 1 by 1 convolution is actually a weighted sum of all of the pieces of data in this little column, it's like saying, well, I have many different filtering results. I have filtered my image for horizontal lines and then vertical lines and so on. And maybe the feature I'm looking for is some combination of those filters. Sometimes I'm looking for only the horizontal line, a little bit of the vertical ones. The 1 by 1 convolution can give me the right combination for that purpose. Of course, the weights defining the combination will be trained. And one more. This one is a bit of cheating. So at the very end, you do your convolutional layers, and at the very end, you want to classify if you're building a classifier. So let's say classify in five classes, typically you would take the last data cube here, reshape it into a vector, apply a dense layer, and then softmax activation, and you've got your five classes. But if all you're trying to do is obtain five numbers, there is an easier way. Those dense layers, they connect everything with everything. They tend to be heavy in terms of weights. There is an easier way. You need five numbers. Let's take this cube of data to slice it up horizontally like a piece of bread in five slices, average those five slices, and you've got five numbers. You can apply softmax activation on it if you want. And you got the same thing with zero weights. So it's a lot cheaper. But warning, it only works if you care about global information and only global information. If you want to classify an image as dog versus cat versus sunset, yeah, this will work. If you actually care about the localization of stuff in your image, like we do, you are averaging your filtering output across the entire image. So you will lose that information. So not good for our purpose here but still an interesting idea. To put this all together into a simple architecture, the inception architecture is a bit complicated for my taste. But this Squeezenet paper actually put all of this in a very elegant architecture. They said let's build everyone out of these kinds of modules. So we start with a 1 by 1 convolution. What we usually reduce the horizontal size of our data cube. And then we have a parallel module, which is 1 by 1 convolution and a 3 by 3 convolution in parallel. We stack the results, and the stacking usually increases the number of channels in the data cube again. So they call it a squeeze and expand layer. And when you put those two together, they call them fire modules. And your full model becomes a nice succession of fire modules, max spooling operations, which decrease the size of your data cube horizontally. Then again, a sequence of fire modules then you decrease the size again horizontally and so on. It's simple, and I find it elegant. So let's build it. But I wanted to test this Squeezenet idea against a more traditional layer, layer, layer, layer, layer architecture. And we'll go into the YOLO paper, You Look Only Once. That's the detection paper, which we'll use to actually produce our squares around airplanes. In that paper, they had this very simple architecture that they called Darknet. And I thought, well, maybe we can compare the two. So you see on the Squeezenet side, we have this-- these squeeze modules here-- sorry, no, here-- 1 by 1 followed by in parallel 3 by 3 and 1 by 1 convolution and then the max pulling operation and then it continues. So you see the data cube is a small one squeezed, then it's expanded then it's squeezed then it's expanded, and so on. On the other side, it's just a sequence. And actually the depth of the cube is roughly constant until the end. And we just have these max pulling operations, which restrict the size of the image horizontally. So we'll see which one feels better. Now how do we actually detect airplanes? This is what the YOLO algorithm does, and it's You Look Only Once. It's a one-shot detector. It's not You Live Only Once. It divides the image into a grid. And it will say, well, each grid cell will now produce a certain number of boxes. So each grid cell will be designed to output four values, which is xy, the center of the box-- the yellow box here. And the center can be anywhere in the grid cell, so x and y are between minus 1 and 1 relatively to the center of the grid cell. And it will also compute the size of this box. The size can be anything. So the box can grow to the-- to be as big as the image itself. It's not constrained in the grid cell. And also some confidence factor, which tells us whether-- if there is a plane here or not. And you can adjust how many of those boxes each grid cell is able to generate-- 1, 2, 3. We will see what's best. This is the loss they have in their paper. So let me-- please allow me to simplify it a little bit. The first line actually is kind of OK. It's the error on the position of the box. So you compare the box you have against your ground truth, and you see this a squared-- square of differences of the center, so that's the error on the position. I like that. The second one is the error on the size of the box. Well they had rectangles. I have only squares. So let's kill the hate only with. And for some reason they thought it was-- in the paper, they argue why they put in the square root of the size and not the size itself. I didn't understand that, so I removed it. The next thing is the error. So this is on detected airplanes. What is the conference-- the error on the confidence factor where you had an airplane. Here I just replaced the ideal value by 1. When you have an airplane, the ideal confidence is 1. The next line is the same thing for boxes where you do not have an airplane, so there the ideal is 0. And the last line, they were detecting many categories. They were actually detecting cars and dogs and blah, blah, blah. I have only airplanes. The last line for them was the missed detection error, detecting a car as a dog. I have only airplanes or not airplanes. I don't care about this. They also-- so this is a multi-composite loss. So they had this idea that maybe the different parts of this loss should be weighted differently. Again, I didn't understand why, so I removed them. And that's about it. So the only thing that is still left-- and that is a hard problem-- is this operator they called one, which is actually the assignments between the boxes you generate in the ground truth. And that's not an easy problem. So, of course, per grid cell, you know that if you generate-- if you have a ground truth box that is centered in one of those grid cells, you will want to pair it with some box generated by that grid cell. But those grid cells are allowed to generate more than one box. And you're allowed to have more than one airplane in a grid cell. So you might have more than one ground truth box. How do you do the pairings? Well, there is a lot of choice. And when I looked in the paper, they said, well, we did something simple, and actually our algorithm is not very good for swarm type things when you have a flock of birds or an airport full of airplanes. It's not very good for that. Thank you, guys. So, well, we'll see. We'll try to play with these parameters. How do we build this? So very simply you've got the end of your convolutional network. You've got a data cube. Split it horizontally in your end-by-end grid, and then split it vertically in four or maybe eight or 12, depending on how many boxes you want to generate per grid cell. And then the red boxes will become x, the yellow boxes y, the green boxes-- the size of the box and the blue ones will become the confidence factor. You just average all the values in them. And for all of those that you need to put between minus 1 and one, you feed them through a hyperbolic tangent that puts them between minus 1 and 1. And all those that you need to put between 0 and 1, you feed them through a sigmoid, and that puts them between 0 and 1. And you have generated from an image 1, 2, 3, 4 boxes per grid cell, which detect airplanes. That's it. We're done. The only thing now is the work of the data scientist. Now starts a slow but steady process of grinding through all the hyperparameters of this model and trying to improve the accuracy. Initially, this is what my first training run looked like. Actually let me show you my real numbers because I have them here. So this is initially what I had. I wasn't sure it was actually doing something at all. But why not. Initially I tried a 4 by 4 grid and generating just one box per grid cell. So not very good. I quickly realized that shuffling data is actually super important. Big progress in-- so IOU, that's the intersection over union. It's a measure of how accurate those boxes are relatively to the ground truth. So I tried with more boxes per cell like 4 by 4 but generating four boxes per grid cell. That doesn't work so well. Again, it's a hard problem to assign them to their ground truth control counterparts. You've got four boxes per cell, four ground truths. Which one is which? It's hard. So this was not very good. So then I went to 8 by 8 and then generating only one box per cell. Yeah, that's better because there is no assignment problem there. It's just one box per cell. I bumped it up to 16 by 16 by 2, the grid-- the YOLO grid. That is even better, and this time I tried to think very hard about this assignment problem and devise an algorithm that would rationally do the pairings. I think I went by the distance from the center, so the first box is more susceptible to go with airplanes, which are closer to the center of the grid cell and the second one more on the periphery. Then it turned out that data augmentation, varying at random the hue and the orientation of the images, that actually helps a lot. And finally, that loss waiting idea, the multi-part loss, figuring out that some parts should weigh more, it was actually a good idea. And that gave me a little bit more accuracy as well. And on here you can see in the reverse order, the loss-- so the error function going down as we do this. And finally the best one was with more layers. So these were worth 12 layers. The last one is worth 17 layers. And then when we got something. So you want to see this in a demo right? Let's go. First model, so this is what we had, the best model obtained from the classifier. Now let's try our first 16-16 counted by by 2, so that's the YOLO grid, your catch images in 16 by 16 grid, and you generate two boxes per grid cell. Let's go. Analyze. It's much cleaner, but it's missing a lot of planes here. This is something you can solve with the data augmentation. Actually I did not have any data augmentation here. By varying the orientation of my cells and the hue of the cells randomly, you got a big boost in inaccuracy. Now we're talking. Now it looks like a detector. We still have some false positives. There is one here. I saw a couple of other ones, but it is not so bad actually. And if we jump to a bigger model, I think that from 12 to 17 layers-- I think we are pretty good. Let's see if we are perfect. We'll see. Almost perfect. Well, it missed this blue airplane here, but that's because I need more detail with blue airplanes. But it's not so bad. And I was able to piece this together with LEGO bricks, which appear in literature without being a super specialist in machine learning. And my message to you here is that it's kind of hard, but it's not like fluid dynamics hard. If you are a good developer, this is a learning curve through which you can go. I went through this learning curve and look. My airplane detection model now works. It will always have an accuracy of 99-point-something percent. You can never go to 100. That's something you have to build into your use case, your product. But you-- all of you here are capable of building machine-learning models. Then just to finish, I want to show you the tooling I was using. So I'm using ML Engine. ML Engine has basically two great features. One is My Jobs. I see all my jobs running. And the other one is My Models. So with one click, I can deploy a model to production, and it's served behind a REST API with auto scaling. And I don't have to manage anything. I really love that. I love my job tuning those models. I do not like the job of chasing VMs and which one is still running and why blah, blah, blah. ML Engine, once you wrap your model in this estimator API, it gives you a one-line config access to many different hardware architectures, including TPUs actually. But let's start with something else. So this is what I had initially training on just one GPU, and this is the config file that you use for that. Scaled here, basic GPU, that's one machine with a GPU. This model, the 17-layer model, the bigger one, trains in 23 hours for a cost of $28. We have faster GPUs. You can bring this down to 10 hours like that for the same price. But with a very-- just a config change, no change in your code, you can actually deploy this to full cluster of five machines with five GPUs each. And it's a cluster, so you can go to 100 if your model scales. With this one I'm down to 4 and 1/2 hours. That's-- that means much more productive. And with the better GPUs, that actually below two hours. I wish I had that in the very beginning. But then I thought, well, let's try those TPU things. First of all, a little warning. Those TPUs, it's new architecture. There is a little bit of a porting effort to adapt your model to TPUs. We are working on that, but right now expect to spend some time tweaking the code. It will still work on GPUs once you tweaked it, but there are some things that need to be done for this new architecture. It's a completely new chip. Oh sorry. Before that, new in TensorFlow are those distribution strategies. So very soon-- as soon as they ship-- it's in beta now-- you will be able to reserve one box with multiple GPUs in it, and that's a different trade off. So you don't have the network communication between the GPUs, but on the other hand, you can put 100 of them in one box. So it's a different trade off. And again, it will be available with just a conflict change. So now on to TPUs. This is available. I just ported it, so I don't want yet to show the numbers because it's not really benchmark with a proper benchmarking set up. But those numbers are not secret. Cloud TPUs are all available on ML Engine today, and I published a code for this yesterday. So all you have to do is run it, and you will see the numbers. And, of course, if they are-- if I am putting them here, it's because the are more-- faster and cheaper than the second best option you have here on this slide. And very soon, you will be able to access not just one of those TPU boards but 64 boards so a full rack of them connected with a high-speed interconnect and use all of that as one big supercomputer straight from ML Engine with just a conflict change. If you want to know more about this, there is a session exactly on TPUs right after this session, but you'll have to cross the street. It's a very good session. So that's it. Thank you for your attention. We have seen this-- the cloud ML Engine and cloud TPU as products. My take away is not on the products, but what I want you to remember is that if you are a good developer, you can build a machine-learning model. There are best practices. There are Lego blocks that exists. You have to follow where this is state of the art, but then you can piece those pieces together. And if you want-- if you don't want to build your own models, well, we have a set of pre-trained models. And new we also have this Cloud Auto ML product that actually this morning, it was announced that it was not doing just Vision but also more things, so I need to update my slide. And with that, you can just throw your data at it and let the system figure out the architecture of your model. That's quite advanced. I'm amazed that we can build this, and I find this really, really, really, really marvelous. And finally if you want to learn more about how to build models, you can check out the other chapters of the TensorFlow Without a PhD series. And all the code and code labs and all that is available on GitHub. You've got the URL over there. Thank you very much. [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 10,066
Rating: 5 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: KC4201o83W0
Channel Id: undefined
Length: 53min 4sec (3184 seconds)
Published: Wed Jul 25 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.