Tensorflow, deep learning and modern convolutional neural nets, without a PhD by Martin Görner

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome to this last session at devoxx well there is still the movie after that so we still get to share something nice so I was here last year and I spoke about tensorflow I want to do some more of that today and that would video was recorded they were a couple more of these videos out and the spooky metric is that people have or have spent 100,000 hours watching those videos thanks for the attention but no thanks for the pressure but this means that a lot and a lot and a lot of people are learning machine learning and tensorflow and those are not people in labs those are people like you and me normal developers who need these technologies because they solve real problems that even five years ago we had no idea how to solve so I find myself in a nice position as a software developer where I still spend a lot of time on github I spent I spent even more time on arc sieve now having fun with stuff like that and although I find this highly entertaining I do not have the time to look up everything that they say in these papers because you know I could look at what a cool bug Leibler divergence is I could look up what a Borel set is in these equations they don't even define all the operators they use because obviously either you know or you have no business reading this paper so this doesn't help ship projects and on the other end of the spectrum you will have people who tell you that's or easy deep learning just that so it looks easier and and maybe we will try to do something more on those lines the reality is that deep learning today is is majoring from the labs into the real world of to our engineering it's a new field of computer science lots of people are all learning it not as a field of mathematics but as a field of computer engineering and even though a lot of the researches is really very high-end out there a lot of it is trickling down in and it's being packaged up into into ready to use tools like for instance tensorflow and so that's what I would like to do with you today I would like to build a neural network together that uses some of those reusable blocks so that you can see how these architectures work and how us computer engineers can piece this together and actually solve a problem so I like plane spotting so I decided today I decided to build a neural network that can spot airplanes in aerial imagery first a trip to kaggle who knows Cagle yeah shout out for cackle cackle oh is is an online community of data scientists where they share data sets and all they also share their approaches to those data sets so what I was in search of a data set I directly had it two cattle and I found this data set which has a little 20 by 20 tiles with airplanes and I can start classifying them into plane or not play that is not quite the plane detector I want but it's a good first step so let's build this first neural network this one for those who have seen the session from last year is extremely similar we are in own territory we will do this by the book so every decision just look up the neural network engineering handbook ends and just do as everyone else does what is a neural network so in a neural network you have neurons it's those white circles that I've shown here and a neuron always does the same thing it does a weighted sum of all of its inputs so here the inputs are the pixels of the picture I have spread all the pixels of my 20 by 20 picture into a longer vector so the neurons they do a weighted sum of all of these pixels they add an additional degree of freedom called the bias and then they feed this through what is called an activation function and that's just a function a number comes in number comes out and what is specific about neural networks is that that activation function is usually non-linear that's what makes them powerful they can solve nonlinear problems and then you can you can stack those layers because in the second layer well the neurons instead of doing weighted sums of pixels they do weighted sums of outputs from the previous layer and then so I stacked a couple of layers here and at the end I end up with two neurons because my goal is to classify my little twenty by twenty tiles into this is a plane or this is not a play so I'm hoping that one of those neurons beer will will light up and tell me the plane or not play let's write it the tensorflow code for this and i will be showing you quite a few code snippets i'm not expecting you to read through the entire code but i usually highlight in the read the the stuff that i wanted you to see so you see there is a layers API in tensorflow here which represents an entire layer of neurons at once and so this is a high-level API in tensorflow so you have to bear in mind that all those weights and biases which parameterize this this layer they will be created automatically in the background so it's automatic but they stay take memory and they still still they will still take CPU time for the system to find what those weights and biases should be we will do that through training so as as a as a as a computer engineer I want to I want you to have in mind how many weights you are creating when you are calling these these high-level functions and for for a dense layers so a layer where everything is connected to everything the number of weights is the wrong inputs multiplied by the number of outputs I put them here so then I said we have those weighted sums computed now I said we need an activation function you can do the research but for intermediate layers almost everyone today uses this really activation function so we will either say we will use the same the only difference is on the last layer so tensorflow will do a lot of automatic things for you during training and you need to provide one thing it's called a loss and it's an error function so tensorflow we'll be predicting here of course initially badly you put an image in and it gives you a prediction for this style being a plane or not a plane and what you have to provide is a distance distance function between what was predicted and what you know to be true because when you are training you put known images so you know if this was a plane or not apply and again this is a classification problem so I look it up in my little handbook it says for a classification problem on the last layer you use an activation function which is called softmax and you use a distance between what was predicted and what you wanted which is called cross-entropy fine I'll just call this function in tensorflow there is a function that does both so you will have noticed that this dense layer here does not have an activation function and I call this function which does the softmax activation and the cross-entropy distance and of course I have to provide the output from the layer and the correct answer so this will compute the distance between what the network predicted and what I know to be the correct answer from there on we can train and tensorflow takes over so I can pick an optimizer and ask it to optimize this loss function and all this magic will happen automatically so what is the magic if you want to the details the optimizer will compute the partial derivative of your loss relatively to all the weights and biases in the system this in technical terms is known as as a gradient and this gradient will give you a direction in which to change your weight and biases to obtain a smaller loss and that's not what you want so at each batch of input images usually you don't wait for one image you do it for a little batch of images 100 images for instance you compute this gradient and this gradient gives you little deltas to add to your weights and biases and you move one step into somewhere where the loss is smaller and you continue training like that this is the rayleigh function i wanted to show it to you it's a really very simple function it's just identity for all positive values and 0 for all negative values it's non linear that's the only requirement for this to be used in a neural network on the last layer if you want to predict continuous values you can use those two functions as well one is between 0 & 1 the other one is between minus 1 and 1 and to help you understand what softmax is I made a little animation so soft acts is a function that sorry whoops here let me replay softmax will be applied to all your output neurons so here we have two but I have represented 10 of them it's actually just an exponential so you take your weighted sum and elevate them to the exponential but then you normalize this entire vector you divide it by the norm that gives you numbers which are between normalized so between 0 and 1 and you can you can interpret them as probabilities that's why we use softmax for classification problems because we are looking for the probability of these but it is being a plane versus the probability of these being not a plane and the nice property of softmax is since the exponential is very steeply increasing function well it will pull the winner apart but without completely destroying the information about those other noodles who might be getting it a little bit wrong and that's very important for for training all right so we have all the ingredients we have layers we have radio activation on the intermediate layers we have softmax activation on the last layer the book tells us to to use cross-entropy as our distance function because this is a classification problem and we've been told to do this by batches on our images we have our model we have some tools to train it so the model is written in tensorflow I will be using ml Engine 2 to train it and also to deploy it and I will be using answer board that's a tool for visualizing during training the loss and all the rest ml engine is is a server is a service on on Google's cloud that allows you to run these training jobs and what is nice about it is that you you can launch as many jobs as you want and according to the the quote are you allowed to yourself let's say you oh you allow to yourself 10 GPUs and each job uses one GPU it will either do them in parallel or we'll start queuing them but you don't have to remember at the end of the day to shut it down because otherwise you would be paying or anything like that you just send jobs and see results yes oh no no not yet I'm ready to ship I have everything let's see the results so here I am in tensor board and I train this so I see my loss curve well it's going down so the training did something I'm quite happy about this and now I'm ready to detect planes in aerial imagery well almost I need a little trick to transform my classifier 20 by 20 tiles into a plane detector that's a super easy trick here I will be sending to the detector 256 by 256 tiles so big tiles and I simply cut them up into by 20 tiles and apply that to detection to each little piece and of course I can do this at various resolutions if I take a big 60 by 60 tile I resize it to 20 by 20 run it through my classifier and I will know if there is a plane there so let's see if this can see airplanes in this is San Francisco where is this hosted while it's computing let me tell you that ml engine has this second feature here I see my training jobs in in the Google Cloud console but you can also use it to deploy a model once the training is finished this training is ends up as being just a file on disk where all the computed weights and biases the optimized weights and biases have been saved so if I want to create a new model well actually if I want to create a new version of a model I have button here create version and I can browse to the file on disk and that's it when I when I when I hit enter here create I will have this model deployed on line beef behind REST API which sits on a fully managed sort of realized infrastructure that has auto scaling so only the only thing I care about is is send traffic to it and that's what I did let let me do this again analyze and is it seeing airplanes so each shaded tile here is one request I do to my to my deployed service and well it's not perfect or it's seeing airplanes I mean it's missing this one and it went completely berserk here but I mean it was my very first try so maybe that's not so bad what can I do to improve it well the first thing I need to do is tell you something about convolutional neural networks because my little handbook tells me for a vision problem you will not get anywhere without convolutional neural networks so what we have seen are dense neural networks where everything is connected to everything what is a convolutional one the main difference is that we do our weighted sums slightly differently we will take a little patch little filter of weights and slide it through the picture and as we slide it we do the weighted sums so this filter has weights we do the weighted sums and if we do this with a little padding on the sides and it in both directions we will obtain as many output values as we had pixels in this image so for this we use this little cube of weights 4 by 4 by 3 maybe that's not enough maybe you want to give more degrees of freedom to your system so you do it again with a second set of weights which produces a second you know playing of output values and you can do this as many times as you want it just depends on how many weights how many degrees of freedom you want to give to this system so convolutional layers will be these transformations between data cubes you have a $1 cube transform into another data cube and and here again as computer engineers I want you to you I want you to know exactly how many weights are created for each convolutional layer so this is it 4x4 is the filter size 3 is the number of input channels so here I have an RGB picture color a picture that's why I have three channels three pieces of information per pixel and the last number is how many times you do this which also means how many planes of data you obtain as an output so this is how we write it in tensor flow again I have changed to convolutional layers here and I transform my dad cube into another data cube as I go tensorflow has this layers that continue to defunct the number of filters so that's the last number here the kernel size is the first two you don't need to give this one because that will be the shape of the your input values and of course you need to specify an activation function because when we do our wake it sums of course we also do the activation function if you want to squeeze the size of your data cube horizontally one simple technique is to simply use a stride so you will make a step of instead of filtering your image pixel by pixel you do it every two pixels and mechanically you obtain twice less values another popular option is to use just three sample so that's not a neural network that's just a resampling operation and one of the most popular resampling operations is called max pooling where you take your data points 4 by 4 in little squares and just keep the maximum so there every sample by a factor of 2 your entire data cube in the horizontal dimensions and one thing that I wanted to point out that's the little guy is pointing out is that something that we can call it one by one convolution which would surprise a mathematician actually makes sense one by one convolution is weighted sum of all the data points in this little column this 1 by Y column but there are still many data points there so you can do weighted sum ok one by one convolution here makes total sense unless you're a mathematician we are computer engineers so this is the network I will be building three convolutional layers then at the end since I still want to classify my my little airplanes I want to get to just two neurons the way I do that is that I take this very last data cube here and I spread all of its values into one vector apply a dense layer and obtain two values at the end so one convolution filter a second one a third one my reshape operation to get one big vector actually I have two dense layers and the last one ends with my soft marks in cross-entropy operation because this is a classifier so this was actually the model that was running when I showed you this it's not so bad for a first try we have plenty of options for making it better but this was the topic of of my my session last year I showed you how to use all these different regularization techniques for for today the only thing you have to know is that these things do not change the architecture of the network they just help it converge so we will use all of those technique to help it converge and actually I can show you on the graphs that it helps it really helps this is my lost curves so it's being optimized to be as small as possible it's going down that's great but the same lost curve with all those regularization techniques is here much much much better much lower and there is a third thing that I want to do which is called hyper parameter tuning so you realize that in this model there are plenty of parameters the the sizes of the of the kernels the number of neurons in this in these dense layers all of the other parameters how do I find which are the best parameters well I can use my engineering know-how which means try to guess and if you have done this a couple of times you might you might have good guesses and that's actually a good thing but I also I can also do here hyper parameter tuning so how does this work this is a feature that is implemented in ml engine the way you send your model for training - on ML engine is that you package your Python as a Python package which is just a folder a folder with this config file and usually this config file has just scaled here standard one I actually use scaled here a basic GPU that means which machine is this going to run on basic GPU means one machine with one GPU but you can add these additional things which enable hyper parameter tuning so here I say I want to maximize maximize what my accuracy accuracy is a metric I have defined in my model I didn't show you that line in tensorflow but that's something I do in my model and I want to do 50 trials so train this network 50 times I want to do 10 trials in parallel and here I might have my hyper parameters to optimize how do I make those public well that's very simple you make them command-line parameters in your package and then you say that this is an this parameter is an integer please try values between 800 and 30,000 using a linear scale there are logarithmic scales there are categorical values if you have discrete values and ml engine will run all those trials for you and tell you which combination of parameters is the best and it does this quite intelligently the way it's implemented it's called Bayesian inference so you'll look it up but it's not as simple as just trying them out at random and it looks maybe I have one here in my jobs do I have a hyper param tuning job somewhere here no not this one OOP here I have a hyper prime parameter tuning job so the output is this it says that it completed here 45 trials and and trial 33 is the best and here it gives me the hyper params that optimized it the best so now it must be super good right on the curves it is actually quite better you can't even see it on the loss but if we go to the accuracy curve here let me remove those boom you see those accuracy curves the the regularized and hyper parameter tuned is here on the top and that's 90 almost 99% so I am actually really really happy with this let's run and see how well this classifies images mm-hmm here alloys what this is horrible well yeah it's seeing more planes than before I don't think it's missing any but what and so you see here a great truth in statistics which says that there are three data sets you have your training data set you have your test data set on which you compute your accuracy you set aside a piece of your data to test your algorithm and then you have real-life I was 99.6% accurate on my test data set but real life is real life I can do and that's why your test data set sometimes it's not so good it has to represent real life and it's always good to test in real life actually I managed to get this slightly to get a slightly better performance by augmenting my data here you see it's it's seeing lots of stuff that is not a plane so it's very easy to to grab additional tiles of non planes you just take a huge aerial picture eyeball it to make sure there are no planes in it and then you can cut it up in a hundred thousand little 20 by 20 tiles and you have plenty of data and if I do that it's it's much better but still not that great well you will see there are many less false positives but it's it's it's still not perfect and at this point I'm kind of stuck with this approach so you see here oops yeah here it's still seeing stuff that is not a plane what else well so what can I do well before I have an idea let me talk to you a little bit about tensorflow because i told you about the model but we haven't spoken about the glue code that does that goes around it to to actually train it so let's go quickly through that and we have in intense level one point four which has just been published an API called estimator which is used to wrap your model and and then train it and run it which is actually very nice it's it has been there in previous versions of tensorflow in contrib but it has been revamped in 1.4 and now I'm really happy with it so in the the way you wrap your model in an estimator is that you write this model function that's the function which will have all the layers layers layers layers and you wrap it in an estimator and your end goal is to call train and evaluate which will run the training and what is nice about this API is that you get a ton of stuff for free if you do that you'll get check points when something crashes your relaunch and it restarts from the latest check point automatically you get your model saved to disk at at regular intervals so that you can deploy it and and then and use it it even handles cluster training automatically if you want to send this to or to multiple machines all you have to do is change this ml engine spec and instead of one machine specifying multiple machines and so on so what else do you have to do there are these things but I want you to look at the parameters what you need to implement and you see those are three data input functions so you need to write one function which will load data during training one function which will load data during testing and one function which will load data when you deploy the model and you want to use it it's behind the REST API so it will be sending JSON to it what does it do when it receives the JSON so let's see that let let's see first the model function that's just layers layers layers layers and the API asks you to output a dictionary called predictions which is something you define whatever your model predicts it's freeform it will need the loss it needs this training operation which is what what you obtain when you ask an optimizer to minimize your loss and then you can return a set of metrics which is again you decide what that is and those will appear automatically in tensor board you will be able to track them there what else yeah one thing you might have been surprised is that when you call this optimizer it returns a training operation what is that well actually one thing you have to know is that tensorflow has a deferred execution model everything you write in tensorflow doesn't execute immediately it builds a graph in memory and it's only when you will start feeding data and executing this graph that you will obtain actual values out of it so this train operation is one of the operation in this computation graph and that operation when executed that is the operation that will compute the gradient and compute the little deltas to add to your weights and biases and that is the operation that will actually change the weights and biases of your neural network during training the function for for forgetting data when you have a deployed model here again you might be surprised because I thought initially ok how complicated can this be you get a piece of JSON you do some treatment and you return a value but here it doesn't accept any parameters mmm how does that work again think graph this function is executed only once when the model is instantiated and when it it is it is executed its goal is to produce a little piece of graph that will go from the JSON to whatever your model needs as an output and that piece of graph will then be grafted on top of your model so the way it works is that here first you have to define the shape of your JSON you do that by defining a Python dictionary with placeholders for whatever data will be and then you can do any transformations you want and what you return is this shape dictionary and whatever you need as an input for your model so here I gave you the the pass-through function when I don't don't do anything but this is actually how I implemented my transformation from a plain classifier into a plain detector and you see in this function when I receive a 22 256 by 256 tile I have one call here that decodes the JPEG and I have a second call here that does crop and resize and in tensorflow crop a crop and resize take a bit takes a picture but it doesn't take just one crop box you can you can give it a collection of crop boxes so that's what I did I generated all the possible crop boxes for my big tile all the 20 by 20 tiles like I can cut out from it at various resolutions and I'll just give it the whole collection and it gives me a whole collection of 20 by 20 tiles as an output it's a bit brutal I obtained roughly 5,000 small tiles so it's not going to be super efficient but that's what I did it here and finally how do I load data so for that we have a new data set API which is new and I really love it because my what you're right here is exactly what I wanted to write when loading data so here data there is some load operation but then I need to shuffle my data to make the training efficient so I have a right shuffle with some buffer to adjust how many things I wrote into memory before starting shuffling I do mini batching so here I batch it by batches of 100 and of course when I train I need need my data to be repeated multiple times so I'll just say repeat and that will repeat indefinitely so again don't you don't need to read through the whole code the rest of the code here implements one very very very standard solution - very standard the problem which is that usually your data doesn't fit in memory so here with the dataset API in a couple of lines have been able to chew to write something that will load my data from files gradually as it needs it and I can train on a data set that is much bigger than what my memory can can't contain and what I like about it is that the instructions for actually managing my my my training data we look very natural with this data set all right now let's actually build a detector and how our little handbook of best practices is is basically out of good things to say now we have to go and read papers so there are many papers that describe how you can build vision networks they are on the right there and many other papers that describe how you do detection so how you do the specific task envision where you generate a bounding box again around something and since it is a lot of work to read all of them I read them for you and these are my two favorite ones the ones that I find most elegant so I will be using these to squeeze net for my stack of convolutional layers that's my architecture and yellow for the way to turn that into a detector so first of all let's have a look at these other papers there are a couple of interesting ideas there Oh data yeah my twenty by twenty tiles are not good enough now I need real images and I couldn't find the data set but I was kind of lucky I made myself little jsut I and I started clicking on planes and he I clicked on mm planes my hand but it's not so bad mm is okay what I'm lucky here is that now I will be to generate my training data set I will simply be stamping out 256 by 256 tiles at random from those big aerial pictures and I know where the boxes are so I will be able to recompute that and I have potentially very big data set by doing that so sometimes you're lucky and you can generate your data set well with relatively little initial information so inception paper this one was published by Google its initial goal was to sputter at or recognize that this is a cat versus a dog and so on and the first thing you see is that it's it's very bizarre all of those are convolutional layers but I told you that you can chain them up you know one layer after another and this does branches it's kind of weird so the idea here is actually very simple the researchers said well what is best is it a one by one convolution followed by a three at three three by three convolution or is it better to do a max pooling operation and then a one by one hmm we don't know why don't we let the network decide so they implemented all of them in parallel they take the little data cube that all of those generate as an output I don't they just stack them up together concatenate them and that's the new output that's an interesting idea second interesting idea here is that you see most of those convolutions are very cheap it's you know one by one convolutions or three by three convolutions what happens to bigger filters why why why aren't we using bigger filters well let's look at two three by three convolutions in sequence let's start from the bottom here this one little piece of data is a weighted sum of this three by three patch and if you look at where does do these white data points come they come from a five by five patch right above it so you see a sequence of two 3x3 convolutions is some weighted sum of a 5x5 patch the same as a 5x5 convolution it's not mathematically exactly the same because we don't forget after our weighted sum we always have a non-linearity rally or something so it's not exactly the same but it's worth benchmarking one against the other and it's it's worth doing that because two 3x3 convolutions that's just 18 weights a 5x5 convolution is not 25 weights so much cheaper to do 3 2 3x3 convolutions and since we have those two nonlinearities maybe it's even better so that's an interesting idea and then those 1x1 convolutions for which we have seen they do make sense their other advantage is that they're super cheap look this is just 1 by 1 by 10 by 5 which is 50 weights 50 weights I have a whole convolutional layer that's really super cheap so we will be using those a lot as well they also introduced something called global average pooling so usually when you want to end at the end of your network you want to end and classify something you take your last data cube and you reshape it into a vector you have a dense layer apply softmax use cross entropy loss you have a classifier well they said that's a lot of weights actually why instead we don't do something in our convolutional layers so that the last cube has exactly the number is exactly as deep as as we have as our number of classes and then we simply slice it up like a piece of bread and average out all the values in each plane and that gives you it gives us five values here we do softmax and we have five categories and that cost us zero weights super-g suspiciously cheap this one if you ask me so we'll try to use all of those ideas but since I don't want to write all of these things that's a bit too heavy I like this other paper called squeeze net which applies these modern ideas but you know in a slightly you know simpler package I have found it a lot more elegant so there-there are based on modules they still have this idea of doing two things at once and letting the network decide which one is the best but the module is much simpler is just one one by one convolution which they call a squeeze operation because most of the time it's used to reduce the depth of the data cube thank you and then they do this these two parallel operations either a one by one or a three by three convolution they call these modules fire modules and what I like about these architectures is that when you when you build a full stack it's very simple you have fire modules you have a max pooling operation to reduce your your your the size of your data cube horizontally then fire fire modules and so on so let's use this but I also want to compare it against something other I told you that I will be using this yellow paper yellow that means you look only once that's a detection paper and in there they are rolling out their own convolutional stack which they call the darknet so I will benchmark darkness against squeeze net to see what's best on the darknet size you see it's a much more traditional architecture just layers layers layers layers and we try to bring our the size of our data cube down horizontally as we condense the information at the end we we want to adjust a little bit of information and you see the the depth of the data cube here I had it vary between 64 and 32 so not too deep on the other side squeeze net you recognize those modules one by one convolution followed by in parallel the three by three and one by one you can see it on the data cube here the one by one does this squeeze operation you see a thin data cube as a result and then this pair concatenates its output so so it's an expand operation and you see a fact that a cube again and again this time I using max pooling layers to try to bring the size down Horace as well and just to benchmark them I wanted to give them a simple goal which is to count planes so I appended at the end using global pulling I appended softmax layer which categorizes those tiles into tiles with zero one two or many planes what does that mean where does that what kind of results do I have there so initially let's click here oh is this converging at all not quite sure but if you look at the images so here is the number of planes the actual number of planes and here is what was computed it's its it is doing something actually it's correct here no planes no planes there are three here and no planes no planes I have one airplane you got it right here so it's saying something it's it's maybe not super good but it's seeing something but this looked like a too much of a hard problem for for my this discussive equation problem so I I started classifying only into zero plane at one plane and started comparing here darknet and squeezed net about the same then I thought hmm I I'm wondering about this global average pulling operation Here I am trying to spot planes to count them I need to know where they are and a convolutional network is actually very good at spotting them where they are and scanning with filters the filter says hello here I'm since I've seen something interesting and when I do global average following I kind of globally average this information across the whole image that sounded like something bizarre to do if you are interested in local information so I try to revert to my previous reshape plus dense layer operation and lo and behold whoa much better loss so the learning here was if you're interested in local information don't do globe average pooling and the other learning is that my squeeze net and darknet perform roughly the same the one is a lot more expensive than the other here I put the number of weights 320,000 for one 60,000 for the other so I will be using the squeeze net and I'm ready to build my detector so how does this work in the yellow paper it's actually very simple and elegant you take your image you divide it in a grid of cells and in each grid you say each grid cell is not allowed to generate a certain number of bounding boxes let's say 2 bounding boxes so for each grid cell you manipulate your network so that it produces 4 additional numbers x and y that's the position of the bounding box W is its size and C is some confidence which is 1 if there is a plane and which tends to be 0 when there is no plane and the trick is that if even though those bounding boxes are constrained within a grid cell their center must stay in the grid cell the width can be as big as the full image so they can grow cost this grid cell and you see here I have a big plane so I need one of those bounding boxes to grow what loss will I be using well they lose they use this loss I hacked it up a little because it was too complicated let's remove all that no I they had some tricks which which sounded weird and I test it and for my use case I didn't need the tricks so you end up with a first line up there which is the distance between where the bounding boxes are detected and where the real bounding boxes are a second line that is the the error on the size of the boxes the third line that is the error on the the confidence factor the only little bizarre thing where these wants and those are the assignments when you are computing this distance between what was predicted and what you know to be you need to assign a generated bounding box to one of those ground truth Bionic boxes and yes you have the grid and you assigned by grid cell but if each grid cell generates is allowed to generate multiple bounding boxes and there is only one plane you have to make some decisions didn't know initially flip the coin implemented what was easiest but that gives us something already so how do I do my last layer I take my last cube I cut it up in my grid and now each column I divided in 4 y 4 because a bounding box is XY WNC that's 4 numbers if I am generating multiple bounding boxes per grid cell I would divide it in 12 in 8 or 12 and so on and from here I average out those values and use these activation functions hyperbolic tangent because x and y are between minus 1 and 1 those are relative to the grid cells Center and W and C are between 0 and 1 and let's see the results so I tried first a 4x4 by one yellow grid and again is this converging at all so I'm showing you one of those 1 components of those losses the one where the variations are the most visible but then I went to the images and and look the grey boxes at ground truth but the yellow box this is what the network sees and it looks like it's starting to see something so I was on a good track then I tried to have more of these bounding boxes generated grid for four by four by allow but allow four boxes for grid cell in it's much better but then I thought but now I have this this box assignment problem I'm generating for bounding boxes for a grid cell and if there is only one plane it's a bit weird because if I assigned just one of them then one of those boxes is is now trained to see the plane and the three others are trained not to see it but it's the same pixels so how does the network make sense of that I didn't know so I tried an 8x8 grid with just one bounding box for a grid cell to avoid this problem and yep works better the losses again much lower so I went to sixteen by sixteen by one and that is even better and if you want to see how this performs here on my little demo I have a by eight by one let's run this and it's it's it's it's well this is too good this was supposed supposed to be a little bit not as good as as good as that but I without cherry-picking I just found the example where it was too good so let's try sixteen by sixteen by sixteen well you see it was a bit of luck because here I have a false detection I have a couple of planes that were missed and so the last thing I did was work a little bit on those great assignments and I did more great arithmetic to make sure that when I had two boxes allowed and one plane they would both be trained to see the plane and it did there are a couple of edge cases but whatever I haven't actually finished optimizing that but with that I have well and yes I first look at the paper and I was hoping that the paper would solve it and the paper said in limitations they said our model is not very good at swarm type things so well thank you using your model specifically on a swarm time problem but it was just this boss box assignment issue so with this last one it actually works quite well and what I'm quite happy is that it sees the little plane here as well so that's what I wanted to share with you today since it's the last session of the day we we might extend a little bit to talk about something interesting called Gans but first the conclude is that I think that's my conviction that as a software developer it is possible to jump into into machine learning it's let's say as hard or as easy as as learning a new language you need you learn new concepts you learn new things that you can piece together but you see here on this example once you know which Lego bricks work and usually it's quite easy if it has a function in tensorflow it's a Lego brick that usually works once you know that and once you learn how to assemble them I think you can all we can all get there so thank you and for those who care to stay for five additional minutes I have something funny to show you piecing and those who want to go please go piecing those LEGO pieces together in a specific way someone built this so you recognize here a convolutional neural network that classifies images what does it classify it classifies real images versus fake images produced by this funny thing here so that's a classifier we know how that works we end up on softmax layer with two neurons it says fake real this generator I will you will look it up for yourself how you can do convolutional layers which upscale images it's not exactly the same convolutional layers that we have seen here but it's it's possible and they are they also have weights and biases and so on and you can train them so you start with a little vector here which is random it's just noise and you run it through this up scaling convolutional network and you generate an image and then what you are trying to do is to fool the discriminator to generate images that he will think or real so to train the discriminator it's normal you put an own image or sorry you put a real image or you put a fake image and depending on what he says you do that through propagation you train it now once you generate an image with the generator you run it through the discriminator the training is a little bit different for the generator whatever the output you say the the results I want it is real I want to fool you I want real and you Retro propagate that well actually you just compute the gradient you you don't touch the weights of the discriminator you apply them only here in the generator and what is going to happen is that the discriminator will learn what are like here I have anime characters will learn what are good features of anime characters anime characters have colourful hair large eyes and in return the generator since it is optimized through the discriminator we learn that to fool it it needs to produce large eyes it needs to produce color forehead colorful hair and so on and in the end you get generated anime characters that's kind of fun but that's not why I'm sharing this with you then people did this they said well it's learning the concept of large eyes maybe let's try to test this so they took three generated images of people with glasses added those each of those image is generated by one random vector somewhere in this Latin space here this random vector one vector gives one of those images they average them then they took three men without glasses the average them subtracted those two and say and said now this difference is the concept of glasses how can we test this well they took three women average them added glasses on that women and it started generating a woman with glasses so this is completely unsupervised we just threw a bunch of faces at this network totally unsupervised and the network has been able to generate the concept of sunglasses totally by itself so it's the first step in unsupervised learning I found this I find this really exciting another I don't know exciting thing that you can do is that you can do this with a smile you can actually do this with any facial expression so I think in a not so distant future we will have a fake news video that will look completely realistic and where I don't know Donald Trump will will read a love letter to Hillary Clinton and if you think that well you will not be fooled by these little low resolution images with lots of artifacts this just came out of Nvidia last week and these are generated images all of it is generated you see interpolations because in this Latin space once you have one face in another face they are just vectors so you can interpolate between the two but they are all generated this is not more thing between real faces it's all generated and this time it's in high resolution and and I find it beautiful thank you [Applause] [Music]
Info
Channel: Devoxx
Views: 12,073
Rating: 5 out of 5
Keywords: DV17, Devoxx
Id: vaL1I2BD_xY
Channel Id: undefined
Length: 56min 39sec (3399 seconds)
Published: Fri Nov 10 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.