Tensorflow and deep learning - without a PhD by Martin Görner

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I have been playing around with TensorFlow and I absolutely love it. Thanks for posting!

👍︎︎ 6 👤︎︎ u/Nightsd01 📅︎︎ Feb 19 2017 🗫︎ replies

Great Video!

👍︎︎ 2 👤︎︎ u/heap42 📅︎︎ Feb 19 2017 🗫︎ replies

I thought I was way too dumb for this stuff but I actually understood some of it.

👍︎︎ 2 👤︎︎ u/blacwidonsfw 📅︎︎ Feb 20 2017 🗫︎ replies

Nice presentation.

👍︎︎ 1 👤︎︎ u/LgDog 📅︎︎ Feb 20 2017 🗫︎ replies
Captions
hello everyone so thank you for coming for this university session on tensorflow and and deep learning you're brave to come and spend three hours here with me talking about neural networks tensorflow and all the nice things we can do with them so let us start and yes maybe like this can you still hear me okay yes even in in the back everybody hears me perfect so let's go I have quite a few things for you today of course we won't cover everything there is to cover in neural networks and deep learning in this university session but basically I've I would like to walk you through two canonical examples one which we'll do in this first session which is handwritten digit recognition and then in the second part of the session we will dive into recurrent neural networks what I hope is that with this little dive in let's say three different types of neural networks you will have enough information to then go on your own and and read neural network box read the scientific papers and actually parse through the not always useful mathematics and formalism and understand the architectures that are being explained so let's dive right in this is a very classical data set it's been around for around let's say 20 years if you go to the website to the Amnesty website you will see that people have been working on this you have 20 years of scientific papers published and we will summarize this all the progress that's been made in this in this field in about 45 minutes so let's start let's start and let's build the simplest possible neural network for recognizing those handwritten digits just before you go who was in the in the tensorflow session yesterday the code lab okay a few people so you I hope you don't get bored in the first 45 minutes because you have seen this already but we'll kick it you know into overdrive right after that so stay around and also who has some experience with the neural networks and deep and and keep learning a couple of people all right so please feel free to intervene feel free to help feel free to add additional information I don't pretend to know everything but if you don't have any experience with neural networks I will make my best to start the explanations from the very bottom and build build it all the way to the top so yes let's build it this is the simplest neural network you can imagine for recognizing those handwritten digits so just 10 neurons our images they come as 28 by 28 pixel images the first thing we do is that we flatten all those pixels into one long vector 784 pixels Ector and that is what we will use as our inputs and now our ten neurons the ten circles here they will always be doing the same thing a neuron does a weighted sum of all of its inputs adds a constant which we call the bias and then feed this sum through an activation function there will be several activation functions mentioned today the the one thing they share in common is that they are nonlinear so that's all there is to it a weighted sum of inputs plus a constant fed through an a nonlinear activation function and now we have those ten neurons Y 10 because we are classifying digits from 0 to 9 and our hope is that if we choose the weights of the weighted sums correctly and the biases correctly one of those 10 neurons will have a very strong output telling us that this in this is actually an 8 that has been recognized here so which activation function are we going to use as I said we'll see several but for classification problems the the wise minds that came before us tell us that softmax is actually a very good activation function and it's it's extremely easy to use as well softmax is simply the exponential of the sums you get in the input and then once you have taken your 10 Exponential's you normalize this vector of 10 10 elements okay so the full process as each neuron does a weighted sum of its inputs plus bias elevates this to the exponential and once you've done this ten times you normalize the this vector of 10 elements normalized using which norm whichever you want it really doesn't matter l1 l2 Euclidean norm the use the usual norm used in softmax is l1 so the sum of absolute absolute values but that doesn't matter so now we need to formalize this as a formula using a matrix multiply but actually I'll do something slightly more let me formalize this not just for one image but immediately for a batch of the hundred images so down there I have a matrix with my hundred in 100 images there is one image per line all the pixels flattened in one line okay and on the top there I have my weights matrix so let's go using the first column of weights of the weights matrix I do a weighted sum of all the pixels of my first image first weighted sum then using the second column of weights in the same way I do a weighted sum of all the pixels of the first image and using the third column of weights and so on so here I have my ten weighted sums for my ten neurons the last thing is to add the biases these again are just constants it's additional degrees of freedom just as those weights or degrees of freedom of the system so I add one bias per weighted sum and and if I continue this matrix multiply I will actually repeat this exact same operation for the second image and then for the third image and so on until the last image so now I would like to write this as a simple formula which you have on top there but I still have a problem with the plus there so x times that x times W that's the operation I just did that works fine now the plus doesn't quite work does it x times W look at the size of this matrix the resulting matrix it's 100 lines per 10 columns can I add a vector of ten elements to this well obviously not the sizes they don't work but nevermind let's redefine what plus is okay that simple solution and actually it's a very standard solution it's the the way plus is defined in Python and numpy which is the scientific computation library in Python by the way who has some experience with Python numpy here okay like half the people okay so broadcasting plus the rule is if you are trying to add two things and the size don't match don't give up just replicate the small thing as much as possible until the sizes match and it just so happens that that is exactly what we want to do here we have only ten neurons one bias per neurons only ten biases so this vector of ten biases we want to add it on each line okay so broadcasting plus is exactly what we want and we can write our formula like this and we obtain the basic formula for Au for one layer of a neural network so let's recap in X we have a batch of images 100 images one per line all the pixels flattened into a linear vector of 784 pixels in W we have our weights and x times W are the weighted sums for each neuron and each image we have the biases and then feed this through the softmax activation function line by line so this line by line takes the sums elevates them to the exponential and normalizes this line of 10 elements next line same thing next line next our next line in the end we obtain for each image 10 values those 10 the 10 outputs of our 10 neurons those 10 values are normalized they are between 0 & 1 so we can interpret them as probabilities and those 10 values will actually be the predictions from our system we are hoping that one of them will be big telling us that this number that was seen in the inputs was actually a 3 or a 5 or a 7 so this will only work if weights and biases are good but what is good so we have to define good and then once we define good we'll be ready to train this system to determine our weights and biases by the way this is how you would write it in Python using tens of so I hope this is not too surprising from here to here tensorflow has this quite useful nn neural network library which contains hostess of functions which are used in neural networks alright so how do we assess the quality of all of our predictions so we will be training the system so we have a batch of images handwritten digits for which we have the correct labels we know what they are and we have the system that when you give it an image gives you predictions so what we actually want is to compute the distance between what the system predicts and what we know to be true here is how our predictions are going to look like 10 numbers between zero and one hopefully one of them quite strong so all we have to do is encode our known labels in a similar format basically all zeroes and just one one here in the sixth position because this was a six and in this format now it's it's trivial to compute a distance between those two vectors so what distance again any distance you like that's fine L to you the Euclidean distance will work L 1 will work any distance will work but for classification problems there is one distance that works just a slightly tiny bit better there are good reasons for that that you can look it look it up on the Internet and that distance is the cross entropy so how is the cross entropy computed very simply it's the values on the top multiplied by the logarithms of the values on the bottom element by element and then you sum all these across the whole vector why do we add a minus sign anybody has an idea minus sign because oh thank you that's so kind well the values are between 0 & 1 so a logarithm of the is always negative that's why we add a minus sign and we have a distance we have a distance between what our network predicts and what we know to be true so now that we have this distance we can actually try to minimize it and this will be our goal during training find the weights and biases in the system that minimize the distance between but what the system predicts and what we know to be true so let's go let's try to train the system demo time to run here we go so here on oh thank you that was nice the demo gods are not with me today let's go again I hope I don't have a process lying around somewhere no not this one okay bad demo God's huge so let me solve this issue really quick now this mister needs to reindex and we should be good to go alright so here you have the 100 training digits which are being fed into the system 100 images at a time you see I put on a white background those that have already been correctly correctly recognized by the system and on the red background on the top those that are still missed what you see on the bottom here are test digits so - - when you train to assess the quality of the final results on real-world data well you don't use real-world data but you have to use data that the system has not seen during training it would be cheating otherwise so here I have 10000 test digits you only see 1000 of them so you have to imagine 9 additional screens of digits down below but again III sorted to the top on a red background all the digits that have been are incorrectly recognized by this system and you have an a white background on the bottom all the correctly recognized ones with a little scale on the side which tells you actually the percentage of correctly recognized images out of those 10,000 test edits so the first thing you see is that even with this very very simplistic model just 10 neurons we are already correctly categorizing 92% of our images that's not so bad on the top we have our loss function so computed both on the test digits and the training digits well you see it's going down so something is working there to make the loss function go down the accuracy that is simply the percentage of correctly recognized recognized digits so it goes up to 0.92 of the over there and here you see weights and biases so these diagrams actually the bands are percentiles so don't worry about that just remember that the full range of all the bands is where 100% of all the values are so what you see here is that biases and weights in the system they started at zero and they started changing during the training process and they ended up the weights between minus 1 and 1 and biases somewhere between minus 20 these two diagrams are just useful to see that things are moving and it's good to keep an eye on them just in case you see the weights and biases shooting off in the hundreds or thousands that would mean that your system is not converging all right so that's what the training process is is all about you throw known digits at the system with known labels you compute the difference between what the system predicts and what you know to be true and you try to adjust the weights and biases in such a way as to make this distance smaller and smaller so let's see how to do that in practice and in tensorflow so now we dive into tensorflow code in tensorflow you have to define first of all placeholders and variables so variables are all the degrees of freedom of your system it's all the things that you want tensorflow to determine for you in our case our weights and biases then when you train the system you will be throwing data at it so you also need something to put the data in for that intensive flow you define a placeholder here I have a placeholder for my images so let's look at this the shape of this placeholder we're using tensors which are basically multi-dimensional matrices and they have shapes which indicate on on each dimensions hummin values you can store so this one starting from the left none will be the number of images in this batch at this time it's not known so it's just none but when you provide the batch of images there will be a certain number of images in there then our images are 28 by 28 pixel images and the last dimension which is 1 is the number of values per pixel so here we use grayscale images one value per pixel the one would is not useful but I put it in there just in case you want to use color images that is where the three would go for RGB so place holders variables and then we're ready to go so the first line on top is our model that is the one line model that we have determined previously the only difference you see is the reshape statement there in the middle so remember our images that they come as 28 by 28 pixel images but do we want them as one long vector all the pixels flattened in the line that is what reshape does and in the shape there the minus one in the shape simply means there is only one solution figure it out okay so the minus one will end up being the number of images in this batch apart from that it's what we have seen its matrix multiply of X by the whites plus the biases fed through the softmax activation function then I define a second placeholder for mine own labels and now I have the known labels I have my predictions I'm ready to compute my loss function my cross-entropy so that's what I do in the middle line it's an element-wise multiply of elements of the no labels multiplied by the logarithm of elements of my predictions and then reduce some those are some across the vector of all of that so I have my loss the last thing to do the two lines on the bottom I won't get into the details that just come it's the percentage of correctly recognized images you can parse that on your own if you want it's just for display alright and now we go into the heart of what tensorflow provides you as a tool so we pick an optimizer there is a full library of them but gradient descent optimizer is the simplest optimizer in the library and we tell this optimizer well please can you minimize our cross-entropy so what is going to happen here let's look at this cross-entropy what does it depend on you see sir apart from depending obviously on the training labels and training images it also depends on weights and biases so to minimize it the the process is first to compute the gradient of this function the gradient is the vector of all the partial derivatives of this function relatively to all the weights and all the biases in the system so first of all little technicality how many weights and biases do we have hmm just the weights 784 by 10 that's 8,000 so we have roughly 8,000 of those weights and biases our gradient vector will have roughly 8,000 partial derivative components so thank you tensor flow for computing this for us by hand it's a bit cumbersome and actually when tensor flow does this a it is a formal derivation it's not a numerical derivation to formal derivation tensor flow knows exactly what you have defined as computation steps and there's a formal derivation of that to compute these gradients let's skip over this we have a tool that does it usually in a neural network class you spend the next hour or two on computing that gradient we can skip that yes now of course not all the weights in the whole weights matrix are all different yes the very good question the gentleman was asking if the weights were the same in some of those comes but no all the weights are different each neuron in our ten neurons each neuron does a weighted sum of all of its inputs using its own weights each neuron has its own words so each column of weights here is different each column of weights corresponds to one neuron and sorry know the trees so there is one weight per input value basically one weight per pixel of the input image and it's possibly different so all these W's in the skin this 10 by 784 matrix so those 8,000 roughly 8,000 weights that are all different it's a very good question because we will get to convolutional networks in in a moment and they're the weights are shared in some ways so who knows why a gradient is useful where where does it point come on who knows in which direction does the gradient point sorry almost it points up but we add a minus sign and yes it points down exactly exactly which is fantastic because that's what we want we want to go where the loss is minimal so we now have an arrow telling us this is where to go we just have to follow it so let's recap in which space are we we are in the space of weights and biases and we have this arrow telling us well if you go in this direction going in this direction in this space means modifying the weights and biases by some fraction of this vector if you go into this direction your loss is smaller in that direction what that's fantastic that's exactly what we wanted so this will be our training loop the training loop is compute the gradient on the current batch of images and labels add a little fraction of these gradients to our weights and biases to make them evolve and start over with a new batch of images in labels what is this learning rate there well that that's the little fraction of the gradient that you will be adding so why not add a full gradient that's simply because that would be going too fast if you to take a metaphor imagine you're in the mountains on a peak top you want to reach the bottom of the valley so we are quite good at computing gradients you know we have sensors for that we know what down is but if you have seven-league boots and you do jumps of seven leagues you will just be jumping from one side of the valley to the other you're never going to reach the bottom even if you know where bottom is so you have to make small steps to actually reach the bottom that's the learning rate here once you compute the gradient you multiplied by this which gives you a little fraction of the gradient and this little fraction those are the deltas that you will add it to your weights and biases all right so let's write this this training loop but before I have one more thing to explain tensorflow has a deferred average execution model so everything we've done so far using TF dot something statements those statements when executed in Python they don't produce results they produce a computation graph in memory they just define a computation graph so why is that useful well first of all I told you that tensorflow does a formal derivation to compute the gradient so it has to know the formula for your loss to compute the the complete compute graph to be able to do formal derivation on it and the second thing are also very useful is that tensorflow was built primarily for distributed computing so the fact that the graph is known in the memory helps a lot when distributing those computations to multiple servers but for us it means that we have to do something more to actually get those graph nodes execute it and get values out of them and that is to define a session a tensorflow session and then any time you want to compute something you have to go session dot run and a note of your computation graph plus a feed dictionary in which you feed in all the missing data for which up to now we have defined placeholders so here look at the the syntax for filling the placeholders the feed dictionary actually it's called train data it's x-column batch batch and so on so those batches of images I use a little utility to give me 100 images and labels at a time and then the syntax of the feed dictionary in the syntax the keys x and y underscore those are exactly what I have defined as placeholders before so that's the link to placeholders here I feed them and execute my train step what is the Train step well that's what we had on the previous slide train step is what you obtain when you ask this optimizer to optimize your loss function so train step is actually the step that computes the gradient derive some that the deltas to apply to your weights and biases of n updates the weights and biases the Train step is what makes your weights and biases move in the right direction and so this is the full the the training loop load 100 images and labels execute the Train step which computes the gradient and updates your weights and biases start over again that's it all the rest what you see on the screen is just for just for display so that has nothing to do with the training is just for so that we get our pretty curves on the screen we compute the accuracy the percentage of correctly recognized images and cross-entropy both on our 100 training images and also on our 10,000 test images so that we can display our red and blue curves here obviously don't do the test computation on every iteration put a test there somewhere saying every 100 every 1000 iterations because processing those 10,000 images is quite expensive ok so that's all there is to it here is the full code for training this initial neural network let's recap so placeholders and variables arrivals from all the degrees of freedom of the system everything that you want tensorflow to determine through training that is weights and biases here placeholders for all the training data our images our known labels now we can compute the model that's one liner which produces predictions yes yes okay initially you randomized them yes at the very beginning just that iteration zero so the gradient you have your your last function which is a function that depends on training data your 100 images in labels the weights and the biases it's a complicated function but it has those four parameters it depends on weights and biases which means you can derive it relatively to weights and biases you obtain a gradient a gradient is for the current batch of weights and sorry for images and n labels this gradient is an arrow in the space of weights and biases pointing in a direction where it is known that the loss will be smaller so you take a little step in that direction and that's it then you it's it's yeah the the gradient formula it's it's the vector where each element is the partial derivative your function relatively to one weight and the next element is relatively to the next weight and next element is relatively to the next weight so we have 8,000 of those okay thank you for the question if it wasn't clear for you I'm pretty sure it wasn't clear for everyone so yes we were at the model and the model gives us predictions and now we can compute the distance between those predictions and the known labels that is the cross-entropy over there the two lines at the bottom just the computation of the percentage of correctly recognized images at the top there we pick an optimizer we ask it to Monomoy to minimize our loss function that gives us a training step we can now launch a training loop in this training loop we load 100 images and labels and execute the training step which will actually make our weights and biases evolve and that's how we train the system all right so this gives us little recap what we have in our neural network soup right now softmax we've seen is a good activation function for the last in our case the unique layer in in a classification problem cross-entropy a good loss function for classification problems and this this mini batching technique of training 100 images at a time so actually nobody asked why we are doing this we could be training one image at a time of course that works as well two reasons why we do it 100 images at a time the first one very technical one is that we want to be running this on GPUs I actually have a GPU here that is running this not really needed for this little example but for further examples this could be nice and with many batching you actually use bigger matrices and bigger matrices are easier to optimize on GPUs so very down-to-earth reason and the second reason is that if you compute the gradient on just one image well from one image to the next the next there can be a bit of variation in where these gradient points so you will get to to a local minimum but you might get there by widening ways if you do this on 100 images at a time you already get a consensus over a hundred images of what is the best global direction so it's a little bit more stable to do it a hundred images at a time okay so using this ten network sorry ten neuron one layer Network we get 92% accuracy is that good who thinks this is good this is good Oh for yes of course just imagine you're the post office you want to use this to recognize post codes well missing eight digits out of every batch of 100 it's terrible it's terrible so we have to do better and of course deep learning is fashionable so let's go deep and add layers it's actually fairly easy to add layers to a neural network look at the first layer of neurons each neuron does a weighted sum of all the pixels of the image so just as easily you can add a second layer where each neuron does a weighted sum of all the outputs of the previous layer that's it that's how you stacked and high just one thing we keep softmax as the activation function in in the last layer because it produces nicely normalized values between zero and one which we can interpret as probabilities we like that we are in a classification problem but for the intermediate layers you want a different activation function and actually the most popular one is the sigmoid which is simply a continuous function going from zero to one so what do we have to change to get this so first of all you need a weights matrix and a bias vector per layer we just had one before now we have five or or whatever you need one weight matrix one bias vector per layer it is also good practice to as you point it out to initialize those weights to be small random values at the beginning so that's what I do here truncated normal is just a complicated way of saying random and now that I have I have my weights and biases for all the layers this is what my model becomes so hopefully you recognize the first line it's the formula for a one layer for one layer of a neural network that we have seen before the only difference is that this time is using sigmoid as the activation function and the second line actually instead of using X the images as its inputs now uses y1 the output of the first line and so on y2 the output of the second line goes into the input of the third line and so on until the last line where the only difference is that the activation function is softmax because we want to get predictions so that's it those are all the changes I need to make to my code replace my model one-line model with this and of course I need to define the additional weights and biases that these additional layers require all right so let's run this but actually before we run it we will do one more little change I just lied to you telling you that the the sigmoid is the most popular activation function it used to be it's not anymore people have invented the relu what is that it's even simpler that's already it's just zero for all negative values and identity for all positive values and surprisingly it works better so especially in very deep neural networks this activation function is is a lot easier to work with so why the real reason why I'm not lying to you anymore the real will be the reason why we don't know people tried it works better yeah success I'm honnest you will find plenty of papers full of equations which are after the fact explanations now what do they explain there still is a reason so basically the problem with this function is that on the sides it's it's flat and flat means a gradient of 0 you use the gradient to move forward so having a function that has a gradient of zero in some regions might not be ideal and this is actually a problem that bites really hard when you have multiple layers you stack those high so you those 0 gradient regions can actually combine and get you into what is called a vanishing gradient problem where the gradient you compute on on your neural network has very very small values and since that is what you use to go forward well you stop you don't go forward anymore and there really doesn't have this problem because well you see at least on the right side it's not bad it's not flat so less vanishing gradients problem here actually the the second explanation is that the guys who who invented this they were inspired by biology and it used to be the consensus among biologists that the physical biological neurons in our heads they behave like the sigmoid and apparently the latest research seems to indicate that biological neurons behave more like this so when they are not stimulated they output 0 when they start being stimulated above a certain threshold they start outputting not one but something that is proportional to the amount of stimulation like this so that was one of the inspiration was but then once inspired people tried it work better success actually here I'm showing you only the 300 first iterations so really the very beginning of of the training this is what you get with sigmoids this is what you get with values so you see it starts faster and we will see that it actually ends up higher as well in this case this is such a simple model that we can perfectly well training with sigmoids as well it works it works just slightly a little bit better with so let's use those values and let's push the training to ten thousand iterations and this is what we get ninety-eight percent accuracy so first of all oh yeah this is good ninety-eight percent we jump we just jumped from 92 percent accuracy to 98 percent accuracy but still I don't really like the appearance of these curves look how noisy this is look at the test accuracy the red color over there is jumping up and down by a full percent that's not right so when you see this it clearly means that you're going too fast this is how it manifests itself when you're jumping from one side of the valley to the other without really reaching the bottom so here you know you have to go slower but just lowering the learning rate by a factor of 10 that will slow down your learning by a factor of 10 not really desirable so actually the right approach is to start fast and then slowly decay your learning rate to get slow towards the end that sounds like you know so trivial why am I even mentioning this but look at the results it's spectacular so this is what you get with a fixed learning rate of 310 minus 3 and this is what you get with the same learning rate at the beginning but you decay it gradually towards 110 minus 4 at the end all the noise is gone and look at the test accuracy we are now stable above 98% accuracy not just peeking up there sometimes we are stable about 98% accuracy and look look at the training accuracy so the blue curve over there on in the last 2000 iterations it is stuck at 100% actually you can't see it on the graph because the 100 is not represented but it's stuck at 100% so across two thousand iterations at least on the training data here for the first time this neural network is not making a single mistake in identifying those training digits remember how many training digits do we have 60,000 how many do we process per iteration 100 how many iterations to process them all 600 that's what we called an epoch an epoch was is when you have seen all of your training data once so an epoch is 600 iterations here across at least two epochs we have for the first time identified perfectly correctly all of our training images it doesn't mean that we are doing perfect on the test images of course not but at least on the training images for the first time here we are perfect um so I'm quite happy with the noise levels and the accuracy which I obtained but look at this curve it's really weird isn't it that's the loss computed on training data and on test data so on training data it goes down well that's to be expected you know we are minimizing actively we have an algorithm that actively minimizing minimizes the training loss on our training images so it works and the training loss is going down but look at the test loss so that's the same loss computed on the test images at first it goes down and then there is a disconnect well first of all this is not completely unexpected remember our optimization algorithm never ever sees the test images it's completely alien to those test images all the work it does is on the training images it doesn't even know there are test images out there so it's not completely so the fact that the work being done on the training images does have a positive effect on the real-world performance and the test images is a kind of a byproduct of neural networks and well it works when it works and sometimes it stops working this is what we see here at one point there is this disconnect and the work we are doing to minimize the loss under training digits no longer has a positive effect on real-world performance so this is to be expected but what can be worrying is the size of this gap when you look up literature usually this gap is labeled overfitting and we'll have to get deeper into what overfitting is but in the first layer of approximation you look up the manual you see this it's over filling solutions for overfitting regularization ok let's add regularization um there is one regularization function that I like very much which in vain' when your neurons don't behave it's called dropouts and it involves shooting the neutrons to make them behavior yeah I like that so let's see how this works you have your neural network and at each iteration of your training loop you will pick you pick a probability let's say here pick heap equals 75% this means that there is a 75% chance for each neuron only to remain in the network so at each iteration of the training loop you roll the dice and you shoot 25% of your neurons boom boom boom you remove them and then on the next iteration of the network of the training loop you're all the dice again and shoot 25 other percent of the neurons and this is how you train so the effect this will have is that during that one training loop since this neuron is removed along with all its weights and biases an old that those weights will not change on that iteration of the training they will remain frozen that's the the effect of course when you evaluate the performance of your neural network on test data you don't do this on a on a brain-damaged Network you have to put all the neurons back so the probability of keeping them will be 100% that's how you configure to disable dropout when testing I put for you the little code here in tensorflow for dropout actually it's just a one-liner the dropout is applied after each layer and what the dropout function does is is very very simple it rolls the dice and in the vector representing the output of the previous layer it replayed it replaces 25% of the values by zeroes that's it the remaining values that's a small technicality the remaining values are slightly boosted so as not to change the average of the whole vector you need that otherwise you would be you would be shifting the activations of the next layer a little bit too much but that's a little technicality in first approximation the dropout function takes the output of the layer replaces 25% of the values by zeros and that's how you continue so how does this work yes yes correct you typically would use a drop out layer after each so after the activation function C really that's the output of the activation function then you a drop out C obviously you wouldn't do this on the inputs you first get through one layer of new neurons then you use drop out and in the same way on the last layer the softmax layer you don't it will have a drop out layer before it but not after it you say it's only between layers then you add the activation for that neuron is zero yeah exactly exactly yes so it is performed through zero so it's not performed basically the outputs of those neurons is a constant of zero which no longer depends on their weights and biases which means that their weights and biases are no longer present in the loss function computation which means that the the gradient will no longer have terms for those weights and biases and they will not be updated on that iteration so it's a way of I say shooting the neuron mathematically it's a way of freezing all the weights and biases belonging to one given neuron for one iteration alright let's see if this is useful and actually let us recap everything that we have done to our neural network we started with a five layer neural network using sigmoids as the activation functions on the intermediate layers and this is what we had we replaced the Sigma's with Raley's well slightly faster to start and actually we get a little bit higher in accuracy as well still very noisy so we used a decaying learning rate that cleans up all the noise and now we are sustained above 98% accuracy but we have this overfitting problem this disconnect between the test loss and the training loss so let's let's start shooting our neurons let's add a dropout this is what we get so you see some noise comes back that was to be expected it's a brutal technique you're shooting stuff so yeah noise comes back the test loss function is largely brought under control it's still disconnected by some amount from the training loss but it's not shooting up anymore so this were happy it has the effect that we were hoping for on the accuracy well first of all I'm amazed that it didn't didn't destroy our accuracy given how brutal this technique is but then I'm a little bit disappointed that it didn't improve our accuracy well not in this case you can't win every time we'll have to actually revisit a little bit deeper what overfitting means so at its core overfilling happens in a neural network we when you have too many degrees of degrees of freedom too many weights and biases imagine that you have so many weights and biases in your neural network that you can actually store all of the training examples in those weights and biases in some way imagine the network have so many degrees of freedom that it can create some kind of internal representation for each specific example of the one in your training data set that would work for the training data but it would fail miserably as soon as it was presented with one that it didn't see before so that's what overfitting its is that it's at its core in a neural network you need to put constraints you need to constrain the degrees of freedom to force the neural network to create categories for your data categories that then generalize to data that has never been seen before the opposite side of this coin is even if you have a small enough network if you have very little data you can still kind of store all of that data in the internal representation so that's a constant takeaway for training neural networks you always need lots and lots and lots and lots of data and then in our case so yeah five layers is a little bit you know of overkill for this problem but I tried to put four or three or two layers I'm not getting above 98% accuracy even with that I know I have plenty of data 60,000 training digits for only 10 categories fine I even tried the textbook solution of adding a regularization function and I confirmed on the curse that the regularization works you know the the test loss and training loss disconnect is is largely fixed but still I'm stuck at 98% doesn't see it doesn't seem to be possible to improve my accuracy I mean hitting some kind of ceiling so the last conclusion is that this network is by design by some faulty design not capable of extracting all the information that we need to extract from this neural network does anyone know know why would it something completely stupid at the beginning we did did you did you spot it come on who spotted the absolutely brutal stupidity with it well you remember those images twenty by twenty eight pixels exactly we flattened them all in one line we destroyed shape information and come on what our digits if not shapes they're made of little circles and curves and straight lines and all that we just jumped all that I mean I'm amazed we are getting to 98% accuracy with that with with that kind of approach that's just not possible but researchers have been working towards our rescue and they invented convolutional networks specifically for 2d data where shape information so locality information is important so let's see how convolutional networks work Here I am coming back to the example of a color image okay so that's why I have three channels of information red green blue and now let's take a neuron again this neuron will be doing the weighted sum of of inputs but this time in a convolutional Network this neuron does a weighted sum of only a small patch of pixels right above it for a 4x4 pixel patch and then the next neuron does again weighted sum of this small patch of pixels right above him but big big difference using the same weights as before we are reusing the same weights so in effect we have a little matrix of four by four by three weights here and we are scanning the picture in both directions using that little patch of widths so if we do this in both directions and of course with proper pairing on the sides we obtain as many output values as we had pixels in the original image and to do this we used one weight per data item in the highlighted cube so four by four by three weights how many is that 48 how many weights did we have before in the simplest possible ten neural network something like 8,000 here 48 that's not going to work not enough degrees of freedom so how do we add the breeze of freedom here well just do it again take another matrix of 48 weights and do it again and that's what we do we will rescan the image using a different set of weights and obtain a second channel of outputs again one output value per pixel in the original image and now since we are using tensors well we can write those two tensors as one by adding a dimension with the value of two because we have two values and this becomes the weights matrix for a convolutional layer in a neural network so let's recap four by four that is the size of the patches we're using three that's how many input channels we have in our inputs here a color image three values four pixels three input channels - that is how many patches we pass on the image and of course that is also how many channels of outputs we obtain so written in this form with number of input channels number of output channels I think it's easy to see that you will now be able to stack those layers simply by aligning the number of outputs and the number of inputs of the next layer we still have to solve another problem which is to boil the information down you see at the very end somewhere somewhere at the bottom of our network we still want to have only ten neurons ok so those we need to boil the information down one way of doing this traditionally was to sample those outputs so it's it's important I mean it's useful to understand how this works because it gives you a good feeling for what those convolutional networks do and actually as those weights will be evolving during training they will evolve towards small shape recognizers there is a patch of weights that will evolve to react strongly to a little circle another one that will react strongly to a little oblique line and what you are saying what you are when you are subsampling these outputs is let's take groups of two by two output values two by two so four output values and in those four output values I just take the maximum value and toss it down I throw away all the rest that's a way of saying in this group of four output values that is one that is big which means that this one corresponds to something that has been seen clearly above like a little shape and the three others corresponds to I've seen nothing so you can throw them away that's how traditionally we've been both we've been boiling the information down but actually there is a simpler way I'm not claiming it's more efficient it's just simpler when you pass the patch on the image instead of doing it pixel by pixel you go you can also do it every two pixels and if you do it every two pixels while mechanically in the output instead of obtaining one value per pixel you obtain one value per group of four pixels so that's another way of boiling the information down and actually you will see in most recent convolutional networks you see only convolutional layers and people play with this tried to boil information down so this is the neural network that I want to build with you on the top our image first layer first convolutional layer let's look at the shape of the matrix so 5x5 that's the size of my patches I'm using patches of five by five reading in one input channel because I have a grayscale image it has only one channel of information the last number is four so I'm using four of those patches producing four channels of outputs next convolutional layers I'm this time I'm using smaller patches of four by four reading in four channels of information because I output four channels of information previously and this time I apply eight of those patches resulting in eight channels of information this time with a stride of two so now my planes of output values they are not size of size 28 by 28 anymore now they are of size 14 by 14 throat convolutional layer again patches a four by four reading in eight channels because I was outputting eight channels before and I apply 12 of 12 of those patches producing twelve channels of information again a stride of two so those information planes are seven by seven now instead of 14 by 14 and then I apply a fully connected layer so a fully connected layer is a layer as we have seen - now in in this session in a fully connected layer each neuron does a weighted sum of this tiny little patch but all the values in the 7 by 7 by 12 cube of data all the values in the next neuron does again weighted sum of all the values using its own weights different weights for new so that's a fully connected layer and then lastly our layer with just 10 softmax neurons to produce our final categories so let's see how to write this in tensor flow so again you need a weight matrix in a bias vector per layer the only difference is that the convolutional layers have a weights matrix with a specific shape which we have seen before so size of the badge number of input channels number of output channels for the fully connected layers the shape of the weights matrix is as we have seen before number of inputs number of neurons and this is what our model becomes so tensorflow has this squat use we'll come to the function which doesn't do any anything fancy it's just a double loop you give it an image you give it a weights matrix and it will apply those weights in a double loop on the image producing this plane of outputs don't mind the complicated syntax for strides okay I put in the read the numbers that need to be either 1 or 2 to get us tried of 1 or 2 you can read the documentation to understand the rest and the pairing strategy so we have images on a white background so adding more white background is a good pairing strategy that's what we have here pairing is the same as the borders of the image so comedy produces those 28 by 28 planes of values we are the bias and we feed this through the Rayleigh activation function next layer the same thing but we use the output of the previous layer as our input and so on then as we want to go into the fully connected layer we need to reshape this sorry this little cube of seven by seven by twelve values we need to reshape it as one vector so that we can input it into a fully connected layer that's what we do and then our two fully connected layers one with Rayleigh one with the softmax I hope by now you are used to those one-liners representing a one fully connected layer in neural networking so let's see how this works and actually well I can show you real demo but then we will look at the video because it's slightly slow sorry not this one this one this one this one this one okay so obviously we are asking the system to do a lot more so it's slightly slower than it was before but still you see the accuracy shooting up quite sharply you see we are not even at 200 iterations we are already at 96 percent accuracy when it sits it's still shooting up very very sharply let's play the video to see how this ends so we'll have to zoom to see something you see the 99 percent line there we are getting there we are getting there we're getting there getting there and we are not getting there India nineteen eight point nine percent accuracy loading I really wanted to get to ninety nine with you here today so what else can we do actually if you look at the curves still have this you know problem you remember you remember you remember this one you remember what the solution is is for this one drop out exactly and actually what I'm going to give you is is a trick but let's call it a methodology for coming out with a ideal neural network for a given problem what you what you do is that you restrain your neural network a little bit too much okay until it hurts here I know I can get higher than that but I did not give it enough to agree the freedom to fully go there actually you can see that on on the parameters I used in the first layer I use four different patches which means that I will be recognizing four different like basic shapes I think that's a bit limiting all the numbers I are not really made of four basic shapes so once I know I'm hurting it a little bit I give it more degrees of freedom and I add drop out to make sure that those additional degrees of freedom are not going to result in overfitting so here is that's what we do slightly bigger patches but most importantly instead of using four eight and twelve of the patches on each layer now I'm using six 12 and 24 of them more degrees of freedom and I add dropout on the fully connected layer I'm a little bit hesitant about shooting neurons in the convolutional layers because you have seen it previously we have significantly less degrees of freedom in these in these convolutional layers so I'm I'm a lot happier about restricting while shooting neurons in the fully connected layers where I have plenty of neurons to shoot and see so let's see how this performs so you see the accuracy shooting up very fast I will have to zoom to see what is going on let's zoom Zoomer look what the 99% line is and we are about yes thank you actually using this approach you get to ninety nine point three percent accuracy which is not so bad considering that well 100 percent is not so far away if you go to the amnesty website you will see that the current world record is 99.7 so here together in an hour we got within a couple of tenth of a percent of what is the state of the art on this problem all right just i want to visualize here what dropout did to us so this is the bigger network already with more degrees of freedom and learning rate DK and everything done by the book but without dropout you see it is already above 99% accuracy but still exhibits this bad overfitting problem now I add a drop out on the fully connected layer and boom the test loss is brought back under control and this time yes I one to tenth of a percent of accuracy just with dropout I went from 99.1 to 99.3% which is huge I mean we are fighting for the last percent here it has only ten tenth of a percent so in that last percent actually getting to tenth of a percent just with a little regularization technique that's huge that's fantastic I love it all right so there is M nest I have a last trick to show you an amnesty actually the last superpower but let's check the time 913 isn't this the time for the break I think it is it's the perfect time for the break so I will be welcoming you here in in half an hour sharp to go to talk about the last superpower which you don't know anything about yet so please come back and right after that we'll dive into convolutional neural networks which are a lot of fun as well thank you welcome back ok I'm glad I didn't scare too many people so we still have most of the audience with us here all right let's kick this amnesty dataset out because I'm a bit fed up with it but I still want to show you one last trick we have seen drop out but there is a better regularization function that has appeared recently and and if you do any work in neural networks it's something you can't ignore and it's called batch normalization and it's super powerful so let's see how it works but before let's imagine we have this kind of data to work with hypothetical example we have two values that evolve like this and we are tasked with predicting something out of this so first of all you see this is very bad data the a value is is is between 0 and 20 the B value is between 0 and 2 so they are not on the same scale this will be problematic in a neural network because when you put data in it's reading the data in multiplying it by weights but those weights are basically fixed so something that is in the hundreds will produce much bigger activations that's something that is on the order of magnitude of 1 and the neural network will have to adapt its weights to compensate and work to do that so we can clean this data a little bit let's first let's rescale it reshift it to be well centered around 0 and as we do that oh well then now we realize our data is correlated those two curves seem to represent you know roughly the same thing they evolved in tandem you see it on the scatterplot here that there is a 45-degree line that seems to be a strong correlation so okay when you are a data scientist you see this you still see signal in the differences between those lines so let's decore alight let's use one signal which is the average and the other signal which is the difference between those two and now lo and behold I have two data sets that seem to be different enough I mean not telling me the same thing so they might be useful inputs for a neural network and on the scatter plot you see that now they are centered well well scaled and the correlation along the x-axis is sorry along the 45 degree axis is gone what we just did is called principal component analysis and you while it was done by hand here but basically what it is is a matrix multiplied with a matrix add we had our data we multiplied it by a matrix that scales and rotates the data and added a shift to send recenter the data around zero so this is useful and for neural networks and if you read the traditional neural network in literature you will always see this kind of advice please whiten your data what they call data whitening is exactly applying principal component analysis well basically rescaling and D correlating your data before you use it in a neural network but look it's just a matrix multiply doesn't that matrix look like whites matrix and the other matrix looks like a bias metric I mean couldn't a neural network do this just add an additional layer and let it learn those parameters why do I have to look at the data and figure out which are the correlation axes and all that that's just a neural network layer let's add a layer and let you figure it out so actually today that's mostly what we do you will see less and less people actually whitening their data because CPU is relatively cheap adding an additional layer is relatively cheap and you can work with slightly dirty data and especially you can let the network determine what the right cleaning of the data is but what happens at the output of the second layer R is the output of the first layer clean I mean again D correlated centered and so on for to be input into the second layer well maybe it's not OK by no I just add another layer that cleans it up yeah that could work but in the end it means you will be doubling all your layers that is starting to be expensive there must be a better way and that way is called batch normalization and this is exactly what batch normalization does between each layer it really normalizes the activations but in a smart way to make sure that the n inputs of the next layer are properly centered scaled and so on so how does it work first of all yes you can see this on the curves if I if I go back to my 5 layer Network ok that we had previously and on my percentile histograms here I display what is usually called a log it's that's just the raw weighted sums plus bias so that is the input into the activation functions all the activation functions of all the neurons in the 5 layer Network well you see what goes into my activation functions is kind of a Gaussian but it's very flat on the sides I see most of the value in the middle and if you look very carefully actually they are not in the middle the zero line is above where most of the values are so my values are actually distributed like this it's a bell curve it does look like a bell curve but it's not centered on zero and I put you in the background the sigmoid which is the function I'm using as an activation function so you see with a distribution of values like this as inputs well my outputs will mostly be in the small regions of the sigmoid and actually if I look at the activations that's what I see I see a very skewed distribution I have one percentile the the intermediate color of green here which is taking up all the space of on the graph and I have just little bands on the sides so I am after the activation function I do have the distribution of values that is very very very skewed what can I do about this well batch normalization has three big ideas we work on batches so we have 100 images and labels in each batch on those batches it is possible to compute statistics so let's compute the statistics for the de la gates so the weighted sums plus bias in in in at the output of a neural network layer those weighted sums plus bias we can compute statistics over the mini batch on it we can compute what its averages with a compute watch its standard deviation s then we can use those to rescale those law gates again log it's that's specialized the word but it just means the raw weighted sums plus bias before the activation function so now on this mini batch we are at the output of we are looking at one layer we have not applied the activation function yet we have just the raw weighted sums and plus bias we compute our statistics on this and we can rescale and recenter those values by subtracting the average and dividing by the standard deviation that's we do - average divided by standard deviation the plus epsilon is for numerical stability to avoid doing divisions by zero okay we forgot one thing I also told you that it's good to D correlate the data well that would be too complicated or too costly so we don't do that just rescale and recenter but now didn't we destroy information well maybe the facts that a neural network layer is actually outputting values that are skewed to one side maybe that's significant maybe that has some value in the neural network maybe re-centering this is destroying information maybe that's not good so we need to do something to preserve the full expressiveness of our model as at the same time that we do our statistical modifications to make the model work better and the genius idea in batch normalization is once you have rescaled your luggage you scale them back using to learn herbal parameters alpha and beta as previously you had biases in each neuron you simply add to two additional degrees of freedom per neuron just 2 per neuron called alpha and beta scalar and an offset factor and your batch normalized law gets become the scaled and recenter law gets multiplied by alpha and plus beta now why does this work well just to prove it look there are values of alpha and beta that it will give you you're already original ex back if that was the right thing to do if that was the right thing to do the neural network during training can decide to assign to alpha the standard deviation value and to beta the average value which basically means that the batch normalized law gates are the initial lockets and you haven't changed something anything just a second so that is just to prove that with some values of alpha and beta we are back to our original network and that proves that we have not destroyed any of the expressiveness of the network there are values for which it does exactly the same thing as before if there is the right thing to do but those two values we let the system learn them so if there are better things to do it will learn that yes per neuron they they play the same role as biases before there was one that was won by us for neuron now here is one alpha 1 beta or neuron now we are centering and rescaling logits which is very precisely in each layer the weighted sums plus biases before the activation function so we are centering and rescaling before the activation function precisely so as to pump into the activation functions values that take maximum advantage of the the useful portion of the activation function well in a way we are because the first layer is a linear combination of the inputs and we are centering and rescaling that so yes in a way we are and but yes we are officially skipping the data whitening part because as I said one layer of a neural network network can do it and here one batch norm layer can do it even better all right so now the last thing to prove is didn't we actually break something in you know in the in the gradient computation are we still capable of computing gradients using this as our lockets instead of you know the non batch norm version so let's let's look at it X is the weighted sums of inputs and biases that obviously depends on our weights biases and images the average of X on my mini batch on my one mini batch well on that one mini batch all the images have been processed with the same biases and the same weights so that part also depends on my current weights and biases that's fine same for the standard deviation it's computed on the current mini batch so it's it uses the current set of weights and biases so that's what I want you to prove X depends on the current weights biases and images the transformed X and of course also the Alpha X plus beta those are just constants they also depend on the current weights and biases so yes it is perfectly possible to compute a gradient a gradient of batch norm DX of X instead of the gradient of X that we were computing before there is no difference and the batch normalization can be added as simply a layer between your weighted sums and your activation function and the backpropagation the gradient with the gradient computation will still work normally that's the beauty beauty about it okay so let's see it in action you remember five layer network with sigmoids as the activation functions and we had this as our activations because the distributions the inputs of the activation function was skewed let's add batch normalization and train again this is what you get look our activations it's a perfectly beautiful Gaussian now the oldest bands are evenly distributed across the full useful range of our activation function and in here I represented it so the the Gaussian is of inputs is centered on the linear part of the of the activation function so actually with the batch normalization we can even train deep neural network networks still using the sigmoid as the activation function is if that was the right thing to do does it work with release well actually it does you see this is what we what I get with values different things on the graphs remember here a serene don't remember but here I switched to my convolutional Network we had previously which had three convolutional layers and two fully connected layers so here I'm actually displaying the activations so after the rayleigh function on the convolutional layers and on the fully connected layers also just a technicality but since this is irrelevant the relu is zero in all the negative side on a mini-batch a lot of values will be zero i I don't want those in my graph because that's not part of the distribution that I want to display so what I'm displaying here is the maximum activation on one mini batch the maximum value taken by one neuron output on one given mini batch and what I get is again a beautiful distribution across the useful portion of my rally instead of being completely skewed on one side you see these these percentiles here you can kind of see that it's a nice bell curve that is centered around for the dense layers around 2.5 so squarely in the positive range of the rally where the Rayleigh is useful it's a linear function and for the convolutional activation actually it's around centered around 1.5 a little bit of skew in indigo Chien but not by much so it works beautifully both on with Sigma ins and it's still useful with with radius just a couple of technicalities this is something you will have you will forget us as soon as I switch off this slide that's what I do but still the batch charm layer comes between the weighted sums plus bias and deactivation itself remember the batch non layer removes the average on the batch so our biases are actually not useful at all anymore having plus bias and removal + bias and removing the average means that the bias is gone so when you use batch norm no more biases the beta factor in the batch norm actually plays the role of biases in batch norm also when you use a relly well rail you is two linear segments so if you modify the scale of your inputs the scale of your outputs will be modified proportionally which we don't care about it doesn't change the shape of the output distribution if you use the sigmoid function of course if you scale your inputs you will be getting into different regions of the sigmoid and the shape of the distribution will be completely changed so when you use a sigmoid it is important to learn also a scale parameter that rescales your values in the useful portion of the sigmoid with really with values you don't care about that so if you use rail use you can skip the Alpha factor and not use it at all so here I give you in a you know in a little chart with and without button or batch norm if using regular sigmoids which of those biases batch norm scales Bachner offsets you are to use if you use a tensorflow you can here you actually have to specify those parameter and say which ones you want which ones you don't so it's useful to have this little reminder so what is the end result the end result that the activations being cleaner you can go a lot faster and that actually for me is even a little bit of a problem because it means that now you have to you you can get much better results but you actually have to change your learning rate to get them you need to use a faster learning rate that's a good thing of course but it also means that when you apply batch norm you have to go back and the learning rate that you had previously doesn't work so well anymore and you have to find a new learning rate that works better and also this does the same kind of regularization effect as dropout so it's possible to completely stop using dropout and use batch normalization instead it's also possible to combine them but use a lower amount of dropout and that's what I did here in this last network Oh last little technicality how do you apply batch norm on on a convolutional layer so you have to compute those you know those statistics on the output of one so the law gates of one neuron across all the possible values in a batch but when you do convolutional layers one neuron you know has a given output per image in the batch but you're also scanning the images using the same weights so it's per image in the batch and per exposition in the scanning and per Y position in the scanning so that's the only little change the averages and standard deviations which you are computing previously on the whole batch in a convolutional layer you need to compute the status to do statistics on all the images in the batch and all the positions in x and y of your neurons so it's just the computation of the stats that is different but you still have one scale factor and one offset factor per neuron just as you had one bias factor per neuron previously okay and last thing now at this time when you're actually testing on real images so batch normalization uses those statistics what do you compute the statistics on at this point do you use the last batch why is that relevant for your test images not not really do you use all the images you have well there that would be good statistics but that's a bit expensive you know so actually the practical trick is during training to lose to use a shifting average and a shifting standard deviation which means that you are using the stats you computed on your last something iterations and that's what what do you that's what you use as the batch norm factors when you do testing basically the assumption you're making is that across those last iterations the weights and biases will not have changed that much so your statistics are still you know quite good for this okay so the code tensorflow has a very useful batch normalization function and right here at the bottom batch normalization but if you look at the code it actually does nothing you give it lockets but you have to give it also the averages and the standard deviations and yes very helpfully it will do the minus and do the division thank you III would have known how to do that myself all the rest of the complexity I mean how to compute the the averages and the standard deviations correctly in dense layers in convolutional layers you have to do yourself and how to compute your moving averages so as to use them when testing you have to do yourself it's not that complicated tensorflow also has this moments function which returns the average and the standard deviation so that's perfect you can use it and you see the only change between a convolutional layer and a non convolutional layer is that I test and I use moments computed on a different data set in both that's the only difference and tensor flow also has this exponential moving average function which you can use to compute your your moving averages so that here my m and V which are mean and variance I put in a condition conditionally on being in training mode or test mode I'm using either the statistics computed on the last batch or the moving average computed on a number of batches before for testing that's what I need for training I just need the statistics on the current batch so barring those two little complications that for convolutional and normal normal layers the stats are are slightly different and that in testing and and non testing environments the stats are a little different the core of it is just batch norm the log it's the the average and and the variance and then the offsets and scales that's my alpha and beta factors which I need to define as variables because they will be learned and this gives me a layer which an can which I now apply between my weighted sums plus bias if I use one and my activation function wherever I want batch normalization does it work well let's try this is my convolutional network which I had previously remember I got to ninety nine point three percent accuracy with my best neural network now applying batch normalization you see ninety nine point three it's kind of far away already this is shooting above ninety nine point five so with batch norm I'm still keeping a little bit of drop of dropout but with batch norm here in my fight to get the last tenth of a percent I got two additional tenth of a percent again just with another normalization take a regularization technique I love bachelor all right so let's close on this superpower and now let us change worlds and we will be diving into recurrent neural networks just a very quick mention about tensorflow tensorflow has this TF dot learn library which today you will find in TF dot contrib dot learn it moves into the main distribution as of tensorflow 1.0 which will be released soon I'm just mentioning this in passing because everything we have seen today was low-level tensor flow we have been manipulating matrices but actually once you understand what you're doing what you understand which layers you want and so on you want to manipulate layers individually so there is an API I'll skip over this quickly but there is an API in tensor flow to manipulate layers directly and to give you a little feel from it this is how I could write my convolutional model so you see it's layers convolutional layer layers convolutional then layer flatten layers really that's a fully connected layer layers linear that is there is a fully connected layer but without the activation function and so on and actually in Kiev dot learn the contract is that your model function returns a dictionary of whatever I want okay so that's it's very flexible I'm very fond of this API it's you return a dictionary of whatever you want depends on your problem plus the loss and the training operation okay and then what you can do with this is create an estimator in the estimator you pass your model as the model FN function and instead of writing yourself a training loop you just do estimated outfit passing the images and that runs for whatever number of iterations you specified so it's the same as we have seen before just written with less code and you see the added value is that all those layers are creating themselves their weights and biases and all the the other degrees of freedom that they need I don't like to start with this because there is a lot of hidden stuff going in the background and if you don't know what that is you know it doesn't help with understanding but once you understand this is obviously a much more user-friendly way of working there are ways of seeing the learning statistics as well actually the way that works is that in the fit by default it will it will give you the learning statistics but just the loss because it knows what the losses if you need more learning statistics you define a metric in in this metric maybe I have the code here where is it oh yeah I have it here see eval metric you you actually say on which key on which key here the my model is outputting a dictionary of whatever I want so if I want to compute a metric on one of those values I define a metric using the metrics spec function where I say on which key I am computing this metric I specify the function computing this metric and now I can use this eval metric into a new object which is called a monitor and this monitor you can pass it as as a parameter to the fit function that's how you do it so you create a monitor the monitor you say please follow this metric the metric being whatever you want it's just a function plus a key in what you return so you define it and then this monitor you configure it saying how how frequently you want to check that value and it outputs that all right so recurrent neural networks who has worked with recurrent neural networks already yeah less people okay we'll go slow so this is a neural network as we have seen previously you know a couple of layers but now the trick in recurrent neural networks is that the activations of these intermediate layers at each step we will actually reinjected as part of our inputs and recurrent neural network will be used in in time steps at one time step I take one input may be the internal state will be zero at that point and I compute one iteration of this network I go through this neural network once that creates activations in this intermediate layer those activations I concatenate them to a new input vector and I repeat so basically you can represent this as a neural network cell it has inputs XT which will be changing with time those are vectors representing something it has this H internal state which as each step is computed and the next step re injected into the inputs and this H state is also used to compute the outputs via softmax activation layer so an additional layer which is usually represented outside of the CEL okay that's the usual representation those yellow circles and which takes the H internal state applies a softmax layer to it and compute something that is useful to us again a vector so these are these are the transfer functions for the most simple the simplest possible sell of a recurrent neural network you see the inputs you calculate them by taking the real input and concatenating the previous internal state to it you just concatenate it's a vector of n elements and another vector of P elements you concatenate them into one big vector of n n plus P elements then you feed this through a normal neural network layer here the usual activation function to use is the hyperbolic tangent which is just a sigmoid shifted means scaled by a factor it's a continuous function going from minus 1 to 1 instead from 0 to 1 as the signal it was and then those intermediate states that is what you will be re-injecting into the next step if you want to compute outputs at this point you add a new neural network layer this time with the softmax activation function to compute your outputs alright so this is nice this is like a state machine so this will be very useful for sequences and since it has internal state it will be able to do all the things that state machines do like remember that something happens in to at one point in a sequence and do something else at another point in the sequence based on that for example you will be able to train this on a sequence of characters and teach it to correctly open and close parentheses that's one thing that this can represent as a state machine but the problem is how do you teach it well on this cell let's imagine we put something as the inputs and it produces some outputs okay now we know what our desired outputs are so we we force those desired outputs and and try to do a retro propagation to change our internal weights and biases to produce better outputs well okay that's going to work if the problem where the weights and biases but what if the problem was the input state here are represented as zero but obviously if I'm not starting if I'm somewhere in the middle of my sequence oh thank you I will update my software later thank you very much what if I if I'm in the middle of my sequence amen my input state is HT minus one and what if it the it if it is that state that is not correct well my retro propagation step can fix my weights and biases but as far as this transfer function goes the input state is a constant I can't change it there is no amount of learning that can force those values to something different it was received from the outside it's a constant so even though those things are capable of representing state machines and states changes I don't seem to be able to teach them to to meaningful state changes the solution is to unroll this cell multiple times and to consider the whole of this as one training step well now now if my last output Y 5 is not what I want I can force my desired output here do a retro propagation step and if the problem was that h3 or h2 or h1 or h2 not have the correct values for this to work correctly well now I'm retro operating through all of that so I will be able to change the weights in such a way as to influence hh0 and in such a way as to influence each one so that h4 going into my last cell actually has the correct values and produces the desired output so that is how training works for recurrent neural networks you have to unroll the neural network across a certain number of time steps and then you put in your inputs you put in it produces some outputs you put in your desired outputs you do one training step computing computing weights and biases and you modify those weights and biases to get closer to your desired outputs but this time since it is unrolled there are ways in which you can influence the values of H 4 H 3 H 2 H 1 and so on if those were the culprits simply by modifying the weights before you calculate them and please remember it's the same cell so when I say the weights and the biases I am talking about the same weights and the same biases in each cell and also in the softmax activation layers it's the same weights and the same biases but that doesn't matter this is just a transfer function where there is a loss calculated from the inputs the desired outputs and which uses as parameters the weights and biases in the system from this loss I can calculate the gradient which means that I can now change my weights and biases meaningfully it doesn't matter that it's the same weights and biases in each cell because this is one cell replicated multiple times I can also stack them well if I need you know deep learning I want to go deep so you find if I want a recurrent neural network that is not one layer but two or three or four layers deep that's what I do it doesn't change anything I still have an input I still have an out and my state vector is now slightly more complex by whatever it's just a state vector so conceptually this doesn't change anything a stacked cell is just a cell in a recurrent neural network to teach it the same thing you replicated multiple times so is this going to be capable to represent state changes across you know large portions of text imagine here we are using this recurrent neural network to predict the next word in a sentence so we feed in all the words all the initial words in the sentence and we're trying to predict the next I skip over how do you encode words into vectors we'll talk about that later for instance for the timing just imagine that we have a solution for that ah so if my sentence is his mother tongue is something well there is a short sentence so even by unrolling my network one two three four five times five times I can teach it that if the beginning is his mother tongue is what comes next is most likely a language name that works but here the context saying that this guy was born in France it is far far far away somewhere up there at the beginning how can I teach it that here it has to output his mother tongue is French and not just any language again the only way you can teach it that is by unrolling this sequence as many times as there are words in this full sequence it's the only one so the conclusion here being that recurrent neural networks will be able to be taught state changes across span as big as the unroll as the number of times you unroll them and no bigger that's a fundamental limitation it doesn't mean that they cannot represent state changes across bigger now if you if you work with a character level recurrent neural network for instance and you teach it to open and close parentheses you can very well teach it to open and close parentheses on let's say twenty character sequences and it will still work on fifty character sequences because internally it has what it takes to represent the fact that it has two open parentheses those things things things things syncing things and others one at one point he decides to close those parentheses even if that point is beyond the twenty character limit so it can represent these state changes but you can if it goit if it gets it if it gets it wrong across a span of more than the unroll time then the there is nothing you can do to train it to get it right so the conclusion here is that recurrent neural networks to be useful or always deep always very deep you see previously my non deep networks had five layers and I told you that in the worst cases people who were here on big stacks for for recognizing images we get two to a hundred layers that's like super big here the simplest thing I just threw on the board already I see that maybe 30 or 40 is by is my unroll size just to start with so these guys are always very deep and you get into all the problems of deep neural networks gradient vanishing radians and all of that so for a long time these guys did not work until we found a cell architecture so here the solution was to work on the cell architecture to make sure it has it had good convergence properties so I'm not going to enter into the math why the converge convergence properties are better'n but I want to show you the solution and how it works and the solution is called an LSTA and now I do a survey do you prefer the super complicated incomprehensible diagram or do you prefer the super complicated incomprehensible set of equations which is your favorite way of explaining things diagram okay well I'm a developer I prefer code and this thing looks like code so I'll actually skip this is the usual diagram you see in textbooks doesn't make sense to me I'm sorry so yeah let's put it on the side and let's simply walk through the code which are the equations defining what this does it's actually not that complicated once you realize that we will use a couple of networks layers to define what is called gates so of course our input is again the concatenation of our really real input plus the internal state and now we use a first network layer here the Sigma is for a sigmoid which computes an internal vector of values between 0 & 1 that's why we use the sigmoid because it's it's values between 0 & 1 why do we call it a gate we call it a gate because these values building between 0 & 1 we will be able to multiply another vector element wise by this one and gate the values in that vector so if it is 0 those values will be gone if it is once they will go through and we have all the intermediate states so we define 1 forget gate 1 update gate and one result gate with different weights now our input so now the idea is that we will be maintaining a second internal state here called C which is a kind of memory and what what we want to express is that at each step the new contents of my memory will be the previous contents - what I choose to forget plus what I choose to remember from my inputs from my new inputs that's conceptually what we want to write here in equations first of all a technicality you know I put on the side the sizes of all these vectors and it's actually quite simple all the vectors inside of the cell are the same size here I call em that's my design so that's one hyper parameter of a recurrent neural network that you have to remember it's the internal size now I have just a problem my inputs are of size P plus n because I'm concatenated my internal state internal so size n like everything else and my in true inputs so I can't use it as it is well okay let's use a neural network layer to bring it down in size that's what I do through this hyperbolic tangent neural network layer by the way this line is the equivalent of what I previously had in my simplest possible recurrent cell you know as my hyperbolic tangent line and this line is also where you can play with the activation function if you don't like the hyperbolic tangent you can use the rayleigh function here that works as well here you can use the hyper hyper the Rayleigh or something else the Sigma is you can't use anything else but the sigmoid because the goal is to obtain values between zero and one and use the night as kids so you have to use a function that is bit of between zero and one so now that I have size adapted my inputs I can compute my new internal memory state C which is forget gate multiplied by its previous state so it's what I had previously in memory multiplied by Y what I want to forget plus update gate multiplied by my inputs so plus what I choose to remember from my inputs that is my new internal state and the let's call it external state H Alice Tian's have two states the this state is basically the result gate multiplied by my internal memory state see the the hyperbolic tangent here is just a size adaptation that's all it is that's not a network layer if you look at it C can grow okay all the parameters are between 0 and 1 but C is the sum of two such parameters so at one point it can be two and the next step it can be three and four it can grow so we use the hyperbolic tangent to bring it back between minus 1 and 1 when we use it that's a technicality my output state is the result gate so whatever I choose to to show to the outside as a part of my inside memory and of course if I need a comprehensible output I add my softmax layer at this point so this is these are the equations of analyst TN the general principle is still the same I have an input I have an output I have an internal state state that is passed around but the rules for updating the internal state are slightly more complicated and actually these are not the only rules possible you imagine those arrows going into plenty of different ways and there was so much variation that one guy decided to test all of them he he wrote a big program to test all the possible variations and he came to the conclusion that most of them were performing just great so today I like this guy with ideas he always talks sense today people the most popular of those recurrent Network cells is called the groove gated recurrent unit so very similar equations the only difference is that it's just as powerful as an LS TM with only two gates instead of three so that's one less weight matrix one less bias bias vector cheaper to compute and apart from that well slightly different mechanics inside expert externally it's just a cheaper recurrent neural network zone so that's what we will use let's use it we will do something fun we will teach a neural network a character-based language model we will give it text and ask it to predict the next character in the sequence of text so that's how it works we have those cells we unroll it a certain number of times and we put on the inputs characters how do we encode these characters we need them as vectors well one hot encoding okay we had that before so here it's now a bet let's say roughly 100 characters if I take a lower case upper case plus a bit of punctuation that's about 100 characters so we are working with 100 element vectors and the way to encode an S is to use a vector of 100 zeros with a 1 somewhere in the middle and the index of this one is the numerical value for s so we put our text on on the inputs and what we force so it will produce some kind of text on the output to teach it we force a sequence on the outputs which is the same text shifted by one character so it's the same characters that of the last one and the last one is new and so this is what we will be teaching so let's write this in tensor flow so first of all intend well there is a fully API for working with recurrent neural networks in tensor flow it starts with a cell so let's use a Groot cell well I like the gruesel and by the way when you say grew cell okay this defines all the weights and biases that we had internally here you see there are three weights matrices three bias vectors all that is defined internally when just when you say Crusoe the only parameter you need to specify is this internal size of the vectors in this cell remember again internally all the internal vectors are of a size n which is not the same as the input size which is not the same as the output size you have a softmax layer on the output so you can adapt the output size to be anything you want the input size the internal size is a free form is a free parameter okay now I I want to deep neural network three deep let's stack it there is a function folder for that it's called multi errand and cell you pass it a cell you say how many times you want you know you want this cell and it creates a new cell from for you which is a stacked cell and now the last thing we need to do is to unroll it again in tensor flow you do this using the dynamic or an N command you give it the stacked cell that you just produced how many times do you unroll well that will depend on the shape of X okay X here my input is a sequence of characters so if I feed it with sequences of five characters my network will be rolled five times and what is fantastic in this API is that dynamic RNN actually doesn't you remember this is all tensorflow code so this tensorflow code doesn't actually so produces a graph in memory in here dynamic around an RNN doesn't actually physically duplicate this piece of the graph in memory it uses a graph node which is a for loop it exists tensorflow has that and they managed to get the gradient computation and buffer back propagation to work somehow across those four loop noms I don't know how that works but it's magic but it means that this actually adds a for loop as a node in your computation graph which means that at training time it will be able to unroll this five or six or seven or three times it's very flexible in our case it's not very useful because all of our input sequences will be the same we take chunks of what did I choose sequence length 30 we take we take choice chunks of 30 characters of our text and that's what we trained on but if you are working with say with sequences which can have different lengths well there is nothing to do you just pass the length of the sequence and dynamic array none will unroll your oran and sell as many times on as needed on a per example basis okay and I need to pass in my initial state because of course I'm passing the state around so what I get as an output is results so it's all the ages on the bottom I will need to apply a soft max activation layer on those to obtain my result characters and I get also the last state age which is the state that I will need to pass back in the beginning as I continue my training all right how do I write this softmax activation layer well I know how to write a softmax activation layer but I have one problem to solve you know all the weights of these cells are shared so as I compute my softmax activation layer a softmax activation layer has a weight matrix and a bias vector I want also to share the weight matrix and the bias vector below each cell there is actually well there is a technical solution in tensor flow when you define you variable you can either save arrival which gets you a new one or you can say get variable which gets you one that exists by name so this will be a very simple way of defining here eight softmax cells using the same weights and set the same biases but actually it's even so you can do something even simpler here look at the output state that we have h 0 h 8 h let's call it h f its initial size is batch size well yeah we will still be processing batches so that's a slight complication every size here we'll start with batch size let's ignore that it's sequence length multiple and cell size because how many of them do we have well as many as we enrolled the sequence and how big is each vector so those are state vectors so their size is the internal size of the cell but now as we process them through our softmax activation layer we're actually processing one one of those ages because it comes from one of the iterations of the cell or because it comes from one of the batch examples that's the same so let's let's let's let's just treat those H outputs exactly the same whether they are different because they come from a different example in the batch or because they come from a different iteration of the cell so the only thing we do is that we reshape this vector into a big bag of vectors batch size multiplied by sequence length basically those H 0 to H 8 we put them all on the same bag all eight across all the examples in our batch of 100 actually my batch size here is just with this reshape operation we can now add a perfectly normal here softmax activation layer so again I'm using the layers interface to define it linear defines behind the scenes weights matrix and a bias vector and does the weighted sums and then I feed this through my softmax activation function and this will treat all of those h2y transitions as you know one little cell and and and it will process more of them because some of them come from the batch examples some of them come from the fact that this cell is iterated but it doesn't make any difference it will produce all of my outputs like that so that was a long explanation to say that there is a simple trick for defining the softmax activation layer reusing the same weights and biased matrix in each of those yellow circles and now I can also now that I have my predictions I can compute my loss because I know what I want to have on my outputs that was the goal now that I have a loss function I'm ready to train still a bit of shiny ganz with the inputs and outputs so this is for example what we want to put on the inputs what we want to put on the outputs let's look at the sizes so the input characters I will be have a number of the of those sequences in a batch and each sequence is of length sequence length my exes or the one hut encoded equivalents of those characters so again I will have a certain sequence and number of them in a sequence and each one of them is a vector of alpha sized elements that's for the one hot in coding my internal state how big is this internal state all batch size is the first dimension of course and those ages are of of size cell size does the internal size of the cell I have three of them because I have stacked three cells so it's cell size multiplied by the number of layers now that we have all those yeah in recurrent neural networks it's as complex to understand conceptually what they do as to actually feed them data correctly because there are so many inputs and outputs without making mistakes so now I am ready to align all my inputs and outputs correctly so those are my inputs the input sequence which I won how to encode so now it's a sequence each vector is of size alpha sighs my placeholders for my outputs encoded in let's say ASCII and encoding in one heart my internal state we've seen that it's it's sizes cell size multiplied by the number of layers now I write my model from the predictions of my model I apply arc max which in a vector of probabilities tells me which cell sorry which element is the biggest one and returns the the index of that that element so that is what goes from one hot encoding back to ASCII or work or numerical encoding of the characters and and in and the outputs of my model I did this little trick of putting all the batch examples and all the examples coming from the different under all steps in the same bag so that's why I have a batch size multiplied by sequence length there so I need to reshape it to get back to batch size coma sequence length a good--nice matrix when I have in this matrix on each line one sequence of predicted characters on the next line the next sequence of predictive characters for the next element in the batch and I can have my train step where I ask to minimize the loss all right am I finished with all the problems well no I'm not now I need to I need to batch this information correctly so let's consider the first batch of sequences I have a text and I extract 30 character sequences from it let's say the first sequence is the quick okay so this will be inputs on my recurrent neural networks unrolled here eight times the next batch the next line can I continue the sentence the quick which is probably the brown fox can I continue the continue it on the next line well no I can't because I need to pass the output states which I obtained when I fed in the quick I need to pass that into where I asked it to process the continuation of the quick something so actually here is how the second batch has to look like and the sentence has to continue across the first line of the of those batches and and in the into the third batch and so on because the internal state is passed from iteration to iteration like that which means that on the second line in all the batches I will be using sequences for from way way way further down in my example in my sample text how further down well it's it's a simple equation to write to exhaust all your inputs so I will not get into it but that's what you have to remember correctly batch these examples the first sequence has to continue as the first example in the second batch it has to continue as the first example in the third batch and so on because that's the direction in which the internal states are passed by the way there is a utility function somewhere in the examples that does this slightly complicated batching just use it someone already wrote it so the here's the full code of this language model so let's recap we have a placeholder for our input sequences our desired output sequences and a placeholder for the the state that we will be passing around we have our model which is our GRU cell replicated three times deep replicated 30 times wide then we have this soft max output layer which transforms the results that our orang n cells are giving us into probabilities for characters the loss and the training step and the training loop in which we are loading our sequences in batches using this slightly complicated way of batching things so we are ready to train this and actually I'm going to train this on the complete works of Shakespeare and once I do this well what I'm going to do is that I will I will take one of those cells just one okay it's trained to receive inputs and produce a probability for output characters so we'll just use that one cell and use it to generate text only the fully trade network I will take one cell I will input garbage it will give me a prediction probability for the outputs so we were all the dice and and select something from those program it is which I put back as the input and by repeating this I get text so I actually did that and what I'm going to show you is this network sorry discordant discard what I'm going to show you I saved checkpoints as I was training so I'm going to reload some of those checkpoints and and and show you at different times in the in the sorry in the training what this network is capable of generating as text all right what did I do now this one is correct this one go away and and and all run and play and here run and run so not trained a lot hmm it picked up the fact that it should be using paragraphs and words but yeah that's not so great let's let's try something that is a bit more trained so let's go for four now I trained it for a lot more I think it's like half the corpus oh look it looks like a play it's still not completely English and and but it picked up the fact that there are supposed to be character names in all capitals and and and then that they were supposed to talk for a certain amount of time and then the new character should should talk so this is much better and you will it's not yet very very good at English so let's train a little bit more let's see here see six for instance oh now we have a long monologue and look at it the character names are actually quite credible present Oni Cleopatra Sita's clown it even picked up the fact that when the character is not a name but a function like a clown or servant or something it has to be capitalized differently Queen hand T Gillis I can guarantee there is no Queen hunted Gillis in Shakespeare this is a hallucinating name but come on if I didn't tell you would you have picked it up those are those hallucinating names are actually very credible and look it's starting to speak English which is correct this is correct it's starting character by character to learn how to spell English so let's try with the fully trained Network now and see if we can generate a Shakespeare play so Shakespeare please tell and you know I have to change this to two all right well this looks like a very credible play to me look I have I have cynic indications I have exit three enter Mark Antony in Burgundy and Columbia's I have Martius this is actually English this is my father from my Singh and yours the Senate and the best of this same spirit what is the matter with me sir so come on who would like to play hallucinate it Shakespeare with me I need one character I need someone please come on come here who wants to play Shakespeare yes yes Shakespeare so you will be Martius and I am the second watchman now actually we already did marshes let's try something else Juliet you want to be Juliet yes why not so you be Juliet please do Juliet and you know position yourself you know your Juliet please play the role and and and I'll position myself here we do some senate indication like here now this is for wisdom effect this is for dramatic effect let's go we're ready have with your hands my lead should break it you to break it to your lordship we all go out the man is mute my lord and you shall not be made that what your lordship show me not enter a messenger you play the messenger now Romeo a most president of this and seek out of the Senate thank you very much you've been a wonderful Juliet thank you so much all right so I hope I've shown you how to build a recurrent neural network and also that these guys can be quite powerful at doing various things with mostly language processing actually I've also did I also did the same thing on on the tensorflow Python code and this is what I had in the beginning doesn't look like code this looks a lot more than like like code you know it picked up the key words already then it learned how to use the key words correctly you know with Def and Ana Colin at the end at a hidden at a function name in the middle it's still getting the parentheses a bit wrong but then in the end it can actually recite the apache license completely it can actually you know close and open paranthesis come come completely and it can give you ten through flow tips look it figured out how to make comments so create the string operation to apply gradient terms that also batch is very important the original of any operation as a code when we should always infer to the session case yes we should always infer to the session case i please remember that we should always infer to the session case make sense doesn't it all right the the guy who originally built this language model also tried it on latex he took a mathematics book and train it on latex it produced almost functional latex he had to fix a little bit to make it compile but when it compiles well it's a credible you know your mathematics book it even tried to make you know drawing it even tried to make a proof look like the first line proof omitted oh yeah that's a good proof so there are lots a lot more things that you can do with the recurrent neural networks character based word based you can feed words once you know how to embed words into into vectors you can predict sentiment you can predict categories and that is also actually how text translation works you input a sentence in a given language and you train the network to output a sentence in another language and finally have you heard about text labeling sorry image labeling well that's a recurrent neural network you train it or an input C sorry on you train it on characters which are produced from the desired input output sequence you just add more information which is some information from an image and you obtain an image captioning neural network that can say things like this herd of elephants a person riding a motorcycle that's fantastic or like this a yellow school bus or a refrigerator whatever well there is still room for improvement there thank you very much and by the way if you want to use this stuff cloud ml very nice we just launched an whole infrastructure for running your tensorflow neural networks on Google's cloud if you don't have 500 of those GPUs under your desk you use carbonelle and the session is open for questions or you can just come and see me afterwards yes yeah I'm not sure I know which paper you're referring to buddy is it the paper that is the user occur on your network to produce neural network architectures so maybe it's not that one the one I have seen is that as you see here we have used different Naraka neural network architectures like three convolutional layers two fully connected layers or five fully connected layers and so on so they figure out a way of actually generating that because those it's just a sequence of symbols like convolutional convolutional dense dense it's just a sequence of symbols so they apply the recurrent neural network to generate this sequence of symbols and they have a neural network generating the ideal solution for a given problem the ideal neural network shape for a given problem which is kind of neat other questions yep of course of course yeah we have seen both in the low-level API there is the confer to the function that will just scan an image in two directions or in the higher-level layers api of TF dot learn you have a layered comp 2d layer which is a convolutional layer when which defines it's in turn internally its weights and biases correctly so yes convolutional networks are well covered and they are fairly easy to cover what I really like is is the the the API they put together for a recurrent neural networks that one is really clever alright well thank you for your three hours of attention I really appreciate you staying here with me for so long and I'm still around I will be around for a couple more days so feel free to come and talk thank you
Info
Channel: Devoxx
Views: 575,430
Rating: 4.9243531 out of 5
Keywords: DevoxxBE2016
Id: vq2nnJ4g6N0
Channel Id: undefined
Length: 155min 53sec (9353 seconds)
Published: Tue Nov 08 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.