Uncertainty Quantification and Deep Learning ǀ Elise Jennings, Argonne National Laboratory

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] you my name is Elise Jennings I work at the AL CF in the data science group so first of two talks today I'll be talking about uncertainty quantification and deep learning so I know that you've just had a morning of machine learning and deep learning topics I just wanted to get a sense of what people know already so I can move fast or move slowly depending so he who here is familiar with kind of Bayesian statistics Bayesian inference okay cool who here is currently doing anything in uncertainty quantification with deep learning or machine learning that's interesting a little bit ok ok so that's that's really interesting right because how many people here are actually doing machine learning deep learning ok so a good few people and not concerned about you queue or maybe who is concerned about you queue and wants to start doing it ok so this is one of the real hot topics if we're talking now about doing scientific deep learning and scientific machine learning right we need error bars that that's what makes us scientists so how do we start doing a certainty quantification in deep learning machinery that's going to be the topic today so for those of you that already know Bayesian stuff please bear with me but I'm gonna have to go through some of the basics for people who haven't heard it before okay so neural networks are notoriously famous for being these black boxes right that we just kind of hit return and we don't know what's going on inside them and so there's always this question of how do we know if the new model is making sensible predictions or it's just guessing it around them right and we have people say oh well if it generalizes right well what happens if it fails to generalize right sometimes you're left just completely in the dark and then you go back and you well they say well I need to do hyper parameter searches but ultimately you're just guessing right it doesn't feel very scientific to do this type of analysis so for this reason people do criticize deep learning for its lack of transparency its lack of interpretability its lack of rebutting robustness its lack of reliability and I think those are pretty fair criticisms so if we're doing any scientific analysis understanding what the model is doing and why it fails I'm putting some error bars on that model or at least taking errors on the data and making sure that they're passed through the model right that we have some sense of statistical errors output is very very important okay so for those who may not have heard this morning just a background on what a neural network is so a neural network basically represents a function with many parameters and it's just a recursive application of these linear functions followed by nonlinear functions okay so we're basically curve fitting when we're doing deep learning we just don't maybe want to admit it and so I want to introduce this kind of notion of the neural network model as being something which can be probabilistic okay so if we have this notation for a data set D where we have our labels and our samples X and Y and then we have some parameters so let's just call the parameters that we're fitting for theta here so then our output our estimation of the model is our Y hat here mmm and that's just the recursive application of those non linear and linear functions and your network can be as complicated as you want but ultimately what you're doing is you're making a point estimate from this probability distribution here right the probability of your Y given your ex and your parameters that you've assumed right so now we are actually doing something where we can talk about Bayesian statistics okay so if you didn't think you were doing this before were you doing deep learning you are you're just getting a point estimate from that full probability distribution okay so why should we consider using Bayesian methods in deep learning well as I mentioned there are many drawbacks to deep learning right so if you do this hyper parameter fitting which we all do there are many of those they require specific tuning and it takes large data sets or a huge amount of time and compute in order to find those specific hyper parameters and that can feel a little unsatisfactory that we're just doing some random searches or grid searches for parameters and we don't really understand why the best ones are the best ones the second point is that the neural network trained with the best type of parameters is still a point estimate right of all the possible models that are out there and you should account for uncertainty for all the possible models or maybe you're in some pseudo minimum of the parameter space so there's no uncertainty in these point estimates and this can be very important when we think about the outputs from these deep learning models when they're doing something pretty serious they're doing things like medical diagnosis or they're predicting things in financial markets or they're operating self-driving cars we definitely want error bars on what the models predicting here and then it's very common to just use very large neural networks and fit the data set and use regularization to prevent overfitting and so in order to do this we do need efficient search algorithms or guesswork to find the best network architecture so in the traditional sense it can be very difficult than to explain when you're deep neural network fails okay so everyone talks about well I do get probabilities outright I get my soft max and if there's one thing you want to take away from today is soft max does not tell you about the uncertainty in your model it is not a statistical and it is not a systematic measure of uncertainty right your soft max just gives the probabilities for each class given your assumption of a fixed model it's no uncertainty in the model and you're not talking about a certainty in the data either so there's a couple of very famous examples and that you can point out so deep neural networks can be easily fooled so in this paper they actually just generated these you know garbage images on the network which was very high accuracy and training and you know generalized very well and then they showed what underneath what the network guesses so the network thinks that this is a king penguin it thinks this is a starfish it thinks this is a baseball with very high certainties right softmax certainties another example I really like is the CFR ten data set which we all use right we get our students to use take your C far ten data set right and there are no green apples in the sea far ten data set right there's just green frogs and then use it on the sea far 100 data set and it will tell you with you know probability your softmax probability of 0.9 that all these images are green frogs okay so it is really easy to fool these things right and so it's always it's always good to keep in mind what that softmax means okay so what our Bayesian neural networks so you could think of training the network as an inference problem in which we solve bayes's theorem so if you're not familiar with Bayes theorem it's the probability of some parameters in a model given the data set right this is your posterior distribution for those parameters given the data set and we evaluate this with our likelihood evaluation right so the likelihood is the probability of the data given the parameters in the model so there's an assumed model in here when you see the theta times some priors on those parameters theta divided by the probability of the data itself okay so we can just look at the numerator for the moment here so this is what we mean sometimes where we're saying well we want to maximize that likelihood for the parameters given the data all right so probability of the data given those parameters we're going around in parameter space trying to find the optimal model and parameters given some prior information right move away from the prior and find the posterior distribution so the network you can think about this as doing something similar so a Bayesian neural network is just a neural network where now there are distributions over all of the weights okay so initially in a standard neural network you initialize all those weights and biases randomly and then they're just point estimates from now on okay now you can think about a Bayesian neural network is at each point at each node in the network instead of how just having one fixed number you have full distributions that you're modeling at each stage and these distributions can evolve away from their priors over time okay so what we get edge is not just point estimates of our Y house what we get at is a full distribution for y hash and then we find a weighted average over all these parameters which can be thought of as an infinite ensemble of neural networks okay so now in theory you no longer even need to do hyper parameter searches on your architecture because you're automatically marginalizing over those parameters and there's some really nice work niel in first and 1995 and then some subsequent papers where they showed you know mathematically that a single layer infinitely wide neural network with distributions over the weights is equal to a Gaussian process so now we're back on very familiar very solid statistical ground right where we understand what's going on if people work with Gaussian processes we've done this for decades and it's a pretty robust trusted method right so you can just say to people well if you distrust neural networks you can just say to them an infinitely wide neural network is a Gaussian process so when we don't do an infinitely wide one we're just making an approximation to a Gaussian process okay so in pictures this is what it looks like so now at each step through the network your weights are being sampled from some distribution no no so that in that case they showed that a single layer infinitely wide so single there so so in the limit right over infinity that then it's a Gaussian process okay so so you with your Bayesian neural network now right for each of these weights you have a prior distribution which is actually being evolved so it's the distribution which is being fit for it at each stage so in the beginning you can assume whatever you like for those distributions right we pick a guy who shouldn't because it's what we're familiar with we pick we assume something for the the mean and standard deviation there and then as we evolve our network the actual distribution itself evolves away from that towards the optimal configuration and now the prediction here right the probability for our Y hat given our data set and given our model is actually an integral over all those parameters that we assumed right so now this takes care of hyper parameter optimization for us because we're marginalizing over all of those possible parameters that we could have chosen all of those possible network structures we could have chosen right and so how do we do this right so now there are actually many inference methods which allow us to approximately solve for this posterior distribution so these networks can be really big to start off with right so now you're not just following weights and biases you're following full distributions for each of these you're sampling from full distributions at every single stage right you're still doing a forward and backward pass through the network to update this so now you need an you know a different method to do this and they're very there's a lots and lots of different inference methods now so this was kind of a seminal paper which came out because people thought well you just can't do this at scale right it's just not possible computationally and so this paper by graves showed a practical variational inference method which allows us to do this at scale and find that posterior distribution with this variation so I just this is just an output of the different classes for doing variational inference from tensor flow probability some of them are kind of familiar like MCMC methods or Gibbs sampling but it's the variational inference one that most people are using now who are doing Bayesian learning okay so as I mentioned in this Bayesian approach you automatically marginalize over your hyper parameters so in theory you do not have to worry about that anymore they are taken care of so this is actually laid out in the paper by McKay which I would I recommend it's kind of you know the first time it was laid out a practical Bayesian inference for back propagation networks and it's a it's a pretty nice read because it was the first time anyone had pointed out that this could be done so naturally it counts for uncertainty right because you end up with that full posterior distribution for your predictions you have everything right you can have one or two sigma confidence intervals you can make all those nice plots it's more robust overfitting as now you're getting some notion of an average with error bars rather than just a point estimate from a distribution in a fixed model right which could be incorrect now the notion of l1 and l2 regularization right whereas if you ask someone why do you do l1 and l2 regularization they would always say to prevent overfitting right but that's not really an answer like why the Square right why why the form of l1 and l2 all right now this actually puts a framework in it l1 and l2 r actually prior choice for the weights right you can actually relate back to continuous distributions and we can also do things like model comparison with Bayesian evidence if you do if you do want to play around with the network or tried you know fundamentally different connections in the network okay but then the big question people always had about this since McKay's paper was well how well do they scale can we do this a large scale yes they're basically regularization terms which are added onto the loss function okay so you can think about l1 as being here in this Bayesian sense it's a choice of prior rhythm which is the Poisson distribution and for l2 it's actually a choice of prior Gaussian distribution right and so you can see that mathematically now if you start to think about it but previously you write if you ask most people that they don't know why they're doing it really they just say to prevent overfitting but that's not really an answer right why not the cube why not the you know why the exact mathematical form of l1 and l2 yeah so I'm going to go on that's exactly my bubble here right so how about are they scale yeah yet so so so right now you can do this on your laptop with these new variation law inference methods it's possible to do this so we should be doing it and in terms of how much longer does it take compared to a standard neural network it doesn't take that much longer the key thing is that the time to solution right the type of the time taken to get to the same level of accuracy does take a little bit longer right so so as I said to actually do a forward and backward pass through the network right now with the variational inference techniques which are in the code I'm going to talk about it doesn't take that much longer but to get to the same level of accuracy takes a little bit longer but I think it's definitely worth it because now you have error bars on your model so the neural network now is the thing you're using to find the posterior distribution so right now right in a traditional neural network you have all these choices of network architecture right and you choose as the practitioner you choose which one and you run it and as you say you could choose a bad one right in a bayesian world when you run a Bayesian neural network that network is automatically sampling from that space you are no longer doing that choose choice you know a distribution that's so no it's marginalizing over that space so it automatically will know that it's taken a bad choice right because it's doing variational inference to get towards a posterior distribution right make sense ok so there's actually lots of codes out there now which you can use for doing this Bayesian deep learning so there's pie MC 3 there's Stan some of the drawbacks arts it I think it's been updated recently but it can be slow in higher dimensions the big words are of was first developed was Edward by Justin Tran who then later joined Google and so now this has been incorporate or sorry during tensor phone this has later been put into tensor flow sensor for low probability so it's actually an upgrade to Edward called Edward 2 which is being used in tensor field probability there's the SK pro machine learning toolbox pomegranate and several others right so so these codes right there they're being developed and you can run them on your laptop or about our supercomputers right you can run it on theta today yes ok so I also want to mention something else people may have heard about which is drop out using drop it to get uncertainty quantification and this is something which you can definitely do as well and in theory it's identical to doing Bayesian neural networks but it is a cheap and dirty method right because unless you implement it exactly it's not going to be the same as a Bayesian neural network so with drop it you have this notion of a fully connected network but with drop/add there's a probability P to drop the weights from the network a training time right so you just zero out some of those notes and the idea here is that you're preventing Co adapting of the of the nodes right codependence on those nodes right so that it generalizes better that's the idea and so how does the how does the expression here change for your estimation of the model well now for these nodes that you're dropping you basically sample from a Bernoulli distribution of 1 or 0 maybe to drop the node or not at a runtime and then you can just think about these as extra bias terms or weights terms which we need to tune ok but again you need to choose that drop probability and only in very particular cases is this going to be equivalent to a Bayesian neural network right so the dropout network is also a Gaussian process approximation in the same way that the Bayesian neural network is and there have been some studies in terms of what are those optimal dropout rates and one interesting paper found you know for the input layers a probability of 0.8 makes sense whereas for hidden layers 0.5 and this makes sense because for the very first layer right you want to maintain as most of the information from your samples going into the network as possible right and then reduce codependence of the hidden units and in the network as you move forward okay so how does dropout compare two Bayesian neural networks so dropout can be interpreted as averaging many models with shared weights okay but in the dropout model each of those models is equally weighted right which is not the correct assumption whereas an evasion neural network it's the proper way of averaging over all of those distributions okay so you're still limiting your uncertainty quantification is still slightly unrealistic and a dropout network using dropout though has big advantages it's much much faster than the Bayesian neural network okay that's why a lot of people do it and in a sense you're also don't have to change some hyper parameters there right so I'm like we said the Bayesian neural network can be slow to train and difficult to scale but you're marginalizing over your hyper parameters okay so I want to show just a quick example so everyone's familiar with this emne Stata set of hundred and digits so we can just play around with this so for this example I'm using tensorflow probability so let's just take one of these examples right one of these digits so is this a three or a five so for me this this isn't clear right and I would hope that my network would be uncertain as well so if I run this at something like tens of probability I just wanted to show you this is just a standard convolutional neural net where you can make each of the convolutional errors and each of the fully connected layers Bayesian right so now you're sampling from probability distributions and what you get out as the network is learning are no longer point estimates they're these full distributions right so this is at an early iteration in the network these are the posterior distributions for the probability of it to be that digit given the input okay so given this input image these are the probability distributions so you can see it's not quite certain right at the moment something looking like it's a six it's not quite clear this is early in training then as it's converged this is at iteration 6,000 you can see that each of reach of the other digits apart from 3 & 5 its peaked around 0 but then for 3 & 5 you see this nice distribution of uncertainty right so this definitely coincides with our own intuition about these two it'd be like this image so just to zoom in on those two so this is what you get out of a Bayesian neural network now right you get at the probability of it being 3 given the input and the assumed model that you worked with ok 3 & 5 and we can do this further digits to play around this it's kind of fun one of the things you can output is well what do the actual distributions look like for the weights and this the weight means and the standard deviations right so you can't actually it wouldn't be worthwhile to look at all the distributions for all the weights so we just summarized it in terms of the distribution for the weight means and standard deviations at an early iteration and then as it evolves and you can see in this example that the the actual mean doesn't change that much away from the prior but for the nets under deviation it does for the exact same image if we were to just use a standard neural network and look at the softmax outputs it would tell us with probability one that this image was a 5 right so you don't see that uncertainty there between it being a 3 or a 5 ok so I just want to drill that home softmax is not telling you anything about statistical or model uncertainty okay so I do have a couple of notebooks that I was hoping people could work through if they're interested so I'll end the presentation here and then we'll go to the notebooks and then if people have questions you can just let me know is working on the notebooks yet so for this I was wondering like uncertainty with all right so you definitely wouldn't see that P of 0.9 right with a Bayesian network one of the things you can do if you're certain that the class route or the set the data set your training on is not fully representative one of the things is always a good idea to do is to have for example for classification have a class of unknown right and then in a Bayesian sense it will show some high probability for it to be the unknown class all right so that's always a good a good practice there but you will see you will just basically see very low probability distributions right you will not see something like a delta function peaked at a point nine right and that will that will indicate to you that it doesn't know what those green apples are but I think it's always a good idea to if you're doing classification to include that class of on no one right because then it allows for your data set to be biased which it probably is in some way so so there's a couple of ways I mean so you're talking about statistical or systematic here so are you talking about error on the samples and put into the network and have that propagates to the network or you okay so the reason I was asking is that you could also think about propagation of the uncertainty with multiple calls to the network right how does that propagate through so with the Bayesian neural network you can actually measure I mean what I'm calling this error is a statistical error right so you're if you have a Delta X how do you propagate that through to a Delta Y and so with your posterior distribution that you're you're actually using the network to fit for the posterior distribution okay and so there are parameters hyper parameters in the posterior distribution so let's say it's a Gaussian or we assume at the Gaussian in the beginning then we're fitting the for the mean and standard deviation there okay and so the standard deviation is our Delta Y right and so if we have some understanding on how Delta X if we have some functional form or we can assume some functional form there then we can have the network fit for that Delta Y and in a sense then we have allowed the Delta X to propagate through the network and give us a Delta Y does that make sense but that Delta Y is made up of both the uncertainty in the in the data set X I'm the uncertainty from the model there right so this two sources uncertainty [Music] right constraint concept I mean you can have principle even if you're just looking at multi-layer perceptrons you get an infinite number so you can think about at a single point in a normal neural network right it's just a point estimate of the weight there right and now you have a layer with a certain number of nodes and they're all just point estimates of the weight right so you could you could reconstruct that right to think about instead of having those point estimates right those discrete point estimates at a single layer right now I'm sampling from a distribution as an approximation to that right because now you have a continuous distribution I mean it's identical right so so so you you either have let's say we have a single layer network with ten year ons at every point you you randomly initialize those weights right randomly initializing those weights actually means you're sampling from a distribution and you're sampling ten random variables right so maybe you use tensorflow ten Sephora is doing this for you do you even know what distribution is sampling from yes so it's probably sampling from a uniform distribution right so right now with Bayesian learning we can take control over that right so we could then say well that's equivalent to a single node where I sample from a distribution but I do that maybe ten times at a single node that distribution I'm sampling from is still a continuous distribution right but now I get to choose I get to say it's a distribution which is Bayesian in the beginning and then I allow the network to distort that distribution away from that towards an optimal distribution right does that make sense so Inosanto neural network you have these discrete points right right but but in the Bayesian neural network here you're actually sampling from continuous distributions so it's not representative so you can think about the forward and backward passes through the network right you're sampling from the distribution but that distribution is still in finish so you still have exactly the same problem like you still have you have far more resources available there than in the discrete Network with just ten neurons at each stage in the discrete Network with ten neurons there are ten points which are being estimated there right in the Bayesian neural network I'm choosing from an infinite distribution so yeah oh so I think I think I might have I put the distribution of softmax probabilities on the bottom here so it's a concatenation just for that prediction of that number right so it's probabilities and probabilities there's no I want it to show a little bit of the notebook just for people who are interested so for the notebooks you can run it in collab it might be the easiest way to do it okay so I added two notebooks just kind of summarizing what we can do with the tensorflow probability package I just think in the time that we had it there wouldn't have been enough time to show you a full implementation so it's definitely up to you guys if you're interested you know you couldn't play under these notebooks and it just gives you a very basic sense of what we can do now with intensive trill probability so basically now you can do all these tenser operations on probability distribution functions and they have lots of distribution functions which you can play around with here yeah so it should it should be under the title of Bayesian learning the uncertain qualification in Bayesian learning if you see it there those there's two notebooks the first is the introduction to produce tic learning so in this and this one I'm just showing you here so what we've we've done some import of tensor flow probability and the distribution class as TFP TFD here and so you can do things like okay well I want a normal distribution with some mean and some standard deviation and then we can sample from it and we can do some plots okay so that's kind of you know we can do that with lots of packages but the nice thing here now is that you can actually do distributions with different shapes right so you now you can have tensors so here we can create a batch of three normals and right and so this broadcasts so here each of the normal distributions these are the means for them and then the this is what's broadcast for this to a standard deviation and then you can sample from each of these so when you say normal sample you're actually taking samples from three independent normal distributions okay and then when you ask for the log probability of these three numbers it just broadcasts so you're actually sampling from three different normal distributions does that make sense and so then we can just play around looking what they've done and they also do have multivariate normal so you can see now how this is I mean so intensive field probability actually constructing the network all of this is hidden from you but this is what's going on underneath right there sampling of this so maybe I'll show you so in that in this net in this notebook I have some example using Gaussian processes with the tensor for probability and how you can sample from this and use it to fit data for people who are interested in in Gaussian process work if you wanted to see an example with a neural network it's in the other notebook called probabilistic layers with regression okay and so in this example I walk you through different levels of uncertainty so here we generate some data set so blue points here are the data set let me see this is the model that I have here so in the very first instance there's no uncertainty so I just have a dense layer here and then for the next part I have a distribution function where I'm just fitting for the mean and drawing from the normal distribution for the standard deviation okay but because this layer here is non probabilistic the mean is completely deterministic so there's no uncertainty in this first cell here right and so this is what we get out from kind of standard neural network these are the data sets and this is the mean of the model that it would predict okay so the next thing we can do is well why don't I make the standard deviation here's something that I fit for us this is this notion of statistical uncertainty right so I'm saying I think that the Delta Y of my model has some dependence on the Delta X right this is clearly a scatter of the values X around the model Y so how do I incorporate that into my model well here now I'm no longer fitting just for the mean I'm also fitting for the standard deviation so in the second-to-last layer I need to output two variables here from the dense layer and now what we get out from this you know semi Bayesian neural network is we get at these standard deviation error bars right so we're still getting a posterior distribution it's no longer a point estimate and now we have these error bars on our day and then how do we incorporate model uncertainty right so now what we need to do is we need to actually go ahead and change the model itself so this is the important part here now we've changed the second-to-last layer to be dense variational and now each of those weights is actually a distribution which were drawing from with some mean and some standard deviation and some kale terms so I know this is a lot to throw up people so I just wanted to give you these notebooks and you can look at them and play around with them if you want once we change the network now we can get some sense of well what if my model changes right and I'm like and like you asked this is in effect what if my network changes now you're actually seeing different models write different predictions for what for what that why is it so for which stage when you're training so are you familiar with a variational inference okay so it so in the beginning right variational inference is a technique where you basically assume you guess any distribution and then that distribution is evolved using it's still doing stochastic gradient descent but now it's a distribution which is being evolved away from that in order to minimize some alternative loss function the loss function is no longer something which is like you know the mean squared error loss function right now your loss function is actually goes back to the Bayesian framework all right and your loss function now is a log likelihood term plus this cool back live near term the KL term all right so mathematically it's a restructuring of that loss function which is being minimized at each step because now we're dealing with distributions rather than point estimates well how you I mean how you parameterize any distribution it depends on the distribution you choose right so if you choose a normal distribution then the normal distribution is completely described by a mean and variance okay no nose so this entire technique the variational inference you it's the notion of assuming a prior but in the method you evolved away from your prior towards your posterior distribution if if it doesn't if it doesn't have all the way from the prior then you just done something wrong but that's true in any method like MCMC right so that's the variational inference you're varying the prior the prior shouldn't have any impact on your results the prior just introduces prior knowledge so you start off in a reasonable point in parameter space so here the prior choice is nothing to do with the number of samples right the prior choice here is your choice for the prior for the weights and the biases all right so it really shouldn't matter so you could you could think about well what if I went what if I had to introduce domain knowledge into the network right what if I know that you know a certain part of the network is doing some feature identification right just feature extraction then you can impose some kind of physical priors on that part of the network right you can think about it like that or if you know well I know the network this node right ultimately is outputting something which should not be nonzero all right some very physical constraint like that then you can constrain that node that way right with the prior information but ultimately when you're running these networks you need to always check right like just output what the prior distribution was and what your posterior distribution is that you get for each of those weights and make sure that they're not the same now there I have seen examples where there are certain points in the network where where it doesn't evolve away from the posterior or from the prior distribution or it doesn't evolve it evolves towards almost a delta function all right and these are very interesting points because it means that that node in the network doesn't actually need to be probabilistic it needs it can be deterministic and those are kind of interesting points in the network it's a sign of it's a sign of the network deciding what is the most efficient thing to minimize that loss okay okay so I'm gonna move on to the next topic if that's okay Martha yeah [Applause] [Music]
Info
Channel: Argonne National Laboratory Training
Views: 4,786
Rating: 5 out of 5
Keywords: Elise Jennings, Argonne, Uncertainty Quantification, Deep Learning, Supercomputing, ATPESC 2019
Id: Puc_ujh5QZs
Channel Id: undefined
Length: 39min 1sec (2341 seconds)
Published: Wed Nov 06 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.