Eric J. Ma - An Attempt At Demystifying Bayesian Deep Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you everybody for coming the title of the talk I think induces a lot of hype keep in mind that the second word in the title is the operative word so this is an attempt at demystifying Bayesian deep learning thank you all for coming my name is Eric I work at Novartis we I just joined recently I just defended from grad school so very happy about that oh thank you so if you want to follow along you can scan the QR code it'll bring you to a mobile-friendly HTML slides in this time that if you haven't yet scanned it don't worry there is a link to the slides as well it's also the last thing I tweeted on Twitter and for those of you who might be looking from behind and Kent or or if you need if if you come in late and if someone comes in late and they need to like look at the slides on the bottom left-hand corner of every slide that watermark will show you the github repository where the slides can be found all right so you can always look at that and go to the go to the github repo like I was mentioning this title is very high pitch the reefs I recently saw this tweet which is the hype of deep learning you write a post or in this case you give a talk with Bayesian and deep learning inside there you fill a room with people who want to look at it rather than tens of thousands of clicks and then something happens point number four shows up you have all this math that's going on with crazy Greek symbols and then you all leave right that's what happens with a lot of the deep learning talks my mission is to solve point four I want to make sure we talk about what deep learning and Bayesian deep learning is without all the Greek without all the math and in fact I'm gonna show you a lot of pretty pictures oops oops all right so this is a talk about Bayesian stats and so there's always the obligatory neon Bayes rule sign Bayes rule is a from the Reverend Thomas Bayes it's a way of inferring parameters from given the underlying data it lives as a neon sign in someone's office that's as much about the Bayes rule that I'm going to talk about because this is the obligatory Bayes rule slide my modest goals today are as follows I'm gonna demystify what deep learning is how many of you have actually coded up a neural network from scratch with like backpropagation likes okay so for the rest of you this talk is for you how many of you have coded up a Bayesian net of some sort with say a probabilistic programming language or likes alright this talk is also for okay so I actually do have a purpose for this talk after all because I give this as a rehearsal at Boston basin's and everyone raised their hand in the crowd and so I felt like huh okay what exactly do you have do I have to offer so I do have something off for today the the goal here is to explain the intuition with minimal jargon all right my take-home points are twofold firstly deep learning is nothing more than compositions of functions on matrices and their operations all right if that sounds like jargon don't worry it'll become clear the second thing I want you to learn is that Bayesian deep learning is basically grounded on learning probability distributions over our parameters rather than point estimates that's it you can go down yeah all right so so how do we build towards an intuition of what deep learning is and how do we build towards an intuition of what Bayesian deep learning is so do that I'm gonna start with the simplest machine learning algorithm that everybody has already touched as long as you've done high school physics you've touched this algorithm it's called linear regression how many of you were at the the tasty talk the BuzzFeed tasty talk yeah so you know how tasty puts those videos where you have like cheese puffs three ways or something so yeah linear regression I'm going to show it to you in three different forms then I'm gonna show you logistic regression in three different forms and if you follow the parallels you will see deep nets in three different forms and you'll get exactly what deep learning is all about then I'm going to talk about what it means to go Bayesian again following that parallel and finally I'm going to show you some example code on how I can code up a a neural network in a probabilistic neural network in pi MC 3 okay are you ready all right let's go so linear regression highschool physics why is some linear function of X in math class it probably was taught as y equals MX plus C or ax plus B or something that's basically what linear regression is we've all seen this equation before in the event where you have not just a single X let's say two columns of X's or three columns of X's or nine columns of X's then you'll have a nine an appropriately dimensioned weights vector so in its matrix form it looks something like that right so just to give you a quick guide through these visuals essentially is this I have my data it's got four columns that's X the ellipsis inside the these boxes indicates that there's multiple rows it's not just a single row of stuff I have my weights vector it's appropriately dimensioned it's colored because it's a parameter that we have to learn right it's one of the parameters that we have to learn the bias term right the slope that's re not the slope the intercept term it's also a parameter that we have to learn and when we do this linear combination of stuff the output is a single scalar value right so that's their regression that way there's a third way to look at this it's as a neural diagram right everybody get this right the the first part is we do the matrix or the vector vector matrix vector multiplication to get a single dimensioned output which then we add the bias term which is also something that we're learning and the output here is our Y that's the final output all right so shown three ways linear regression right do we get the the Tastee feel over here all right good so what is logistic regression well logistic regression is nothing more than you take the the linear function you tack on one more functional transform on the output right I take my outfit I take the intermediate output of the linear regression transformation and then apply a squishing transformation to squash all the Alpha values values from 0 to 1 so this is what logistic regression is all about and so in matrix form it just looks like that I do my linear regression thing and then apply a single transform across all of the elements inside the outputs and then I get my output right and so as a neural diagram that's that's what logistic regression is all we've done is we've tacked on this other functional transform in orange which is G so logistic regression shown three ways is nothing more than that okay so then what about deep neural Nets if you look at what a deep neural net looks like in its functional form you'll think Greek right lots of symbols lots of oh crap what's going on well let's break it down let's break it down I have my input data multi-dimensional lots of dimensions I have weights and biases that are appropriately dimensioned that I need to learn I do a matrix matrix operation which is the F operation and then I apply a non-linearity and then I get some intermediate output then I'd take that intermediate output do the same thing to that thing do another non-linearity I get my final output and so that's what deep neural nets are essentially doing in in this particular case I'm talking about the feed-forward said I'm not talking about these special other algorithms the confident or the the recurrent neural network this is basically what deep learning is right so how many of you work with investors ok yeah don't tell your investors if a new startup comes to you asking for money ask them for you know don't don't be whoa what's going on here then investor calling I got something sells if that was that what's going on hold on ah that really should not be me okay no my phone's not ringing either is Satya listen it's okay um let's let's just move on I don't care if it's Satya in matrix form that's what's happening right that's the exact same thing represented a different way I have my data four columns of X's I have my appropriately dimension weights and biases and the functional transform and when you look at it as a neural diagram that's what this looks like all right so we get the weights made multiplication we have the computational graph pushing us all the way to the output well note in a majority of cases people generally will not show this b1r the the blue and the orange lines they'll just squish everything together that for the longest time really confused me which is why I've elected to do this super explicit here's every single transformation that's going on thing all right okay so a few three ways that's deep neural Nets so then what exactly do we mean by going Bayesian so by going Bayesian like I was mentioning earlier on in vanilla neural Nets or vanilla linear and logistic regression what we're doing is we're learning a single parameter a single estimate of the parameter that says this parameter helps us best explain what the data looks like when we talk about going Bayesian we treat that parameter not as a single thing to be estimate but a random variable with other parameters to be estimated say a mean and a standard deviation or something like that so we'll what we're really trying to do is we're trying to learn what the probability density is over a parameter space so here's the intuition from Bayesian linear regression in the non-bayesian case I have one slope that best explains the data and one intercept or one one intercept that one slope and one intercept that best explains data when we talk about going Bayesian essentially all we're saying is I don't want that just just that one slope I want the family of slopes and the family of intercepts that help explain how my data came about all right that's the intuition behind Bayesian linear regression so we basically go from single point estimates as indicated by shaded squares to probability distributions over every single estimate every permit parameter inside this model one thing that's really cool here is the moment you treat any of your parameters as random variables to be estimated you get an expectation and you get a variance and once you've got that expectation invariance all that uncertainty gets propagated down to the final output as well so in theory I could have actually just treated one of these parameters as a random variable with if say for example that was the thing that I was most interested in but in this case you know it's always most informative to just treat everything as a random variable so with Bayesian logistic regression same thing we basically go from this to that and so now here's the intuition behind Bayesian deep nets we go from this to that how good are we with Bayesian deep nets everyone can do it now right like I was saying because we have uncertainty in the parameters all of that uncertainty just gets propagated right throughout to the end and so that's what that's actually the power of a probabilistic programming language we can intuitively intuitively specify our model with random variables as our parameters and for free get a very easy way to propagate the uncertainty through the to the end and I'm going to show you that in a bit but in the mean time if you want through their moment to load you can take a look at this cheat sheet it exists on the github repository so you don't have to raise your camera's right now one thing so it shows the parallels between the functional the matrix in the computational graph form of each of these three machine learning algorithms if you want to call it gistic regression in linear regression machine learning one thing that I've done on the other side there is shown is show what's happening with respect to these Gaussian distributions or whatever other distributions if we take our priors so first off let me introduce over here this column here is all about what makes rational or good priors for our data for parameters over here we might say a wide uncertain distributions good for the priors on logistic and linear regression but down here you might say well we are when we're doing deep nets we're already assuming that we've units scaled our data that means we've centered the data on 0 and made sure the standard deviation of the data is also 1 right and then when we initialize our parameters that is each of these colored boxes we usually take a single number from the unit Gaussian and place it inside there as the initial value which we then train and optimize towards the the best-performing value so then why not just use unit gaussians everywhere but for all of those parameters it's a fairly rational choice and it pragmatically speaking also works also on the other side on this side over here we're mousing over the final column of stuff you'll see that what's what the intuition behind what's happening is we're taking our priors which are basically just flat gaussians or unit gaussians and we're shifting them shifting them around such that we such that the parameters such that these shifted gaussians help us best explain the data and it's basically analogous to shifting a point estimate around right which is what we do in the non Bayesian framework so that's really the intuition of what's going on when we train a Bayesian neural network ok so let me talk a little bit about an example coated up using PI mc3 it's a probabilistic programming language and it provides three things every probabilistic programming language has to provide three things firstly it has to provide a family of statistical distributions that you can work with so it should at least provide the normal to go the the the normal the the students T the cap the Koshi distribution the discrete distributions and the likes secondly it should do all the fancy math for lazy programmers right it's got all the sampling algorithms that are present inside there so that you and me we don't have to code up our own MC MT sampler the third thing that it should provide is a really nice API or syntax for specifying these probabilistic models and I can't emphasize enough how important that last point on syntax is is if you were at Nicole Carlson's talk down up downstairs she talked about rapping basically rapping PI MC 3 inside a scikit-learn estimator that's a good thing be like Nicole right you do not want to go around inventing new api's likewise this API provided by pi MC 3 is a is a really intuitive one for specifying Bayesian models and you'll see you in a moment don't be like some companies that open source a new API that just doesn't match what we know is the intuitive way to specify a probabilistic model I'm not gonna name that company all right so how are you going to how are we gonna do this I'm gonna show you an example from a multi-class classification problem this problem comes is basically structured as such there's this data set that exists inside the UCI machine learning repository it's called the cover type data set basically without showing you what the data looks like I'm going to tell you what the task is at hand I have 66 cartographic variables and I want to output one out of seven different forest cover types right so you can imagine setting up a multinomial logistic regression of the likes and to solve this but one thing that was really cool about the original authors intent behind this data set is can we use a neural net to make these predictions and the answer is yes otherwise I wouldn't be speaking here so the network architecture that I've specified again in this very explicit format is one where we have these weight matrices and the functional transforms without the bias terms right so you'll notice blue lines on blue lines are not present over here it's got what two hidden layers with 20 nodes each and then basically if you're supposed to predict a probability distribution over one out of seven classes okay this is what the code looks like I'm gonna zoom in a little bit pine c3 is written on top of piano the unfortunate thing is Deanna has been discontinued so the pie MC devs on the pie MC discourse forum have been talking about okay what's the next back-end that we're going to move to the back end is not so important for the purposes of this talk so we can talk about this afterwards what's more important is the API and the syntax this is really cool whenever you deal whenever you deal with a probabilistic model in pi MC 3 you're only ever touching Python objects that is super good because it's super pythonic it follows Python idioms so don't go around inventing new API is that don't follow Python idioms all right so I can initialize my weights weights 1 weights 2 weights output do you notice there were three weights that I need to estimate these Python objects also behave like matrices right so this is another really nice thing about using the way that the primacy devs have set up the API in addition to that not only do I have Python objects as my distributions I can do math with these Python objects such as the dot matrix dot matrix a dot product and the tash attached transformation right so I can specify a single layer this way in a very very nice and simple with a very nice and simple syntax so 1 layer 2 layer 3 layers and finally I have the output the output is modeled as a multinomial output and there's a distribution for that just like there's an app for that and finally all the fancy math which all the fancy math gets done for us lazy programmers in the inference step right there's this thing called Auto differentiation variational inference I'm not going to explain the underpinning mechanisms right here we can talk about that later but what's really cool is it'll do the shifting of gaussians for you right and once it's done with that we can sample from the posterior distributions and inspect what our model has learned so remember I was saying it's shifting gaussians around well when we start and initialize with unit gaussians everything's supposed to be centered around 0 with roughly standard deviation 1 now we've got distributions that are shifted to the negative and to the positive side that's just the first layer of weights the second layer of weights is like that the third layer of weights is like that that's a great sign that our model is learning something I've actually done training tasks where basically my gaussians never removed so that was kind of sad for those projects thankfully for this one you guys get to see the results of them if we were to look at the point estimates that is the class that has the highest probability distribution a probability mass attached to it so it could be 0.4 for class 1 and point 1 for the other six classes summing up to summing up to 1.0 and we just take that maximum probability thing this is what we get and I've ordered my 20,000 odd samples in such a way such that it's basically block diagonal right so the first block is class 1 the second block is class 2 where it's yellow in the right block on the diagonal the predictions were right where it's not yellow on the block diagonal the predictions are off and so a neural network is capable of making good predictions but keep in mind we're using a probabilistic programming language which gives us us which gives us more than just the point estimate right we can use the probabilistic estimate right so rather than just plotting the single highest value that's associated with it we can ask we can actually discern what the probabilities are plot all of the probabilities and this is what that looks like but even this doesn't tell us how certain or uncertain we are about the particular prediction all right that's where this sampling thing really comes in handy I can take this I'm happy to share the code with you you guys later and figure out what the uncertainty in the prediction is and quantify it and plot that over there so what's really cool is this take a look at this probabilistic estimate and look closely at class seven it's got really high probability mass associated with it we're pretty sure it's not class six we're sure it's not class four there's some probability mass associated with class one if you look at the uncertainty surrounding that they tell a slightly different story we're actually kind of really not so sure whether some samples are class two or class one and it's less we're less clear about our certainty about the classification of this class of block seven in addition to that if you look at class one and two it looked like we're able to sort of distinguish between these two classes but the uncertainties tell us oh we're actually not really clear about that right we're less we shouldn't be so confident about our class predictions when it comes to differentiating class one and class two right this slide here is the probability mass associated with each class this slide here is the uncertainty the quantified on Serna have essentially taken the standard deviation of the posterior samples alright so that was basically a worked example and I hope it's strong your world upside down and slightly enough the point being here is that the uncertainty can really matter now there are some places where the uncertainty shouldn't matter right the uncertainty shouldn't matter in these real-time applications that you know are streaming in milliseconds at a time this should matter for life-and-death decisions like drug manufacturing and part of a drug company now so this sort of flicks this sort of stuff is really important for us right so so that should be an example of okay the uncertainty can tell us a slightly different or more nuanced story than we were thinking of at the beginning right let's look back this is the first take-home point deep learning is nothing more than compositions of functions on matrices and this is the second take-home point which is that Bayesian deep learning is nothing more than learning a probability distribution on each of those parameters I've shown you how that happens with pie mc3 code if you're interested in reading more take a look at the pie MC 3 Doc's listen to Nicole's talk she's got a really good talk on how do you make your pie MC 3 models reusable so go listen to a talk online if you didn't this example exists in the Bayesian analysis recipes repository you can click on it on the link inside the slides as well and check it out for yourself I want to thank some people who taught me stuff in ways direct and indirect so their names are up there and I want to recommend a few people to talk to or in the case of David McKay because he's passed away go look up the resources that he's left behind for us they're a valuable valuable wonderful resource for us to learn stuff from all right with that I want to say thank you for your time happy to take questions yeah definitely definitely so one one way that we can take this further is to actually so the question was have we have I looked at ways of harnessing the uncertainty to train a better model essentially and so one way that you can imagine doing this is by taking the data from class one and class two and training a separate classifier for that particular case right and because this this evidence here would bear out oh okay we're not when training across all seven classes we're not really able to do so maybe we can find either a different training scheme or find more data to help us distinguish between those two classes or it could also be like this is cartography I'm from biomed so I don't really know the I don't really know geography very well but potentially cartographic variables are not good enough to discriminate between two really similar forest cover types or two cover types that are distinct but happen to show up on the same cartographic regions on the left and then at the back and then chai had the back twenty times more expensive ya know so if you can plus-minus Street no I'm just joking um if you were to train the same neural network architecture in Paris it'd be slightly different because it has the bias terms built in but it trains really fast this thing took about 40 odd minutes to train on 20,000 data points it might guesstimate on how fast it would take with Kerris is something on the order of like 10 minutes to 5 minutes to finish training now that said I wasn't taking advantage of what we call mini-batch advi which allows us to train really fast and that's actually one way to speed up training of Bayesian models for the record like how to Train Bayesian models a Bayesian deep neural networks really efficiently is a super active way area of research right now so lots of lots of uncertainty boom lots of uncertainty in the field surrounding that right now so you know lots lots to learn as well ah yes yes yes um so if you look at the Bayesian analysis recipes repository you'll find the logistic multinomial logistic regression and the Bayesian neural net as separate notebooks and you'll see that the Bayesian neural net gets au ROC on the means at roughly roughly zero point eight zero ish and au ROC with multinomial logistic of 0.72 so considerable performance going to a deep net right then then of course the advantage of going Bayesian is we get the uncertainty I had there two questions at the back oh okay all right oh so when how does Bayesian inference work when we go to higher higher dimensions so you're essentially saying what happens when we increase the number of hidden nodes same exact thing you don't you don't have but it chains slower because we have more parameters to fit ah convergence now that it's that's something I've not tried granted though the number of data points is 20 Ott's 20 something thousand what's this 70 by 20 is 1400 plus another 400 800 so we're still under parameterised with this model we don't have more parameters than data points yet so yeah there's room to grow there's room to grow one back here and then one in the front correlation between the parameters and I've been asked that question and I've asked my I've been asked that question in public talk settings and I've asked that question to myself on my own and truth be told whenever we look at a whenever you look at a deep net you try to ask what's the meaning of this what's the interpretation of the weights matrix that's essentially the heart of the question right what yep mmm-hmm right right right it's witchcraft to determine how many parameters inside there so the number of parameters is essentially determined by conceptually how many nodes we want inside there and so I don't I so I might be uninformed about this but my best guess right now is that we are there are apart from trial and error or you know Bayesian hyper parameter search there's no good way of determining how many parameters we need inside there's no sorry there's no principled way they're good ways just flood Amazon right and you'll you'll get your answer when uncertainty in regression estimates can be tied to company bottom lines then we've got a good reason to do that for unsupervised I truthfully don't know I really don't know that's something I have to read up and maybe you can chat about that later back there and then up friend um fewer parameters laziness now I'm just joking um it's I've trained it with the bias terms it trains for some reason significantly much slower with the bias terms and I don't mean significant I don't mean that it just trained with the so I'm not saying that it added a linear amount of time to the training time I'm saying it really trained much slower and I couldn't figure out why so I dropped the bias terms in any case we're under parameter izing the problem anyways so we're start forcing the network to know my work to learn and compress data which is sort of the intuition behind why neural networks do so well so yeah it worked out fine so there was no principled reason for that using our priors intelligently yeah yeah yeah yeah definitely oh and something I actually learned last week reading Radford Neal's book Radford Neal is basically the father of Bayesian neural nets so that was his seminal thesis work you you don't need to do train test splits when you go Bayesian that was mind-blowing go read Radford Neal's book if you can get access to it you can't you don't you do not need to split your data into training and testing sets whatever you fit with whatever you fit the posterior distributions this is super controversial I know I'm facing a crowd that like I do not need to do that according to his book and I have a small blog post on my website quoting the relevant sections of of that book it's really neat I was mind blown and I also was like [Music] okay question Oh Oh so so you're saying basically rather than fit the full distribution fit just the MU and the SD right yeah I think that's okay I can't speak intelligently to whether that's happening behind the scenes but yes I do know that when you train a Bayesian neuron that you double versus Lykes point parameters you do double the number of parameters so it should be that when we're using gaussians or two parameter distributions on our as priors on our parameters wow so many parameters you basically double the number of parameters that you need to learn one more question okay so first question is what would happen if you change the likelihood function for the data what would happen if you change the priors for the for that for that for the parameters if you change if you get the likelihood function wrong you'll be doing something that I blogged about recently which is fitting Poisson distributions to data that had negative values inside they're wrong but sons are all positive you do not get so make sure you cannot get the likelihood function wrong you always have to get the likelihood function correct on the other hand if you have if you have if you change the priors for the parameters what you're implicitly when you're basically expressing is I know a little bit more than my flat priors well I know a little bit less than my flat priors about what this parameter should be and so you should have a justified reason for changing the priors on that so when we talk about going Bayesian essentially we're like Andrew Gelman was saying getting rid of key values those detestable things right so we're getting rid of P values and we're making science more a more discourse type of thing I come to you with my priors and I justify them I give you good strong reason that aren't grounded necessarily only in math that say why these fryers are good right it may be a empirical Bayes thing I am I am I cheated and I did the frequentist thing first and then I use those as the priors right it could be something like that or it could be I have absolutely no reason to believe anything and so I put a completely uninformative fryer and I'm bringing this to you the model critique er to say that's that's what I believe that's what I believe and I can give you reasons for why I believe we should not know anything about that that's the nice thing about going Bayesian right we don't have like cut-offs and things to help us shortcut science we actually bring back science to this dialogue thing where we're saying I need to argue forcefully with reason about why my model is the right model why my priors are the right priors it's tougher work but it's better work did I die get yeah okay cool sorry philosophy of science permeates this frequentist Bayesian thing as well so thank you very much see ya
Info
Channel: PyData
Views: 44,851
Rating: 4.9653678 out of 5
Keywords:
Id: s0S6HFdPtlA
Channel Id: undefined
Length: 36min 14sec (2174 seconds)
Published: Thu Dec 21 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.