Probabilistic Programming and Bayesian Modeling with PyMC3 - Christopher Fonnesbeck

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right thanks thanks to the Montreux Python organizing folks for letting me talk while I was here in Montreal it's great to be great to be here and I love coming to Montreal I try to do it at least once a year thanks also to ever open the windows I'm gonna be talking about statistics so combining statistics and lack of oxygen is a recipe for disaster so I'm a ostensibly a professor of biostatistics at Vanderbilt which is in Nashville but I'm really a data scientist so it what that means is I'm the worst programmer in a roomful of programmers I'm the worst statistician in a roomful statistician and the worst basic scientist in a roomful of basic scientists oh and I'm c3 is a prime C in general is a project that I started back when I was a postdoc what I really didn't know anything about Python programming and so I'm gonna talk a little bit about this here but so before I do that this is way different from the first talk which I enjoyed a great deal I'm certainly not doing any live programming sorry but a couple of questions though how many folks in here would describe themselves as data scientists few good how many no know what Bayesian statistics is okay that's great awesome okay that's all I need to know so about this time last year I did a different meetup in London there's a Bayes meetup over there which is really great and it was held at the Imperial College there and much like this Meetup what you do you know afterwards is you know you go to the pub right and so there's a pub that you go to after the Bayes meetup in London it's called the artillery arms and it's in East London and right across the street if you look out that this window here there's a cemetery called bun filled Hills Cemetery and John Bunyan's buried there and Daniel Defoe and William Blake all these writers but if you go just inside the gate into your left the big that you see here is Thomas Bayes so you can sit there give your talk about Bayesian statistics and pour a little bit of beer on on base great so what first of all what is probabilistic programming it's really not a new concept but I thought I should describe it before I go on and talk about software that does it it's really the easiest definition or maybe the least useful one is is that it's any program that's partially dependent on random numbers and so the outputs of these programs are not deterministic and it can be expressed in any language that has a random number generator so certainly you can do it in Python and what it really amounts to is essentially you're adding another set of primitives to the programming language that you would have otherwise so and with these primitives you're able to do things like draw random numbers calculate probabilities given those those random quantities so you can have for example distributions over values so you can say something normally distributed bell curve and draw random numbers from that you can do more complicated things like having distributions over functions so rather than you know drawing single integer floating point values you actually a realization is an entire function in a longer version of this talk I talk about Gaussian processes a little bit but an important thing that allows you to do is is to condition random quantities on one another so if you define just say quantity P like a probability as a beta distribution just a particular kind of distribution that has values between 0 & 1 it's good for modeling probabilities well then you can take another random Kuantan other random stochastic primitive and condition its value on the value of that so it's so Z here Z is 0-1 based on that probability P and that's the important thing is it allows you to kind of condition things and and and really have kind of building blocks to let you to build more complicated things and so why do we do this well really most of the time we're doing it too Sylla Tate Bayesian inference and so there are lots of hands here that recognize what Bayes is I'm not gonna I'm gonna have like two minute talk about what Bayesian inference is and then I'll move on to that software implementation the most important thing you have to know about Bayes is that it it deals with a different sort of probability then say the statistics you learned most of you probably learned in school and and most importantly it's the fact that we're moderate we're doing something called inverse probability where we're modeling effects based on causes so the only notation I'll show you here is that we have things that we don't know which are theta parameters unknown values future predicted values and there are things that we know things we've observed which is why our data so everything in Bayes can be classified into do things one things we don't know the things we know are things we haven't observed and things we've observed and and we can use these quantities and conditioning statements to help determine what the causes might be so we've observed the effects why we're gonna see what the probability the causes are and and everything can be thrown into that theta anything you don't know you can put it in your model and that's one of the powerful things about it so why do we need a whole different type of statistics for me it's pragmatic some people have kind of philosophical preferences here for me again way back when I was a graduate student it helped me solve problems that I wasn't able to solve before with with sort of classical statistics the important thing is is they're very useful it allows you to build really complicated models that you couldn't fit otherwise and ironically it's because it it's allows you to build these things from kind of simple building blocks so this is the bay this is why Bayesian statistics is called Bayesian statistics this is Bayes formula and the important things are along the top so we have unknowns theta that we say before we do our experiment or or collect data or observe the world we have some information about so the probability theta is our prior probability it's what we know before we've collected any data what we want is after we've conducted our experiment our study what do we after we've collected all this information so it's a process of updating priors to posteriors and the way that we do that is through a likelihood function okay so those are the kind of the three main components and then you know when you're done you can take those that posterior and make it the prior the next time that you use it and you go collect more data and update it again you can kind of turn the Bayesian crank that way and and the big advantage to doing this is that everything here is in terms of probabilities so at all outputs from probabilistic programs will tend to be entire distributions so the so rather than just getting a mean you know our median or some statistic you get an entire distribution this allows us to say things like well what's the probability this is greater than zero so I build generally models of infectious disease systems and so this was some co-infection effect and we can see that it's you know almost certain that it's greater than zero yeah in a probabilistic sense you can pull arbitrary values from this so once you're able to get this posterior distribution you get a lot of stuff kind of for free the stochastic program then is where the probabilistic programming comes into play so we're able to specify priors and likelihoods and come up with some joint distribution of everything in our data so the first step in doing Bayesian inference is to write down your model in whichever language that you're going to be using and so you know just how do you do this right what constitutes a prior what constitutes a likelihood so prior distribution generally just quantifies the uncertainty in whatever variables that you're interested in fitting here so this is a normal distribution with a zero mean and a standard deviation of one and it says that we're you know reasonably sure that things are somewhere between negative three and three you know with quite a bit of certainty this is also a normal distribution that line across the bottom but that's got it it's got a standard deviation of a hundred so a variance of 10,000 and so here we're saying we don't know it could be any essentially any real value almost so this reflects kind of lack of information if you like this one was highly informative this one's not very informative and we can pick these based on it's best if you impart any prior information that you might have so if I'm am modeling a disease and it's a rare disease I'm doing kind of a model of rare disease prevalence I might pick something like a beta 150 which has all its probability way down here right most of the people don't get the disease if I'm you know say I'm a baseball fan and my favorite player gets a hit in this first three at-bats of the season what's the probability he's gonna you know hit 400 or 300 for the season well you wouldn't put a flat prior on that right because we've been playing baseball for over a hundred years there's lots of data that this prior actually comes from all of the data on batting since the turn of the century on average major leaguers hit 26 point one percent of the time with a standard deviation of point zero three four that's prior information there's no way he's gonna hit nine hundred there's no way he's gonna hit zero he won't be in the major leagues long enough that's kind of the idea you put whatever you know about the problem before you collect your data into the into the into the problem that way how about the likelihood this is where our data comes into play so what we're coming up with here is a data generating mechanism if you like so how do the data come to be and here to it comes down to picking an appropriate distribution for this and this is kind of the knack this is kind of the art if you like to probabilistic programming as seeing which distributions should be used in which cases and so you know for example our data might be normally distributed if it's human Heights and weights for example they tend to be normally distributed blood pressure measurements things like that if it's baseball right binomial distribution so it's in n chances and number of at-bats and you get X hits you can model that with a binomial distribution and the batting average would be P in here if we're running a website we want to know you know how many unique visitors per month or a year or whatever we might pick something like a Poisson distribution which is for counts so different distributions are good for different things and then we combine all of these things together to get a posterior distribution there's a likelihood there's our prior and what I've written here is this little symbol here means proportional to it's not quite equal it's equal up to a constant and the constant I kind of glossed over when I showed you Bayes formula it's this probability of wives hmm the probability of the data this is a marginal probability what it is it's just the numerator integrated over all of the theta so you integrate out all your variables so I'm a really bad mathematician my background is in biology I'm not very good at calculus I can do integration of one variable or at least I could about 20 years ago - I have a lot of trouble 3 forget it and most of you probably can't do 3 even if you're mathematicians most models will have hundreds maybe thousands of variables so doing Bayes is really hard and that's one of the reasons that you didn't learn it in school particularly if you went to school as long ago as I did it was just really impractical to use and so with probabilistic programming we're able to use complicated numerical methods to approximate that and probabilistic programming abstracts those away so that we don't have to be experts on all of these things and like I said probabilistic programming is not new so when I was a graduate student there was really only one way to do this there was a package called wind bugs the win of course means it only ran on windows but it was really great you know in the sort of the mid-90s you had this kind of dashboard here and you could specify you know samples you can watch the samples come in live here and you get summaries of everything and it really made it easy to describe and build and fit and share Bayesian models particularly for non-experts again I was a biologist most people use this or not statisticians or mathematicians and even better is there was this really nice domain-specific language that he used called bugs and sort of an are like syntax for any of you familiar with are this looks very are like right this is a complete model a complete hierarchical model in seven lines or so here and this was really great it allowed me and others to do lots and lots of things but after a while you kind of hit your head on the ceiling right there's a few things wrong with it a it was closed source and again it's this kind of domain-specific language where you had to you know get everything into bugs and then get everything out when it was finished it was coated an object Pascal which now that it's open source how many Pascal programmers in the room oh wow okay yeah I gave a talk in Denmark recently I raised that it's like half the room with their hands I was like it's a weird place yeah I don't know so you know there there are a lot of problems with that and so back when it again when I was a graduate student I had lots of time I didn't think I had a lot of time then but I really did have a lot of time I kind of cobbled together using a language that I like to use and which I used did use for most of my other work tried to reimplemented and so I started this around back in 2003 at the University of Georgia and really what it is it's a it's a proper probabilistic programming framework for fitting arbitrary probability models so it's not any one particular class of models not just regression models any mall you can write down in math you should be able to implement in pi MC it's based on Theon oh this is why I come to Montreal a fair bit fee on osa is a package package that was produced it from the Lisa lab they changed their name right it's the Miele lab now or something like that at the University of Montreal and and it implements what we call next-generation Bayesian inference models that use gradient information and I'll talk about that a little bit we have over 100 contributors now about a dozen or so core contributors it's used quite a lot in academia and industry companies like quanto peein and mana Tait GrubHub Channel four over in England other Alianza sure until I got and of course it's on github and freely available so what theano does this is kind of so the computational engine behind PI MC 3 prime C 2 is largely a Tron project it's mostly Fortran with a little bit of crunchy chocolate coating made out of Python on the outside so people didn't have to code in Fortran same thing here theano is the engine now and what siano is it's a sort of a meta language for specifying and evaluating mathematical expressions using tensors which are just generalizations of vectors and matrices so and and it's really built to do deep learning that's why it came about in much the same way as tensor flow or torch piano kind of led the way in fact and what it does is it dynamically generates C code from that and so you this is kind of what it looks like so what we're gonna do here is construct a matrix populated with values and then take some gradients here so if I specify matrix I'll call it X and then here's a function of that matrix this is a inverse logit transformation so I'm gonna transform the values to the zero one domain and then this is the cool part just take the gradient of it automatically just like that that's the magic and then you turn that into a function now up till now no calculation has occurred whatsoever whatever all that's being done here is a graph as being built a static graph that Fianna can use it will optimize it we'll learn how to do the gradients over that whole thing and then the only time it actually does anything is when you call the function for the first time it will compile it to C and run it so on and so this is the gradient of matrix transform matrix so this is great so that this power is everything that that's done um so it's always best to kind of show real-world examples for this sort of stuff so I'm just gonna show you what a model looks like in PI MC 3 using an example that's actually in our set of tutorial examples in in PI MC this is a data set from britain across the turn of the last century of coal mining disasters so every year they just you know safety isn't what isn't isn't safety back then I guess in mines what isn't what it is now and so there are a fair number of disasters and coal mines and these were just count from about the middle of the 19th century to the middle of the 20th century and and it's a good example because it's this nice count data but you can see the counts kind of change right they're kind of high they're kind of around 3:00 near the beginning and then somewhere around the turn of the century it kind of drops and you know you still get some bad years but on the whole that you know that they're they're lower they tend to be lower and so we're gonna do here is model this county this count process and we're gonna have pasta size that there's some early mean that's kind of high and there's a late mean that's kind of low but we don't know where it is right we don't know where this kind of switch point is and the great thing about Bayes is that anything you don't know you just make it a variable and estimate it and that's we're gonna do so there's gonna be three variables there's gonna be early mean late mean and then switch point and they'll all be random okay so first step prior distributions we talked about these before right so I'm going to choose exponential distributions for these rates why because they're positive continuous values rates can't be negative obviously you can choose other ones it's always good to test whether you picked good priors that's a different subject and then switch point I'm just gonna say this is gonna be completely pretend I didn't look at the data you know I know it's somewhere in the middle but we're just gonna allow it to be uniform across the time series it could be anywhere in there right and then the prime cee code looks like this the cool thing is that you it hijacks this context manager which you usually use for things like opening files or sockets socket supports and things like that we're gonna use it here to open a model and we're going to populate the model with variables and that saves us from doing the sort of thing that you see like in Karis if you've used carrots where you have to add add this variable and you have to have all these add statements this kind of does it with a bit of magic anything in tiny time you declare a pie NC variable inside the context manager it gets added to that model which has a nice name disaster model so there's my switch point uniform between the lowest year and the highest year early rate exponential late rate exponential and prime C has 4050 pre specified probability distributions the ones you'd probably ever need to use but you can customize it to do sort of weird distributions that nobody ever uses and the point here is that and kind of the driving the motivation behind pines c3 is this high-level language for specifying these models where you have almost the same number of lines of code as you do of math right there's very little extra stuff going on here so what happens now what do we have well if we look at some of these things the type of the early rate for example if this pie MC object called a transformed RV it's actually been transformed to the real line rather than positive value because it makes sampling more efficient and I can do things with this so this is the probabilistic programming thing these are these are our primitives now I can do things with them they have attributes so here's the log probability of the value 2.1 which happens to be negative 2.1 i can take around four random values from that distribution let me do anything I want with it and then I can transform variables arbitrarily so my rate here is gonna be early mean if T the time is less than the switch point or late mean otherwise and I can use this switch statement here for that so and notice I don't have to do a loop here everything's vectorized right so does this vector wise and this this is really just Theano code disguised here and there's nothing random about this this is all Detroit these are just deterministic transformations and then our likelihood here we're gonna use a place on it's what I used for the you know website visit counts before now I'm using them for disasters and so whatever that rate is for any given year it'll be a Poisson draw from that and my data are the disasters and and in PI MC the only difference here that distinguishes this from a prior is that I've got observed this observed flag includes the data and it says essentially these are fixed I've observed these don't change them okay so the next step is how do you get posterior distributions this is the obstacle right this was the hard bit it's analytically impossible most of the time and even calculating them numerically is is challenging and so over the years statisticians have come up with different approximations things called the map estimate which really just does optimization and finds the peak but you don't get anything it's not really fully Beijing because you don't get any any distributions you just get a value you do things like weird things like rejection sampling where you take random values and you look at the value and see if it looks like it's from the distribution or not whole slew of things the sort of de facto standard for doing this is something called Markov chain Monte Carlo or MCMC and even quicker than my distributive description of Bayes MCMC the Markov chain part is the fact that even though we can't sample directly from our if we had a really simple model we could sample directly from our posterior distribution we usually don't so we can't sample independently from it we can generate a dependent sample and markovia Markov chain is a dependent sample where the next value is dependent on the current value but not any of the past ones if I can generate a Markov chain with a particular property called the reversibility so if it satisfies this detailed balance equation then if I sample from that Markov chain long enough I'm going to get samples that are indistinguishable from the true posterior distribution that's kind of the magic that the math guarantees us right and in practice so MCMC is kind of a class of algorithms there isn't an algorithm called MCMC there's lots of specific implementations of it the most famous one is something called metropolis sampling you may have heard of Gibbs sampling there's lots of different ones metropolis sampling is the easiest to describe what you do is you initialize your parameters to arbitrary values whatever you want and then you have some way of proposing new values so some distribution or some way of proposing values that's easy to sample from and then you evaluate that proposed value depending on the log probability of the whole model that you've specified you either reject it or accept it if you accept it you take the value-add it's your your bag of values otherwise you revert to the current value and then you go about up and do it again you repeat this over and over again and when you do that you get something like this so this is a big but very simple model what this is is a model that is a thousand dimensional multivariate normal so it's a multivariate normal thousand values and so so it's big but it's very simple and we can see as metropolis am playing is not doing very well because of the correlation here and you know it's it's kind of stopping and starting when it stops it means it's getting rejected all the time it's not really finding the meat well this is the marginal of just two of them so sorry I forgot to say that so this is two of the thousand just to kind of see what it looks like it's hard to visualize in larger dimensions pardon me I arbitrarily asked the first two it doesn't matter they'll all look like this right and and the problem is is that it's this random walk I'm randomly selecting a candidate value and then evaluating it and works fine for small models not really well for big ones and so the whole idea of primacy 3 is to use new more sophisticated algorithms and in particular it uses what we're going to use gradient information of the posterior distribution to propose better values so it's not going to be random walks anymore what we do is we essentially try to simulate this as a physical system so if you think of our posterior distribution is like a landscape like a skateboarding park right and the skateboarder is like your point and you're just kind of rolling them along the surface right and so it's you're simulating the physics if you like so we add an auxiliary variable to this so we have the position and the velocity and we move this thing around according to that so no more random walk and and and what we're doing here is simulating this so you know if you're at the top of the hill you have lots of potential energy and not much kinetic and then as you go down the hill you lots of kinetic energy and not much potential energy your sense simulating that system here and so derivatives right we see how that changes over and that's what we've got a model here and that's why we require Theano or something like it because you know integrals are impossible to do automatically derivatives you can do automatically you just need the right technology to do that and that's what what is provided and so Hamiltonian Monte Carlo kind of looks like this you assemble a new velocity from a Gaussian distribution so you essentially you know give the marble or the skateboard you give the skateboarder a push in a random direction and then you you simulate a continuous system using kind of steps deterministic steps discrete steps and then once you get to the other side you stop it you take a point and you repeat the thing over and over again and now what you get is this works much better right so near independent sampling across the distribution that characterizes it very quickly has a very high acceptance rate and and it's applicable to much larger models than kind of the metropolis style sampling the downside is is there's a lot of tuning to be done you've got to pick you know how many steps you take along that leapfrog where to stop right so what happens if you're you know on your skateboard and you get to the other side of the skateboard park right you start sliding back again along the path that you just took and you don't want to do that and so Andrew Gelman and one of his graduate students came up with a an automated sort of self tuning version of hamiltonian monte carlo called the no u-turn sampler which as it says is trying to prevent the u-turn coming back upon itself and so you don't have to know all that when you do pomme see right again we black box that we we abstracted away all you do is you call sample that's it and it determines it's used nuts for the late rate in the early rate it's used metropolis for the switch point and you get a few thousand samples and in a few seconds and then as we promised earlier on once you can get that posterior distribution you get a bunch of stuff for free so you get the here are samples you're able to get means and standard deviations and credible intervals and everything else that you need which is fantastic so that's that's kind of the primary way the get inference do machine learning using PI MC but even with more sophisticated algorithms MCMC can be very slow particularly for large data sets it doesn't scale well with large data sets because that likelihood has to be evaluated for every data point at every step of that sampling algorithm so in those cases we can use a different type of algorithm that very recently has been added to PI MC and that's called variational inference and it's a very different approach than MCMC what we do here is we take the blue curve here is like some posterior that we don't know what it is and we have another distribution that we're familiar with like a normal distribution something that's easy to work with and we transform it and we select parameter values for it so that it gets as close as possible to the posterior distribution so we're changing the problem from a sampling problem which is great but it can be slow to a straight optimization we're going to optimize the crap out of this approximation as well as we can and so what do we mean might mean by close as we can well the measurement for this we use isn't essentially an information distance this is called callback lie blur divergence tells us how far away one distribution is from another so Q is our approximation P is our true posterior distribution and all you need to know about the math is that it gives us an expected value in terms of Q which is the thing we know if it's a normal distribution so we can work with it we can't optimize this directly because it contains the posterior and we don't know what that is but with a little bit of math that's way over my head we can rearrange it and get a quantity that we can deal with and this is called the evidence lower bound so we're going to try to maximize this evidence lower bound it's the same as minimizing the callback live the distance but again just with like with nuts there's some choices or tuning to be done here we have to use pick Q how do we pick Q it's got to be a distribution that's useful we may not know what our posterior distribution looks like things like that and similarly in the last few years 2016 ALP Canseco beer also out of Colombia the same as Gelman and Hoffman came up with an automated is called a DVI automatic differentiation variational inference that will just start with normal distributions transform them into a real coordinate space and standardize them for everything so that it works across any any problem and so what we get when we do variational inference is this it kind of looks like MCMC but these aren't the values this is that elbow so I've hit some sort of asymptote here and whether that's a good place or a bad place depends on how good my approximation is and kind of give you an idea of what that looks like here's a beta distribution that I'm estimating it's this dashed line here and each of these curves is an approximation based on 100 through 10,000 optimization narrations and this is just straight optimized this is you know BFGS or nell their meat or whatever optimization that you want to use it's and and it's fast and you can see that in this case it does a reasonable job and in pi MC all you do you take your model whatever it's called and rather than call sample you call fit and then fit we'll choose an appropriate approximation and what we get out the other end is not a bunch of samples but this approximation which again is a distribution that's been fit but because it's distribution we can draw samples from it so we take that approximation and sample from it rather than the true posterior distribution so we get what looks like MC MC samples these are just again approximations but as they say in the machine learning world there is no free lunch these are approximations they generally aren't as good unless you get lucky or your problem is simple enough so the blue line is the advi approximation and nutz is is a better a better approximation to the posterior this is what you see in general is you tend to underestimate the variance because it does what's called a mean field approximation it kind of assumes all of the variables in the model are independent of one another but it's works a lot faster so if you have lots and lots of data you may be willing to make that trade-off and it's made faster still by the fact that we can mini-batch so what I mean by mini-batches that rather than throwing all of the data at the problem at once at every iteration of my optimization I throw just a random batch at it instead this has two advantages one the computation time decreases and two it does what's called stochastic gradient descent which tends to be more robust it's kind of noisy gradients rather than rather than non noisy gradients and so they tend to converge faster so what's really cool so MCMC and advi are kind of the two main ways of using PMC and this fits very appropriately into machine learning you can combine probabilistic programming call it Bayesian machine learning if you like thing about machine learning models is you tend not to account as much for uncertainty particularly you tend to get a prediction or a probability and not an entire distribution and so they can sometimes be easy to fool harder to interpret you can totally fit machine learning models in PMC and here's an example that Thomas Vicki who's one of our core developers demonstrated a couple years ago this is a Bayesian deep learning model and so this is just a neural network with two hidden layers so deep learning is you know any anything more than one hidden layer it's deep so this is deep learning and all that all we do to Bayesian eyes it is we take the weights of the neural network and you and we put priors on them here we're just putting normal 0 1 priors so and this is the whole program in PMC so here's the weights from the input to the hidden layer first to second hidden layer to output and then an a set of activation functions and our output is a binary classification so it's a Bernoulli random variable and what you see on the side here is more than you usually get this is not a decision boundary which is what you would get if you did a support vector machine or something like that or a or a deep neural network what's being shown here is the posterior standard deviation of the estimate of the probability of classification so you can see the the darker it is the more uncertain it is so everything close along the boundary is uncertain everything light and away from the boundary is very certain and so you can get an idea of how reliable your predictions are which is important right you want to know we're gonna use this to make decisions you want to know what the risk is okay I'm almost out of time so in terms of the future we have we're lucky enough this year to have some google Summer of Code slots we've got a student in Argentina Agustina Arroyo load implementing up approximate Bayesian computing so again another algorithm for fitting these models bill angles is gonna continue working on Gaussian processes which is really great and then we have another student Chiron Yale Bergy who's gonna be working on a tensor flow backend for pi mc3 unfortunately the piano project is shutting down after many many years it's kind of served its purpose it's it's essentially prodded lots of companies to make very robust and powerful open-source deep learning engines and so we're going to transition into what we would call PI MC for using hopefully using tenth tensor flow that's we're gonna try first anyway and Google Summer of Code is going to start that for us for those of you interested in learning more about this our project the PI MC repository has a whole bunch of jupiter notebooks full of well-documented deep examples everything from regressions to survival models to machine learning models and of course I'm biased towards PMC we live in a great time now rather than back in the 90s when you had just had wind bugs these are just the Python tools for doing probabilistic programming it doesn't not even any other languages hey Edward which is also now tensorflow stant the stand team has a Python interface and so on and then if you want to learn a bit more about Bayes a local Montreal or actually cam Davidson peel on I think he works for Shopify wrote a nice open source textbook years ago based on Plan C - and it was recently ported by Max Martin oh and Thomas Vicky - pom C 3 and Python 3 yay and so you can go on there and learn all about kind of the basics of Pi MC and probabilistic programming so with that I'll close this is our core team I'd like to thank all of them this went from a three three person operation for pi mc2 to more than a dozen now and as a result we have a much much better project and and I've been able I've done anything for the project I've been able to recruit people who are smarter and better at all this stuff than I am so I thank them and I thank you for hanging out listening [Applause] it means that for the most part assimilation attack please symbolically it is say a piano does symbolic differentiation yes and that's what's with the building but yeah you need that somehow stan has built its own engine for doing that in C++ tensorflow can do and we've just hijacked deep learning a deep learning engine to do probabilistic programming is the idea so you could do with any of those but yeah you so that's what the graph does build you build this big static graph and then it's able to reason over that graph and come up with the gradients didn't I show you all yeah it was around halfway I mean that's the cool thing is that that you get a distribution there's any one of those ipython notebooks has the whole worked example but it was you know right about I can remember which year but the hundredth but the 40th observation I think the 40th year and you get a nice distribution you don't just get one I would imagine so yeah yeah I'm not sure and and that's a simplification it probably wasn't just a switch point in my longer version of this talk I show how to fit Gaussian process which I think I mentioned it earlier on is it's a distribution over functions so we deposit just a function of some kind you know maybe does that maybe Wiggles a little bit and and so you can come up with a function that's more complicated that allows it to vary within those two yeah you know flows to intervals so so you can do something simple like that or something more complicated depending on what you could be or yeah I could just more a bit of some technology or something like that a regula could be regulations turn of the century in England you know yeah I'm watching peaky blinders right now so that was it that was real time yeah so 111 data points and three parameters it was so the thing about the Hamiltonian Monte Carlo is that each iteration of it is slower than you would get in metropolis but it's a far more efficient sampler so you're going to keep all of the samples you get if you use metropolis sampling you throw away 75% of them for the most efficient algorithm so I don't know it took four seconds I'm happy with that and again it does well it scales well with size of the model and if you increase the numbers of parameters it's still quite fast when that when the when the data increase so it doesn't scale quite as well and gets large things can slow down quite a lot wait a minute number I had three parameters 100 data points yeah I mean I you typically I fit you know biomechanical examples that have you know dozens to hundreds of parameters and I don't it'll take five minutes seven minutes something like that so yeah it depends on what you you need you know and maybe you could code something that's faster but it's not as general as this tool right so you could have a super optimized algorithm for your particular domains problem but then when you go and do your next problem you've got to do that all over again it's the other with the PMC is that you can apply this to any probabilistic programming application Thanks [Applause]
Info
Channel: Montreal-Python
Views: 22,051
Rating: 4.9697542 out of 5
Keywords: montreal, python, technology, dev, web, science, code, conference, user group
Id: M-kBB2I4QlE
Channel Id: undefined
Length: 43min 39sec (2619 seconds)
Published: Mon Jun 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.