Probabilistic Machine Learning and AI: Zoubin Ghahramani

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
it's my my great pleasure and great honor to introduce Zubin Gary money Zubin is a joint invited speaker between the innovative applications of AI conference and and 3-play ioan and we are really fortunate to have been able to attract him here Reuben is a professor of information engineering at the University of Cambridge and is also chief scientist at uber combining the two in doing very interesting things he's a deputy director of the Center for the future of intelligence is also a fellow of st. John's College and a founding director of the unanswering Institute which is the National Institute of data science in the UK he confirmed he co-founded geometric intelligence which is no uber AI lab which which is running and is also an elected fellow of the Royal Society in the UK for his contribution to machine learning he has work in many many areas of AI and I'll mention a couple of them but there are many others that I won't be able to touch upon they include variational methods for approximate Bayesian inference graphical models computational neuroscience and obviously a statistical machine learning so I hope you can join me in welcoming Zhu being Gary money for this invited talk thank you very much [Applause] thank you it's a great pleasure to be here at this joint to talk for the two conferences okay so I'm gonna be talking about probabilistic machine learning and AI and I'm just gonna start out by saying it's a incredibly exciting time for AI and machine learning I'm sure you are all aware of this we've had many recent breakthroughs in AI and games for example Atari ego poker new breakthroughs in chess but maybe even more importantly we've had an era of really interesting applications of AI and machine learning so on this slide I just have you know six things I could fit in we could have fit many many more but you know areas like speech and language technologies computer vision many fields of scientific data analysis all the recommender systems that we all interact with the very exciting future for self-driving cars applications in finance automated trading and many many more things we could have fit on this slide so it's a hugely exciting time and you know when we think about this mission of AI one of the questions that comes to mind is like where are we how far are we from what we need do we have all the principles that we need to be able to solve the challenging problems do we have the sort of practical tools that we need and clearly to solve some of these challenging AI problems we need principles for perception for learning for reasoning and for decision making and and probably a few other things I haven't left but certainly for I haven't put on this slide but certainly those four are very important things to think about and I'm gonna try to return to this at the end of my talk so are we there with the principles that we need to build AI systems so think about it maybe who thinks by the way who thinks we have the principles we need raise your hand if you think we have everything we need to solve AI who thinks we're missing something okay all right so we return to this okay so you know one of the things that there's sort of like 1,100 pound gorilla in the room is deep learning right so you know one of the things that people often think these days especially people who haven't been in the field of AI for very long is the deep learning is going to solve everything so deep learning is you know the tool that is going to allow us to create intelligent systems and solve all of our problems so let's talk about that for a little bit and you know just to set the stage many of you I'm sure know this but you know what our what our deep learning system so fundamentally deep learning systems are neural networks and neural networks are tunable nonlinear functions with many parameters that's basically all they are okay and the parameters in a neural network we could call them theta these are the weights of the neural network and neural networks are particularly interesting functions because they're nonlinear generally in both the inputs X and the parameters theta so what I've shown here is if you can seem I don't know if you can see my pointer but I've shown here is the simplest possible neural network which shakes a linear combination of the inputs and passes them through a non-linearity a sigmoid non-linearity like a logistic or something like that that's probably 70 or 80 years old and what modern neural networks do which is a little bit more interesting is they represent this function from the inputs X to the outputs Y via a composition of functions so that's the sort of layers in a neural network so here is a two layer version of a composition of functions and so what do you do with these neural networks well you train them generally by maximizing likelihood so almost every last function that's used to train a neural network course - like likelihood of some kind not everyone but almost everyone or a penalized likelihood of some kind and you generally train them with some stochastic gradient procedure so at the bottom the core of a neural network is just a nonlinear function with a little bit of basic statistics and a little bit of like elementary optimization thrown together and it's incredible how powerful these things have been in the modern era but let's not forget that that's all that's going on in most neural networks okay so so the field of neural networks which really got me excited about you know being an AI researcher in the in the mid-80s when I first started that's what I got really excited about has basically been rebranded as deep learning recently and deep learning systems sort of modern deep learning systems are very similar to the things that were popular in the 80s and 90s with a few very important changes so first of all we've had some new ideas the field has had some new ideas and new innovations like you know making us able to learn systems with many layers using different kinds of nonlinearities like rel use different forms of regularization like drop out and architectures for recurrence like LST MGR use etc so there's some new ideas in there but really importantly we now have vastly larger datasets web-scale datasets in the than we had in the 1980s so what I was again as a personal anecdote the first major thing that got me excited about machine learning was my undergraduate project which was learning how to parse English using a recurrent neural network this is in 1986 the problem was you know our computers were incredibly slow even the connection machine a highly parallel computer I was trying to program was very slow and my data set was tiny in hindsight this was doomed to failure and of course it failed I didn't you know get a good parser that was competitive with the existing symbolic parsers back then so we have vastly larger data sets we have vastly larger compute resources you know GPUs on the cloud new hardware architectures we have much better software tools that makes it much easier to actually build these neural networks and we have because of all these things we also have vastly increased industry investment and also media hype and the good thing about that is that more people are coming in when you get an order of magnitude more people coming into the field your progress speeds up so there's been tremendous progress and we've achieved a lot of really interesting breakthroughs with deep learning systems and I don't want to minimize from that because I think that is really important for the AI endeavor but these systems also have limitations so although these systems have given amazing performance on many benchmark tasks they're also generally very data hungry often needing millions of examples they're very compute intensive to train and deploy they're poor at representing uncertainty this is something I'm going to come back to because it's sort of a you know great interest of mine there it's non-trivial in neural networks and other deep learning architectures to incorporate good prior knowledge to handle symbolic representations of course all of these things are things that many people are working on so I'm not saying that these are insurmountable but these are things that are at the boundaries of what we can do with neural nets right now the neural nets are famously easily fooled by adversarial examples so you can perturb a few pixels in an image and get the neural net to give you confidently wrong answers they're finicky to optimize so non-convex you know finding the choice of architectures how to initialize them and so on all this is a black art and they are generally uninterpretable black boxes they're lacking in transparency they're difficult to trust and that makes them hard to deploy on some really important problems so these are the challenges in the field of deep learning right now and I think this is where a lot of the interesting stuff is going on in trying to overcome these challenges but what I want to talk about for most of my talk is kind of what's next so deep learning has clearly been a revolution in the field we can all agree on that and now we can move on because we're researchers we want to see what is next and I'm gonna give a very personal answer to what I think is next and the personal answer really comes from a particular worldview that I and you know of course many of my colleagues have about what machine learning is so the way I'd like to think of machine learning is in terms of probabilistic modeling and the point of that is that in many fields of science and engineering what we're really interested in is building models okay and what's a model a model describes data that one could observe from a system okay that's the way you tell whether your model is good or not it doesn't have to be a model of all the variables it doesn't have to be a generative model of everything it could be a model of the outputs given the inputs but that's still a model okay almost everything out there this machine learning is a model of some kind okay and now what we're gonna do is we're gonna use the mathematics of modeling and there is a good language for dealing with models and that is the language of probability theory because fundamentally if you have a data set and you want to model something about the world if you want to predict something about data you haven't observed yet any sensible model would have to capture uncertainty okay cuz how else are you gonna make predictions and probability theory is the mathematical language for dealing with uncertainty just like calculus is the language for dealing with race of change probability theory is the language for uncertainty so we're gonna use probability theory for all forms of uncertainty in our model all aspects of noise all things that we don't know for sure and then we're gonna continue using probability theory in the form of what's called inverse probability or Bayesian inference and that allows us to infer unknown quantities adapt our models make predictions and learn from data so it's a very very simple framework and many many machine learning problems fall into this framework or at least should fall into this framework so so we're you know in equations what what is this all about and of course I know many of you know this but let me let me explain Bayes rule the way I see it so Bayes rule is this rule here which is telling us how to update our knowledge or beliefs about things we don't know which I'm going to call hypotheses from things that we do know or measure which I'm going to call data so in a bayesian world there are only two kinds of things out there the stuff that you measure and everything else and the stuff that you measure is the data whether it's the perceptual data for a robot or some medical data that you measure whatever it is and the stuff that you don't know that's uncertain and we need to represent our uncertainty about that and we're going to use probability distributions to represent that uncertainty so before we've observed the data we have the prior that's this P of hypotheses here then for any hypothesis we need to be able to evaluate how probable is the observed data under that hypothesis that's the familiar likelihood and when we multiply these two and we renormalize summing over all the plausible hypotheses we're willing to consider what we get is the posterior the posterior distribution over the unknown stuff the hypotheses given the data and that is sort of a a simple universal rule for updating beliefs in light of data it's a rule that tells us how to do learning learning is fundamentally the the the transfer from prior knowledge to posterior knowledge if I get more data I just apply the rule again to my current state of beliefs to get a new posterior thing there's nothing special about that initial and that's the basic framework and now if we apply this to machine learning what we get is the following basically you know three equations defining learning prediction and model comparison one thing I want to point out is that Bayes rule itself is even not really an axiom it's just a corollary of two simpler rules and probability theory which are the sum rule in the product rule written at the top so everything basically follows from the sum rule and the product rule and anytime you deviate from that you need to explain why is it because you don't believe in probability theory is it because you're it's computationally infeasible there must be a good reason in the Bayesian framework so learning if we now take Bayes rule and we substitute in theta for the parameters of our model is basically the process of updating from our prior over the parameters our state of belief before observing the data through the likelihood multiplying in the likelihood and then renormalizing and then we get the posterior of the parameters which represents plausible settings of these parameters given the data prediction any quantity we might want to predict call it x given the data d well Bayes rule the summer and the product rule tell us that to make a prediction well we need to do is average or integrate the predictions of all possible parameters weighted by the posterior that we computed over here the term in green so it's a natural ensemble oppresses of prediction and if I want to compare two different models then I apply Bayes rule at the level of models M okay so M are sort of you know model structures I might want to consider and this is a very powerful idea we can use it to do all sorts of model selection model comparison ensemble etcetera so that's it that's Bayesian machine learning in one slide and you know in some way we don't really need much more but maybe we should ask ourselves you know why do we need these probabilities in the era where deep learning has been so successful or seemingly successful without most deep learning papers starting out by saying here's Bayes rule and everything follows from that well should we throw away probabilities should we just move on and say well that's old that's like from the 1700s you know we're in the 21st century now we don't need that so do we need that when do we need probabilities and I would argue that they really do matter for AI so here's a few examples of places where I think probabilities really matter for AI we'd like systems that have calibrated uncertainties model and prediction uncertainties and the reason for that is that we want systems that know when they don't know okay if we're building a self-driving car with a pedestrian detector then we would like it if the conditions are weird if like the sensors fail or it's foggy or you know it's a it's a pedestrian that looks somewhat unusual we wanted to say I don't know but we don't want it to confidently give us the wrong answer like most current perceptual systems do the probabilistic framework also allows us to do very elegant model complexity control and structure learning so actually we don't need regularization regularization and overfitting these are not Bayesian problems these are hacks for solving the thing that probability theory solves naturally I mean that might sound a bit harsh but from a purist bayesian point of view these are let's call them approximations to actually computing things with the summon product rule for AI we clearly want to build systems that make rational decisions so now if we think about like what are the principles whereby rational decision-making come in well you know I'll talk about this a little bit more in a couple of slides but we want systems that update the beliefs in light of data in a coherent manner and they make decisions that maximize some utility or minimize some loss in an effective sort of way which is what you would imagine from utility theory we need systems where we can build in prior knowledge into our learning not just like a deploy black boxes because sometimes we have prior knowledge and we want AI systems that can incorporate that and not just that but they need to update their knowledge in light of data in a coherent and robust way and we want to make sure our learning systems work not just with big datasets but also with small datasets because a lot of challenging AI problems involve making the right inferences from small amounts of data so let me talk about sort of some of the foundation some of the probabilistic foundations of where AI fits into where probability theory fits into AI okay so let's talk about the problem of representing beliefs in an AI system so imagine you have a robot and in order to behave intelligently we think that the robot should be able to represent beliefs about propositions in the world like you know my charging station is at this location or this stormtrooper is hostile or whatever the the robot wants to believe okay and what we want to do is in the brain of this robot we want to be able to represent these beliefs numerically somehow okay and now we want to know well what is the mathematical language we should use to manipulate these beliefs okay so this is an interesting thing to think about and this is something that you know has been thought about like at least from the 1940s so here's the mathematics we're going to use we're gonna use the idea that B of X is gonna represent the strength of belief of the system or the plausibility of some proposition X okay so we're gonna say okay B of X this number is going to be between 0 & 1 0 means that the system believes that X is impossible 1 means that it's definitely true so logic is at the extremes of this interval and furthermore B of X given Y is gonna represent the belief that X is true given that we know that Y is true okay so we're gonna use this B to represent things and then in 1946 Cox wrote down some axioms which were axioms about how to represent beliefs for example these are things like we're gonna use real numbers to represent beliefs real numbers between 0 & 1 we want them to qualitatively correspond with common sense and so there was a few sub-clauses there on what common sense would mean and we want the systems that represent and manipulate believes to be consistent in a certain way so if the conclusion can be reasoned several ways then each way should lead to the same belief state so if I have a couple of bits of evidence and I incorporate them in different orders the belief state should not depend on the arbitrary order in which I incorporated that evidence the robot or system must take into account all relevant evidence so it can't just say I'm gonna flip a coin and I'm gonna ignore this particular Y that I observed at random because that's sort of arbitrary that's not rational and then equivalent states of knowledge should be represented by equivalent plausibility assignments equivalent numbers so these are the Cox axioms and then the consequence of these axioms is that these belief functions that I've written with B must satisfy the rules of probability theory including the sum rule and the product rule and therefore Bayes rule so this is this is the alternative derivation of probability theory this is not probability theory from the point of view of repeatable experiments and frequencies and so on that you get in classical statistics it doesn't even involve like you know measure theory and stuff like that this is just about representing beliefs this is a representation of probability theory that I think is very relevant for thinking about AI problems here's another argument for why beliefs are important and this is an argument that comes from kind of decision theory or games and here is the question here what we have is the question of how do we use beliefs to make decisions about things that we're willing to bet on so assume you're willing to accept bets with odds proportional to the strength of your beliefs that is if V of X is point one that implies that you will accept a bet where if X is true you win nine or say ten dollars more than nine dollars if X is false you lose one dollar so it'd be rational if you're willing to bet and that B of X is point one that you would accept that bet then unless your beliefs satisfy the rules of probability theory including Bayes rule then there exists a set of simultaneous bets that you would accept that set of simultaneous bets is called a Dutch book that you're willing to accept for which you are guaranteed to lose money no matter what the outcome okay that's not about whether your beliefs are correct or not we're not saying your beliefs have to be right they just have to be coherent and the way in which beliefs are coherent is they satisfy the rules of probability theory so if we want to build rational agents that represent beliefs and we want them not to be fooled by nature or by other more rational agents then they better have paw brouilly theory at their core of the their beliefs okay so that's a bit of like this sort of philosophical almost background of probability theory now as as was mentioned I'm heading up Ebers AI research lab and one of the things that's very interesting about working on some of the problems that we're working on is that probabilistic modeling is actually really at the core of many of the most challenging things that we do so many aspects of learning and intelligence depend crucially on representing uncertainty and if we're looking at the you know modeling and prediction of the supply of drivers and the demand of riders over space and time if we're trying to model cities and traffic and the complex dynamics of these things if we're trying to predict the etas or you know do rational experiment designs good pricing schemes good causal inference financial forecasting self-driving cars I'll mention that because people think that's the thing that we're working on I mean that we have lots of people working on that but there's everything else that a big technology company like uber that's out there in the real world dealing with people in cities has to do and believe me probabilistic modeling is really central to to what we do we of course use deep learning we have lots of functions we want to learn and we have lots of data to learn those functions so we use deep learning to do that but it's really important to have those probabilistic models to be able to build a rational decision-making agent and our CEO daraa to quote him said the following uber is a giant machine intelligence problem and it's a particularly challenging problem because we're trying to optimize and navigate the real world with all the uncertainty that that brings so um so let me talk about some of the most to me the most exciting areas of probabilistic machine learning and these are of course this is a very personal view these are areas that I've been investing in and my colleagues and I have been working on and to do this I'll give you a map somebody asked me a couple of years ago to present my research program in one slide so I had to come up with one slide to capture everything that I'm interested in and it was a good exercise because it turned out everything a lot of the things that I work on are related to a central mission which is automating machine learning and what I mean by that is that machine learning systems are not very rational right now they're quite ad hoc and in many places we can make them much better much more efficient much more automatic and I'm not going to talk about all of these things but I'll try to give you three vignettes three interesting areas that I think are very promising from here so the first one I'm going to talk about is Bayesian deep learning so remember I was sort of almost juxtaposing deep learning and Bayesian as as being in opposition but they're not actually they're talking about different things deep learning is talking really about types of models and Bayesian inference is talking about methods for learning okay so we can mix and match these two and there's actually a deep and rich history of interesting work in this area so let me motivate a little bit what I mean by this so in a in a neural network or any other deep learning architecture we have parameters we have a function mapping from X to Y let's say and we have some parameters these are the weights and if I want it to be really amazing about my neural network then I would need to capture things like not just the parameter uncertainty right now just a single point estimate but uncertainty in all of these weights or parameters but also the structural uncertainty like how many layers should I have what should the layers look like that's really I computationally quite beyond hope I think right now but formally that's what we would have to actually capture let's talk about parameter uncertainty because that that may be closer to tractable so here is a neural network it's the same neural network as before has the same structure same weights whatever neural network it is doesn't matter I've shown a very simple one here and now we're gonna treat this with a Bayesian lens so the Bayesian lens would say there are parameters theta and these are the weights in the neural network before observing the data I have a prior that's P of theta given alpha alpha might be some hyper parameters you can ask me where to set those from but that's you know not as hard as people think so you have a prior and then after observing the data you have a posterior basically that represents the distribution of parameters of your neural network that are consistent with the data now something interesting Radford Neal showed in 1994 in the previous sort of heyday of neural networks is that a neural network with one hidden layer if you if you treat it in a bayesian manner in this way and you take the limit as the number of hidden units goes to infinity of that model then that converges to something well-known in statistics called a Gaussian process about a hundred year old model so actually what happened in the mid-90s is there were lots of people excited about neural networks but they were frustrated that they were so hard to train and they had so many local optima and they weren't really producing great results back then and they got excited so part of the field got excited by support vector machines because they solve problems elegantly using convex optimization and then the Bayesian part of the field said well why deal with these annoying finite neural networks when a an infinite neural network can be handled with linear algebra using Gaussian processes so they threw away neural networks and they worked on Gaussian processes now it turns out to quote David McKay another pioneer in this field that we may have thrown the baby out with the bathwater if you've heard that expression and it's probably good to revisit that but we've recently revisited this theoretical result and we've generalized that in this iclear paper to look at wide and deep neural networks so this is just with one hidden layer and it turns out that even for deep networks if they're wide when you take the limit it's just the Gaussian process and many of the models that were analyzed in the Bayesian neural network community it turns out are indistinguishable from Gaussian processes which are much much easier to deal with so it's sort of an interesting curiosity if you would say now I haven't explained what Gaussian processes are I could talk about that for an hour but we don't have an hour I would recommend people looking at that there are very very nice models for functions okay so the as I mentioned the field of Bayesian deep learn has actually been around for a long time and it can be implemented in many ways so the key computational problem is doing inference over the parameters coming up with the probabilities from the parameters given the data for a neural network and you know in the early 90s actually Yann laocoön had a paper which did some Bayesian deep learning but this was before more from a physics perspective before it was actually even recognized that that was based on deep learning then David McKay Radford Neil geoff hinton develop aspirational MCMC methods then there was a bit of a lull when everybody else kind of decided not to work on neural networks and then early in 2011 and later on many people have started to revisit inference methods for neural networks trying to make Bayesian inference fast and cheap so that it's competitive with optimization and this figure here is meant to show why we might care so this figure shows a neural network fit to some data trying to fit some function that's all fine where you have the data of these black dots but the point is outside of where you have the data what you'd really like is these big error bars you'd like uncertainty to be well represented and Bayesian neural networks will generally do that much better than a point estimate we had a couple of very successful nips workshops 2016 and 2017 and the videos of that are online if you're interested so a very recent piece of work I want to talk about is how to combine deep learning with Gaussian processes and this is not the idea itself is not necessarily new but one of the things that's interesting is that we can now do this in a very scalable way so you can take a neural network and you can slap a Gaussian process on top at the end this is a Gaussian process deep neural network GP DNN and then you can train the whole thing in tensor flow using something that my group and others developed called GP flow so to end training of the whole thing so what's the point of that well the point of that is that we're trying to have our cake and eat it too we want to have the representational power of a deep neural network but the uncertainty the nice uncertainty properties of a Gaussian process and this is one way we might be able to do that and what we found is that we get more robustness to adversarial examples and we get systems that generally are better as saying when they don't know okay better extrapolation better representations of uncertainty so here is a picture that's meant to represent that I'm gonna show look let's look at this example here this is um this is a very toil if you can see there are some orange dots and some blue dots that's the training data it's a simple classification problem in 2d and the neural network this is a neural network the neural network can beautifully classify this data set okay that's fine you get very good training and test accuracy if you only test on this distribution now the question is what happens when you go far away what happens when you go all the way over here all the way over here and here is what the neural network will do it will confidently generally it will confidently extrapolate to places where it's never seen data it'll say the probability of being orange is 1 virtually 1 here and is virtually 1 here ok that's bad that's a system that doesn't know when it doesn't know it should say I don't know way out here because it's never observed any data way out here this is a trivial example but in high dimensions this is gonna happen and you're not even gonna know it this is what a Gaussian process can do now you can do a Gaussian process exactly in this small data low dimensional problem and it performs as you would expect it classifies the data really well and then outside of the areas where it's seen anything it reversed the white white is I don't know it's like point 5 okay so that's the behavior that you want this is the Gaussian process deep neural network this is one particular run the deep network has idiosyncrasies when you run it on different runs it will produce slightly different shapes but essentially it still classifies the data well and it has a lot more regions out here that are in the I don't know area or that are basically saying probability 0.5 and if you simply average this over 2 or 3 runs you get something that looks basically like this ok ok so that's just an illustration of how one way of combining Gaussian processes and deep learning I'm going to talk very briefly about two other ideas and then I want to take some time for questions so the second really exciting idea I want to talk about is probable SiC programming so raise your hand if you have worked with a probabilistic programming system okay raise your hand if you know what it is ok a larger fraction of people know what it is ok well even for those who know what it is I won't spend too many slides on this so one of the problems with this probabilistic modeling framework is that writing down probabilistic models and deriving inference algorithms is very time consuming an error prone ok now the solution to that is to provide better tools right just like happen in the neural net community once you have tools for automatic differentiation it's a revolution you can just write down the model and the computer does the differentiation for you right you don't have to take derivatives by hand every time for every model we want to do the same thing for probabilistic modeling and so that's the solution is in the form of probabilistic programming languages and these are languages for expressing probabilistic models as computer programs that generate data in other words simulators simulators are an idea that you know many many fields have discovered and make use of all the time so we're gonna treat a simulator as a model and it doesn't have to be again to be clear it doesn't have to simulate everything it doesn't have to be generative model for everything you could just simulate Y as a function of X for whatever complicated y you have and whatever complicated X you have now that's not the magic the magic is that given a general-purpose probabilistic programming language we can develop Universal inference engines these are languages that do inference over over programs program executions given the observed data so intuitively I have a simulator from that simulator I can simulate some fake data and the process of learning is now I have some real data and now I want to figure out how do I get my simulator to generate data that looks more like my real data now in the optimization framework what you do is you take derivatives through everything and you try to optimize the simulator to fit the real data that's what maximum likelihood is in the promise like programming framework you say my simulator has many parameters that are uncertain what I really want to do is Bayes rule given the observed data on my computer program essentially running Bayes rule which is running my computer program in Reverse to infer what those random variables were given the data there are many examples of this bugs inferred are net etc there are two projects I've been involved in cheering which is a promise a programming language based on julia and pyro which i'll talk about it is a language developed at uber AI labs and under the hood there are many different inference engines based on Markov chain Monte Carlo and particle filtering and other methods so here is like a typical model of hidden Markov model written in one of these promise like programming languages Turing and it's just a few lines of code which is exactly what you would like and if I want to make a Bayesian hidden Markov model I have to uncomment one line and just run it again so I don't have to read arrive anything and the real attraction of promise lis programming is I think he could really revolutionize scientific modeling because that's what you know it's very fundamental to scientific modeling to have models and try to figure out what's a good model for data but also machine learning in AI and here are just some interesting pointers to useful tools and resources out there pyro this language that we developed in AI Labs at uber is a really interesting language because it kind of tries to bring together the best of deep learning and the best of universal problem state programming so what we're trying to do here based on PI torch as a framework for doing deep learning is to layer on the ability to express any probabilistic model that you can write down in Python so the idea of a universal probabilistic program is that anything you can write down in a turing-complete programming languages language that calls random number generators is a valid probabilistic model and that's the class of models that you want to be able to do inference in so I have a look if you're interested in that and the last thing I want to talk about very briefly is this project that has been a long going project called the automatic statistician this is sort of an AI for data science where we're really trying to develop tools that automate aspects of inferring models from data and here day by data I mean like tables of numbers and things like that and not just that but trying to explain those models in a report in an interpretive way to the end user okay so I'm gonna just skip over some of this we have time for some questions but in goes the data the thing turns over possible models using Bayesian Occam's razor to figure out good models for the data it comes up with the best model they can find and then it tries to explain it in a report so it involves some classical ideas like search trying to find good models the stopping criterion there this is for a time series example you know it's getting better and better models and then it stops using an a concept called the marginal likelihood which simply comes from the sum rule in the product rule it's not a new concept but basically it's a way of trading off the model complexity with the data fit an MDL criteria can be seen as approximations to the marginal likelihood and then in the end when it finds a model it comes up with an explanation here's an executive summary of an explanation generated by the computer program and here just some example reports this is like the first page of a 10 to 15 page report that the system can generate and if I have some time maybe I'll show some videos at the end but I want to leave some time for questions so let me come back to how to build an AI system and it's very interesting that nobody as far as I could tell nobody thought we have the principles that we need for building AI systems but I asked people to raise their hands because I must be weird I actually think we do have the principles we need I think we we kind of know how to do it it's just really computationally intractable we have the principles that we need you know I don't know if you want to call deep learning a principle but you know differentiable nonlinear functions is a good idea clearly optimization is a useful tool you know Bayes rule Bayesian inference is a general principle for doing perception and more data as a principle if you want to call it a principle more data always helps for learning well there is a one equation that tells you how to do learning from data that's at least that's what I believe we know how to do learning it's not it's okay sorry I'll take questions later I can I can see you know we can use if you don't like Bayes rule at least we could use a likelihood principle almost everybody agrees on the likelihood principle optimization is also useful for reasoning we know we have logic probability theory do we need more than that I'm not sure when we search you know these are tools decision Theory decision theory is well developed we know how to make optimal rational decisions under uncertainty we have things like Bellman's equation to tell you how to do sequential decision making we have game theory I think we have everything we need okay and I'm it's weird that I'm the only one who thinks that but what we don't have is you know we write down the equation for rational behavior and then we say Oh God all these integrals and all these sums may be optimization is a way of approximating that well we need other ways of we need better ways of approximating the intractable computations involved so all this work on amortized inference Monte Carlo methods variational methods recognition models value functions these are just ways of approximating optimal behavior more data more compute so I am optimistic about AI ok so to conclude probably slick modeling offers a framework for building AI systems that reason about uncertainty and learn from data and when you combine probability theory which is about inference with decision theory which is about decision making then you have the foundations for rational decision making systems and that's it so I've briefly reviewed some of the things that I think are exciting and these you can think of as tools on the way to AI because the problem is computationally intractable so we need better tools for representing uncertainty in our deep learning methods we need better tools for doing automated inference over computer programs and we need better tools for searching for models and explaining them so the last thing the last things here are there is a review paper I wrote now three years ago on basically this topic of course we're hiring so if you're interested in these things and interested in changing the world through probablistic modeling uber AI Labs is hiring not just problems like modeling but all other areas of AI as well and all the work that I did was in collaboration with lots and lots of really smart people okay so thank you very much I'll take some questions thanks for that really excellent talk so just going quickly back to Cox's axioms so they one of the things they state is that a robot incorporate the order in which a robot incorporates pieces of information shouldn't affect sort of its final belief state but this isn't really true for humans right what we see initially triggers confirmation bias which affects how we incorporate feature information and you could say that's a bad thing but it's a bad thing but but it's also probable humans are terrible decision but you could also argue that it's that it's part of how we make these intuitive leaps without getting distracted by a unlikely okay so so just if you had a generative model for the emergence of intelligence in the universe right it might be true that very few randomly drawn intelligence would actually intelligences would actually satisfy the Cox axioms so I just want you to I want you to refute that okay I don't I'm not interested in biological intelligence actually I think that it's just you know we're just lucky apes we're not some pinnacle of intelligence we're deeply irrational we're lousy at doing very simple things like multiplying large numbers in in a few years I think we'll look back on humans and we'll think well we clearly got the idea of intelligence wrong because there are all these systems that are more intelligent than us in many different ways and the world doesn't revolve around these funny Apes just like you know the universe doesn't revolve around planet Earth which Copernicus kind of helped shed light on so sorry if that sounds harsh to all the other human beings in this room included myself does not get too full of ourselves if we want a theory for rational behavior humans might have some good hacks that we could approximate but we shouldn't aim to mimic humans thanks and I have a question here which kind of relates to his I mean the whole idea of Bayesian learning is based on having a good prior now how can you imagine that you have a good prior for the link edge number 2017 in layer 307 and your answer was it simpler than you think I interpret this as similar as you think as it doesn't really matter and if it doesn't really matter the whole base and reasoning falls down now there are there are several deep misconceptions in that question so first of all the whole Bayesian framework doesn't depend on having a good prior the point is you're it's it's not that there is an optimal prior or anything like that basically if you want to build a rational decision-making system it's it needs to be able to represent the things that may be possible in its world and then it needs to be rational under those assumptions you don't need to get the right answer you just need to cover possibilities that's why most Bayesian z' work on things like basing nonparametric sand and even deep learning architectures can be thought of as nonparametric because they have a lot of capacity they can model lots of different functions and now let me follow up with a second question it's the the whole problem with the nuisance parameters in your model like this wait in the in the neural network is that that's not the right level at which you should think about the model the model is a function from X to Y and actually the priors that we should think about are what are sensible functions from X to Y the representation in terms of parameters is actually just a really big distraction so that's why when we build systems like for example the automatic statistician we try to ensure that they're sensible priors over functions that we can talk about which is ultimately the the input-output data behavior that matters so prior should be always grounded in potential data that you could observe parameters of a neural network have nothing to or very weak things to do with potential data but I think you're missing the the no free lunch theorem means the theorem we cannot do anything without prior knowledge we can prove it of course that's the best argument for being Bayesian no because where is the prior knowledge coming from if the prior knowledge is arbitrary from a Bayesian no from a Bayesian of you the only thing you have is data unless you make assumptions you can't make any predictions and the way you represent assumptions in the Bayesian framework is through probability distributions you can represent them differently but you have assumptions as well so let me move on to unfortunately and this sounds crazy but I have to run to the airport so and it's not because I'm trying to run away from shiz great questions I'm gonna take two more and then I'm literally gonna run through the airport actually I'll take a car but what just quick question so I agree with the models and approaches are partly to blame for the lack of uncertainty modeling but I would also say the validation methods are also partly to be M because we never validate models we only validate how certain the models are but we don't really validate how uncertain the models are yeah now I agree I mean most metrics that are used for evaluating performance performance do not look at uncertainty and you know I would like that to change but it's sort of a cultural change you have to basically say you know give me the likelihood on held-out data don't give me just the error rate because maybe in some cases it's really bad if you get things very confidently wrong right and it's also helpful because humans are very good at uncertainty modeling so we can have data points which are which the model should essentially classify as uncertain yeah yeah great yes thank you it's very interesting talk and I wonder say suppose in 50 years the quantum computing has solved MCMC problems make it tractable and tie torches taught undergraduate levels at that point does it say AI is strong the problem then might be what is rational what is the ethical things we should build into a system I was wondering if you had any you know sauce on that that's a great that's a great question I mean you know rational decision-making depends on you know some representation of what you're trying to decide right you know some loss function utility functions and that's sort of in clearly in the realm of ethics and that's why we need to understand the sort of ethical implications of the sort of things that we try to optimize so somebody hopefully agrees at least and you know that's deeply important and we should be working on and is something for example the Leverhulme Center for the future of intelligence is looking at the impacts of AI both positive and negative and looking at how to build you know good outcomes for things that we care about yeah I'll promise of you okay so thanks for the wonderful talk I just wanted to get your quick thought on approximate Bayesian computation method like it has free likelihood free methodology it seems to be very popular among biologists but doesn't really get a B C is a very interesting method that you refer to interestingly some of the core tools that are used in probabilistic programming are related to ABC there's a deep connection there and I think that it's one of many tools that can be used for approximate inference I like to make sure that people don't get too in love with one particular tool that they empirically evaluate different tools and and see what works best okay I'm gonna have to go [Applause]
Info
Channel: AAAI
Views: 3,055
Rating: 5 out of 5
Keywords:
Id: SIa0701Sji0
Channel Id: undefined
Length: 56min 50sec (3410 seconds)
Published: Mon Jul 09 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.