Probabilistic Machine Learning - Prof. Zoubin Ghahramani

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay okay yep that sounds like the microphones on okay so I'm full blossom I'm going to introduce our speaker for their stretchy lecture this term firstly I'd like to say a big thank you to Oxford Asset Management our sponsor who makes this series of lectures possible I've also been asked to draw your attention to our hashtag they're prominently placed in case you would like to tweet and also anyone interested in our software engineering program can find brochures outside and people to talk to so okay but on to the main business so it's my pleasure to introduce Zubin Guerra money who will give our Hillary term stretchy lecture Zubin is professor of information engineering at Cambridge he's a fellow of the Royal Society I think it's fair to say that his machine learning group in Cambridge is would be one of the most influential over the last decade it's hard to it's hard to go any to any major machine learning academic group or industry lab without finding subin's X students or postdocs so and more recently Zubin founded geometric intelligence a startup and after its acquisition he's now the co-director of uber ala labs and maybe if you ask nicely he might tell you a bit about that we'll see okay so subin's made contributions across machine learning particularly probabilistic inference even deep learning it's not surprising he was Mike Jordan student and did his postdoctoral work with geoff hinton so pretty amazing pedigree there so really Zubin seminal work is really amazing nonparametric s-- where he's really been leading this idea that it's not just enough in our machine learning research to aim for for accurate predictions we also need to be able to quantify uncertainty we need to be able to talk about causation and if we really want machine learning and AI to have an impact in industry we need to be able to tackle those things I'm sure he will tell you all about these things so please welcome Zubin for the talk [Applause] thanks Phil for that great introduction and great thank you to the department of computer science for inviting me ok can you all hear me yeah good so I'm gonna talk about probabilistic machine learning which is my passion it's the thing I'm really excited about and I'll start from basics and as the talk goes on we'll get into more and more current research more of what we're actually really doing these days so that's why the subtitle is foundations and frontiers foundations is meant to be you know motivation background material but if you're bored by that don't worry it'll get more technical later on ok so let's start from the basics machine learning well what is machine learning it's just a term there are many other related terms you know depending on the community you come from you might think about data mining or artificial intelligence or statistical modeling neural networks pattern recognition sort of a bit of a more old-fashioned term all these terms are related I'll focus on the term machine learning but keep the context in mind in terms of academic disciplines this is also a very interdisciplinary area in that we draw from ideas in computer science engineering statistics applied mathematics and we get a lot of inspiration from cognitive science economics even tools from physics and neuroscience and then why are people interested in machine learning these days well it used to be kind of an interesting academic field where you sort of played around and you kind of try to get computers to learn from data most people didn't care much about it but now suddenly lots of people care and the reason lots of people care is because there are many many applications of machine learning it's sort of I like to think of it as the invisible thing that's behind a lot of the more visible applications that involve computers learning from data so let's just go through some of those applications just to motivate speech and language technologies is an area that has been transformed by the use of machine learning so automatic speech recognition machine translation question answering dialog systems and every year we seem to get more and more advances in these sorts of tools computer vision again a field that has been around for a very long time but with the advent of large amounts of data and more powerful computational tools we're able to now do interesting things like not just object face and handwriting recognition but image captioning going from an image to a bit of text that's meant to describe the image and these are this is from a very famous paper and you know you can actually pick it apart in the sense that you could say well these are hand chosen to make the algorithm look good you know man and black shirt is playing guitar that seems pretty amazing that a computer could take an image like this and produce this description of the image it doesn't always work that brilliantly but I would say that most of us in the field were stunned when we saw this happen for the first time that we could actually get a system that would produce some reasonable descriptions from images of course we all have cameras in our pockets that put boxes around people's faces if you ever ask yourself well how does that work well that's a bit of machine learning that runs on all of your camera devices moving into the sciences a lot of the sciences have become very data heavy fields like bio informatics and genomics and the Medical Sciences but also Astronomy areas where we're now able to collect much more data than any human being could sit down and analyze manually and so machine learning and AI tools have been very important in scientific data analysis and that's something I'll talk about maybe a little bit later on as well recommender systems we all know what these are you know customers who bought this item also about this kind of thing that's driven by machine learning self-driving car is something that I'm now much more involved in this is not a totally new thing I mean this self-driving car Alvin was around about thirty years ago and he used neural networks to drive around at seventy miles per hour on highways that's what it says on this slide that I took from about thirty years ago that's very scary I would not want to be anywhere close to that truck driving at 70 miles an hour on a highway driven by a neural network that's about this big but things have moved on and we now have pretty good self-driving systems that are just getting better every year robotics I just love dogs playing football so robotics is this this particular sort of Robocop isn't necessarily driven by machine learning but there is a lot of excellent uses of machine learning in robotics automated trading financial prediction computer games you're all familiar with you know the the deep mind landmark results first with learning Atari games playing Atari gazed at human or superhuman level then more recently beating the world master at go and who knows what this is this is Libre tus a system that recently won a poker championship and this was very against a whole bunch of humans this is all the numbers and parentheses are how much money the humans lost to the computer and the very interesting thing about this is that this is quite a complicated game in that if you think about poker what does it involve well it involves things like trying to understand the state of mind of the other player and bluffing and things like that so to be a good poker player you have to be able to do those things and so now we have good machine poker players as well so what is it well machine learning if if I had to define it I would use a sentence like this it's an interdisciplinary field that develops both the mathematical foundations and practical applications of systems that learn from data here are some of the main conferences and so on associated with that field so that's all in terms of motivation from applications but actually when you look at machine learning systems most of the time machine learning systems are trying to solve one of a few canonical problems so I'll just go through those canonical problems in my kind of introductory part of the of the lecture so this is probably the most canonical problem the classification problem you have some data you want to classify it into two or more classes so the task is to predict some discrete class labels from input data that has lots and lots of applications and there are a lot of buzz words for different methods that can be used for classification these are just different ways of trying to do classification from data regression trying to predict some continuous quantity Y from some inputs X obviously this has lots of applications as well and you know there are lots of methods some of which you might you know say well that's not a machine learning method that's linear regression it's been around for over a hundred years but you know again remember this is all in the context of everything that's been going on in all of these neighboring fields and there's nothing that says oh this is a machine learning method and that's not a machine learning method if it's just if it's making predictions and decisions from data it is a machine learning method at some level clustering the task here is the group data together so similar points are put in the same group many applications again many different methods dimensionality reduction when you have very high dimensional data you might want to find a low dimensional representation of that data that preserves important information another canonical machine learning problem semi-supervised learning where you might have some label data here you might have a few labeled points like these to label points that are minuses and these three that are pluses and you might want to basically be able to leverage the fact that you have a lot of unlabeled data as well and so semi-supervised learning combines labeled and unlabeled data to get better predictions and reinforcement learning which is related to sequential decision making and adaptive control the task there is to learn to interact with an environment making sequential decisions so as to maximize future rewards so it's an interactive setting where you have an agent producing some actions or decisions in an environment there might be some hidden state to both the agent and the environment and then you get some observed sensor inputs and the agent has to be has to act in the environment to maximize its rewards okay so these are the canonical problems it is actually quite bewildering if you start reading the machine learning literature and you're not an expert because there are many many different methods and you know every paper seems to prove present a new method and so here is just sort of a very crude way of organizing a bunch of machine learning methods but don't give this too much put too much weight on this okay but I'm gonna focus on for the first few minutes I'm going to focus on one bubble here which is this neural networks and deep learning one and the reason I'm focusing on that should be for any of you who's familiar with the field it should be pretty obvious because these methods have been really revolutionary they've really been involved in some of the most spectacular breakthrough in the last few years so what are they well a neural network and I'm gonna focus here on a feed-forward neural network just for simplicity there are other kinds but a feed-forward neural network the most standard one is essentially just a function approximator so it takes some inputs call them X and it produces some outputs call them Y and the way it produces them is through a sequence of transformations organized in layers but all of that is in a sense a bit of a detail it's just a way of representing a function the maps from X to Y via tunable parameters called weights or I'm using theta tuna to note know they note the parameters of the network so neural nets are I mean one of the important aspects of neural nets is that they're nonlinear functions and they're often both nonlinear in the input and non linear in the parameters so optimizing them to minimize some objective function tends to be slightly complicated the other defining characteristic of neural networks is that they represent the function from X to Y in layers which is essentially simply just as a composition of functions okay so here is a multi-layer neural network with one hidden layer represented as a function the maps from X's to eyes through some parameters and these superscripts here one and two you know the two layers of parameters that you have and these neural networks are usually trained to maximize some likelihood so they fall very squarely within the world of statistical models using some variant of stochastic gradient descent optimization so this is where we start using tools from optimization theory okay so that's one slide on neural networks and these things have been around for many days Cades in fact these things are what got me excited about AI back in the 80s when I was sort of an undergraduate and thinking about what to do with my life but what's happened is that something dramatic has happened between the 1980s and now and one of the things that's dramatic is that the terminology has changed so people now call these deep learning systems because they have many more layers but there are other more interesting dramatic things that have happened so these deep learning systems that are involved in a lot of these very impressive benchmarks are very similar to the neural net architectures from the 80s and 90s with some important architectural and algorithmic innovations like being able to use many layers in particular nonlinearities such as the rel u particular ways of regularizing them like drop out and very useful tricks for dealing with time series like LST Em's they are also based on vastly they're trained using vastly larger datasets really web-scale datasets to do that you need vastly larger compute resources so GPUs GPUs on clouds etc importantly there's been a major effort to democratize the software tools so that it's quite easy to actually train a neural network so we have much better software tools things like torch and tensor flow and of course there's been vastly increased industry investment and media hype and what that's what that has meant is that there is a huge influx of people trying out different variations of neural networks on different problems and stepping back I kind of think of this a little bit of as the community of machine learning researchers is running a bit of a genetic algorithm trying out lots of different ideas and variations and ideas to be able to improve on the performance of existing benchmarks okay so that's that's deep learning in a nutshell there's huge amounts more to say about that and there are many better people than me to talk about that but one thing I do want to talk about is limitations of deep learning so let's step back from the excitement let's acknowledge the excitement and let's say well where do we go next what do we need to focus on and I would argue that there are a few limitations we really need to think about so one of them is that neural nets are very data hungry you often need millions of examples to train these large models and that should not be surprising if you if you know a bit of statistics perhaps the surprising thing is that you don't need that many millions to train models with millions of parameters people would have thought that would that was crazy and it is surprising that you can get away with you know relatively small amounts of data even though it's large by the standards of the 80s and 90s they're also very compute-intensive to train and deploy they're poor at representing uncertainty and this is something that I'm particularly interested in there are some great studies that show that neural nets and deep learning systems can be easily fooled by adversarial examples so you can construct examples that will make the neural network very confidently give the wrong answer and that should be worrying that relates to the uncertainty thing it's okay for a system to make mistakes but it's not okay for it to be really confidently making mistakes because then you don't know when to trust the answers and you you can't really build mission critical systems things like in let's say in the healthcare domain or in self-driving cars and so on if you really can't trust the confidences of your model they're finicky to optimize you know optimization is non-convex and there are many different parametric architectural choices that need to be made and they're generally uninterpretable black boxes lacking in transparency and difficult to trust okay of course people are working on all of these things but I wanted to put them on a slide to sort of motivate us to move towards the interesting challenges that we have a particular area that that I'm really interested in which Phil mentioned in the introduction is thinking about probably Sheen learning as probabilistic modeling so let's go beyond deep learning I'll come back to neural nets and deep learning in a minute in the context of problems like modeling let's go beyond deep learning let's talk about a particular view of machine learning that's grounded in the idea that we want systems that will build models from data probabilistic models from data so what do I mean by a model the term model gets used by many people in different contexts what I mean is a model describes data that one could observe from a system okay so it should model should be able to make predictions it should it should say make statements about observable data if it doesn't do that then it's very difficult to know if you have a good model or not whether you have a falsifiable model for example or not now if a model is making statements about possible data that could be observed then what we're going to do is we're going to use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model so think about a simple model let's take let's say a model that does forecasting of the weather tomorrow okay that's not a necessarily simple model one could certainly build a simple version of that okay now you don't want models that make forecasts that don't tell you how uncertain they are and now you have to consider where are all the different sources of uncertainty they could have in predicting the weather tomorrow you might have uncertainty that's coming from the noise in the sensor data that you collected you might have uncertainty that's coming from the fact that there are unpredictable effects that your model did not consider your model might have parameters and you might be uncertain about what the right parameters are all of those sources of uncertainty we need to deal with somehow and what we're going to do is we're going to use the language of probability theory to express uncertainty and to me that is as fundamental as saying that we use calculus as the language to express rates of change probability theory is a language of uncertainty then the good news is that we don't have to invoke anything else we can just stay within this framework of probability theory to infer aspects of the model from data to adapt our model to data to make predictions etc so it all ends up being very very simple and here's what it looks like here is Bayes rule which is the sort of engine that drives learning from data and I'm color coding things into two classes data and hypotheses and what I mean by data is anything that's actually measured a measured quantity okay and what I mean by hypotheses is everything else okay the world for amazing point of view is divided into two kinds of things stuff you're measuring and stuff you're not measuring okay and the stuff you're measuring you've measured so you kind of know what it is it could be noisy but you've measured it and this stuff you're not measuring you better represent the fact that you're uncertain about it because you didn't measure it okay so all of those things we call hypotheses okay so that's not the only thing there I said that these hypotheses if we think about these as as if we're trying to express models of data we gonna use probability theory to express our models so basically for every potential configuration of our hypotheses we should be able to describe what is the probability of the observed data under that hypothesis that's the term that's called the likelihood and that's actually what drives most neural network learning is maximizing likelihood or penalized likelihood of of some kind but forget about neural nets now we're talking much more generally we have this term which is the likelihood which gives you the probability of the data given the hypothesis and then we have this term which is called the prior and the prior is our representation of our uncertainty about everything we haven't observed before we get our data so be the game goes like this before we have our data we have to place our bets on all the unobserved things we use the language of probability theory to do that so we put a probability distribution over our space of hypotheses then we observe the data aha that's the beautiful moment where we can now compute the likelihood the probability of the data given the hypothesis and the simple rules of probability tell you you multiply these two you renormalize over all the hypotheses that you've been considering and then what you get is your new state of knowledge the posterior distribution over your hypotheses given the data and that is the prior that you would use if you got any more data so there's nothing really fundamentally different between the prior and the posterior is just the representation of your state of knowledge at any point in the process with the data you've observed so far okay so learning and prediction can be seen as forms of inference using this this rule and here is the slide that I it's a one slide description of Bayesian machine learning that I always use apologies for people who've seen it but the point is that even Bayes rule that I had on the previous side is not a fundamental rule the mental rules of probability theory are these two simple rules the sum rule and the product rule and the sum rule tells you that the probability of some unknown quantity X is the sum over some other unknown quantity Y of the joint probability so the the this is called also sometimes called the marginalization rule and the product rule says that the joint probability of x and y can be factored into the probability of x times the probability of y given x or the other way around okay so from these two simple rules if we substitute x and y with data and hypotheses we can get Bayes rule which we got in the previous slide if we use the following symbols theta to represent the parameters of our model D to represent the observed data and M to represent the the model class that we've assumed then we get this expression here which is just Bayes rule apply two parameters of our model what would the parameters be for example in a neural net there would be the weights in the neural net in linear regression there would be the linear regression coefficients etc every model has parameters in this world okay and this is the prior that's the likelihood and this term here is the normalizing constant which is itself quite interesting it's called the marginal likelihood now this follows from the sum and product rule if you want to make predictions about any unknown quantity x given the data then the sum and product rule tell you that the way you make predictions there's only one valid way under this framework and that one valid way is you consider the predictions made by every possible parameter value so those are these terms and then you weight them by this term in green which is the posterior probability of the parameters given the data and the model class so the act of forecasting or predicting any unknown quantity x given the observed data is by the sum and product rule an averaging process you have to average over all the hypotheses that you've considered you don't pick the best one or your favorite one or you don't flip a coin or anything like that you're supposed to average over the space of hypotheses in this particular way and if you now want to compare different model classes then you might apply Bayes rule at the level of model classes and that looks like this where this term in red the Marshall likelihood now appears in the numerator rather than the denominator none of this is actually mysterious they all follow from from these two rules what do I mean by model comparison model comparison might the story might go like this okay let's say I'm a biologist I do an experiment and I have a colleague and my colleague says I believe that you know this transfer transcription factor regulates these genes and I say no I have a different model I believe that it doesn't and that this one does or something like that so my colleague and I have two different models now we could argue about it in words but if we follow this probably sig framework what we should do is both of us should write down the model to the specification level that it could make predictions about observable data we could assign a probability to the observable data and then we observe the data D and now we can settle the argument we basically say all right what is the marginal likelihood that that your model gave to my data what is the marginal likely model gives to the data well both of our models had some free parameters maybe your model had 17 free parameters and my model had 3 free parameters so my model is simpler somehow and I want I don't now I get nervous I say that seems unfair okay your model had more parameters if my colleague goes and optimizes those 17 parameters then sure enough she can fit the data much better than I can right but that's not the game optimization doesn't follow from the sum rule in a product rule it doesn't matter that my colleague has 17 parameters and I have three if can both compute the marginal likelihood then we can settle this argument okay so I actually really strongly believe that in an ideal world science would be done like this people wouldn't just publish their papers in open journals and share their data in an open manner I think actually people should write down their models in a way that one could evaluate with future data maybe write them as public programs which I'll talk about later and then we could really do objective well actually it's subjective but you know we could do sort of principled comparison of models given different subjective opinions about what the hypotheses are okay so one slide on Bayesian machine learning so why should we care about all this we've had a revolution in machine learning with wonderful fantastic deep learning methods that never talk about Bayes anywhere in them so why should we care about all this Bayesian stuff well the reason I care is that I'd really like models with calibrated senses of uncertainty so I want to be able to trust my system if it says the probability of there being a pedestrian in front of my car is 0.1 I want that to mean 10% and I can take actions that correspond to that calibrated probability getting systems that know when they don't know I feel is very important also there's a very beautiful thing about all of this which is that unease about like 17 parameters versus three parameters or different structures of models well this framework actually gives you automatic tools to compare models of different complexity and to automate the learning of models from data and this is called Bayesian Occam's razor and it's something I will use in latter part of my talk okay so let's go back to our neural networks and just to ground the discussion a little bit here's a neural network in math from X to Y there are different sources of uncertainty here one of them is parameter uncertainty that we have weights in the neural network and you know given any finite amount of data we're not sure what those weights should be so we need to represent our uncertainty but we also have structural uncertainty we've made some structural choices like the architecture a number of hidden units a choice of activation functions and that's also a source of uncertainty so it would be great if we could represent all of that and that's not a new idea none of this is really new ideas in fact the idea of doing Bayesian analysis of neural networks has been around since the early 90s at least actually late 80s here's a bit of a history of a few different methods here is a depiction of what we'd really like so here's the system that was trained to do some regression on some data and what we'd really like is this sort of behavior that outside of the range of his training data I should say hmm I don't really know ok and there are many ways of doing that these are all different ways of doing that and we had a nice workshop at nips on Bayesian deep learning where we kind of brought that history together and looked at some of the current state of the art so this world machine learning often has camps and people think that you have to be in one or another camp but you don't actually you have to understand what all the tools are in the different camps and there's a lot of fertile ground at the intersection of these camps and that's this is one example of those things so when do we need probabilities well we need them when we are system our you know learning and intelligence problem depends crucially on representing uncertainty I've sort of said that but let me describe some examples of that so any time we're doing forecasting ok and you know that could be financial forecasting weather forecasting forecasting demand that uber or for Amazon products or whatever we need to represent our uncertainty decision-making generally when you make decisions you're thinking about the consequences of your actions into the future and it's really useful to represent uncertainty there it's hard to imagine not doing that at some level when you're learning from limited noisy and missing data so if you imagine dealing with say medical records if you're trying to do machine learning and medical records you have patients your patients and each of them has lots of things that are unobserved they may be there are a few medical tests that have been done on each patient most of the data is actually missing if you look at that look at it that way if you want to learn complex personalized models so it might be again whether it's in a medical domain or in a you know retail domain or something like that you might have you might think you have a huge data set but actually for every patient or every customer you only have a little bit of data right so it's not really a big data problem you need to represent uncertainty about that individual the whole field of data compression is based on probabilistic modeling and a lot of my interest in automatic model discovery and experiment design is really based on uncertainty now over the last three months I've been involved in setting up Ebers AI labs I'll just mention that in one slide why would a Burk air about any of this well if you look at many of the problems that a large technology company has to solve their problems that deal with uncertainty decision-making personalization and so on they're huge number of problems they're huge number of opportunities around any of the major technology companies for learning from data and for using uncertainty in there and you know fairly obviously if you're trying to build a very complicated system that makes decisions in the real we're like a self-driving car you'd really like to have calibrated uncertainties in that system okay so here is the one slide picture of my current passions my current research interests and then the next few minutes and I'll leave a few minutes for questions at the end in the next few minutes I'm gonna touch on a few of these topics and it's fairly modular so I can stop to give us time for questions but I wanted to put this slide up here because well actually because I had this slide that's one reason because and the reason I had this slide is that I was asked to give a talk about a year ago and they told me summarize your work in one slide so that forced me to produce this slide and then when I produce it I thought that was actually kind of a useful exercise so so the the useful exercise is that it crystallized in my mind the thing that really drives me okay and you know it's not that I'm a Bayesian and I just love probabilities or anything like that it turns out the thing that really drives me is that I like stuff that's automated okay I don't I want things to be systematic and automated and computer scientists are very good at that like computer science if you put your computer science hat on you do something three times and you think I need to write a computer program to do that for me three times there was two times too many right and the sorry state of machine learning is that stuff is not really automated there still is tremendous amounts of human labor arbitrary decision making and tweaking involves in deploying machine learning systems which is ironic the whole field is about getting systems to learn from data but then there's a there there are a lot of well-paid researchers and engineers tweaking those systems that learn data so let's think about automating these things and this is what drives me so if you look at some of these topics which I'm going to talk about so automatic statistician what is that about and I'll talk about that in a couple minutes that's about automating the process of model discovery from data so searching for a good model from data probablistic programming something that Frank would who's at Oxford is a world expert in public programming is automating the process of doing inference from a very general promising model we also want to automate optimizations so optimization is actually a sequential decision problem if you have an optimizer there's trying to optimize a function it's making decisions about where to evaluate the function next collecting some data and then moving on to another point and so on people don't think about optimization that way they just think about here's an algorithm and here's something I can prove about the algorithm but actually optimization is very much like you know bandit problems reinforcement learning problems sequential decision making under uncertainty is something that drives this we want to optimize sorry we want to automate the allocation of computational resources so especially now that machine learning systems are very complex right these these systems use a lot of memory a lot of CPU the data sets are very big so we can't afford to just think her about and run a few experiments on a single computer and when we run major experiments we actually have to worry about the fact that this is running on a big you know cloud of computers and you know that's using energy and energy costs money and it's not good for the world right using energy like that so optimizing resource allocation so these are the things that drive me these days I'm going to talk about a couple of them very quickly probably slick programming is one of them the problem here is that developing probabilistic models and deriving inference algorithms is generally a very time-consuming and error-prone process and the solution is to develop probable stick programming languages so what are these things this is a very beautiful marriage between probabilistic modeling and programming languages worlds and the idea is that you have a promise take programming language which is a way of expressing probabilistic models and the modern ones the ones that people are very interested in like Frank Wood and myself these days are completely general programming languages sort of Turing complete programming languages that can express any computable probability distribution that's the expression part but what do you do with that well well first of all how do you do that you express your model as a simulator a simulator that would generate data okay that's one canonical way of doing that and that's a very natural concept you could say okay might I have a model for the weather well that's actually kind of a simulator okay and I write it as a computer program I have a model for my gene expression network and that's going to be a simulator that you know simulates gene expression data okay that's the modeling part but then you have some data you have a simulator and you have some data and what you're really interested in is inferring or learning parameters of your simulator of your model given the data and the very incredible thing is that we can actually come up with universal inference engines we can come up with inference engines that in principle could compute the probability distribution over the hidden variables in our computer program given the data so it's basically running Bayes rule on computer programs we're all used to the running computer programs in the forward direction you take some inputs and you produce some outputs but this is kind of doing it backwards you have a computer program that take some inputs and some calls to random number generators and produces some output C's a random outputs that's the data and now we say well what should the inputs and the cost of the random number generators have been to observe this output for the computer program that's Bayes rule on the program and there are many languages now Anglican is the one that Frank woods team has been developing one of the state-of-the-art languages our group in Cambridge has a language called tiering which is much less developed but also exciting is based on Julia there are many different languages developed by different groups and there are many different inference algorithms that can generally run on models in those languages here is for example a hidden Markov model written in in keyring it's fairly easy to read if you uncomment one line of this model you go from a regular hidden Markov model to a Bayesian hidden Markov model so changing models around is as easy as sort of adding and removing a few lines of your promising program and I really think that you know if our vision actually plays out this could really revolutionize scientific modeling if people were actually willing to write promising programs for all of their models and they shared them then people could take somebody else's model run it on their data improve it etc a few resources here I'll just give you a few examples these are now slides from my postdoc home ok a little bit about Turing I'll skip through that that's our hmm example but much bigger this is a Bayesian neural network most of this is specifying the prior on the on the weights and then this is the actual you know basic neural network that's just sort of the neural network function and so on and then you could just run inference using you know Hamiltonian Monte Carlo or something you don't have to even know what that is it abstracts the way the model specification from the inference and our language Turing is pretty competitive it's it's sort of in the same ballpark it's Anglican and occasionally a bit faster but I know that the Anglican team keeps improving their language as well another topic I want to talk about is Bayesian optimization I have a basically couple slides on that so the problem here is you want to find ideally a global optimum maybe that's too much to ask of some black box function that is expensive to evaluate so you can't just evaluate and lots and lots of places you need to think about where you're going to evaluate your function next and we don't want to do that manually we want to automate the algorithm that thinks about that so the solution is to treat the problem as sequential decision-making under uncertainty and why we're uncertain about is what the actual function is and this has huge number of applications and I'm actually you know I'll say a couple words about the automatic statistician but I do want to leave some time for questions so the automatic statistician is automating is trying to automate model discovery and the idea here is what we'd like is a system where we can just give it data is searches over a large space of models evaluating models according to some principled metric that trades off model complexity with the amount of data that you have and actually the marginal likelihood which I described is is one such metric it produces a model and then interestingly it translates that into a report that is then interpreted by a human being so this is the opposite of a black box we really want a transparent box something that the human will be able to understand okay and again you know I'll actually skip over most of this because I do want to leave time for questions so we do a search over models this is the automatic statistician applied to some time series it finds a good model then it comes up with a description of that model it produces the text itself so this is the executive summary of the text actually the text is in the form of these documents which are you know 5 to 10 pages long and you know we can have here is the report writing demo you know we could run this and this is a slightly different version of this which actually does clustering it tries to visualize things it tells you what it's found etc okay and it tends to perform well at prediction because actually being systematic pays off okay and we've applied this to classification as well to regression to clustering and so on and we're gonna have a release of it I keep saying very soon but this time I really mean it very soon means in a couple months okay so I'm gonna wrap up there this probabilistic modeling framework isn't the only way to do machine learning but it's a really useful organizing principle there is there many layers and it's completely compatible with the choice of models that you have and whether you like deep learning or even logic and other frameworks and so on we we really can hybridize a lot of these methods to produce interesting systems that reason about uncertainty and learn from data I've briefly reviewed three topics this is a review paper I wrote a couple of years ago that summarized this line of work and I wanted to end by thanking a whole bunch of collaborators I've had [Applause] stick your hand up if you have a question in terms of in a non-pro matrix how do you you know envision priors in this modeling right so the question is how do we come up with the priors basically I think this is a great question and the the interesting thing is that actually there's no difference between the prior and the rest of the model basically in my world there's only two things there's data and there's the model the model you know basically in a lot of statistics people say the likelihood is given but where did who came up with a likelihood the likelihood makes assumptions and the prior makes assumptions in fact if you don't make any assumptions all you have is the data the only thing you can say is that was my data right so all modeling makes assumptions and when we specify a model that it's fully specified including the likelihood and the prior and so on then the nice thing about a fully specified model is that from that model we can actually evaluate the probability of data if I don't specify the model I can't do the the prior I can't do that let me give you a very concrete example okay the simplest model in the world in statistics is linear regression okay here is a linear regression if I tell you my the relationship between X and Y is a linear regression to me that's not a well specified model because there is a huge difference between linear regression where I assume the slope can go between plus and minus one and linear regression where I assume the slope can go between plus and minus a thousand okay those are two different models the prior is just the way of describing the sensible range that parameters can have in your model the same host for any modeling problem that you encounter if you look closely they make some assumptions there's some parameters you have to say what are the sensible ranges for those parameters different choices correspond to different models okay so the question was to what extent is causality important and how do you model that I haven't actually talked much about causality I've worked a little bit on causality mostly with colleagues who know more about it than I do causality is hugely important in the sense that if we want to understand our world and to act in it it's not good enough to figure out that things are correlated we need to know what will be the consequences of manipulating one variable on other variables and time is a great indicator for a causality because you know we know that causality can't go backwards in time but if you just have observationally de it is quite difficult to figure out what the causal relationships are whether X caused Y or Y cos x now when I say it's quite difficult it actually people were too negative about it they just thought it's impossible because if you don't make any assumptions it is impossible but if you start making some assumptions recent work by people like for example bernard shaw cough in a kind of more classical setting and others in a more Bayesian setting have shown that if you just start making some sensible assumptions then you can actually tell with some certainty where there x cos y or y cos x or whether there was a hidden common cause so there is a lot of recent advances in causality and it's a very exciting area it's hugely hugely important here yep it's right behind you yeah take that question to me it seems a bit like a lot of brute engineering do you think there's a unifying learning theory I'm surprised because to me it seems like a very unified learning theory so what's the the engineering aspect of it the choice of prior if you look if you look at you sort of your slide you handle what ties all of this together and your interests ah right my interest this sort of the yellow side yeah okay so yeah let's come back to this there it's fair to say there isn't a unifying theory here okay in the sense that this is a this is a desire our desire is to make stuff more automatic but if you look at the way in which we solve particular problems let's say the we try to solve the problem of discovering a model from data or we try to solve the problem of optimizing a function that is expensive to evaluate then or this problem of doing inference on a Kabbalistic program then actually for each one of those problems there is a very clearly defined normative thing to do there is the ideal that you want to achieve okay you can write down what the ideal is so you have some principles you have principles based on Bayes rule and you have principles based on Bellman's equation sort of other basic principles that you can invoke there is an ideal thing that you want to solve but the bad news is that the ideal thing is generally computationally tractable okay so you need to make some approximations to be able to solve it so so I think that the biggest kind of conceptual challenge in this whole framework is we can't do our ideal thing because it's too expensive computationally and so we need good it'd be computationally expensive because it's not actually fundamentally understood okay what it's understood in a world where computation is free it's understood what the ideal answer is in a world where computation is free in a world where you have limited resources that's this sort of bubble here when you in a world where you have bounded resources this relates to the the sort of ideas of bounded rationality then then it becomes fiendishly difficult to figure out how do you try to do the rational thing within you know computational time and memory and so on constraints the problem with that is to figure out what the optimal thing is in that setting you have to use resources so you're sort of kind of living in a box where you have to figure out how to use resources to figure out how to use the rest of your resources yeah that is difficult I think we might need to maybe one last question here and then what's the most promising way in which we might be able to take the sort of models that we have right now and understand them there is there is a lot of fantastic work trying to understand deep learning models I know for example deep mind has a great interest in in developing in a sense visualization tools for what's happening inside our deep learning models so one approach is trying to visualize what's going on but I actually think that that's really difficult because these systems have so many parameters you might visualize a little bit of it but not know what the system as a whole is doing I'm sort of more interested in understanding the input-output behavior properties of the input-output behavior of these systems and so tools for figuring out the uncertainty in the input-output behavior are the level at which I think you can you can actually understand these things so that our tools representing us that's roughly that you build you analyze the model after you've built cells rather than building in a different way originally yeah there there is a whole strand of work that we are involved in on transparency and interpretability of models where there are two approaches essentially at a very high level you either build a model that's interpreted in the first place like a decision shallow decision tree or something like that or you build a complicated model and then you build an interpretation system that tries to interpret what the complicated model is doing this sort of two ways of doing things [Applause]
Info
Channel: The Artificial Intelligence Channel
Views: 11,353
Rating: 4.8277512 out of 5
Keywords: singularity, ai, artificial intelligence, deep learning, machine learning, immortality, anti aging, deepmind, robots, robotics, self-driving cars, driverless cars
Id: 095Ee0rKC14
Channel Id: undefined
Length: 60min 53sec (3653 seconds)
Published: Sun Nov 12 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.