“The Automatic Statistician”– Professor Zoubin Ghahramani

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I'm going to talk about work that we've been doing on the automatic statistician and this is really part of a larger research program that I have a few years ago I was asked to give a talk where they said I have one slide to describe my research interest so it forced me to produce this slide the interests haven't changed since a few years ago what it forced me to realize was that the thing that really drives me is the idea that we could further automate machine learning machine learning practice ironically is very labor intensive so you know even though machine learning is all about getting computers to learn from data the way it's actually done is that people spend a lot of time applying their sort of expertise and applying their little tricks to getting learning algorithms to do something interesting with the data that they have and I think we should just go one step further and look at that process and try to make that more intelligent more rational and more efficient and so many of the things that I've been doing relate to this theme of automating machine learning the talk today is firmly focused on this bit automatic statistician but you know I'm also incredibly passionate about a few other areas probably programming which seeks to automate inference in probabilistic models methods for more automatically scaling up to very large data sets methods for allocating computational resources in a rational way machine learning is an actually very resource hungry practice and we are being incredibly inefficient with the way we use computers and so applying levels of rationality to that I think is really important I'm very passionate about Bayesian off to this is something that we actually use quite a bit within humor for example tools for our optimizing expensive functions Bayesian deep learning again sort of deep learning has taken off as a hugely impactful area of machine learning but there are some limitations to current deep learning practice and I think problems like approaches help with that and Bayesian nonparametric switch is another area that's always been in the background as a fundamental tool that is at least important to know about if not to use in and every day on an everyday basis so these are my interests currently I'm not going to talk much about you know what I do at Ober it's not because it's secret actually I'd be happy to talk about it in fact I've given a talk about that I'm happy to answer questions about that at the end people are interested so what I'm gonna focus on is this problem of automating machine learning and this is something that obviously many many people have thought about over many years this is a figure that's you can't really see it well but it's a figure that relates to the data analysis process and essentially the takeaway message is that in the data analysis process there are many many sages many of which are incredibly important but but often quite neglected and not really very let's say exciting in terms of like writing research papers and so on but these are the the stages of processing there are absolutely critical in terms of getting value out of your data so things like automating feature selection transformation understanding I put the word automating in front of all of them most of these things are done manually these days but you know you know the the process of dealing with the raw data and turning it into a usable form the actual data collection process and experiment design where did the data come from how do we collect more data the thing that I'm going to focus on quite a bit during this talk is model discovery and explanation which is once you the data what model is a matchmaking process what model do you use to answer the questions you're interested in from that data and then the explanation bit is interesting as well once you've done that how do you explain what you've done to other people and then automating allocation of computational resources I'll touch upon this a little bit automating inference by that I mean things like software frameworks like probably programming languages that allow you to sort of sidestep the whole deriving things in hand implementing the inference algorithms I would put automatic differentiation tools also within this bucket so basically these are tools that have really revolutionized the practice of machine learning because people don't have to do derivatives by hand code them up and then debug them okay so the many stages of interest and I think we need to handle all of these to really get to the next stage of extracting a huge amount of value from the data so the thing I'm going to really focus on is automating model discovery which is this automatic statistician project the sort of subtitle for this is an AI for data science I'm not really interested in generally I think generally AI is a bit bogus but you know actually I think if you take pick a limited domain like data science we can do interesting things that we would call intelligent data science and the nice thing about this project is it has a tremendous amount of overlap with for example the priorities of the Alan Turing Institute so the Alan Turing a suit is the National Institute for data science and artificial intelligence so you know it's fun to be working on something like that because it brings these two fields together and in fact the work that I'm going to present here has a closely related sister project within the Alan Turing Institute which is called AI for data analysis which is really focusing on some the earlier stages of this data analytics pipeline whereas I'm going to talk about the more the model discovery and explanation side of things okay so just a pointer to that really exciting project that the turing institute as well okay so so what's the problem that the automatic statistician is trying to solve so the problem at a very basic level is that we've got lots of data there's a lot of value in that data and we'd really love to be able to extract that value but we can't afford even if we're a hedge fund we can't afford to hire enough data scientists to get all that value out okay and actually they're not enough data scientists machine learning researchers and so on out there to meet the demand that all of the data is presenting so the two approaches to that one of them is you know what professors do is you train up more people to be data scientists so then they can get hired by industry or other sectors and the other approach is you sort of automate what they do and you know the the negative way of thinking about that is you know you're putting people out of a job the positive way of thinking about that is you're giving them superpowers so they're incredibly more efficient than they were before okay and that's the way I like to think about it when I spoke to a bunch of statisticians I think one of the things I might have done there cheekily was changed the title to be the the automatic data scientist and they weren't offended about automating data science as much as they would be offended about automating statistics the statistics such a deep and important field and surely it can't be automated but data science I'm sure that could be automated right I haven't really given the talk about automating AI researchers to an AI audience I think that would be that would be fun as well okay so that's the problem we're trying to solve at a very high level and the solution if you can do the error correcting decoding here is to develop a system that automates model discovery from data so the real ambition is actually what I would love to have his system that does everything end-to-end and basically what you do is you have a conversation you meaning the human has a conversation with the system about the data where the system can ask you questions about the data and you can ask the system questions about the data and through that conversation you figure out what's going on that's of interest and you might also do you know high throughput predictions of stuff if you want on the side but a lot of this centers around you know really trying to understand what's in the data okay and one of the key ingredients in this is that matchmaking process discovering a good model for the data and I think that's really done in a very ad hoc way and it's actually quite a creative process right you know most of the time people get a data analysis problem they'll think of the few models that they learned about when they were you know in PhD students or master students or whatever and those are the models that they'll apply to the data analysis problem but they have because those are the models they're comfortable with but the space of possible models is absolutely huge and if we were just a bit more systematic about searching that space we might find much better models to account for the patterns in the data but of course with that come some risks so here the ingredients of the automatic statistician the way you sort of picture this is a system where data goes in at one end and a bunch of things come in at the other end including a report that describes to you what's in the data and it could be an interactive reporter could be a static report I might show you a video if we have time of this thing in action and when I talk about data of course data could mean almost anything to keep my life simple I'm talking about think of tables of numbers okay it's something like you might find in a spreadsheet they might not be purely numerical we should be able to allow for other kinds of data types but a table with rows and columns okay nothing really much fancier than that and so what do we want in between the raw data and the report and other things at the other end well what we want is to replicate some of that creative process of coming up with a good model for the data and to do that what we need is not just to test out two or three models or a handful of models ideally what we'd like to have is an open-ended language of models a way of expressing many different models by combinations of some simple primitive operations and I'll describe a very you know I think everything that we're doing is sort of early days in this effort that I'll describe a system that kind of does that in the context of time series models now of course the downside of having this open-ended language of models is that the number of models that you might consider for your data set is absolutely huge right and so that has two downsides one of them is then you have to search that space for a good model so you want an efficient search procedure for finding good models but also you have the risk of overfitting you want a method that is principled for evaluating the models and the data so a method that really trades off the complexity of the models with the fit to the data okay because if you don't do that you'll always be able to find some model in this vast space with enough time that captures all sorts of spurious patterns in the data and that's just classic overfitting in model space and then the last thing that we want to do is we don't want to be black boxes so one of the big criticisms about you know machine learning these days is and especially deep learning methods which I love and are wonderful in many other ways but there are actually terrible black boxes in the sense that they have millions of parameters inputs go in outputs come out and it's hard to know what happens in the middle it's hard to give any sort of explanation for the decisions that come out of these systems and it's hard to rely on them to robustly generalize to new situations so the x-plane ability and transparency is something that we care about because that's something that actually a lot of people find useful it depends on the application domain but I think even in finance if you have a model that seems to perform very well but you can't really explain at all what it does people might not trust allocating lots of money to it so it depends on the flavor of where you work but that certainly I've experienced that okay so now I'm gonna go a little bit into the details and so what we're going to look at is not the space of all models obviously that's too big but let's limit ourselves a little bit to for now regression models at the end of the talk I'll talk about some generalizations to other things so regression consists of learning a function call it f from some inputs to some outputs Y where the outputs are generally thought to be continuous values and the data is just input/output pairs so a very classical simple machine learning problem of course also a problem that underlies over a century of statistics and so what we want is a language of regression models that captures some of the simple interpretable things that statisticians have been doing for over a hundred years like for example fitting linear models we want to be able to fit linear models and explain them we want to be able to fit polynomials or Exponential's or other kinds of curves and but we also we don't want to limit ourselves to these types of models we want to be able to describe the functional relationship between x and y in simple intuitive forms like maybe it's a smooth function or a periodic function or it's a monotonically increasing function etc so terms like that that people might find useful and now what we would like is for infrastructure all the models in the language it doesn't have to be if it's not tractable if it's computationally expensive we can always approximate it but you know there are class of nice models that have some of these properties so the models that we're going to use for this part of the work and in one of the workhorses but not the only workhorse of the automatic statistician is Gaussian processes so raise your hand if you know about Gaussian processes raise your hand if you don't know about Gaussian processes it's about capturing the shy people who don't raise their hands with either question raise your hand if you've used Gaussian processes okay good so Gaussian process is just a way of defining distributions over functions okay so what we're trying to do is learn a function what we're going to do is start out with a distribution over functions and then condition on the data to get a posterior distribution over functions so basically it's a Bayesian inference is one of the other workhorses that we use so the Gaussian process is the distribution of a function such that any finite subset of function evaluations so f evaluated at x1 and x2 through X and that n dimensional vector has a multivariate Gaussian distribution so if that's true for any set of n X's and all those multivariate gaussians are coherent then there is an object called F which is a draw from a Gaussian process basically now just like a Gaussian Gaussian is defined in terms of a mean and a covariance variance of multivariate Gaussian is defined in terms of a mean vector and a covariance matrix a Gaussian process is defined in terms of a mean function U of X and a covariance or kernel function call it k of XX prime so we're trying to express beliefs about functions from X to Y and the mean tells us you know on average where we think the function is going to be and what the covariance tells us K of X X prime is basically how similar or correlated is the function value at at X with the function value at X prime so typically covariance functions where the covariance drops with the distance between X and X prime gave us smooth functions but there are other kinds of covariance functions out there so for example periodic functions will have X being highly correlated with sorry f of X being highly correlated with F at X prime if it's one period away okay like a week away or a day away or whatever the periodicity is okay and the covariance function that's used in a Gaussian process is something that is identical to the kernel that's used in kernel machines which were very popular in the 90s and where the most exciting thing in machine learning in the in the mid to late 90s here's a Gaussian process in action essentially the process that you're seeing with these four panels is what I call learning okay so what's going on is before I observe the data they ignore all the pixelation of course before I observe all the data I have a prior over function values that's very broad I don't know what the function is gonna be okay so that's this whole shaded area now I observe a single data point imagine I observe this single pair of x and y well what's going to happen then is that I've learned something about the function at this point that under this kernel it tells me something about the function nearby but again far away I don't know anything about the function now if I have two data points or three data points I start to learn more and more about the function so this envelope of uncertainty shrinks as we go to more data and as we go closer to the training data so this is learning when we're assuming that the function values are noise free but you can super trivially generalize this to learning when you assume that you observe noisy values of the function it's a simple beautiful illustration of Bayes rule in action all this going on here is successive applications of Bayes rule where I start from the prior I observe one data point I get a posterior functions which is now the prior that I observe another data point with etc okay now Gaussian processes are interesting because you can actually relate them to a lot of other models so one relationship I think I'm not talking about at all is that they're related to neural network so in fact maybe I get that relationship out of the way Radford Neal in nineteen in 1992 roughly proved that if you had a neural network a multi-layer perceptron with infinitely many a one layer of infinitely many hidden units then under under some fairly weak conditions on the prior over the weights that converged to a Gaussian process so what happened in the community back then was neural networks were a complete headache for people people hated them because optimization was hard and they had local optima and so on and so forth and when Radford Neal proved this people said oh great we can throw away neural networks and we can just use Gaussian processes where everything can be done analytically using a couple of lines of linear algebra so it was revolutionary it was wonderful but as David makai said many years later maybe you know maybe we threw away the baby with the bath water maybe neural networks were worth looking at and then this came back in a with a vengeance in around 2012 with the deep learning revolution now interestingly we have a paper just out few weeks ago where we extend Radford Neil's result to not just infinitely wide networks that are shallow but also deep and wide networks so depth doesn't buy you anything actually if your network is really wide then you're still going to end up in some limit basically with the Gaussian process so it's interesting just to think about that here's another way of deriving Gaussian processes so you start with linear regression sort of the the most basic model in all of statistics and then you do a few operations the linear regression so one operation I can do to linear regression is I can instead of the output Y being a you know a real valued variable what I'm trying to do when I'm trying to do regression I think of a discrete or categorical variable like for example if I'm trying to do classification so these magenta arrows are turning regression problems into classification problems so an example of a linear model that does classification is logistic regression there are other examples here as well so I think of that as the classification version of linear regression another operation I can do to linear regression is instead of finding a point estimate of the parameters instead of solving for the parameters using least squares or something like that I do Bayesian inference over the parameters so I come up with a prior on the parameters and then I multiply by the likelihood and I get the posterior so that would be Bayesian linear regression a textbook model if you're doing Bayesian statistics those are the blue arrows the Bayesian versions of point estimate or maximum likelihood models and then the orange arrows are applications of the kernel trick so the kernel trick again why wildly popular in the mid-90s and and so on is the idea of doing transformation from your raw inputs X to some high dimensional feature space call it Phi of X and then doing a linear model in that high dimensional feature space so your model is nonlinear in X because you've mapped it into this high dimensional thing but it's still linear in the parameters so all the computations are super easy so kernel regression is application of that trick and if you think of the kernel eyes sorry the kernel eyes version of a linear classification model you get something like kernel classification which is where support vector machines live among other models and Gaussian processes live at this corner of this cube they're a bayesian version of kernel regression or a kernel eyes version of Bayesian linear regression and if you apply all three of these operations what you get is Gaussian process classification which is like the Bayesian the Bayesian sister of support vector machines so I've given a talk which was titled why I never use SVM and the reason was not because I don't like SVM's but because as a Bayesian I can always use Gaussian process classification instead and I'm much more comfortable doing that okay so these are closely related anyway so that's a little deep dive into sort of the relationship of Gaussian process with other things but what can we do with the kernels in a Gaussian process so here are some things that we can do well I talked about language of models and what do I mean by language of models well language has this property where you can take words and compose them together to get complicated interesting sentences so we want that in the space of models as well so the words or atoms of our language are going to be a few simple base kernels so like this an exponential periodic linear constant and white noise kernels are a few of the basic hurdles we can use and what these kernels correspond to is basic properties of functions we might be interested like squared exponential corresponds to smooth functions periodic corresponds to periodic functions so here we have just two samples from functions with that kernel linear corresponds to linear functions and constant corresponds to different constant functions white the white noise kernel corresponds to functions that are just white noise okay these are the basic functions now from these basic functions we can compose them with a few operations to get other valid kernels and the two main operations actually there are three operations the third one we use for time series that I'll mention them in that in a minute but the two main ones are addition and multiplication so adding or multiplying kernels gives me another kernel so for example a linear multiply by linear kernel gives me a distribution over quadratic functions and by closure I can get all polynomials with that squared exponential times periodic gives me locally periodic functions linear plus periodic gives me periodic plus sub trend so functions that look like this they have some linear trend and some periodicities on top of that etc sorry the other operation that we use which is I think also interesting from a finance perspective is the change point operation so basically if I have a valid kernel I can compose Li with another valid kernel with an assumption that the function somehow changed from being drawn from this kernel before some point in time thinking about time series and then from this kernel after that point in time so that change point operation gives me still a valid Gaussian process and we use that because in a lot of the time series that we model that's what we're actually really interested in we're interested in where did the behavior of the time series change okay so now with those ingredients let's put them together and look at the search space and the evaluation of models so in terms of search here's what we do and I always like this even just for you know for practical purposes when you know you do data analysis you want to start from simple models and move towards more complicated models and that's exactly what we implement in our automatic statistician so here is a time series this is the amount Aloha healing curve time series and this curve is carbon dioxide concentration on top of Mount Aloha which is a volcano in Hawaii and this is a famous data set because it's been used in climate science to talk about like global warming and things like that but for us it's just a bunch of data points a long time here this this axis these years and here what we're doing is we start with a really simple Gaussian process like for example the constant function or something like that and then we apply different operations to search over different kernels and here is the first thing that I found so in this case in this particular language of kernels we're looking at something called rational quadratic kernels as well but we got rid of that after a while because those models are not that useful but anyway so it finds a kernel and what you can see from that is that that models the data very well and after the dashed line is extrapolation when it does extrapolation the extrapolations look a bit funny to us right I mean extract there is no science to extrapolation they could be right you know that could be what's happening but it seems counterintuitive there seems to be a pattern here and those just wrap elations don't look great and you can also see the uncertainty that it has an extrapolation which I think is kind of useful too okay so then it expands on the current best colonel by applying these operations multiplication addition and change points to find another kernel that explains the data a bit better so this is periodic less rational quadratic and then another kernel which is squared explanation exponential times periodic post rational quadratic that explains the data better and so on and then it stops at some point this is squared exponential plus squared exponential times periodic plus rational quadratic that's just shorthand for the kernel that is found for modeling this data and where does it stop well it stops using a criterion called the marginal likelihood and the marginal likelihood is integrating over all the parameters in your model the probability of the data under that model and the nice thing about this this quantity which is also sometimes called the integrated likelihood or the evidence and sometimes is referred to as like the Bayes factor or Ockham's factor this quantity really elegantly trades off the amount of data that you have with the complexity of the model that you're trying to fit so it automatically trades off things so that you don't get either underfitting or overfitting given your assumption so basically in a subjective Bayesian framework you have to make some assumptions that you start out with some priors and so on and then under those assumptions this will naturally trade off so you don't have to do cross-validation or anything like that so uses the marginal likelihood and it stops at some point and then it says that's the model that I most believe in from the models that I've explored in my search and what you can see is that model actually has you can't really see much difference on the training data but the extrapolations of that model are much more sensible and in fact often this procedure gives you very intuitive extrapolations the kinds of things that a human would draw in fact I saw just this past year somebody did somebody wrote a research paper where they they got humans to do extrapolations of time series and then they compared them to the outputs of the automatic statistician versus some other time series models and they found that human behavior was more similar to what the automatic statistician did than a lot of these other time series models so this was not in my group I mean I could find the paper I thought it was quite amusing to actually do that experiment but it's intuitive and when it's not intuitive it's one of two things that have happened one of two things go wrong one of them is that your priors were really weird and you didn't realize it your priors didn't actually capture your own intuitions that's one failure mode and the other failure mode is that your search procedure just got stuck in the wrong place or the approximation to the integrals that you had to do was was poor in that case so something went wrong and you can kind of diagnose what went wrong okay so this is what happens here's another data set this is again it's actually quite hard to see you know cryptographically obscured for you so this is a data set of airline passengers monthly airline passengers from the late 40s to the early 60s and you can see it when you do connect the dots a little better so this is the raw data after the dash line is extrapolation when you connect the dots what you see is per you DISA T but it's not a strict periodicities kind of an approximate periodicity and the amplitude gets bigger over time and then the extrapolation for the next year or two looks really sensible I would say from the model so the text here is actually the models own explanation in words for what is discovered so this is computer-generated text but it says is for this data set for additive components have been identified and the data and then it orders them by amount of signal that were explained by each of those components so a linearly increasing function well you can sort of see that an approximately periodic function with a period of 1.0 years and with linearly increasing amplitude so the only thing it knows is that the unit is years here and so it produces that it's not exactly periodic it's approximately periodic which also makes sense and you can see that it's got linearly increasing amplitude I could actually write down for you what the kernel expression is that translates into this phrase in English basically and then the third and fourth components you can't actually see so easily by I it says that there's also a smooth function underlying it and an uncle and uncorrelated no it's it's linearly increasing standard deviation you can't really see that by eye but actually the report this is like the the few lines of the executive summary of the report the full report which is 10 to 15 pages long has sections where it removes the first two components of the signal and then it shows you in the residuals that there is some smooth function and then it says and this is why I think there's a smooth function it removes that and shows that the noise increases linearly in amplitude etc so the report tries to explain why it's come up with this summary okay so these are the the first pages of a couple of these reports maybe I could show you one of these reports let's see if that works okay so here's another data set this is a fun data set let me just put it up here if I move it you might be able to see where the real dots are this is actually a classic paradigm is that in in psychology in visual perception you have random dots and then you have other dots that move coherently and the monkey or human has to press a button to tell you which direction these things are moving that's another that's a previous life I had okay so so what is this data so this data is actually it's from the I'm gonna make it a little bit bigger maybe you can see it better from the 1602 2008 sunspot activity okay and there's some interesting things in this data which I didn't know about because I don't know anything about sunspots but you know there is this thing that looks like you know the measurement error or something like that but that's not actually measurement error that's something called the founder minimum and that was a period in the 1600s where sunspot activity was very very low and fairly constant so that's interesting and then there's something else in the data that you can't really see easily here but you can see it sort of here there is some periodicity in sunspot data you you know many of you maybe know that so what does the actual report say so this is the executive summary in the report it says blah blah blah you know computer-generated text is never really fun to read but you know what is it this structured search algorithm has identified eight additive components in the data the first for additive components explained 92.3% of the variation in the data blah blah blah the first and so on short summaries of the additive components are as follows okay first component is a constant okay and why is that will actually you know that might not pop out but if you look at this data if you look at the axis like these numbers aren't centered around zero they're all Center they're all starting from like you know one thousand three hundred and something so the most salient thing that's stuck out to the computer was that this constant is quite different from where it expected it to be it's one thousand three hundred in something okay so what's the second component the second component is a constant this function applies from 1643 until 1716 okay so what is that well that is the mound or minimum so the second thing that that was most salient to the algorithm was this change point that happened somewhere between here and here okay so it's to change points to sort of delineate the two ends of this minimum what was the third most interesting thing for the computer the third most interesting thing was a smooth function that applies outside until sixty forty three and from 1716 onwards so smooth variation outside of that Malan der minimum the fourth most interesting component was an approximately periodic function with a period of ten point eight years this function applies outside of the bounder minimum so the fourth thing I found was the approximately 11-year periodicity of sunspot activity okay and then these these fifth through eighth components are sort of there but you know harder to see by eye so this is what one of those reports looks like this is largely the work of James Lloyd when he was in my group and co-authors and then the report goes on you know with tables that are produced of R Squared's and explanations of each of these components and what the residuals look like and then you know showing what the extrapolations look like for the different components and then a section I really like which is called model checking which basically goes and tries to falsify the assumptions of the model it basically applies some classical statistical tests to see where the assumptions of the model don't match the actual data this is a nice interface between classical and Bayesian statistics and basically what it produces is a whole bunch of test statistics and it finds something in bold something that is significant okay and then it goes and tries to explain that so basically the thing that I found is moderately statistically significant discrepancies in component eight okay so that's nice I like I like it when you know my algorithm can be self-critical okay so that's it that's the example of let me see the example of a report from the automatic statistician for time series this is now a few years old we still are we're actually revamping all of the code to be able to make another release it's been a few years we've had a change of like people you know PhD students and postdocs coming and going but the project still lives on so we're still doing things with it and we're gonna have another release very soon one of the nice things about being systematic when you search over a big model space is is that sometimes you actually get better predictive performance as well so it's not just about being explainable and producing reports but you know the standardized root mean squared error in terms of prediction where standardized is you know 1.0 RMS C would be the best method for the two versions of the automatic statistician they're both better than the other things we compare to this is over 13 data sets box plot over 13 data sets and in particular if you compare them to just doing the linear model or just doing a vanilla out-of-the-box kind of Gaussian process they're you know about three times better than doing that so doing a systematic search over kernels can help quite a lot model checking I already mentioned I think this is a hugely important area if we're gonna do automated statistics our automated data science then we should also bake in all sorts of you know critical self doubt into our systems so that they don't come out and over confidently tell us things that are silly and wrong okay so we thought quite a lot about model criticism there is a paper that that we wrote about this a few years ago now the time series thing is the the thing that we initially published for automatic statistician but we've been doing a whole bunch of other things one of the biggest applications of machine learning is classification right so um almost I would say a large fraction of well known machine learning applications are classification problems so we also built an automatic statistician for classification and it's different from time series model but we're still using things like Gaussian processes so here is a report that was produced by the automatic statistician for classification like a six page report and here are some snippets of text it's using Gaussian processes but it's now taking not one dimensional time series but it's taking multiple input dimensions and trying to predict some output class label okay and so the nature of the reports that you would write to explain that are different than what you would do for just plain time series 1d time series modeling and one of the things that we focused on in these reports is concepts like this is these are computer generated text and tables from one of these reports concepts like interactions additive additive interactions two-way interactions we also looked at things like monotonicity like whether the class label was monotonically the class probability is monotonically increasing or decreasing with a particular input variable we use words like no evidence or little evidence or moderate evidence because that we want the reports to be kind of readable right so if you pack them full of numbers they're not easy to read but if you say there is you know moderate evidence that Y increases monotonically with X that's something that some people might find useful when they're analyzing their data and just more examples of this we've looked at other problems like transforming features let me show you maybe I'll show you a couple of videos of the automatic statistician in action okay let's see if this works it's playing but it's playing somewhere else I was playing down here cleverly alright let's start again okay so here is the sort of version curve version in the automatic statistician it's not it's it's not available for everybody yet but we will put it online very soon sorry you can't really see it because of the resolution here but you what do you do you you pick a file you upload the file then you click regression we might even change the words to be something simpler than that and it goes that it produces a report basically so it produces a report the report actually you're the video is just scrolling through the report the report will refresh as the automatic statistician runs so it's basically dynamically producing a report its evaluating models it gives you instant gratification or not not nearly really instant but gives you gratification as soon as they can but then as it gets more data it you know and it runs more cycles it updates that report for you over time okay and that's one version of the statistician okay oh yeah here's the explanation demo [Music] okay you upload the data set the first bit is basically like it ingests the data and it tells you what it thinks the data is how many rows how many columns whether they're outliers whether the missing data and that's often a really really useful sanity check because one of the biggest failure points is you think you've given it some data but actually your data is a mess and you didn't realize it so it spits back to you any error messages that gets from the data then it starts producing the report again the report is dynamically being updated basically and in this case the report is on you know this is not very scalable yet we can scale it or we have plans for scaling it this is a report on a very classic data set called the iris dataset it's doing some clustering for you it finds some number of clusters it then explains what's in these clusters with some visualizations for you etc all right so high I can wrap up fairly soon I think yeah so the last couple of things I wanted to say are a lot of this stuff takes compute and it's sensible to be stingy with that compute it's good to be rational about how you use your computation and so the problem you're trying to solve there is that you want to trade off statistical and computational efficiency and so we do this by treating the allocation of computational resources as a problem in sequential decision-making under uncertainty so a very simple case is if I'm exploring many many possible models for a potentially large data set then it makes sense to take a subset of that data run it a little bit on a few promising models and then allocate more computational resources to things that look more promising and less resources to things that look less promising and resources here means more data and more compute so we automate all of that and the end result is actually very very nice it's very useful thing to be doing in the back end and when you're trying to assess whether something is promising or not you can't just look at its current performance you have to actually extrapolate you have to say well here these are these are sort of learning curves as you put more compute into something this learning curve looks promising because it's sort of shooting up and I think if I give it just some more compute or more data it's going to do way better than this other method that looks like it's already asymptote it you have to make assumptions and then you make decisions based on that so we use an idea called freestyle Basin optimization and we built that on top of that and then we entered it into this auto amel competition and it did very well there so I think I'm going to just wrap up so to conclude the framework that I built this all of this on is a probabilistic modeling framework and it's a nice framework for automatically building systems that reason about uncertainty and learn from data and you know I feel like we should be doing more and more automation just because it forces us to be more systematic it forces us to be more rational and sort of cleans up our thinking of it and so that's the thing I've been mostly focusing alright thanks to lots of people who have collaborated with over the years on this project and there's a review paper I wrote about three years ago on public machine learning and AI which you might be interested in great [Applause]
Info
Channel: The Alan Turing Institute
Views: 2,927
Rating: 5 out of 5
Keywords: data science, artificial intelligence, big data, machine learning, data ethics, computer science, turing, the alan turing institute
Id: aPDOZfu_Fyk
Channel Id: undefined
Length: 54min 55sec (3295 seconds)
Published: Fri May 25 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.