Bayesian or Frequentist, Which Are You? By Michael I. Jordan (Part 1 of 2)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so I'm happy to be here thanks when I was invited I looked at the list of all the other speakers and I noticed there was a lot of Bayesian on the list and so I thought I'd better give a frequentist counterpart if there had been a lot of frequences on the list I would have given a Bayesian counterpart I'm just sort of a contrarian at heart so that's not entirely the goal of the talk so the goal the talk is to sort of it's a summer school to teach a little bit about Bayesian frequences distinction emphasize a little more the frequentist because I think you've probably had more Bayesian during the week and emphasizing the essential unity of these two classes of ideas that have you know been debated for about 400 years now so I'm not gonna really ask this question but let me I'd you know I am gonna ask the question so let me you've been through two weeks of mainly Bayesian lecturers how many of you think that you're a Bayesian would you carry your slide yourself a Bayesian all right I'd say about third of the room how many of you are frequentist all right about three of you see so that's that's kinda weird how many of you are both another third of the room how you how many of you're neither all right so I'll ask that same question maybe at the end of the lectures and see if they can see if things have shifted at all okay so this is this topic that most of us are interested in statistical inference has been around for quite a long time and there are two sort of made perspectives that have stood the test of time the the Bayesian frequencies I'm going to give you a little argument in a couple of slides from a decision theoretic point of view of why these are really the only two real competitors kind of essentially become lot because lost functions have two arguments so there's kind of only two ways to go and it's really important to kind of keep it throughout your whole career balancing these two things back and forth anybody who'd decided there one of the other early in their life and we're only only that I think it's sort of missing the point these are kind of things to ponder and muse and and understand the relationships and they're deep so Bayesian perspective as a conditional expecting so inferences should be made conditional on the current data so you just observed some data hold that fixed and do everything conditional on that don't worry about other data you could have gotten that's the conditional Bayesian perspective now I find myself often being a Bayesian in applied projects when I'm working with a domain expert and we have a lot of time to work with each other so I'm picker like it lots of biology projects you'll have someone who knows a great deal and you want to try to understand what they're thinking about the problem what do they know what's the / what's the prior also what is the loss function so we often confound the priors but you really also need to concert in the loss function as well so how are they care about so you have time to elicit all those sorts of things it's often really very appropriate to be a Bayesian and one way to think about what a Bayesian perspective is it's the optimist so I'm approaching a problem as a statistician I want to get it I want to get knowledge out I'm going to get inference out from data so let's we have the sophisticated tools let's be optimistic and assuming get as much knowledge out as possible by working hard to get a good prior and get a little good loss function and get a good model and iterating so that's the optimist the frequentist perspective is an unconditional perspective you don't think about necessarily on the current date all you could but the way you evaluate a frequentist procedure or you evaluate a procedure for a frequent support of view is that you consider unconditional averages so you should get good answers in repeated use so repeated use means you're going to look at multiple data sets and you're going to take averages unconditionally over the multiple data sets you don't condition on one single data set you look at multiple data sets you should get and you talk about unconditional performance over all those possible data sets that's what it means to be a frequentist now I also think this is a very natural spective for lots of situations and I often find myself being a frequentist in particular if I don't have a lot of time to sit down with a domain expert and we just have a very quick sort of project you know I'm going to maybe develop a very simple tool that aims at some inference directly in some sense and hope that I can prove something about or someone else has already proved something about that I'm not going to go wrong by using this so in particular if you're gonna write a piece of software that's going to be used by many people with many data sets you really ought to give a frequentist guarantee on that you ought to be able to say that's piece of software is going to work well on multiple data sets now inside the software it may be be Bayesian it may condition alright but you want to be able to say whatever you condition on multiple different data sets you should do well okay now some one sort of worked all that out in advance in some sense that there are some theorems that say that Bayesian inference is has frequent good frequentist properties under various situations so in some sense you don't have to worry about it that much in kind of classical situations parametric situations and so on nonparametric is a different story but in general if you're not going to necessarily start with a conditional Bayesian procedure you'd like some other procedure I'm going to compute the median or something doesn't necessarily have a Bayesian justification but I like it it's a procedure then I that's my software I want to prove that it has good frequence properties meaning that a large fraction the time it'll give the answer you expect to get on malta all kinds of data sets okay so I find that a hard argument to two to find difficulties with I mean it really you know if you're writing software you should be a frequentist now the precursor perspective also is that of a pessimist so that'll be an optimist what we think of if you're a frequent is that we write down a model we're gonna develop a procedure and almost certainly it's going to be a simplification of reality okay reality is really complicated we're gonna simplify it drastically often and so we may get the wrong answer and let's protect ourselves not to get the wrong answer too often okay so in medical domains a lot of people tend to be frequentist because they want to protect themselves and get about doing something wrong something stupid okay so it's the pessimist for a sec so frequent ism is definitely dominated statistics in the last hundred years you know they're the Bayesian world is certainly present and but smaller I think most sad systems were trained as frequentist and still tend to approach the problem a frequent sort of view and just this sort of pessimism just sort of you know ground in and there's lots of good reasons for that there are lots of situations where people have made bad inferences and you know the pessimistic parity is an important one to ingest okay now I'm often asked I go back and forth between the quote unquote machine world and learning world in this 2-6 world by statisticians what is this thing called machine learning and I don't really believe it's a new field per se I believe it's a contribution to the general problem of statistical inference and decision making and it really is a set of themes it's not a you know one kind of a field but it's a and it's a loose confederation of themes so reinforcement learning clustering classification graphical models I mean what do they have to do with each other you know not directly that much they just happen to be things that if people have found interesting and useful and they connect to each other usually by some sort of a statistical argument of some kind so you know SATA stations I talked about them a little relieved oh it is not they don't actually pretend to think they've discovered a new field it really is just service disco inference okay that's the least that's good but it's a sort of different flavor in some ways and one of them is there's a lot of focus on prediction if you just could make a good prediction I'll care how you did it that's often the machine learning spirit and I'm kind of completely other side of things not know more of a prediction but just do exploratory data analysis find cool features it's kind of a typical machine or anything to do that's exploratory data now so you're trying to somehow understand something about your data and so statisticians you know neither had those are both things like interest and that's reason why it's part of statistics too but a lot of statistics is about what coverage you know I want you to tell me sometime your confidence in some prediction you have okay and I want you to tell me you found some structure you've done some sportiness you've found some structure what sections are probably that's real or what's the probability that that's that's garbage and so that kind of goes on the terminology of coverage I want you to it's a frequentist concept typically to guarantee that if you run your procedure over and again you will kind of do the right thing ie cover your error bars will cover the truth the right fraction of time or your the things you'll have discovered will be noise the right fraction of time so that's just kind of you know not and say well okay there they still haven't kind of matured enough yet to really understand that this is an important issue to be too worried about error bars in a frequent since a coverage notion of coverage the other thing that you'll note from machine learning people is there's a lot of focus on methodology so every develops a new model than a new method to fit their model that's the focus of things not so much about kind of classical inferential topics and then the evaluation is usually not not theoretical but empirical so that's kind of good in statisticians I think appreciate that and then there's a kind of a dollop of empirical process theory it's often called statistical learning theory but it's really a miracle process theory and it's when they turn to theory they turn in this big hammer and they don't kind of use the you all the other little hammers that are sitting around that are available mainly asymptotics so there's lots of parent nonparametric statistics and machine learning in fact very few people do parametric's but surprisingly to a statistician there's home with hardly any asymptotics that's the main tool to a status dishes all these acid products and everyone's using these these big empirical process type hammers so that's a bit of a surprise and the other somewhat surprising fact is that the the field is somehow sometimes frequences is sometimes bayesian and it's not the wind when one's got to come up when the others gonna come up and it kind of is a coexistence you have a lot people doing just purely Bayesian and stuff a lot of people doing frequent of stuff sometimes not even call themself frequentist and there's no interplay between those two things so there's just kind of two parallel streams that go forward and occasionally you'll say something well I can give it a Bayesian interpretation of your procedure here's a prior that matches it or something but very little actual kind of real interplay and the usual statistical sense okay so you guys are more machine learning people so this slide may have meant that much to you but I think it is important to understand how you're understood by the outside world okay so I promised a decision theoretic perspective decision theory goes back to Wald and two others in the 40s and although sort of the number of papers that have decision theoretical content to them has dropped over the years it was the thing to do in the 50s and 60s it's still definitely present and I think many people including myself view it as a extremely useful perspective to bring to bear and thinking about fundamentals in statistical inference so you know decision theoretic perspective says you have some data X let's consider a family of probability models indexed by a parameter theta and I put parameter in quotes because it doesn't mean to find a dimensional parameter it just means an index of a class of probability distributions if that in that class is infinite as it often is like a function space then theta indexes all the functions of your function space uncountably number of things so people often pick up a decision theory book and they see theta there and they saw if there's just parametric stuff this isn't relevant to me and that's completely wrong okay and completely wrong so theta index is a function class or or a measure space anything that you want okay so that's our family probability models with an index and now I have having gotten data we're gonna define a procedure of some kind maybe a decision tree maybe a support vector machine maybe a graphical model that will take that data and produce some sort of a decision an estimate function a decision of some sort all right and now given that decision and given the underlying probabilities which is generated indexed by theta you define a loss function okay how much how bad you're gonna feel if the truth was one thing and you made a decision of the form Delta X okay and I think almost any statistician would be happy at this point this is just this is all fine this is what you want to do you need to write down a loss function okay to evaluate how good a procedure is doing and now we're gonna have multiple procedures it could be a delta one which is your decision tree and a delta two which is my support vector machine and so on and like to compare these say which is best either in this situation or in many situations hopefully so how can I do that well the last now tell me it's my measure of how good I'm doing all right but the problem is the loss function isn't just a number I want to get a number to compare two things you know that procedure has you know lost three point five and that one has lost four point two that you know that one's better but it's not just a number because there's two unknowns there X is random so it's kind of an unknown about X and theta is unknown I don't know the probability you shouldn't underlying the data so I got two pieces of the things that are unknown how can I actually optimize over loss functions choose the right Delta okay so that's kind of the core problem of the decisions your aims to face all right well you've got two arguments of that function so there's gonna be two perspectives on how to get rid of that unknown as' you start with the Delta X or you start with the theta and those two persuasions are called frequentist and Bayesian and if that's my argument as to why there are two of them because there's only two arguments to lost functions okay all right so this time of the frequences one first so the frequences looks at the lost function says well you got two unknowns there let's start with the x one try to do something about turning that random X into a into a number somehow well if it's random and we won't turn it to a number we need to take its expectation so let's take an expectation of this quantity the second expectation of the X part okay well an expectation with what respect to what probability sure well let's use the same theta to take our expectation as is in the loss function okay so I don't need to know the truth opera or I just say whatever the truth is I will take the Xpression with that and I'll do this for all theta so I'll look at all possible truths and so I'm gonna get a risk function the frequentist risk here is a function of theta I've picked which theta a priori is the truth okay but this quantity here this e sub theta is the expectation under the distribution indexed by theta and it's an exhibition over X so the X goes away here and we just get theta on the left hand side alright and that's a key so if you need a definition of frequentist this is probably as good as any in fact this is kind of one message I want to give to machine review a lot of people in machine learning are frequentist a lot of the work is at least frequentist but you never see the word frequentist and you never people see they'll talk about the definition of what it means to be alright well here it is this is here here's one it it means you take an expectation for six theta I would start 2x and in doing that what you're being not mazing at that point you are taking an expectation with respect to other X's or over the entire sample space not the X that you saw but other X's you might see okay so that that's what it means to be frequences you're looking at other possible data you could have gone it's the unconditional perspective and exactly so whenever you write down an e in an equation and it's an e you know averaging over X you've just veered away from the Bayesian route and you've gone towards the frequency drop hope that's clear okay so that's what the frequentist does does it get started now they have this function R of theta and they say well I don't know what the truth is it could be anything and now I need to still convert this thing to a single number how do I convert that to a single number okay and now there's many possibilities and there's been huge literature fifty years of kind of work on ways to think about turning this into a single number so the one that you've probably heard the most about is minimax or you'll take the the maximum of this frequentist risk over all theta and then find the procedure that has the minimum of the maximum risk okay and a lot of good work has come out of the minimax perspective that's one way to take get rid of the theta part by taking the maximum over all theta and then taking the minimum over that so that's minimax with there are other ways to do this you could take subclasses of procedures you know so Delta X could be in a sub class may be all unbiased estimators or some other invariant you know invariant estimators so on and in that sub class this R of theta might actually have a simple characterization you can actually get out a single number over some sub class and there are many other kind of ways of trying to approach that another thing you could to try to do here would be to average over theta all right but when you average over theta you got averaged over some distribution on theta and at that point you become Bayesian okay you're trying trying to say something about a distribution on theta so the frequency are you know willing to do that in mathematics because you can get out some interesting mathematical results but in practice are trying to avoid doing that the Bayesian on the other hand you know welcomes the opportunity to integrate over theta because they think it's fine to put a distribution on theta and they're gonna do that and since they have one sitting around they might as well average this quantity now not with respect to X both spec to theta and so here this Y over here is a different one this e is a conditional expectation given the data X its conditional perspective an expectation over the theta part of the loss function okay so this is still now a function of X all right but that's less troublesome now because the X is assumed to be conditioned on its known it's fixed so this is now a single number and that's called the Bayesian risk so we have the frequentist risk and the Bayesian risk okay Bayesian should be interested in this because having to find this now they can optimize over Delta and they can find a the the right procedure you know should it should I report the the initial mean the posterior mean or the conditional median or so and so forth well that depends on what gloss watch you have here and by optimizing or Delta you can you can pick that out so if you haven't picked out a loss function then you just report the hopeless terior but if you have a loss then optimizing this equation will tell you which procedure to use all right so you don't see enough of that kind of work in the machine learning literature of choosing the loss and talking about this particular risk function and then this one you see a lot because this is what's disco learning t book people work with and all other frequentist the this frequentist risk okay so the frequent just goes to the left the basin goes to the right and now you can sort of ask what happens if I keep going so I could take this quantity here and I can act like a Bayesian average over theta I can take this quantity here and I can act like a frequentist and average over X and neither camp would be very happy with you all right but what would happen at that point well you get a single number because now both things have been averaged over and well that number differ on this branch and this branch yeah yeah this is just the conditional expectation of this random quantity it's random sort of theta its condition this is a this is constant just a random quantity expected theta so this is the expectation with respect to theta conditioned on x over here this is the usual frequentist notation they don't want to treat theta as a random variable so you put it as an index so this is just the expectation of that particular probability distribution it's not a conditional expectation it's unconditional anyway you get the same number by doing the two calculations yeah what's the theorem that tells you you get the same number it's called fubini's theorem just iterated expectations you can switch the two expectations okay and what number do you get when you do those two expectations it's called the Bayes risk so you may have heard of the Bayes risk out there you might think you have to be Bayesian to use the Bayes risk and that's wrong that's not right because the Bayes risk has gotten by either path a frequentist or the Bayesian path okay so that's a little bit of decision Theory hope that was interesting let's talk about the issue of coherence and calibration to me this is very helpful way to understand some of the start to send some other relationships between Bayesian & frequentist ideas let me just actually before I do that I said there were relationships so if you go back to the previous slide decision theory has been really the home of a lot of relationships between Bayesian and frequentist ideas so in particular if you try to find optimal frequences procedures and defined in various ways there's something called complete class theorems that tell you that they are either Bayesian procedures or limits of Bayesian procedures so from frequentist point of view you often want to use Bayesian procedures because you know that they give you the class of optimal procedures and of course the freakness doesn't necessarily use that theorem in practice because you don't know with respect to it prior to the unknown but prior but you know mathematically that's that's a fact so that's one class of connections between Bayesian or frequentist okay so coherence and calibration is these two words are used a lot to describe kind of you know bias perspectives on inference David Draper's written a lot about these these particular words and a lot of other people as well so there are two important goals for distill inference coherence you know means coherence that you give out the same answer kind of no matter what question I go to the same answer no matter what question you ask me and you can't find any incoherence among my multiple answers so something like that and calibration is means something like if I give you a number out then that number means something that if you may ask me to do a procedure multiple times and I claim the 95 percent of the time I be about the right answer all the 95 percent the time you better give me out the right answer that's calibration okay so Bayesian work has focused on coherence while frequencies work hasn't been too worried about cards I think it's pretty a fair statement so Bayesian 's get kinda coherence for free because they have a joint probability vision underline everything and that that's the source of the coherence and then they love to bash frequentist because they find places where frequentists work is not coherent and lots and lots of papers written about that and the frequencies are not so worried about that they are interested in a particular inference problem at one particular time and doing the best you know finding the loss function that targets that law that that problem and you know we can't be coherent all the time you know that's just life somehow it's kind of you know one might shrug one shoulders you know I'm not I'm sorry not go here yo well coherent is may be the wrong word I'm not go here at in the mornings but consistent is perhaps another word but that has another technical meeting so I'm not consistent eater I'll tell you one thing in one day and then I'll tell you something else six months later and that's just life now another hand frequentist work is tended to focus on calibration so calibration again kind of is like you know Scotian of coverage that you're the kind of numeric values you associate with your procedure really do come out in practice and Bayesian haven't been too worried about calibration okay now that's kind of a bit of a problem with the Bayesian perspective you know you start writing out a bunch of priors you write out a procedure you run on your data and that's it you're done all right what guarantees did you give me and can you tell me you know that you did that procedure multiple times it would come out to have the guarantee no your guarantees turned out to be true all right well basically tend to worry about that you know that enough now good Bayesian most Bayesian xand statistics are actually a little bit frequences too and so they will often look at a little bit of a frequencies analysis of what they're doing and compare the coverage for example of their amazing procedure anyway if you don't do that if you just be if you're a pure Bayesian and then you certainly get Co m coherence but you know you don't interrogate calibration on the other hand if you're a pure frequentist and you just are worried about pure calibration you can be calibrated and completely useless you know so you know ninety-five percent of the time you give out error bars that are a point in the hunt and then zero five percent the time you got air bars that cover everything and so that the average works out to be some kind of confidence interval but on any given set of data you give out a useless answer so you can be completely loose of your purely calibrate on their hand if you're completely coherent you can be completely coherent and completely wrong it is clearly a coherent to give out the answer one to whatever question problem you ask me okay but okay and so most statisticians find some kind of a blend a natural way to proceed because they tend to achieve both coherence and calibration so I've been some sense given you the answer of my question of my title I think many statisticians not every single one but many of them find that they are both a little bit Bayesian a little bit frequentist and these things can be made into conflict there are ways of you know focusing on calibration and coherence and showing that one perspective doesn't achieve it but they really do act complimentary and eight each other and it's a little bit like wave particle duality it's one way to think about it sort of waves and particles are both there they're both could always be true there's something right about both of them but they don't quite really work out you know together as well as they should I think it's true about Bayesian frequencies - they're both right in some way they're going around forever and one's not gonna vanish and but they don't quite you know merge entirely there they do fight each other in various ways in particular and and testing and model selection problems and and I think they'll probably be eventually be some more of a resolution there even but it's gonna take a while all right so a few more comments about the kind of sociology really so the the frequentist world is this hodgepodge of people you know you can do any kind of technique as long as you give me an analysis that's frequent ISM so it's just a big big field Bayesian is a little smaller and it's really got two main subdivisions so you have subjective Bayesian you have objective base this is simplifying but these are the kind of two main schools of Bayesian and the subjective Bayesian is you know believes that the prior comes from some from a person or maybe a small group of people and so the goal is to work with that person a domain expert and you want to figure out what's the prior that person has in their head and what loss function do they have in their head also and then the model too is somehow I had to come from the domain expert so you've got to also figure out what the model is elicit the model okay and the the subjective Bayesian argument is that if you got out bad answers from your your Bayesian procedure it's just you didn't work hard enough to get the prior in the loss function of the model it should work harder alright and put that way it's sort of hard to argue with if I spent a million years and got on the right prior you know something you know I would get out the right inference so what kind of you know you just once you have the prior the loss and the the model you're done you know they use Bayes rule and and so on there's not much else to talk about so what kind of research to you is a subjective Bayesian well you do a lot of things you saw here this last two weeks I think a lot of the work you probably saw was and effectively subjective Bayesian even though I'm not sure those words were not used what it was it mean what you'd have out and a lot new kinds of models all right why well because I'm a subjective Bayesian I'll go on to face some new problem I'd like to have a library of models I might bring to bear on a new problem okay so some if some a lot of people worked on models I I'll have a big library so that's one thing you can do the other thing you can do is that Bayesian you know Bayes rule you have to integrate you got to get that denominator so you better develop lots of procedures for integration cuz that's gonna be hard to do so a lot of algorithmic work goes into subjective Bayesian research on integration and then a thing you probably don't talk as much about but if you're gonna be a real subjective Bayesian and you really ought to worry about how to get those priors it's not that easy and you better work out techniques for eliciting and assessing priors from individuals there's a whole literature on that and a lot of Bayesian or you know machine learning people don't focus on that nearly enough so we really aren't gonna be amazing you better worry about how to do that anyway those are the kind of some of the main areas of research there there's what those of the mayor's so there's not a lot of focus on analysis of did my procedure work and so on that's really what frequentists do because you know if you have the right inputs the Bayesian outputs could be a good one all right so again you can't really argue about that from a philosophical point of view it's coherent it's pretty it's it's nice but in practice there are really lots of lots of problems and the main one is that you know lots of us work with really complicated models our hierarchies there's multivariate quantities there's you know matrices so on so forth and all of those in brings new parameters into the problem you know whatever you wish our distribution you got a whole matrix of parameter set in there all right you got to put a distribution on that all right well okay that's hard and so the more complicate your model gets the more parameters and it's now there to take you know a long long time to get a domain expert to kind of say well my prior on that Wishart thing this is this and moreover if you've got long list of parameters it's really the Joint Distribution all the parameters you better be assessing that's what you're supposed to get all right that's kind of that it becomes really hopeless how do you do that well then you start making independence assumptions you start throwing them in because if I say well that's independent of that now I can think separately about this and think sorry about this then the human domain expert can get in and start thinking about it now you're leaving you know Bayes on the floor a little bit because you're not really assessing the right prior simplee for computational reasons you often start writing a list of independence assumptions and at that point you may have left kind of optimality behind and now a subtle question issue but just as important as the others is that it's really hard to get domain experts to assess tale behavior and whenever you're working with real valued quantity is which of course most of our models are as you go high up in the hierarchy they start to become real numbers you want the probability of some discrete thing well the probability is a real number you got to put a prayer on that so it has tail behave you have to worry about what tail behavior you know I can get my mother to talk about the mean and the standard deviation of something but she can't tell me about whether its Laplace tails or orti tails or whatever right and I don't think many of you could either I don't think I could it's really hard to assess those things and it's also hard to kind of learn them quote/unquote hard to get ideas he'll be here does that matter well some Bayesian models it doesn't matter that much but in lots of Bayesian miles that doesn't matter a lot in fact in some cases determines the entire output of your procedure and this is a really serious issue so you will often hear people talking about Bayes factors and marginal likelihoods how do you solve model selection problems amazing marginal likelihood it's the knee-jerk answer calculate that all right well the marginal likelihood is the integral of the likelihood under the prior it's not the integral under a posterior which tends to sharpen up and tail behavior doesn't matter it's under the prior so the tails are there and your integral is if you have very fat tails there's gonna be a largely determined by your tails that's in the prior that's your assumption ok and so the marginal likelihood can be you know hugely determined you know affected by your particular assumptions Bayes factor similarly Bayes factors are ratios of marginal likelihoods and so one way you might try to go is say well use improper priors you know try to make them flat so I'm not putting very many assumptions under in all right as I think you may know marginal like goods if your interview under the prior and you have an improper prior it has an arbitrary constant and those things will tend to divide out when you calculate things like posteriors but in marginal likelihoods they and beta factors they don't you have a ratio of arbitrary constants and so the Bayes factor is meaningless in that case alright so these are really serious issues and there's a lot of Cisco literature on this there's things like intrinsic Bayes factors and fractional bates factors and various kind of ways to approach this but if you don't try to at least think about those things you know it's not it is not the solution it's not the hammer that solves the model selection problem okay so tale behavior is a big issue nonparametric so a lot of us in fact when I wear a Bayesian hat as a researcher I'm very interested in nonparametric sand have been for about a decade I think it's great it's awkward for subjective Bayes because it's really complicated and nonparametric Bayes model this is hard to think about it's these stick breaking things and infinite objects and so on now what's my subjective prior on those things and so a lot of subjective basis in fact are not very happy with the nonparametric Bayes movement so you know that may eventually get worked out but it is is currently an issue ok so that's kind of some of the problems that arise and I belabor them perhaps a little bit because I really think there's a time a tendency to sort of say that the basic idea is so easy and systematic you know could anyone do anything else well these are real issues that come up in real life and then last one is more philosophical which is just that a lot of frequentist don't like sort of subjective Bayesian sort of telling that they can't use a certain method so I like the support vector machine because it works all right I'm going lots of applied situations I will roll it out and it will work really well and solve someone's path and everyone's happy I get paid in the company you know makes money what's wrong with that all right well it doesn't have a Bayesian interpretation you know at least an obvious one you might really build it we're really hard to find one but it's pretty it doesn't have one it doesn't seem right well do I have to wait around for someone to show me it's Bayesian to use it no I could just use it because I can write it down it works and then I could actually do some theory that shows that it has a frequent just you know justification all right and there's lots of some simple simple kind of you know nonparametric testing situations where I just got a column of numbers here and a column here I want to say where those two columns are different right well there was these simple things that you just written you sort them and you you you find it's column a1 we're higher in the list and column two develop a statistic that measures that and then prove that that will work on repeated usages it's a perfectly good kind of approach to testing I'm not because I'm not supposed to use that because that's not Daisy and it just sort of doesn't feel right okay so I hope I bashed subjective a is enough that you'll be interested some other things yeah oh yeah everything we've even likened them to like it does talk about so doesn't matter whether it's ya know it does sharpen up it does sharpen up but it doesn't it's the rate at which things sharpen up and you've got to pick your tail behavior to have a certain rate to cut to compete with that and you got to do that effectively your two models in the numerator and the denominator and it's sort of you know getting all those are all those right rates to line up is which is hard yeah I sort maybe a little exaggerated look like the does sharpen up you're right but yeah but it's important a lot of people sort of think well problems go away when you integrate against the posterior which is true but the whole point of margin like that you're integrating against the prior so you've got the tail behavior still has to be taken into account okay so objective base you know I really like objective base you know there's a whole conference on this that I went to the last couple of years and you know I think it's like OB own I if you'll type that and you'll see the objective based conference and so this is a great perspective it really is a bridge between frequences and Bayesian ideas it's it's trying to find ways to set priors that aren't subjective maybe no human would have come up with them but some sense they're sensible they would give you per table protect you for making bad inferences and moreover in really complicated models it would maybe give you a way to set priors automatically that you don't have to have a human go looking at every the long list of parameters you have alright so it's been a lot of work on this probably the the best existing class of techniques are called reference priors it's a whole talk on its own but what they do is they set up a variational problem where they maximize the the divert some notion of divergence between the prior and the posterior with respect to the prior okay so the distance between the prior and posterior in some sense is the likelihood so if you maximize that distance you're making the likelihood do most of the work and the prior do as little work as possible so that's a well-posed variational problem and you can solve it in many situations and you get out of Prior so this prior hasn't been gotten from a domain expert it's gotten from a piece of mathematics that tries to protect you about having an over an influential prior in many situation get out improper priors by doing this procedure Jeffrey's priors and so on and better not all situations you get out proper priors in some situations as well so anyway this is an ongoing and research project it's very interesting I would hope there would be in lectures about reference priors here but it probably there were and so objective basins you know how did you pick the reference part was a good idea well it sounds like a good idea but how did you actually kind of show it was a good idea and moreover you know there's many other kind of approaches to priors which how do you choose between procedures or principles for choosing priors and we're not being subjective so you use frequentist ideas you often sort of show that you're Bayesian procedure choosing priors in some way has a good frequentist properties and people kind of agree that's not a bad way to get a guidance so consistency properties are sometimes used admissibility properties or widely use that's another frequentist idea that our to try to get good principles for choosing priors okay so I like this framework it's a great area to work in but the kind of downside is it can be challenge to work within complex models okay so you have to kind of do the mathematics to get out your prior and often that's really hard to do and so that's an ongoing research project to kind of do that and I'd say four simple models this is often worked out and this is an off-the-shelf solution but for a lot of the models that many of you will be interested in it's not off the shelf so you have to do a lot of work to to use objective base ideas alright but I just want you to be aware that there's a counterpart to subjective B's okay so lastly the the frequent is perspective so so the frequency prediction is very is very Catholic you procedures can come from anywhere they don't have to be derived from a probability model law there had to be conditional in or they have to Eve me derive from a probability model so nonparametric testing you know just kind of sensible sets of test statistics and show that they work support vector machine boosting are kind of things that weren't drive from a probability model that's all I'd like to mention things like methods based on first-order logic all right so you can have a data set data comes in you have a big first-order logic machine and out comes some answer all right and that's good nothing wrong with that and as a frequentist I would want to sit down and say well is that a good procedure is that just cuz it's logic doesn't mean it's necessarily go but it might be good but I can analyze it from a statistical point of view and that's what frequent ism does he would say okay does that procedure on repeated data sets give me an answer which is good in some notion of a loss right and so I often get an argument people in your under the Moray I cited machine learning saying well there's statistical machine learning and there's the arrests of machine learning and I would say what is that was that other object was that what the other partition what logical sort of stuff and I said well you know it's it's these are completely agreeable perspectives you can take your logical thing I can evaluate it statistically and they said well okay no that's fine but no Bayesian that's different from logical and I just sort of start find these distinctions a little bit unhelpful at time so this frequency perspective particular is just an analysis tool and it can analyze all kinds of things so I think I machine learning at statistical is the inferential problem of taking in data and getting out knowledge right and that and and and it's frequent as perspective it's very much part of that okay so if you can get your methods from anywhere you know I can write down you know Mike Jordans silliest method ever and you want to be able to rule that out so what frequent is mostly do is they develop techniques of analysis that I should rule out stupid methods and to rank the reasonable methods so it tends to focus more on analysis than on methods I did want to mention one and passing one general frequentist method it's the bootstrap it's kind of as automatic as Bayesian procedures can be used on all kinds of problems it's just a general methodology that's very frequent too so the bootstrap is that you take your original data setting your resample at multiple times and in doing so you're looking at alternative data sets you're exactly being a frequentist from a methodological point of view now not so much an analysis point of view of course then there is analysis to show you that procedure has been free kritis properties itself but it's very interesting broad class of techniques okay so I think it's one more slide on kind of introduction and I'll I'm gonna move on to some more concrete stuff so what do you do is a frequentist what kind of activities do you do well you also write down models you develop procedures and all that but more you know the kind of the analysis side of the story is there's kind of a hierarchy of mathematical things you do first of all you maybe try to prove consistency that if there is a correct answer you'll converge to that no matter what that correct answer was you know so that's often kind of fairly straightforward and and and not that informative a more informative thing to do is to get rates of conversions towards rates of conversions two procedures are both good they're both consistent but maybe one of them has a faster convergence rate in terms of number of data points and I might want to prefer that procedure a lot of a lot of work is done on that and then more hard but also very important is to try to get sampling distributions that as the number of data points gets large perhaps I converge to some nice distribution like a Koshi or a normal or something like that and I can use that distribution I get to give me error bars looks like I get error bars by finding out the sampling distribution okay so there is certainly work on consistency in the ML literature there's some on rates and there's very little on sampling distributions so classical frequented statistics you focus on parametric statistics you know in the 40s and 50s but since then it's mainly been nonparametric really there's a lot of nonparametric testing and there's tons of other kind of parenting you know function estimation and all these large P small end problems where these are going to infinity you know the number of parameters is going to infinity as well as the number of data points and so on so also you'll see people say well classical statistics with parametric and so on but you know that's just not the tools were developed to be you know general nonparametric is perfectly part of the story one of the most general tools is empirical process theory empirical process theory talks about convergence of objects uniformly so you find consistency rates in distribution uniformly on various spaces function spaces parameter spaces measure spaces and so on and so forth so statistical learning theory is really a part of that it's it's it's a particular area of empirical process theory that focuses on zero one loss but the tools their route a mocker and all that or developed in an empirical process here there are whole books on this so if you're interested in theory this tool is available it's used to prove things about the bootstrap it's used things to prove things like M estimators and so on so for a lot of frequency analysis usually this is this big heavy tool and there are lots of other tools that are you know simpler but that one's always always available okay I'm gonna take a little pause see if there any first of all any questions and then just kind of a stretchy and pause and then the rest of my presentation today and the next time are going to be some little vignettes on a research that I've been involved in it's all frequentist and try to give you a better flavor of what reacquaint is activity really is like what the kind of problems are set them up and then see that there's some challenges there and see out overcome them there could be kind of machine learning methods but then analyzed from a frequentist point of view and try to carry analysis all the way through to the end so I think I'll probably in the rest of this talk today talk about experimental design and then these things the great things will be for the next presentation so many questions on sort of the philosophical stuff first yeah yeah I mean one of my current favorite books is just kind of statistics in general is ad van der Vaart asymptotic statistics ad takes a you know Catholic view he has got Bayesian a frequentist argument throughout it it's probably more frequent it's overall but it's uh it's it's got Bayesian theorems as well jim berger if you don't if you not been introduced to james burger yet you should be he's got a great book on statistical decision theory the first edition of it was frequent just in the second issue was Bayesian and it's kind of good to read both of them and he anyway in the second edition as well it's just there's a lot of merging of frequentist and bayesian ideas and I just think reading his book and his papers and generals is a very good educational experience yeah his objective base also subjective well you know yeah sure in the sense that I've written down a big complicated model and some of the parameters I'm gonna possibly able to elicit subjectively and there's a whole bunch of others that are often called nuisance parameters or whatever that I don't want to or can't elicit subjectively I'll try to use objective Bayesian methods for those so you know uh most Bayesian do this actually in real life they will sit down and say something wrong well this parameter I sort of believe is in this range for this or that reason and there's this scale factor I have no idea what it should be let's put a Jefferies prayer on that now that's kind of an ad hoc thing to do and there's a lot of ad hoc rebut that's kind of you're you're in fact blending objective based on subjective base so I think of objective Bayes is kind of a big tent that incorporates objective as well as objective ideas yeah maybe yes I mean okay no the dot pairs are not functions of the data and in reference priors this is your as you're sitting there you haven't seen any data yet at all and you're thinking about what prior should I use okay you're free to envision data sets you could get you haven't seen any day yeah but you're free to sit and imagine possible data sets you could get okay and the divergence function in the in the reference prior is an expectation over possible data sets you could get so in fact it has a little frequentist kind of mathematical character but it's perfectly Bayesian beibei's ins our freedom you know to dream about data sets and not just one day to say yeah you want to stop how do you get your posterior you need your prior to get your posterior or you getting your prior from you know but I'm just saying like if you say I need to use imagine the basis that I'm going to see but this is that the whole point that so if you don't know you know know I've got I wrote down a probability model in beginning everybody agrees you sort of have to start there okay and now I can imagine data sets under that probability model so that that defines my probability measure and I can go from there and now take averages with respect to that and that's what the the reference parts does just you just need the likelihood and given the likelihood you do this hope this divergence maximization thing that gives you out of prior and now you see a real data set you put your prior together if you're real life you're likely on that data set you've observed and do a posterior it's perfectly Bayesian yeah yeah it it depends on that it sort of has an experimental design flavor Anna you know that was just which was arguably a good thing you kind of want to think about how your data will be gathered and there's a very young Jose Bernardo was the first person to talk about this in great detail and he's got a lot of papers talking about why experiment design should be taking an account in a bayesian arguments a lot of you know the likelihood principle says you shouldn't and that that that is a sort of a misleading argument ok so if you want to read more about this read some of Jose's papers yeah the image gets really you imagine your potential detective have frankly proprietors kind of similar to that so you imagine your prior New York section might have and try moralizing the crime and then in the objective framework you try to imagine they say no your your data set your measure around X space so the data is an X you know we're trying to use that somehow they get some out theta ok so that didn't help you imagining data sets and acts I get something on theta all right what whether you're doing you're simply trying to find a a prior that under these data sets you're imagining photon quote has as little impact on your posterior as possible and it's pretty neat that you actually just write that down you write down the kale diversions with strike two prior and posterior you average that over X that's where the average union is coming you don't know the X yet so you can average over all these possible days everywhere X you get the mutual information and you solve that problem and out pops it prior and it often as a jeffries prior and it has a lot of nice properties I should go over to that side it's a lot of questions here ya know not the only thing you can come up with their balance that's not true a lot of Cisco learning people only use bounds because somehow the I think it's more of the CS perspective it has to be a bound but most of that decisions use Assam products of all kinds expansions that aren't balanced but are hopefully tight no I mean analysis is called always grains of salt you know there's you did an analysis and then you got an answer and there's a little bit of error there's a tailor or third order term you can neglect it and so on but that's what mathematics is about sort of getting an understanding and maybe you know not being quite exactly right but hopefully get a guide and then explore it further yeah there's no you know you know I always rank that procedure over that there's always a little bit of art to this kind of an understanding the kind of setting of the consequences yeah evaluating that has to be about probabilistic predictions and soft prediction point estimate I mean that's what frequent is mainly are focused on his coverage not any other currently not just the point estimate sound like pretty easy yeah you can your loss function can for example be a log loss on a predictive distribution you can study that it's frequentist convergence you know in terms of some procedure it's commonly done yeah absolutely maybe one more question I do have a so there's all kinds of bootstrap literature and there's the simplest bootstrap has you know there's some some issues that occasionally come up with there's a general techniques called resampling methods and and there's a lot of worry about that but you can prove their situations with the bootstrap it's not consistent okay but there are correction there are better bootstraps which are consistent alright there's just a whole literature there's a book on resampling ah Joe Romano and others have written a very nice book on that whole technology you know so alright let's take a short I guess it's a two minute break and let me have time to then get through the rest of my talk ok those are all great questions and I was glad to trigger all of them I really wouldn't mind spending the rest of the time just at talking about those things but I'd you know I do a summer Michelle I want cover so what I'll you know I'm around the rest of the day if people want to chat more about those things I'm happy to talk to anyone about those so as I said what I'm gonna do in the rest of my presentation today tomorrow is go through some vignettes of particular problems and show you get a little flavor of how you do frequency analysis okay so this first vignette is about loss functions in classification and and about experimental design so the main paper this was based on came out last year my colleague Joe long again and Martin Wainwright and there are two to backup papers that also play some role in development of these ideas so this has to do with things like boosting support vector machine and sort of classification algorithms and there's they all kind of came out separately and there was some realization there's no there's a lot of unity and the ideas and there was frequencies analysis that came out you know for example to show that they were consistent so boosting eventually was shown to be consistent it was not clear the beginning and swerved vector machine and so on so we're gonna try to face a harder class of problems we're gonna do not just classification but also experimental design simultaneously with classification and now we're gonna ask are things like the boosting lost the SVM lost and so on and so forth still consistent even though you're doing this harder problem and that I would know how to answer that question unless I did this analysis and having done the analysis then you actually learn that actually turns out some classes of these lost questions leading concisely and some of them don't so it's kind of a little bit of surprise I'll leave you to kind of guess which of them do in which other don't want you understand what the setup of the problem is so the way we got originally Baldwin this was a practical problem that some people at the Intel lab and Berkeley had one of these early sensor networks and they asked us to solve a classification problem here which is that they had a bunch of sensors on a grid this were up on the ceiling and there was a little robot moving around and it had a light source on it and they wanted to know whether the robot had gone into a particular region of the room or not so the region was some green region it could be convex or not they just wanted to know whether it had or not these are highly noisy sensors and so it was a kind of a hard classification problem to solve but moreover the more interesting part was that these sensors there's one of the machine over there have these little batteries and they if they transmit data all the time the batteries run out immediately so they need to transmit only a little bit of day like one or two bits per time slice so you can't transmit the real-valued voltage you're sensing of the light you need to transmit a quantized version of that and the question then became how do you quantize how what's the you know what's the optimal way to quantize given that my problem is one of classification if it was just data compression I know how to quantize but now I'm trying to quantize for the purposes of classification how do I do that alright so here's the abstraction of the problem so there's a bunch of observables x1 the through X s that are you know too often real-valued quantities but since they've actually gone through a analog-to-digital processor they've actually become quantized but the cardinality of the quantization is really big so M is really really large all right we're going to quantize those with quantizers q1 through qsr and these are distributed so this is this is a different spatial locations right and so this Q here doesn't get to see any other X's this is a local of calculation so this quantizer then spits out a z1 which is the quantized version of x1 and so on and this one it really is quantized in its cardinality is much much smaller than M and then the central these then are transmitted to a central site over the radio and the central site fits a discriminant function to the Z values and tries to predict yes or no you're in the green region and of course you were in the green region or you were not and depending on the value of that hypothesis you get different distributions on your lights or sensors that problem is called the centralized detection it actually existed in the literature before we got involved with it in the 80s it was a hot topic in Electrical Engineering signal processing ok so the general set up is going to be XY pairs let's assume their ID for simplicity the Y's are going to be 0 1 we're gonna have a quantizer now that takes the original covariant vector X and turns that into a quantized version Z there could be a quantization space in which Q lies in and Q is the space is some set of random mapping so it for our analysis it needs to be a set of random mappings but in practice we would often implement it as a deterministic mapping ok so Q is a random mapping all right now we're gonna do a statistical analysis of this kind of object and what what kind of object is this well this is known as an experimental design and you know you often the governor experimental design is something like an analysis of variance table you know a person comes in and they went you know to sell 1/3 or sell 4 or 5 you know that's experimental design alright but it's really a broader the broad mathematical problem is really just a kind of a mapping from some space at Z to some space X to Z and the map being random in fact an analysis variance it is you come in here it's a randomized experiment you're a scientist some sellin or some random assignment so the z space here would be the sell of the analysis of variance table and that's a discrete variable right and X comes in it's the description of the of the person who gets put in one of the sells so that's just one example of a map being qubit then there are many many others so the step set in generality you just allow Q to be in some space and then you can have the space be characterized in various different ways so so I will use the language of experimental design quantization is a special case of that analysis variance as a special case and so on if you prefer one of those special cases think in those terms now that's half the problems that is the the external design but the other half of the problem is the discriminant function ie the classifier so we have this family of classifiers that lie in some family big gamma and it's probably going to be a large family so nonparametric family and our problem is to define the decision like in our decision theoretic framework have to note the decision the decision now it's two parts it's to choose the quantizer q and to choose the discriminant function that's the output is this this tuple alright and what's our loss function well the risk is going to be the probability made an error so the why the quant the quantized discriminant function value was not equal to the correct label that that is a 1 0 a loss function if I take the expectation level and I get the probability that they're not equal so this is the risk function as a function of Q and gamma so I it's different notation but it is it's in my decision theoretic framework we talked about earlier okay so there are many applications of this ok so if you look at the existing literature there's sort of help on two sides of the equation but not on both of them simultaneously so the classical signal processing literature you know define this problem decentralized detection and it assume that everything is known except for the quantizer okay everything means that all the probabilities regions are known the class conditional probability distributions the class profit prior probabilities and so on so all that's not known is Q right and so how do you find Q alright and something like a little drawing I didn't do it on the slides here I guess I turn on that light over here I didn't come on did I do that wrong no I didn't work so in my original space there's the X space over here you know that had class one and class zero maybe looking like that and it's sort of hard to fit a discriminant down dream on these two things so I might want to use a map in queue which goes over to a space Z which pushes them as far apart as possible all right that would be a good choice of Q and if I did a bad choice of Q it would smush them together even further all right so all I got to do is measure in some ways the divergence among probability distributions and then optimize Q with respect to that divergence what diversions should you use all right well you're trying to maximize divergence here it's you know often we have a minimum divergence this is a maximum divergence problem and so these guys said well what are some divergences you can maximize and so they wrote down lots of kind of folks functionals on probably distribution x' and they found that some of them were easy to maximize some of them were not and so they pick those and that's what they did all right so Hellinger bhattacharya a whole bunch of others kind of came out of that literature and became then famous in other fields and they were set up because of this problem of divergence maximization so lots of radar has been done this way you know it's a big literature where people have just picked up divergence a Hellinger tearing off or something and then it's a function of probabilities ution but you assume those problems are known and so just write down the expected divergence maximize twit to Q and then pop out that Q back to the user and so then you put that into your radar and the radar quantize is in that way called signal set selection alright so that's the story and I would view this as a basically a heuristic literature it could turn the lights off now or did I could I have control okay so it's heuristic what's the what I do all right so it's basically using a plugin and then not really worrying about how well it performs you put it in and you don't then try to evaluate how well that does alright so the seasonal machine learning on the literature on the other hand is focused on problems on where the whole problem is to find the discriminant function and not worry at all about the experimental design so you assume that that's known and you try to find that okay and the way it's done is is that by defining a surrogate loss function you know boost enid aggression support of your machines are all based on surrogate loss functions and this is you know kind of it's more rigorous there's a decision theoretic flavor there's consistent results and so on and so forth but it's been really facing the whole problem which is to find the Q and the gamma simultaneously okay so let's build up a little more machinery let's talk about these F divergences these are the guys that are been discussed by the signal processing literature and of course they've been up here and many other literature's as well and kind of part of this this talk is going to be to somehow unify these things it's not just a list of things in your mind there there's gonna be relationships here so in this let's talk about discrete random variables just for simplicity's all right sums instead of integrals but you can do this with continuous as well so you define the f diversions between two measures mu and pi as f of the likelihood ratio and the integrate are you average with respect to the pi okay so that looks kind of like a load of urchins and in fact if f has chosen to be you log you you get tail diversions all right but if F has chosen to be absolute value minus 1 for example then you get out the variational distance which is here just the l1 distance on measures F can be any continuous convex function so these are particular examples but here's another one and if you plug that one in you get out the Hellinger distance and this goes on for several pages so you plug in any continuous convex F and you look at you now define a new ally sylvie or f2 virgins all right so why did these guys use the f2 virgin psa's well they were somehow intuitively appealing but there was a little bit of kind of underlying theory behind that choice it's not entirely satisfactory in fact it's not really suspect at all but it I it's a good starting place and the theorem was due to david blackwell in 1951 a classical paper well worth reading had a big impact on economics and so his theorem stated the following if a procedure a ie some kind of an estimator has a smaller F divergence than a procedure B for some particular choice of F and your F divergence okay then there exists some set of prior probabilities these are the class probabilities you're in one class or the other so some set of prior problems on the classes such the procedure a has a smaller probability of error than procedure B well that's what you care about that's the risk so we've now just learned that procedure a has a smaller risk than procedure B that's good we should now choose problems of procedure a and we were told that by looking after Virgin says so so now after Orton gives us an information on risk all right well this is just an existence statement though it says there exists some set of prior problems we don't know what the prior problems are for which that after versions gives us this ranking and we don't know in our particular problem which F diversions to use okay so it's not that help when it's not at all helpful in practice but it does at least suggest that after ver consider not unreasonable objects to be looking at if you're trying to minimize the probability of error and that's a good thing because minimizing the probability error is course a non convex problem the risk of zero one loss is non convex and so you try to find some other kind of function you can optimize and these things are convex and therefore you might try them by this theorem so that's what people did there's a famous paper Scylla in 1967 and so on choosing particular divergences and just kind of in some sense hoping that the priors now we're right for that F divergence on the particular problem okay all right now there are some supporting arguments for asymptotics you know these divergence also arise in other ways in fact the original Colbeck leader divergence arose by an analysis of hypothesis testing it's it characterizes the power function in hypothesis testing where you're two classes are staying at a fixed distance apart as the number of data points gets large and similarly tearing off distance arises when you do that same analysis in the Bayesian setting where you have priors on your classes so these diversions were you know kind of in some ways talking about probability of error directly in this case but you know it's an asymptotic argument and hypothesis testing so anyway it's still a heuristic literature all right now let's turn to the other side of the coin how to choose the discriminant function and so you kind of now know this stuff this is a machine learning kind of 101 you choose a loss function that measures the distance between your class label and your discriminant you tell you we're going to start with a zero one loss it's kind of the real loss we're trying to optimize and in the binary case you can write that as the indicator function of when the labels disagree okay if the labels disagree in fact now I'm using a why is it is 1 minus 1 and the discriminant function outputs also a 1 or minus 1 or a real or a real number in general so if they disagree in their sign that's bad and that's and you pay a loss of 1 in that case otherwise you pay a loss of zero okay so main focus is on the discriminant function now we know also from this point of view its intractable minimize your one loss predicted this argument as well so instead what people have done is they picked these surrogate loss functions which are convex upto bouncing 0 and loss so hopefully you've all seen this picture here is the 0 1 loss expressed in terms of this margin value ie the product of the Y and the gamma of Z and if you disagree you're on this side you pay it with loss of 1 otherwise you pay a loss of 0 and so it's intractable to optimize this instead people look at these upper bounds and the blue line is the mark is the support vector machine the hinge loss the the red one here is logistic the green one I think is the boosting loss the exponential lost which is what what gives you boosting and there's a whole bunch of others that you know this page could be littered with examples ok so the all of those different procedures aren't that different from a you know optimization point of view you just write down these you just optimize over these particular choices of curves and and then you hopefully try to prove something about them what can you prove about them well let's try to set up a little bit of the theory so we're doing optimization here so sometimes well in statistics this is called M estimation M estimators you know you write down like a function often is called a contrast instead of a loss to distinguish between the thing you're trying to analyze this is a procedure not an analysis but you're optimizing and that's called M estimation and the machine learning the traits often called empirical risk minimization and system it's the same idea alright so we have this ID training data we write down an M estimation functional which is this object here which just takes our our contrast function or our loss function say the exponential loss or the hinge loss we sum that up over our data sets and we call that the function we'd like to optimize either so it's one empirical risk if you will okay all right so here's some theory for this object this is a paper that my colleagues and I worked on that gives you necessary and sufficient conditions there have been working on so fishing it is this gives a full treatment of these surrogate loss functions it gives both necessary and sufficient conditions for consistency so we're trying to say if you use these loss functions did you get the same answer at the end as if you had optimize the zero one loss that would be a satisfying story and all those guys do that and here and here's here's a theory that tells you that that is the case all right so first of all not any arbitrary fee can be used it has to satisfy some properties in particular we had a very weak condition called classification calibration which is essentially a form of Fisher consistency and here's an equation that it defines it let me not spend a lot of time on this but basically this says that if you disagree with the right answer you pay a bigger loss that's bigger than if you don't disagree with the right answer okay so this sort of says that things kind of tilt up to the left that on the left side where you're making an error you have a bigger loss then on the right hand side where you're not making an error okay so it turns out to be necessary and sufficient for for what was called Bayes consistent it's not a Bayesian notion as I hope you remember I talked about Bayes risk earlier in the lecture this is consistency in the sense of the zero one loss okay so we will now define a surrogate loss function to be something that is cast of classification calibrated that's the definition of this this object here it satisfies this property all right now the ones you can forget this definition because in the convex case where a fee is convex function and then it's consequent if and only if it's differentiable at zero and it has a negative derivative at zero so all those curves and turn the light cuz it's not worth it all those curves tilted like this at the origin they had a strictly negative derivative here and and and that's all you need for classification calibration so it just matters what happens around the origin okay so that's just a kind of a set up for the rest of this talk that's what we mean by a surrogate loss function and it turns out that those surrogate loss functions defined in this kind of machine learning literature turned out to have a very nice relationship to F divergences sort of surprising but true so there could be a construct even many to one correspondence between surrogate loss functions and afterwards they go back and forth in two directions and having done that there's gonna be now two spaces we can work on you can be in certain loss functions and you can go instead and work in the space of divergences or vice-versa and by doing that we're able to define a notion of equivalence among loss functions so two cloths functions will be equivalent if they map roughly over into the same after virgins not quite that's not exactly right there's kind of a range of after verses but we will define a notion of equivalence and with that notion of equivalence in hand it's gonna be extremely easy to prove things like that this procedure is consistent and this is not and so on it's kind of a nice characterization of a space of loss functions so space of loss functions has some structure it's not just a list of loss functions it has a lot of structure to it and we have a theory that that shows that explicate that structure okay so I got about 10 minutes right is that actually correct is that give a room I take five I start a little late but I'm trying to figure where I'm gonna be in the middle of this talk by the end of this so I'm gonna figure out where to how FAR's to go okay so let's just set up a little bit of notation this is kind of dull but necessary notation so remember the risk function is a function of if the expectations the frequencies expectation of the loss and then the data is the Y and the Z it's a tuple and then the parameter which was theta back in the original slides has now become a tuple as well it's both it's both the discriminant and the and the quantizer for simplicity it could be nice to work instead with conditional distributions of Z given Y unnormalized conditionals ie the joints so mu Z and n PI Z are these class conditional densities unnormalized so P little P and Q are the priors of Q is 1 minus B and then we integrate out the the X which is the unobserved covariant and so this is now just a function of Z okay so we take the class conditional distribution integrate that out under the quantizer and we get a distribution on Z okay and so will using this notation you can now represent the fee risk the expectation here is of course over the two values that Y can take on and so if we just do that expectation over the Y we get a mu and a PI and we get a minus from the from from y equal to minus 1 and a 1 where Y is equal to 1 I add those two up that's the expectation over Y and then we do the expectation over Z where the immune and the Z are they measures okay so that's just a representation of B risk so this kind of looks has a little bit of a convex flavor the fee tilts in one direction and fee of minus tilts in the other direction and this is a convex combination of that kind of roughly speaking has some convexity properties already emerging ok so now as a frequentist you're free to do something which is called profiling I have a function of two arguments and how do I get rid of one of the arguments so I can optimize respect to the other argument all right well if you're a Bayesian the kind of the whole thing you know how to do is to integrate out one of the arguments but we know how do you integrate out the the discriminant function I don't know how to do that but I can optimize it out that's called profiling statistics opera I thought one argument so you can focus on the other it's called profiling you off number the profile likelihood for example all right so let's profile this risk function by by optimizing out the discriminant function and we get a function now just of queue and then we could use that as a function to for the for choosing Q ok so let's do that now for some examples so if we choose zero one loss you can easily figure out you can do this optimization and the answer is just that it's a difference the sign of the difference of the two measures which makes sense if you're doing zero and loss you want to pick the one guy that had the bigger measure so you plug that back in to this optimize you will get this function here and I've just done that in a couple of steps here if I plug it in you just have to do a little calculation if you're sort of classification you will know how to do these calculations if not get obviously paper afterwards it's really easy you plug back into 0 1 loss you get this this minimum right makes intuitive sense you should pay the loss that of the worst of the smaller class and if you just did you know the absolute value is equal to this minimum just really easy to see and then this thing here is 1 minus the variational distance just by definition okay all right so we turns out this profile risk happened to be the negative of a of a divergence that's kind of interesting so if you use this divergence effectively what you were doing was working with the profile 0 1 loss that's kind of interesting right ok so we did that calculation and we thought well does that hold more generally for other kind of losses other than zero one loss and turned out it did turned out it's a really fun exercise to do this for all kinds of losses if you start with the hinge loss for example and you profile out the discriminant function you will get one minus the variational distance so it turned out there there that two different losses mapped into the same after virgins I'm interesting so it's not good we thought it might be one-to-one this mmediately prove that it's not one-to-one relationship um what about if you start the exponential loss the boosting loss well it turned out then you got the Hellinger distance and that's a nice lecture size for you to do it's kind of surprising that the exponential function goes into a square root if you start with the logistic loss you got out something looking like the detail this is the KL divergence symmetrized it's called the capacity discrimination and so on and so forth so all the losses we could write down they all turned into F divergences and so we wondered is there a general theory behind this all right it turns out there is yeah let's talk about that I'm about to finish it I want to get that would take me a little bit of diversion so I mean get talk to tell you about that a little bit later so it turns out that there is a general relationship here that this class of surrogate loss functions Maps over in the class of adaptive urgencies for every surrogate loss function there is a corresponding F divergence and for every f divergence there is a class of loss functions and these partitions of space and exhaust the space of surrogate loss functions so it's a complete characterization of the space of loss functions in terms of F divergences okay and so I'm gonna give you a little flavor of how that's proved it's it's a theorem in that and annals paper I mentioned the key tool underlined it is my favorite tool of convex analysis called conjugate duality unifies lots and lots of things so just to remind you what conjugate duality is it is if you take a lower semi continuous convex function half the convex dual is defined as the supreme of of a linear functional minus the original function that's necessarily a convex function okay so the star means conjugate dual so we're going to work with F star of negative beta for technical reasons so this sigh of beta is the thing to remember it's the conjugate dual function up to this flipping of the sign okay I think on the next slide I have the theorem yeah so let me just take a minute to explain the theorem so this is the theorem that shows that we have this this relationship between diversions and losses so One Direction is pretty straightforward for any Surt margin-based surrogate loss function there is an F diversion such that when you profile out the loss function of the discriminant function you get the negative of an F diversions for some convex function f so that goes in one direction moreover in going in that direction it turns out that for fee that is continuous that you get some nice properties of that conjugate dual function these are sort of technical it's decreasing and convex it kind of it has a fixed point property and there's kind of a cascade property trying to fix point like property here so let's not worry about the technical details it's just that turns out even if you have this weak condition on fee that's continuous which all the ones that practice are that you get out these kind of mathematical convexity type properties of the conjugate dual now the other direction is really the interesting one which is that if F is a lower symmetry whose convex function that satisfies these conditions when you take its conjugate dual okay then there exists a loss function that induces that F diversion so this goes in the backwards direction and shows that F divergence is also characterize loss functions okay the forward Direction is actually trivia to prove it's one page and the backwards direction requires a lot of convex analysis and it's definitely non-trivial so here's the easy direction even Elissa the intellection take you through it it's pretty easy here's the risk function remember I wrote that down a little while ago now let's profile that let's optimize out the discriminant function all right well when you when you optimize this out over gamma we're going to do this for each Z and so we can move it inside and replace gamma of Z with just a number so we get this expression inside ok and now let's just pull out PI of Z and now we you're left with this expression here ok just dividing and multiple just dividing by PI of Z and pulling it out all right and now if we look at this object right here this is a function of the likelihood ratio and moreover this is the convex function all right because it's here it is I just wrote it down here it's just there's a linear a bunch of linear if it's a family of linear function you take the infimum that gives you a concave function you flip the sign you get a convex function so we've identified the f it's just this function and this thing is now PI of F of the likelihood ratio it is in F divergence so all the little examples we were doing or all just examples of that so that's really very very easy ok the other direction is hard but there is a constructive consequence of it which is that when you go through that proof at some point you identify the loss function you can write it down it has a certain form for alpha which is equal to 0 it's just this this fixed point for alpha bigger than zero it is the conjugate dual function of a free function G yes it's like I have a degree of freedom you can choose any G you want that is increasing continuously convex I'll give us some examples here on the next slide so sigh of that function and then when alpha is negative you just get the G function itself so it turned out that that psy had this this form as part of the proof so this kind of gives a little structure of the possible loss functions you can get and also gives you points out to where the freedom at sets this G function now so you can now do this for some examples so if you start with the Hellinger distance remember that is the F divergence where F is equal to this square root function all right if you now just take the car you do love that that's a little easy action just a piece of calculus to do that you'll get this function that's the conjugate dual of the square root function and take the minus of that and now that's the size function I've calculated it and now I can choose a bunch of G's and plug them into sy and and and I get out loss functions and so if I use G equal to e to the U minus one that's a particular choice of G then I get out the red curve which is the exponential loss so we've now Rico we've got in the opposite direction we start with Hellinger distance and we recovered the boosting loss but also I could choose other G's if I chose G equal to u squared or G equal to u those those are continuous convex then I get these other loss functions that Reena and great curve which are equivalent to the boosting loss in that they map it in the same divergence function okay yes correct okay it defines the entire class exactly as its characterization it just turns out it's not as strong as we need for a statistical theory the sisal theory actually defines even a little broader class you take a divergence and you brought it out a little bit and you go backwards to get all losses that map into that and those things turn out to be universal equivalent you're already headed in the right direction but it turns out that the reasons I'll get into it's a little broader than just one after versions defines the kind of a whole whole statistical story here's the variational distance if you'd liked the hinge loss you would have started with the red curve there become the the underlined 2f divergence is that is based on the is the variational and as we saw earlier slide you can write that as the using F is equal to the men if you take the conjugate dual of that you get out this function here kind of got the hinge kind of look to it and if you plug in then different choices of G you get out these different curves including the hinge loss and some other curves which are equivalent to the hinge loss in terms of giving you the same variational distance as the F diversions and Colbeck Leibler you can kind of play the same story and so on so I am about ready to run out of time let me see let me just gonna page through a couple of things so I can see where I am and I cuz I'm going to stop here okay I think I'm gonna just let's say base consistency and the universal equivalent story and then the theorem there's three theorems here I'm not gonna get two theorem three which is my favorite one but I'm going to I'm gonna do I'm just let's see where it where was I yeah I think I'm just gonna set it up and I'm gonna stop there so um alright so I've given you a theorem that relates after versions and losses and now I'm going to kind of anticipate how we'll use we're gonna make of this theorem so the zero one loss is the goal of classification so let's start with that and we now map into into the version space and we get the variational distance with with F chosen this way okay now that's plug it in right here let's now consider a broader class of F divergences defined just by taking this kind of affine expansion the original after versions to the men of Yueh one this is kind of a bigger class of F divergences alright and now let's get all those F divergences and so if I get draw a little picture here and this will return next time so we've started with with a particular loss we care about the zero one loss here we mapped over the space of F divergences there and we got the variational distance and then we're gonna now broaden out a little bit and get a class of kind of all the affine combinations sort of thing of that one and then we're gonna go backwards and get a broader class of loss functions all the ones that map into here it's kind of composed of several subsets there's a whole bunch of them over here alright and it'll turn out that all of these over here have the same statistical properties as the zero one loss all right so you get consistency immediately for all of them all right and anything outside of that does not it's they're not Universal equivalence we can prove lack of consistency and consistency right so you don't see that yet because you don't know why I picked this particular a fine thing but you'll see that in theorem 2 0 and 3 but anyway let's just assume that weak that's a reasoning to do we do that and immediately that's going to tell us about Bayes consistency and will also give us the converse which that only these C losses yield based consistency ok so I've set up this story I guess the slides will be available you can look ahead if you want to read the annals paper nothing else to do with your time you can you can see these theorems and the other approved all right let me stop and see if there questions this is half of a story I okay so all right so I this talk is about classification okay so we are centered around zero one loss and we're developing I have a theory for that okay there's a whole other sort of story to talk about other loss other so um kind of a the focus of the loss so if I was doing regression I would develop a different parallel story here when I get to the end of this this this sequence of theorems I will be able to say a little bit more about that Irie let's just focus on binary classification for now yes ya know so we worked on distance between distribution defined by kernels and in fact one of the applications of this will be to that at the end of this talk so we'll talk about kernel space characterizations and it essentially comes about by having a function space over which you optimize and that function space gonna be a kernel space it's just one particular choice all right so I'll just one way to get a kind of a rich class to optimize over sorry that it's not a fundamental part of this story but it's just kind of a convenient part of the story yeah okay let me tell you what in one sentence what it is and and you'll see it next time and universal equivalence means that if I if two procedures are ranked the same by the FD vergence then they're the they have the same they're ranked the same by the loss so it's kind of the the Blackwell theorem that I mentioned earlier but extended to the whole class of divergences and loss functions for any given probability suspicions it's Universal in the sense of any probability distribution that's right so we'll be able to say if you're not universally equivalent to the 0 1 loss there must exist some problems you'll get it you'll get a different answer than the 0 1 loss what give you right it's not since it's not a satisfactory lost function to be using because you can get it wrong sorry no no this is not an asymptotic theory let me see if someone else has a question I'll come back to you ok convexity that f function is a convex function log love like the ratio that's the key yeah lawsuit there is there a best moment great question so no this theory is silent on that it just identifies the class and I return it to you and I say now you choose among that class according to other principles say computation complexity in particular or sparseness or some other principle that goes beyond decision theory but I think that's a good thing to do if I'm gonna narrow down my class and say all these are good and now you can bring another principle to bear and choosing a moment sorry I don't want to tell you at the end of day you have to use that one that's kind of not the frequent of spirit it's rather than to say that if you want to now carve it down further you better bring another principle to bear computational complexity sparsity something some other principle no that's not an assumption but you're right it's nonparametric in the sense that it's over all possible gamma that step wouldn't be true if there if it was restricted it's over all measurable functions gamma which is which is that's right so our eventual theorem of consistency is going to have to be in a sieve of some kinds where we're actually going up to a rich nonparametric class but that's the spirit of this kind of whole story is to do this for you know things like support vector machines where you're supposed to converge for all possible generating distributions all right since the popcorn has stopped popping now we can [Applause]
Info
Channel: Neuromantic
Views: 7,393
Rating: 4.972414 out of 5
Keywords:
Id: HUAE26lNDuE
Channel Id: undefined
Length: 88min 55sec (5335 seconds)
Published: Mon Feb 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.