ICML 2018 | Invited Talk by Max Welling Part 2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
by a naivete and I started to think about maybe can we you know update MCMC in such a way that it will actually give you reasonable answers after finite amount of time if you have a very large data set and we were inspired by gradient descent which is basically you take your you know your a prior plus your likelihood term which is proportional to your your posterior and you compute a gradient with respect to your parameters right sort of decomposes into these two terms now this is the offending term because it depends on all the data set right and you take a gradient step so people have known how to deal with the big data problem for learning for gradient descent learning they just do stochastic mini batching so they instead of for every update you look at all the data you look at a random subset of the data and then you scale this thing up by a factor n over n so that so you get sort of updates that infinite amount of time and of course there is a bunch of requirements on that apps on the step size the step size has to decrease over time and satisfy these sort of constraints so that's just 2 kasa gradient descent that we all know and love for deep learning works very well but then there's this other algorithm called large event dynamics which is really just like gradient descent but then if you miraculously add like this noise term it's a normal distribution with 0 mean and epsilon now note epsilon is actually the step size or 2 times the step size if you add this amount of noise to every step then in the limit of very small epsilon you will sample from the posterior distribution so instead of actually converging to a point you'll be sampling from the correct posterior distribution it sounds like a miracle but it works and then since we cannot actually do very small absalom what we do is we add a metropolis Hastings accept reject step to make sure that every actually sample from the correct distribution okay so that's called launch of n dynamics and then we thought well there is a very close parallel between these two so let's see if we can also make it close pair between these two so now instead of looking at all the samples we draw a small subset of the samples here and so we scaled by the factor n over N and we have also a step size which is decreasing over time with the same requirements and since this is a very expensive step as well it looks at all the data point we just remove it and observe that if epsilon goes to 0 in fact you know the procedure will sample from the correct distribution and so this then is stochastic gradient line chiefin dynamics it's the first of a long list of algorithms which improved this particular algorithm which runs MCMC in the line when you have a very large data set and gives you approximations and so what have we really done so as I said in variational Bayes and MC MC you know one has sort of a bias the error is represented by a bias and the other one has an error that's represented by the variance so we've played the same game here so you know on the one end if you have a distribution with a large step size then you're assembling from the wrong distribution but you're making a lot of progress and because you're making a lot of progress you can draw a lot of samples and so the situation is you're drawing from the wrong distribution but you can draw many samples so this means that you can reduce the variance in your estimate which you increase the bias in your estimate and in the other limit when you know epsilon is very small you will only draw very few samples or independent samples if you want but you're drawing it from a distribution that's closer to the true distribution so all this is doing is basically sort of playing this game of bias variance dick you know error just putting the error are there in the bias or the variance and it turns out you know it's it's actually beneficial to trade off a little bit of the variance with some of the bias and so in variational Bayesian learning so the situation is a little bit different so again you know this we are in a family and we're trying to find a distribution which is closest to the true posterior distribution so that's now let's look at that that free energy equation that we have so again that's a an integral over the approximate posterior x you know the log of the likelihood in the prior and then you know this entropy term and then you know in order to maximize this the best way to do that is to actually do a repro motorisation so this is something that you know dirk Ingham and I introduced in 2013 so instead of trying to do this complicated integral which might still be complicated even though Q might be a simple distribution what you do is you transform theta to another variable Omega random variable Omega and this function going from Omega and Phi or you know Maps you know these two to theta in such a way that you know this measure Q say today theta equals to p 0 Omega D Omega and then the important part is that this distribution should be a standard distribution a distribution that does not depend on any parameters you know it could be like a standard normal distribution doesn't depend on any parameters the parameters are now in this function so if you didn't apply that transformation then this Q turns you know D theta Q theta turns into D Omega P 0 Omega and then every every time you see your theta here you have to replace it by you know this function f Omega Phi everywhere right but now and then you have to and now you can turn the gradient you can pull it through this expectation and you can apply it here and here and there and there okay so that's called the reprioritization trick and in fact this is a really important trick because it reduces the variance if you would start sampling from this distribution directly and use use an algorithm called reinforce you have very high variance in this trick will reduce the variance a lot so that's the you know way that we can train these variational Bayesian methods efficiency okay so this is the introductory part now I'll tell you a little bit about efficient deep learning and so I just want to share with you this plot which I made and you should take with a lot of grains of salt because I just looked up some models in the literature and I just you know I pathi sized that you know the first models in 1943 which were the first sort of perceptrons you know they could have ordered ten parameters because there was not really a computer to learn them so you may have to have set them by hand so let's say that in 43 you know there were about ten parameters in our models and then an 88 was just a sort of previous neural network hype there was a molecule net talk which had about 20,000 parameters which seems a lot at that time and then in 2009 when it if Hinton started you know his deep belief net papers there were about 10 million parameters in these models which at the time looked like incredible and then I visited Google and Yahoo actually in 2013 and people were talking about models with 1 billion parameters and nowadays when you talk to people from Google they apologize if they say that their model is 1 only 1 billion parameters and then oh god what is that all right and in 2017 there was a paper called extremely large neural networks which had 137 billion parameters right and then if you plot these numbers like on a log linear plot which basically means that a straight line is exponential growth you'll find that you know if we keep on going then in 2025 we're gonna reach a hundred trillion parameters which is about the size of the brain if you want that's the amount that's the amount of synapses we have in the brain right that's not very far away I mean we all hope to live you know to that time which is interesting right it also points to the fact that you know these models as they grow they're using more and more energy to run and and the question is do we you know is this a sustainable path forward and I'll argue that there is multiple reasons why this is not really a sustainable path forward and maybe we can be expired by the human brain which seems to do at least a hundred times better in terms of energy efficiency okay so why why do I think that it is important to start dealing with this energy question now I think there's two really important reasons the first one is that the value created by AI must in fact exceed the cost to run the service right so if you rank posts on facebook or other social media right and you're gonna spend a little bit of money maybe in the order of a ten thousandth of a cent a micro dollar let's say on each one of those rankings then the revenue you get by doing that in terms of advertisements should outstrip you know that money that you invest in running that you know that that neural net right and as you know we we we start to run these models and in sort of bigger and bigger environments at some point the economic benefit will outweigh you know the sort of the the energy costs will outweigh the economic benefit now there's another reason why you know energy efficiency is really important and that's that I believe that a lot of the AI will move from the cloud to the edge right and so if you have like a smart hearing aid you don't want that hearing aid to use a lot of energy first of all you know there might not be a big battery in that thing but also you don't want it to heat to warm up a lot you know behind your ear same thing with a cell phone where the student sort of get very hot in your pocket right so when we are applying these methods to sort of you know edge computation it's really important that you know we make sure that the energy envelope is under control so the workloads are a very compute intensive on phones there's lots of complex con currencies you have to do many tasks at the same time often it has to be done in real time and it's always on at the same time of course you have really a small device you don't have a lot of storage and you don't have a lot of battery life so that calls is really a lot of sort of challenges and sort of energy consumption is a big issue so therefore the central sort of maybe sort of a you know statement I want to make here is that you know maybe we should start thinking about AI algorithms and measure their success not in just solely in terms of their accuracy but in the amount of sort of intelligence that they provide per kilowatt hour right so how much intelligence can we extract from a neural net you know per per Joule or per kilowatt hour okay so now I'm gonna try to go into a little bit more detail about some things so so here's a neural net and I'm gonna you know should have explained to you first a local Reaper motorization trick which was introduced in this paper here where you know what you do is you you convert the stochasticity on your parameters a to stochasticity on your on your sort of neurons here so basically in equations when you look at the energy term of the free energy it's an expectation over theta py given sort of and then theta times the non-linearity times the previous activations so we're going to convert that into a distribution over the activations Z given the previous activations and then py given Z because that's this term right and so in an animation this looks like this so you can see these things wiggle here so this is the parameters which are being fluctuated under the prior or the posterior and that is converted into see the fluctuations on the activation so that set of an animation of what the local Reaper motorisation trick really does and now I'm gonna try to explain to you how that idea can be used to you know remove or or how Bayesian statistics can be used to reduce energy in neural networks okay so we start with model that hasn't seen any data whatsoever and we going to look at these two parameters W 1 and W 2 so they're on these X's here so W 1 value W 2 value all right and now we're going to use this set of wiggling so that we wiggle the whole thing so the whole method is stochastic right we have a we have a posterior distribution over our parameters and the whole thing is stochastic and as we wiggle this thing these parameters W and W 2 they take on different values all right so this only happens in a Bayesian perspective of deep learning and then okay so this is under the prior we're going under the prior now we're gonna in you know use data and what's happening is that we're gonna change this prior distribution and a posterior distribution over possible values right and what we see here is that you know the you know the the W 1 actually was constrained to a small value so we really transformed the data into parameter values right so we moved the information which was in the data into the value of parameter W 1 but W 2 actually you know wasn't constrained at all it stayed equally uncertain and so that's sort of like a very useless parameter and now what I'm gonna say is that we can either remove that parameter or we can quantize it so that we can you know represent it at very low precision all right and if we represented the very low precision you can potentially run it at much lower energy cost you know on a hardware and as somewhat a similar procedure happens when you you know you actually you know look at the values of the activations and some of these activations but you know remain completely uncertain and quite useless and some of these neurons can therefore be pruned using the same ideas ok so now I'll go into more detail explaining a little bit about recent work on you know using Bayesian statistics to do compression so in terms of the equations that already explained we need to choose a prior in this case a prior over the parameters and we used this auxiliary you know variable Z as well which we integrate over to define a prior over the weights and the thing do you really need to sort of take away from this prior without trying to parse this equation too much is that it's a very sparsity inducing prior it really likes things which are not very useful to go away and it's a sparsity inducing on the on the level of the neurons not so much on the level of the parameter so it's trying to get rid of entire neurons if they're not contributing very much to the prediction we also have to choose the posterior distribution Q of W comma Zed and and this is the particular version that we use it's called a drop out posterior sort of this paper by dmitry fedorov and students developed it and sort of we have also some work that's closely related and so the posterior is it you know it it's a it's it's the important part of this is that the mean and the variance are proportional so bigger means also have larger noise and so if you encode this in this posterior distribution and then so this is a very beautiful animation that was made by dmitry fedorov students where you show you know if you run this algorithm if you do inference and learning in this algorithm how this met this method specifies it just gets rid of most of the parameters of this model so here you see the accuracy the accuracy stays about the same so you're not actually compromising the model in terms of its accuracy but the compression ratio goes up to 270 times which means you just keep one out of 270 weights the rest all go away right and it completely specifies these filter maps so this is really fascinating in my opinion that we over parameterize these neural networks by such a large degree and I've learned you know throughout this conference actually that this might have a lot to do with the fact that you can do a lot better optimization if you're in a much bigger space but then once you have found your model you can then prune it down very effectively and clearly once you've pruned it down you know many of these filter maps can just be thrown out and that will help you save energy when you do the computation so here's the same plot but now for the for the last fully connected layer and you see that almost no weight survives and the compression and the accuracy just stay the same so I think this is a very interesting result where we can use Bayesian methods to compress these models by a huge factor so I want to go a little bit into in the last sort of 10 minutes I want to go a little bit into some of the more recent papers that we did they're not even published or somewhere out an archive probably so this is work between open ai and M lab where we basically again look at the sort of the energy term and we write every parameter as you know theta prime times a function G of s which you know which basically you can think of it as either you know a soft version of a binary value 0 or 1 so if it's 0 this parameter get prunes out if it's 1 it will keep the parameter in and but but the point is we're actually making it a soft version of it so in in fact the version it's called hard concrete so the the model is like it's a this is the distribution so it has a little bit of probability between 0 & 1 but it has a spike at 1 and it has a spike at 0 and then we add an extra term which basically says the probability for nonzero values of this function should you be restricted I want you know I want most of my mass to be on 0 so that I can prune out as much you know parameters as possible and go so there's some results that again you can get you can keep the same accuracy but you can reduce these models so these so this is the MLP with 784 300 and 100 neurons and you know you can get down to 219 240 102 266 8833 so there's a huge amount of compression that's happening if you apply this and this is a fairly practical method I would say so this is an other work this is with Borgia Delta lab which is a lab within the University of Amsterdam sort of your painters did this work so this is a binary neural net again I'm going to exploit the fact that we are going to use that we're going to use a you know stochasticity and and probabilities in order to run it at higher efficiency so there's a distribution over you know the model parameters here because of central limit theorem you're collecting a lot of these sort of binary values you get some kind of thing that looks like a Gaussian distribution here then we define stochastic batch normalization and stochastic max pooling which are generalizations of batch normalization and pooling for the case of stochastic random variables then we threshold again so we take this distribution and we threshold it at some point so again we convert to a binomial distribution then we sample from that and then we have again the next layer of binary activations and then it turns out you can do back propagation throughout this whole chain and you can train these networks quite efficiently and get a fairly good results so here's another piece of work between Qualcomm a our research and Cuba where again we think about you know training a neural net knowing that in the end the the parameters will have to be quantized on a grid so that we are going to train it so that it knows that it's going to be quantized and it doesn't lose a lot of accuracy when that happens so here is sort of the sort of the values for the parameters and we choose some grid points in fact you can also learn the grid points but you have some distribution over where you know the parameters can land and then we we have a probability so we basically throw all the probability mass here we throw it sort of in this bucket and this becomes this probability and since this is all done with probabilities you can actually train and back propagate all these probabilities and you see after training that the model really likes to quantize itself it likes to sort of put spikes on these sort of grid points and in fact the last layer it tries to binarize itself so again you can you can sort of train yourself to be quantized abow and the results look pretty good ok so the last top I want to talk about is spiking neural Nets so this is an other way in which we are wasting a lot of energy if you have like a surveillance camera is looking you know at a shop it's really 99% of the time nothing happens and so if you would analyze every frame using a deep neural net you're wasting a huge amount of energy right and so it's basically like this which is a sort of a spiking camera which only processes things that change so that sort of inspired us to come up with a spiking neural net so the normal neural net has an input and has a linearity then has a non-linearity etc so we replace that by an input we do an encoding step I'll explain a bit what that is later then we do a quantization step then we do it linearity and then we do a decoding step okay and so the the input is the encode er is now a stateful sort of representation it's a linear combination of the actual signal that's coming in and the difference with the last signal right and this is basically the fact that you look at the difference if nothing changes then this term goes to zero so then the quantization is known as something like Sigma Delta modulation and it's similar to something I did a long time ago called hurting I knew it was some good for something and we found it it is good for sort of quantizing these sort of these input values it's basically a little potential that Rises and then you sort of when it reaches a certain point it's sends out a spike and then a sort of resets and then this is the inverse of that so it's sort of the decoder and what happens if you only take sort of differences or the Sigma Delta modulation is pure form if you have a signal it's going to approximate that signal by a piece piecewise constant function and every time it sends a spike it changes that that level and if we do this thing we add if we add a little bit of the input it will have exponential decaying sort of parts of it as well and so the most important thing is we have to mix this in so we can actually do stable back propagation on this spiking neural net and again so this trains very well okay so let me quickly then sort of stop so we've discussed you know black holes and physics and how it relates really underlying all of that is is information theory we've seen that gravity might actually be entropy maximization and from the Hitchhiker's Guide to the galaxy actually we know that the earth really is a big computer that is trying to compute the answer to the ultimate question of life the universe and everything and it's built by these mice and if you think that's funny let me remind you that there is a bunch of people who actually do think that we live in a giant simulation and so maybe it isn't all that strange idea of an idea in fact if you look at this in some sense you know you could you could argue the universe is a huge computer trying to compute something so we looked at basically the free energy cycle going from physical free energy to model free energy and back so you know I've talked about Bayesian deep learning and how it can help us compress and quantize our neural nets in order to save energy of course we all know that Bayesian methods also help you regularize and generalize better there is interesting work also here at ICML on using Bayesian deep learning for confidence estimation and finally you know there's also interesting work that shows that Bayesian methods can help you train more privacy-preserving models and also it seems to give you a little bit of robustness to adversarial examples but it doesn't protect you completely okay then let me end by this quote which surprised me that Steve Jobs actually knew about free energies it says the revolution the information of this revolution the information revolution is a revolution of free energy as well but of another kind free intellectual energy so I would say think deep and think free and then I have two quick announcements you know we have recently started Qualcomm a our research which is a environment where we can do fundamental research in write papers about all of these intriguing problems about compression and quantization and of course we are hiring like everybody else actually we are also hiring at the University of Amsterdam we are looking for a chair in machine learning full professor so if you're interested in that let me know and then this is the final slide which happens to be slide 42 also written in 42-point caliber kind of relied and so if you have any questions then you know what the answer is going to be thank you very much so thank you max are there questions yes can you go to the mic so there's somebody there too thanks for the talk I mean I belong from Berkeley could you please comment you mentioned compression using Bayesian inference could you please compare how this compares with the state-of-the-art compression methods you showed 240 X compression but that was for a very simple problem for M knees which we know that's not a very represented problem in deep learning yeah so of course we did a lot of experiments and a lot of you know other data you know models as well and we compared to a number of other methods including sort of more as as VD sort of so as VD type models where yourself constrained the weight matrix and clustering methods and it typically looks like for large compression values it does a bit better then if you would use these other methods and for smaller compression levels it would do about the same but you know you should read our papers to see the full story about that because we have lots of experiments where we do all these things music of los alamos Samak co you started by saying that physics is information and then you discussed how you reduce number of degrees of freedom in India planning but another approach would be to put physics into into those models the orphan model physical world or engineered world engineering world so can you comment on that another way to reduce so that's an interesting suggestion so one way I could see that could happen is if you and you know that's something that we are definitely doing which is you can model the device on which you are doing your computation I'm not sure this is what you're heading for but so if you model you know your phone or whatever you want to run your you know your your deep learning model on then you can sort of directly optimize against you know that simulation of that device or the real device so another another word you you use the physics to evaluate how well you're doing and you give feedback to the learning algorithm to optimize against it is that where you're heading at or a more maybe a more principled more fundamental well I I guess I'm asking how we mix physics of phenomena we want to model with deep learning is deep learning person so physics and form type of learning yeah we'd have to think about that more that's a difficult question to do here on stage but we should talk more yes thank you for a very nice talk I just want to you compress these neural networks of course you're making them simpler does it also mean that it makes it easier to explain what they're doing either for a decision or to extract insights into what the network has learned that can teach us something I wish that were true but I don't think it is really true so it's just removing you know bits and pieces off the model so it's just you know removing sort of filters and it's removing weights it's quantizing weights so I don't think when I look at that it looks any more interpretable to me than it was before unfortunately Thank You max for the talk one question I had is while this method is this whole philosophy is striving for intelligence per kilowatt are at the prediction time or consumption time there's also the same question at the training time yeah so right now in the current way of doing things we are absolutely ignoring efficiency during training time justifying for optimization of whatever we are seen currently and then focusing it on the downstream how do you see this paradox or this paradox resolved well it's a good question and in fact you know fortunately you often have to do training once or you know not very frequently but execution you could you may have to do like a trillion times a day or something like that so there is a big difference between those two things yet I think it is very interesting to think about if you can also improve the efficiency of training so typically many methods they use full precision representations of the model during training and then but you could imagine that you could develop methods that already use low precision that say weights and activations during training time as well but I think that's a wide open area that looks very interesting to me next question you mentioned Maxwell's demon so I'm wondering if you're familiar with the arguments for reversible computing and how you yeah so interestingly so the very last the very last part of the resolution of Maxwell's demon is that when you do reversible computation so it turns out you can do these measurements of where the where all the molecules are during reversible computation so that doesn't cost you any energy it but it seems that but you know if you are at least thinking of a cycle that you at some point you'll have to delete some of the information that you've written down and that's this deletion process which is not reversible clearly and that's where you're gonna spend your energy so it turns out that any reversible computation can be done at zero energy cost but typically there is parts where things are non reversible and that part is what you're gonna cost you it cost your energy next one thanks we'll go talk so I find the spike neonatal quite interesting so have a general question is about have you thought about using a spikes to code the gradient which means in other words your gradient stay at zero most of the time but a way you want to move it good when you're well sure which direction to move you give it a big spike so the temple average of the spikes basically encode the temple luke average of the two greetings I was wondering whether that's related to somehow in concept yeah I think I have to think a bit more about that but we actually in our back propagation pass we also use the spiking architecture so both the forward and the backward pass is using spiking but then we do collect statistics before we actually do an update so yeah so maybe you're suggesting something that even goes further which is also encode the actual gradient in terms of spikes but we haven't thought about that we did think a little bit about if you have multiple sort of agents and they are sort of distributive Lee learning whether the messages between sort of the distributed learning agents can be represented using spikes but that's a bit different than what you suggest which is interesting okay thank you so we'll take the last two questions and while doing that can we have the next speakers start to set up so thanks to the target which is a I see it as all about how physics in computation are intertwined at a deep level yeah your background is in physics and you know so we could think that it was that background enable you to see these links as you as you move from one field to the other or did both of them which you could also see that that or see it as your background physics caused you to you know or just really love the possibility of such connections so I just wonder you know you look at your own self and introspect do you see both these forces and and and how do they they interplay that's a personal question very interesting yeah I do surely I love the I you know the sitter did the fundamental nature of those questions and maybe I was speculating a bit in certain ways but I actually like to think that you know sticking out your neck and speculating a little bit is not a bad thing but yeah I guess it it has influenced me a little bit in the way to think about this yes thank you hi max thanks for the talk I'm wrangling my brain now to try and figure out if there's a connection between the adversarial problems of these networks face and entropy and if there's a connection between this compression as a way of helping the battle of a serial example problem you mean like how big or how if you quantize in network or compress it whether it will help you against ever serial examples right so somehow you're in some way reducing the dimensionality of the network a little bit yeah even though you're getting rid of extraneous information in there but yes I think wait yeah I think I don't want to make any claims here because it seems like anytime somebody comes up with a Vaziri all sort of and they you know protections you know it's easy it's easy to beat it again there is some papers that say that at least following a Bayesian perspective will protect you a little bit but not completely when you're doing massive compression of your neural net I I don't think we can actually say with any certainty that it will become more robust against adversarial examples or less so I don't think there's a clear connection but maybe who knows okay thanks thanks thanks max let's give him a round of applause again [Applause] so our our second talk for the plenary session is another best paper award and this is the best paper award for delayed impact of fair machine learning by Liddy Lu sardine Esther Rolf max sim Kovich and more it's hard let's give them a big kind of applause hi everyone thank you for coming so I'm gonna speak to you about the delayed impact of fair machine learning disjoint where my co-authors that's already been introduced so I want to first start by sharing with you our cartoon about fairness research of machine learning so there is a clear trend in our field which is that as an increasing trend so the number of papers that's been written about fairness has increased exponentially and up to last year and this is our projection for say 2018 and it is very encouraging and exciting that fairness has become an active research area in our community and a lot of this work has focused on different mathematical definitions of fairness and how we can achieve these algorithmically so earlier this year our venerian and had a tutorial where he confirmed there exists at least 21 different definitions of fairness so one question now is what do all these criteria do so in fairness research there is a notion of a protected group defined on the basis of a sensitive attribute which could be gender a race for example and there's in a lot of work on algorithmic fairness with then there tends to be an implicit assumption that fairness criteria will make the protected groups better or at least protect the protected group from some kind of harmful outcome but the justification for this is often left to intuition and in our work we critically examine if this implication is actually the case so let's first explore this question with an example so here I'm showing you two groups a blue group and an orange group and which could stand in for two different race gender or age groups and these are the histograms of credit scores in each group so higher credit scores implies greater proportion of people repay their loans so here the solid circle indicate the proportion of people who would repay and the empty circles indicate the proportion of people who would not repay and a priori we do not know who would or would not repay their loans we only know that each credit score was the proportion of people who would and would not repay so now given these scores the bank would like to know how do how do you give out loans to these two groups well and if they care about fairness say for example we could think of the blue group here as the protected group and we want to you know apply some fairness criteria to ensure that the loan policies fair in some sense so let's try applying one of the common fairness criteria called demographic parity so what does that mean it means that we have to approve the same fraction of loans in both groups and this translates into having two group threshold scores and we only approved loans if the scores above the respective Black thresholds and now with this policy you are approving the same fraction of loans in both groups so now after an individual gets a loan the scores will change as a result of whether they repay or default so for example in the orange group we can see that all the people who got loans actually repay their loans and the credit scores went up as a result where in the blue group more people defaulted than they repaid and on average to score off the group actually got worse so why is this why is this a problem so having if your average group score goes down that is a harm because on average that means you have more debt and less access to credit in the future so it's definitely an undesirable outcome and so what I've shown you with this example right is that once we applied fairness criteria it did not seem to help the protected group especially if we consider the impact that loans have on scores so this really motivates us to study more systematically the delayed impact or classification so what we did in our work first we will we introduce a tool called the outcome curve that allows us to compare the delayed impact of fairness criteria and so this is what we did just now in that simple example but in a much more systematic manner and we also provide a complete characterization of the delayed impact of three different fairness criteria and in doing so we discovered the fairness criteria may actually cause harm to the groups that they intended to protect so so before I go in to the results let's first set up the problem with some key definitions so formally what is a score I already mentioned score just now and formally it is a scalar random variable that is a function of an individual's features so we can simply think of this as a score as coming out from a blackbox machine learning algorithm that takes in features and spits out a scaler and that we already had a in a running example of credit the example of a score is a credit score used in the u.s. to assess credit worthiness and as an integer valued random variable and it might take in it could be a function of features such as an individual credit history or their demographic information so if an individual has a particular score then a group of individuals will have a distribution of our scores so it might look something like this which I show you just now histogram and so intuitively the score should correspond to an individual's success probability so in the credit example this probability would be this one's probability of successfully repaying the loan so to formalize this we we write down the monotonous II assumption which states that the score must satisfy this assumption that higher scores means more likely to repay so now that we have these scores what can the institution such as the bank do with it so the bank will want to classify individuals using the score and so specifically that means choosing an acceptance threshold score that maximizes their expected utility so what is this utility for the bank it would be their expected profit it consists of two terms one is the expected reward from people who repay and subtract the expected loss from people who default and these expectations are taken over the random events of repayment versus default and also over the individuals in a population so for example for we have like a group like a score distribution that I'm showing you right here choosing a threshold score T corresponds to accepting a beta fraction of that group so you can see that for any group distribution the threshold T tells a corresponds to a beta which is the proportion people accepted so when there are multiple groups these thresholds can also be group dependent so by now it doesn't have to be the same threshold and in fact this is a necessity under certain fairness constraints that we might apply so now we are ready to define what is delayed impact so as we saw just now in that example the scores of acceptor individuals will change according to their success so if someone repay their loan scores increase and if they defaulted on their loans the scores decrease so there will be some change at an individual level and at the group level that means that your distribution of scores in that group will also change as a result so what is natural way to model to summarize this change in the distribution of scores we can use the mean change in score and here we focus on this quantity called the delayed impact the mean change in score in each group and we take denote this as Delta mu is just an expected difference between the new score and the o score in a group so let's go into our first result which is that under very mild assumptions we can characterize the delayed impact which is Delta mu as a concave function of the acceptance rate so if we plot out this curve it looks something like this it has a unique maximum and we look at what's happening on the x-axis that's telling us what is the fraction of people in a group that the bank is giving out loans to so the bank could give out could have an acceptance rate of 1 that means that loans give out loans to everyone which does not sound like such a good idea or it could give out loans to no one at all which is also not a great idea because in this case there is effectively no bank and there is no effect on at the average score in that group and you have a delta me of 0 so the natural strategy now would be to for the bank to maximize their own expected utility and that results in the acceptance rate of beta max Yuto and so now here if we if the bank actually loans out last beta max Yuto that gives rise to average score change that is less positive than under maximum utility and we call this regime relative harm because the group is the score is increasing less than under just Mexico and because because defaults are so much more costly for the bank the bank has tends to be conservative so under you till utility maximization you will actually not give out loans to some people who are still likely to repay so by sacrificing some expected profit and giving out more loans to these people who are still likely to repay can actually increase the average score further so as we increase the acceptance rate now we can increase Delta mu further up to the maximum of this curve which is achieved at the philanthropic optimal acceptance rate right we call this optimum here and in this regime that we have relative improvement because the group score is increasing more than under just utility maximization so now after this point the bank will have to start giving out loans to people who are less likely to repay on average so that again brings down the Delta mu and if we do this up to a point and we are again in the relative harmony where the that this where the people your default kind of outweighs the people who repay and then that brings down the average score change and you're doing worse than under utility maximization and eventually if you start giving out even more of these bad loans then it brings it causes negative mean score change and we call this the active harm regime because that actually brings down the Group score on average so what I've shown you here is there are three regimes of interest on on this outcome curve right with a relative improvement relative harm and active harm so now the question is where with fairness criteria put us on this outcome curve and in what cases does that happen so before that our first review what are these fairness constraints so as an alternative to just maximizing the expected utility we can also maximize the utility subject to certain constraints so what we saw just now demographic parity the constraint is to for both groups to have or for all the groups to have equal acceptance rates so just allow straight this with two groups we look at basically so that the dots all the dots in the group so include people who repay and people who would not repay and we equalize the fraction of people we accept in both groups so another common fairness criteria would be Equal Opportunity and the difference is now we only look at the solid dots so only the people who would repay and we equalize the fraction of solid dots that we accept so this is called equal to positive rates because I saw that dot that's also a SAP test consider a true positive so now where does fairness criteria put us on the outcome curve so the first result is that these two fairness criteria I showed you just now they can actually cause any of the three regimes of interest that I described on the outcome curve and in contrast so unconstrained utility maximization does not cause active harm so what this is saying is that depending on the setting depending on the distribution of scores and the relative group sizes fairness criteria can put us anywhere on this x-axis and because the bank is conservative because institutions are conservative the utility maximization will never put us in the active harm regime okay so now is there some way we can also compare the two fairness criteria and see if they behave differently so the answer is also yes so in theorem two we can there is there exists a class of settings where demographic parity may cause active or relative harm by over a something whereas Equal Opportunity does not so what it looks like on this outcome curve is that demographic parity puts us in the active harm regime because it over assess relative to the beta naught which is the threshold for active harm and the Equal Opportunity does not do that and in this case gives us relative improvement of the average score and on the other hand there are also another class of settings where Equal Opportunity may cause relative harm under accepting so on the outcome curve equal-opportunity would under s apt relative to Betamax util and demographic parity does not under a cept in in this case gives us relative improvement so what we what we discover here is that the choice of fairness criteria really makes a difference to the kind of outcome that we see in a group and depending on which fairness criteria we choose we could observe completely different outcome regimes so now that I've shown you some theoretical results let's look at how we would apply the outcome curve framework to real data so we looked at 300,000 credit scores and as I mentioned just now these scores predict default risk and so we use these data to estimate the group square distributions or payment probabilities and the relative sizes for two different race groups so Y Cuban black group and if we recall the main assumption we have to make about scores is that the repayment probability is monotonic in the score and this is approximately satisfied by the real data as we can see in this graph here so other than what we can estimate from the data we also have to make some modeling assumptions such as the bank's profit vs. loss ratio which is not something we can estimate from this data set so we also have same for the delayed impact which is how people scores change depending on whether they default or repay so we pick these numbers to be in a reasonable regime where the defaults are indeed more costly than the repayments are profitable and particularly more costly for the bank so with these we can compute the outcome curves and analyze the delayed impact under different fairness criteria so if we just plot out the empirical outcome curves of the two groups it looks something like this there are approximately concave because our monotonicity assumptions approximately satisfied and you can already see that these two outcome curves with two groups are very different the score distributions are different so let's look at what happens under utility maximization so now the bank would pick an acceptance rate for two groups both groups that maximizes its own utility and in this case it gives us positive score change in groups as we might expect but if we apply a fairness criteria now say demographic parity we find it actually gives us lens the black group in the active harm regime so their score decreases on average if we apply demographic parity and if we use a different fairness criteria equal opportunity the black group experiences improvement and we can also see that in the white card no matter which criteria we choose they're all on the score change is always positive so this corroborates one of our theorems theorem - so we're in the setting where demographic parity is over accepting relative to Equal Opportunity and we can also use these outcome curves and utility curves to get an idea of why the minority group here experiences like disproportionately different delayed impact under fairness criteria so we can see here that the Maxima of the outcome curve the black group and their utility curves under fairness criteria are very misaligned so there is it can experience a quantitatively very different delayed impact when we change the fairness criteria whereas a majority group does not experience that as much yeah so we can really see how the no minority group is very much affected by our choice of fairness criteria so now that I've shown you some experiments of how we would apply the outcome curve in the real data with real data and I've also shown you that our theoretical insights are corroborated by these numerical experiments so let me just recap what I've taught you about our work so first we so we have shared with you this outcome curve framework which is a very general way for us to deviate from unconstraint maximum utility while improving outcomes so we're able to use this outcome curve to compare different fairness criteria and anticipate their potential negative impact and secondly this is very strong motivation for us to develop these domain-specific models of delayed impact so if we did not pay attention to the context of the application just blindly applying fairness criteria could have unintended effects as we saw in all the previous examples and now there are many great questions for future work so in our work we considered binary decisions which is just for the bank to either accept or reject identical loan applications but could be xn our results to richer decision spaces such as different loan sizes and different interest rates and how how might we do that and then a second direction is that in our work we consider the mean score as a way to summarize the change in distribution but obviously the mean score is an imperfect way to characterize welfare for a whole population so another great question is what other notions of delayed impact could we and should we be tracking and so finally our work is just one of the many recent works that have tried to study the dynamics of the impacts that machine learning have on the distribution of variables of interest so in this case a welfare indicator so it's very exciting that you know our community has started to think more about the dynamics of this distribution Olymp act and how we can take into account the feedback loops in this dynamic setting while us in order to design algorithms to have positive societal impact and we think there are many fascinating questions in this space so with that I like to thank you and once again this is trying to work with my co-authors and we have a poster today in here yeah thank you are there questions yes yeah thank you for a very nice talk um just to make sure I take the right take-home message out of this it seems your results imply that the fairest thing we can do right now is your tilty maximization that is of course something that banks for example would love to hear is that really what we should take away from this or are the better things that you think are on the horizon that we should be doing okay that's okay thank you for the question so we're definitely not saying that utility maximization is the best that you can do so if we look at the outcome curve utility maximization for the bank does not coincide with that the the maximum for the delayed impact so we're already saying that there is there are many there's a lot of space for you to improve the delayed impact beyond just utility maximization so the problem is which part of the curve do we land and when we try to intervene so obviously we're not saying no intervention is the best intervention we're saying there are many ways to intervene and how do we intervene in a smart way if we care about how people are impacted so does that answer your question thank you thank you can you think okay thank you very much for your work and thank you for working on the dynamics of welfare here what I would argue is that the dynamics are actually even more complex because it might be that if I have a higher acceptance rate then this increases the probability of paying your money back because you inject more money into the community and thus it might be that more people are actually able to pay their loans back so in fact in that case it might be better to have a higher acceptance rate right yeah I thank you for the question I think that is a great point so that's also something we've considered when we think about like multi-step effects so I think there's definitely something more subtle that goes on when you consider how you inject like more like money into the community so it is something that we thought about but it's not something that it the scope of this work and great to talk about it more offline thank you next question yeah we okay hi and yeah thank you for the talk so I sort of have a comment that is sort of a question it's that it seems slightly convoluted to me in your example that you define harm to be a negative effect on credit score when sort of the result of that negative credit score is simply going to be that they can no longer get loans in the future and so it seems weird to me that the solution to this would be to accept fewer people for getting loans in the first place did that make sense yeah sure I so to make sure I understand what you're saying you're saying that if you don't give loans you have no effect on people who have lower credit scores yeah that works thing so yeah so clearly our model is sort of like a simplified model of the real world so in the real world that might be many sort of drift effects for how people's credit scores would change and so we're only obviously we cannot model everything so we're just trying to say like okay if this is the only way scores can change in this model then we can characterize some very clear effects of over lonely and over accepting so once you want to take once you want to take into account how what are the other ways of improving people's credit scores and abilities to repay I'm sure there are other dynamic effects that we have to take into account in a model so here does that make sense yeah yeah I'm so I guess just a slight follow up is do you have any examples of other sort of metrics other scenarios where you have this delayed impact other than this quite at school yeah so I would definitely like to draw attention to some work other work in the works in the community so these references here we we have so they have different ways of looking at dynamic impact and I would definitely encourage you to talk to these authors to look at the other ways they are modeling the delayed impact okay thank you so we'll just take two more questions and that will take things out flying so um thanks for the exciting work and fantastic talk i was curious with when you when you start this estimate you need to have an an unbiased estimator of the score for a particular population and i was wondering if you could comment on if that initial assumption is violated so if your initial estimates of this probability of repayment as biased in some way how that would affect the model and the dynamics that you describe here yes thanks for the question so that's a great question so in the paper we actually also consider a case where there is some bias in the score so we call that measurement error so it's in the case where the group's credit scores are systematically underestimated and you find that in that situation actually expands the regime we're applying a fairness criteria such as demographic parity would give us a positive impact so yeah I will definitely encourage you to look at the full version of the paper and that's a great question because here we are only covering the sort of this situation where bias is not such a big problem but yeah thank you which is if you look at the score distribution of the two demographic groups you have there's a gap and I think of these scores bank loan repayment is is a primary drive development of these scores do you have any comment on whether dis cap should exist and why it exists together so yeah this is the gap you're talking about right okay yeah I okay I guess I I'm not able to comment on why exactly FICO credit scores have this gap but so if I understand correctly I think in because we sort of assume this monotonous City assumption and by if we we kind of transform the the mapping we're able to make the gap disappear at least for our empirical simulations so I'm not sure if I'm answering your question but maybe we can talk more offline if it is about how we did the simulation no it's actually not about how you did the simulation and simply making the observation is that if the score is supposed to denote anything by and large our delinquency rate yeah then we really shouldn't be seeing this difference between different demographic groups right now and and I think in your answer to the previous question it indicates the the bigger this disparity between the original distribution the bigger I think these fairness criteria will have an impact when you when you add that to to the acceptance rate consideration so the question really is did you have any thoughts about this gap when you when you see it in the data yeah I think part of this problem is score as a black box so we sort of don't know where it comes from and then all we observe is that there is this gap that shouldn't be there in an ideal world where we really know everything and about the scores so I'm happy to take this discussion offline because there's a rich literature about things like calibration like this the same score mean the same thing for different groups and I think that's definitely a question of active research thank you thank you so let's thank the speaker again enik and you can have your questions offline and the poster session too [Applause] [Music] [Music] [Applause] [Music] [Applause] [Music]
Info
Channel: ICML IJCAI ECAI 2018 Conference Videos
Views: 1,310
Rating: 5 out of 5
Keywords:
Id: vzoTh0ZAUAk
Channel Id: undefined
Length: 70min 10sec (4210 seconds)
Published: Sat Jul 14 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.