Large Scale Machine Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks for coming everyone and thanks to unbalance for hosting so how many people who for them this is the first time they've been to one of these meetups raise your hand okay keep your hands up how many people have a research background keep your hands up okay how many people have a programming or software engineering or whatever background okay keep your hands up how many of you are experts at machine learning ok so I'll explain just an analogy just so the talk makes more sense it's ok so think like you know you hear machine learning AI is this magic that does stuff right but think of like this ok you have a function or a procedure right like when you're programming normally and what machine drilling does is it basically fills in the code on the inside and the way it does that is you get a bunch of examples of what you want like you know when i get this input I want this out but when I get this and put it against output it magically does that now some that magic marks going to talk about so actually before we get to that I had knows give me a sec so the first thing I'll say is there's other events related to this like for example there's a KO competition stuff with a dude and we also do a reading group so Bruce is going to tell you about the Kegel competitions Bruce hi folks how many people here even know what a Kegel competition is almost everybody cool so what you may not know is that there is a meetup group that meets every two weeks and the name of this meetup group is learn data science by doing Kegel competitions so those who don't know what a Kegel competition is there these online competitions that are posted by real companies usually people who actually have some data and they post a machine learning project and it's open to everybody to participate they're really great for learning how to do machine running data science because everyone's very even though there's cash prizes for the winners everyone's very sharing of their code and knowledge and so on so what we do in the meetup is we pick a past competition that's closed and we do that because it sort of avoid running into problems with the live competitions where there's a lot of rules you have to follow but it's it's still good because all the the data is still available you can still run your train things and test it and see how well it does and people have posted their solutions so it's it's a great way to learn quickly so every two weeks we we pick a different competition somebody sort of volunteers to read about it and then walk the rest of the group through it so someone's looked at it in detail but everyone kind of looks at it we have a great discussion great way to learn so look for us on the unmeet up under learn data science google for 4k go learn data science on meetup and you'll find us there and I hope you can join us Dave Campbell's going to tell you about the reading group hey folks i feel like i need to do a survey so how many of you know how to read great there's a reading group that happens every other wednesday it's it happens out of harbour center sfu venture labs we hosted out there we everybody can suggest a paper we vote on them and then what happens at the reading group is we we talk about the papers now some weeks it's it's something that you know a lot about near you're coming with your own expertise to help everybody out with and sometimes you are you maybe don't really know what's going on in that paper and this is a great time to fill in some holes and to ask some questions and it's a it's a very open community in there and we tend to read a big variety of things this coming next Wednesday next a week from Wednesday we're going to be reading about election polling and how that works in other times we've done things on on machine learning algorithms or Bayesian modeling and just a big mix of stuff thanks thanks you ready mark oh cool so mark schmidt he knows a bit about machine learning I'd say so he's going to talk on that how you deal with it when you deal with big data so that creates problems mark thank you alright so thanks for the organizers for organizing thanks for everybody for coming out sorry for the people who were standing and for the people who came here and waited for a long time I'm not going over the title in the the meetup group i'm doing this title instead if you google this it's exactly the same way of saying the same thing but it's a bit more technical i have given this talk in a lot of places so the talk is about a paper from 2012 that sort of changed the way we look at machine learning in a lot a lot how we train models so I've given this talk kind of all over Europe all over North America been to Asia a few times Australia I went to Brazil in may for the from so this is going to be like but they said that the the current version my talk had maybe too many equations so we've tried to make this a bit more accessible to everybody oh I'm that guy who drops the mic ok if I drop the mic again I need the people the back to start waving your hands especially if you're standing ok so i just recently snuck in some slides after hearing the result of that survey so we're gonna do a little bit of intro to machine learning before we start on the the more technical stuff ok so why isn't she learning so important right now it's this big data phenomenon we're using technology to collect and store data an absolutely unprecedented rate if you think of all the data that's being stored around you we've got things like news articles so we've got hours of video uploaded to YouTube every second credit card transactions we're mapping the human genome mapping the sky the Large Hadron Collider has one trillion experiments phone calls are being recorded like crazy we're recording every single click you do on minecraft or pokemon go why not so people call this big data it's kind of a buzz word but we really are collecting tons and tons of data this is where machine learning comes in so we want to use this data for fun for profit or for the greater good but there's just way too much data to search through it manually there's traditional ways we think of using data put it a database look at tables things like that they just don't work when you're looking at data that's so big you can't even look at it are we okay at the back on volume great okay so machine learning is one of the ways we deal with this huge amounts of data the idea is we want to use computers to automatically detect patterns and data and to use those patterns to help make predictions or decisions it is very similar to statistics but there's a big more emphasis a bit more emphasis on large data sets and computationally feasible methods bit more emphasis on doing predictions instead of just descriptions and more emphasis on flexible models that work on many problems rather than thinking very hard about one model for one problem so the typical machine learning framework which was just mentioned is supervised learning so as an example I want to build a program that's going to distinguish cats from dogs so the way I do that in the machine learning approach is I collect a whole bunch of images of dogs and I collect a whole bunch of images of cats now I'm going to use that data to learn a model that maps from the inputs to the outputs so the output of that is going to be a program or a model where if I give it a new image right here a nice give it to this program it's going to spit out cat even if it's an image it's never seen before so if people just want to ask questions or whatever just raise your hand and we can talk about other things I'm hoping that most people have seen this before but if you if you haven't you could imagine that that you can do this for many different tasks I really don't care about recognizing cat vs dog images I think most people probably don't but you can imagine this could be any input and any output and you can maybe think of something that's more useful to put in there so now one way to think about this is the input is data and the output is a program so this is a little bit different than the usual way we write a program where we have input specifications output specifications here we have examples of input and output and we have lots of data we're going to use the data to write the program for us typical cases where this is useful the problem is too complicated so we can't write the program ourselves or maybe we a human expert and they can't explain why you have a certain input or output or you just don't have a human expert but you have labels somehow and you want to hope for the best this is a very fast-growing field as evidenced by over 200 people in this room right now there's been large investments from google facebook and Elon Musk lots of local companies applying these ideas and in academia the top machine learning conference nips increased from 2,500 to 4,000 last year and I don't know what it's going to be this December it might be less because it's in Barcelona and lots of people have visa trouble especially because there's no Spanish Embassy on like this side of Toronto you do if you are from Iran or India or where many of my students are from okay so I shouldn't need to convince you of this but machine learning is now affecting our daily lives if gmail is putting something in your spam filter that's that's a system those built with supervised learning there's lots of websites that do product recommendations if you play on your xbox and you're playing Microsoft Kinect and you're moving it around and it knows where your limbs are that's that's a system called the random forest optical character recognition is a classic problem in ml translation is now being taken over by machine learning methods speech recognition on every single smartphone is a supervised learning system face recognition is not always supervised learning but almost always it's some sort of method you would see in a machine learning course sports analytics has got into things like non-negative matrix factorization so they're using machine learning for sports now i watch i watch the talk by the guy who made this chart he networks he was a geo geologist no sorry geographer the guys who make maps at harvard and now he works for the san antonio spurs because he started making maps like this object recognition is coming pretty soon your phone is going to recognize your your kids or your where your remote control is self-driving cars are on the road and it's a little bit terrifying machine learning is used for scientific discovery so this was a case where a machine learning visualization was used and they found there was new types of leukemia and this is my old office mate he wrote a program that's supposed to basically take money away from statistical consultants you give it your aggression data set it will write you a paper about the data set this plot the caption the numbers all that it spits out a paper describing all the different trends in your data sets so you don't have to pay anyone does hypothesis testing and all that stuff he's fact that you've t just started in last month okay so I just want to convince you that if you if you're not already sold on this machine learning is doing some really cool things it's not magic if you have a problem you really want to solve and lots of data that's not enough for machine learning to be ready to solve it it really has to be in one of these nice formats like supervised learning but if you can put your data in that format and you've got nice data you can do some really amazing things ok how are we good at the back yeah everybody's happy all right oh and if you hate all this stuff and you just want to paint pictures people doing machine learning for that too so here's a picture in to begin which is one of the top ml centers in Europe you give it your picture of tubing in you give it a van Gogh and it'll spit out a van Gogh version of your picture how cool is that yeah I'm gonna start putting these on my wall actually Google started selling these for way more money than Google needs to sell paintings for okay so that's sort of like you know machine learning why should you care type of thing for those people who are here for the first time and haven't seen something like this before now I'm going to get into them the more technical part of my talk is that most machine learning methods are written as some sort of numerical optimization problem you want to minimize some cost function and that cost function is going to tell you how well your program is at getting from the input to the output so if the cost function is very high you give it a picture of a cat and it says dog or something like that if the cost function is very low it means you're doing well and a classic example from stats and from numerical linear algebra and for machine learning is least squares I want to minimize some function f of X I make a addiction that's a linear combination of my features why I is my target that's whether I've cat or dog and I square the difference and then I sum it up over all my training points if you've never seen a linear model before you probably should just listen to what i say as opposed to stare at the equation but many of you have seen linear models before and you've probably seen this equation many many many many times great okay but we're not even going to need that notation I'm going to talk about an even simpler problem I'm going to talk about I want to minimize a function f of X that's the average of a set of individual functions F I of X and we're going to assume that those FIS are differentiable I can take the derivative so it's a super simple problem you think that everything that would have been said about this problem would have been said in the 1800's but it turns out we learned something kind of fascinating in 2012 about this problem so you can think of this is measure measuring the average error f of our costs fi and individual examples I so this F I of X that that's low if I've predicted image I as a cat and actually is a cat and it will be high if that is not if that's not a cat that little framework this minimizing some some basically includes almost all of machine learning you've got logistic regression SVM's pca CRFs deep learning regularization all that sort of stuff so I haven't really made any restriction yet I'm going to have to make one restriction to sort of make my life easier I'm going to focus on something called convex optimization now the convex functions are class of functions such that if I go downhill minimizing the cost function locally I'll eventually find the global maximum so whenever I find a local minimum of a convex function it happens to be the global maximum these are among the only efficiently solvable continuous problems we have and you can do a lot with convex models so a whole bunch of the machine learning methods we know and love are in fact convex models so they're included we have lost some more notable things like deep learning in PCA though but the empirically effective non-convex methods tend to be based on methods with good properties for convex function and if you have a good method for non convex functions it better have good properties for for convex functions because near a minimizer every function looks like a convex function so if you don't have good properties for convex functions you don't have a good method finally tools from convex analysis are now being extended to the non convex case we've been working on non convex optimization for four decades and decades and decades we finally have some nice methods with provably good properties many of those are coming from the convex world and trend going into the non convex world so the convex world has been very fruitful in the past few years okay does anybody want to make a comment or throw up a question does anyone want to be brave at this point way at the back how do you know and when you start with some data how do you know that this function that you're fitting has convicts property or not okay yeah so so how do you know if a function is convex I covered that in my class today there's a bunch of rules you can go through and you and you can prove that your function is convex the thing about the convexity is not a property of your data set it's actually a property of the particular cost function fi that you pick and then if if you know SVM's are always convex no matter if I'm distinction cats vs dog or spam versus not spam or your your lemon connect so in some in some sense you you usually know beforehand whether your function is convex say yeah say but I think it's better to ask this question did so when you want to start let's say training a mother how do you know the model that you're picking is going to be the right model in order to reach to the optimal point I guess I don't know how to phrase the prop I mean it's more kind of practical question I have so when you want let's say you have some data and if you fit a model how do you know that you say that the model should have if you know the model has a convex property or not yeah it does what's the thing you say before her but how do you know that kind of model is the right selection okay so so that's a great question and that's a question that has a very unsatisfying answer so in machine learning there's a theorem called the no free lunch theorem and some people like yes all right ok so the no free lunch theorem says that if I look at data I've never seen before if I'm truly looking at images of cats that have not been used during training all models are equivalent in some sense in that model a works better than model be on one task model B is going to work better on model a on another task so the implication of this is you have to try stuff out you have to figure out what works for your types of problems which models work better so there's no way to a priori say that one model works better than another one unless you've seen similar problems so a lot of this actually comes from experience as opposed to theory is this mic working um how does regularization fit into the convexity ah ok great question so almost all the usual regularization 'he's not only are convex so things like l2 regularization l1 regularization so on those are convex but they actually especially l2 regularization makes your problem strongly convex so if you if you look at my papers that's that's sort of a crucial part of the theory we've subsequently got rid of that but actually regularization makes your function more convex and makes it more nice more easy to solve just like it improves your test there okay so classic approach to convex optimization it comes from like 1847 or something like that so older than everyone in the room is called gradient descent and if you go to my web page I have that paper linked it's in French it's scanned by the French library we do have a paper copy at ubc the papers are completely yellow but you can go and check it out at ubc too and it's by a guy named koshi who if you if you've taken math classes you recognize that I'm koshi he's a super famous real analysis guy okay so gradient descent for this particular case we go through our entire data set to compute its gradient and then we update our model based on its gradient so we're gonna start with some guests and we call it X 0 mmm and now I'm going to generate when I have x0 I'm going to generate X 1 and then I generate X once i have x 1 i generate X 2 and so on so I'm going to generate a sequence of guesses each one is going to be better and the way I do that is I subtract a small step size times the gradient and conceptually we're thinking about we have our function f of X we're at some point it has a slope and we go in that direction to decrease the function we take the gradient and go on and then we hope we get to a minimizer when we're in this setting of minimizing these huge sums I the gradient of this the average just be ends up being the average of the gradients and that's just because derivatives are additive okay as a picture if we're imagining this is like some sort of bowl and the minimum of the bowl is here so we're coming out of the whiteboard so think of this is like a temperature map we're like this is the hottest area and then we're getting like colder and colder and colder as we go out if we start here we're going to move inside the circle inside inside inside inside and you hope you eventually get to the minimum now nice thing with this algorithm you can prove the number of steps is polynomial if your computer scientist you're happy polynomial is what we want this is good every step of this algorithm needs to go through the entire data set I need to give you gradient on image 1 graded on the image to gradient image 3 all the way up to gradient on image n and today I want to talk about the case where the number of training examples is very very large so there are fancier methods that you may have learned about like quasi-newton methods and still on but they're still going through the entire data set to compute a gradient so they still have this problem okay so if our data set is every web page on the internet or every every product and Amazon or something like that this is too slow right we're not going to go through the internet 50 times to try and find a minimizer we're going to go through it one time or two times or five times we have to update as we go we can't just wait and compute these gradients they're just too expensive so if you've taken a machine learning course they give you a standard answer use something called stochastic gradient instead of looking at every single training example you look at one random training example in our data set and it could be at the gradient of that one example then you update the model just based on that individual one so we can imagine instead of going through the entire internet to see how well our classification of this is if this is a whatever you want to classify you just look at one random webpage I take the gradient of that one random web page gradient F of IT of XT and I do gradient descent as if that was the entire internet and as you can imagine that's kind of a crude approximation representing the entire internet with one web page not great but I'm going to do this over and over and over again and on average it's still going to point me in a good direction if I do this many many times so here's the verse there's the same picture for stochastic gradient descent so stochastic gradient descent is like gradient descents like drunk cousin or something like that where it's not always pointing straight down it starts out taking big steps sometimes goes up outside the circle inside slowly the steps get smaller and eventually you can converge if any of you have played with this algorithm this picture is being very generous to stochastic gradient descent it does not work even close to this well okay but here's the advantage every step only needs to look at one example so if i have 1 billion data points one step of this algorithm is 1 billion times faster that's a speed-up that's hard to ignore there's not many times in computer science life where you get to make something a billion times faster that doesn't come for free the number of steps is actually exponential and if you're a computer scientist that should make you uneasy we don't want exponential time algorithms it could take you until the end of time to get high accuracy you might quickly get to one digit of accuracy and to get that second digit of accuracy you might need to multiply your runtime by a thousand it's not cool it's also a pita which stands for pain in the ass to set the step size and to decide when to stop especially those of you who've been around for a long time and learned about neural networks in the 80s or 90s you probably did some playing with this nonsense it's really annoying so we've got an algorithm that's slow you don't know how to pick the step size and you don't know when to stop it even if you could pick the step size and even if it wasn't slow people love this algorithm okay so we've got a trade-off here we can do a small number of very expensive steps or we can do an enormous number of very cheap steps and anyone who tells you one algorithm is better than the other is wrong there's there's no strict dominance it depends on what you want and what you have so here's our deterministic algorithm I go through my entire data set computing gradients and doing nothing once I've got every single gradient I take one step to improve my cost I go through my entire data set again and take one step again take one step but the steps are actually making quite a bit of progress the stochastic gradient method is different it's taking a whole bunch of little tiny steps and after it's gone through the data set one time it's actually made a ton of progress but the longer you run it the less progress it makes it starts to get flat so if you only have this much time you should pick that stochastic gradient algorithm if you have this much time you should pick the deterministic algorithm if you only need this level of accuracy just run the stochastic grading our room you'd only have to go through your data set one time to get that accuracy if you need this level of accuracy and otherwise you're your self driving car runs into white trucks or something like that then you need the deterministic algorithm you need that high accuracy okay so everybody in this room probably knows where I'm going right now we've we've got two algorithms so one does well in one situation one does well in another situation everyone could probably think of some algorithm that maybe tries to do in between so maybe maybe I'll run the green algorithm for a while and then switch to the blue one or I'll make an algorithm its intermediate and starts looking like the green one and slowly turns into the blue one you can do that I'll talk about that later but what I want to say is we're actually get an algorithm that's better than than those strategies that actually work that gets the best of both worlds and cleans the clock of both of these algorithms okay I just want to pause here to see if anyone wants to bring up any comments or anything like that at this point yeah it that the graph that is show that's in theory you can prove it or that's kind of your empirical observation the this this is the theory so this will be an upper bound on the progress of each algorithm I'll show plots later in in practice that looked almost exactly the same ok and so into I mean kind of intuitive you can prove that there is no any way that stochastic gradient descent it all goes to zero yeah the error does go to 0 but asymptotically so he's very slow yeah ok thank you yeah so why does the stochastic slow down ok so the issue is when you have a really bad guess almost any of those random webpages or random training examples you pick is going to give you a good direction as you start to get better that one random web page is a worse approximation of the direction you need to go as your model gets better you need to have more information to keep making progress and and so really the key is you need the variance between your different gradients to go down to zero and if that doesn't go down to zero then what you need to do is you make the step size go to zero and so stochastic rate starts taking tiny tiny steps to keep making progress I'm being completely blinded by that side of the room so I can't really see what's happening over there okay why don't I move on and I'll try just do someone stand up and start jumping up and down if I find a bad Mike guy okay I'm definitely not the first person to think about speeding up stochastic gradient methods the first paper on this topic was from 1958 and that's well let's just talk about the things that have been out there so there's you can play with the step size momentum is very popular in the neural network there's things like averaging and ADA grad and Adam and RMS prop maybe you think I don't want to do gradient descent I want to do Newton's method and i'll make a stochastic version of newton's method or something called nesterov method which is more recent i can prove to you that none of those methods are going to work all of those still require an exponential number of iterations the newton one is really scary because for deterministic methods newton newton's method is awesome Newton's nothing gives you a gives you a sub linear convergence or superlinear depending on what your community is the 1958 paper tried to look at constant step size stochastic gradient and then there's more recent papers these give you a polynomial number of iterations to a fixed solution quality and then after that they stopped making progress so if you just use a fixed step size you go down very quickly to some solution quality and then you just start bouncing around and you really have no idea what's going to happen there it'll never really converge if you're stuck using stochastic gradient this is usually what i recommend people to and then you can think of hybrid methods so here you get a polynomial number of iterations but usually you eventually start making full passes through the data so you have something that's stochastic early on but then becomes more deterministic as you go and then finally for some very special problems we know that you can actually get faster algorithms so if you have what's called a consistent least squares problem what that means is you're solving a least squares problem and your line exactly goes through every single data point if that happens which it doesn't then you can get a faster stochastic gradient algorithm and there's a few other cases like linearly separable classification okay so that's sort of the setup and rather than going that the technical detail because many people are not technical I'm going to tell you the story of how this algorithm came about and what sort of the the things we were thinking about and a lot of failures that happened along the way a lot of failures so it actually starts in 2005 11 years ago I finished my masters at University of Alberta and we built a machine learning system for automatic brain tumour segmentation this image takes MRIs so we're imagining someone sitting in the scanner like this and you're seeing the top of their head they've got a tumor here if you just do logistic regression SVM something like that with some sort of cleverly chosen features it'll give you output like this which is ok but it's got holes and little points that may be hard to see and we found that you could do better using something called a conditional random field so now deep learning is the hottest thing in machine learning right now but but in 2005 conditional random fields were the hottest thing these are the second most cited paper in the 2000s in the machine learning world now they're coming back because people realized they're not mutually exclusive thing you can put a conditional random field on top of your deep network the advantage of these things is you can model dependence between labels I can say what that well I can't really have a tumor pixel in the middle of nowhere so I can model that this sort of adjacent pixels are likely to receive the same label so 2006 I came over to ubc to work with kevin murphy and we wrote this paper on training condition random fields with stochastic gradient it was really slow to train them we wrote this paper and since they were so popular at the time this paper got a bunch of citations so if you look at me up on google scholar you'll find that my top cited paper is this one and it's a little bit embarrassing because if you download my code it uses the deterministic method so the thing that in some sense people have cited me the most for something that I absolutely did not believe in and do not do in practice when I give people code why is that it's just easier to use stochastic gradient you can make it look good in your papers you can play around a lot but if you want something to work out of the box the first time it's just not reliable it's hard to pick that step size it's hard to decide when to stop and it goes really slowly even if you're lucky so I thought that we would fix this problem so so the next year we started playing around trying to develop a better method we didn't have much success we got a rejected nips paper that's what came out of this the the summer 2008 came my PhD proposal and I was like we need to solve this problem and I looked this up today this is the first version of that that graph that i showed you though the colors are very different i guess the cast a gradient was still green the deterministic was read at that time and so I showed this at my thesis proposal defense saying this is a really important problem this will be so exciting and impactful and my committee was not too happy with that they're like this is too hard you're doing too much focus on existing projects and so on and I just want to put that there because I know that some of my students around and I kind of tell them some of these same things and it's good to see that it happens to everybody okay so I got a postdoc in France and I was going to go there and as some of you may know it can take a long time to get a work visa in France so I had this like extra a three-month limbo period where one of my committee members offered me a sort of a three-month postdoc and he was the one who told me I can't work on this problem and he's I was like oh what problem do you want to work on he's like oh I know he didn't have an idea and I was like let's work on this and so we made this red line we made we made one of the obvious lines you can think of so it was based on controlling the batch size instead of grabbing one random example at each iteration on the first iteration will grab one but on the next one we'll grab two and then the next one will grab three the next one we might grab five and so on so the number of random examples we look at slowly increases over time and that way we transform from the green line into the blue line when I went to my CRF problem the one that I cared about we got empirical results that looked much like the theory so here the colors are exactly the same as in the plot that I wanted here's our deterministic method this is something called el bfgs it's a very fancy quasi-newton method for stochastic gradient I didn't want to commit to a sit step size so I just plotted the three best step sizes I could find so that seems like a reasonable estimate its performance and then the red one was the new one and there's a very small performance improvement it kind of cleans up the beginning so it follows the stochastic method but then the slope is the same here you know whatever that's you know that's just an extra ten iterations that's not a big deal ok so in France we ask this question can I have an algorithm that only looks at one random example in each iteration and only requires a polynomial number of iterations I don't want to be looking at more and more examples on each iteration because that eventually just kind of gives me the same thing I've asked this question of the audience many many times I'm always very heavily leading up to it but the first time I showed a slide like this was in edinboro and there's a fire alarm right at this moment so we all like went down and people had their smokes and there was lots of famous people there and then we came up and I mean I've been leading up to this so there is an algorithm that achieves this it's called the stochastic average gradient or the unfortunately acronym sag algorithm looks at one random example on each iteration computes its gradient then it updates its memory of this example so this is the difference with the regular stochastic gradient we're keeping track of something about each example we're not just looking at random things I look at a random thing and then I could look at what it was before now we update our all based on our memory of all the examples so as in gradient descent we're basing our update based on all the examples it's just that the memory for all of our examples except for the one we pick is going to be out of date if you want to see that as an equation it looks like this now this looks exactly like the gradient descent equation except for I've put an IT here instead of XT so what is that on each iteration we choose a random example I t I don't know if that's the best notation but the terms in this sum this is the last gradient I computed for each example I so we're imagining I have a table that has all the gradients it has one gradient for example 1 2 3 all the way up to n each time I'm basically swapping it out I pick a random row in that table I get rid of the old gradient I ad in the new one and then i use the average as my direction so you can think of this as gradient descent with outdated gradients instead of updating all n of my gradients on each step I update one gradient on each step and then I still do the gradient descent step now I'd like to say that we came up with this out over ourselves our only contribution from the algorithm side was actually saying we should pick the example randomly instead of going through in order so if you go through an order it's called the iag algorithm but actually picking randomly is incredibly important it's what gives you that property if you go through an order you don't get that nice property it actually works terribly so here are two experiments on standard logistic regression with regularization problem SG and ASG are to stochastic gradient methods that I've tuned the hell out of and if you look at our paper we actually compared to 14 other stochastic gradient methods because the reviewers just didn't believe us we just submitted it and they said no you haven't compared to this and ok we'll compare to that and they we submitted again they said no you haven't compared to this and by the by the eventually we wrote we're happy to compare to any stochastic gradient you throw out at us will include it in the final version we don't believe that any of these albums which we can prove take an exponential number of times are going to outperform the new are the new arms not on here yet I haven't put it there yet these are the deterministic methods afg is called nest draws method l bfgs is a fancy quasi-newton method and you see that well on this data set the deterministic methods eventually catch up and pass the stochastic methods but it's a little bit closer in this setting ok so where does the new method go and I love doing this because you know as a computer scientist you don't have to get to do this this is a polynomial time algorithm where there was an exponential time algorithm there before as a computer scientist you don't get to do that very many times in your life you either get to do that like zero times or one time in your life so let's do that again look at the axis here I actually had to change the y-axis to show the new method on the plot that was crazy we started doing like we just thought this was kind of a nice theoretical curiosity we never realized it would actually work in practice so that that was a pleasant surprise so here you see it follows the stochastic methods and but then it keeps going and keeps making progress and it's actually the slope is better than the quasi-newton method which is trying to approximate the Hessian and similar here it goes down and then what I used to approximate the the optimal apparently wasn't low enough so it went down here and caused me a like a value because what I was using to approximate the true solution wasn't good enough okay does anyone want to make comments at this point before I before I go into the one last thing I know I think this is super cool because it was we chair this plots show is it the training error or the generalist Oh fantastic question ok this is that this is the training objective function so it's the it's the logistic loss plus the regularization so this is what the optimizer is trying to do if you look at the test air it's not so dramatic and at this time we didn't have an explanation of how the test air on this model works just last year people did start to figure out the test error analysis and and they did show that the test error converges a little bit faster but it's not as dramatic of a difference the other thing I'll mention is it's really easy to set the step size in this so you can actually in practice drive the test error down super fast can you explain the algorithm again I couldn't really follow your explanation for your sshd algorithm and explain what is your intuition that how why this should perform better than its to Cassie Craig Anderson right okay so the problem with stochastic gradient descent is as i get so when i'm far away from the solution the stochastic gradient is really close to the gradient it's good enough mod most iterations when i get close to the solution all my vectors my gradients are pointing in different directions so my variance is very high so if you look at the balance for stochastic gradient the bottleneck comes from the variance of the gradients at the solution it's the problem is they have to point all different directions to cancel out and make a minimizer here what we're trying to do is converge to the true gradient and the true gradient is going to 0 so instead of making my step size go to 0 I can just use a constant step size now why can I use this as I get towards my solution X T plus 1 and X T they start to get close together as I'm converging my iterations get closer and closer together what that means is if one of these gradients is old it's less of a big deal so my interation czar getting closer and closer together that means my outdated gradients start to look like the real gradient at my current iterate and I actually converge to the true gradient descent iteration as T goes out to infinity and it's a really nasty proof but it actually all magically happens to converge at the same rate so if you and mathematically show that your alligator should perform better than those algorithms that you compare or you just had empirical results no the the result is the theoretical result the the empirical result was like the cherry on the sundae it's it's it's a 20-page proof it's really nasty but luckily there's new algorithms that have simpler proofs yep so in here which is the the history of the gradients that you're keeping is that is that fixed are you going right back to beginning keeping all of them okay so I'm not summing up over time here I'm assuming I have a finite training set I'm actually summing over the examples so it could actually go back to the beginning if I haven't seen an example since the beginning or something like that so so on average I may have a data point that that's sort of n log n iterations old in the winds of the worst case yeah what's the limitation on the kind of problem you can solve with this due to the assumption that you actually need to remember the updated gradients like you need to keep the whole history of the great interest room that decreases the class of functions that can actually minimize okay so what is the class of functions we can minimize so first off we're making a convexity assumption here because that that we need that for gradient descent to work now the second problem is actually we're actually making very standard assumptions so so this is this more directly for Yves because because he'll know what I'm saying but you need to assume that the gradient of each exam Lipschitz continuous and you need the function to be strongly convex if you have not heard that before those are exactly the standard assumptions you would see in 1942 show the gradient descent has a polynomial number of iterations so so we're not doing anything fancy here but I need to store these gradients so if I have n examples and say n is a million and these gradients are each a million variables I'm going to run out of memory really quickly so so we published this in 2012-2013 there was four papers to propose the exact same solution to that problem so people have now got rid of the memory with a slightly different algorithm yeah um I think I'm things whoever's left on um so I've got two questions which are probably just the same question the first one is why are you choosing your data points randomly and then the second one is the random choice optimal or is there a better algorithm that is theoretically possible let me answer that in like three slides oh my question is that you have you have randomness here that when you develop this were you just doing on a single machine or did you test it in a distributed environment and what are the challenges doing it in a distributed environment if you're trying to do distributed training ok that's a great question and it's a question that I do purposely do not think about you know busy academic guy there's only so many problems you can work on and I knew that so many other people would work on that problem and they did so there's a bunch of papers on like nice parallel distributed versions of this algorithm now do you know off the top of your head what the what the issues were um it's kind of the usual issues with with parallelism so how do you how do you use sort of trade off doing computation versus communication costs how do you how do you you know load balance and other such things right but but it's not my area of expertise at all so I'll just plead ignorance on that one uh there's some more questions about I'm question about our party so you are using the fixed loan you reach out and ensberg learning rate that's right so first for sag you can use a fixed step size which is which is kind of wild so for stochastic gradient you need the step size to go to zero to conversion that's what causes you all your problems here you can use a fixed step size because this thing converges to the true gradient which at the minimizer is zero but but supposed to be our first the learning rate a top 0.01 and the you don't change the alone you treat Eve even at the cost 10 to 20 is that correct uh yeah I mean so as as with all organs you need to pick the learning rate well now the theory works for any learning rate below a certain value so it's not like stochastic gradient where I need to do you know 1 over 1 over T 1 over T Square or what is it 1 over T 1 over T plus 1 1 over T plus 2 and so on something like that here there's some magical number called 1 over L and if it's smaller than that it's going to work any constant value in practice we set alpha T equals one and there's a certain inequality that we can check and if this inequality is satisfied the algorithms going to do what we want it to do we just check it and if alpha is too too too big we divide it by two so I'm you proud to party yeah it's a constant I mean you know different page and they are using the same learning without a from the learning rate so what's up yeah you're not a slide and see their different are prompt yes yeah your principal yeah so you are using the similar new route for all our curve octave front ok good question so so for this algorithm there was there was you only inspect the learning rate on the first iteration so after that it uses a line search because it's it's mystic method here I plot all three in this plot to be as generous as possible to the stochastic gradient methods I tried all the step sizes and I'm plotting the best one picked afterwards so stochastic or any methods are totally cheating in this plot for sag it's using a line search it's just run once and it adaptively finds the step size if you've changed that initial step size as long as you don't make it too small the curve will look exactly the same up to you know the location of these bumps or something like that I don't have the plots with error bars but if you look on archive you can see the error bars are actually quite small so I realize there's not going to be a proof but have you done any experiments on non convex problems and how does this perform because like state of the art is still using various stochastic methods for example in a deep network or something like that okay so I have not done that personally but the there's new state-of-the-art algorithm for PCA which is non convex problem and that that's based on the memory free version of this algorithm so PC is a non convex problem where this algorithm has sort of made a new state-of-the-art in the cause of neural networks I've heard very mixed results so some people nobody really says it works badly but some people just say it works kind of as well as other things so so why would I why would I do something different but uh but yeah some of my students are playing with that and I think there's still a lot to a lot of room for improvement in the neural network training with the fixed learning rate and a memory of old gradients from the beginning why is it not overshoot the optimum it does over sure the optimum but then it comes back okay and just fairly quick to come back yeah thanks yeah it definitely overshoots there's versions of it they try to fix that okay so since 2012 that was when the paper came out there's been a whole bunch of work on the topic so so one of the most notable ones is there's a version called SVR G which has no memory so instead of being over n times d which is crazy if you have large number of features and examples this one is only o of n no sorry ofd you just need to start one vector the same size as your solution vector for a few more great evaluations we found out that there's nothing really special about sag there's now like a dozen algorithms that have shown to have this exact same theoretical property there's versions for sparse problems constrained problems non-smooth problems there's quasi-newton versions there's accelerated versions there's versions for certain non convex problems like pca there's versions that use parallel and distributed data so i guess i'm going through some of the questions here and there's also analysis showing that it can improve the test there but on the test there it's not a phase transition you're not going from exponential two polynomial on the test area are always exponential in the number of examples you can only improve the constants but you can show that it does improve the constants beyond improving speed as I fint it it also addresses some of the other annoying issues of SG it's easier to tune the step size because as long as it's small enough you get this nice result and it's easier to decide when to stop and the reason for that is that this value converges to the true gradient and so if this value is super small it means you're your true gradient is getting super small and you don't have you can't make much more progress your functions almost flat if you don't trust me try it out if your stochastic gradient algorithm is three lines of code this is five lines of code you can also have the code for my web page which has the line search and all the fancy stuff it's inside KITT learn if you've seen an advertisement on the Internet today that didn't come from google it most likely came for an algorithm train or a method train with this algorithm so the second largest internet advertiser they hired my co-author they paid him a lot of money they actually tried out the algorithm before they hired him so they read our paper they tried out they found it was fantastic and then there was negotiations and now he's being paid well much more than you bc pays and it's also used at an academic application so there so I mean the you know the paper has like four hundred citations or something like that the two of them but one of the neat ones was called building proteins in a day so classically hard problem in computational biology is protein folding and this algorithm was able to help drastically reduce the cost there okay but here's the sad part going back to the whole problem that motivated this the CRF problem that I wanted to solve ran up compare it to some state-of-the-art methods there was my nice hybrid method here we've got el bfgs stochastic gradient ada grad whatever it works maybe a little bit better or the same so I wanted to solve this problem for ten years go off to France prove this amazing result fancy new algorithm the problem I actually care about it doesn't really make a difference at all I feel you shouldn't use it okay but that relates to one of the questions that i said i would ask later maybe random sapling is too naive so for classic stochastic gradient methods that need an exponential number of iterations non-uniform sampling you can prove that that's not going to solve the problem that can improve your constants it can't change the rate for this algorithm you can improve the rate so we want to sample some training examples with higher probabilities and other I don't want to just go to a random web page I'm going to try and figure out which are the important web pages to go to and visit those more often and the web pages that aren't affecting my model I'll visit those less often so the key idea is I want to bias my sampling towards the examples where the gradients changed quickly this is a delayed gradient method if I have an example where the gradient changes very slowly who cares if it's delayed it'll still look exactly the same if the great changes very quickly I need to see that example a lot in order to have it converge quickly so recent work by us another show that that actually does improve the convergence rate and in our code we have this kind of nice trick test to meet the importance of each example as you go so it tries to learn which parts your data are important to visit often which parts are not important now as I stole this example from an SVM webpage or something like that if you're in this setting of linear classification classic machine learning model of how fast do the examples of the non support vectors change if i'm doing like logistic regression or sound like that they end up changing extremely slowly so what actually ends up happening is it's going to focus all its effort on these points and very rarely visit the points that are well classified the algorithm automatically figures out what the hard examples in your training data are and sample the samples those more often and starts to ignore the easy examples and now if I go back to my CRF problem and have the non-uniform sampling I get sort of that that amazing improvement that was unexpected on the other problem and now shows up here um okay I'm going to wrap up take-home message so this sag algorithm very nice theoretical result but actually it works a little bit better than the theory predicts so that was that was really nice it's a black box the cast a grading algorithm it adapts to your problem difficulty there's a line search you know when to stop it hasn't completely fixed all the annoying issues with stochastic gradient but I think this is that this is really nice because I don't have any stochastic gradient codes on my web page but I have this code on my own page and if you download it I'm pretty sure it'll work led to a whole bunch of other algorithms and applications and finally I want to say that if you are looking for employees or you're looking for an employer I've done a lot of one-to-one matching over the past few years i'm trying to avoid that so some students set up a data science job board so if you're if you're looking for employers or employees make a profile there i have a hundred and eighty three students taking my class this term i'm going to encourage all them to sign up so if you're really looking for desperately for machine learning people there may be one hundred and eighty three of them coming on the job market in december or april and finally thank you so much for the invitation thanks for coming out and i'm happy to answer any more questions and have to Russians and so on can you elaborate a little bit more online searching and a way to stop termination okay so let's just do this live so I need alpha T to be less than or equal to 1 over L cool so L is a measure of how fast the gradient can change so it's so it's sort of a bound on the eigenvalues of the Hessian that seems really annoying to compute but it's just for one training example it's not over the whole data set so what I could do is I could do something like I can just test gradient f5 Thank You surface it didn't look like my my collar touching it y less than equal to L X minus y that that's the inequality that L has to satisfy if I have an estimate for L I can just test the inequality at the random example I pick if my L is too small this inequality violates it I just double it that's it I stole it from the Russian guys from the 70s I have no proof that that works they're a guy in Japan told me that for his problem my trick didn't work but then we found a minor tweak that made it work pardon oh when to stop okay so this thing sum i equals 1 to n gradient fi of X I t that converges quickly to the true gradient X T so normally when we implement gradient descent we test the size of the true gradient to decide when to stop now I've got this estimator that's converging at the same see that my algorithm converges to this quantity so I'll test the norm of this quantity which I know as an approximation of this quantity which I don't know something you completely cannot do with stochastic gradient because to cast a gradient you don't have this global view of the function yeah there was a question earlier about using SI g and neural network architectures i actually missed the answer to that could you repeat it um mix not super exciting results i would i would say as the summary ok look you win with your CRF problem you went back to non-uniform sampling and found the improvement did you try that again on the logistic regression with regularization did it uh yeah provements yeah yeah so I normally show that pot I didn't I didn't today but so in our original paper we showed results on 9 data sets now you think i would have cherry picked those those two examples that are so dramatic but seven out of the nine datasets had that dramatic effect two of the data sets AG was just basically a long state-of-the-art once we did the non-uniform sampling it made those two datasets look dramatic to and the other day sets it didn't really make a big difference the non-uniform sampling seems to work for particular problem sets that's where it makes a difference yeah yeah exactly and what of those two data sets do you think it was that the non-uniform sampling made the impact is there something that you could pin it to I quite sort of data sets or problem sets that non-uniform sampling works always I would have to think about that I don't know the answer to that I had perhaps months of irrelevant data points that aren't particularly informative naming yeah that that would definitely make sense so so it it certainly learns to ignore or at least sample them not very often the very well classified data points first but see for crfs you don't quite have the same picture of support vectors it's a bit more complicated um I'll update the question then yeah so is that the optimal form of non-uniform sampling or is there a lot of potential for future improvement yeah great question under certain assumptions I the one here is optimal but those assumptions are a little bit brittle in that if I start doing like Newton like versions or accelerated versions it's not clear that the same thing will will still be optimal so there's that there will i mean i have a student working on this now trying to find out if there is a better thing it's not clear that there is or not and those assumptions are not about your data set they're about what particular so you mentioned they're not wasted Newton like method yeah so the non-uniform sampling is it's sort of there's a consistent way to do it that's that it doesn't matter what your data is so use the same strategy for every data set hi thanks for your contribution to science I don't think I've ever heard that before uh thank you you made my day yeah uh so I'm not a researcher I'm a developer and between working on your features or collecting more data or working on novel training algorithms where do you think a business would be able to get the most value from say maybe one of your graduates of you are saying is entering the job market or from some other junior machine learning okay worse yeah a great question I'm gonna try and take a two two faced view on that question so the first is if I just want to make my model work better should I work on making a better optimization algorithm should I work on better features or should I get more data you almost always pick features in that case features can do a lot having the right features makes a world of difference then you pick data then you pick algorithm now in terms of hiring one of my students they don't know what your problem is so they may not be as useful at making features as you are so they're probably more useful in terms of developing algorithms or so on the other things they can do is if you um if you don't quite fit in the standard supervised learning framework those are those are the things my students are a bit more useful at if for some reason you can't just fit into this simple cat dog thing then then they're a bit more useful if you need to do some sort of feature creation or something like that hello oh hi hey can memoryless sake be applied to streaming data so there is a paper called streaming SVR gee I was written by a bunch of guys from MIT last year what they showed is that yeah okay it's a very complicated paper and complicated to even explain what they were doing but basically they showed that if you ran streaming serg you did well with the same constants as if you exactly solve the optimization problem for every value of n as data came in so so if I if I if I ran streaming SVR G on a thousand data points I would do as well and untested as if I had exactly optimize my training error on those thousand data points that sounds kind of cool but there's there's a little thing hidden there because you can / fit if you exactly optimize the training around a thousand so yeah so streaming s for energy is definitely a thing there's this sort of fascinating theoretical result but it's uh I don't think anyone's actually using it yet there there are several nips papers on it this year i noticed including one by mike jordan who's the serious serious machine learning guy okay maybe we should stop there okay thanks everyone for coming out
Info
Channel: Polyglot Software Association
Views: 399
Rating: 5 out of 5
Keywords:
Id: 9TEPmEhd0t4
Channel Id: undefined
Length: 68min 52sec (4132 seconds)
Published: Tue Oct 18 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.