Vincent Warmerdam: Winning with Simple, even Linear, Models | PyData London 2018

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

The strength of deep learning is in learning complex features on high dimensional data. DL and the nice, linear models shown are not interchangeable as implied by "you can often win with simpler models that have properties that are much nicer...". I'd preferred "the majority of today's business problems are still solved by engineering good kernels and features. "

With that said the presentation touches on a lot of useful data science principles that people are not aware of / forgot (this poster especially).

👍︎︎ 7 👤︎︎ u/DeepGenomics 📅︎︎ Jun 17 2018 🗫︎ replies

Anyone know if the code for this talk is available somewhere?

👍︎︎ 4 👤︎︎ u/Minimalx 📅︎︎ Jun 17 2018 🗫︎ replies

This is really good. Really, really good.

👍︎︎ 3 👤︎︎ u/CalligraphMath 📅︎︎ Jun 17 2018 🗫︎ replies

Absolutely agree! I cannot overstress the value of being able to explain what's going on with the model to business at large...

👍︎︎ 6 👤︎︎ u/gandalfgreyheme 📅︎︎ Jun 17 2018 🗫︎ replies

Thanks for sharing.

Anyone got a link to an example to clarify the workflow for how to generate the kind of manual RBF features he uses towards the start for the time series example?

👍︎︎ 2 👤︎︎ u/4556654433225566 📅︎︎ Jun 17 2018 🗫︎ replies

Absolutely fantastic talk.

👍︎︎ 1 👤︎︎ u/BrokenTescoTrolley 📅︎︎ Jun 18 2018 🗫︎ replies
Captions
all right everyone thanks for dropping to my talk today I'll be talking about how you can win was simple and even some linear models and as a reason why I'm a little bit passionate about this because if you are living on the internet or in our sort of little filter bubble you may have heard variants of this quote something in a line of you should be using deep learning and typically the people who tell you this or people like blogs reddit hacker news and some YouTube stars and especially the latter one is kind of weird because people are rapping about variational Ottoman codes and stuff and this got me thinking and actually got me inspired to talk about this because it's getting a little bit worrisome and I love deep learning I actually do it for clients but to be very specific it feels more like people are focusing about the tools they're trying to apply then you know that they're trying to actually solve a nice problem and again like use deep learning in production but I preferred the simpler models and in this talk I hope to convince you and I hope to explain to you why it's not like deep learning isn't amazing like it really is but there's a bit of hype around it that's distracting people from other great ideas and some of those other ideas are actually greater and you may be doing yourself a bit short if you pay too much attention to them now you can also win with much simpler models that have properties that are much nicer to have in situations beyond that you put a notebook and I'm an industry kind of person I really care about what happens in production so my goal is to learn you back to the domain of boring old but ultimately beautiful simple models I hope we're excited about that in particular there's a couple of topics that I'll discuss and I will be going over all these things relatively quickly due to time and I hope that I will mainly leave you with a lot of intuition I'll start by saying that extraor actually is a linear problem I'll show you a lot of great time series tricks especially if you're going to predict a long time ahead I'll show you some small feature engineering tricks will discuss streaming models that's models that learn data point per data point sort of just on a large batch file I'm also going to talk about recommenders a video game recommender and afterwards I'll end on something that I think might be the future of machine learning I'm namely hierarchical domain models so the XOR problem you probably have a book on neural networks and one of the main arguments you typically hear on why you should use a nonlinear model as opposed to a linear one is that you cannot deal with the XOR problem and it's a textbook example and it's something like this so let's say that our task is that we want to click the blur the new points and we want to separate them from the red points and then people say ah you need a nonlinear model for that because there's no line that will separate the classes so let's do an experiment we put that into scikit-learn so from scikit-learn import linear model logistic regression this is a small library that allows me to say the class is dependent on the x1 and x2 thing in this data frame I have my wide I've got my X I took a little Jessica aggression and yes the confusion matrix isn't great and so then what people typically do is they say ok let's use a model that's easier to apply than to understand so let's maybe use a support vector machine and then you can see that the confusion matrix is actually a lot better what I could do though is I just look at this data set sure right now of just these two features I cannot separate them linearly but how about I apply a small trick what I'm gonna do is I'm just going to say I'm going to add a third column that are based on these two columns so I'm going to say I'm going to make a new column called x3 and it'll be this number multiplied by that number suddenly when I put that into this model the computer matrix actually pretty damn good so yes XOR you can solve with a linear model the trick here is that they just have to do a small little feature engineering trick and that was enough to already be competing with this support vector machine and I have a logistic regression which is easy to explain and less stuff goes wrong in production I can explain it to this person in a suit that I'm supposed to convince that machine learning is great so already I'm seeing lots of benefits by just you know adding one line of code and this is this very sort of textbook example I mean this is just the demo of the linear method solving a nonlinear problem but because the linear model is applied in sort of a nonlinear one we have much better interpretability and there are many more tricks that we can use and the main reason for me as a consultant especially if I'm either company who's just starting out when I leave I typically want my legacy to still be in production I really like the idea that suppose the company has great engineers we're not maybe the best scientists I can leave a linear model and I'm pretty sure it's gonna work for the next two years if it breaks they'll fix it complex models don't have this property sorry the yeah well there's two types of consultants left if you're interested in this let's discuss business after shall we okay that was the first tackle well so this is a very silly example of what I hope to do now to show more of a relevant example it's actually happening quite often who here has a situation where you want to predict your head like if you're in retail or something like that right this is something you typically see and also here this is nonlinear I've got something here that I call X and this is definitely not a linear relationship with X some things about this data said it's fairly general I think let's pretend it's daily data we've got four years of it we want to predict a year ahead maybe two it's a fairly common piling task I hope you all agree and I also hope you can smell from here that the feature engineering will play a large role and I'll just show you one trick that I really really like and then I hope we all agree that it's a nice trick and that it will mainly work in linear models what some people do is they make dummy variable hint hint dummy the word dumb isn't it what they do is they say let's take like a variable for a month in a year and everything there will be averaged and we kind of have a seasonal pattern and this is true it's not the worst idea but I mean the thing that I don't like about it is I do want to be able to predict from April 30th so let's say the data that's in the next month and I do think that points that are close to each other should be able to influence one another in dependence on whether or not there's a separating line that decay the notes if it's a new month so again we're gonna do a feature engineering trick there's a very complicated formula that's a shown above but basically what I'm doing is I've got some sort of hump that repeats itself every year it's a radial basis function that's another way we just call it a hump there's a hybrid parameter called alpha that allows you to make the hump thinner or more thick but the nice thing that I can do is I can make such a hump for every January for every February for every March and all the other ones and the nice thing is when I took this into a linear regression there's something to be said you might actually get a nice little smooth fit over the data points are trying to predict let's say a year ahead so what you would typically do is you'd have some sort of function in pandas that generates these columns for you so this is really a basis function for January and then you know all the way up to December let's pretend that this was the data frame that we had and when we fit this with a normal linear that's it this is what comes out and definitely compared to the dummy thing that we were generating before this looks a lot better it's a nice smooth curve and again the only thing I did hear was a nice little trick with regenerating features and the cool thing is I can actually go up to my manager and I can say look I filtered out the season there might be still some pattern left but at least I've quantified what the long-term season should be and all of this just a couple of lines of extra feature engineering and you know I'm willing to argue that's actually a pretty cool approach if you tune the high parameter alpha they showed before you get these sorts of things you still want to do the normal feature engineering and high parameter tuning but we're actually still very flexible in fact we're so flexible you can still keep your contests happy because when you put this into linear regression you can still do all the stuff you're used to doing with linear regression so if you want to remove features that are insignificant because we're still in the linear domain we can actually remove some features that aren't significant for example which is also kind of cool if you're into this thing I prefer other methods but it's still kind of cool that you can keep the frequentist friends along yeah not too important another thing I like about this and it just deserves a mention real quick all the magic can be explained but whenever I'm at the client I don't know for sure if they have so I could learn or if they've got h2o or spark ml or positioning R or Microsoft stuff but I'm pretty sure - whatever stuff they have they have linear regression so if I'm at least able to use some of these feature engineering tricks I'm pretty sure I'm always able to find a way to get it to production even if they have a different stack than what I'm used to you can implement this in Java or JavaScript amazing and you can mold this feature approach I mean right now I've generated 12 features but you can very well imagine you might make 24 one for every two weeks you can also imagine that you have this feature for sort of the long term thing in the year and you've got another radial basis function for within a certain width single day for example and because everything is linear it's relatively straightforward and I hope you agree that's kind of a cool trick what you can also do is you can have a separate radial basis function for different days of the week in the Netherlands there's this weird thing that on the Wednesday people typically take the day off to be of the kids it's the things the culture thing so maybe the seasonal pattern is different for Wednesdays that it is for Fridays so what you can also do is just generate such a season for every single day of the week and you can even extend this trick to sort of generate a feature like this for every single holiday for example if you're buying Valentine's Day flowers roses typically after a Valentine's Day is done no one wants to buy roses anymore so the price for roses just drops and usually it's all the days before Valentine's Day when the retail prices go skyrocket so to accommodate these sorts of effects that are kind of like the season because they occur every single year you can also make a feature like this that depends on what Valentine's Day happened you can still get away with a lot of doing things and just in your models and you can interpret everything and you can still tweak lots of stuff do that with your neural network just and that's kind of part of this part of the point I mean you can still generate you can still generate all these features and check them into the neural network I mean that's fine but it's not the point the cleverness of what I just did came from creativity but not focusing on the algorithm per se but just start thinking about the problem if you understand a solution better than your problem you're doing the you're doing it wrong I think the note I mean a lot of people would say maybe just chug it into some gradient boosted the Machine or deep neural network and I do hope that you shouldn't expect that automatically these features will be generated on your behalf there's still something to be said maybe not to the algorithm little bit and think along I'll consider a small variant to show that even now we're still super flexible because typically you don't have a season this repeats itself every single time when you typically have is something more like this like there's some sort of change in the season which season might also change over time so oh my god Kendall linear model still accommodate this yes there's a couple of tricks that you can do so normally in a linear model you do this thing where you minimize the error so this is the squared error that you can minimize there's other people that like to prefer to minimize something else like the absolute error it's also fine there are some other people that say oh you can do this thing called Ridge regression where you basically punish it if the weights get way too big which is actually a neat trick and we can just write it down what you can also do and that's actually what I really like is you can also say well you know what certain errors are more important to me than other errors so if stuff that happened five years ago I care way less forward and stuff it happened a month ago so we do the same trick oh by the way if you do this you can do this like a gradient descent method but if you're a math nerd you can also just write it down in the matrix and there's a closed-form solution super cool but you can also just do is you can make another column in your train data frame call that importance and so I could learn you can use linear models I think support vector machines logistic regression also have support for this it's pretty cool and what you have to do beforehand is you have to go to your pan assay the frame and basically add some sort of importance what I do here is I do some sort of exponential decay all the way to the back and I have another high parameter for this stuff you can play with this is what we had before this is what we have now and you can see we really don't care about the fit that happened in the past much more but the thing that happens in the future is fitting better even though it's still not perfect this is a simple circuit we can just go ahead and use yeah so when you do this I'm going over this very fast because I think I made way too many slice than I have time for but one thing that when you're doing stuff like this what you typically can want to do is you're going to end up with a bunch of hyperparameters and at least for me whoops speaking of optimization better check the cable but at least for me what I really like there's these five lines of scikit-learn and they work very well but part of the stuff that I'm doing here is in pandas and you cannot really put all the pandas operations in these scikit-learn pipe lines perfectly yet so what I just do is I just make this one function that have all the high parameters as function inputs and then I have some it's a library called evil that I really like to play with it does evolutionary algorithms and it's good enough for my type of primary tuning in general do whatever you will but keep it simple that's the whole point another thing that we could do just as another trick this is a bit of syntax from our in our this would be a way for me to say hey I want to predict the variable Y and it depends on all the other variables what I can also do in arm is I can say you know what I wanna have an interaction term so don't just take the radial basis function that I had before but just multiplied by the time that way the seasonal pattern might increase as time increases because time increases and it will multiply itself by that feature and this is the R trick it sounds like a DSL for model specification but there's a cool library in Python called Pasi that allows you to take a string just like that and will generate these features on your behalf and it works on pandas dataframes it automatically does won't have encoding it's a pretty sweet library if I'm very frank this is the first when I saw that library that's when I started doing less our machine learning it started moving to Python it really helps when you run this this is what the extrange like so you still got the names of all the variables that you make and you can see that we still get all the original variables that we're creating the only thing that's been added is we also get the time times the radial basis feature etc it's pretty cool is it saves you a lot of typing when you do this though you're going to generate a whole lot of features and then you should probably wanna reduce them as well so things you might want to consider is maybe just get a rich model to prevent overfitting try to remove variables in your high performance search function because you don't want to keep things small you can apply t-test switched its stats models you can also use whatever thing is scikit-learn for feature selection that you have but this is when we apply the sample weights this is when we apply the sample weights and add the interaction terms so some parts are predicted better if you just pay a bit of attention these are two predictions a year ahead that are somewhat reasonable but I can think here it's predicting the end of the year a little bit worse whereas here it was a little bit better so it smells like hyper parameters I think I described a lot of texture basically that says here's what I end up with and it's pretty damn decent which is the main point and all of this was done with the simple models and a bunch of tricks and I can still perfectly interpret the model I can still play around with it and here's a nice thing our example only had three years of data so every year for every single day there's only 1100 data points the radial basis functions works in such a way that different months can actually overlap in features so I'm actually using as much information as I can to generate all of this and you know try doing this with a deep neural network I don't know if you need more than a thousand data points you probably do something to be said at least that you shouldn't expect a recurrent neural network to automatically solve all of this for you but if I have let's say a thousand different products I can generate all these different radial basis functions and just per product run this and I should be able to filter out the season still relatively nicely I I can sort of there's two types of machine learning models you got the Panda there's the month you really want to care for that model you want to optimize it perfectly but in this case if I got a thousand models I got a train I think this is decent for general seasonal v linear models are easy to maintain and the bug something goes wrong we can inspect it linear models are actually kind of fast to train which is also kind of nice I trained deep learning in production now I have to wait sometimes and they're also very easy to explain to humans of Management which especially if you're a consultant live it's a very useful thing and here comes my favorite reason and I didn't expect this at all but when I'm running a deep learning model in production linear models are convex when the optimizer says I'm done then I'm sure it's done and it will not find a better place in the training set for fit when that when a neural network is done it basically says well I don't I don't have enough time to do anything more but it can converge on something that's completely bonkers and this is something I have to worry about in production besides all the other stuff I got to worry about I now also have to check if the convergence is OK is it something I don't have a guarantee for anymore tensorflow is a cool tool but it's even cooler if you don't need it honestly if you have ass I could learn algorithm that's amazing but you can also just run a sequel query the sequel curry is better right the same thing goes for tensor plug and again don't get me wrong I put the various introductions and I like the algorithms they solve problems I couldn't solve otherwise especially in text nowadays I'm seeing things that are amazing but production is dangerous and I don't want to wake up and that you know 12 o'clock on a Sunday because something went wrong I prefer to just have paid or Doody off as well so speaking of production I've just done a lot of the linear models that are just you know normal static files and we can do some cool things with it but there's some extra tricks that you can do and especially if we're considering production there's this whole realm of machine learning that I haven't heard a single talk on PI data yet so let's talk about streaming because you know you think about it instead of handling data in a batch where we just have one giant file you want to learn from maybe it might be preferable to think about it in a stream setting in production data will probably come in in the stream we're not going to usually you don't have someone that says here's a CSV file please train typically we get one data point at a time and this is what we so kind of what it looks like if you do it in bats what you typically do is you train some sort of function you learn it and then until the next batch is sort of done with training and event comes in and you apply this one function and that's sort of the prediction most machine learning algorithm and the functions in the source if you train which I could learn something goes in something comes out same thing goes in same thing goes out but in this roaming situation we would have it every day the point it comes in might actually update the model state and there's something be said about being able to react quickly to the world they're not having to wait for the batch algorithm that's running on a daily basis so there's an implementation if I could learn it actually it sort of does this but you can also imagine if things like Apache flink or sparks treating might have some implementations that are interesting so I'll give you a small explanation of how some of these algorithms work this is a point that lives in space and we're going to do a regression on it and as far as this point is concerned it doesn't care about the blue line or the yellow line they both have an equal fit they both go straight to the point so these two lines are somewhat equal as far as that one point is concerned what I just said however I can make a second plot what I can do is this is again the point and line going through and this is a plot of x and y but next to it you see a plot of the weight space so this is the intercept and this is the slope and the blue line core the blue dult here corresponds with this blue line and the yellow dots corresponds with the yellow line as far as this point is concerned every single thing on this straight line is equal in terms of fit but now suppose that I have some weights from like the previous points that I've seen well then I probably want to find like the shortest distance from the original weights that I had up until like that plane well we have linear algebra for that just put it on there with organelle you're done and what we can also do is well you know we want to move in that direction but instead of overfitting how about we take like a maximum step size it seems like a good idea such as we don't make sort of very aggressive moves so this is sort of a bound there and the idea then is as a stream comes in if my whatever classification or regression algorithm has a pretty good fit then I maybe not update so let's only update when we have giant errors when there are giant errors that we were very aggressive update this is called passive-aggressive algorithms and I love the name and turns out that scikit-learn actually has an algorithm linear regression and I think also logistic regression for passive-aggressive algorithms and I've actually uses his client not so much for the benefit of the streaming aspect but more in terms of the benefit that you also need a much smaller memory footprint if you just have to handle data one data point at a time there's no longer a need to have the entire data set in memory say if you're only interested in sort of preserving memory already this is an interesting algorithm let alone sort of the the actual streaming phenomenon which is also very powerful so this very small experiment because I still some stuff that can go wrong that is constant step size that can be big or small and I'll just go over this very quickly you can read the slides later what you can do here you see what the fit is and here you can see what the weights are doing and if I in the beginning say let's in the first 30 iterations take a somewhat small step size and then also small after you see that the fit is sort of all over the place if I were to say oh let's just take very small step sizes at all times you see it takes a very long time to even get near where it's supposed to be but then you see that the errors are much smaller typically what you want to do is you want to be very aggressive in the beginning that just when the first 100 data points are so that you see allow it to make huge steps and then allow it to make only small steps a thing that's also kind of cool about these algorithms if this thing is deployed and it's updating data point per data point suppose now the world changes a little bit this thing will react this algorithm is actually able to change its weights over time but management says we should implement tensor flow anyway this approach can be on the live system and especially useful when you've got labels coming in it actually come in on the stream it makes really if you have real-time updating my weights it's very sensible and if the world changes then you can do some cool stuff extra details or math are on the blog or in the original paper that I prefer it here and I'm immediately here you thinking dear heckler like Vince that this is all great but when do you actually have a situation where a stream of data comes in with a label because you know if you have the label you don't need a prediction right kind of true but let's talk about recommender systems so the nice thing about a recommender is when someone clicks on the Internet then I get a stream that comes in that says yes this user click that thing so in this case yes I have a label but I would really like it if I'm able to use that immediately to update some sort of system when I'm in retail it's very common for me that I don't see someone for over half a year only for the person to come back with completely new preferences maybe half a year ago you want to buy a Jaguar but now you've got a care that you need diapers I would really like to learn this in preferably two mouse clicks I don't for the batch algorithm to update and here's the nice thing typically you have a recommender system let's suppose did you do collaborative filtering I think it's fairly safe to say that the features of an item don't change that often over time but this features of this user might and here's the cool thing this rating that we're trying to predict via to click or an actual rating that's a linear model if you're just doing collaborative filtering and if you assume that these items are actually set you can do you specify ggressive updating and suddenly you can do you can still do bass right you can still do the giant batch algorithm at night but then during the day we now have a nice little method to keep the model updated for mouth click still a huge engineering challenge but I'm no longer thinking about you know accuracies and that kind of thing I'm more thinking about how do I want to make a cool system but management wants us to use tensor flow and this is a general thing I see go wrong quite often especially when companies are sort of starting with this don't just think about machine learning in general like think about machine learning but maybe think more about system design because you want to build a system but you leave with the client not so much just an algorithm algorithm is thinking in just the notebook a system is about actually making the world better I just showed you collaborative filtering we can do something as a way simpler and way better let's pretend that we're making your Carender for the dutch BBC and just the simplest idea that you could come up with you can see you can probably calculate the probability if you saw this thing I call I and some sort of duck show what is the probability is you're going to see some other show so what's the probability that you've seen I and J together the problem with this algorithm is that this can already be an algorithm right but the problem with this is everyone is probably gonna be watching the news so no matter what you watch I'm always gonna be recommending you the news if I use this algorithm so just you know as a mental counterbalance just some sort of accounting how about I divide this by the probabilities you're gonna watch that content anyway okay so the hope is that we're gonna give you somewhat specific content because I'm gonna give you content that's often watched together but I'm gonna account for the fact that you may be pointing someone to popular content and you can sort of make a hello world chart of what that might look like so suppose that in this super super normal that I and J are seen together but probably is you're gonna see J is quite high then this will be rating score that comes out but suppose that I and J together is less likely but probability of watching J in the first place is super low then this is relevant to actually start recommending and I hope it's very clear this is a very intuitive probabilistic argument but if this is in production I can write this in sequel in fact if this was a streaming algorithm this is just counting and it's so simple okay if you grant a sequel that's what I said then you can calculate very easily also you can write tests for it if one of these numbers ever higher than zero or lower than one something goes wrong so you can actually write some decent unit tests for this and all parts of the algorithm actually very interpret one you can even turn into a personal recommender namely take all the recommendations of all the items a user saw multiply these scores and boom you now have a personal recommender as well so you know it's kind of weird if people are considering maybe starting with deep learning for recommenders into some very cool work being done in that space I definitely not deny that but you're starting out with this probably you need to think about these sorts of things more and suppose are we working at this theoretical company doing actual recommendations isn't even simple algorithm that we can do we can we'll just recommend the next episode and I'd be very awkward at this particular video service that as a potential client of mine never tried recommending to the next episode first if this is something they didn't consider that's kind of weird and if they never really implemented a a testing or a random testing just to test their baselines I've also decided kind of suspicious and if they were using about thinking about using deep learning instead I don't know then they're not probably not thinking beyond the notebook and the hype around these algorithms actually can be super duper dangerous so I'll show you two more examples to sort of show you those sort of lure you back into the the domain of simple models and you guys are being a good crowd I've already done 100 slides with you today already so this is going well one example is about video games and the other example is about chickens so let's pretend that we're super Enterprise business model we have a video game surface and people are in a queue the problem with the situation is that whenever people are in the queue and they want to play a video game they probably want to play a video game against some of the same skill it's super sucks if you're super good and you're playing against a total newb it also super sucks if you're told getting his ass whooped so we need somebody to SME player skill and that's in this case assume that we have a stream that comes in that can only say this person that person this person won that's the only thing we can actually learn from we don't know anything happen in the side of the match I hope it's also very clear that if this service is going to help people then we probably don't want to wait for a batch algorithm after every single game we want the state to update and we want something simple so this has a simple insight first thing I would the first way I would start thinking about this if you have been playing this game for years then I should be able to say your skill estimation is much more precise than someone who just came in and never played a single game before so if we're going to represent skill instead of having a representation be a single number it might be a good idea to say hey skill is actually a belief I have a belief of your skill I don't know exactly what it is but it's a belief okay simple way to model this follow along I've got the distribution for player 1 I've got a distribution for player 2 then I also got because in before the game has been played they're independent of each other then I will have to distribution of these two players suppose that one of the two players won you know I can just throw away all the probability mass on one side just draw a diagonal if player one was the winner then probably all the probability density matter was here has just been declared unlikely because the player lost the only thing I got to do then is come up with a clever update rule that sort of translate this back into sort of marginals for every single player but one thing that's kind of nice to note is I typically also have like a margin around the diagonal which also means that I can support draw gains if two players are equal that also means that their likelihood should be equal you can then make a small simulation the proof that this idea works so these are two distributions killed when two people get in one of the two wins you actually see the one gets nudge to the right the other one gets much to the left and vice versa you can just simulate this to convince yourself that it works and what you can also do is you can simulate what happens if one player is proven better than the other player if these two will battle it out the cool thing is if the poor player won from the good one there's going to be a huge shift in belief if someone who was supposed to be poor winning from someone who's supposed to be super good and obviously there'll be a much bigger change than when two people who are supposed to be bad and one wins from each other it makes sense it's sort of information theory basically at its best also if the good player wins from the bad player there's hardly anything is updating so I get of common sense for free still I would like to know this worse so then you have to find a data set the proof that this works so I went for pokemons what you can do is you can go to the poker API get all the data from all the Pokemon you can go to fan websites you get a formula that tells you how one Pokemon might beat another you can actually simulate Pokemon games and after running this I know so this is usually the point in time we never want to get started so just like the the small bit of energy that I give you so that you understand the final point of my talk anyway before this gets meta these are all the posteriors that you get if you run this and basically we have a original belief we updated after we see a single game that believe is then held on that another game gets played we updated belief again and my conclusion is that magikarp is the worst Pokemon so I think it's safe to say that my algorithm works Mewtwo is also in the top five so then take your lead in designing this algorithm was a whole lot simpler when I considered I don't have to find some sort of algorithm that fits my data set no ignore all algorithms just think about it myself what makes sense to me how would I model this with just common sense and it's mainly the observation that I want to use distributions instead of just a single number for your skill that made designing this easy but I really like the algorithmic freedom I'm giving myself got it five minutes this in particular really fits the bayesian mindset well and so it's something that is worth mentioning because this is a streaming algorithm as well and the way that you can sort of see this the distribution of the belief is all massively if you think this is boring sleep for like 10 seconds but the the belief that I have after seeing three data points is my prior that's sort of a new person comes in times some sort of update we're seeing the first data point the second one the third one Hey look the system is recursive we can do streaming this is a very general thing we have a nice little system that we can define and in layman's terms if you come up with a sensible update rule for whatever we know before we see a data point if you just come up with a update rule that updates probability you're done and it's lovely oh and if you do the math you can also do these four teams there's a separate talk I did at Berlin buzzwords look at that if you want to know how to do this for here's to the storms the thing in this is and I hope it's clear the model I've created is just the history but look at all the benefits we now have we can quantify the uncertainty we can apply the model in a streaming setting and it's very easy to deploy understand the bug in tests because we understand every single step here and if you want to come up with such a model the my best advice would be take a step back from the hype and sort of just you know have a beer pen and paper sit on the thoracic bed and just think about how you would prefer to solve the problem if you were not constrained by the machine learning tools that we have we're blessed by them I know this but you're also cursed in the way let's talk about chickens as a conclusion and I think I won't be running too much out of time this is my paper data set whenever I give trainings I always say something about the chick weight data set the idea is there's a diet and there's a certain chicken that belongs to that diet and at this point in time this chicken was on this diet and it weighed this much and we can make a plot of it and this is a there's also one of those data set that makes me a proud vegetarian the idea here is there's a farmer who tries to get the chickens as fat as possible in 21 weeks and it is sort of an a/b test on different diets so what some people might do that if you if you were to do this in our they would say oh you know what I'll just what Hutton code to diet parameter and I'll put in the time I'll just make some sort of linear model and then there's a dummy variable for all the diets this models wrong and I hope you all agree because if I'm a point zero here all the weights are almost the same if a times zero all the diets are about the same but the model that I just proposed suggests that there's some sort of constant I have to add depending on the diet that you're in the sense this makes no sense all wrong all wrong now because I live in I also like to do things in our in our there's a very cool trick you can have a data frame inside of a data frame so what you can do is you say group by diet nest it then you have a data frame within a data frame and then you can run the machine learning model per data frame so the idea is per diet I'm not going to run a separate linear regression which is already much much better but it's still wrong at least we're going to get separate slopes and but it's not possible of the intercepts not the same and you know I've got a bit of prior knowledge a chick has been born they also have the about the same way they're coming from the same distribution any model that does this differently is wrong so the conclusion then kind of is I just want to grab a piece of paper write down this formula and say this is how I believe the world works the Wake is modeled by having some sort of constant for all the chickens and some sort of interest sort of slope not intercept some sort of slope that's different for every chicken that's based on the time and the only thing I want is the computer to look at this and maybe have some sort of domain-specific model and it just it just I just wanted to run in our there's a cool DSL called rethinking and there's a function there called map to Stan and it's kind of like PI mc3 it's also kind of whatever tries to do and the cool thing here is I can actually sort of have a domain-specific model we're kind of the features get generated but also kind of the model gets defined what I'm doing here is I'm saying look there's some sort of distribution that defines the weight it's a Pentagon view which is some sort of intercept times some sort of slope times diet the slope is different per diet and this engine is clever enough to notice hey everything that's defined that's null in this data frame that's something I have to estimate and you run this it compiles down to Sam and I'm running over time just hang on for a moment there's a break afterwards right I'm almost there trust me and this will get trained and because this is a full Bayesian framework to get some nice benefits namely what you get is you get some uncertainty around the parameters that you estimate which is also kind of cool but I can go a step further because if I look at this data set I will see the variance increases over time so I probably want to model that as well okay make a new thing on my the back of my envelope and then I add a couple of expert parameters to this one function and lo and behold it trains all of this and the nice thing is in this particular case I can actually inspect the model I could confirm that this works way better but I'm actually dealing with uncertainty here as well so I can actually sort of say hey the diet for the beta parameter for diet three that's super high oh and die three has a much lower variance than the other one so diet three is probably the best a neural network usually doesn't have allow me to do inspection of models as detailed as what I've just shown you and something very precious happening here instead of modeling within the feature space we can also keep models simple by modeling in the model space all sort of features that are automatically generated by this DSL make it creative modeling somewhat convenient and it's the thing I really like and you can do this for house prices you can do this for diseases that have certain symptoms but I'm running out of time so let's do conclusion time instead I hope that we life somehow inspired you or at least convinced you that these simple there's still a worthwhile pursuit future engineering can definitely still save the day and it makes sense to come up with systems instead of just simple algorithms instead of mere algorithms simple algorithms are cool simple algorithms have properties that complex algorithms are simply missing out on and we should care about more than just ever in the test set if you want to provide a service I hope I've given a glimpse of all the other things that you could be looking at that are actually highly beneficial and especially the case of the recommender would you rather have a recommended that's 1% better or recommended as maybe 1% worse look an update for every single mouse click or the user sees you may be optimizing the wrong thing here simple models can actually be very advanced in smart complex models can actually be inarticulate and dumb fortunately they seem to be unfortunetely as a fear of missing out in this a dangerous pursuit but let's start celebrating the fact that we can still get simple models to production simple models are easier to explain understand debug maintain so let's use them to solve problems I definitely ran out of time but this is 145 slides so we did well yeah great so the here's the awkward thing I am between you guys and beer so if you want to have questions we could do them now but if oh wait no one awkward there's a Keith weights no Holden's gave an amazing talk afters I'm not between beer but we should root but yeah we're and also we have to make this space because Holden's going to give an amazing keynote a bit because the walls have to be removing stuff right or do you have time for questions oh we're five minutes type of questions yes Vincent yeah thanks for a great talk oh yeah over here the response I usually get you know when I make this pitch is like about scalability what what pitch you know that idea that linear models should be used over you know deep learning and then they say oh okay well the feature engine thing is really great and you showed me a really cool example but what if I have like a huge number of features and I don't have time to do all of this feature engineering I don't have the resources so part so features name can be partly automated right and feature engineering is another convex solve problem so you're not going to do it perfectly anyway and if you just try a couple of simple ways to do feature engineering automatically you can still manage a lot of things I think and again the examples that I showed you we're sort of obvious things where there's a main knowledge available to actually get the thing done the way you want it but there are of course still situations where it might make sense to go a different route that's fine but consider the simple thing first don't do the complicated thing before you've benchmarked a very simple thing which includes recommending the next episode for example Thanks the argument against linear models is not so much XOR but the typical argument is outliers especially when linear models are used with somewhat naive error metrics sure that's not a question but over spite anyway the response not required so when you're working with linear models what's your favorite way of dealing with noisy messy outlying data sure my favorite way and this is only if I've got time for it and again there's this notion of there's a panda this model that I have to take very careful for because with this one model is super important to everything and there's also these situations where I just had one a thousand models that are all-time series and I have to care less per model but one thing I like to do but it's also because when it helps me to understand the data this is an example where I have also defined how noise works so I'm actually saying here look the the noise is going to increase over time and also how that noise is increasing for that I've got uncertainty bounds that are coming out of my model so if preferably I prefer to do domain-specific modeling and this is cheap by the way usually you go to a retail client you just have a coffee with I know truck drivers someone who's actually doing the work on the floor and typically they will just tell you how to do this they're not necessarily by saying old outliers and Sigma know they delete Lu how this works but but then on your side you can deal with the uncertainty however you want you just have to model it appropriately that's and that's still the art it's not soft problem but you can manage kind of another question yeah so up to now you've been showing us how to basically do feature engineering as a support vector machine would be doing it are there any other ways that you would recommend okay so so just just just to translate I'm not saying that all complex models are evil and should be put in the dungeon right the main sort of the main argument for starting with simple models is if nothing else see it as a benchmark and that the simplest example for classification would be take the most frequent class and especially when you're doing things with recommendations the thing that I that noise means so much is the thing that people never do when you do your first a be test do an a a test that way you can test if you're a beat splitting works because you don't want to be making decisions for five years and then realize oh none of them mattered and the next thing you should always do is doing a random test why random because it's the worst possible recommendation ever just do random stuff from all of your content but you will at least measure how many will people will click anyway knowing this and then realizing oh my recommend there's only 0.1% better than that means you haven't figured out how recommender should work on your platform if nothing else starts simple this way and the main reason for it is because otherwise you don't know what's happening there's one more yeah it's not like specific to the linear model versus deep model thing but more about streaming so how do you account for the fact that as data comes in you might be overfitting point or like history bias or like super big one so so the whenever something goes to production someone has to just give me a reason why it's a good idea and some people might also say hey doing it this way is not a good idea but just as a general architecture for this specific case I could argue we can still do the batch thing run that every night and just what happens during the day we're going to do in a streaming fashion and then I can already limit some of the things that go wrong because probably the world's not changing overnight every single day another thing that you can do if you're worried that you're maybe overfitting a bit and in the example that I gave on a passive-aggressive there is this maximum step size and you can specify it so there are some sort of tricks that you can deal with this in this particular case though the main thing that I would do is start monitoring this because everyone can sort of do hand wavy arguments that hey this could happen that could happen I'm sure everything can go wrong but maybe start measuring it first and then still getting it live first making sure the whole pipeline works that's a good investment either way so your concern is definitely valid but measure it first then optimize but there are very valid concern holds for deep learning too any more questions okay okay in that case there's still tickets for pi day time salon where we're looking forward to serving everyone's through bubbles thanks for listening and if you can all do me a thing we all hate recruiters on LinkedIn right so what you can do to sort of annoy them a bit this is my actual LinkedIn profile what you can sort of do just for fun is you can list all the Big Data technologies that you're using and just chuck and just chug in a few Pokemon names and and just ask the recruiters if they can understand which these things are Pokemon and which aren't if we if we just all promise to do this right now LinkedIn will be a much better place thank you
Info
Channel: PyData
Views: 64,482
Rating: 4.9228597 out of 5
Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, download, learn, syntax, software, python 3, data scientist, data science, data analytics, analytics, coding, PyCon, example, general linear model, pdf, machine learning
Id: 68ABAU_V8qI
Channel Id: undefined
Length: 43min 32sec (2612 seconds)
Published: Sun May 27 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.