Visual Diagnostics for More Informed Machine Learning Within and Beyond Scikit-Learn - PyCon 2016

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Let's all welcome to the stage Rebecca Bilbro with Visual Diagnostics for More Informed Machine Learning. [applause] Hello. Can you hear me? OK. Welcome. Thank you for joining me late in the day after a long day for my talk on Visual Diagnostics for More Informed Machine Learning. First of all, it's very nice to meet you. This is my first time at PyCon and my first time speaking at PyCon, and it's been a great time so far. I've really enjoyed myself. It's really nice to be here. So I want to tell you a little bit about my background, just to provide some context. So the first question you always get asked in the machine learning community: yes, I do have a PhD, and no, it is not in machine learning. I work now as a data scientist after spending many years in academia, but I still consider myself research proximal. I am part of an open source collective called District Data Labs. We work in Washington, DC. This is something that we do because we love it, in our free time. My day job is in a public startup in the Department of Commerce, working on using machine learning models to do precision policy. I have been programming in Python for about two years now. So, that's a little bit about me. I also want to give you a little bit of a sense of where I'm gonna go in the talk, but before I do that, I have two questions for you. So, first question: who is self-taught? Who considers them self-taught? You've self-taught machine learning, self-taught Python, self-taught something. OK. How many people have seen The Wizard of Oz? [laughter] Awesome. OK. [laughter] So... Assuming that those two things are true for the majority of the people listening, I want to tell you a little bit about how I started doing machine learning. So, in the analogy, that's Kansas. Where I am doing machine learning now: in the Land of Oz. How I got there: I took the yellow brick road. And what I think is next, and kind of what I want for all of you to see, what I want you to help me with, if you're willing. OK. Starting out in Kansas. So a lot of self-taught people out there. I am a self-taught machine learning practitioner. I had a very circuitous path here. I did a lot of things before I did this, and -- but when I found Python, when I found machine learning, it was like love at first sight. And the reason is because Python makes machine learning so easy. I know there was another talk about machine learning right before this one, and maybe some of you saw that too. You see just the incredible power that you can -- you can do with just a few lines of Python code. For me, it was really kind of encountering the Scikit-Learn library that is just incredibly powerful, and the incredible API and what it lets you do in Python. So basically I started out doing machine learning using Python, using Scikit-Learn, and it was a matter of five simple steps of how I started doing it. I would prep data, pick a model, fit that model, validate it, and deploy. So I'm going to walk you through how I did that. So I started by prepping it, basically by just loading it into a pandas dataframe. I kind of had like a vague sense that this was probably not a scalable solution and that maybe it was a bad idea to be holding all of that in memory, but I kind of just went with it, and it worked fine for most of the time. In order to kind of be really smart about model selection, I would go to the, kind of, leading expert, which is Google and Stack Overflow, and, you know, luckily for me, the internet is full of people who know exactly what the best machine learning model is, and they were happy to tell me. And so I would then instantiate and fit this model. It's, you know -- as many of you know, this is just a matter of a couple of lines to make that happen. This is, again, just, like, the incredible power of the Scikit-Learn API. And then, because I am a nice person, I would validate my results. So if I'm doing, you know, regression using the mean squared error, root mean squared error, if, you know, the coefficient of determination, if i'm doing classification using a classification report, and if my scores -- you know, if I'm getting, you know, an f1 score, an r-squared score of 0.8 or better, I'm feeling, like, pretty cocky, pretty good about myself. And if I'm not, then I would use gridsearch to try to get better results. And then, kind of at the end, when it was time to pull all of my hacky research code into something that could be deployed, I would use pipelines from Scikit-Learn to try to get it all together. And that's it. It's as simple as that. Except, you know, at night, you go home, you know, lie in bed and stare at the ceiling, and kind of have this sick feeling in my stomach that I have -- [laughter] -- no idea what's really happening under the hood here. And, you know, traditionally, machine learning was done by people who trained a long time to do it. You know, they went to school for many years. They studied the models. They studied the math. But I think that the future of machine learning practitioners looks a lot more like me, because of Python, because of the tools that are available now -- these libraries that are coming out that support machine learning for everyone. The problem is that it is important to know what you're doing. And so, it's important because machine learning is increasingly informing all kinds of decisions that we make. You know, I work for the federal government; I work for the Department of Commerce. We use machine learning to make US policy. Machine learning is used to inform decisions in wartime operations, all the way to things like dating sites. So machine learning really needs to be informed. The problem is that machine learning is easy, but informed machine learning is really hard. Does anybody recognize this? [audience member laughs] Call it out. Yeah, so this is Anscombe's quartet. It's four data sets that have, essentially, identical statistical properties. But if you plot them, you can instantly see that they are nothing alike. And if you tried to model the behavior of these four data sets using the same machine learning model, you would not do very well. And I think that the lesson here is that sometimes the most powerful tool that we have to use in programming and machine learning is our eyes. So, we go to the Land of Oz. The Land of Oz is a place where things are not easy anymore. Machine learning is not easy. It's hard, and it's complicated, and you don't always understand what's going on. But you are committed, sort of, to this idea that it's important to be informed about what you're doing, and that you are doing this in color, you know. This is machine learning in Technicolor. So how do we turn the color on for machine learning? So my proposal to you is that you follow the yellow brick road, and I'm going to tell you how to do that. So this is the workflow that I use now to do machine learning. As you can see, it's not simple, it's not easy, it's not linear. But it's based on this notion of the model selection triple. My feeling is that when we talk about machine learning, you know, especially on Stack Overflow or, you know, these places online -- I was sort of teasing about asking for -- asking the internet, crowdsourcing what is the best model. But when we talk about models, I think that the problem is that we're so focused on what's the best model? What's the best model? Everybody has their favorites. You know, decision trees, SVM, neural nets. Everybody has the one that is, you know, their favorite. But the problem is that that sort of gives us this tunnel vision about what machine learning really is. And so I think it's, you know, more than just, "what's the best model?" I think it's three things. The first is feature analysis. So, feature analysis that supports intelligent feature selection, intelligent feature engineering. Model selection, so picking the model that makes the most sense for your problem, for your domain space. And then finally, hyperparameter tuning. So, once you have selected the model and the features that are going into the model, you know, picking the parameters that result in optimal performance, optimal scores. So, come with me on this journey down the yellow brick road. So we're going to start with visual feature analysis. So, one tool that probably most people have seen before is boxplots or box and whisker. What I like about boxplots for doing feature analysis is that you can very quickly start to see, you know, the central tendency. You can start to look at distribution, start to visualize outliers for different features. And then you can move to something like histograms and examine a single feature and look at its distribution and see, is it a normal distribution, or should I be prepared for a non-normal distribution? Using things like scatterplot matrices -- also get called sploms. These are pairwise plots of features. So two by two features. You know, what we're looking for here is relationships between pairs of features, right? So we're looking for linear relationships, for quadratic relationships, for exponential relationships. We are looking out for things like homoscedasticity and heteroscedasticity. So, we want to know how the features are distributed relative to each other. That's going to be important for modeling. And sometimes, because the sploms can get big in high dimensions, we use jointplots to examine, you know, a single pair and start to look for the relationships between those pairs. Radial visualizations are another great visual feature analysis tool. Here we've got the features plotted on a unit circle, and what we're looking for is how much pull each of the features have. They're pulling the data points towards them. The nice thing is with radviz, if you don't have too many features, you can start to visualize separability. So it's really good for classification. Parallel coordinates is another great way to hunt for separability. Here, instead of a unit circle, they're plotted -- you know, each point on the x-axis is a feature, and our data points are plotted as line segments. And so what we're looking for are chords of the same color, or braids. And that kind of starts to point us to potential separability which is really useful for classification. So I propose that you can also use visualizations to do model selection. Many of you are probably familiar with this already: the Scikit-Learn "choosing the right estimator" flowchart. I really like this. I think it's a great way to get started with Scikit-Learn. What I like best is that it kind of makes you make a decision at every node. You have to really think, you know, "How much data do I have? "What kind of -- what are my goals here? "Where am I gonna end up?" You know, you can very quickly exhaust this, and this doesn't even capture, you know, a small fraction of what's actually in Scikit-Learn. So, maybe using other things like the cluster comparison plots which start to give you this chance to visualize how different models behave across different data sets. And there's also the classifier comparison plots. I really like -- I really like using these. I think, you know, they -- you can't really deploy them, you know, for -- for every problem. You know, it doesn't really work in high dimensions, right? But what's nice is that you can start to see the patterns in how an algorithm slices up the data space, right? So, each row is a data set. Each column is an algorithm. And you can start to see patterns about how these algorithms are performing, and I think it's really useful to have that in your mind when you're selecting between, like, an SVM and a random forest. You know, they're not just words. You know, they behave very differently. Recently I've been experimenting with this notion that you could use, sort of, graph traversal as a way to do model selection. This is something that I'd be very interested to hear from other people about their ideas for how, kind of, model selection can be made interactive. Evaluation tools, model evaluation tools, I think are also promising for model selection, because you can start to see, like in this classification report heat map, you can start to see -- you know, the darker blues are the places where the model is performing best. The lighter colors are the places where it's not performing as well, although, you know, it's doing pretty well, generally, here. I think that, you know, being able to do comparison plots actually has a lot of promise. Here, you can very quickly see from these roc-auc plots which model is performing best, because of the curve. I think that, you know, kind of using these comparison tools could really be a very effective way of supporting model selection. Here is another example using regression where we can start to visualize how -- you know, how -- what kind of errors different models make relative to each other, might help support model selection. Residual plots, I think, are also very promising. You know, you can start to see here not just which model is doing best, but why. You know, because of bias, you know, how it's distributed, you know, because of heteroscedasticity. I think this is -- these are, you know, very promising tools. Visual tuning. I made fun of gridsearch a little bit at the beginning, but I actually think that it's a very useful tool. I just think that we need to find ways to, kind of, make it more thoughtful. In order to really make good use of gridsearch, you already have to know a little bit about the hyperparameter space, so it's not just kind of stabbing in the dark. Validation curves, I think, are a good alternative. It allows you to visualize the performance of a model along a bunch of different values of the hyperparameter, which -- so, here on the, you know, on the left hand side, you know, we've got underfit because the training and cross-validation scores are both low. At some point they kind of are both, you know, fairly high. And then on the right-hand side we have overfit, because the training score gets really good but the cross-validation score drops off. You can also make gridsearch visual by integrating these, kind of, heat map kind of tools. And here we've got two hyperparameters kind of mapped against each other. And what we're looking for is that place, that sort of sweet spot, where the two hyperparameters kind of come together to result in the highest-performing model. So, this is the part where I ask for your help. I think that we need to find a way to facilitate better workflows. This is my workflow. I'm hoping that I'll get the chance to talk to a lot of you about how you do machine learning and what your workflows are like. But I think that we need to kind of find a way to facilitate more informed machine learning, you know, whether it's for us, to make it easier for us, or to make those tools more accessible for other people. So, experimentally, because there's no place like home, I have, kind of, worked with my colleagues at District Data Labs to push a pypi package called Yellowbrick. The idea is that it would be sort of a wrapper for all of these visualization tools that already exist. You know, some of them are in Scikit-Learn, some of them are in pandas, some of them are in Seaborn. A lot of them are matplotlib. But to kind of put them all together with a common -- a common API that would make deployment of those visualization tools as easy as using the Scikit-Learn API is, which I think would really facilitate, kind of, informed experimentation. So if you're interested in working on that with me, please come see me. I -- again, you know, I want to think about ways of making model selection interactive, maybe as a sort of graph traversal problem. I'm interested in the notion of maybe developing visual steering techniques. So here -- the concept here is that it's a, you know, a slider where you tune the hyperparameter by, you know, dragging this slider and you watch how the model performs as that hyperparameter value changes. This comes from a blog post by Scott Fortmann-Roe. So, if you would like to read more, the District Data Labs blog -- we've got posts on all of the things that I just went over, and all of the code is open, and you're welcome to use it and iterate on it. We've got posts just on kind of the basics of getting started in Scikit-Learn. Then a post on doing the visual feature analysis, a post on visual interactive model selection, and a post on visual hyperparameter tuning and model evaluation techniques. We have a bunch of, kind of, open source projects. That's what we do in our free time on the weekends, you know, at night after the kids go to bed. So, you know, these are the, you know, some of the projects that we've been working on. Please check us out and see what you think. Some of the projects we are actually going to be sprinting on. So we have a sprint at the end of this week on Baleen, which is an RSS ingester for -- to support natural language processing, and Trinket, which is my project. This kind of -- this concept of building tools -- visual tools to support informed machine learning. We also have two posters on Wednesday, so please come by and visit us. We'll talk more about machine learning. And finally, just my contact info. I'd be excited to speak with all of you. I'll be around through the end of the sprints. I hope that you will come and sprint on Trinket with me. So Trinket is, you know, is -- the idea is a tool that everybody could use that would be free and open, that would kind of wrap all of these tools that have already been built and deployed in other libraries, and kind of bring them together under a common API to help people do informed machine learning and, you know, support model selection -- interactive model selection and make something incredible together. Thanks very much. [applause]

Info

Channel: PyCon 2016

Views: 2,120

Rating: undefined out of 5

Keywords:

Id: c5DaaGZWQqY

Channel Id: undefined

Length: 24min 20sec (1460 seconds)

Published: Thu Jun 16 2016