Let's all welcome to the stage
Rebecca Bilbro with Visual Diagnostics
for More Informed Machine Learning. [applause] Hello. Can you hear me? OK. Welcome. Thank you for joining me
late in the day after a long day for my talk on Visual Diagnostics
for More Informed Machine Learning. First of all,
it's very nice to meet you. This is my first time at PyCon and my first time speaking at PyCon, and it's been a great time so far. I've really enjoyed myself. It's really nice to be here. So I want to tell you
a little bit about my background, just to provide some context. So the first question
you always get asked in the machine learning community:
yes, I do have a PhD, and no, it is not
in machine learning. I work now as a data scientist after spending
many years in academia, but I still consider myself
research proximal. I am part of
an open source collective called District Data Labs. We work in Washington, DC. This is something that we do
because we love it, in our free time. My day job is in a public startup
in the Department of Commerce, working on using machine learning
models to do precision policy. I have been programming in Python
for about two years now. So, that's a little bit about me. I also want to give you
a little bit of a sense of where I'm gonna go in the talk, but before I do that,
I have two questions for you. So, first question:
who is self-taught? Who considers them self-taught? You've self-taught machine learning,
self-taught Python, self-taught something. OK. How many people
have seen The Wizard of Oz? [laughter] Awesome. OK. [laughter] So... Assuming that those
two things are true for the majority
of the people listening, I want to tell you a little bit about
how I started doing machine learning. So, in the analogy, that's Kansas. Where I am doing machine learning
now: in the Land of Oz. How I got there:
I took the yellow brick road. And what I think is next, and kind of
what I want for all of you to see, what I want you to help me with,
if you're willing. OK. Starting out in Kansas. So a lot of self-taught
people out there. I am a self-taught
machine learning practitioner. I had a very circuitous path here. I did a lot of things
before I did this, and -- but when I found Python,
when I found machine learning, it was like love at first sight. And the reason is because Python
makes machine learning so easy. I know there was another talk
about machine learning right before this one, and maybe
some of you saw that too. You see just the incredible power that you can -- you can do with just a few lines
of Python code. For me, it was really kind of
encountering the Scikit-Learn library that is just incredibly powerful, and the incredible API and what it lets you do in Python. So basically I started out doing
machine learning using Python, using Scikit-Learn, and it was
a matter of five simple steps of how I started doing it. I would prep data, pick a model, fit that model,
validate it, and deploy. So I'm going to walk you through
how I did that. So I started by prepping it,
basically by just loading it into a pandas dataframe. I kind of had like a vague sense
that this was probably not a scalable solution
and that maybe it was a bad idea to be holding all of that in memory,
but I kind of just went with it, and it worked fine
for most of the time. In order to kind of be really smart
about model selection, I would go to the, kind of,
leading expert, which is Google and Stack Overflow,
and, you know, luckily for me,
the internet is full of people who know exactly what the best
machine learning model is, and they were happy to tell me. And so I would then instantiate
and fit this model. It's, you know -- as many of you know,
this is just a matter of a couple of lines
to make that happen. This is, again, just, like,
the incredible power of the Scikit-Learn API. And then,
because I am a nice person, I would validate my results. So if I'm doing,
you know, regression using the mean squared error,
root mean squared error, if, you know,
the coefficient of determination, if i'm doing classification
using a classification report, and if my scores -- you know,
if I'm getting, you know, an f1 score, an r-squared score
of 0.8 or better, I'm feeling, like, pretty cocky,
pretty good about myself. And if I'm not, then I would use gridsearch to try to get better results. And then, kind of at the end, when it was time to pull
all of my hacky research code into something
that could be deployed, I would use pipelines
from Scikit-Learn to try to get it all together. And that's it. It's as simple as that. Except, you know, at night, you go home, you know, lie in bed
and stare at the ceiling, and kind of have this sick feeling
in my stomach that I have -- [laughter] -- no idea what's really happening
under the hood here. And, you know, traditionally,
machine learning was done by people who trained a long time to do it. You know, they went to school
for many years. They studied the models.
They studied the math. But I think that the future
of machine learning practitioners looks a lot more like me, because of Python, because of the tools
that are available now -- these libraries
that are coming out that support machine learning
for everyone. The problem is that it is important
to know what you're doing. And so, it's important
because machine learning is increasingly informing
all kinds of decisions that we make. You know, I work
for the federal government; I work for the
Department of Commerce. We use machine learning
to make US policy. Machine learning is used
to inform decisions in wartime operations, all the way to things
like dating sites. So machine learning really needs to be informed. The problem is that
machine learning is easy, but informed machine learning
is really hard. Does anybody recognize this? [audience member laughs] Call it out. Yeah, so this is Anscombe's quartet. It's four data sets that have, essentially, identical statistical properties. But if you plot them, you can instantly see
that they are nothing alike. And if you tried to model
the behavior of these four data sets using the same
machine learning model, you would not do very well. And I think that the lesson here
is that sometimes the most powerful tool that we have to use in programming
and machine learning is our eyes. So, we go to the Land of Oz. The Land of Oz is a place where things are not easy anymore. Machine learning is not easy.
It's hard, and it's complicated, and you don't always understand
what's going on. But you are committed,
sort of, to this idea that it's important to be
informed about what you're doing, and that you are doing this
in color, you know. This is machine learning
in Technicolor. So how do we turn the color on for machine learning? So my proposal to you is that you follow
the yellow brick road, and I'm going to tell you
how to do that. So this is the workflow that I use now
to do machine learning. As you can see,
it's not simple, it's not easy, it's not linear. But it's based on this notion
of the model selection triple. My feeling is that when we talk
about machine learning, you know, especially on Stack Overflow
or, you know, these places online -- I was sort of teasing
about asking for -- asking the internet, crowdsourcing
what is the best model. But when we talk about models, I think that the problem is
that we're so focused on what's the best model?
What's the best model? Everybody has their favorites. You know, decision trees, SVM, neural nets. Everybody has the one that is,
you know, their favorite. But the problem is that that
sort of gives us this tunnel vision about what machine learning really is. And so I think it's, you know,
more than just, "what's the best model?" I think it's three things. The first is feature analysis. So, feature analysis that supports
intelligent feature selection, intelligent feature engineering. Model selection, so picking
the model that makes the most sense for your problem,
for your domain space. And then finally,
hyperparameter tuning. So, once you have selected
the model and the features that are going into the model,
you know, picking the parameters that result in optimal performance, optimal scores. So, come with me on this journey
down the yellow brick road. So we're going to start with
visual feature analysis. So, one tool that probably
most people have seen before is boxplots or box and whisker. What I like about boxplots
for doing feature analysis is that you can very quickly
start to see, you know, the central tendency. You can start to look
at distribution, start to visualize outliers
for different features. And then you can move
to something like histograms and examine a single feature
and look at its distribution and see,
is it a normal distribution, or should I be prepared for
a non-normal distribution? Using things like
scatterplot matrices -- also get called sploms. These are pairwise plots of features. So two by two features. You know, what we're looking for
here is relationships between pairs of features, right? So we're looking for
linear relationships, for quadratic relationships, for exponential relationships. We are looking out for things
like homoscedasticity and heteroscedasticity. So, we want to know
how the features are distributed
relative to each other. That's going to be important
for modeling. And sometimes, because the sploms
can get big in high dimensions, we use jointplots to examine,
you know, a single pair and start to look for
the relationships between those pairs. Radial visualizations
are another great visual feature analysis tool. Here we've got the features
plotted on a unit circle, and what we're looking for is how much pull
each of the features have. They're pulling the data points
towards them. The nice thing is with radviz,
if you don't have too many features, you can start to visualize
separability. So it's really good
for classification. Parallel coordinates
is another great way to hunt for separability. Here, instead of a unit circle,
they're plotted -- you know, each point
on the x-axis is a feature, and our data points
are plotted as line segments. And so what we're
looking for are chords of the same color, or braids. And that kind of starts to point us
to potential separability which is really useful
for classification. So I propose that you can also
use visualizations to do model selection. Many of you are probably
familiar with this already: the Scikit-Learn "choosing
the right estimator" flowchart. I really like this.
I think it's a great way to get started with Scikit-Learn. What I like best is that it kind of
makes you make a decision at every node. You have to really think, you know,
"How much data do I have? "What kind of --
what are my goals here? "Where am I gonna end up?" You know, you can very quickly
exhaust this, and this doesn't even capture,
you know, a small fraction of what's actually in Scikit-Learn. So, maybe using other things like the cluster comparison plots which start to give you
this chance to visualize how different models behave across different data sets. And there's also
the classifier comparison plots. I really like --
I really like using these. I think, you know, they --
you can't really deploy them, you know, for -- for every problem. You know, it doesn't really work
in high dimensions, right? But what's nice is that
you can start to see the patterns in how an algorithm slices up the data space, right? So, each row is a data set. Each column is an algorithm. And you can start to see patterns about how these algorithms
are performing, and I think it's really useful
to have that in your mind when you're selecting between,
like, an SVM and a random forest. You know, they're not just words. You know, they behave
very differently. Recently I've been experimenting
with this notion that you could use, sort of,
graph traversal as a way to do model selection. This is something
that I'd be very interested to hear from other people
about their ideas for how, kind of, model selection
can be made interactive. Evaluation tools,
model evaluation tools, I think are also promising
for model selection, because you can start to see,
like in this classification report heat map, you can start to see --
you know, the darker blues are the places where the model
is performing best. The lighter colors are the places
where it's not performing as well, although, you know, it's doing
pretty well, generally, here. I think that, you know,
being able to do comparison plots actually has a lot of promise. Here, you can very quickly see
from these roc-auc plots which model is performing best,
because of the curve. I think that, you know, kind of
using these comparison tools could really be
a very effective way of supporting model selection. Here is another example
using regression where we can start to visualize
how -- you know, how -- what kind of errors
different models make relative to each other,
might help support model selection. Residual plots, I think,
are also very promising. You know, you can start to see here not just which model
is doing best, but why. You know, because of bias,
you know, how it's distributed, you know, because of
heteroscedasticity. I think this is -- these are,
you know, very promising tools. Visual tuning. I made fun of gridsearch
a little bit at the beginning, but I actually think
that it's a very useful tool. I just think that
we need to find ways to, kind of,
make it more thoughtful. In order to really make
good use of gridsearch, you already have to know a little bit
about the hyperparameter space, so it's not just kind of
stabbing in the dark. Validation curves, I think,
are a good alternative. It allows you to visualize
the performance of a model along a bunch of different values
of the hyperparameter, which -- so, here on the, you know,
on the left hand side, you know, we've got underfit
because the training and cross-validation scores
are both low. At some point they kind of
are both, you know, fairly high. And then on the right-hand side
we have overfit, because the training score
gets really good but the cross-validation score
drops off. You can also
make gridsearch visual by integrating these, kind of,
heat map kind of tools. And here we've got
two hyperparameters kind of mapped against each other. And what we're looking for is that place,
that sort of sweet spot, where the two hyperparameters
kind of come together to result in the
highest-performing model. So, this is the part
where I ask for your help. I think that we need to find a way to facilitate better workflows. This is my workflow. I'm hoping that I'll get the chance
to talk to a lot of you about how you do machine learning
and what your workflows are like. But I think that we need to
kind of find a way to facilitate more informed
machine learning, you know, whether it's for us,
to make it easier for us, or to make those tools
more accessible for other people. So, experimentally, because there's no place like home, I have, kind of, worked with
my colleagues at District Data Labs to push a pypi package
called Yellowbrick. The idea is that it would be
sort of a wrapper for all of these visualization
tools that already exist. You know, some of them
are in Scikit-Learn, some of them are in pandas,
some of them are in Seaborn. A lot of them are matplotlib. But to kind of put them
all together with a common -- a common API that would make deployment
of those visualization tools as easy as
using the Scikit-Learn API is, which I think
would really facilitate, kind of, informed experimentation. So if you're interested
in working on that with me, please come see me. I -- again, you know,
I want to think about ways of making model selection
interactive, maybe as a sort of
graph traversal problem. I'm interested in the notion of maybe developing
visual steering techniques. So here -- the concept here
is that it's a, you know, a slider where you tune
the hyperparameter by, you know, dragging this slider and you watch
how the model performs as that hyperparameter
value changes. This comes from a blog post
by Scott Fortmann-Roe. So, if you would like to read more, the District Data Labs blog -- we've got posts on all of the
things that I just went over, and all of the code is open, and you're welcome to use it
and iterate on it. We've got posts just on kind of
the basics of getting started in Scikit-Learn. Then a post on doing
the visual feature analysis, a post on visual interactive
model selection, and a post on
visual hyperparameter tuning and model evaluation techniques. We have a bunch of, kind of,
open source projects. That's what we do in our free time
on the weekends, you know, at night after the kids go to bed. So, you know,
these are the, you know, some of the projects
that we've been working on. Please check us out
and see what you think. Some of the projects we are actually
going to be sprinting on. So we have a sprint
at the end of this week on Baleen, which is an RSS ingester for -- to support
natural language processing, and Trinket, which is my project. This kind of -- this concept
of building tools -- visual tools to support informed
machine learning. We also have two posters
on Wednesday, so please come by and visit us. We'll talk more
about machine learning. And finally, just my contact info. I'd be excited
to speak with all of you. I'll be around
through the end of the sprints. I hope that you will come
and sprint on Trinket with me. So Trinket is, you know, is -- the idea is a tool
that everybody could use that would be free and open,
that would kind of wrap all of these tools
that have already been built and deployed in other libraries,
and kind of bring them together under a common API to help people do informed machine learning
and, you know, support model selection --
interactive model selection and make something
incredible together. Thanks very much. [applause]