Thanks for the introduction.
Thank you for having me. It's nice to be back on a college campus.
Quick background on me. This is gonna be a more nuts and bolts talk about data
science. I unfortunately did not go to Cal. I went to the Cal of the East Coast I guess, MIT.
I did a PhD in computer science from there and then have been, first, it was called an algorithms engineer.
Then, data scientist. Now, AI engineer. So I've been working as a data scientist for
almost a decade now and we founded Manifold with a couple of co-founders a couple of years ago.
Who are we? We call ourselves an AI studio, really using the buzzwords there, but we are a
services company. We're a consulting company that helps companies with accelerating their machine
learning and/or data engineering solutions. Most of our clients are typically non-tech companies,
larger non-tech companies that are taking their first journey into really becoming a more
data-driven company, putting machine learning and data engineering as core assets to their business.
We're actually right down the street in Oakland, headquartered there, and also officed in
Cambridge, a lot of alumni stay there. So what is this talk gonna be about? Really it was
kind of hard to fit everything into one talk but what I want to do is just share some mental
models for applying AI and make them real using some case studies from our work. So, at the top
level, the main model is that we've adapted a lot of techniques from Chris Diem, if you're familiar
with that, and human centered design made by Ideo and we've come up with what we call our Lean
AI Playbook of how to go into a new company, a new situation where the business is trying
to get value out of AI and really making sure that we deliver a win in six weeks, in 12
weeks, in 24 weeks, because that's really why they are hiring us. They want us to really
accelerate getting value out of machine learning. I can't talk about all six of these steps
and I'll focus really mostly on Understand which is understanding the business and
understanding the data. A little bit on modeling and then a little bit on user feedback.
As machine learning people, data scientists, we tend to focus a lot here on the modeling stage.
In ten years of practice, this is really a very very small piece of the much larger puzzle, the
work that happens before and the work that happens afterward to really make the business get value
out of machine learning, so I'll talk a little bit about that. I'll jump right in, so Understand. One of the
mental models that we have is something... I like to put Latex on slides because
PhD habits die hard but I call it the AI uncertainty principle, which is really like,
the value that you get out of the AI is upper bounded by the business value of the problem
that you're solving, the data quality that you have to solve that problem, and then lastly the
predictive signal. Again, as data scientists, we tend to focus a lot on how good can I
get the AUC, how good can I get the make, but that is the thing that you can't predict.
You have to actually do the data science to do it, so really it's very important early on in
the project to focus over here upfront, figuring out where to aim the AI and
assessing how good the data quality is. Notice this is multiplicative so if
any one of these things goes to zero, the value over here goes to zero, and that's bad
for us. It's bad for our clients. It's bad for a business. You don't want to get that. It erodes trust
and you don't want to do that as a data scientist, so how do you prevent that from happening?
Essentially, you have to do data understanding and business understanding. The two
techniques we use here are a business understanding workshop where we're really trying
to surface: Hey, get all the stakeholders in the room, this is usually CEOs, chief marketing
officer, CTOs, potentially even finance people, analytics people, to really figure out
if we were to solve this problem better, what could the ROI be and what often happens
here is we will start an engagement being about "Hey we want to do preventative maintenance
on these connected blenders that we have that makes milkshakes". Turns out that that was not
the highest value place to focus the AI. It's better focused on "Hey can I instead forecast
what flavors will be best to sell to each of these stores that can lift revenue by four to
five percent?" That's a much better place to aim our AI and therefore get better ROI. On the
other side, we're working with the tech team: the data analysts, the CTO, software engineers, to
figure out "Hey, let's catalog your data sources. How clean is it? How rare is the event that you're
trying to predict? Is it labeled well? Do we need to label more? Is the data joinable?" In many of
these large organizations the data has been siloed in various different CRMs or data sources, and
there's actually not even a join key and so that is an issue that we face many times. Turns out
you can use machine learning to learn a join key. I won't talk about that but these are problems
that you should be thinking about. So, let's make it real. One of our customers was one of the
leading registries in the United States and their CEO wanted to hire us really to help serve her
customers better. I know that most of my revenue comes from a very small amount of my customers.
How can I tailor my service to serve them better? How can I find these customers? So when we went
in there, we did the business workshop and we did a data audit. We came up with a spec. This was one
of these stories where we went and they originally had a spreadsheet where you could track cohorts
and whether they activated or whether they bought anything off of the registry or not. They kind
of had the idea but no one there has like "Okay, let's get to the next revision of this." When we
did the business understanding, what we found out, what they really cared about was that they
wanted to be much more data-driven in marketing and in product, but their biggest problem was
that after people sign up, they didn't know whether they were going to be high LTV or not,
a high lifetime value or a valuable customer, until about nine months later with their baby
registry. So, that turned out, after a lot of surfacing, was their biggest friction point in
their organization. So how could you make this shorter? We then did a data audit and they had a
lot of data. They had been running the company for almost ten years, so a lot of data from mobile
app, mobile app clickstreams, web clickstreams, marketing data from Facebook, Pinterest, all of
these things. They had data in their transactional database about how the customers are using the
product, all the way to even demographics about their customers. They can join it with census data
and other creepy data sources where you can put in people's emails and it'll tell you demographics
about them. So after cataloging everything, the spec that we came up with is: we're going
to build a model that will predict the final customer lifetime value after nine months every
day after sign up. So one day after sign up, two days after sign up, four days after sign up,
30 days after sign up. So every day we would make a new prediction of what the expected customer
lifetime value would be and in addition, we focused on only the transactional database because
turns out the data quality was very high there and there is a tax that you have to pay for every new
data source that you bring online. We thought, after looking at it, that the transactional
database has enough signal in and of itself. We don't need to pull in a terabyte of heap analytics
logs or a terabyte of Mixpanel logs. We can just focus here on the transactional database. Another
problem. I'm gonna use these two case studies throughout. So another client we had was an oil
services company out of Oklahoma City and their goal was to again be more data-driven
about their maintenance operations. They have thousands of machines out in the field. They are breaking down. They actually don't sell the machines, they lease the machines and they sell up
time. This is again out of surfacing the business understanding. They sell up time and so they are
also selling the maintenance contract along with it and if the machine goes down, they lose money.
In addition, they have to roll trucks to go do the maintenance on them and so that was their burning
problem. We also did a data audit. Turns out this was one of the cases where they thought they had
much better data than they really did. They had almost a decade's worth of service logs from the
machines where the maintenance techs are typing in and clicking little checkboxes about what parts
they're using to fix these machines. They thought that they could use that to predict whether a part
was going to fail. Turns out that this is what's very very common in human input data. Humans
are inconsistent over time and across humans, so there's a lot of inconsistency on how the
exact same failure was written up and diagnosed and check-marked in these are service logs. In
addition, the service logs had a very quickly changing lineage, in the sense of how things
were input three years ago is very different from how things were input a year ago and how they
are input now. So the schemas changing and all this stuff is a very very difficult data source
to work with. Instead, what we chose to focus on is a huge historical log of sensor
data. They have 54 sensors on these machines collecting all sorts of things like vibration,
temperature, the states of different registers on the gas compressor equipment. We chose to
focus on that because machine data is much more trustworthy. The lineage didn't change as much
because once it's out in the field, it is out in the field. Number three, it's just much more
trustworthy. So what we ended up kind of focusing down on, posing the machine learning problem,
is to forecast if a major fault, and again we have to define what a major fault
is and this is where the business value meets the data. We define major fault as whether the machine
was down for more than two hours because according to their maintenance organization, this is
when they're getting many many more calls. This is oftentimes when they have to roll a truck. This
is often indicative of a true major failure in the parts because otherwise these are kind of like
computers too. Sometimes you can just reboot it and it comes back up and it's okay
but if it's down for two hours typically we can't just knock it on the side and it'll be okay. There
is actually something going on. That's kind of the data understanding part. I just
can't emphasize the importance enough. This is such an important part of doing data science out
in the field because if you don't aim it at the right place and a problem that is solvable with
the data that you have, you will fail. It's upper bounded by zero and it's not going to be good
for anybody and you'll spin your wheels. It's all about reducing waste that way. Let me move
on. I think I'm not gonna talk about engineering. I'm gonna talk more about modeling here because
it's the fun stuff but one mental model we also have, and this is actually by a friend in
the firm. I don't know if you guys have seen this diagram but this is by Monica Rogati. She's a
data science celebrity on the original LinkedIn Data Science Team from about ten
years ago. She has this fantastic diagram which basically puts modeling right at the top. All of the
stuff that we think is the sexy stuff is at the top. All the "boring stuff" is at the bottom
but the boring stuff is not actually that boring because it's the foundation. If you don't do this
stuff right, you will never get success out of that. This is why engineering is really important.
That being said, once you get to modeling, I can't emphasize this enough either and I'm
sure you're being taught this in your courses, build a baseline model. No exceptions.
We'll hire new people out of school and they'll want to go be like "Oh we should use WaveNet
and use this pre-trained model. We'll cut off these layers and we can put it in again." It's
just like "Dude, let's try division." That's oftentimes how a lot of these initial
conversations go and division is a great algorithm. It's been proven to work in a lot of places. [AUDIENCE]: *chuckling* So that's what I mean by a baseline model. Do the simple thing first and rules of thumb that
we have found useful over the years is if it's a regression problem, turn it into a classification
problem. Quantize, even if you do multi-class because a classification is easier
to understand than a regression. You can look at AUCs, you can look at class errors. You can learn
from that. Secondly is usually the progression. I don't always start with just division and
usually we're starting with random forests, then going to grading boosted trees, then
going to deep learning. Random forests are awesome and, oftentimes, is just the thing that
we put into production because it's just so easy to tune. It's so robust at overfitting, it's
fantastic. Lastly, on the feature engineering side, pick a few features, iterate from there.
This is a part of the engineering I didn't talk about. We're often working with the
client and getting a prioritized list of what features that they think have the most
predictive signal and also scoring them against how difficult it would be to engineer them.
There are some things that are like a single sequel query. There are some things where I have
to join across 17 tables and have like 24 sub clauses in my sequel query. That's hard. Let's do
the simple thing first, but maybe this thing has a lot of predictive signal. I do have to do
that. So really judging that and then iterating. Everybody here is very familiar with evaluation
metrics. I like to classify them into two buckets. One bucket is the Aggregate Metrics.
This is the thing that you're actually seeing. How well is it performing? These are the AUCs, ROCs,
TPR at some false positive rate, but oftentimes we're looking, especially in the learning phase, is at
the individual metric, so the sample level metrics. Here, this is a common plot that we look at. I'm
looking at the true negatives and seeing what the model predicted on them. These are
the true positives and I'm seeing what the model predicted on them. As you can see, some of the
true positives are doing really really well, but many of them are not doing so well.
What's going on? Who are these guys where it's predicting so low? What's up with them? We do
kind of this analysis of just looking at the four corners in the middle and letting that guide you
on what features you should make next or perhaps how you should change the architecture. Making
this real again, this is going back to the baby registry problem, we started with the two simplest
features which were: What what platform did they use to sign up? This is iOS or Android or
Mac or Windows. And where they came from. Pinterest or Facebook. Just with that simple model
and literally no features about the usage pattern. These are super easy to make. I can
just query it right out of the table. I got a 0.65. Then we added 11 more features
that was in our priority list. Write the sequel query, put it into the Python
model, you see a 0.90, so at that point, we're killing it. We had thoughts about how
we should do embeddings and we should do a multiscale convolutional
neural network. Forget it, 0.90 AUC, this is amazing. Seven days after sign
up, I can predict whether or not this is actually a lifetime value bin of $500 or
greater. I can predict that pretty accurately so that's amazing. Let's move on to the next problem
that's a better use of a data scientist's time. Similarly, the oil services problem. This was
a much more difficult problem with a huge data sensor time series. We had to
sample it and do some sample rebalancing because the things that we're looking for are
rare events. We did some feature engineering, went into the random forest.
With a few features, got a not so great AUC of 0.65. Added a few more features, got to an
AUC of 0.78.This started to saturate a little bit. For the custom feature engineering, we made five
or six more features. Not doing great, we're like okay, let's move on to convolutional neural
networks. I think that could be really really good. We did a multiscale convolutional neural
net where we look at a look-back window, similar to a WaveNet, if you're familiar with it, to try
to predict out five days, whether a failure is gonna happen. After two weeks of trying to tune
the hyper parameters and tune the architecture, we weren't getting materially better performance.
The AUC is about the same and so at that point we went on to the next stage and we were thinking
about doing some more advanced modeling and mixed effects modeling. We were like "You know what? Forget it. Diminishing returns." It's more important to move on in the cycle and put this
in front of the user. That's kind of the last step I'll talk about: user feedback. What do I
mean by user feedback? We are getting the predictions in front of the user that
will actually use it. So in this context, for the baby registry company, it was in front of the
marketing team that was using this to make decisions on whether this campaign versus this campaign was
doing better, in front of the product team. For the oil services organization, this was in front of
the maintenance organization that is actually triaging these predictive maintenance
things. We're doing working sessions with them, and again, this is more of a design philosophy
but it works really really well in this space because you don't know how this is going to be
used and you don't know how well you have to do for the business to get value out of it,
so it's really important to get it in front of the user. Oftentimes, first, we're going with
nothing. We're just seeing what their flow is like right now. Then we go dump predictions into Excel,
give it to them. Then, perhaps we're making a Jupyter Notebook where we can change a few parameters,
playing with them there. Eventually, we'll build a web app
or something like that around it but this is very important. What happens is that "trust nobody".
Right? Like nobody trusts models, especially black box models. Even I don't trust black box models
Something comes out, you're like "It's probably wrong". That's like my first instinct
from a model. I'm an engineer by training and it's magic that anything works at all because you
know the shortcuts were taking, right? What we do is we'll do sensitivity analysis and so we
actually have a package that we use internally that we've developed, that can probe the model in
different ways to see if the intuitions that the customer has matches with what's coming out of the
model. For example, "Hey, does the predicted failure rate for the cohort match the historical average
of failure? If sensor A goes above this psi, does the likelihood of failure go up?
These are kind of known heuristics in people's heads. The model has to match that, otherwise,
your model is likely wrong. In addition, this is what builds trust in the model. The second thing we're often finding, this is true, is that the predictions are never enough.
The raw predictions are never enough. It's not solving the business problem. You have to build a UI around the AI.
What do I mean by that? I'll make it concrete. In the baby registry problem, the product team
came back to us and they're like "Okay, I want to change the product. I want to have this special
promotion where if people do these ten actions, they will get a free box that I will ship to
them that's worth twenty five dollars. I want to know if this will make my LTV better. Is
this trade-off worth it? I'm gonna have a higher customer acquisition cost but will the final
LTV be worth it? How do I answer that question?" This is the business problem that they care about.
Turns out that the model can answer that, but just giving them raw predictions is not enough. We have
to do a Temporal A/B test, where we have to come up with some new math to do that.
How do you do an A/B test with predictions? In addition, we have to retrain a model without
the features that would be confounded by this experiment. In the end, we gave them a tool
where you can take out certain features out of the model, retrain it, run it on two different cohorts.
You pass in two different CSVs. It runs it on the two different cohorts and it does a
modified Welch's test to tell you if the two predicted LTVs are really different from one
another and even gives a p-value. That's the thing that gave the product team value, not just the raw predictions.
Similarly, on the other side, on the oil services problem, we delivered the raw predictions
and we were really happy with ourselves in the sense of "Hey look, there's all these units
that have high probability of failure today" and we're thinking "Oh man, they're gonna roll
trucks and it's going to save the day. It's going to be fantastic." Turns out we took
the Excel spreadsheet to the to the maintenance people. They took a look at it. They double-click
down into which units they were. Turns out that "Oh yeah, those units? Man, we're driving those way
out of range. We know that basin has really really high line pressure. They break down all the
time. We know that. We're driving it out." It's like, "Oh. Okay. Yeah, this
means nothing to me because I know that I'm driving these machines in a place that will lead to more
failures so, at least that was a sanity check on the bottle that is predicting that it's going to
fail, but it doesn't have value. So what we ended up doing there to solve that first problem is we're
looking at the differentials and probabilities so the prediction comes out every day. We're really
now alarming on when the prediction changes. If it's 0.2 for awhile and it jumps up to
0.6, that is what we alarm on. Secondly, there are so many features and sensors to look at and there are so many different failure modes. They wanted better direction in
their triage. We ended up implementing what's called the Tree Interpreter.
It's a way that you can interpret what is coming out of a random forest or a gradient
boosting tree and it tells you why it's making the prediction. What features are driving the
probability being higher or the probability being lower? That sort of "explainable AI" is very very
useful in the final web app that we delivered to this client because it helped them direct their
triage. Otherwise, it was just taking too long to figure out what could potentially be wrong. What
should I throw on the truck when I go out there? So that's it. There are
many other things that I didn't talk about, like Use Docker. Don't be a
pirate, be the Navy, so be good about your software engineering practices. Embed high cardinality and
categorical variables. Hopefully, I've been able to communicate a few of the mental
models that I found useful. Any questions? [AUDIENCE APPLAUSE] [HOST]: I can see that we are short on time. We can take a couple of questions but we will have to move on from there. [HOST]: Will you be around after? [SOURAV]: I will be around, yes. [STUDENT]: You've probably run into a lot of interesting
questions to answer every month and I'm wondering, if we go back to the previous slide.
I know modeling is kind of at the top of the pyramid but what percentage of time
are you really spending with your clients in potential stages? [SOURAV]: Yeah, usually Understanding is one week. Engineering is where we're spending a lot of time to
really understand the tooling that we've built up for certain constants.
It's like "Hey, we're gonna deploy to the cloud. We're gonna share and collaborate but there are
a lot of specifics of the problem that we need to understand, the scale, the velocity of the
predictions that need to come out so we're using different things from the RISELab and the AMPLab. We also use Spark and Clipper and stuff like that, spending a lot of time here.
We want to quickly get through the modeling phase to get something in front of user
feedback. Then, this is always unfortunate, there's a lot of time spent at Deployment because
of details. Deploying it to the cloud in the infrastructure.
Then, this is where we're not involved anymore but we're monitoring this with the client.
Right now, that oil services company has been running validation for the past two quarters
on the product that we've delivered to them. [STUDENT]: Thank you for the presentation. We
talked during the break so I was excited to see it. One of the questions that I had was
on the slide you had about building trust in your model. How do you avoid possible goal
seeking so that's like when you have an answer that you're expecting to see
and then working towards that. How do you avoid doing that? [SOURAV]: So like overfitting to
expectations in some sense? That's a very good question, one that I haven't
thought about too much. Usually, these are basic heuristics that we're looking at.
I don't think we are. Nothing has been so specific that that antenna has gone up.
I know this is probably an unsatisfying answer, but usually these heuristics are really
reasonable things that were looking at. If it was like some specific thing that is like "All these points have to match to one. That has to be probability 0.71".
It's never been that kind of an issue. These are much more global aggregated things that we're
looking at. [HOST]: Please join me in thanking Sourav. Thank you so much.