- Hello, this is Sean Owen. I'm a principal solutions
architect at Databricks, and I focus on machine
learning and data science here at Databricks. And I'd like to tell you today about what it looks like to do machine learning and data science on Databricks from end-to-end, from the data, all the
way through the modeling, to production. And along the way, we'll see
how this works in Databricks with some of the tools
you may have heard of, like Spark and Delta, but also of course, particularly MLflow. The data science life cycle
has a couple steps at least. So there's probably three main parts here. There is the data engineering
where we take the raw data and we normalize it, we filter it, we extract what is a good representation for the analyst to use. And then the data science takes over, we analyze that data, we explore it, we build models, we select models. And once we've selected the best model, we need to get it out to
production in some sense. And the nice thing about Databricks and these tools we're gonna talk about is they support all
steps of this workflow. So I wanna show you today
how this all fits together. Within Databricks, of course, we're gonna see today the workspace, which is the piece you
interact with as a user, but there's a couple of
interesting pieces under the hood, Spark and Delta, which I alluded to, but also primarily MLflow. I wanna show you how MLflow supports users working in the workspace to implement a data science pipeline, from data all the way to production. So I wanna go ahead and
jump into a demo here. And for those that have not
seen this particular demo, I hope you enjoy it, for those that have, there are a couple of new elements here from maybe the last time
you've seen it in MLflow, and I'll highlight those along the way. And with that, let me
jump to the demo itself. Okay, let's take a look at this demo here. And this is my way of showing you what I think a simple data science problem would look like, solved
end-to-end on Databricks. So this is Databricks, for
those that haven't seen it, it's a hosted by web based service where you interact with a
notebook-like environment here. So we can write documentation in Markdown, you can write code, which
you'll see in a second. And the one thing you don't really have to care too much about is a compute resources. So here I'm connected
to an existing cluster. If it wasn't didn't exist, I could create a pretty simple cluster. These clusters scale
up and they scale down without you having to
think too much about it. So Spark is available, a
lot of common libraries are available here, and I can
just really focus on my work. So in this example, I'm
gonna work on a problem that you're probably not working on, but I hope you can extrapolate and insert your problem here, but to make it make sense, let me explain what we're gonna look at. We're gonna try and
predict life expectancy as a function of health indicators from developed nations in the
past 10 or 20 years or so. And we're gonna grab some data from the World Health
Organization, the World Bank, and Our World in Data, and use it to build a model
that predicts life expectancy and see what that tells
us about life expectancy. And then we're gonna try
and move that to production, for example. So let's get started. We began, as I said in the
world of the data engineer, that's the first thing we have to do, we have to engineer the the data. And the data here comes
in the form of CSV files, they're not that big. None of this would be any different if the data source were Parquet files or a database or whatever, Spark can read all of it. And I just wanna call out
that the data engineers, for example, could work
in a language like Scala, they can use SQL, they
can use Spark directly. And that does not mean that
data scientists have to. We'll see later that we're
gonna switch into Python even within the same notebook. And that's fine, you
can mix and match that within Databricks on top of Spark. No problem. So the details of the ETL are, frankly, not that interesting. I've never met a CSV file
that was perfectly formatted. And in this case, we have to
do a little bit of work here to massage the data, to get in the right form here. We are trying to head for
a representation like this. So after a little bit of manipulation, a little bit of ETL,
we're getting something a little more like what we're gonna need for the modeling problem here, we're getting a table
that's country, year, and all these features and values. These are health indicator values, these are demographic values over time. And you can see even here that I've switched into SQL because I can easily query data frames that I'm creating with Spark, with Scala, in SQL, if I like. So I can, for example,
also switch in that view, that table view into a
built-in visualization view. So same query, but here I'm
looking at life expectancy as a function of time and country. And this is using the
built-in visualization in Databricks based on Plotly. So you can see, for example, already that there an outlier here. The life expectancy for one country here is low and declining, whereas others are higher and increasing, and this country is the United States. So you might wonder what's
different about the US that would explain that
difference in life expectancy. So the data engineers may continue here, they're gonna grab a different dataset And you may have noticed along the way, I'm actually saving off some
of these health indicator codes and their descriptions as a table, because I'm gonna need that later to understand what
these column names mean. Let me come back to this in a second, because it's important
that I'm writing them as a Delta table. More ETL of a second data source, we get it into a similar format, country, year, a bunch of values here, and we're gonna join
these a little bit later. Last, not least, we're
going to grab a dataset of drug overdose deaths
by country and year, because that may be relevant. And given these three data sets, the data engineering workflow
ends right about here. We're gonna combine
them all using SQL here into one dataset and write
that out as a Delta table. And I do wanna note that over here we've actually mixed and
matched the Spark APIs, SQL, and language native ETFs. And that's easy to do in Spark, that's easy to do in Databricks. So it's just that whatever transformation you need to express, you can express it in
the most natural language that makes sense, and just get
on with creating the ETL job. But this, for this example, I'm gonna skip a little bit
past the data engineering and assume that this is
where the workflow ends. We joined all these tables, we have all these features, and we're gonna present
them to the data scientist for analysis, for modeling and so on. And we're gonna do this
by creating a Delta table, or writing this data as a Delta table, and we're actually registering
it as well in the metastore. So if we do this, we will see that in the data tab, here in Databricks, these several tables I'm running show up, and that helps with
discoverability, for example. And if we clicked into, for example, the descriptions table, this is this lookup table I was writing. You'll see, for example, the schema, sure, sample
data, but also a history. So Delta tables, unlike most
data sets can be updated, they can be modified, and as they are, there's a transaction
history that's created. So for example later, I could go back and query the results,
the state of this table as of a previous point in time, as of a previous version. And maybe that doesn't help too much for this particular lookup table, but it might make a big
difference for this input table, this so-called Silver table that we are creating here
for the data scientists. For example, with Delta the data scientists can come
back later and query the data as of two weeks ago, to see what the data looked like when they built their model. Delta also ensures that this
data set is transactional, so if this ETL job is running one night while the data science
inference job is running, no problem, that job was gonna not read the data in an inconsistent state, it's gonna read a consistent
version of the data. There's a couple of other
interesting things here, like for example, if the data engineering pipeline had a bug and we wrote some bad data
here into this input table, well, we could easily roll it back, and we could rerun the ETL
pipeline to update it as well. So although Delta is
just a storage engine, it still is valuable to data scientists because it offers some features that make this table, this interface between data engineering and data science, more reliable and more robust. So let's move on to the data science here. So the data scientists might pick up here, maybe in a separate notebook, although we're here in the same notebook and they could easily
use Python, for example. So we've switched to Python
here, and that's no problem. This data set is equally
readable as a table in PySpark. So we read this data, and maybe look at some
summary statistics here. We see that, for example, some of these features
are entirely missing, you know, they're present in no rows. Some of the features are
present only in a few of the years and countries as well. So we might figure we need
to do some imputations and inference of the missing values as part of our pipeline here. So let's go ahead and do
that as data scientists Now we could do that
in PySpark, no problem, but I'm doing it in pandas here, just to make a point
that for data scientists who are used to pandas, and are used to the whole
Python ecosystem, no problem. You can use pandas here
pretty easily with Spark. And here for example, I'm
gonna take a Spark data frame, bring it down to one machine
as a pandas data frame, and then use pandas to
forward and backfill the missing values by country and by year. I could have done it in Spark, no problem but I'm just making the point
that I can easily do that and then send it back to Spark. And this would not be a good
idea if the dataset was large, but if it's not, this
is totally legitimate. And I'll show you a couple of other ways that pandas interacts and
inter-operates with Spark later. So as data scientists, maybe
the next step is a little bit of exploratory data analysis, maybe we wanna take this, a couple of key features here and look at how they correlate. We might wanna make some plots here, and here I'm gonna use Seaborn to create these various pair plots between a couple of key features. So Seaborn is a common library
for visualization of Python, and it's already built into
the machine learning runtime in Databricks. You don't have to install it, if it wasn't installed, it
would be easy to install, it's no problem, but things like Seaborn that you might commonly
use are just already there, they just work. So I create this pair plot
and we see things like, well, maybe GDP here is correlated with spending on healthcare,
that makes sense. And we also see that deaths
from opioid overdoses is, well, there's a clear
outlying trend here, and this series here in
brown is the United States. So we might file that away as well, and maybe that's an explanation of why the US life expectancy is lower, we need to investigate that. Moving along, maybe there's
a little more featurization that has to happen
between the Silver table and the final Gold
table of featurized data that we're gonna build a model on, that we're going to run the model on in production as well to do inference. And here there's just a
little bit of hand coding that has to happen, so
it's not very complicated, but you can imagine this
could be more complicated in general. And so this might be an
additional data pipeline you need to run repeatedly
to get from Silver data to this Gold table of featurized data, that's not just what
you built your model on, but what you can apply the
model to in production. And we're gonna register that
as a Delta table as well. So now we're almost ready
for a bit of modeling. So before we get to the modeling, I wanna introduce one other
library here, called Koalas. And this is for the
pandas users out there. So pandas is a common
data manipulation tool in the Python ecosystem. And for those that know pandas, but want to use the power of Spark to distribute computation
across a cluster, that's where Koalas comes in. So you can import Koalas, and load data from Spark
as a Koalas data frame, and then you get an object
that is manipulatable as if it's a pandas data frame, you can apply pandas syntax to it. But all these operations
are really happening on this distributed dataset. So here, I'm just doing
something pretty simple. I am splitting the data by
year into a train and test set, so we're gonna train up to 2014, and we're gonna test after 2014, but you could do more
elaborate things with Koalas. But here I'm just using it to show that you can use pandas-like syntax to manipulate data, even on top of Spark. But this data set is
actually pretty small, So we're gonna actually realize it as a pandas dataset and
move on to modeling. Maybe wonder well, hm,
is that interesting here? I thought Databricks was about scale, it was about Spark, and
distributed cluster in computing. And this data set is small enough that building a model on it does not really require a cluster to build the model. But that doesn't mean
Spark's irrelevant here. So for example, given
this pandas data frame, we could write a little bit
of code here to train a model, a regressor with XGBoost, to predict life expectancy as a function of these thousand 1600 parameters
or so, and that's great. That does not need Spark though, because this dataset, small XGBoost can handle this perfectly well on its own. But we do not, we don't build one model. When we build models, we
really build hundreds of them, because we're not sure
what the best settings of all these hyper-parameters are. Maximum depth, learning rates,
regularization, and so on. And normally we would use sci-kit, we'd use spark.ml, to do a grid search, do a random search over all
these parameter combinations, but really no matter what, we're building hundreds of models. And that can happen in parallel, and because that can happen in parallel, can happen on a Spark cluster. So that's why, for example in Databricks, we ship and support a
tool called Hyperopt, it's an open source tool
for parameter tuning. So Hyperopt can help us figure out what the best values are of these hyper-parameters, we give it ranges and let it explore. And there's three things
that make Hyperopt appealing in this case. Number one, it's Bayesian optimizer, so it conducts an intelligent
search over the space, it's not gonna bother looking
at parameter combinations that haven't worked out well, it's gonna focus its attention on the combinations that
seem to be working well and do a deeper dive. Second thing it can do
is integrate with Spark. So for example, if we use
the Spark integration here then when we run this, it's going to actually use the Spark cluster to build these models in parallel, not necessarily on the driver, and we get some speed up there, even though the models themselves, these XGBoost models do not
know about Spark at all. So there's still some value for Spark, even if your modeling process
does not use it directly. So as we run a tool like Hyperopt and try to let it optimize,
find the best model, now we'll see that its best
loss goes down down, down. It's trying different hyper-parameters, the best loss, lower is better,
is going down, down, down. And the other interesting
thing this does over time is to record what it does with Hyperopt, sorry, with MLflow, excuse me. So as MLflow, Hyperopt runs, it's actually logging
everything it does to MLflow. MLflow is this open source model tracking and deployment framework from Databricks. And the good news about MLflow is it integrates with a lot of tools and it can do a lot of
stuff for you automatically. So here we really didn't have
to do anything with MLflow, but yet we got all these results here in the experiment for the notebook. I think this is more
interesting if we pop this out, let me switch over here. And so we can see here, for example, all the 96 or so runs that MLflow created here. We can click into any of them and we can see, for
example, in this instance, for some hyper parameters, it achieved a certain
loss value, and so on. This view is not that interesting, I think for parameter tuning, it's more interesting to, for example, select all these and compare all the 96 runs that Hyperopt created. And we can do this with MLflow, so what you're seeing right now is the MLflow tracking server, it's integrated into Databricks, you can run it on your own though, this is actually open source. And we might do a parallel coordinate plot and see, for example, that some of the best runs
here with the lowest loss, we can see what their
hyper parameters look like, and maybe, you know, figure out that well, we probably should have looked at lower learning rates, higher
min child weights and so on. But back to the story here, we might take the results of this process, the best hyper-parameters and go ahead and run one more model, and log it with MLflow, on all the data with the
best hyper-parameters. And we're gonna do it manually here, but it's really not that hard. So even if you do it
manually, you import MLflow, you start a run, that's
the end of logging, you build your model, you log whatever parameters you want, and you log the model, which
is really important here, we're logging this XGBoost booster, and maybe here we also log
a feature explanation plot by a tool called CHAP. I'm gonna show you the plot later. I will tell you that by
the time you see this, I believe MLflow version 1.12 will support doing this automatically. So even though I had to write
a bit of code to do this, you will get the plot I'm
showing you now for free. So if we do this, we get this run here. And we can see we logged, you know, what were the best parameters here, and also the artifact itself, some of the environment requirements, so this requires xgboost--1.1.1 in MLflow. We even logged an example
of the input to the model, which I'll show you in a minute. And we logged this plot here, this feature explanation plot, which I will actually probably show you in a second here in the notebook. So we could declare victory and send this artifact
here, this pickled file, to some dev ops person to
deploy and wish them good luck, but we probably want a little
more process around that. So getting to production means managing the
promotion process at least. So that's where the
model registry comes in, and if you see the
models tab in Databricks, that's the model registry. And so this particular model is probably just the latest instance of many different models that we've built that implements this service
we're trying to build, this life expectancy predictor. And so we've registered this model card, called gartner_2020, this is for Gartner. And we can see, for
example, in our notebook, all the various versions of this model that have existed over time. And so let's just focus on the active one. So as we start this example, maybe version 30 is the
current production model, and we've just created version 31, and we've registered that as
the current staging candidate of this model. Maybe we think we've done better here, and we wanna propose this as maybe the next production model. So that's what we do here with MLflow. Now, a couple of things could happen here. Maybe an automated process takes over, a web hook can trigger that to run a test on the staging candidate
and if the model looks good, we flip the bit and
promote it to production. But maybe, certainly for
narrative purposes here, maybe we wanna do a
manual sign-off process. So a data science manager could come in, load from the registry, the latest staging candidate, excuse me, and unpack those artifacts
and take a look at the plots. Just to answer the question, according to the SHAP, this model says that the factor that most influences
predicted life expectancy is mortality rate from
cancer and diabetes. And for countries and years where that mortality is high, life expectancy is lower by about almost one and a half years, and where it's low, it's
higher by almost a year, which you know, makes sense, of course. But what we don't see in
here is drug overdoses, so apparently that's not
really an explanatory factor, it's really just the deaths
from cancer and diabetes, darkly. So after some more analysis, maybe the data science
manager decides that's fine, we can promote this to production. You can do that with the API, you can do it with the UI as well. So I could've gone in here and, you know, promoted this model to production. And of course this is all permissionable, so I can control who is allowed
to register new versions, who's allowed to transition them, who's allowed to transition
to production and so on. So let's get to production then. So normally, production's
kind of a hard part. So we have a model that
someone created in the lab and they send it over to dev ops engineers to implement in production, and that production environment
may be totally different. So one thing MLflow does really well is to try to translate
that model that you built in the lab to an artifact that's
usable in production. So we built an XGBoost booster, but what we probably need, maybe for a batch scoring
of the model is a Spark UDF, and MLflow can do that for you. So for example, this could
be my production job, this cell here. I load from the MLflow registry, the latest production
version of this model, not as an XGBoost booster, but as a Spark UDF, this becomes a function
that I can apply to data with Spark at scale, and then I read the latest featurized data that's landed in my Gold table, maybe now some data has
arrived for 2017 and 18, and I apply it with Spark and that's it, it's not more complicated than that. And I can, for example, join this with the actuals
from previous years and recreate this plot
I showed you earlier to show the extrapolation here. So these are the predicted
life expectancies according to the model for 2017 and 2018. And this would not have been different if this were a streaming job, no problem, if data's arriving in a stream, you could also do the same thing. This could also be a SQL UDF as well. So you can load these models as Spark ETFs that are usable in Spark SQL as if they're SQL functions as well. Another interesting thing you can do with models is deploy them
as real-time services. So you can have, MLflow
wrap-up your service in a Docker container, it exposes it as a Rest API that responds to JSON formatted requests, you send it inputs,
you get an output back. And that can be deployed to
your choice of cluster manager, Kubernetes, Azure ML, SageMaker, but it's also deployable in Databricks. So I've done that here to my model, let me open the serving tab here. And so for testing for low volume usage, so you can actually just
deploy this within Databricks as a Rest end point here, and here's the URLs, and I can grab, let me just
quickly grab a bit of inputs, JSON formatted input that
would work in this model, and I believe, I hope this'll work. Yes. So given that a first line of input in the current JSON format,
I can send it to the model and it gives me back a
particular life expectancy of about almost 80 years here. So if you wanna deploy
your models as a service, no problem, MLflow can do that for you. Last, not least, maybe production in MLflow
means not a batch scoring job, not a rest API, but some kind of dashboard
for business users. So maybe we wanna give to business users some tool where they can
explore what the model's saying and maybe evaluate some what if scenarios. So let's take a look
at what life expectancy would've looked like in
the US over the years had that mortality rate
from cancer and diabetes been different. So if it were low or high, how would that have affected the life expectancy
according to the model? So we can write some code
here to load the model to run this series of predictions to create this heat map here, but in Databricks, we can also export this as a dashboard, like this. And maybe this is easier to
share with business users, it's simpler than the whole notebook. But we can also instrument this
with some widgets, as here, and we can make the plot
update as we update the widget. So maybe you wanna drill a little more, let me change this to 17, and as I change it, it reruns the cells, it reruns all the scenarios
and re plots this, and I see a slightly different plot here. So that really concludes my
demo here from end-to-end. We took a couple of datasets, we did some feature engineering
with Spark in Databricks. We managed those features with Delta, we did data science and used some third
party plotting libraries and so on to explore the data. We used tools like Hyperopt to parallelize our model selection process with XGBoost on top of Spark. And once we got that best model, we logged it with MLflow and we use demo flow and the registry to manage its transition from say, test or staging to production. And once we got to production, we showed how you can grab that model and simply deploy it as a batch job, as a streaming job, as a rest API, or even maybe as a dashboard as well. So I hope this has been helpful, I hope that you can see
how this might apply, this pattern might apply
it to whatever models you are going to build, and I hope you try MLflow
and get started today. Thank you.