(dramatic music) - [Instructor] Welcome to Databricks. The Lakehouse is a simple
and open data platform for storing and managing all of your data that supports all of your
analytics and AI use cases. It's where data scientists
and data engineers and analysts collaborate to
prepare and analyze data, build models, and deploy
them to production. Today, we will focus
on the data scientists who as you will see here are trying to explain life expectancy from health indicator data. And along the way, we'll see how our platform
supports the whole life cycle from data ingest to production. The notebook based environment
you see here is Databricks'. You can edit, run, and share code and documentation and
visualizations and output in many languages. We will look at the
problem in three parts: data access and preparation, modeling and interpretation, and finally deployment. We're going to only
briefly review the work of the data engineer and Databricks here. Her goal is to make raw data usable. This includes fixing errors in the input or standardizing representations and joining disparate
data to produce tables ready for use by modelers and analysts. Here, the input are just CSB files from the World Health
Organization and World Bank, and an additional drug overdose data set that we included to explore as a factor. And they can be read directly as is from distributed storage as Spark DataFrames. Note that these data
sources could just as easily have been a SQL database or JSON files or Parquet files and so on. The schema is automatically read or inferred in the case of CSV. These files contain about
2,000 different health and demographic features
for 16 developed countries over the last several decades. The goal will be to learn how these features
predict life expectancy, and which one or two are most important. So the data engineers
might prefer to work in SQL and Scala as here. Databricks supports these in addition to Python and R which the data scientists might prefer. All may be used even within one notebook. And along the way, even the data engineers
can query the data with SQL and see the results with
built-in visualizations, like this one. They might wanna take an
early look at the data. And we see that clearly the trend in life expectancy
between 2000 and 2016, looks different for the USA. It's low and declining, and the question will be, why? So at the end of the data
engineering workflow, the three data sources are
joined on country and year, and written as a registered
Delta Lake table. So to the data scientist this representation doesn't
make much direct difference. It's just a table that can be read like
any other data source. But Delta Lake provides
transactional rights and lets the data engineers
update and fix data from a bad feed, or gracefully modify the schema or enforce certain
constraints on the data. And that does matter
to the data scientists. Anyone who's faced
modeling on dumps of data from a database that might break or change in subtle or silent ways, will appreciate the real
data science problems that these factors can cause downstream, and that Delta Lake helps fix. Or, for example, consider the need to recall exactly what data was used to produce a model for governance or reproducibility. Managing the data as a Delta Lake table allows the data scientists to query the data as of
the previous point in time, and this helps with
reproducibility of course. Note that everyone can
collaborate on this notebook, perhaps leaving comments like this one. So here we pick up from the
world of the data scientist. Her goal is to explore the data, further enrich and refine it for this specific analysis here, and produce another
table of featurized data for modeling and production deployment. So the collaboration began simply read this table of data that the data engineers produced. This notebook uses Python and the ecosystem may be more familiar and useful for data science. The Spark API is the same, and the data tables are available and exactly the same way however. Now, you don't have to use
just Spark and Databricks only. For example, here we dropped into pandas to fill in some missing values before returning to Spark. You can define efficient UDFS or user-defined functions for Spark that leverage pandas too. The Python ecosystem offers libraries for visualization too. And with Databricks, likewise,
you can just use these. Common libraries like
Matplotlib and Seaborn are already built into the ML Runtime, and others can be added easily. Here the data scientist uses Seaborne to generate a pair plot
of a few of the features such as life expectancy, literacy rate, and opioid deaths per capita. It can highlight correlations such as that between per capita
expenditure on healthcare and GDP. And it also reveals a clear outlier. So the outlier here with
respect to opioid deaths is going to turn out to be
the United States again. After some additional featurization, the data is written as
another Delta Lake table. Now, the data set is small enough that it's possible to
manipulate and model with tools like XGBoost and pandas, all of which are available
already in the runtime. But Spark will still be
relevant in a moment. The data scientist is here developing modeling code using XGBoost. It will regress life
expectancy as a function of the 1,000 or so
features left in the input. But building a model really
means building hundreds of them in order to discover
the optimal settings of the model's hyperparameters. Typically, a data scientist might perform a grid or random search over these values and wait hours while the platform crunches through each of them serially. Databricks however, provides a Bayesian optimization framework called Hyperopt in it's runtime, which is one modern and efficient way to perform the search in
parallel among others. So given a search space, Hyperopt runs variations on the model in parallel across a cluster, learning as it goes which settings to give
increasingly better results. Hyperopt can run these variations
in parallel using Spark even though the simple modeling here doesn't need Spark itself. This parallelism can dramatically
lower the wall-clock time that these hyper parameter searches take. The results are
automatically tracked using MLflow in Databricks. This gives a quick view
of the modeling runs that were created in this notebook, and we can drill into the experiment view. This is the MLflow tracking server, and this experiment shows
a more detailed overview of these trials. We can use this to search through runs and even compare them as here. For example, we might wanna compare all the runs using a parallel coordinate plot, and in this way figure out
which of the combinations of hyperparameters seem
to produce the best loss. This detail view also shows, for example, who created the model and with what revision of the notebook, when it was created, and all the details of the
hyperparameters including loss. It also includes the model itself along with feature importance plots that the data scientist created. The model can now be registered
with the model registry as the current staging candidate
model for further analysis. The model registry is one
centralized repository of logical models that are
being managed by Databricks. It manages artifacts and
versions of this model, and manages their promotion
through staging to production. This particular model has
a few versions registered as versions of the same logical model, and they can exist in states
like staging and production. The model registry is also accessible in the left navigation bar. Instead of managing models as for example, a list of coefficients
written down in some file or a pickle file stored on a shared drive, the model registry puts
a more formal workflow around tracking not just
the artifacts of the model but which ones are ready for which stage of production deployment. So next, a manager might review
the model and the plots. Here is a feature importance
plot created by Sharp. And it shows that the
feature that most influences predicted life expectancy
is mortality from cancer, diabetes, and heart disease. Low mortality, in blue, indicates higher life expectancy and appears to explain plus
one to maybe minus 1.5 years of life expectancy. And year is next most important, which unsurprisingly captures
many of the cumulative effects of better health over time. So notice that drug related deaths do not appear to be a top
explanatory feature overall, other diseases still dominate. So after reviewing the
model and other plots, the manager might finally approve
this model for production. The deployment engineer takes over here. She loads the latest production model from the model registry, the one that has been
approved for deployment but MLflow automatically converts it if desired to a Spark UDF
or user-defined function. This means it can be
applied to featurize data at scale with Spark, with
just one line of code. Now, this could work equally
well in a batch scoring job or with a streaming job as well. So for a second compare this to, for example, handling the model or
even just the coefficients to a software engineer to attempt to correctly
reimplement as code that could be run in production. Here, the exact model that
the data scientist created is made available to production engineers and nothing is lost in translation. So in this production job for example, the data is applied to
inputs from 2017-2018 for which life expectancy
figures are not known. And this can be joined
with data up to 2016 to complete the plot we saw earlier showing the extrapolated trends. Now notice that this could be registered not only as a UDF in Python but also in SQL as well. So note that the predictions
appear fairly flat, but this is mostly because
over half the feature data is missing in later years, and this model could also have
been deployed as a REST API or as a service in Amazon
SageMaker or Azure ML. So this has been a simple example of how Databricks can
power the whole life cycle from data engineering through
to data science and modeling, and finally the MLOps tasks like model management and deployment.