Data Science and Machine Learning on Databricks Demo

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(dramatic music) - [Instructor] Welcome to Databricks. The Lakehouse is a simple and open data platform for storing and managing all of your data that supports all of your analytics and AI use cases. It's where data scientists and data engineers and analysts collaborate to prepare and analyze data, build models, and deploy them to production. Today, we will focus on the data scientists who as you will see here are trying to explain life expectancy from health indicator data. And along the way, we'll see how our platform supports the whole life cycle from data ingest to production. The notebook based environment you see here is Databricks'. You can edit, run, and share code and documentation and visualizations and output in many languages. We will look at the problem in three parts: data access and preparation, modeling and interpretation, and finally deployment. We're going to only briefly review the work of the data engineer and Databricks here. Her goal is to make raw data usable. This includes fixing errors in the input or standardizing representations and joining disparate data to produce tables ready for use by modelers and analysts. Here, the input are just CSB files from the World Health Organization and World Bank, and an additional drug overdose data set that we included to explore as a factor. And they can be read directly as is from distributed storage as Spark DataFrames. Note that these data sources could just as easily have been a SQL database or JSON files or Parquet files and so on. The schema is automatically read or inferred in the case of CSV. These files contain about 2,000 different health and demographic features for 16 developed countries over the last several decades. The goal will be to learn how these features predict life expectancy, and which one or two are most important. So the data engineers might prefer to work in SQL and Scala as here. Databricks supports these in addition to Python and R which the data scientists might prefer. All may be used even within one notebook. And along the way, even the data engineers can query the data with SQL and see the results with built-in visualizations, like this one. They might wanna take an early look at the data. And we see that clearly the trend in life expectancy between 2000 and 2016, looks different for the USA. It's low and declining, and the question will be, why? So at the end of the data engineering workflow, the three data sources are joined on country and year, and written as a registered Delta Lake table. So to the data scientist this representation doesn't make much direct difference. It's just a table that can be read like any other data source. But Delta Lake provides transactional rights and lets the data engineers update and fix data from a bad feed, or gracefully modify the schema or enforce certain constraints on the data. And that does matter to the data scientists. Anyone who's faced modeling on dumps of data from a database that might break or change in subtle or silent ways, will appreciate the real data science problems that these factors can cause downstream, and that Delta Lake helps fix. Or, for example, consider the need to recall exactly what data was used to produce a model for governance or reproducibility. Managing the data as a Delta Lake table allows the data scientists to query the data as of the previous point in time, and this helps with reproducibility of course. Note that everyone can collaborate on this notebook, perhaps leaving comments like this one. So here we pick up from the world of the data scientist. Her goal is to explore the data, further enrich and refine it for this specific analysis here, and produce another table of featurized data for modeling and production deployment. So the collaboration began simply read this table of data that the data engineers produced. This notebook uses Python and the ecosystem may be more familiar and useful for data science. The Spark API is the same, and the data tables are available and exactly the same way however. Now, you don't have to use just Spark and Databricks only. For example, here we dropped into pandas to fill in some missing values before returning to Spark. You can define efficient UDFS or user-defined functions for Spark that leverage pandas too. The Python ecosystem offers libraries for visualization too. And with Databricks, likewise, you can just use these. Common libraries like Matplotlib and Seaborn are already built into the ML Runtime, and others can be added easily. Here the data scientist uses Seaborne to generate a pair plot of a few of the features such as life expectancy, literacy rate, and opioid deaths per capita. It can highlight correlations such as that between per capita expenditure on healthcare and GDP. And it also reveals a clear outlier. So the outlier here with respect to opioid deaths is going to turn out to be the United States again. After some additional featurization, the data is written as another Delta Lake table. Now, the data set is small enough that it's possible to manipulate and model with tools like XGBoost and pandas, all of which are available already in the runtime. But Spark will still be relevant in a moment. The data scientist is here developing modeling code using XGBoost. It will regress life expectancy as a function of the 1,000 or so features left in the input. But building a model really means building hundreds of them in order to discover the optimal settings of the model's hyperparameters. Typically, a data scientist might perform a grid or random search over these values and wait hours while the platform crunches through each of them serially. Databricks however, provides a Bayesian optimization framework called Hyperopt in it's runtime, which is one modern and efficient way to perform the search in parallel among others. So given a search space, Hyperopt runs variations on the model in parallel across a cluster, learning as it goes which settings to give increasingly better results. Hyperopt can run these variations in parallel using Spark even though the simple modeling here doesn't need Spark itself. This parallelism can dramatically lower the wall-clock time that these hyper parameter searches take. The results are automatically tracked using MLflow in Databricks. This gives a quick view of the modeling runs that were created in this notebook, and we can drill into the experiment view. This is the MLflow tracking server, and this experiment shows a more detailed overview of these trials. We can use this to search through runs and even compare them as here. For example, we might wanna compare all the runs using a parallel coordinate plot, and in this way figure out which of the combinations of hyperparameters seem to produce the best loss. This detail view also shows, for example, who created the model and with what revision of the notebook, when it was created, and all the details of the hyperparameters including loss. It also includes the model itself along with feature importance plots that the data scientist created. The model can now be registered with the model registry as the current staging candidate model for further analysis. The model registry is one centralized repository of logical models that are being managed by Databricks. It manages artifacts and versions of this model, and manages their promotion through staging to production. This particular model has a few versions registered as versions of the same logical model, and they can exist in states like staging and production. The model registry is also accessible in the left navigation bar. Instead of managing models as for example, a list of coefficients written down in some file or a pickle file stored on a shared drive, the model registry puts a more formal workflow around tracking not just the artifacts of the model but which ones are ready for which stage of production deployment. So next, a manager might review the model and the plots. Here is a feature importance plot created by Sharp. And it shows that the feature that most influences predicted life expectancy is mortality from cancer, diabetes, and heart disease. Low mortality, in blue, indicates higher life expectancy and appears to explain plus one to maybe minus 1.5 years of life expectancy. And year is next most important, which unsurprisingly captures many of the cumulative effects of better health over time. So notice that drug related deaths do not appear to be a top explanatory feature overall, other diseases still dominate. So after reviewing the model and other plots, the manager might finally approve this model for production. The deployment engineer takes over here. She loads the latest production model from the model registry, the one that has been approved for deployment but MLflow automatically converts it if desired to a Spark UDF or user-defined function. This means it can be applied to featurize data at scale with Spark, with just one line of code. Now, this could work equally well in a batch scoring job or with a streaming job as well. So for a second compare this to, for example, handling the model or even just the coefficients to a software engineer to attempt to correctly reimplement as code that could be run in production. Here, the exact model that the data scientist created is made available to production engineers and nothing is lost in translation. So in this production job for example, the data is applied to inputs from 2017-2018 for which life expectancy figures are not known. And this can be joined with data up to 2016 to complete the plot we saw earlier showing the extrapolated trends. Now notice that this could be registered not only as a UDF in Python but also in SQL as well. So note that the predictions appear fairly flat, but this is mostly because over half the feature data is missing in later years, and this model could also have been deployed as a REST API or as a service in Amazon SageMaker or Azure ML. So this has been a simple example of how Databricks can power the whole life cycle from data engineering through to data science and modeling, and finally the MLOps tasks like model management and deployment.
Info
Channel: Databricks
Views: 6,837
Rating: 5 out of 5
Keywords: Databricks, MLOps, MLflow, Product, Demo
Id: R0SkPiulseo
Channel Id: undefined
Length: 10min 56sec (656 seconds)
Published: Tue Feb 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.