- [Facilitator] We're gonna
go ahead and get started. Welcome back to part four. How to Build a Cloud Data Platform Machine Learning and
Business Intelligence. We will again have live Q&A
throughout the presentation and at the end, so feel
free to ask questions in the Q&A chat box which
is down at the bottom of your Zoom player. Furthermore, we've dropped
in yesterday's on-demand session link in the chat box
so you can access it there. All of the videos will also be on YouTube. So you can find the sessions
for days one, two and three on YouTube. Sessions one and two are up
and we are uploading three now. But eventually all four
will be on YouTube. So you can access the
previous day's content there. We will also be sending
a survey as a follow up. So please go ahead and take the survey. Let us know what you thought of the course and what you'd like to see in the future. So without further ado, I will pass it to lead
instructor for Databricks, Doug Bateman, who will take
you through today's session. Doug, take it away. - [Doug] All right. Thank you, Kayla. So welcome to part four of how to build a cloud big data platform for business intelligence
and machine learning. Today, we are gonna be focusing
on the business intelligence and machine learning aspects
of our big data platform. So this has been part
of a four-part series. And in the first part, we talked about the architecture
of a big data platform. And then in parts two and three, we actually designed and built a pipeline using Databricks delta lakes to build out our big data pipeline. And to construct this flow
from bronze to silver to gold. Today, we're gonna dive
into the machine learning and business intelligence
aspects of our big data pipeline. So now that you've gotten
the data, and it's clean, how do I use the tools to
actually explore the data do exploratory analytics and
how do I go about actually training a machine learning model? So if you haven't already
access the course materials, here is a short link that will take you to where you can go to access
these course materials. And that will bring you
out to this webpage here at academy.databricks.com and it's a private page
just for those of you who are taking the webinar. And you come in and you
can enroll in the course which is self-paced course
and once you've done so, you'll see three courses. The one that we use for the first webinar, the content that we use
for the next two webinars, and then the demonstration notebooks. And these are the notebooks
that we'll be using as well for today's webinar. So for webinars three, and four, we use these demonstration notebooks and we're gonna really
focus in on these today. So, when you click Start, it will show you that you need to download this zip file here. Notebooks: How to Build a Cloud
Data Platform for BI and ML. You'll click on that. And then you'll wanna go to
Databricks Community Edition. So if you go out to
the Databricks website, you'll see this Try Databricks link. And from there, you'll be able to log in and actually create an account at Databricks Community Edition. And once you've got your account, you're able to click on Home. Choose this little drop
down arrow and select Import and import your content
into the downloaded file right here into your workspace and then you'll see a
folder here called Webinar. And inside of the webinar
folder will be the content that we're using. Today we're gonna focus on
notebooks three and four. But before we do so I
thought a short review of what we've talked
about in previous webinars should be in order. So to do that, first I
wanna just introduce myself. My name is Doug Bateman. I am a Principal Instructor
here at Databricks. I joined the company in 2016, and helped build out the training team. And I've got 20 plus years of experience doing consulting and engineering and architecting large solutions. Databricks is a company that's
really focused on this vision of unified analytics. So what this means is that
we're trying to create a big data scalable platform that can do your entire ML pipeline, starting with ETL and data cleansing, and then creating a data
warehouse out of that really using a data lake. We call that the lake house pattern, the idea that we're gonna use a data lake to implement a lot of the things
that are traditionally done with a data warehouse. And this gride is cheap and scalability. And so low cost, high scalability with elastic compute, elastic storage. And then you can go and
do your machine learning and your data science. So this whole idea of this
unified analytics platform coupling streaming, ETL,
analytics, machine learning and business intelligence. We are the original creators of the Apache open source project as well as the Delta Lake project, which is a really powerful file format and runtime for processing and
storing your lake house files and MLflow for doing machine
learning experiment tracking. And we're gonna be
looking at MLflow today. And I say we're the original creators simply because this is now a very large, successful open source project across each of these three areas Apache Spark, Delta Lake and MLflow. And additionally, there are
200,000 plus companies worldwide using this platform to do their big data and machine learning. And so when we're using Databricks, we are looking at how do
I ingest data with ETL and make that available in my Data Lake. And then, using the Databricks workspace, to be able to query that data
lake and to gain insights, or to do machine learning
and model training. So our end users are using
that workspace to do this and that's what we're gonna
be working inside of today is a Databricks workspace. And in the previous webinars, we looked at this idea of a
pipeline for cleansing our data where the first goal is
just capturing the data. However it is from whatever system, the first goal is just to
get a snapshot of that data. And we call that our bronze table. And the idea is we're gonna
use cheap elastic cloud storage to just capture as much data from different external
systems as possible. And we're not gonna worry
about cleaning it right away. And this server comes one
of the main pain points that people have often had
about data warehousing, where there's so much time spent uploading and cleaning the data
before they can even get it that they're afraid to even
upload and clean the data or even upload the data
into the warehouse, because so much work is involved. So we say, upload it now and
then clean it when you need it. So when it comes time to need the data, you'll then create a pipeline to create our cleaned or silver tables using those bronze tables. And we saw in seminar three that we could use structured streaming to keep those silver tables up to date whenever the bronze tables change. And then you can build your data marts by reading from the silver tables and writing out roll ups or
reports or featurized ml tables, tables that are ready for consumption by your end business users. And when we did this, here's
a nice little set of review of some of the features of delta lakes. Is part of the notebooks that
you would have just uploaded. If you come in here
under the Demos folder, I've included a number
of really nice demos to help you remember
what you learned today. And I'm gonna use this now
just to do some of our review of the previous sessions. So if you go here to Demo 01, it's really all about what is
the power of the Delta Lake. Let's check to make sure that
my cluster is up and running. There we go. We'll launch our cluster. And while we're waiting
for a cluster to start, let's go ahead and do that
review of delta lakes. So the primary problem, and anytime you're doing any
type of business intelligence is that data initially
is siloed and messy. It's in lots of different systems. It's spread out across the enterprise. And so the idea is to bring the data into a single source of truth. And a data lake is a
critical piece to that because it's able to store so much data. So we bring the data from
all these different systems into our data lake. Delta is this enabling technology that makes that really easy to do. And we're gonna talk about why
delta is one of the best ways to build your data lake. And then we can use Apache Spark to serve as this query
engine for our data lake and feed that information into business intelligence and reporting. And what we'd really like
then is to have data coming in from all sorts of different sources, read it into Apache Spark,
populate this data lake, and then be able to use Apache Spark to do machine learning and reporting. But the challenges we have are number one, we want data to be consistent. So if somebody is busy writing to a table, and we're reading from the table, we wanna make sure that
we get a consistent view of the data. That is I don't see newly written data until it's finished writing, that out of necessity constraint. And this is one of the critical things that we're looking to have our
Delta Lake solution provide. We also wanna be able
to do incremental reads from a large table saying,
show me what's new. And if a right to a table fails, we wanna be able to roll back or you wanna be able to access
historical views of the data. See what it looked like back when I trained my
machine learning model. And we need to be able to
handle late arriving data and update our downstream
views without having to go and reprocess or delay
processing downstream. And this is where delta lakes
really comes into the picture. Delta Lake is an enabling technology for building a data lake. So that's where the pun is. A Delta Lake is a technology
for building a data lake. And it allows us to
unify batch and streaming and to retain our historical
data as long as necessary. And we get to use
independent cheap elastic compute and storage to be
able to scale at low cost. So with delta lakes, we're
able to get isolation between different snapshot
versions of the table, only seeing new data
once it's been written. And if I'm in the middle of reading, I will not see what
new writers are writing until I am done with my reading job. I'm similarly able to
optimize the file format to get large scalability. We saw the optimized command and how it compresses
small files into big files, as well as really scalable
handling of the metadata of the table. We're able to go back in time. So we looked at the time travel feature in the previous webinar. And we can even replay
historical data using streaming so that we could backfill and
load our downstream tables, which was really, really powerful. And this gives us those
atomistic or atomic guarantees. So we have our data ingestion with bronze, we clean it up with silver and then we do our aggregation with gold. And some people rightly pointed out, you could call these your load tables, your warehouse tables, and your data marts if you were using more traditional lingo. Now, so this is what we
really had gone and done. And we saw that using
delta is really easy. Instead of using the parquet file format, we just change our code to
use the Delta file format. Now this particular demo
is written using Python. We had done ours using SQL, which had a lot of advantages in terms of lots of people, no SQL. And I wanted to just
scroll down a little bit and point out some of
the key words that we saw that we could use. We are able to delete from tables, we're able to update tables, we're able to merge into a table, that's the up cert operation. So it doesn't insert if the data is new, it doesn't update if the
data is already there. Which was really, really powerful. So we're able to use this merge syntax. And then there's some features we didn't spend a lot
of time talking about, but they are there, they're powerful. And those of you who'd
like to read about them are welcome to come and
look at this notebook. And this is schema evolution. The idea that I can evolve
the schema of my delta table over time by adding certain
compatible types of changes. And so I'm actually able
to set merge schema true when I append to a delta
lake, in which case, if I'm adding columns, it
will actually add new columns to that schema. Which is really nice because
anytime you've had a table for a long period of time, it becomes very important to
be able to evolve the scheme. We also saw we could do time travel, where we view the history of a table. So describe history. Let us see what's the status of the table, what's changed over time. And I can go back and view
prior versions of that table. Oops, what did I just do here? I managed to delete
the cells, that's okay. We can go back and see prior
versions of the tables. And so this was really, really cool, this power of the Delta Lake. So now we're gonna dive in and look at business intelligence. And how do I go about
connecting a tool like Tableau to my data lake. To do that, we're gonna
need to first of all, launch Tableau. So I've got Tableau up and running, and I'd like to connect it to Spark. So Spark will be the query engine, and I'm gonna connect Tableau to Spark. So I'm gonna then need to
go and look at my cluster. And it looks like my
cluster is still launching. So I'm gonna go ahead and just grab one of these other clusters since it's already up and running. And I'm gonna click on
the Advanced Options here. And I see the section
here called JDBC/ODBC. And this is basically a way
in which I'm able to connect to a running cluster from an
external tool like Tableau. So I come over here to Tableau. And I simply scroll down and I said, I'd like to connect and I click More and I find Databricks here in the list. And I click on Databricks. Now, when you first do this, if you do not have the
Tableau drivers installed for Databricks, they'll be
a little prompt down here that will tell you hey, please download and install the drivers. And that actually just takes
you out to a little website at the Databricks website here, where you're able to download the drivers that you would need to
connect Tableau to Databricks. Now I've already installed those drivers. So at this point, I can go ahead and
point to the information that I see in my cluster. So server host name is
trainers.cloud.Databricks.com. And I would come here to
where it says HTTP path, I'll copy this. And then for username, I have a choice. I could use my Databricks username, but I would rather have an
application specific token. In which case, I'm gonna just
make the username be token. And I'm gonna generate a token
now for use with Tableau. So I'm gonna click appear on my account, and I'll go to user settings. And I click on access tokens. And I'll generate a new access token and I'll type in here Tableau. And I could set a lifetime that this token will be valid for. In this case, since it's
going out to the world, I'll keep its lifetime very, very short. And in fact, I'll revoke the other token that was showing on the page. And then I come back over
here and I paste that token into my new connection. And I click Sign In. At this point, I am now
connected using Tableau to Databricks. So at this point, I would choose which
database I wanna connect to. And the database that we've
been using was DB Academy. This is one that we set up
in the earlier webinars. And I click Search. Here we go, I've got a
match for DB Academy. Now it's connecting to DB Academy. And then if I click this
little search box on table, it will tell me what tables
are available in DB Academy. And here you can see our
health tracker tables that we created, including
our silver tables, and our gold table. So let's see here, there is
our silver table right here. And then daily patient average
would be our gold table. Let's go ahead and open
up our silver table. I click it. There we go, it's loading
the metadata for our table and sending the query out to Databricks, which is our big data cloud platform. There we go. And I could see name, heart
rate, time, date and device ID. And then I click here and say update now. And I'll get to view the users
and what their heart rate is. And now I have the full power
of Tableau at my disposal to be doing exploratory data analytics that are more familiar
business intelligence tool. And of course, Tableau has
a lot of really awesome graphing capabilities
and display capabilities. And so one of the things
that I wanna show you is some of the graphics
that you can do in Tableau. To do that, I'm gonna pick
a slightly more interesting data set that lends itself
to some nice graphics. So I'm gonna go to the
loans schema in my data set. And that's the schema that you would get if you ran that notebook
that I just showed you that was reviewing delta lakes. It's actually using the loans. Where's that, click again,
or maybe loan, singular? There it is loans. And then we look for the
tables that we wanna have. And these are ones that
come from that notebook that I had just demoed
that 01 delta review. And again, I could see the
gold, silver and bronze loan information. In this case, I'm gonna
load the gold table in Tableau. And choose update now. And what I'm able to see is
loan information by state. So this is a sample data set that just shows who's getting
loans for what purposes and in what state. The synthetic data set
like some of the others that we've been talking about. But now I can go and do some
really cool visualizations. So I'll come over here and
I'll click on sheet one. And I could say, I'd
really like to look at this map of states. So I come and I drag Address
State into my worksheet here. And it says, all right, I've got data it looks
like for all 50 states. But if I'm interested
in doing some display on some of the measures, I can change the plot
type to be a colored plot. And then I can come over here and say, I'd like to be plotting
based on the number of loans or the amount of the loan. Let's do a plot based
on the number of loans. So I'll drag count over here onto my map. And sure enough, I'm able
to get a nice visualization of all of the data that you're seeing here is being powered and delivered
to Tableau from Databricks. So what's happening inside
of the database platform, I'm using my data lake. And I've linked it to a really popular business intelligence tool. And this is really, really cool. Now, I don't have to
use a tool like Tableau to do this type of work. I am able to do a lot of this type of exploratory data analysis inside of Databricks itself as well. So to illustrate that, I'm
gonna come into our webinar and go to our demo that we
were just talking about, build and manage your data
solution with Delta Lake. And let's see if I can find
some good plots in here. We'll connect to Joel's cluster. Thank you, Joel. And let's see here. So again, I've got that same data right inside of Databricks. And then I can come here, and I could choose to do a map plot. So let's see, I would choose a map. And again, I could see this
information right here live. And one of the really cool
things that I can actually do if I want to, is I could
combine this with streaming spark.readstream.table.createOrReplaceTempView. And I could make a temp
view called gold loan, gold_loan_stats_live. So that's a straining view. And then I can run this
query on the streaming view oops, to temporary view so it
doesn't have a database name to go in front of it. And now I'm launching a streaming query. And let's choose a state to modify. So let's change the data for Pennsylvania. And we'll come back to
our streaming view here, oops, map, plot options, and I'm gonna be looking at
the amounts of the loans. Let's do the count of the loans. There we go. And let's change the amount
of the loans for Pennsylvania. So I could do, come in here,
I'll find Pennsylvania. The amount of the loans is
currently $45,000 roughly. But I could do an update
statement at this point. Back to map view again, and do update. And I have to update the
underlying gold table, not the view. Update this, set or where state equals PA, set amount our address
state, a-d-d-r state. Set amount equals and let's
make it a really big number. And let's do Indiana instead just because it's right in
the middle of the country. Ah. And the name of the table was
called loans.gold.loans_stats. Oh but it's not a delta table. We want it to be different Delta table. Shoot, we're gonna have to scroll down to find a delta version of this table. Where's my delta version? I'll just scroll down a little bit here. And this time, upload gold loan stats. All right, we'll change Wisconsin loans. And we will see that data set change as the streaming query updates, which is really, really cool. So in this case, I
updated Washington State. And you see Washington State now has a significantly
larger number of loans. So this is really cool, I could do this type of analysis
right inside of Databricks. Or I'm able to do this type of analysis using a tool like Tableau. And you can even see that Tableau is now showing higher data as well. This is very, very powerful as a platform for business intelligence. Now, where I'd like to go next
will be to explore Databricks as a platform for machine learning. So to do that, we're gonna
open up this next notebook, 03-Machine Learning. And we're gonna choose a slightly
more interesting data set, the Airbnb data set. So these are rental prices
for Airbnb in San Francisco that were made open source
or available to public. So we're gonna come in here,
we'll run classroom setup. This just makes sure that
the datasets are available. And I'll run this a second time. I am getting a slight
warning message here. Because Joel's cluster,
the one that I was using, wasn't set up for machine learning. So I'm gonna switch it
over here to shared tiny, which has the machine
learning libraries installed. So one of the cool features of Databricks that I wanna point out is that Databricks you have a version of the Spark runtime that has a bunch of popular
machine learning tools pre-installed. So notice here I have a
choice between version 6.3 or version 6.3 ML. The ML version comes pre-installed with a lot of popular
machine learning libraries. So that's what we just did. And similarly, there are
versions even that work on GPUs for large scalability. So I'm gonna switch
over here to share tiny, and I'll rerun classroom setup. And this time I won't
get the error message because the machine learning
libraries are pre-installed. There we go. Now this is the data set that
we're gonna be looking at and it's the Airbnb data set. And so it's been made
available here under dbfs. We just finished mounting this s3 bucket to mount training in the data file system. And now I'm able to read
in the data from Airbnb. Now the code today when
we do machine learning, we're gonna move away from
SQL and towards Python, which is a very popular
programming language for doing any type of machine learning. So to do that, we'll come in
here, we've loaded our data from the file, spark.read.parquet,
I give it the path. And it gives me a data frame, which is basically the Python
equivalent to a SQL query. So I could immediately
display that data frame, AirbnbDF and I would see the query results from reading that file. I can even do filtering like filter where instant bookable is true and filter the queries. Or where room type is private room. Now I'm only seeing, where's room type. Private room. I'm only seeing private rooms. So I get to use a more
familiar data frames API. And Sparks data frames
are slightly different than some of the other
Python data frames libraries like pandas. There is a really great open
source project out there, excuse me, a really great open
source project called koalas. And koalas attempts to
mimic the pandas API, but powered by Spark. And it's definitely something that I encourage the data
scientists here to check it out. Koalas is another open source project that is currently being
sponsored by Databricks. So you could think of a koala
as being a panda on Spark. But I'm able to do queries
using Python as well as SQL. Now what I wanna do is take this data set that we had just done. And a good data scientist knows that they want to train the
model on different data than they use afterwards. So I'm gonna train the data today on the data I've got available. But in order to know that
my model is any good, I need to make sure that the
model makes good predictions on data that it didn't see
when it was being trained. So if you ever look at
stock market predictions people go past performance is
not necessarily an indicator of future results. What this is really saying is, just because you've got a model that does a great job on past data, it's how it does on unseen future data that's the real proof of quality. So what we're gonna do
is take 20% of our data and set aside for testing
and evaluation purposes. And we're gonna train our model on the remaining 80% of the data. So 80% of our data we're
gonna use to build a model and the 20% is gonna be data
that we never saw before. And we're gonna find
out how our model does on data that's never seen before. So I'm gonna split up this
data with an 80/20 split. And you'll see that in this
case, my training data, I have 5,758 rows. So I've got that many
records in that dataset. Now by fixing a random seed, if I was to run this code again, I'm still gonna get that same split. The way it does the split in a highly parallelizable fashion, rather than doing a perfect 80/20 split. What it really does is for
every row, it rolls the dice. For every row, it rolls the dice. And if the dice come back greater than 80, it's in the training set or test set. If it's less than 80,
it's in the training set. So if I don't fix the random seed, which I really don't need to do, you'll notice that the number of rows will vary 5700 to 33. I run it again 5720. That's because it's choosing
a different random sampling. And because each row is
done by rolling the dice, sometimes they end up in one set, sometimes they end up in the other set, but they got approximately an 80/20 split. And you ask why not a precise 80/20 split? And the answer is that
by using random numbers in this fashion, I'm more scalable because I don't have to
communicate across machines. Oh machine a, you got 50. Okay machine b, you should only get 20. That would be a lot more coordination and it wouldn't scale as well. So by doing an approximate 80/20 split, every machine is able to work in complete isolation from each other. And I still get a good
approximate 80/20 split. If I set the random seed, then
that random number generator is gonna produce the same
random numbers every time or so you would think. But even that is actually
subject to fluctuation. So if I change the number of partitions that I break my data up into, you'll notice that because
the data has been split up slightly differently, my
random number went from 58 at the end, to 28 at the end. So when my cluster size changes, or the number of partitions
of my data change, I still get some difference
in my randomness. Because remember, those random
numbers are being generated on each machine independently. Change the number of machines,
I change the random numbers. Or if I change just the
way the data is divided up, even if they don't change the machine, I change those random numbers. So far, so good. I'm gonna check the Q&A just to see if any questions that popped up. Oh, Dexter asked a good question. He said our access tokens not available for Community Edition. That is correct, Dexter. Access tokens are not
available on Community Edition. That's a feature of the enterprise full version of Databricks. So for Community Edition, you would just use your
regular username and password. And yes, the ML runtime is available in Databricks Community Edition. That's actually worth pointing out. I'll log into Databricks
Community Edition right now and point out that question. So if I come in here in the
Databricks Community Edition, when I go to launch my cluster, let's create cluster, my cluster, you'll notice I have a choice of either the machine learning runtime or the regular runtime. And this drop down is actually
one of my favorite features of Databricks. Notice I get different options for different versions of Spark. So for people who are
deploying Spark on premises, rolling out software updates is painful. But if you wanted to update
your version of Spark in Databricks, it's just a
matter of picking from a menu, which is really nice. One of the pains when you have
an on premises Spark solution is people go, how many
machines should we buy? Well, I'm not really sure. We haven't done testing yet. Well, we need to place to
order for the machines. And then you order the machines, they finish installing Spark and a new version of Spark comes out and you go but I want the new version. And they go, you're kidding, right? We just spent three months installing these machines for you. With Databricks you're able to
launch the number of machines that you need, on the
version of Spark you need, with a simple click in the menu. It's one of my favorite
features in all of Databricks. It's not the flashiest feature, but anybody who's done a real
project knows the benefit of just being able to spin
up a cluster in a few seconds as opposed to a few weeks. All right, back to where we were here. So at this point, I wanna do some linear
regression training. We're gonna start out
with a relatively simple linear regression training, where we're gonna just
try to predict the price purely based on the number of bedrooms. And I'm doing that for this webinar to keep things relatively
straightforward and digestible focusing on the machine
learning capabilities as opposed trying to do a really elaborate machine learning example. If you were to take our three-day machine learning instructor-led course, we go through the full gamut
of doing cross validation and training with multiple features, feature extraction, categorical variables. Today, we're gonna keep
things a bit simpler, just doing single variant
linear regression. So I wanna predict price based
on the number of bedrooms. How would I go about doing that? Well, step one here. Let's just look at information about price and number of bedrooms. And notice that when I call summary, I get to see, okay, your cheapest price, somebody's apparently
willing to rent out for $10 and somebody else is renting
out for $10,000 a night. That's quite the spread. But if we look at the median, the average price per night is $150. Similarly, if we look at bedrooms, we'll see that somebody
is renting out their place with 14 bedrooms. This must be a Scottish
Lord with a castle. 14 bedrooms, very impressive. So it does appear that
we have some outliers in our data sets. So we just wanna keep that
in mind as we do our work. We could also do a little
plot here, a scatterplot where I will plot a computer plot options. And I say I really
wanna just look at price and number of bedrooms. So that's number of
reviews, bedrooms and price. There's price. Now it's running a query, and setting up my plot to do a scatterplot of number of bedrooms versus price. Now, one key thing here, because I'm running
inside of a web browser, the visualizations that you'll see inside of your web browser in Databricks are gonna be based only
on the first 1000 rows. And that's because if
I sent a billion rows to my web browser, my web
browser would run out of memory. So Databricks by default
will limit those rows to only be 1000. Now if I used a popular plotting
library like matplotlib, I could actually do quite a few more rows. Or if I use Tableau, I could obviously get the full data set. But this is very interesting here because you could see
that for the most part, we do have something that
looks somewhat linear to start with, but it starts
to get a little bit weird as the number bedrooms
gets up to four and five. And I also have these
outliers like $10,000 for two bedrooms. Well, hey, nobody stops the
guy from listing his place. It just means we don't know how popular his place is gonna be. And this is an outlier, that will definitely
throw off my data science. Now, what we would like to do
is simple linear regression. So what I'd like to say is, hey, I wanna train predicting the price based on the number of bedrooms. But if you naively write this code, and then run it by
going linear regression, please fit a model to my training data, we're gonna get an error message. It says column bedrooms must be a vector and it was a double. So it was expecting an array
of doubles, array of double, but it was actually just the double. And I get this error message. Mhhh, what's going on here? Well, the trick is based
on this label here, features column. It's actually expecting
an array of features, not a scalar value. So what I need to do is tell Databricks or tell Spark what are all
of the features I wanna use. So I'm gonna say input
columns is bedrooms. And it's going to, I could give it a list. I could give it bedrooms, I
could give it number of reviews, or num reviews. I can get a whole list of features, and it's gonna add a
column that is putting all those features into an array. In this example, we're just
gonna train on a single feature. So that's the job of the vector assembler. It says these columns in, this column out. And now if I look at what the output of a vector assembler is, you'll see that it's added
a new column at the end called features. And it's a sparse vector. So what that is actually a
dense vector in this case, with the value one if it's one bedroom, value two if it's two
bedrooms, and so forth. It is a little bit weird when
you look at a vector type in this display, because the first cell is
one if it's a dense vector, zero if it's a sparse vector. The second item is the size of the vector. The third item would be the indices and then the last one
is the actual values. So if you're reading this, you really wanna look at two and three. In this case, I have a vector that just contains the value two in it. So it's adding a new
column to the data frame that combines a bunch
of these other columns. I now can feed that into my
linear regression example. So I can go linear regression. I'd like to read in these features and predict the price please. And what I'm returned is
a linear regression model, that's a form of machine learning, a linear regression model, that given the features
will predict the price. And I can look at the line, so I can go into this
linear regression model, I can look at the slope and the intercept. And I can get the equation for the line. So on average price, I
can even label these here, price is num rooms or bedrooms. So it's gonna be $120 per bedroom plus $50 is the best fit line for that
data point in San Francisco. But of course, that's
including some of our outliers. Now to find out how
good my model is doing, this is where our test
data set is gonna come in. So come down here to test data set. And I could go linear regression model, I would like to transform
this test dataset. Here's my test data set. I'm gonna transform it
to get my predictions. But remember, first I
have to run it through the vector assembler to extract
that vector of features, and then I can apply it to
the machine learning model. So let's run that and see how well we did. And we could see, for one bedroom here, the actual price was $130
and I predicted $173. And you'll notice that
every one bedroom place got predicted at 173. And that's simply because
every bedroom will be for 173. And that's simply because the fact that we're only training on a
single feature currently. But we could train on lots of features. Now, the next step will be to evaluate how good our model is. But before we do that, I wanna show you another way of writing this code. So notice that I'm vector assembling. And then I'm creating a
machine learning model. We can package these two steps up into what's known as a pipeline. So from pyspark.ml.import pipeline. From pyspark.ml.import pipeline. And then I would say, pipeline equals a pipeline, where the stages are first
gonna be to assemble the vector. And then to build a
machine learning model, a linear regression model. And what the pipeline does, is it simply allows me to have a whole series of transformations
lined up, back to back. So maybe I was doing one hot encoding or I was working with
categorical variables or string indexing. I could include all of them
those as stages in my pipeline. And now we just go
pipeline.fit testdataframe. Or let's start with my
training data frame or not. TestDF, and that's gonna
give me back a model. And then I could go model.transform, and I give it my test data
frame and I get my predictions. And this will display the
predictions like we did before. And these are my predictions,
price and prediction. So a pipeline allows me to have
a series of transformations saved into this reusable component. One really nice thing about
doing a pipeline this way is I can actually in turn save the model and save it off to disk for later reading. So I could save the
model off for later use and read it back in again. Now, in order to determine
whether this is a good model for predicting Airbnb, a good data scientist needs to be prepared to do model evaluation. So this is where evaluating
the model comes in. So for this, I'm gonna
use a regression evaluator to compare the prediction
from the actual price. And I can get the root mean squared error. That's a standard metric
for determining how far off are the predictions from the price. One of the nice things about
root mean squared error is that the unit's match. So in other words, my model
is off by $290, typically. Wow, that is a pretty big error. But it's not at all surprising, because I've only predicted
based on the number of bedrooms. I didn't take other things into account like popularity of the listing or what neighborhood it's in. Or what are the standard
reviews for the listing. So it's not at all surprising that my model is off by $290 typically, as a standard variance. Simply because I didn't bother computing on anything other than bedrooms. And if I wanted to use a
different metric, I can like I could use r squared. In which case my r squared
metric change by labels here. R squared, in this case is
not a very good score, 0.12. One is very good, zero is not very good and negative would be
terrible for r squared. But this is a simple
machine learning pipeline consuming from our data lake. So notice that at the very top, I wanna link this back
to what we did before. I can read from tables in our data lake. It could be a parquet file,
it could be a Delta file, or it could actually be a table. I could go back to SQL here go
create table as of listings, using parquet in this case, location and tell it
where to find the file. And then instead of
going Spark.read.parquet, I would just go
spark.read.table SF listings. So I'm able to link to the datasets that are in my data warehouse. Alexis asked a question. She said how could r squared ever actually be negative? That's an interesting
data science question and it is a good one. R squared actually, in
fact can be negative. Which means that the way
r squared is computed, when we looked at root mean squared error, you saw the units were in dollars. So let's come down here where
we did root mean squared error at the bottom. Root mean squared error,
the units are in dollars, because that's what my price is in. But that's annoying, because it means that if I changed my units to
something other than dollars, for example, I was doing
it in millions of dollars, then my RMSE would
suddenly be a lot lower. And so it's hard to
know what's a good RMSE and what's a bad RMSE. So the solution to that is to scale the root mean squared error. And the way you would do that
is you would just look at what if I did a naive model? Where I just used the average price which is represented as
x bar, the average price. How far off would I get the, what would my RMSE be
if I used average prices supposed to your machine learning model? And so if we look at the formula for root mean squared error, let me go ahead and
pull up a slide on that because it is an interesting discussion. R squared. So here we're looking at the
residuals from our prediction. So what did you predict? What was the actual or actually
this is the actual versus, no prediction. Prediction minus the... no the actual minus
the prediction squared. So that you're summing up the errors. And then you compare
that to what would it be if it was the actual minus the average. So if you did a naive model where you were just
looking at the average, how would that be? And then you just use that
to scale how good we did. So instead of being in dollars,
I'm dividing out the unit. So if it was in millions
of dollars, or in dollars, it wouldn't matter. The units are gonna cancel out here. But what you'll notice is that, what is the worst possible
score I could ever get? Well, if my predictions were perfect, the best score I could
ever get would be a one because there would be a
zero up here at the top. But if my predictions were terrible, the thing at the top could be a million. And if it did way worse than say, just using the naive model,
which just gave you 10, notice this would in
fact be negative lexis. So while rare, it is possible
to have a negative r squared, which means your model is doing
worse than the naive model. So let's continue on now. Oh, by the way, here's a picture I meant to show you earlier. These are different types of plots that you could do in Tableau
using that same data set. Just wanted to show some
pretty pictures from Tableau. I had meant to mention that earlier. Let's go back to machine learning. So this is how we evaluate
our machine learning model. Now, what I'd like to do
next would be to show you how to use a technology called MLflow to track your machine
learning experiments. And in the process, we're gonna get to do a
slightly more interesting machine learning exercise. So this is notebook for MLflow. MLflow is basically a tool for
logging your ml experiments. So any data scientists that
has been doing it for a while they sit there changing
what features they use. They trained what machine
learning algorithm they use. They try all sorts of different settings and the end up building
hundreds or thousands of different machine learning models. And at some point, it's easy to get lost as to which ones did well. And then once you've zeroed in on that machine learning model, you wanna go a step further
and you wanna save it out and be able to use it in production. And this is where MLflow
comes into the picture. MLflow is a tracking tool for all of your machine
learning experiments. So we'll run classroom
setup that just makes sure that the datasets are available and that the necessary
libraries are available. MLflow comes pre-installed with the machine learning runtime. It's one of those libraries
you would have to install if you were not using the
machine learning runtime. You are using the base runtime. So the difference between
the machine learning runtime and the basic runtime is
that the basic runtime does not have a bunch of
libraries pre installed, giving you total control over
what versions of libraries you want to use. Whereas if you use the
machine learning runtime, it has a lot of these popular
machine learning libraries pre installed, and we find
that that is extremely popular with data scientists who just wanna get up and running quickly. And if you wanna know
precisely what version of what libraries installed, you would just look at the release notes for the version of the ML
runtime that you're using. All right, so step one. We're gonna read our data set. And again, we could read it from the file, or we could read it from
the table SS listings that we just created in
the previous example. Either way, we could
read it from a table name or directly from the underlying file. And like we did before, we're
gonna do an 80/20 split. Now, let's go a little bit further here. What we wanna do is a
standard machine learning pipeline again, so the
one you saw from before, but you're gonna see a bunch
of stuff surrounding it. But let's just recognize it first. We're gonna build a vector
assembler and linear regression. Remember seeing that. We're then gonna build a
machine learning pipeline that's gonna do vector assembling, followed by linear regression. And we are gonna train our model. And then we're gonna make
predictions on our test dataset, evaluate what is our
root mean squared error and compute that. So what is all the other stuff
I see on the screen here? Well, the rest of this is MLflow. What I'm gonna do is say MLflow, I wanna start an experiment. I wanna start training a model. So here's my run. And I'm gonna call this
run linear regression using a single feature. I could call it anything I want. I could call it fill. It's just the name. And this with statement
is a Python feature that says I am gonna assume
the experience is over when I leave the scope
of the with statement. So by the time I get down
here to the very end, where I might be displaying something, it will have saved that
experiment to disk. The moment I leave the with statement, it will transmit that experiment
off to the tracking server and make it available to me for viewing. And so it gives me this
object here called run that I can now use for logging purposes. Now I'm gonna go MLflow,
I would like you to log that I'm really gonna be
training based on price and the number of bedrooms. Just a note I'm making. Tracking in my log path. You'll be looking at MLflow
as being a logging system. So I'm gonna log that this experiment was using price in bedrooms. I also just finished training
a machine learning model. Let's save the machine
learning model that we made, to my log, so that I can get to that
machine learning model later. And since I computed the
root mean squared error, let's save the root mean squared
error to my model as well. Now managers of data
science teams love this because if they wanna go
and look then at what models their people are training, all that information has been logged and is available later on. So we're gonna train our
machine learning model, and let's see where it
got logged this time. That's not needed. All right, now, I come
over here in the top right and I see runs. And there it is. I can see in this case, I just ran a training on
this notebook right now. I called it, double click on it, I could see the version of the
notebook that actually ran. So this is the version of
the code that actually ran when I trained this model. I can see it was called price bedrooms and it had a root mean
squared error of 290. And if I want more detail, I can click on this little
icon to view the experiment. So here we go. I can see I did a linear
regression with a single variable on April 17th of 2020. These are ones that I did previously while testing stuff out. I could see who trained that model, the version of the notebook that was used when the model was trained, and then in information
that I chose to log like price bedrooms and the
root mean squared error. Previously, I'd done an experiment where I looked at the log price as opposed to the actual
price because I noticed there was a log normal
distribution to price. And I was able to train
a slightly better model by using log normal distributions. But I can see all of my past experiments. And then if I click on
this particular experiment, and drill down, I can see, again, the metric root mean squared error. But I also have this here artifacts, things that have been saved. So there is my machine learning model. It saved off the disk. Well, let's say that I really liked this machine learning model and I wanna start taking it to production. There's this link over
here, register model. Let's click it. So I'd say I'm gonna create
a new model in my registry. And I'll call this my Airbnb model. And I click Register. There we go. And now I can come over here on the left, you'll notice this button models. Now this button is available in the full version of Databricks. If you're on Databricks Community Edition, notice that you do not
see the model registry. So that is part of the
professional version of Databricks, as opposed to the free
open version of Databricks. But I'm able to register the model and I get this button on the left where I can see all of
my registered models. Now this is for the purpose of
taking a model to production. So notice that I get to say, when did I move that model into staging? When did I move that
model into production? So I come here now to my registry. I can look at my version of my model. And I could say, all right, I would like to transition
this model into production. And I can make a comment
about when I transition it into production, so that other people are then able to grab
that model and use it which is a really powerful feature. They just go through the Databricks APIs, and they can grab the latest
production version of my model. And as I have new versions of the model, you'll notice that this
version number will increase and I'll get to move that
version into either production or staging or production as they come out. Really, really powerful. This is known as the model registry. And we are even adding the capabilities soon to do model serving, so that you'll be able to
actually hit a rest endpoint to have it provide you
scoring based on the model. All right, so that is the
basic part of using MLflow as a logging API. So let's do a slightly
more interesting example. This time around, I'm
gonna grab more features than just the price. I'm gonna grab all the features I can get. So price is gonna be what
I'm trying to predict. Everything else is gonna be a feature that I can use for making predictions. And this little utility is a nice way of grabbing all the features. It uses the formulaic approach available in the R programming language. Oh Sid just ask a question
that I can't resist answering. Can you do deep learning and
natural language processing? Yes, you can Sid. For an example of doing deep
learning using TensorFlow, I'm gonna point you over here to the demos 03
operationalizing data science. There is an example of
using Kerris and TensorFlow. For examples using
doing NLP on Databricks, we have some really good blog posts that you might check out. Or you can take one of our
full trainings where we do NLP, natural language processing, NLP. All right, so I'm gonna grab
all the features this time, and put it into an actor, and
then do linear regression. And out pops my pipeline model. Again, I'm gonna log my model
as well as logging a label And any metrics. What is my RMSE? What is my r squared? So I'll run a bigger experiment, and see if I can do better
than my previous experiment. It's a little bit slower today because I'm using a tiny machine as opposed to a big cluster. Yesterday I was using big beefy machines for our delta pipelines. I didn't need to be. I just chose to do it to
reduce some of the waiting that we're doing in class. Today we're using this tinier machine. And the real reason for that is quota. When I tried to launch a bigger cluster, it turned out I didn't
have enough AWS quota, somebody else was using that quota. So I'm sticking with a smaller
machine for the moment. It's just gonna take a little bit longer to train our machine learning model. While we're waiting on that to run, let me see if my quota
is finally available to launch my machine. I'm gonna come down here
to my cluster again. Revisit Doug's cluster. Let's see if we can get it to come up, maybe quota is available now when it wasn't available earlier. Ah, somebody asked how
does the vector assembler know which data frame
to pull the data from? So if we go up here to where
there's a vector assembler, how does it know which
data frame to pull from? And the answer is it gets the data frame right here on line 16. So somebody said, how
does the vector assembler know which data frame? It gets it here. When I go pipeline.fit, it's literally taking
this training data frame and providing it to the vector assembler. And then it takes the output
of the vector assembler and provides that to
the linear regression. And then it takes the output
of the linear regression and that is my model that I've produced. And it's able to use that model when making predictions down here. So it's this line here that
tells it which data frame to get it from. And Brian Pan asked, is
there a visual UI for MLflow? I think we just got to see
that when I went over here and I clicked on runs. You got to see the visual
UI for looking at MLflow. So let's scroll back down here. There we go, I finished
doing my machine learning. Let's click on runs. Here's my latest run, price all features. And notice that indeed, my root
mean squared error did drop when I added other features. But it didn't drop enough. And one of the reasons it didn't
drop enough, is that again, I still have a lot of outliers
on the Airbnb dataset, because I'm not predicting
what the price should be to maximize profit. Rather, I'm trying to predict what people have been
listing their Airbnb's at and some people were
listing theirs a $10,000. So we'd have to deal with those outliers. The other reason why
this is somewhat high, in terms of error, is that I am assuming that
there is a linear relationship between price and the number of bedrooms. And it turns out that it's
not a linear relationship. It's an exponential relationship. So what I really wanna
be doing is looking at the log of the price to
do my linear regression. And these are the types of things that a good data scientist would do as they're running lots of experiments to try to find a model that does a good job making predictions. They might also try things
like neural networks, as opposed to doing linear regression to do this type of machine learning. And you would log all of
those experiments here using the machine learning
APIs or the MLflow APIs. So now, I could click on
this guy, open them up. I could scroll down to the model
and say register the model. And I would like to replace
the current Airbnb model with this new one. And I would click Register. And then I come over here to
the registrations pending, we'll give it a moment to
finish installing that model. While we're waiting, I'll look to see if there
are any other questions here. Ernesto said, well, how
does the pipeline know which vector assembler to use? So again, let's go back
to the code Ernesto. Notice that when I define the
pipeline, let's come up here, when I defined the pipeline, I tell it which vector assembler to use. The way I usually like to write this code that I think is a bit more readable, is when I define the pipeline, I often will define
the pipeline right here in more linear fashion. So I'll say here's your vector assembler. Here is you linear regression. And now it's very clear my pipeline is gonna do vector assembling and then linear regression. And remember, the vector assembler is producing a single vector
column called features. And then I chose to
put the features column into the linear regression. So notice that I'm explicitly naming which columns I want to be using here. All right, so my deployment is finished. Let's go back over to this
time, it's now registered. And again, if I go over to
the model serving layer, so let me click on models, I see that version 2
is the latest version. But version 1 is the one
that's still in production. So I have the latest version, that's the development
version, the staging version and the production version. And I could click here on Airbnb. And I could say I would
like to take version 2, and move that into staging please. And now I've moved it
into the staging state. So now I could see that if you're playing in the staging area, you
should be using version two. And I can retrieve this model
using the database API's and actually do live scoring with it which is really, really powerful. All right, I wanna show off a few other capabilities of MLflow. So in this experiment, I'm actually gonna do the log of the price as opposed to the actual price. So I'm gonna take my data
frame and I'm gonna add a new column called log price. That will be the logarithm of whatever the value
of the price column is. Now I wanna highlight I am importing a log function from Spark. I'm not using Python's log function. I'm using one from Spark
that knows how to work on entire columns of a data set as opposed to an individual number. So the log function of Python would expect a floating point number. The log function in
Spark would actually work on that entire column of data, as opposed to an individual row. So at this point, we're
gonna work on the log price. We're still gonna run
it through our formula. This time, I'm gonna do log price. And I wanna use all features
except for the price because I'm gonna use log price instead. I don't wanna predict log price by price, that would be cheating. So I'm gonna use all features
except for the price. And again, I'm building a pipeline. It's gonna start with our
formula, then linear regression, I got my pipeline. That's gonna train on my
training dataset, yields a model. I'll log that model using an outflow. I'll then make some predictions. And I'll log the various scores telling me how well we're doing. And NC just asked, is the
job scheduling an automation using Databricks available
and the answer is yes. You can actually do job
scheduling over here in the jobs tab. He also asks the question, could I use a tool like Airflow or Azure Data Factory to launch jobs? And the answer is yes. Integrations exist with Airflow, as well Azure Data Factory and a number of other schedulers
as well to launch jobs. So you can define the jobs in the jobs tab and then launch them
using these other tools. Or you can launch them with
our own built in scheduler and not rely on these other tools. Lots of options there. So here, I'm gonna train
a log normal distribution. And while we're at it,
let's have a little fun. Let's do plotting. So plot is a library that
comes with matplotlib. It's a Python library matplotlib. Let's do a plot. And in this case, the plot
is gonna be a histogram of the log price. And let's save that plot off to disk so that we can see if we get
a better normal distribution of our data. So we're gonna make a plot
and notice that with MLflow, I can log the plot along with my model. So it'll show up here in the
runs tab in just a little bit as soon as this job finishes. Oh, I bet I know why my
clusters not coming up. I've been trying to
figure out why my cluster didn't come up. I think I know the answer. I chose to use spot pricing to save money. But the spot price might be
very high on Amazon right now. So if I can't get a cluster
paying the spot price, I can ask it to give me a cluster
using the on-demand price. That's a nice feature of Databricks. Let's see if my big
cluster will come up now. And actually before I
launched the big cluster, let's change the machine type too just to make sure there's no quota issues. So I can come in here
and say, you know what, I don't wanna run on an i2. No wonder I wanted to be running on an i3. Well, that would explain it. I don't have any i2 quota,
I only have i3 quota. So we'll fix that as well hit confirm. And now that my bigger
cluster will come up. It's kind of fun you get
to see the little controls inside of Databricks
that are available to you as you play around with that. There we go and sure enough
here is that plot we made showing that I have a nice
normal distribution curve, bell curve shape for my data. So when I use log price,
they get a good bell curve, which is much better for
doing linear regression. And let's see how my run did. So I come over here, I
close it and reopen runs or I click this little refresh icon. And sure enough, my
root mean squared error has dropped a little bit when I started using log price instead. MC asked the question, I believe MC you're asking, how do we go about training the model? Your question may be a little bit unclear. It's if you're asking
what algorithm do we use to converge on the optimal solution, we're using gradient descent
because it paralyzes very well. So yes, it is iterative. It uses gradient descent
without adaptive step size. And similarly, you're able
to use tools like hyper opt to do hyper parameter tuning. And if you ran a bunch of cross validation or hyper parameter searches, they would show up here
in the runs tab as well, which is really cool. All right, so this is our latest model. Let's expand it. We'll come in here. And notice that what
got logged in this time is not just the root mean
squared error and the R squared, but under artifacts, I can see the model, but I can also see my picture. So I can log any plots that I
want along with my artifact, which is really, really nice
if I'm doing data science. Helps to keep all your
experiments straight, which is the real goal of MLflow. You can also query MLflow
not just Using the UI the way I did right here, I can also query MLflow
using a Python client, or other clients, they're
not just in Python. And I'm able to say, show me all the available
experiments, please. And I can get back a list
of all the experiments that I've trained. I can grab my current experiment. And I could see all the different runs that I did inside of my experiment. So I'm querying the API programmatically. And I can even once I find a specific run, ask it to give me the saved model. So I'm gonna allow this to run. While that's busy running, I'll come down here where
I say load saved model. And notice that I'm actually
able to find the model that was associated with a
given experiment and run. So I can load my log
model that we did up above and have it available for use all the APIs (mumbles) which is very, very nice. I wanna make sure my audio
didn't cut out there. So I'm able to log and access
my log models using the APIs. So somebody was asking
how do I go to the APIs to get this information? Somebody else asked, hey, could I see these ml
predictions in Tableau? Could I take the data that we just did, and see it in Tableau? That's a great question. So here, let's take a look
at the data frame we've got. It was called prediction DF. So I've got prediction data frame. Let's display it so we
can see our predictions. And he says, how would
I get those predictions over to Tableau? My other cell is still busy running where I was doing a query down here. I'm gonna stop this search. And I refresh my page. Sometimes refreshing the page helps when something's delayed. There we go, now it's running. Let's see if my other cluster, my fast cluster is now up, woohoo. We've been running on two cores. My other cluster is gonna have 48 cores, which is gonna give us
a lot more horsepower. But oh, why are my predictions showing up. That should not have been empty. Let's run that a second time here. There they are, perfect. So there's my predictions data frame. I could take this and
say .right.Saveastable. And let's put this into
dbacademy.predictions. So I write this out as a table. And what I really should have
done, it's a little late now, but I should have gone format is delta and then saved it as a table. So I'll actually do that. I'll make it a delta table. So first, I gotta go drop
table dbacademy.predictions. Formatdelta.mode overwrite
anything that already exists and save this off as a table. And as soon as this is finished writing, I'm going to click over here to Tableau. And let's go back to my
data source in Tableau. And I'm gonna say I wanna look at the DB Academy schema, again. There it is DB Academy. All right, and now I
can remove the old table that I was using, and
let's look for the table that we just wrote. It may not be written yet. It's still in the process of writing. So notice that I don't see
it until it's done writing. So it's in the process of
writing out that table now. And again, I'm writing on our teeny tiny Spark cluster right now. It's one that I leave running
all the time for quick things. Whereas the one that's got 48 cores, I would shut down as soon as I was done. In fact, if you look at that cluster, one of the cool features in Databricks, I can set a period in which
if nobody's using the cluster, it should automatically shut
down, which is really nice. All right, so I've written it out. Let's go to Tableau. I'll hit refresh. And there's predictions. And I'll drag predictions over here. Just gotta read it from my cluster. And I click update now. And sure enough, here's
all of my data from Airbnb, including the log prediction, which is what we were measuring the root mean squared error against. One thing I never actually
did do and I should have is take that log prediction
and then expand it back to the original prediction. And I don't think we actually
took the log prediction and turned it back into a real prediction. So we should go back to our code, and really be comparing the
price to the actual price, as opposed to the log price
to the log prediction. So how would we do that? Well, we would come back to our code here. And let's see. So we've trained our
machine learning model. Well, what we ought to do, oh, here it is. We take the log prediction, we do have a column called prediction. My mistake is when I saved off the data, I saved the prediction data frame instead of exponent, where we actually had the real prediction. And that's why we're not seeing it. Well, that's no problem. I could just come back here and say, I would like to save
the exponent data frame, drop the old table, write it out again. Or I could have actually
done schema evolution, where it would have just
appended that column, which would have been even cooler. And then you would actually
see the log prediction. So notice that I'm able
to work with my data from the data science over here
and explore it from Tableau, which is really, really cool. So hopefully that answers
your question Baskara. And I'm gonna come over
to the Q&A channel here and see what else. Somebody had asked, how
could I consume a model? So Carlos, there are actually two ways you can consume a model via an API call. Or actually, are you asking
about the model hosting feature where there's a live rest call? For that, I'm gonna point
you to the Databricks blog, where we actually show
you the live rest call, because that is a brand new
feature that's just coming out. So I'll point you to the Databricks blog for examples of the RESTful web service that can actually be what's
called model serving. Whereas what I've demoed here in the code, we used an API call to
retrieve and load the model, and then we were able to apply it. So in this case, I'm just using
code to retrieve the model and apply it. But using the RESTful API
is really, really cool. And take a look at some
blog posts for that. Let's see here, Jay asks,
can you save the artifacts to s3 or Blob storage. So actually Jay, the
artifacts are being saved to s3 in Blob store right now. Remember dbfs is in s3 or Blob storage, and you can control the path
of where this stuff is written. So it doesn't have to be
written in this directory. I could have put this
in a mounted directory that's in your blob and
save that data there instead which is a really nice feature, being able to access it
from any type of blob store. Remember, dbfs is just a
layer on top of blob stores. Anything written dbfs is
in fact in the blob store. Let's see here. Ah, somebody asked why
is linear regression called machine learning. So Alfred, there's actually
many different types of artificial intelligence. Artificial Intelligence fundamentally is about finding patterns in past data so you can make
predictions on future data. And so linear regression is one form of artificial intelligence. It's a really simple form. It's not a terribly intelligent form. It's not magic from that perspective. But it's actually incredibly powerful. If you look at how the human brain works, each neuron in many ways, is really doing simple linear regression. If the voltage from this
incoming axion is this and that and our dendrite, it fires off the gap and that neuron fires. So your brain, this neural
network in your brain, is actually just this network of things that are doing relatively simple
linear regression training. Now, that's oversimplifying
the brain a little bit, I will grant you that. But the idea is that linear regression is a fundamental building block of really most any form of
artificial intelligence. But there are other algorithms we can use like decision trees, neural networks, all sorts of different
machine learning algorithms. The reason I chose linear
regression for this demo is because most people know it. And I don't have to explain
the science behind it because most people are
familiar with linear regression. But there are many, many,
many machine learning algorithms out there. But you would be surprised
how many data scientists build linear regression models
to do their predictions. All right, somebody else asked, how can I deploy the model yourself, if you aren't gonna be using a REST API? How can I deploy this model? So there are actually several ways you can take a machine
learning model to production. One of them is with Spark. So I could read the model in using Spark and scheduling nightly
job to read in the data, apply the model and write
out the predictions. So that is known as batch. So I would take my loaded model. I would take my loaded model, I'd apply it to whatever
my nightly data is. And then I would write that back out as a table. So I can do nightly jobs to
be writing out these models. Or, Peter, if I wanted to, I could use streaming
to apply these models. So let's take the data frame
that we were using earlier. Let's see here, where is that data set? Airbnb bdf, I'll come back down here. Where was I? Where I loaded the saved model. And what I could do if I wanted to is after I've loaded the saved model, I can actually go
spark.readstream.parquet. Give it a place where
it's gonna be picking up streaming data. This really could be coming from Kafka. I'm gonna simulate it coming
from a directory here. And so I'm gonna go options,
max files per trigger is one just to simulate streaming data. But now this is a streaming data frame. And so I could take my loaded
model, do predictions on it. And then let's just do prediction df dot. Let's just count how many rows there are and display it for our sake. And you can actually see
the count of the predictions as they're being done. So I'm feeding in a streaming
data frame from before, into my machine learning
model and making predictions in real time that I could
either write out to disk or in this case, I'm just displaying the
count to the screen. Maybe we'll have a little
fun writing it out to disk. So that I would go predictiondf.write.formatdelta.save
as table predictions. And I need to go .mode. Overwrite anything
that's previously there. And why did they complain? That option instead of options, that's a little bit annoying. And we'll fix that but then
I'm gonna query this table, spark.readstream.predictions.group
by everything.count, and I'll display that. So I'm gonna write it out to a table and then I'll read that table, just like we did in seminar three. Table doesn't exist yet, because
this one up here aired out. Oh, if I am reading a streaming data, one of the things I have
to provide is a schema. So I have to go dot schema, and I have to provide it
the schema of the data when I'm doing streaming. So let's just extract the schema from our existing file really quickly. I'll just read the data
that's already there, spark.read.parquet and then get the schema and I'll provide the schema for streaming. Because the idea with streaming is you have to provide the schema because the data hasn't arrived. Now in this case, I'm
simulating streaming. And now why is he complaining
here where I call? Oh, right stream, right stream. Okay, we'll leave mode off. I have to write to a file. Lots of little things as I
tried to do a live code example, parquet or delta, and then I would save
it off here to a file. And now I'll have, and I have to set a checkpoint
all the little things that we had to do in our streaming. Do some tab completion
here to help me out. Writer dot and I think it's
an option I have to set. Okay, so we go dot option checkpoint path. This is what I get for live coding. Checkpoint location. Checkpoint location, there we go. Now what's written out is streaming, I could turn around Spark to read stream, format delta and give it the path that I wanna be reading in. That's the checkpoint. I want my file.delta. And I could start seeing that
data arrive as it shows up, but I got to provide the schema again. All the little steps that
come with doing streaming. Needs to finish writing
the table the first time, we give it a moment. It says write data into it. I gotta wait for the
streaming job to finish. And we'll be able to see
it deployed is streaming. So this went into a
little bit into the weeds, but you could do batch predictions, you could do streaming predictions. Peter, the third thing you can do is you can actually export the model using a library called Mleap that would allow you to then
serve it up in Scala or Java. Or you can build a RESTful web service. So those are really your four options. Batch, streaming, export within
Mleap or use a RESTful API. All right, let's take a look while we're waiting for this to happen. Let's take a look at some other questions that are running here. Ah, so Krishna is asking
how can I get my notebook? I believe that's what you're asking. So remember, all of these
examples are available at the Databricks Academy
link that we shared with you at the beginning here. Go to
https://tinyurl.com/cloud-data-platform. And you'll be able to
import this notebook. It downloads initially
as a Databricks archive, it's a zip file. You would import it into Community Edition and you'll be able to
then import that code into your notebook, which is really cool. So you just come out to our website and you'll be able to download it. And good news, it looks
like my example of streaming is finally working. We're gonna be able to get
a count here very shortly. Oh, okay, you'd like to get an export of what's on my screen. I can work with our marketing
team to upload an export of my examples with all of my
code that I've been adding. I can work with our marketing team to get that uploaded
to the website as well. And here we go. There is our count of predictions. So the big takeaway with all that was, I'm able to do predictions on
streaming data, batch data, export a model to emulate, or use the RESTful API for serving which is really really cool. Durgadus asks, how can I combine this with automated deployment
and CI/CD together? We have some really
good blog posts on that. So I would encourage you to go
out to the Databricks website and check out the blog post on CI/CD. Or if you have deeper questions reach out to the database sales team. And we can put together a
demo for you of doing CI/CD, continuous integration,
continuous deployment with Databricks. So we could put together a demo for you. We have some really great blog posts. Covering CI/CD in the context here, we wouldn't be able to do in eight minutes that we have remaining today. Let's see what other questions we've got. Can we use a cluster on Microsoft Azure with Delta Lake functionality? So yes, Databricks will run on Azure. It's called Azure Databricks, and it's available as a
first party service on Azure. So just to demonstrate
what that looks like. If you are a Microsoft Azure customer, you have Databricks today, you don't even have to
talk to a salesperson. You would just log into the Azure portal. You would log into your Azure portal and then you just type
Databricks at the top, and you see Azure Databricks. And at that point, you can
create a new database workspace. See, choose your Azure subscription. You would choose which work
group do you wanna use. I should have mine in here
somewhere, just search for Doug. There we go, Doug work. You choose which Azure region
you wanna be running in. And then there's different
tiers of Databricks. So there is the standard tier,
which is what we're demoing. But then the premium tier adds
in greater security controls, role based access controls, which are really powerful. So the different users can have access to other people's notebooks or not. If you're deploying jobs, I could have access to just the jobs logs but not to actually change the job. Those are features that are
available to premium tier. And then you could just say create review or you could add custom
networking if you wanted to. So you could set up your own v-net. And then eventually you
would create your workspace. And evidently, I missed a step. What did I do wrong? Says oh, I forgot to
get my workspace a name. And I would click Create, and it's actually gonna
deploy right here in Azure, a full version of Azure Databricks for you using your existing Azure account. So this is really nice. It is so easy to use Databricks in Azure. It's a match made in heaven. And then Databricks integrates with a lot of the other
Azure technologies, including the Azure Data
Warehouse, Azure ML, lots of tools that are out there. Azure data factory, I strongly recommend. It's a great way to use Databricks. How could you access Kafka from
Databricks, Augustine asks. So if I wanted to read from Kafka I don't have a demo of
reading from Kafka here, but if I wanted to read
streaming data from Kafka, here I did spark dot read stream and I said, I wanna read from
a directory that's parquet. I would just replaced that with Kafka. And I would give it the
path to my Kafka server and my Kafka topic, as well
as any information needed to log into Kafka. And I could actually
access Kafka from Spark. So Augustine, you would
just look for Spark, read stream, Kafka for examples. Spark.readstream.format Kafka. And here's an example, right
here of reading from Kafka. So format, Kafka. Notice I tell it, the
host I wanna connect to, the topic I wanna connect
to, and then I just call load and now I'm streaming in from Kafka as opposed to a directory. Let's see here. Brian, you asked a question
about containerization. But your questions a little bit too broad. If you could narrow it, I'll
try to answer that question. Databricks actually does
run inside of a container. So we're using Linux containerization. And the idea there is that it
makes it really easy for you to spin up a custom version of Databricks with the libraries that
you want pre installed. So when you go to launch a cluster, you can actually provide a
docker image if you wanna to with pre installed libraries,
which is really nice. Somebody else asks, can we support Avro? So we can read Avro files with Spark and the caveat is Avro is not
a delta lake at that point. Delta Lake is built on top of parquet but Spark and absolutely read Avro. So spark.readoutformatAvro.load and I could point to Avro files. And then I could turn around and save that into my delta lake. And now I'm ETLing data from
Avro into my delta lake, which is really, really nice. Let's see here. Somebody else asked, could we actually integrate
Databricks with GIT? We absolutely can. So one of the features
that I'm not demoing today, but let's see if I can point it out. If I come up here, where is
the option to link to GitHub? I may need to zoom out a little bit. There we go revision history. And it says git not linked. But I would click on this link and I can actually link it to GitHub. And then the version
controlling of my notebook will actually integrate
directly with GitHub and I can even add commit notes. And we have new features
coming out later this year, where get will actually be able
to have a group of notebooks that are all committed together to GitHub which is really, really nice. So yes, I can do version control
of my notebooks this way. And in version controlling
your notebooks this way is not sufficient. Another option you have is
there's a database command line. And you can export notebooks
using the database command line and version control them that way as well. So using the database command line API is a really key piece
actually for any kind of CI or continuous integration pipeline. All right, and Bango thank
you for adding that doc here. There's a really good
documentation here on Databricks where people can look at the
integration for version control inside of Databricks. Somebody else asked, is there a way to automatically configure the vacuuming and versioning of our tables to keep it up to date? So yes, there is. There's actually an auto
optimize and auto vacuuming feature in Databricks. So let's look at auto vacuum. Auto vacuum, we may not do automatically because of safety reasons. But we can say for example, keep the latest 50 versions
and things of that nature. With optimized command, we definitely have auto optimized. Databricks auto optimize. There we go, auto optimize. And you can read about how to set it up to automatically optimize
and notice this picture here. I had lots of small
files that got compressed into fewer big files. That was a really nice
feature of Databricks. We had run optimized manually. But you can actually
configure your cluster to automatically optimize a table. It's just a property
you set on your table. Alter table, delta auto optimize, and you would set it to true
and auto compact to true. All right, and with that in mind, we have reached our two hours. I really wanna thank
everybody for coming to us. If we did not get to answer your question, I would highly encourage you
to reach out to our sales team. They are happy to help answer questions and give demos the features
that we did not cover. And one thing I wanna
highlight about that, is if you wanna reach them,
we have a nice bitly link for how you can contact our sales team. Or if you're interested in
other courses and trainings that we have available, here's a nice link to reach
our Databricks Academy. I really appreciate it
especially those of you who are here for all four sessions. I know that takes a lot
of time out of your week and I really hope you learned
a lot and got a lot of value that will help accelerate your projects. Thank you again for participating and we'll see you again soon.