How to Build a Cloud Data Platform Part 4 - Machine Learning and Business Intelligence

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- [Facilitator] We're gonna go ahead and get started. Welcome back to part four. How to Build a Cloud Data Platform Machine Learning and Business Intelligence. We will again have live Q&A throughout the presentation and at the end, so feel free to ask questions in the Q&A chat box which is down at the bottom of your Zoom player. Furthermore, we've dropped in yesterday's on-demand session link in the chat box so you can access it there. All of the videos will also be on YouTube. So you can find the sessions for days one, two and three on YouTube. Sessions one and two are up and we are uploading three now. But eventually all four will be on YouTube. So you can access the previous day's content there. We will also be sending a survey as a follow up. So please go ahead and take the survey. Let us know what you thought of the course and what you'd like to see in the future. So without further ado, I will pass it to lead instructor for Databricks, Doug Bateman, who will take you through today's session. Doug, take it away. - [Doug] All right. Thank you, Kayla. So welcome to part four of how to build a cloud big data platform for business intelligence and machine learning. Today, we are gonna be focusing on the business intelligence and machine learning aspects of our big data platform. So this has been part of a four-part series. And in the first part, we talked about the architecture of a big data platform. And then in parts two and three, we actually designed and built a pipeline using Databricks delta lakes to build out our big data pipeline. And to construct this flow from bronze to silver to gold. Today, we're gonna dive into the machine learning and business intelligence aspects of our big data pipeline. So now that you've gotten the data, and it's clean, how do I use the tools to actually explore the data do exploratory analytics and how do I go about actually training a machine learning model? So if you haven't already access the course materials, here is a short link that will take you to where you can go to access these course materials. And that will bring you out to this webpage here at academy.databricks.com and it's a private page just for those of you who are taking the webinar. And you come in and you can enroll in the course which is self-paced course and once you've done so, you'll see three courses. The one that we use for the first webinar, the content that we use for the next two webinars, and then the demonstration notebooks. And these are the notebooks that we'll be using as well for today's webinar. So for webinars three, and four, we use these demonstration notebooks and we're gonna really focus in on these today. So, when you click Start, it will show you that you need to download this zip file here. Notebooks: How to Build a Cloud Data Platform for BI and ML. You'll click on that. And then you'll wanna go to Databricks Community Edition. So if you go out to the Databricks website, you'll see this Try Databricks link. And from there, you'll be able to log in and actually create an account at Databricks Community Edition. And once you've got your account, you're able to click on Home. Choose this little drop down arrow and select Import and import your content into the downloaded file right here into your workspace and then you'll see a folder here called Webinar. And inside of the webinar folder will be the content that we're using. Today we're gonna focus on notebooks three and four. But before we do so I thought a short review of what we've talked about in previous webinars should be in order. So to do that, first I wanna just introduce myself. My name is Doug Bateman. I am a Principal Instructor here at Databricks. I joined the company in 2016, and helped build out the training team. And I've got 20 plus years of experience doing consulting and engineering and architecting large solutions. Databricks is a company that's really focused on this vision of unified analytics. So what this means is that we're trying to create a big data scalable platform that can do your entire ML pipeline, starting with ETL and data cleansing, and then creating a data warehouse out of that really using a data lake. We call that the lake house pattern, the idea that we're gonna use a data lake to implement a lot of the things that are traditionally done with a data warehouse. And this gride is cheap and scalability. And so low cost, high scalability with elastic compute, elastic storage. And then you can go and do your machine learning and your data science. So this whole idea of this unified analytics platform coupling streaming, ETL, analytics, machine learning and business intelligence. We are the original creators of the Apache open source project as well as the Delta Lake project, which is a really powerful file format and runtime for processing and storing your lake house files and MLflow for doing machine learning experiment tracking. And we're gonna be looking at MLflow today. And I say we're the original creators simply because this is now a very large, successful open source project across each of these three areas Apache Spark, Delta Lake and MLflow. And additionally, there are 200,000 plus companies worldwide using this platform to do their big data and machine learning. And so when we're using Databricks, we are looking at how do I ingest data with ETL and make that available in my Data Lake. And then, using the Databricks workspace, to be able to query that data lake and to gain insights, or to do machine learning and model training. So our end users are using that workspace to do this and that's what we're gonna be working inside of today is a Databricks workspace. And in the previous webinars, we looked at this idea of a pipeline for cleansing our data where the first goal is just capturing the data. However it is from whatever system, the first goal is just to get a snapshot of that data. And we call that our bronze table. And the idea is we're gonna use cheap elastic cloud storage to just capture as much data from different external systems as possible. And we're not gonna worry about cleaning it right away. And this server comes one of the main pain points that people have often had about data warehousing, where there's so much time spent uploading and cleaning the data before they can even get it that they're afraid to even upload and clean the data or even upload the data into the warehouse, because so much work is involved. So we say, upload it now and then clean it when you need it. So when it comes time to need the data, you'll then create a pipeline to create our cleaned or silver tables using those bronze tables. And we saw in seminar three that we could use structured streaming to keep those silver tables up to date whenever the bronze tables change. And then you can build your data marts by reading from the silver tables and writing out roll ups or reports or featurized ml tables, tables that are ready for consumption by your end business users. And when we did this, here's a nice little set of review of some of the features of delta lakes. Is part of the notebooks that you would have just uploaded. If you come in here under the Demos folder, I've included a number of really nice demos to help you remember what you learned today. And I'm gonna use this now just to do some of our review of the previous sessions. So if you go here to Demo 01, it's really all about what is the power of the Delta Lake. Let's check to make sure that my cluster is up and running. There we go. We'll launch our cluster. And while we're waiting for a cluster to start, let's go ahead and do that review of delta lakes. So the primary problem, and anytime you're doing any type of business intelligence is that data initially is siloed and messy. It's in lots of different systems. It's spread out across the enterprise. And so the idea is to bring the data into a single source of truth. And a data lake is a critical piece to that because it's able to store so much data. So we bring the data from all these different systems into our data lake. Delta is this enabling technology that makes that really easy to do. And we're gonna talk about why delta is one of the best ways to build your data lake. And then we can use Apache Spark to serve as this query engine for our data lake and feed that information into business intelligence and reporting. And what we'd really like then is to have data coming in from all sorts of different sources, read it into Apache Spark, populate this data lake, and then be able to use Apache Spark to do machine learning and reporting. But the challenges we have are number one, we want data to be consistent. So if somebody is busy writing to a table, and we're reading from the table, we wanna make sure that we get a consistent view of the data. That is I don't see newly written data until it's finished writing, that out of necessity constraint. And this is one of the critical things that we're looking to have our Delta Lake solution provide. We also wanna be able to do incremental reads from a large table saying, show me what's new. And if a right to a table fails, we wanna be able to roll back or you wanna be able to access historical views of the data. See what it looked like back when I trained my machine learning model. And we need to be able to handle late arriving data and update our downstream views without having to go and reprocess or delay processing downstream. And this is where delta lakes really comes into the picture. Delta Lake is an enabling technology for building a data lake. So that's where the pun is. A Delta Lake is a technology for building a data lake. And it allows us to unify batch and streaming and to retain our historical data as long as necessary. And we get to use independent cheap elastic compute and storage to be able to scale at low cost. So with delta lakes, we're able to get isolation between different snapshot versions of the table, only seeing new data once it's been written. And if I'm in the middle of reading, I will not see what new writers are writing until I am done with my reading job. I'm similarly able to optimize the file format to get large scalability. We saw the optimized command and how it compresses small files into big files, as well as really scalable handling of the metadata of the table. We're able to go back in time. So we looked at the time travel feature in the previous webinar. And we can even replay historical data using streaming so that we could backfill and load our downstream tables, which was really, really powerful. And this gives us those atomistic or atomic guarantees. So we have our data ingestion with bronze, we clean it up with silver and then we do our aggregation with gold. And some people rightly pointed out, you could call these your load tables, your warehouse tables, and your data marts if you were using more traditional lingo. Now, so this is what we really had gone and done. And we saw that using delta is really easy. Instead of using the parquet file format, we just change our code to use the Delta file format. Now this particular demo is written using Python. We had done ours using SQL, which had a lot of advantages in terms of lots of people, no SQL. And I wanted to just scroll down a little bit and point out some of the key words that we saw that we could use. We are able to delete from tables, we're able to update tables, we're able to merge into a table, that's the up cert operation. So it doesn't insert if the data is new, it doesn't update if the data is already there. Which was really, really powerful. So we're able to use this merge syntax. And then there's some features we didn't spend a lot of time talking about, but they are there, they're powerful. And those of you who'd like to read about them are welcome to come and look at this notebook. And this is schema evolution. The idea that I can evolve the schema of my delta table over time by adding certain compatible types of changes. And so I'm actually able to set merge schema true when I append to a delta lake, in which case, if I'm adding columns, it will actually add new columns to that schema. Which is really nice because anytime you've had a table for a long period of time, it becomes very important to be able to evolve the scheme. We also saw we could do time travel, where we view the history of a table. So describe history. Let us see what's the status of the table, what's changed over time. And I can go back and view prior versions of that table. Oops, what did I just do here? I managed to delete the cells, that's okay. We can go back and see prior versions of the tables. And so this was really, really cool, this power of the Delta Lake. So now we're gonna dive in and look at business intelligence. And how do I go about connecting a tool like Tableau to my data lake. To do that, we're gonna need to first of all, launch Tableau. So I've got Tableau up and running, and I'd like to connect it to Spark. So Spark will be the query engine, and I'm gonna connect Tableau to Spark. So I'm gonna then need to go and look at my cluster. And it looks like my cluster is still launching. So I'm gonna go ahead and just grab one of these other clusters since it's already up and running. And I'm gonna click on the Advanced Options here. And I see the section here called JDBC/ODBC. And this is basically a way in which I'm able to connect to a running cluster from an external tool like Tableau. So I come over here to Tableau. And I simply scroll down and I said, I'd like to connect and I click More and I find Databricks here in the list. And I click on Databricks. Now, when you first do this, if you do not have the Tableau drivers installed for Databricks, they'll be a little prompt down here that will tell you hey, please download and install the drivers. And that actually just takes you out to a little website at the Databricks website here, where you're able to download the drivers that you would need to connect Tableau to Databricks. Now I've already installed those drivers. So at this point, I can go ahead and point to the information that I see in my cluster. So server host name is trainers.cloud.Databricks.com. And I would come here to where it says HTTP path, I'll copy this. And then for username, I have a choice. I could use my Databricks username, but I would rather have an application specific token. In which case, I'm gonna just make the username be token. And I'm gonna generate a token now for use with Tableau. So I'm gonna click appear on my account, and I'll go to user settings. And I click on access tokens. And I'll generate a new access token and I'll type in here Tableau. And I could set a lifetime that this token will be valid for. In this case, since it's going out to the world, I'll keep its lifetime very, very short. And in fact, I'll revoke the other token that was showing on the page. And then I come back over here and I paste that token into my new connection. And I click Sign In. At this point, I am now connected using Tableau to Databricks. So at this point, I would choose which database I wanna connect to. And the database that we've been using was DB Academy. This is one that we set up in the earlier webinars. And I click Search. Here we go, I've got a match for DB Academy. Now it's connecting to DB Academy. And then if I click this little search box on table, it will tell me what tables are available in DB Academy. And here you can see our health tracker tables that we created, including our silver tables, and our gold table. So let's see here, there is our silver table right here. And then daily patient average would be our gold table. Let's go ahead and open up our silver table. I click it. There we go, it's loading the metadata for our table and sending the query out to Databricks, which is our big data cloud platform. There we go. And I could see name, heart rate, time, date and device ID. And then I click here and say update now. And I'll get to view the users and what their heart rate is. And now I have the full power of Tableau at my disposal to be doing exploratory data analytics that are more familiar business intelligence tool. And of course, Tableau has a lot of really awesome graphing capabilities and display capabilities. And so one of the things that I wanna show you is some of the graphics that you can do in Tableau. To do that, I'm gonna pick a slightly more interesting data set that lends itself to some nice graphics. So I'm gonna go to the loans schema in my data set. And that's the schema that you would get if you ran that notebook that I just showed you that was reviewing delta lakes. It's actually using the loans. Where's that, click again, or maybe loan, singular? There it is loans. And then we look for the tables that we wanna have. And these are ones that come from that notebook that I had just demoed that 01 delta review. And again, I could see the gold, silver and bronze loan information. In this case, I'm gonna load the gold table in Tableau. And choose update now. And what I'm able to see is loan information by state. So this is a sample data set that just shows who's getting loans for what purposes and in what state. The synthetic data set like some of the others that we've been talking about. But now I can go and do some really cool visualizations. So I'll come over here and I'll click on sheet one. And I could say, I'd really like to look at this map of states. So I come and I drag Address State into my worksheet here. And it says, all right, I've got data it looks like for all 50 states. But if I'm interested in doing some display on some of the measures, I can change the plot type to be a colored plot. And then I can come over here and say, I'd like to be plotting based on the number of loans or the amount of the loan. Let's do a plot based on the number of loans. So I'll drag count over here onto my map. And sure enough, I'm able to get a nice visualization of all of the data that you're seeing here is being powered and delivered to Tableau from Databricks. So what's happening inside of the database platform, I'm using my data lake. And I've linked it to a really popular business intelligence tool. And this is really, really cool. Now, I don't have to use a tool like Tableau to do this type of work. I am able to do a lot of this type of exploratory data analysis inside of Databricks itself as well. So to illustrate that, I'm gonna come into our webinar and go to our demo that we were just talking about, build and manage your data solution with Delta Lake. And let's see if I can find some good plots in here. We'll connect to Joel's cluster. Thank you, Joel. And let's see here. So again, I've got that same data right inside of Databricks. And then I can come here, and I could choose to do a map plot. So let's see, I would choose a map. And again, I could see this information right here live. And one of the really cool things that I can actually do if I want to, is I could combine this with streaming spark.readstream.table.createOrReplaceTempView. And I could make a temp view called gold loan, gold_loan_stats_live. So that's a straining view. And then I can run this query on the streaming view oops, to temporary view so it doesn't have a database name to go in front of it. And now I'm launching a streaming query. And let's choose a state to modify. So let's change the data for Pennsylvania. And we'll come back to our streaming view here, oops, map, plot options, and I'm gonna be looking at the amounts of the loans. Let's do the count of the loans. There we go. And let's change the amount of the loans for Pennsylvania. So I could do, come in here, I'll find Pennsylvania. The amount of the loans is currently $45,000 roughly. But I could do an update statement at this point. Back to map view again, and do update. And I have to update the underlying gold table, not the view. Update this, set or where state equals PA, set amount our address state, a-d-d-r state. Set amount equals and let's make it a really big number. And let's do Indiana instead just because it's right in the middle of the country. Ah. And the name of the table was called loans.gold.loans_stats. Oh but it's not a delta table. We want it to be different Delta table. Shoot, we're gonna have to scroll down to find a delta version of this table. Where's my delta version? I'll just scroll down a little bit here. And this time, upload gold loan stats. All right, we'll change Wisconsin loans. And we will see that data set change as the streaming query updates, which is really, really cool. So in this case, I updated Washington State. And you see Washington State now has a significantly larger number of loans. So this is really cool, I could do this type of analysis right inside of Databricks. Or I'm able to do this type of analysis using a tool like Tableau. And you can even see that Tableau is now showing higher data as well. This is very, very powerful as a platform for business intelligence. Now, where I'd like to go next will be to explore Databricks as a platform for machine learning. So to do that, we're gonna open up this next notebook, 03-Machine Learning. And we're gonna choose a slightly more interesting data set, the Airbnb data set. So these are rental prices for Airbnb in San Francisco that were made open source or available to public. So we're gonna come in here, we'll run classroom setup. This just makes sure that the datasets are available. And I'll run this a second time. I am getting a slight warning message here. Because Joel's cluster, the one that I was using, wasn't set up for machine learning. So I'm gonna switch it over here to shared tiny, which has the machine learning libraries installed. So one of the cool features of Databricks that I wanna point out is that Databricks you have a version of the Spark runtime that has a bunch of popular machine learning tools pre-installed. So notice here I have a choice between version 6.3 or version 6.3 ML. The ML version comes pre-installed with a lot of popular machine learning libraries. So that's what we just did. And similarly, there are versions even that work on GPUs for large scalability. So I'm gonna switch over here to share tiny, and I'll rerun classroom setup. And this time I won't get the error message because the machine learning libraries are pre-installed. There we go. Now this is the data set that we're gonna be looking at and it's the Airbnb data set. And so it's been made available here under dbfs. We just finished mounting this s3 bucket to mount training in the data file system. And now I'm able to read in the data from Airbnb. Now the code today when we do machine learning, we're gonna move away from SQL and towards Python, which is a very popular programming language for doing any type of machine learning. So to do that, we'll come in here, we've loaded our data from the file, spark.read.parquet, I give it the path. And it gives me a data frame, which is basically the Python equivalent to a SQL query. So I could immediately display that data frame, AirbnbDF and I would see the query results from reading that file. I can even do filtering like filter where instant bookable is true and filter the queries. Or where room type is private room. Now I'm only seeing, where's room type. Private room. I'm only seeing private rooms. So I get to use a more familiar data frames API. And Sparks data frames are slightly different than some of the other Python data frames libraries like pandas. There is a really great open source project out there, excuse me, a really great open source project called koalas. And koalas attempts to mimic the pandas API, but powered by Spark. And it's definitely something that I encourage the data scientists here to check it out. Koalas is another open source project that is currently being sponsored by Databricks. So you could think of a koala as being a panda on Spark. But I'm able to do queries using Python as well as SQL. Now what I wanna do is take this data set that we had just done. And a good data scientist knows that they want to train the model on different data than they use afterwards. So I'm gonna train the data today on the data I've got available. But in order to know that my model is any good, I need to make sure that the model makes good predictions on data that it didn't see when it was being trained. So if you ever look at stock market predictions people go past performance is not necessarily an indicator of future results. What this is really saying is, just because you've got a model that does a great job on past data, it's how it does on unseen future data that's the real proof of quality. So what we're gonna do is take 20% of our data and set aside for testing and evaluation purposes. And we're gonna train our model on the remaining 80% of the data. So 80% of our data we're gonna use to build a model and the 20% is gonna be data that we never saw before. And we're gonna find out how our model does on data that's never seen before. So I'm gonna split up this data with an 80/20 split. And you'll see that in this case, my training data, I have 5,758 rows. So I've got that many records in that dataset. Now by fixing a random seed, if I was to run this code again, I'm still gonna get that same split. The way it does the split in a highly parallelizable fashion, rather than doing a perfect 80/20 split. What it really does is for every row, it rolls the dice. For every row, it rolls the dice. And if the dice come back greater than 80, it's in the training set or test set. If it's less than 80, it's in the training set. So if I don't fix the random seed, which I really don't need to do, you'll notice that the number of rows will vary 5700 to 33. I run it again 5720. That's because it's choosing a different random sampling. And because each row is done by rolling the dice, sometimes they end up in one set, sometimes they end up in the other set, but they got approximately an 80/20 split. And you ask why not a precise 80/20 split? And the answer is that by using random numbers in this fashion, I'm more scalable because I don't have to communicate across machines. Oh machine a, you got 50. Okay machine b, you should only get 20. That would be a lot more coordination and it wouldn't scale as well. So by doing an approximate 80/20 split, every machine is able to work in complete isolation from each other. And I still get a good approximate 80/20 split. If I set the random seed, then that random number generator is gonna produce the same random numbers every time or so you would think. But even that is actually subject to fluctuation. So if I change the number of partitions that I break my data up into, you'll notice that because the data has been split up slightly differently, my random number went from 58 at the end, to 28 at the end. So when my cluster size changes, or the number of partitions of my data change, I still get some difference in my randomness. Because remember, those random numbers are being generated on each machine independently. Change the number of machines, I change the random numbers. Or if I change just the way the data is divided up, even if they don't change the machine, I change those random numbers. So far, so good. I'm gonna check the Q&A just to see if any questions that popped up. Oh, Dexter asked a good question. He said our access tokens not available for Community Edition. That is correct, Dexter. Access tokens are not available on Community Edition. That's a feature of the enterprise full version of Databricks. So for Community Edition, you would just use your regular username and password. And yes, the ML runtime is available in Databricks Community Edition. That's actually worth pointing out. I'll log into Databricks Community Edition right now and point out that question. So if I come in here in the Databricks Community Edition, when I go to launch my cluster, let's create cluster, my cluster, you'll notice I have a choice of either the machine learning runtime or the regular runtime. And this drop down is actually one of my favorite features of Databricks. Notice I get different options for different versions of Spark. So for people who are deploying Spark on premises, rolling out software updates is painful. But if you wanted to update your version of Spark in Databricks, it's just a matter of picking from a menu, which is really nice. One of the pains when you have an on premises Spark solution is people go, how many machines should we buy? Well, I'm not really sure. We haven't done testing yet. Well, we need to place to order for the machines. And then you order the machines, they finish installing Spark and a new version of Spark comes out and you go but I want the new version. And they go, you're kidding, right? We just spent three months installing these machines for you. With Databricks you're able to launch the number of machines that you need, on the version of Spark you need, with a simple click in the menu. It's one of my favorite features in all of Databricks. It's not the flashiest feature, but anybody who's done a real project knows the benefit of just being able to spin up a cluster in a few seconds as opposed to a few weeks. All right, back to where we were here. So at this point, I wanna do some linear regression training. We're gonna start out with a relatively simple linear regression training, where we're gonna just try to predict the price purely based on the number of bedrooms. And I'm doing that for this webinar to keep things relatively straightforward and digestible focusing on the machine learning capabilities as opposed trying to do a really elaborate machine learning example. If you were to take our three-day machine learning instructor-led course, we go through the full gamut of doing cross validation and training with multiple features, feature extraction, categorical variables. Today, we're gonna keep things a bit simpler, just doing single variant linear regression. So I wanna predict price based on the number of bedrooms. How would I go about doing that? Well, step one here. Let's just look at information about price and number of bedrooms. And notice that when I call summary, I get to see, okay, your cheapest price, somebody's apparently willing to rent out for $10 and somebody else is renting out for $10,000 a night. That's quite the spread. But if we look at the median, the average price per night is $150. Similarly, if we look at bedrooms, we'll see that somebody is renting out their place with 14 bedrooms. This must be a Scottish Lord with a castle. 14 bedrooms, very impressive. So it does appear that we have some outliers in our data sets. So we just wanna keep that in mind as we do our work. We could also do a little plot here, a scatterplot where I will plot a computer plot options. And I say I really wanna just look at price and number of bedrooms. So that's number of reviews, bedrooms and price. There's price. Now it's running a query, and setting up my plot to do a scatterplot of number of bedrooms versus price. Now, one key thing here, because I'm running inside of a web browser, the visualizations that you'll see inside of your web browser in Databricks are gonna be based only on the first 1000 rows. And that's because if I sent a billion rows to my web browser, my web browser would run out of memory. So Databricks by default will limit those rows to only be 1000. Now if I used a popular plotting library like matplotlib, I could actually do quite a few more rows. Or if I use Tableau, I could obviously get the full data set. But this is very interesting here because you could see that for the most part, we do have something that looks somewhat linear to start with, but it starts to get a little bit weird as the number bedrooms gets up to four and five. And I also have these outliers like $10,000 for two bedrooms. Well, hey, nobody stops the guy from listing his place. It just means we don't know how popular his place is gonna be. And this is an outlier, that will definitely throw off my data science. Now, what we would like to do is simple linear regression. So what I'd like to say is, hey, I wanna train predicting the price based on the number of bedrooms. But if you naively write this code, and then run it by going linear regression, please fit a model to my training data, we're gonna get an error message. It says column bedrooms must be a vector and it was a double. So it was expecting an array of doubles, array of double, but it was actually just the double. And I get this error message. Mhhh, what's going on here? Well, the trick is based on this label here, features column. It's actually expecting an array of features, not a scalar value. So what I need to do is tell Databricks or tell Spark what are all of the features I wanna use. So I'm gonna say input columns is bedrooms. And it's going to, I could give it a list. I could give it bedrooms, I could give it number of reviews, or num reviews. I can get a whole list of features, and it's gonna add a column that is putting all those features into an array. In this example, we're just gonna train on a single feature. So that's the job of the vector assembler. It says these columns in, this column out. And now if I look at what the output of a vector assembler is, you'll see that it's added a new column at the end called features. And it's a sparse vector. So what that is actually a dense vector in this case, with the value one if it's one bedroom, value two if it's two bedrooms, and so forth. It is a little bit weird when you look at a vector type in this display, because the first cell is one if it's a dense vector, zero if it's a sparse vector. The second item is the size of the vector. The third item would be the indices and then the last one is the actual values. So if you're reading this, you really wanna look at two and three. In this case, I have a vector that just contains the value two in it. So it's adding a new column to the data frame that combines a bunch of these other columns. I now can feed that into my linear regression example. So I can go linear regression. I'd like to read in these features and predict the price please. And what I'm returned is a linear regression model, that's a form of machine learning, a linear regression model, that given the features will predict the price. And I can look at the line, so I can go into this linear regression model, I can look at the slope and the intercept. And I can get the equation for the line. So on average price, I can even label these here, price is num rooms or bedrooms. So it's gonna be $120 per bedroom plus $50 is the best fit line for that data point in San Francisco. But of course, that's including some of our outliers. Now to find out how good my model is doing, this is where our test data set is gonna come in. So come down here to test data set. And I could go linear regression model, I would like to transform this test dataset. Here's my test data set. I'm gonna transform it to get my predictions. But remember, first I have to run it through the vector assembler to extract that vector of features, and then I can apply it to the machine learning model. So let's run that and see how well we did. And we could see, for one bedroom here, the actual price was $130 and I predicted $173. And you'll notice that every one bedroom place got predicted at 173. And that's simply because every bedroom will be for 173. And that's simply because the fact that we're only training on a single feature currently. But we could train on lots of features. Now, the next step will be to evaluate how good our model is. But before we do that, I wanna show you another way of writing this code. So notice that I'm vector assembling. And then I'm creating a machine learning model. We can package these two steps up into what's known as a pipeline. So from pyspark.ml.import pipeline. From pyspark.ml.import pipeline. And then I would say, pipeline equals a pipeline, where the stages are first gonna be to assemble the vector. And then to build a machine learning model, a linear regression model. And what the pipeline does, is it simply allows me to have a whole series of transformations lined up, back to back. So maybe I was doing one hot encoding or I was working with categorical variables or string indexing. I could include all of them those as stages in my pipeline. And now we just go pipeline.fit testdataframe. Or let's start with my training data frame or not. TestDF, and that's gonna give me back a model. And then I could go model.transform, and I give it my test data frame and I get my predictions. And this will display the predictions like we did before. And these are my predictions, price and prediction. So a pipeline allows me to have a series of transformations saved into this reusable component. One really nice thing about doing a pipeline this way is I can actually in turn save the model and save it off to disk for later reading. So I could save the model off for later use and read it back in again. Now, in order to determine whether this is a good model for predicting Airbnb, a good data scientist needs to be prepared to do model evaluation. So this is where evaluating the model comes in. So for this, I'm gonna use a regression evaluator to compare the prediction from the actual price. And I can get the root mean squared error. That's a standard metric for determining how far off are the predictions from the price. One of the nice things about root mean squared error is that the unit's match. So in other words, my model is off by $290, typically. Wow, that is a pretty big error. But it's not at all surprising, because I've only predicted based on the number of bedrooms. I didn't take other things into account like popularity of the listing or what neighborhood it's in. Or what are the standard reviews for the listing. So it's not at all surprising that my model is off by $290 typically, as a standard variance. Simply because I didn't bother computing on anything other than bedrooms. And if I wanted to use a different metric, I can like I could use r squared. In which case my r squared metric change by labels here. R squared, in this case is not a very good score, 0.12. One is very good, zero is not very good and negative would be terrible for r squared. But this is a simple machine learning pipeline consuming from our data lake. So notice that at the very top, I wanna link this back to what we did before. I can read from tables in our data lake. It could be a parquet file, it could be a Delta file, or it could actually be a table. I could go back to SQL here go create table as of listings, using parquet in this case, location and tell it where to find the file. And then instead of going Spark.read.parquet, I would just go spark.read.table SF listings. So I'm able to link to the datasets that are in my data warehouse. Alexis asked a question. She said how could r squared ever actually be negative? That's an interesting data science question and it is a good one. R squared actually, in fact can be negative. Which means that the way r squared is computed, when we looked at root mean squared error, you saw the units were in dollars. So let's come down here where we did root mean squared error at the bottom. Root mean squared error, the units are in dollars, because that's what my price is in. But that's annoying, because it means that if I changed my units to something other than dollars, for example, I was doing it in millions of dollars, then my RMSE would suddenly be a lot lower. And so it's hard to know what's a good RMSE and what's a bad RMSE. So the solution to that is to scale the root mean squared error. And the way you would do that is you would just look at what if I did a naive model? Where I just used the average price which is represented as x bar, the average price. How far off would I get the, what would my RMSE be if I used average prices supposed to your machine learning model? And so if we look at the formula for root mean squared error, let me go ahead and pull up a slide on that because it is an interesting discussion. R squared. So here we're looking at the residuals from our prediction. So what did you predict? What was the actual or actually this is the actual versus, no prediction. Prediction minus the... no the actual minus the prediction squared. So that you're summing up the errors. And then you compare that to what would it be if it was the actual minus the average. So if you did a naive model where you were just looking at the average, how would that be? And then you just use that to scale how good we did. So instead of being in dollars, I'm dividing out the unit. So if it was in millions of dollars, or in dollars, it wouldn't matter. The units are gonna cancel out here. But what you'll notice is that, what is the worst possible score I could ever get? Well, if my predictions were perfect, the best score I could ever get would be a one because there would be a zero up here at the top. But if my predictions were terrible, the thing at the top could be a million. And if it did way worse than say, just using the naive model, which just gave you 10, notice this would in fact be negative lexis. So while rare, it is possible to have a negative r squared, which means your model is doing worse than the naive model. So let's continue on now. Oh, by the way, here's a picture I meant to show you earlier. These are different types of plots that you could do in Tableau using that same data set. Just wanted to show some pretty pictures from Tableau. I had meant to mention that earlier. Let's go back to machine learning. So this is how we evaluate our machine learning model. Now, what I'd like to do next would be to show you how to use a technology called MLflow to track your machine learning experiments. And in the process, we're gonna get to do a slightly more interesting machine learning exercise. So this is notebook for MLflow. MLflow is basically a tool for logging your ml experiments. So any data scientists that has been doing it for a while they sit there changing what features they use. They trained what machine learning algorithm they use. They try all sorts of different settings and the end up building hundreds or thousands of different machine learning models. And at some point, it's easy to get lost as to which ones did well. And then once you've zeroed in on that machine learning model, you wanna go a step further and you wanna save it out and be able to use it in production. And this is where MLflow comes into the picture. MLflow is a tracking tool for all of your machine learning experiments. So we'll run classroom setup that just makes sure that the datasets are available and that the necessary libraries are available. MLflow comes pre-installed with the machine learning runtime. It's one of those libraries you would have to install if you were not using the machine learning runtime. You are using the base runtime. So the difference between the machine learning runtime and the basic runtime is that the basic runtime does not have a bunch of libraries pre installed, giving you total control over what versions of libraries you want to use. Whereas if you use the machine learning runtime, it has a lot of these popular machine learning libraries pre installed, and we find that that is extremely popular with data scientists who just wanna get up and running quickly. And if you wanna know precisely what version of what libraries installed, you would just look at the release notes for the version of the ML runtime that you're using. All right, so step one. We're gonna read our data set. And again, we could read it from the file, or we could read it from the table SS listings that we just created in the previous example. Either way, we could read it from a table name or directly from the underlying file. And like we did before, we're gonna do an 80/20 split. Now, let's go a little bit further here. What we wanna do is a standard machine learning pipeline again, so the one you saw from before, but you're gonna see a bunch of stuff surrounding it. But let's just recognize it first. We're gonna build a vector assembler and linear regression. Remember seeing that. We're then gonna build a machine learning pipeline that's gonna do vector assembling, followed by linear regression. And we are gonna train our model. And then we're gonna make predictions on our test dataset, evaluate what is our root mean squared error and compute that. So what is all the other stuff I see on the screen here? Well, the rest of this is MLflow. What I'm gonna do is say MLflow, I wanna start an experiment. I wanna start training a model. So here's my run. And I'm gonna call this run linear regression using a single feature. I could call it anything I want. I could call it fill. It's just the name. And this with statement is a Python feature that says I am gonna assume the experience is over when I leave the scope of the with statement. So by the time I get down here to the very end, where I might be displaying something, it will have saved that experiment to disk. The moment I leave the with statement, it will transmit that experiment off to the tracking server and make it available to me for viewing. And so it gives me this object here called run that I can now use for logging purposes. Now I'm gonna go MLflow, I would like you to log that I'm really gonna be training based on price and the number of bedrooms. Just a note I'm making. Tracking in my log path. You'll be looking at MLflow as being a logging system. So I'm gonna log that this experiment was using price in bedrooms. I also just finished training a machine learning model. Let's save the machine learning model that we made, to my log, so that I can get to that machine learning model later. And since I computed the root mean squared error, let's save the root mean squared error to my model as well. Now managers of data science teams love this because if they wanna go and look then at what models their people are training, all that information has been logged and is available later on. So we're gonna train our machine learning model, and let's see where it got logged this time. That's not needed. All right, now, I come over here in the top right and I see runs. And there it is. I can see in this case, I just ran a training on this notebook right now. I called it, double click on it, I could see the version of the notebook that actually ran. So this is the version of the code that actually ran when I trained this model. I can see it was called price bedrooms and it had a root mean squared error of 290. And if I want more detail, I can click on this little icon to view the experiment. So here we go. I can see I did a linear regression with a single variable on April 17th of 2020. These are ones that I did previously while testing stuff out. I could see who trained that model, the version of the notebook that was used when the model was trained, and then in information that I chose to log like price bedrooms and the root mean squared error. Previously, I'd done an experiment where I looked at the log price as opposed to the actual price because I noticed there was a log normal distribution to price. And I was able to train a slightly better model by using log normal distributions. But I can see all of my past experiments. And then if I click on this particular experiment, and drill down, I can see, again, the metric root mean squared error. But I also have this here artifacts, things that have been saved. So there is my machine learning model. It saved off the disk. Well, let's say that I really liked this machine learning model and I wanna start taking it to production. There's this link over here, register model. Let's click it. So I'd say I'm gonna create a new model in my registry. And I'll call this my Airbnb model. And I click Register. There we go. And now I can come over here on the left, you'll notice this button models. Now this button is available in the full version of Databricks. If you're on Databricks Community Edition, notice that you do not see the model registry. So that is part of the professional version of Databricks, as opposed to the free open version of Databricks. But I'm able to register the model and I get this button on the left where I can see all of my registered models. Now this is for the purpose of taking a model to production. So notice that I get to say, when did I move that model into staging? When did I move that model into production? So I come here now to my registry. I can look at my version of my model. And I could say, all right, I would like to transition this model into production. And I can make a comment about when I transition it into production, so that other people are then able to grab that model and use it which is a really powerful feature. They just go through the Databricks APIs, and they can grab the latest production version of my model. And as I have new versions of the model, you'll notice that this version number will increase and I'll get to move that version into either production or staging or production as they come out. Really, really powerful. This is known as the model registry. And we are even adding the capabilities soon to do model serving, so that you'll be able to actually hit a rest endpoint to have it provide you scoring based on the model. All right, so that is the basic part of using MLflow as a logging API. So let's do a slightly more interesting example. This time around, I'm gonna grab more features than just the price. I'm gonna grab all the features I can get. So price is gonna be what I'm trying to predict. Everything else is gonna be a feature that I can use for making predictions. And this little utility is a nice way of grabbing all the features. It uses the formulaic approach available in the R programming language. Oh Sid just ask a question that I can't resist answering. Can you do deep learning and natural language processing? Yes, you can Sid. For an example of doing deep learning using TensorFlow, I'm gonna point you over here to the demos 03 operationalizing data science. There is an example of using Kerris and TensorFlow. For examples using doing NLP on Databricks, we have some really good blog posts that you might check out. Or you can take one of our full trainings where we do NLP, natural language processing, NLP. All right, so I'm gonna grab all the features this time, and put it into an actor, and then do linear regression. And out pops my pipeline model. Again, I'm gonna log my model as well as logging a label And any metrics. What is my RMSE? What is my r squared? So I'll run a bigger experiment, and see if I can do better than my previous experiment. It's a little bit slower today because I'm using a tiny machine as opposed to a big cluster. Yesterday I was using big beefy machines for our delta pipelines. I didn't need to be. I just chose to do it to reduce some of the waiting that we're doing in class. Today we're using this tinier machine. And the real reason for that is quota. When I tried to launch a bigger cluster, it turned out I didn't have enough AWS quota, somebody else was using that quota. So I'm sticking with a smaller machine for the moment. It's just gonna take a little bit longer to train our machine learning model. While we're waiting on that to run, let me see if my quota is finally available to launch my machine. I'm gonna come down here to my cluster again. Revisit Doug's cluster. Let's see if we can get it to come up, maybe quota is available now when it wasn't available earlier. Ah, somebody asked how does the vector assembler know which data frame to pull the data from? So if we go up here to where there's a vector assembler, how does it know which data frame to pull from? And the answer is it gets the data frame right here on line 16. So somebody said, how does the vector assembler know which data frame? It gets it here. When I go pipeline.fit, it's literally taking this training data frame and providing it to the vector assembler. And then it takes the output of the vector assembler and provides that to the linear regression. And then it takes the output of the linear regression and that is my model that I've produced. And it's able to use that model when making predictions down here. So it's this line here that tells it which data frame to get it from. And Brian Pan asked, is there a visual UI for MLflow? I think we just got to see that when I went over here and I clicked on runs. You got to see the visual UI for looking at MLflow. So let's scroll back down here. There we go, I finished doing my machine learning. Let's click on runs. Here's my latest run, price all features. And notice that indeed, my root mean squared error did drop when I added other features. But it didn't drop enough. And one of the reasons it didn't drop enough, is that again, I still have a lot of outliers on the Airbnb dataset, because I'm not predicting what the price should be to maximize profit. Rather, I'm trying to predict what people have been listing their Airbnb's at and some people were listing theirs a $10,000. So we'd have to deal with those outliers. The other reason why this is somewhat high, in terms of error, is that I am assuming that there is a linear relationship between price and the number of bedrooms. And it turns out that it's not a linear relationship. It's an exponential relationship. So what I really wanna be doing is looking at the log of the price to do my linear regression. And these are the types of things that a good data scientist would do as they're running lots of experiments to try to find a model that does a good job making predictions. They might also try things like neural networks, as opposed to doing linear regression to do this type of machine learning. And you would log all of those experiments here using the machine learning APIs or the MLflow APIs. So now, I could click on this guy, open them up. I could scroll down to the model and say register the model. And I would like to replace the current Airbnb model with this new one. And I would click Register. And then I come over here to the registrations pending, we'll give it a moment to finish installing that model. While we're waiting, I'll look to see if there are any other questions here. Ernesto said, well, how does the pipeline know which vector assembler to use? So again, let's go back to the code Ernesto. Notice that when I define the pipeline, let's come up here, when I defined the pipeline, I tell it which vector assembler to use. The way I usually like to write this code that I think is a bit more readable, is when I define the pipeline, I often will define the pipeline right here in more linear fashion. So I'll say here's your vector assembler. Here is you linear regression. And now it's very clear my pipeline is gonna do vector assembling and then linear regression. And remember, the vector assembler is producing a single vector column called features. And then I chose to put the features column into the linear regression. So notice that I'm explicitly naming which columns I want to be using here. All right, so my deployment is finished. Let's go back over to this time, it's now registered. And again, if I go over to the model serving layer, so let me click on models, I see that version 2 is the latest version. But version 1 is the one that's still in production. So I have the latest version, that's the development version, the staging version and the production version. And I could click here on Airbnb. And I could say I would like to take version 2, and move that into staging please. And now I've moved it into the staging state. So now I could see that if you're playing in the staging area, you should be using version two. And I can retrieve this model using the database API's and actually do live scoring with it which is really, really powerful. All right, I wanna show off a few other capabilities of MLflow. So in this experiment, I'm actually gonna do the log of the price as opposed to the actual price. So I'm gonna take my data frame and I'm gonna add a new column called log price. That will be the logarithm of whatever the value of the price column is. Now I wanna highlight I am importing a log function from Spark. I'm not using Python's log function. I'm using one from Spark that knows how to work on entire columns of a data set as opposed to an individual number. So the log function of Python would expect a floating point number. The log function in Spark would actually work on that entire column of data, as opposed to an individual row. So at this point, we're gonna work on the log price. We're still gonna run it through our formula. This time, I'm gonna do log price. And I wanna use all features except for the price because I'm gonna use log price instead. I don't wanna predict log price by price, that would be cheating. So I'm gonna use all features except for the price. And again, I'm building a pipeline. It's gonna start with our formula, then linear regression, I got my pipeline. That's gonna train on my training dataset, yields a model. I'll log that model using an outflow. I'll then make some predictions. And I'll log the various scores telling me how well we're doing. And NC just asked, is the job scheduling an automation using Databricks available and the answer is yes. You can actually do job scheduling over here in the jobs tab. He also asks the question, could I use a tool like Airflow or Azure Data Factory to launch jobs? And the answer is yes. Integrations exist with Airflow, as well Azure Data Factory and a number of other schedulers as well to launch jobs. So you can define the jobs in the jobs tab and then launch them using these other tools. Or you can launch them with our own built in scheduler and not rely on these other tools. Lots of options there. So here, I'm gonna train a log normal distribution. And while we're at it, let's have a little fun. Let's do plotting. So plot is a library that comes with matplotlib. It's a Python library matplotlib. Let's do a plot. And in this case, the plot is gonna be a histogram of the log price. And let's save that plot off to disk so that we can see if we get a better normal distribution of our data. So we're gonna make a plot and notice that with MLflow, I can log the plot along with my model. So it'll show up here in the runs tab in just a little bit as soon as this job finishes. Oh, I bet I know why my clusters not coming up. I've been trying to figure out why my cluster didn't come up. I think I know the answer. I chose to use spot pricing to save money. But the spot price might be very high on Amazon right now. So if I can't get a cluster paying the spot price, I can ask it to give me a cluster using the on-demand price. That's a nice feature of Databricks. Let's see if my big cluster will come up now. And actually before I launched the big cluster, let's change the machine type too just to make sure there's no quota issues. So I can come in here and say, you know what, I don't wanna run on an i2. No wonder I wanted to be running on an i3. Well, that would explain it. I don't have any i2 quota, I only have i3 quota. So we'll fix that as well hit confirm. And now that my bigger cluster will come up. It's kind of fun you get to see the little controls inside of Databricks that are available to you as you play around with that. There we go and sure enough here is that plot we made showing that I have a nice normal distribution curve, bell curve shape for my data. So when I use log price, they get a good bell curve, which is much better for doing linear regression. And let's see how my run did. So I come over here, I close it and reopen runs or I click this little refresh icon. And sure enough, my root mean squared error has dropped a little bit when I started using log price instead. MC asked the question, I believe MC you're asking, how do we go about training the model? Your question may be a little bit unclear. It's if you're asking what algorithm do we use to converge on the optimal solution, we're using gradient descent because it paralyzes very well. So yes, it is iterative. It uses gradient descent without adaptive step size. And similarly, you're able to use tools like hyper opt to do hyper parameter tuning. And if you ran a bunch of cross validation or hyper parameter searches, they would show up here in the runs tab as well, which is really cool. All right, so this is our latest model. Let's expand it. We'll come in here. And notice that what got logged in this time is not just the root mean squared error and the R squared, but under artifacts, I can see the model, but I can also see my picture. So I can log any plots that I want along with my artifact, which is really, really nice if I'm doing data science. Helps to keep all your experiments straight, which is the real goal of MLflow. You can also query MLflow not just Using the UI the way I did right here, I can also query MLflow using a Python client, or other clients, they're not just in Python. And I'm able to say, show me all the available experiments, please. And I can get back a list of all the experiments that I've trained. I can grab my current experiment. And I could see all the different runs that I did inside of my experiment. So I'm querying the API programmatically. And I can even once I find a specific run, ask it to give me the saved model. So I'm gonna allow this to run. While that's busy running, I'll come down here where I say load saved model. And notice that I'm actually able to find the model that was associated with a given experiment and run. So I can load my log model that we did up above and have it available for use all the APIs (mumbles) which is very, very nice. I wanna make sure my audio didn't cut out there. So I'm able to log and access my log models using the APIs. So somebody was asking how do I go to the APIs to get this information? Somebody else asked, hey, could I see these ml predictions in Tableau? Could I take the data that we just did, and see it in Tableau? That's a great question. So here, let's take a look at the data frame we've got. It was called prediction DF. So I've got prediction data frame. Let's display it so we can see our predictions. And he says, how would I get those predictions over to Tableau? My other cell is still busy running where I was doing a query down here. I'm gonna stop this search. And I refresh my page. Sometimes refreshing the page helps when something's delayed. There we go, now it's running. Let's see if my other cluster, my fast cluster is now up, woohoo. We've been running on two cores. My other cluster is gonna have 48 cores, which is gonna give us a lot more horsepower. But oh, why are my predictions showing up. That should not have been empty. Let's run that a second time here. There they are, perfect. So there's my predictions data frame. I could take this and say .right.Saveastable. And let's put this into dbacademy.predictions. So I write this out as a table. And what I really should have done, it's a little late now, but I should have gone format is delta and then saved it as a table. So I'll actually do that. I'll make it a delta table. So first, I gotta go drop table dbacademy.predictions. Formatdelta.mode overwrite anything that already exists and save this off as a table. And as soon as this is finished writing, I'm going to click over here to Tableau. And let's go back to my data source in Tableau. And I'm gonna say I wanna look at the DB Academy schema, again. There it is DB Academy. All right, and now I can remove the old table that I was using, and let's look for the table that we just wrote. It may not be written yet. It's still in the process of writing. So notice that I don't see it until it's done writing. So it's in the process of writing out that table now. And again, I'm writing on our teeny tiny Spark cluster right now. It's one that I leave running all the time for quick things. Whereas the one that's got 48 cores, I would shut down as soon as I was done. In fact, if you look at that cluster, one of the cool features in Databricks, I can set a period in which if nobody's using the cluster, it should automatically shut down, which is really nice. All right, so I've written it out. Let's go to Tableau. I'll hit refresh. And there's predictions. And I'll drag predictions over here. Just gotta read it from my cluster. And I click update now. And sure enough, here's all of my data from Airbnb, including the log prediction, which is what we were measuring the root mean squared error against. One thing I never actually did do and I should have is take that log prediction and then expand it back to the original prediction. And I don't think we actually took the log prediction and turned it back into a real prediction. So we should go back to our code, and really be comparing the price to the actual price, as opposed to the log price to the log prediction. So how would we do that? Well, we would come back to our code here. And let's see. So we've trained our machine learning model. Well, what we ought to do, oh, here it is. We take the log prediction, we do have a column called prediction. My mistake is when I saved off the data, I saved the prediction data frame instead of exponent, where we actually had the real prediction. And that's why we're not seeing it. Well, that's no problem. I could just come back here and say, I would like to save the exponent data frame, drop the old table, write it out again. Or I could have actually done schema evolution, where it would have just appended that column, which would have been even cooler. And then you would actually see the log prediction. So notice that I'm able to work with my data from the data science over here and explore it from Tableau, which is really, really cool. So hopefully that answers your question Baskara. And I'm gonna come over to the Q&A channel here and see what else. Somebody had asked, how could I consume a model? So Carlos, there are actually two ways you can consume a model via an API call. Or actually, are you asking about the model hosting feature where there's a live rest call? For that, I'm gonna point you to the Databricks blog, where we actually show you the live rest call, because that is a brand new feature that's just coming out. So I'll point you to the Databricks blog for examples of the RESTful web service that can actually be what's called model serving. Whereas what I've demoed here in the code, we used an API call to retrieve and load the model, and then we were able to apply it. So in this case, I'm just using code to retrieve the model and apply it. But using the RESTful API is really, really cool. And take a look at some blog posts for that. Let's see here, Jay asks, can you save the artifacts to s3 or Blob storage. So actually Jay, the artifacts are being saved to s3 in Blob store right now. Remember dbfs is in s3 or Blob storage, and you can control the path of where this stuff is written. So it doesn't have to be written in this directory. I could have put this in a mounted directory that's in your blob and save that data there instead which is a really nice feature, being able to access it from any type of blob store. Remember, dbfs is just a layer on top of blob stores. Anything written dbfs is in fact in the blob store. Let's see here. Ah, somebody asked why is linear regression called machine learning. So Alfred, there's actually many different types of artificial intelligence. Artificial Intelligence fundamentally is about finding patterns in past data so you can make predictions on future data. And so linear regression is one form of artificial intelligence. It's a really simple form. It's not a terribly intelligent form. It's not magic from that perspective. But it's actually incredibly powerful. If you look at how the human brain works, each neuron in many ways, is really doing simple linear regression. If the voltage from this incoming axion is this and that and our dendrite, it fires off the gap and that neuron fires. So your brain, this neural network in your brain, is actually just this network of things that are doing relatively simple linear regression training. Now, that's oversimplifying the brain a little bit, I will grant you that. But the idea is that linear regression is a fundamental building block of really most any form of artificial intelligence. But there are other algorithms we can use like decision trees, neural networks, all sorts of different machine learning algorithms. The reason I chose linear regression for this demo is because most people know it. And I don't have to explain the science behind it because most people are familiar with linear regression. But there are many, many, many machine learning algorithms out there. But you would be surprised how many data scientists build linear regression models to do their predictions. All right, somebody else asked, how can I deploy the model yourself, if you aren't gonna be using a REST API? How can I deploy this model? So there are actually several ways you can take a machine learning model to production. One of them is with Spark. So I could read the model in using Spark and scheduling nightly job to read in the data, apply the model and write out the predictions. So that is known as batch. So I would take my loaded model. I would take my loaded model, I'd apply it to whatever my nightly data is. And then I would write that back out as a table. So I can do nightly jobs to be writing out these models. Or, Peter, if I wanted to, I could use streaming to apply these models. So let's take the data frame that we were using earlier. Let's see here, where is that data set? Airbnb bdf, I'll come back down here. Where was I? Where I loaded the saved model. And what I could do if I wanted to is after I've loaded the saved model, I can actually go spark.readstream.parquet. Give it a place where it's gonna be picking up streaming data. This really could be coming from Kafka. I'm gonna simulate it coming from a directory here. And so I'm gonna go options, max files per trigger is one just to simulate streaming data. But now this is a streaming data frame. And so I could take my loaded model, do predictions on it. And then let's just do prediction df dot. Let's just count how many rows there are and display it for our sake. And you can actually see the count of the predictions as they're being done. So I'm feeding in a streaming data frame from before, into my machine learning model and making predictions in real time that I could either write out to disk or in this case, I'm just displaying the count to the screen. Maybe we'll have a little fun writing it out to disk. So that I would go predictiondf.write.formatdelta.save as table predictions. And I need to go .mode. Overwrite anything that's previously there. And why did they complain? That option instead of options, that's a little bit annoying. And we'll fix that but then I'm gonna query this table, spark.readstream.predictions.group by everything.count, and I'll display that. So I'm gonna write it out to a table and then I'll read that table, just like we did in seminar three. Table doesn't exist yet, because this one up here aired out. Oh, if I am reading a streaming data, one of the things I have to provide is a schema. So I have to go dot schema, and I have to provide it the schema of the data when I'm doing streaming. So let's just extract the schema from our existing file really quickly. I'll just read the data that's already there, spark.read.parquet and then get the schema and I'll provide the schema for streaming. Because the idea with streaming is you have to provide the schema because the data hasn't arrived. Now in this case, I'm simulating streaming. And now why is he complaining here where I call? Oh, right stream, right stream. Okay, we'll leave mode off. I have to write to a file. Lots of little things as I tried to do a live code example, parquet or delta, and then I would save it off here to a file. And now I'll have, and I have to set a checkpoint all the little things that we had to do in our streaming. Do some tab completion here to help me out. Writer dot and I think it's an option I have to set. Okay, so we go dot option checkpoint path. This is what I get for live coding. Checkpoint location. Checkpoint location, there we go. Now what's written out is streaming, I could turn around Spark to read stream, format delta and give it the path that I wanna be reading in. That's the checkpoint. I want my file.delta. And I could start seeing that data arrive as it shows up, but I got to provide the schema again. All the little steps that come with doing streaming. Needs to finish writing the table the first time, we give it a moment. It says write data into it. I gotta wait for the streaming job to finish. And we'll be able to see it deployed is streaming. So this went into a little bit into the weeds, but you could do batch predictions, you could do streaming predictions. Peter, the third thing you can do is you can actually export the model using a library called Mleap that would allow you to then serve it up in Scala or Java. Or you can build a RESTful web service. So those are really your four options. Batch, streaming, export within Mleap or use a RESTful API. All right, let's take a look while we're waiting for this to happen. Let's take a look at some other questions that are running here. Ah, so Krishna is asking how can I get my notebook? I believe that's what you're asking. So remember, all of these examples are available at the Databricks Academy link that we shared with you at the beginning here. Go to https://tinyurl.com/cloud-data-platform. And you'll be able to import this notebook. It downloads initially as a Databricks archive, it's a zip file. You would import it into Community Edition and you'll be able to then import that code into your notebook, which is really cool. So you just come out to our website and you'll be able to download it. And good news, it looks like my example of streaming is finally working. We're gonna be able to get a count here very shortly. Oh, okay, you'd like to get an export of what's on my screen. I can work with our marketing team to upload an export of my examples with all of my code that I've been adding. I can work with our marketing team to get that uploaded to the website as well. And here we go. There is our count of predictions. So the big takeaway with all that was, I'm able to do predictions on streaming data, batch data, export a model to emulate, or use the RESTful API for serving which is really really cool. Durgadus asks, how can I combine this with automated deployment and CI/CD together? We have some really good blog posts on that. So I would encourage you to go out to the Databricks website and check out the blog post on CI/CD. Or if you have deeper questions reach out to the database sales team. And we can put together a demo for you of doing CI/CD, continuous integration, continuous deployment with Databricks. So we could put together a demo for you. We have some really great blog posts. Covering CI/CD in the context here, we wouldn't be able to do in eight minutes that we have remaining today. Let's see what other questions we've got. Can we use a cluster on Microsoft Azure with Delta Lake functionality? So yes, Databricks will run on Azure. It's called Azure Databricks, and it's available as a first party service on Azure. So just to demonstrate what that looks like. If you are a Microsoft Azure customer, you have Databricks today, you don't even have to talk to a salesperson. You would just log into the Azure portal. You would log into your Azure portal and then you just type Databricks at the top, and you see Azure Databricks. And at that point, you can create a new database workspace. See, choose your Azure subscription. You would choose which work group do you wanna use. I should have mine in here somewhere, just search for Doug. There we go, Doug work. You choose which Azure region you wanna be running in. And then there's different tiers of Databricks. So there is the standard tier, which is what we're demoing. But then the premium tier adds in greater security controls, role based access controls, which are really powerful. So the different users can have access to other people's notebooks or not. If you're deploying jobs, I could have access to just the jobs logs but not to actually change the job. Those are features that are available to premium tier. And then you could just say create review or you could add custom networking if you wanted to. So you could set up your own v-net. And then eventually you would create your workspace. And evidently, I missed a step. What did I do wrong? Says oh, I forgot to get my workspace a name. And I would click Create, and it's actually gonna deploy right here in Azure, a full version of Azure Databricks for you using your existing Azure account. So this is really nice. It is so easy to use Databricks in Azure. It's a match made in heaven. And then Databricks integrates with a lot of the other Azure technologies, including the Azure Data Warehouse, Azure ML, lots of tools that are out there. Azure data factory, I strongly recommend. It's a great way to use Databricks. How could you access Kafka from Databricks, Augustine asks. So if I wanted to read from Kafka I don't have a demo of reading from Kafka here, but if I wanted to read streaming data from Kafka, here I did spark dot read stream and I said, I wanna read from a directory that's parquet. I would just replaced that with Kafka. And I would give it the path to my Kafka server and my Kafka topic, as well as any information needed to log into Kafka. And I could actually access Kafka from Spark. So Augustine, you would just look for Spark, read stream, Kafka for examples. Spark.readstream.format Kafka. And here's an example, right here of reading from Kafka. So format, Kafka. Notice I tell it, the host I wanna connect to, the topic I wanna connect to, and then I just call load and now I'm streaming in from Kafka as opposed to a directory. Let's see here. Brian, you asked a question about containerization. But your questions a little bit too broad. If you could narrow it, I'll try to answer that question. Databricks actually does run inside of a container. So we're using Linux containerization. And the idea there is that it makes it really easy for you to spin up a custom version of Databricks with the libraries that you want pre installed. So when you go to launch a cluster, you can actually provide a docker image if you wanna to with pre installed libraries, which is really nice. Somebody else asks, can we support Avro? So we can read Avro files with Spark and the caveat is Avro is not a delta lake at that point. Delta Lake is built on top of parquet but Spark and absolutely read Avro. So spark.readoutformatAvro.load and I could point to Avro files. And then I could turn around and save that into my delta lake. And now I'm ETLing data from Avro into my delta lake, which is really, really nice. Let's see here. Somebody else asked, could we actually integrate Databricks with GIT? We absolutely can. So one of the features that I'm not demoing today, but let's see if I can point it out. If I come up here, where is the option to link to GitHub? I may need to zoom out a little bit. There we go revision history. And it says git not linked. But I would click on this link and I can actually link it to GitHub. And then the version controlling of my notebook will actually integrate directly with GitHub and I can even add commit notes. And we have new features coming out later this year, where get will actually be able to have a group of notebooks that are all committed together to GitHub which is really, really nice. So yes, I can do version control of my notebooks this way. And in version controlling your notebooks this way is not sufficient. Another option you have is there's a database command line. And you can export notebooks using the database command line and version control them that way as well. So using the database command line API is a really key piece actually for any kind of CI or continuous integration pipeline. All right, and Bango thank you for adding that doc here. There's a really good documentation here on Databricks where people can look at the integration for version control inside of Databricks. Somebody else asked, is there a way to automatically configure the vacuuming and versioning of our tables to keep it up to date? So yes, there is. There's actually an auto optimize and auto vacuuming feature in Databricks. So let's look at auto vacuum. Auto vacuum, we may not do automatically because of safety reasons. But we can say for example, keep the latest 50 versions and things of that nature. With optimized command, we definitely have auto optimized. Databricks auto optimize. There we go, auto optimize. And you can read about how to set it up to automatically optimize and notice this picture here. I had lots of small files that got compressed into fewer big files. That was a really nice feature of Databricks. We had run optimized manually. But you can actually configure your cluster to automatically optimize a table. It's just a property you set on your table. Alter table, delta auto optimize, and you would set it to true and auto compact to true. All right, and with that in mind, we have reached our two hours. I really wanna thank everybody for coming to us. If we did not get to answer your question, I would highly encourage you to reach out to our sales team. They are happy to help answer questions and give demos the features that we did not cover. And one thing I wanna highlight about that, is if you wanna reach them, we have a nice bitly link for how you can contact our sales team. Or if you're interested in other courses and trainings that we have available, here's a nice link to reach our Databricks Academy. I really appreciate it especially those of you who are here for all four sessions. I know that takes a lot of time out of your week and I really hope you learned a lot and got a lot of value that will help accelerate your projects. Thank you again for participating and we'll see you again soon.
Info
Channel: Databricks
Views: 5,463
Rating: 4.8987341 out of 5
Keywords: Databricks, delta lake, apache spark, machine learning, Business Intelligence, BI, business analytics, BI Tools, ML, what is machine learning, machine learning engineer
Id: GUP0YFXajkk
Channel Id: undefined
Length: 119min 4sec (7144 seconds)
Published: Mon Apr 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.