Getting Data Ready for Data Science with Delta Lake and MLflow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Perfect, thanks very much. Hi everybody, my name is Denny Lee. I am a Developer Advocate here at Databricks. We are having a fun session called, Getting Data Ready for Data Science with Delta Lake. This session will be slides, of course but we'll definitely do demos and on top of that we'll also go ahead and have a live Q&A. So if you have Q&A please go ahead and click on the panel for Q&A and actually add your questions there. Just because that's the best way for us to see what's going on, okay? So, and my apologies for those who are not currently in the Pacific Time zone, then good evening for you folks as well. So or good afternoon for that matter. So, who am I? Lets start off with that. Well, as I said I'm a Developer Advocate here at Databricks. I'm a hands-on distributed engineer, systems engineer with data science background. I basically help build internet-scale infrastructure. I was at Microsoft, part of Bing, and also part of SQL Server. I was on the teams that helped build what is now known as Azure HDInsight on the project isotope. And work with very large implementations of SQL Server and Azure implementations. And I've been working with Apache Spark since 0.5. So, at least I can have a good sense of imposter syndrome to be able to talk a little bit about Spark and data engineering. So before I go into, (clears throat) excuse me, into the session, a little about Databricks. We are the company that accelerates innovation by unifying data science, engineering and business all together with our unified data analytics platform. So the company, the founders of our company were the original creators of Apache Spark. We've also created Delta Lake which is part of the Linux Foundation MLflow and as well Qualys. These are all really cool open source projects, I'd love to go talk to you more about. We're gonna focus today, primarily on Delta Lake. So, just to give some context about the Apache Spark, if you're unfamiliar with it. The Apache Spark itself, basically is at this point the de-facto unified analytics engine. It's an important aspect actually to call out because it's one of the few frameworks that's able to uniquely combine data and AI, where on the left hand side as you're seeing here, we've got big data processing, with ETL, SQL, streaming. On the right side we've talked about machine learning MLlib and Spark R. Well these are things that Apache Spark incorporates, and it can ping a lot of different data sources, whether you talk about S3, Azure, Hadoop, Kafka, Parquet and other source ecosystems. And so that's part of the reason why you see Apache Spark, as the de-facto (coughs) unified analytics engine. Apologies for that. So let's talk about what the theme of today's session is, which is to get data ready for data science, okay. We're not gonna go cover all of the different aspects here just because it's a lot to put around lightly, right? But basically what you have here is the data science lifecycle. It typically starts with data preparation, and there's a whole set of tools that are involved with that, okay? You've got, of course Spark but then you also in ecosystems you've got SQL native, not just our SQL, various aspects of Python, Scikit-learn, Pandas. Where you're gonna press the data, okay, and then you're gonna need to figure out how to tune scale that mechanism of data preparation. There's a lot of ETL that actually has to go into this in order get to the point or you're actually able to train the data. Then when I flip to the training cycle side of things, there's a whole set of other tools, which is like Scikit-Learn, PyTorch, also Apache Spark, RR, XGBoost, TensorFlow, and so forth and so forth. And if you happen to be using a technology that I didn't mention it was not so much a, it's not meant as a call out against your technology, quite the opposite is to set the ecosystem for training, and all these other technologies is quite large. And you're gonna have to tune, and you're gonna have scale that because after all, when you're training there's also a horse of hyper-parameters that you're gonna have to go ahead and deal with, right? And you're gonna have to tweak and change them, optimize them, right? So this is the reason why it's so important that you actually have a mechanism like MLflow which will show you at the tail end of this session to track all these different variables. And then, alright now you train the model not the model exchange, you're gonna have to deploy it. And again you can do with Apache Spark or you can do Docker as a machine learning Sagemaker. There are all sorts of mechanisms for you to deploy this basically which may or may not be the same sort of technology that you do. And that goes right back to the raw data because after I want to deploy it, there's gonna be raw data that goes into it. You're gonna have to have governance, to ensure that the data itself actually, it will follow GDPR or CCPA compliance rules. We're gonna get into that a little bit during today's session because an important aspect of getting data ready for data science is exactly that. The ability to ensure you have GRC as in Governance Risk Compliance covered so that way you can ensure that any aspects of security but more importantly, no sorry not importantly, but just as important aspects of privacy are covered. And then, in order to scale that raw data, again you're gonna be using things like cloud storage or Delta or Kafka or adobe or whatever else. And then, as more data comes in, it's gonna change, right. It's not just like this BI, where it's just about aggregates and I have to report a trend, right. Those trends are those changes because there's more red toggles purchased or more warmer temperatures or whatever else. Invariably, not just affect the BI reports that you will report out but just as important if not more important in this particular case. The data science you're gonna do against it. The machine learning models that your training is, there's basically this concept of model drift, right? And it begins with data drift like the changes in data over time and this is normal, right? It's an expected aspect of how things work. And so, this is a continuous cycle right. It's not like it ends at one point. If you're successful and you're able to build excellent, awesome machine learning models, deep learning models or whatever it may be. You're gonna continue doing this, you're gonna continue training, you're gonna continue deploying, continue getting new data. So in other words, not just the data that you have but adding new additional data sources that maybe you originally never processed. Then you join that data and then you have to prep it again, and then you have to train again and deploy, and so forth, so it's a vicious cycle. An important aspect of that also is that now that I just covered some of the technologies that are very common in that data science lifecycle. If I was to ask for a raise of hands obviously we're on a YouTube Live/ Zoom setup so it's gonna be a little bit difficult for us to pull that off but, if I was asked you to raise your hand, how many of you know, all of the technologies that I've listed here, and for that matter, all the technologies that actually are used in your environment, okay. So forget about even my list, just talk about your environment of every single technology are you an expert with Kubernetes, all the way to PyTorch, right? Are you an expert of every single one of those technologies right? And while I'm sure there are some heroes that are listening to the session here today, the reality is, the vast majority of you, not only don't know all these technologies, really don't have the time to learn every single technology because you're actually trying to be great at one of these technologies, right. And so that's a focal point for us, which is how do we survive this data science life-cycle. Well, the first things we often need to do when we, oops sorry. The first thing we often need to do when it comes to the data's life cycle is to focus on this concept of the role of data engineering, okay? Data engineering actually ultimately becomes a very crucial crucial aspect of how you actually deliver data science and even today there's far too many teams that are actually built up incorrectly. So what ends up happening is that data science is the tool du jour, or the word du jour or the job courage du jour and deservingly so. So don't get me wrong, I'm not trying to insult anybody who's a data scientist here. Quite the opposite, I think it's an awesome job. The context I'm coming from is more of the fact that we hire all these data scientists, but we don't hire any data engineers, okay? And so this actually becomes really problematic, right? Because what ends up happening is that the data scientists themselves, she will actually have to do the data engineering, right? It's not about the fact that they actually have to because it's not like the data will magically appear in tabular format, cleansed, partitioned, in Parquet , with no data problems whatsoever. I mean if you do, that's great and I'm really happy for you, but in reality that's not what happens, right? What actually happens is that you've got a lot of data you have to filter it out, it's dirty, you have to aggregate or augment that data. All right, so in that process, right? You actually need data engineers to actually bring software engineering rigor to that process. It's not meant as a "okay, now I'm going to make a data scientist "into a software engineer." Again, if you want to, that's awesome. And in fact I, we can probably have another discussion about rigor, software principal rigor within data science itself. But before we can even talk about that we actually need to talk about the rigor for data engineering. And so, for example, in one of my past lives, I was at Concur now purchased by SAP. I had helped build up the data science engineering team, what ended up happening there was that we had the company had hired a lot of data scientists first. And so, really smart people, but they're all using a disparate set of tools, right? And some were standard, some were not, and because they were using disparate set of tools, and they were really working in isolation from each other, what resulted is that we had different copies of the same data, all over the place. This resulted in a problem where the data itself wasn't clean, right, or it wasn't following the same business rules, right. And so one of the first things I had done was to bring in data engineers, or data science engineering specifically engineers that were focusing on the aspect of how to get data cleansed and organized for data science. All right, and then exactly as we go down the slides. It's allows us to develop tests, and most importantly maintain those data pipelines right. So you have these very reliable quality data for quickly, efficiently and secure, right. That's what you need, you need reliable data lakes, we're gonna talk about data lakes shortly, but this concept is that whatever data source that your data scientists are going to work with, it actually needs to be reliable. And this ends up becoming of paramount importance in order for you to have successful data science projects. So, when you look at these like Big Data architectures, right, it starts to quickly and I've oversimplified it. Just because that way I don't have to give you a gigantic Burger style, all the different Batch Sources, so all the different streaming sources. You got input sources typically it's actually really broken down by latency either it's a batch or it's streaming. Batch could be all sorts of things, a file source, cloud source. Database streaming could be your, well you're streaming, picking up some form of REST API whatever else, Kafka kinesis as your event hub saying of that nature, right. Ultimately, you're gonna wanna put this stuff in a data lake, okay. This concept of a centralized single source of truth but it's. (coughs) Excuse me. Not really a single source of truth from the standpoint that, Oh, it's completely utter trustworthy, at least not yet. Right now up to this point it's actually a store, right, basically I just dump any and all data that we have. And this, when we brought in the Hadoop stage of the data life cycle career, that was it. The great thing about Hadoop was that we could just take all the data, store into our file system, it automatically would replicate it so I could make sure it's reliable what I wouldn't lose the data. I had to deal with the concept of eventual consistency, but nevertheless the data was there. Yay, I'm good to go. Problems that, I'm sorry, and then we would do merge Schema on Read or basically at the point of time I was querying the data I would actually get Schema out of it. But guess what, like that's a lot of data. And we weren't sure what we're keeping, maybe we need to filter that out, maybe we need to cleanse it, maybe we need to go ahead and actually remove it because that's actually corrupt. But nevertheless, now we have a data lake. And so, and there is ultimately a data consumer. Now we talked about AI, we talk about machine learning, deep learning, BI. And from where I'm sitting from at least from a standpoint of reliable data lakes, there's actually no difference, right. It is very much about this concept that you're going to have consumers that look at this data, that want to understand how this data works. Okay. And that's it. Right. So, whether they're running a machine learning model or they're running a BI report, they're the data consumer, (mumbles). All right. So, these are the basic concepts. With these input sources or this your data lake you've got a store for structured and semi structured data. (cough) Excuse me please, sorry. You pull data in from various input sources, there is a single central location to access all this data, basically breaking the status out, that's what's great about this that everybody can come in, whether you were all the different teams instead of actually working separately from each other, we can just all go to the data lake and grab that data. That's awesome. All right. It's open accessible format, there's no vendor lock in, SQL machine learning, against a single source of truth. So, big data architecture like this where you have a data lake, that's super promising super awesome, but there are always issues with that of course. And these are the data reliability challenges with the data lakes, okay. If you have failed production jobs. It leaves data in a corrupt state requiring tedious recovery. What that means is like for example you're running a Spark job, or Hadoop or whatever, right, it's multiple tasks, the task fails, the entire job fails. Well, it's written something to disk, right, and if it's written something to disk that means at that point in time when somebody else goes ahead and reads from disk and disk in this case can be cloud storage, so I'm not trying to specify one or the other. There's a bunch of remnant data that actually could be read that's actually not trustworthy. It's not reliable, it's not something you wanna actually be reading. Alright so that's actually a big problem especially when the more data you're working with, the more likely jobs will fail. And if the jobs will fail that means your data scientists are writing algorithms that actually are reading data that's actually incorrect, which is really really bad. (coughs) Excuse me. Okay. So there's a lack of quality enforcement which means this inconsistent and unusable data that quality enforcement can be something as simple as your Schema. In other words there's an additional column, that's noted inside the source data, except the source data had an extra comma. Not because the, mainly because there was a corruption, or because there was actually a text with com in it, but because the comma it was com dealing with your file, your, the codebase never understood it so all the sudden, it's actually adding an additional column to the data, thereby putting incorrect data in the wrong format so when you read it, you're actually reading the wrong thing. Alright. And the final thing but the most important thing is lack of transactions. This lack of transactions itself basically prevents you from ensuring that trustworthy data sets the corrupted data when you for example have a failed production job, right. If you actually had a transaction protecting it, what would happen is that the job fails, anything that you wrote to disk, it'll actual be reversed, will remove the files that were written, okay. Because that's actually a transaction would do, right. this is what traditional RDMS is, Relational Databases Systems are actually able to do. And this lack of transactions makes it nearly impossible to actually combine batch and streaming together, right. Because as you have streams of data running in at the same time, when, at what point can you trust the data as it's coming in? Well if you have a transaction that'll protect it, then sure, then you've got something to work with. But if you don't have a transaction protecting that data, you don't know if the data that's being written in here actually is the final state. And so how can you then combine that with your batch data. All right. So now let me look at the pipelines, okay. And so with those pipelines this actually gets really funky really fast, right. So for example I'm gonna just happen to use this particular design, you can use something else, whether it's kinesis or Azure Hubs versus Kafka but I'm just gonna use Kafka, for example. (coughs) So, when you have these complex pipelines, you're increasing the engineering overhead so the events go to Kafka because you're, you're doing streaming so you're gonna you'll run Apache Spark started streaming against Kafka, it'll get written to the table. The table gets continuously written. and then, for example, there'll be a stream that goes from the table to a Figure Eight, and report it, right. But these Spark jobs get exponentially slower, as there, as there are more and more of these small files because that's what happens with streaming jobs, right. You create lots of really small files, the more files there are, the more overhead there is for Spark or for that matter any system to go ahead and read those files, and that overhead basically slows down performance. All right. Then let's add the fact that okay well, how about this late arriving data, right. In other words, the data that was supposed to be processed every five minutes or every 10 minutes. In fact there's data from two days ago that, that they forgot to process. I need to go back in and process it so that means, in that case I have to potentially delay processing because of the fact that there's a delay in that data and then, okay maybe I could change that where I have a continuous stream, but then what I do is that I have the left table that where table, where it's receiving data that's constantly streamed in. But then I have a batch process that goes ahead and compacts it and organizes it so that way it takes care of the late processing delays, okay. So now it's getting increasingly more difficult. Okay. But then, more times than not, this latency doesn't satisfy business needs not to mention, it's getting more complicated, right. (cough) So then what do you typically do you typically switch, oh well then let me do a lambda protection 'cause lambda protection where I have streaming, and I have batch. So the upper portion is where I'm streaming and then I have this Kafka to stream to your unified view of the streaming and batch data so your AI reporting can go off that. Meanwhile I'm batching data wherever another streaming against Kafka does that same time don't get rights to the data to the table in the bottom left, I have a batch process that goes ahead and process the data into the table. It processes it in batches that way we can take into account of the delays of data, and the batch processes into the unified view. Now the unified view has has both streaming and batch, okay. And by the way, I'm going this fast not because I'm trying to emphasize that you're supposed to listen to this fast, I'm going this fast because I'm trying to explain to you this gets more and more complicated as I go through every slide, okay. So now the problem with that particular option is that you're gonna actually have to have a validation check that checks the data that goes through the upper streaming, the top left streaming. So that way we can make sure that the batch data and the streaming data are actually have the same values otherwise your unified view will actually result in reconciliation problems. Okay. And then if you need to fix mistakes, well, the validation takes care of the streaming data but you need to go ahead and reprocess the data that gets written to the batch in the bottom left, and so forth and so forth, okay. So this becomes really really complicated and then not to mention updates emerges. Again, this is a really complex thing to do. (coughs) Okay, now I'll stop and breathe. The question then is, can this be simplified? Because in the end, I know many people who are saying "Hey, I'm a data scientist, I wanna go ahead "and talk about algorithms" but this is the reason why we did talk about getting data ready for data science with Delta Lake and MLflow because we're trying to get across the point that there's all these data engineering pipeline steps that you have to do before you get to do the fun stuff. We'll actually have sessions later on which dive into all these different machine learning algorithms that are available in Spark (mumbles). So definitely chime in the comments so we can know how to customize these tech box. So we can talk about stuff that you guys are interested in, but, or which algorithms you would like us to tackle first like XGBoost or PyTorch or whatever else. But the point is that before I can even go to that area, I need to make sure I can handle the data in production. Right, because in production, this is the scenario we have. And of course the question I'm gonna ask is, can this be simplified? Even if I had a dedicated data science engineering team that will focusing on building these pipelines, this is a lot to maintain. Right. So how can I simplify this. And so the key thing to simplify this is that we want to introduce ACID Transactions back into the fold. Okay. And so ACID basically, transactions were properlized because of relational databases, okay. What they did is that the grid introduces its concept of atomicity, consistency, isolation and durability. We have actually separate videos that go into the details of what those mean, but the context basically is that if you write something to disk, cloud or on prem, it means it's got written. If there's an error, it will fail out. Any other clients that are concurrently trying to read from it can reliably know that the data is either there in there for them to read it or it isn't. And we'll get that information right from the get go. If you're concurrent writes to the system, we can actually, those with ACID Transactions have the ability to co-ordinate the fact that there are concurrent transactions and prevent corruptions, overwrites and any other issues. Once you have that, you can actually handle batch streaming updates, deletes easily. For example, if I go back to my data SQL Server, right. I could handle streaming and I could handle batch at the same time in my one single database because I had ACID Transactions, where it allowed me to do super-fast all TP where I was like in financial services, writing data to this super fast, And it was no problem I could handle things, I had ACID transactions at the same time. I was going ahead and doing batch processing of aggregate data or other sources of data that I'll eventually join with the financial data. So, again I could go do that. The problem of course is that I couldn't scale it, at least not to big data internet scales type sizes. Right. If I could get a database to do that, I'd still be at SQL server and this is not a knock off SQL server, it's a great product. It's just simply saying that when I have to work on big data problems where I have to handle streaming, and I have to handle batch, and machine learning, data science, all these other things, all together, this is is where, why we started looking at things like Apache Spark, right. And then, in order for data scientists, for you to even do BI reporting, reliably, we needed to reintroduce ACID Transactions back into big data systems. And that's actually why Delta Lake's there. Delta Lake is to, allows us to bring ACID Transactions back to Apache Spark in big data workloads. So ultimately you get consistent reads, and you get rollbacks. So in other words if there's an error, I can roll back on that, which is pretty sweet. So Delta Lake as we say here, right, it is a open format based on Parquet with transactions, utilizing the Apache Spark APIs. Okay. It allows us to address the data practitioner's dreams process data continuously and incrementally as new data arrives in an efficient way without choosing between batch or streaming. Right. It's the same difference to us. Right. So that previous screen that you had, you simplified it so it just becomes this one line per segment that you're looking at, which is sort of nice. All right. The key features of Delta Lake is that it has ACID Transactions, schema enforcement, unified batch and streaming and time travel with data snapshots. And we're actually going to demo that in a few minutes from now, so that way you can get to see it in showcase, okay. (coughs) Included with this concept is the concept that's called Delta Architecture. So in other words, when you design your systems, what you want is, you want to make sure that there actually is a continuous flow model that allows you to go ahead and unify the batch and streaming. So how you do that is that you start with your source, and you build, in essence, what we deem as a bronze table, right, this is raw ingestion. This is similar to how you traditionally used your original data lake. Okay. You dump the data in there and you're basically good to go. Right, that's how you basically start off. Right. It'd be equivalent to basically saying, I'm doing lots of inserts to the data. So, what ends up happening here is that you have that, okay then you have silver, all right. The silver basically is filtered, cleaned, augmented data. Okay. And then, so what that means is that when you process the data. Oh excuse me. When you process the data, you can filter it out, for sake of argument all the logins that you don't need or whatever else that you want to do. Okay. You can clean the data and augment it with additional information from other sources and you can combine that together. Okay. And then there's gold data, okay. So, this is where you can then run your business level aggregates, this where you can do your AI, your streaming, excuse me. Your AI, your data science, your machine learning, deep learning, things like that. This is where then often you will do your merges and your overwrites, okay. This isn't to say you can't do inserts in gold or you can't do delete in bronze. It's just a simple saying that from a traditional data flow perspective, basically our data pipeline perspective, it is more or less like this. You know, inserts in bronze, deletes in silver, merge and overwrites in gold, all right? And like I said, it ultimately goes ahead and allows you to view your streaming analytics and AI reporting. It also allows you then to reprocess the data as if there's corruptions, issues or with the data or changes in business requirements, that require you to reprocess because you have the original raw ingestion, you can go back to the beginning and reprocess it so you have the silver, you have the silver and gold data again. So this is the concept of a Delta Lake Architecture, okay? And basically, each step that you see here is a progression of how you can make things cleaner, okay? How you are going, the data quality basically goes up. All right? This architecture then reduces the end-to-end pipeline SLA, reduces pipeline maintenance burden and eliminate the lambda architecture for minute-latency use cases, okay? And so, we make this architecture real because you've got the Optimized File Source, ACID Transactions, Scalable metadata handling for high throughput, Inserts, Updates and deletes, okay? Data versioning, and Schema on write option to enforce data quality. Now, I skim through this really quickly because I'm gonna show you how that all works. So you know what I'm actually gonna stop presenting now and I'm actually gonna go right to demo because I'll end with how do I use Delta Lake, okay? So give me a second. All right, perfect. So. So here's a notebook, by the way, and by the way, this notebook will be available for you to go ahead and download and use yourself. Okay so this is a, you can use it on Databricks Community Edition so you don't need to purchase Databricks. It is a Databricks notebook but you can use it on the other free DBC I even pull that out actually here, okay? Now, what I've done up to this point, is I've just created a quick table, that looks like this. Address, sorry the state, and the counts, okay? Basically what we're doing, is we're counting the number of loans, okay? That's all we're doing. (coughs) Okay, so let's sync here. Now, in order for us to do this, what I first did is I easily converted this from Parquet to Delta Lake format. So what I did is I basically, in this particular case simply ran a 'create table' by statement. So I basically converted what used to be Parquet into Delta Lake. There's actually a Delta Lake one step migration, API as well that allows you to do that but I just have to do it this way. Nevertheless, now I have a Delta Table. And this Delta Table basically, you'll notice here there is Delta, dbfs, this is a Databricks file system, loan by State Delta. I've called it the partition columns, things of that nature, and you're good to go. Okay, so in other words its basically let you know the scheme, all right. So I've just done this real quick setup. Now, what I will do is I will now run it live from this point onwards, okay. So, if I look at the file system what you'll notice basically as it's showing it's stuff okay, is, these are all the different Parquet files, and so this looks pretty similar to just pretty much like any other Parquet set of data that you actually are used to, okay. You have a table, loan by state Delta that has a bunch of Parquet files. Yay. Except the one big difference is the Delta log folder. And the Delta log basically is exactly what sort of implies, it's a transaction log. Every single transaction, every single change, modification to the data is recorded, and is recorded in this format and you'll see the status of these JSON files. Okay. And so the _delta_log contains those JSON files. So you only seen one of them, the zero file, mainly because that's the creation of the table. All right. So let's go ahead and actually run the streaming center where I'm gonna go ahead and do a unified faction streaming source and sync. So what you notice here is that I have this Delta Lake silver path that I initially had setup. Alright, I'm gonna do a read stream, I am gonna do, run a streaming job, I'm gonna read whatever data is coming out from that Delta Lake silver path that's this path here by the way, that I called up before, All right. So, now I'm gonna go read that stream and in this case I'm just gonna output the data by state, so my apologies for for you non-Americans. The dataset I have is currently only for the United States. Okay. So, the explanation of this particular dataset by the way is that these are loans. So these are loans by state restraint. So in other words the number of loans for each state that we have, okay. So let's let that sucker run. So perfect, okay, I'm gonna answer some questions, real quick, as we run this through. For some reason it's running a little slower than I expected. Actually, you know what, I'm gonna see if I can just run it from here. Oh, that's why it's always good to have a couple backups. Ah ha! All right, so, I'm just gonna skip that step. I want this step. Okay, cool. All right, So, this one is running a little faster today so I might use that one. Okay. So, as I wait for the initial streaming jobs to kick in, basically, the quick call out is "will this recording of the Zoom webinar "be available as well?" Yes, just as a quick note this webinar, this tech talk is gonna be available for you to go ahead and work with no problem at all, okay. So all right. So, for some reason this one's running a little slow today so I'm just gonna use this one, ha ha. All right, cool. So right now, poor Iowa is null. Okay there's no values for Iowa so let's go ahead and run that right now. This is a great thing about live demos. So I'm just going to insert data into it. Now, note the fact that I'm inserting data into loan by state Delta. What that means is I'm inserting data into the batch table, not the read stream, not the write stream, but the batch table. It's ultimately going to the same file system, but it's really, it's right now doing a batch process of insert, as opposed to doing a write stream. And what you notice is Iowa's slowly steadily building up more data, as you can see that it's progressing, from 450 to 900 because (coughs) I'm just doing an old school insert here right now but the context and by the way, in production, you don't wanna really do it this way. It just, I'm just doing it so that way you can see, I'm doing a batch write to a read stream concurrently, that's actually all I'm trying to do here, okay. So that's just a quick call out here okay. So all right, so I believe most of the data has been processed already, the six loops, there you go. And for that matter, I will go ahead and take a look at the data here. So if I go ahead and look at the log. Oh, sorry, I forgot the stuff, there you go. You'll notice that now I have multiple files that we have the original zero but we have one, two, three, right up to the six, all here. These are the six inserts associated with what you, of the six insert statements that you have here. Okay. And we'll show you actually how that looks in shortly. I just want to show you the file system first. And then I can go back to the loan by state Delta, right. So in other words I had the read stream which was showing it all here, okay but I can also go ahead and see it back in the batch. And sure enough, this is the same value. It doesn't seem that interesting, but the important aspect here is that, now I can read and write, whether it's streaming or batch to the exact same file system, and I can do so in a reliable fashion. And that in itself is a very powerful thing that I could do. Okay. And so, what's also great about Delta Lake is that now if you are thinking yourself like I am somebody who is used to using SQL syntax, because in particular may come from a SQL background or for that matter just like the syntax, which is understandable. What's great is that you can now run with Delta Lake with Spark, you can actually run deletes you can run updates and merges, okay. So for example, I just created the same version of the same table except I'm creating a pq one i.e. a Parquet version of that. So for sake of argument if I was to run a delete, okay, it'll fail, right? The delete only works when, because when supported by Delta. So not a big deal. It's because I realized, wait, no, I wasn't supposed to put the data in Iowa I was supposed to put that at Washington. So, no problem. I can go ahead and run the Delete right there. Okay. And so, boom, now that delete's done, I can just go run this particular statement, and I can look at my state all over again and I'm back pretty much to the zero where Iowa is nil. Okay. (coughs) So now I want to run an update statement. So this is like our seventh command is basically doing the update, sorry the delete. The eighth command to do this update where I'm gonna try to update it because it was this 2700 value that I saw there, I was actually supposed to give it the Washington, not Iowa, okay. So, of course it fails with Parquet . So, again these failures are actually designed to happen. So, by the way when you get the notebook and you upload it and run it, you'll see three errors right away. That's where the two of the three errors that you just saw, okay. You'll see the third error pretty soon. And so then we have the update statement again same idea with Delta Lake I can go ahead and do that right away. Not a big deal, we're good to go again, okay. And then again, let's review the data. Sure enough, here you go, okay. So, Washington State has 2700, okay. And then same idea if I want to do a merge statement. So the reason, I'm just gonna run that obviously, I'm just gonna to run that in Delta at this point because I think I've nailed that point across. I want to update Iowa to 10. I want to change California 2,500. I want to make Oregon null for some reason. So nothing against Oregon, I'm actually born there, so nothing against them. But normally a merge statement's actually these seven steps. Identify new rows, that to be inserted, new rows that have to be replaced or updated, rows that are not impacted by an insert or update. Create a new temp table for where all three inserts have to occur. Delete the original table, renamed the temp table back to original table and drop the temp table. Okay. That's what a merge is usually, okay. If you were to run this yourself in a standard distributed concept, or you can just run this one merge statement in Data Lake. So again, we try to make things a little bit simpler. So again, we made California 2,500, we nullified Oregon and sure enough, Oregon's null, California's 2,500 and Iowa's 10, okay? So this is really cool stuff. All right. But I also wanna evolve the schema. So for example, the counts that I have here right now, that's just simply the counts, right? That's all this is. It simply tells me the count of loans over time the real business value of it should be actually the dollar amount. So I've generated this by the way, okay. This is obviously fake data, okay. So I've got the state and I've got the count, but now I've got the amount. And so now I wanna just go ahead and say, yeah, yeah, let me include that amount of information because that's super important. Well, that's the fourth error here because we're actually wanted to enforce the schema. So this is actually a desired error. In other words, as you're putting data into your data lake, with Delta Lake, you basically ensure that you're not gonna overwrite the data that you have there or you're not going to go ahead and put potentially corrupt data because maybe it was human error. We weren't supposed to put the sum amount data inside there we just it was by accident, okay. Well that's fine. So with scheme enforcement, we're actually preventing that from happening. But if I was able to add this option, option merge schema's true, well, what's great about this particular option is that then I can do exactly that. I am now saying to myself, "No, I want to evolve the schema "because it's what we have here is correct in fact. "So let me go do that right now." And sure enough, now I can query the exact same data from the exact same table, the loan by state Delta, and in this case I'm looking at by amount as opposed to by counts. All right. And then, we're going to finish this off with time travel and MLflow, okay. So I want to describe the history of my loan state Delta. Remember how I had shown all the files? Well, here's a better view of that stuff, right. So every single change I did, initial creation, the six inserts that we did, the delete, the update, the merge, and the appending that we just did, they're all recorded right here. So now we know what transactions occurred. But what's really cool about this is that because I have those transactions and I also have the original files, if I was to simply go loan by state Delta version as of zero, which is the initial one I originally created, I'm literally going back in time and this is what my table looked like initially. Okay. Where Iowa was null, Washington was three 336, not 2,700, Oregon was still a 182 and not null, California had a different value of 2006. Okay. Versus the ninth version of the table where basically I was able to go ahead and include, everything but the inclusion of the amount and sure enough, here's what the table looks like now. So Iowa's 10. Oregon's null, Washington 2,700, California's 2500. So, all of that is actually retained. So you can go back in time and recall what version of the data you had assigned to that transaction. And actually that's pretty powerful as you can, as you can guess. Okay. Well why is that power from a data science perspective? So here's what I'm doing, I'm just simply downloading some U.S. census data because I'm building a really boring model. Okay. Oh, a really boring model where I'm actually combining the Delta Data, the loan state data, for each and every single different version of that data, by the population of that state, okay. And so that's what this particular set of queries are for. But now what I'm doing is I'm doing for the data version zero, six, and nine. Zero is the original version, the real version of the data, six is the version of data that got modified where I went ahead and, nullified Iowa and increased Washington at the 2,700 and then the ninth value of it was basically one where I really messed it up where I also nullified Oregon, added a ten to Iowa and also messed up, messed around with California from the front of it. Okay. So, what I'm doing here is actually just running a standard, I'm actually running Yellowbrick so I can do some residuals, so I can just visualize it. But I'm actually just simply doing, (coughs) excuse me. I predict a loan count, that's all I'm doing. So I'm doing a standard linear regression. I'm using the exact same metrics for each and every one. It's basically like if I was to deploy the model, so I'm just gonna run that right now. I'm sorry, I am gonna simply run the code, okay. And then now I'm actually gonna run the code. We're actually gonna run the predict loan counts for the three different versions. Version zero, the original, sixth version and ninth version. The reason I'm calling this out is that I'm sort of faking this concept of data drift. And as you are running your machine learning models, the data can change over time. Now maybe it's because you screwed up on the model or I screwed up the model for that matter. Or maybe it's because the data actually did change, but how do you know? And so what we're doing here is actually we're gonna actually do a run, which is basically like I said, we're going to do a linear regression. And what's sort of cool is that this model, okay, It's, this is your standard, you pull right from Spark, linear regression. That's all this is. So the only thing I really did to make this somewhat interesting is that I added these statements with MLflow start run, and I also included the data version, and I log the metric and I log the model itself, okay. I also happen to log the Yellowbrick residuals view as well, just because of sort of nice, okay. So that's it. That's all I've really done, all right. So what happens that if I go ahead and click on the runs folder here. All right, all three runs are here and as you can progress, okay, these are the older versions I did last night, but this is the ones that I did live as you can tell here. Version zero we have an RMSE of 65, okay. Not great, but not bad. But as you can tell using the exact same hyper parameters, and again, this is linear regression, so not really that interesting, but still it gets worse. With version six, this is the one where I've, nullified, (coughs) excuse me. I went ahead and, knocked up Washington state's okay the RMSE got worse and now I went ahead and changed California, nullify, added data to Iowa that shouldn't have been added, nullified Oregon. It's, as you can tell, it's getting worse, worse and worse. Right. And so let me go ahead and click on a MLflow and open up the window here, okay. And by the way, even though I'm running this all on Databricks, this is actually all open source technologies I'm talking about. I just happened to be showing you this on Databricks but this is all open source stuff, okay. So, if I want to do a comparison, I can just compare the three different runs version of nine, six and zero. And again, you can tell that basically the RMSE is much better when I go with, go here, right, with the zero versus v six, v nine. You can also visualize it here, right here, at the RMSE to see the different runs. But what I also like about when I'm working with, MLflow is that actually, for example, I'm looking at this one version nine, right. So I can tell the RMSE is horrible, right. But also I've also went ahead and like I said, I went ahead and did yellow brick and I looked at the residuals and right away I can tell, okay by the way, train basically means that the v zero data and test means the v nine data. Like the training data basically is 0.97, so the R score, it's actually not bad, but the R-squared for my V nine data is really horrible, right. And so it's really clearly indicating the drift here, like the changes to the data. Now of course this is known because we manually manipulated the data, but the context here is that as time progresses, over time, you have that ability to go ahead and with Delta Lake, not only be able to reliably look at your data and trust your data but you actually have data versioning. So you can figure out what the heck's going on and, if things have changed or progressed, it's okay. You can go back and verify, yes, my v zero model actually looked good. And so for that matter you can even redeploy the model. So we have the ML model right here. It tells you exactly which version that you were using, what Python version, what Spark ML, the Conda that goes with it. So these are the dependencies, so ML flow. This version of PI spark, the version of Python, and for that matter all the metadata and the stages for your ML model are now stored directly in MLflow. So I can go back and redeploy it or go back and change it. So because there's always invariably the issue that maybe the data that came in, actually there were problems, but then later on they got cleaned up and fixed. So you can go back and use this model, okay. So as you can tell, there's a lot of power by combining these two open source technologies together to help you go ahead and get yourself data ready for data science. Okay. So I'm gonna finish up my presentation here and then I'll still leave a little bit of time for questions. So, hey how do I use Delta Lake? Okay. So to use it, obviously I'm gonna be we're gonna give you this notebook and where you're going to be a download stuff. That's great. But if you wanna run this yourself and you don't, directly right from terminal or from a Jupyter notebook, that's fine. Add the spark package goes --packages io.Delta:Delta core. I'm using skull 2.12 the current latest version is Delta Lake 0.5.0. so there you go. And then I'll keep on updating the slide deck when we keep on updating the Delta jars, which we're doing at a roughly about six to eight week pace. Okay. You can also of course use Maven, which is then right there. Same idea that your artifact ID is Delta core, which current version, the version, the current version as of this recording is 0.5. When you're right, instead of using Parquet , in other words, data-frame.write.format Parquet just simply change it to data-frame.write.format Delta. And that's it. Now, just like that, you're using Delta Lake, okay. And we already talked about that update from merges we already talked about the time travel and yes, because I have that time travel capability pretend a bunch of those steps, six, seven, eight, nine were wrong. I can now roll back. This is an example of it but basically I can go roll back to the sixth version. And so now from this point onwards, I'm back to version six of the data, or sorry, version zero of the data for that matter, the data that I actually could trust, okay. And so this is a small but fundamental component of that data science life cycle that allows you to go ahead and have reliable data that by basically adding Delta Lake and MLflow that you can ensure that no matter what tools you're using to do your analysis, no matter what tools you're using to do your processing, you're actually gonna have a reliable data lake. And you can track both the data version itself and the model versions that you're working with, okay? So if you want to build your own Delta Lake, please go to https://Delta.io if and, Delta Lake is actually gotten more more powerful over time. We have various connectors, we're currently working on Hive, Presto is already out. We have connectors for Snowflake, Athena, Redshift and Hive should become publicly very soon. There are lots of providers and partners, whether it's Tableau, Informatica, Qlick, Privacera, Attunity, Talend like you know, WANDisco, Streamsets and Google Dataproc, recently announced that they are supporting Delta Lake as well, which is pretty cool, right? And, as you can tell, there's a ton of users, so I'm not going to go through them all, but there's a ton of users of Delta Lake, we constantly update the Deltaio page for all that. If you want the notebook from today's webinar or our tech talk here, try it out and, Databricks Community Edition and you can download it from this link. We're going to send you this email out with all this information anyways, which includes these slides by the way. But yeah, like it's right there for you to download and try our try out. (coughs) Did wanna call a couple of things before I answer questions. Hey, you got more questions, you want to dive into deeper, why don't you join us a Spark + AI Summit, okay? Spark + AI Summit, June 22nd to 25th in San Francisco. We actually have two training days now just because of the popularity of them. It's organized by Databricks, but, you know, listen to the open source technology. So come on down. It's in San Francisco. Use the code Denny, my name, DennySAI020. Okay? That's the code you get 20% off. So go ahead and give that a shot. Would love to see you down there. Love to answer your questions both online and also basic face to face, okay? And also if you have questions come to Delta.io or mlflow.org, we actually have a ton of questions, where we can actually, we're already active on our Slack channels we are already active on the distribution list. So definitely go ahead and join us there, okay? So saying that I'm gonna leave you at this, I'm gonna leave a few minutes for questions, and we'll go from there, okay? Sorry, I am trying to open the Q A but it's not opening, all right, let's just close the chat. Okay, So, okay, I'm gonna go through it, looking at the questions here. So I apologize, I'm just trying to scroll through them right now. There is a question, say Oh by the way, as a small joke here at the, when I did say morning everyone and it's a good morning for San Francisco, that's from Karen. I'm actually Seattle based. So just a little quick alert. There's a question about "is there a difference between Databricks Delta" "as in the Databricks version versus the open source one?" There is no difference between the APIs by version1.0. Most of the API changes, differences are actually are all resolved now, but I do want to call out that you know, since we're a pre 1.0 project, we're still trying to get, make sure that all of the functionality that exists in Databricks, Databricks Delta, from a reliability perspective are included in the open source version, okay? Oh, by the way, if you're asking questions, please don't use the Q&A panel, please put it on to the chat panel. For some reason, my Q&A panel is not opening right now. And so, by version 1.0 if not earlier, we'll actually make sure that every single feature that exists in Databricks Delta is also available in open source Delta Lake. Now what is the difference then? The difference right now primarily is, for example, some of the update, delete statements that you saw. Those are not gonna be ready. You can run updates, you can run deletes, but you can only run them from the Python and Scala APIs, and the open source one. We're actually waiting for Spark 3.0 to release with data source V2 that allow us to actually make some of the changes that allow us to go ahead and put the SQL statements in as well. So that way you can use Spark SQL, for deletes and updates. Like I'd shown you on the screen here as opposed to, using the API. But if you want to use the API, obviously you can use the API right now to do deletes, okay? And updates and merges. And so the main difference post 1.0 is that well, all the API will stay exactly the same, it's just that Databricks itself is a merged service. So because we're a merged service, we're gonna go ahead and run Delta Lake for you. We're going merge we're gonna do performance enhancements, things of that nature. But it is a open source project, it's part of the Linux Foundation. So if you are going to put in PRs, yeah, we'll probably take them right? Because it is an open source project. So we want to make that very clear, but right now at least that's the primary set of differences, okay? So there's a question, can I change the Delta store to Blob Lake store DB? Not to DB, at least not certainly not yet. But you can certainly make the changes for Delta Lake to be pointing to whatever, a file store that you want. So in other words, you wanna run this on the Hadoop file system? By all means. If you want to run this on the, as a Data Lake Gen 2, or S3 or Google cloud storage, these are fine. There's actually nothing stopping you from doing that, because if you recall in the end, the primary thing we do is we add an additional folder _Delta_log where we put the transaction log into it. That's the primary difference that we've made, right? Outside of that, it's the still technically the same Parquet files that you would see in a Parquet table, on that file system, right? So we're still doing the same thing. There are certainly some optimizations done underneath the covers but for all intents and purposes, it's still Parquet files. The most notable change is that you, when you're, trying to look at the data, there's a lot more data in there, right? So for example, when I do a delete, technically I'm not doing a delete, right? Technically I'm building a tombstone because what I've done is I've created a new version of the data that doesn't contain the data that I deleted. This is the reason why I can go back and use time traveling to basically build up the different versions, right? I can go back to version zero and have the, have the version of the data that says I was null, and Washington is 336, because the data still actually exists, right? It's still there. It's just that basically because the transaction log, it allows me to keep track of that. All right? So that's the main difference. So if you tried to, for the sake of argument, have multiple transactions on your Delta Lake table, but then you try to just read the Parquet logs without actually reading the Delta log. There could be multiple versions of this data, so then presumably you could see, duplicates or triplicates or so forth because you're actually seeing like the previous versions of the data as well. (coughs) So because it's an open source project, actually we recently released a blog at Delta Lake 0.5 and zero, where we actually talked about manifests. And this is where Presto and Athena can actually go ahead and take a look at the Delta Lake data, they can look at the Parquet data because what you do is you generate a manifest file. The manifest file basically simply contains over here are all the files that you should be looking at for the most current version of data. That's it. And so like for sake of argument, there are 20 files inside there, but if you want to look at the most current version, you only need to look at the six. So the manifest basically contains the file names of those six files. And then Presto, Athena can read that manifest, which is invariably tells them to go look at those six files. And from there you're pretty much good to go, okay? So hopefully that answers that question. So actually in this case, I apologize, but I think we, our time is up now, so I will go ahead and probably end this session now. By the same token, if you do have questions, I highly would advise you to join us at Delta.io or join us at mfflow.org and as well, we will be sending an email out. Well actually, not so much an email, but we'll be sending out...
Info
Channel: Databricks
Views: 6,982
Rating: 4.9703703 out of 5
Keywords:
Id: hQaENo78za0
Channel Id: undefined
Length: 58min 45sec (3525 seconds)
Published: Thu Feb 27 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.