- Perfect, thanks very much. Hi everybody, my name is Denny Lee. I am a Developer Advocate
here at Databricks. We are having a fun session called, Getting Data Ready for Data
Science with Delta Lake. This session will be slides, of course but we'll definitely do demos and on top of that we'll also
go ahead and have a live Q&A. So if you have Q&A please go
ahead and click on the panel for Q&A and actually add
your questions there. Just because that's the
best way for us to see what's going on, okay? So, and my apologies for
those who are not currently in the Pacific Time zone, then good evening for you folks as well. So or good afternoon for that matter. So, who am I? Lets start off with that. Well, as I said I'm a Developer
Advocate here at Databricks. I'm a hands-on distributed
engineer, systems engineer with data science background. I basically help build
internet-scale infrastructure. I was at Microsoft, part of Bing, and also part of SQL Server. I was on the teams that helped build what is now known as Azure
HDInsight on the project isotope. And work with very large
implementations of SQL Server and Azure implementations. And I've been working with
Apache Spark since 0.5. So, at least I can have a good
sense of imposter syndrome to be able to talk a little bit about Spark and data engineering. So before I go into, (clears throat) excuse me, into the session, a
little about Databricks. We are the company that
accelerates innovation by unifying data science, engineering
and business all together with our unified data analytics platform. So the company, the
founders of our company were the original
creators of Apache Spark. We've also created Delta Lake which is part of the Linux Foundation MLflow and as well Qualys. These are all really cool
open source projects, I'd love to go talk to you more about. We're gonna focus today,
primarily on Delta Lake. So, just to give some context
about the Apache Spark, if you're unfamiliar with it. The Apache Spark itself,
basically is at this point the de-facto unified analytics engine. It's an important aspect
actually to call out because it's one of the few frameworks that's able to uniquely
combine data and AI, where on the left hand
side as you're seeing here, we've got big data processing,
with ETL, SQL, streaming. On the right side we've
talked about machine learning MLlib and Spark R. Well these are things that
Apache Spark incorporates, and it can ping a lot of
different data sources, whether you talk about S3,
Azure, Hadoop, Kafka, Parquet and other source ecosystems. And so that's part of the
reason why you see Apache Spark, as the de-facto (coughs) unified analytics engine. Apologies for that. So let's talk about what the
theme of today's session is, which is to get data ready
for data science, okay. We're not gonna go cover
all of the different aspects here just because it's a lot
to put around lightly, right? But basically what you have here is the data science lifecycle. It typically starts with data preparation, and there's a whole set
of tools that are involved with that, okay? You've got, of course Spark
but then you also in ecosystems you've got SQL native, not just our SQL, various aspects of Python,
Scikit-learn, Pandas. Where you're gonna press the data, okay, and then you're gonna need to
figure out how to tune scale that mechanism of data preparation. There's a lot of ETL that
actually has to go into this in order get to the point or you're actually able to train the data. Then when I flip to the
training cycle side of things, there's a whole set of other tools, which is like Scikit-Learn, PyTorch, also Apache Spark, RR,
XGBoost, TensorFlow, and so forth and so forth. And if you happen to be using a technology that I didn't mention
it was not so much a, it's not meant as a call
out against your technology, quite the opposite is to set
the ecosystem for training, and all these other
technologies is quite large. And you're gonna have to tune,
and you're gonna have scale that because after all,
when you're training there's also a horse of hyper-parameters that you're gonna have to go
ahead and deal with, right? And you're gonna have to
tweak and change them, optimize them, right? So this is the reason
why it's so important that you actually have
a mechanism like MLflow which will show you at the
tail end of this session to track all these different variables. And then, alright now you train the model not the model exchange, you're
gonna have to deploy it. And again you can do with Apache Spark or you can do Docker as a
machine learning Sagemaker. There are all sorts of
mechanisms for you to deploy this basically which may or may
not be the same sort of technology that you do. And that goes right back to the raw data because after I want to deploy it, there's gonna be raw
data that goes into it. You're gonna have to have governance, to ensure that the data itself actually, it will follow GDPR or
CCPA compliance rules. We're gonna get into that a little bit during today's session
because an important aspect of getting data ready for
data science is exactly that. The ability to ensure you have GRC as in Governance
Risk Compliance covered so that way you can ensure
that any aspects of security but more importantly, no sorry not importantly, but just as important aspects
of privacy are covered. And then, in order to scale that raw data, again you're gonna be using
things like cloud storage or Delta or Kafka or
adobe or whatever else. And then, as more data comes in, it's gonna change, right. It's not just like this BI, where it's just about aggregates and I have to report a trend, right. Those trends are those changes because there's more red toggles purchased or more warmer temperatures
or whatever else. Invariably, not just affect the BI reports that you will report out but just as important
if not more important in this particular case. The data science you're
gonna do against it. The machine learning models
that your training is, there's basically this
concept of model drift, right? And it begins with data drift
like the changes in data over time and this is normal, right? It's an expected aspect
of how things work. And so, this is a continuous cycle right. It's not like it ends at one point. If you're successful
and you're able to build excellent, awesome
machine learning models, deep learning models
or whatever it may be. You're gonna continue doing this, you're gonna continue training, you're gonna continue deploying,
continue getting new data. So in other words, not
just the data that you have but adding new additional data sources that maybe you originally never processed. Then you join that data and
then you have to prep it again, and then you have to
train again and deploy, and so forth, so it's a vicious cycle. An important aspect of
that also is that now that I just covered
some of the technologies that are very common in
that data science lifecycle. If I was to ask for a raise of hands obviously we're on a
YouTube Live/ Zoom setup so it's gonna be a little
bit difficult for us to pull that off but, if I was
asked you to raise your hand, how many of you know,
all of the technologies that I've listed here,
and for that matter, all the technologies that actually are used in your environment, okay. So forget about even my
list, just talk about your environment of
every single technology are you an expert with Kubernetes, all the way to PyTorch, right? Are you an expert of every
single one of those technologies right? And while I'm sure there are some heroes that are listening to
the session here today, the reality is, the vast majority of you, not only don't know
all these technologies, really don't have the time to
learn every single technology because you're actually trying to be great at one of these technologies, right. And so that's a focal point for us, which is how do we survive
this data science life-cycle. Well, the first things we
often need to do when we, oops sorry. The first thing we often need to do when it comes to the data's life cycle is to focus on this concept of
the role of data engineering, okay? Data engineering actually
ultimately becomes a very crucial crucial
aspect of how you actually deliver data science
and even today there's far too many teams that are
actually built up incorrectly. So what ends up happening is that data science is the tool du jour, or the word du jour or
the job courage du jour and deservingly so. So don't get me wrong, I'm
not trying to insult anybody who's a data scientist here. Quite the opposite, I
think it's an awesome job. The context I'm coming from
is more of the fact that we hire all these data scientists, but we don't hire any
data engineers, okay? And so this actually becomes
really problematic, right? Because what ends up happening is that the data scientists themselves, she will actually have to
do the data engineering, right? It's not about the fact
that they actually have to because it's not like the
data will magically appear in tabular format, cleansed,
partitioned, in Parquet , with no data problems whatsoever. I mean if you do, that's great
and I'm really happy for you, but in reality that's
not what happens, right? What actually happens is
that you've got a lot of data you have to filter it out, it's dirty, you have to
aggregate or augment that data. All right, so in that process, right? You actually need data
engineers to actually bring software engineering
rigor to that process. It's not meant as a "okay, now I'm going to
make a data scientist "into a software engineer." Again, if you want to, that's awesome. And in fact I, we can probably
have another discussion about rigor, software principal rigor within data science itself. But before we can even talk
about that we actually need to talk about the rigor
for data engineering. And so, for example, in
one of my past lives, I was at Concur now purchased by SAP. I had helped build up the
data science engineering team, what ended up happening
there was that we had the company had hired a lot
of data scientists first. And so, really smart people, but they're all using a
disparate set of tools, right? And some were standard, some were not, and because they were using
disparate set of tools, and they were really working
in isolation from each other, what resulted is that
we had different copies of the same data, all over the place. This resulted in a problem where the data itself wasn't clean, right, or it wasn't following the
same business rules, right. And so one of the first things I had done was to bring in data engineers, or data science engineering specifically engineers that
were focusing on the aspect of how to get data cleansed
and organized for data science. All right, and then exactly
as we go down the slides. It's allows us to develop tests, and most importantly maintain
those data pipelines right. So you have these very
reliable quality data for quickly, efficiently
and secure, right. That's what you need, you
need reliable data lakes, we're gonna talk about data lakes shortly, but this concept is that
whatever data source that your data scientists
are going to work with, it actually needs to be reliable. And this ends up becoming
of paramount importance in order for you to have
successful data science projects. So, when you look at these
like Big Data architectures, right, it starts to quickly and
I've oversimplified it. Just because that way I
don't have to give you a gigantic Burger style, all
the different Batch Sources, so all the different streaming sources. You got input sources typically it's actually
really broken down by latency either it's a batch or it's streaming. Batch could be all sorts
of things, a file source, cloud source. Database streaming could be your, well you're streaming, picking
up some form of REST API whatever else, Kafka
kinesis as your event hub saying of that nature, right. Ultimately, you're gonna wanna put this
stuff in a data lake, okay. This concept of a centralized
single source of truth but it's. (coughs) Excuse me. Not really a single source
of truth from the standpoint that, Oh, it's completely
utter trustworthy, at least not yet. Right now up to this point
it's actually a store, right, basically I just dump any
and all data that we have. And this, when we brought
in the Hadoop stage of the data life cycle
career, that was it. The great thing about Hadoop was that we could just take all the data,
store into our file system, it automatically would replicate it so I could make sure it's
reliable what I wouldn't lose the data. I had to deal with the concept
of eventual consistency, but nevertheless the data was there. Yay, I'm good to go. Problems that, I'm sorry, and then we would do merge Schema on Read or basically at the point of
time I was querying the data I would actually get Schema out of it. But guess what, like that's a lot of data. And we weren't sure what we're keeping, maybe we need to filter that out, maybe we need to cleanse it, maybe we need to go ahead
and actually remove it because that's actually corrupt. But nevertheless, now we have a data lake. And so, and there is
ultimately a data consumer. Now we talked about AI, we
talk about machine learning, deep learning, BI. And from where I'm sitting
from at least from a standpoint of reliable data lakes, there's actually no difference, right. It is very much about this concept that you're going to have consumers
that look at this data, that want to understand
how this data works. Okay. And that's it. Right. So, whether they're running
a machine learning model or they're running a BI report, they're the data consumer, (mumbles). All right. So, these are the basic concepts. With these input sources
or this your data lake you've got a store for structured
and semi structured data. (cough) Excuse me please, sorry. You pull data in from
various input sources, there is a single central
location to access all this data, basically breaking the status out, that's what's great about this
that everybody can come in, whether you were all the different teams instead of actually working
separately from each other, we can just all go to the
data lake and grab that data. That's awesome. All right. It's open accessible format,
there's no vendor lock in, SQL machine learning, against
a single source of truth. So, big data architecture like this
where you have a data lake, that's super promising super awesome, but there are always
issues with that of course. And these are the data
reliability challenges with the data lakes, okay. If you have failed production jobs. It leaves data in a corrupt state requiring tedious recovery. What that means is like for example you're running a Spark job,
or Hadoop or whatever, right, it's multiple tasks, the task
fails, the entire job fails. Well, it's written something to disk, right, and if it's written something to disk that means at that point in time when somebody else goes
ahead and reads from disk and disk in this case
can be cloud storage, so I'm not trying to
specify one or the other. There's a bunch of remnant data
that actually could be read that's actually not trustworthy. It's not reliable, it's not
something you wanna actually be reading. Alright so that's actually a big problem especially when the more
data you're working with, the more likely jobs will fail. And if the jobs will fail that means your data scientists
are writing algorithms that actually are reading data
that's actually incorrect, which is really really bad. (coughs) Excuse me. Okay. So there's a lack of quality enforcement which means this inconsistent
and unusable data that quality enforcement
can be something as simple as your Schema. In other words there's
an additional column, that's noted inside the source data, except the source data had an extra comma. Not because the, mainly because there was a corruption, or because there was actually
a text with com in it, but because the comma it was
com dealing with your file, your, the codebase never understood it so all the sudden, it's actually
adding an additional column to the data, thereby putting incorrect
data in the wrong format so when you read it, you're actually reading the wrong thing. Alright. And the final thing but
the most important thing is lack of transactions. This lack of transactions
itself basically prevents you from ensuring that trustworthy data sets the corrupted data when you for example have a failed production job, right. If you actually had a
transaction protecting it, what would happen is that the job fails, anything that you wrote to
disk, it'll actual be reversed, will remove the files
that were written, okay. Because that's actually a
transaction would do, right. this is what traditional RDMS is, Relational Databases Systems
are actually able to do. And this lack of transactions
makes it nearly impossible to actually combine batch and
streaming together, right. Because as you have
streams of data running in at the same time, when, at what point can you trust
the data as it's coming in? Well if you have a transaction
that'll protect it, then sure, then you've got
something to work with. But if you don't have a
transaction protecting that data, you don't know if the data
that's being written in here actually is the final state. And so how can you then combine
that with your batch data. All right. So now let me look at the pipelines, okay. And so with those pipelines
this actually gets really funky really fast, right. So for example I'm gonna just happen to
use this particular design, you can use something else, whether it's kinesis or
Azure Hubs versus Kafka but I'm just gonna use Kafka, for example. (coughs) So, when you have these complex pipelines, you're increasing the engineering overhead so the events go to Kafka because you're, you're doing streaming so you're gonna you'll run Apache Spark started
streaming against Kafka, it'll get written to the table. The table gets continuously written. and then, for example, there'll be a stream
that goes from the table to a Figure Eight, and report it, right. But these Spark jobs get
exponentially slower, as there, as there are more and
more of these small files because that's what happens
with streaming jobs, right. You create lots of really small files, the more files there are, the more overhead there is
for Spark or for that matter any system to go ahead
and read those files, and that overhead basically
slows down performance. All right. Then let's add the fact that okay well, how about this late arriving data, right. In other words, the data that
was supposed to be processed every five minutes or every 10 minutes. In fact there's data from two days ago that, that they forgot to process. I need to go back in and process it so that means, in that case I have to potentially delay processing because of the fact that
there's a delay in that data and then, okay maybe I
could change that where I have a continuous stream, but then what I do is
that I have the left table that where table, where it's receiving data
that's constantly streamed in. But then I have a batch
process that goes ahead and compacts it and organizes
it so that way it takes care of the late processing delays, okay. So now it's getting
increasingly more difficult. Okay. But then, more times than not, this latency doesn't satisfy
business needs not to mention, it's getting more complicated, right. (cough) So then what do you typically do you typically switch, oh well then let me do a lambda protection 'cause lambda protection
where I have streaming, and I have batch. So the upper portion
is where I'm streaming and then I have this Kafka to
stream to your unified view of the streaming and batch
data so your AI reporting can go off that. Meanwhile I'm batching data
wherever another streaming against Kafka does that
same time don't get rights to the data to the table
in the bottom left, I have a batch process that goes ahead and process the data into the table. It processes it in batches that
way we can take into account of the delays of data, and the batch processes
into the unified view. Now the unified view has has
both streaming and batch, okay. And by the way, I'm going this fast not
because I'm trying to emphasize that you're supposed
to listen to this fast, I'm going this fast because
I'm trying to explain to you this gets more and more complicated as I go through every slide, okay. So now the problem with that
particular option is that you're gonna actually have
to have a validation check that checks the data that goes
through the upper streaming, the top left streaming. So that way we can make
sure that the batch data and the streaming data are
actually have the same values otherwise your unified view
will actually result in reconciliation problems. Okay. And then if you need to fix mistakes, well, the validation takes
care of the streaming data but you need to go ahead
and reprocess the data that gets written to the
batch in the bottom left, and so forth and so forth, okay. So this becomes really really complicated and then not to mention updates emerges. Again, this is a really
complex thing to do. (coughs) Okay, now I'll stop and breathe. The question then is,
can this be simplified? Because in the end, I know many people who are saying "Hey, I'm a data
scientist, I wanna go ahead "and talk about algorithms" but this is the reason
why we did talk about getting data ready for data science with Delta Lake and MLflow because we're trying to
get across the point that there's all these data
engineering pipeline steps that you have to do before
you get to do the fun stuff. We'll actually have sessions later on which dive into all these different machine learning algorithms
that are available in Spark (mumbles). So definitely chime in the
comments so we can know how to customize these tech box. So we can talk about stuff that
you guys are interested in, but, or which algorithms you
would like us to tackle first like XGBoost or PyTorch or whatever else. But the point is that before
I can even go to that area, I need to make sure I can
handle the data in production. Right, because in production,
this is the scenario we have. And of course the
question I'm gonna ask is, can this be simplified? Even if I had a dedicated
data science engineering team that will focusing on
building these pipelines, this is a lot to maintain. Right. So how can I simplify this. And so the key thing to
simplify this is that we want to introduce ACID
Transactions back into the fold. Okay. And so ACID basically,
transactions were properlized because of relational databases, okay. What they did is that the grid introduces its concept
of atomicity, consistency, isolation and durability. We have actually separate
videos that go into the details of what those mean, but the context basically is
that if you write something to disk, cloud or on prem, it means it's got written. If there's an error, it will fail out. Any other clients that are
concurrently trying to read from it can reliably know
that the data is either there in there for them to read it or it isn't. And we'll get that information
right from the get go. If you're concurrent writes to
the system, we can actually, those with ACID Transactions
have the ability to co-ordinate the fact that there are
concurrent transactions and prevent corruptions,
overwrites and any other issues. Once you have that, you can
actually handle batch streaming updates, deletes easily. For example, if I go back to my data SQL Server, right. I could handle streaming
and I could handle batch at the same time
in my one single database because I had ACID Transactions,
where it allowed me to do super-fast all TP where I was
like in financial services, writing data to this super fast, And it was no problem
I could handle things, I had ACID transactions at the same time. I was going ahead and
doing batch processing of aggregate data or other sources of data that I'll eventually join
with the financial data. So, again I could go do that. The problem of course is
that I couldn't scale it, at least not to big data
internet scales type sizes. Right. If I could get a database to do that, I'd still be at SQL server and this is not a knock off SQL server,
it's a great product. It's just simply saying
that when I have to work on big data problems where I
have to handle streaming, and I have to handle batch, and machine learning, data science, all these other things, all together, this is is where, why we
started looking at things like Apache Spark, right. And then, in order for data scientists, for you to even do BI reporting, reliably, we needed to reintroduce ACID Transactions back into big data systems. And that's actually
why Delta Lake's there. Delta Lake is to, allows us to bring ACID
Transactions back to Apache Spark in big data workloads. So ultimately you get consistent reads, and you get rollbacks. So in other words if there's an error, I can roll back on that,
which is pretty sweet. So Delta Lake as we say here, right, it is a open format based on
Parquet with transactions, utilizing the Apache Spark APIs. Okay. It allows us to address the
data practitioner's dreams process data continuously
and incrementally as new data arrives in an efficient way without choosing between
batch or streaming. Right. It's the same difference to us. Right. So that previous screen that
you had, you simplified it so it just becomes this
one line per segment that you're looking at,
which is sort of nice. All right. The key features of Delta Lake is that it has ACID Transactions,
schema enforcement, unified batch and
streaming and time travel with data snapshots. And we're actually going to demo that in a few minutes from now, so that way you can get to
see it in showcase, okay. (coughs) Included with this concept is the concept that's
called Delta Architecture. So in other words, when
you design your systems, what you want is, you want
to make sure that there actually is a continuous flow model that allows you to go ahead and unify the batch and streaming. So how you do that is that
you start with your source, and you build, in essence,
what we deem as a bronze table, right, this is raw ingestion. This is similar to how you
traditionally used your original data lake. Okay. You dump the data in there and
you're basically good to go. Right, that's how you basically start off. Right. It'd be equivalent to basically saying, I'm doing lots of inserts to the data. So, what ends up happening here is that you have that, okay then you have silver, all right. The silver basically is filtered,
cleaned, augmented data. Okay. And then, so what that means is that when you process the data. Oh excuse me. When you process the data, you can filter it out, for
sake of argument all the logins that you don't need or whatever
else that you want to do. Okay. You can clean the data and augment it with additional information
from other sources and you can combine that together. Okay. And then there's gold data, okay. So, this is where you can then run your business level aggregates,
this where you can do your AI, your streaming, excuse me. Your AI, your data science,
your machine learning, deep learning, things like that. This is where then often you will do your merges and your overwrites, okay. This isn't to say you
can't do inserts in gold or you can't do delete in bronze. It's just a simple saying that from a traditional data flow perspective, basically our data pipeline perspective, it is more or less like this. You know, inserts in bronze, deletes in silver, merge and overwrites in gold, all right? And like I said, it ultimately
goes ahead and allows you to view your streaming
analytics and AI reporting. It also allows you then
to reprocess the data as if there's corruptions, issues or with the data or changes
in business requirements, that require you to reprocess because you have the
original raw ingestion, you can go back to the beginning and reprocess it so you have the silver, you have the silver and gold data again. So this is the concept of a
Delta Lake Architecture, okay? And basically, each step that you see
here is a progression of how you can make things cleaner, okay? How you are going, the data quality basically goes up. All right? This architecture then reduces
the end-to-end pipeline SLA, reduces pipeline maintenance burden and eliminate the lambda architecture for minute-latency use cases, okay? And so, we make this architecture real because you've got the
Optimized File Source, ACID Transactions, Scalable metadata handling
for high throughput, Inserts, Updates and deletes, okay? Data versioning, and Schema on write option
to enforce data quality. Now, I skim through this really quickly because I'm gonna show
you how that all works. So you know what I'm actually
gonna stop presenting now and I'm actually gonna go
right to demo because I'll end with how do I
use Delta Lake, okay? So give me a second. All right, perfect. So. So here's a notebook, by the way, and by the way, this notebook will be available for you to go ahead and download and use yourself. Okay so this is a, you can use it on Databricks
Community Edition so you don't need to purchase Databricks. It is a Databricks
notebook but you can use it on the other free DBC I even pull that out actually here, okay? Now, what I've done up to this point, is I've just created a quick table, that looks like this. Address, sorry the state,
and the counts, okay? Basically what we're doing, is we're counting the
number of loans, okay? That's all we're doing. (coughs) Okay, so let's sync here. Now, in order for us to do this, what I first did is I
easily converted this from Parquet to Delta Lake format. So what I did is I basically, in this particular case simply ran a 'create table' by statement. So I basically converted
what used to be Parquet into Delta Lake. There's actually a Delta
Lake one step migration, API as well that allows you to do that but I just have to do it this way. Nevertheless, now I have a Delta Table. And this Delta Table basically, you'll notice here there is Delta, dbfs, this is a Databricks file
system, loan by State Delta. I've called it the partition
columns, things of that nature, and you're good to go. Okay, so in other words
its basically let you know the scheme, all right. So I've just done this real quick setup. Now, what I will do is
I will now run it live from this point onwards, okay. So, if I look at the file system what you'll notice basically
as it's showing it's stuff okay, is, these are all the
different Parquet files, and so this looks pretty similar to just pretty much like any
other Parquet set of data that you actually are used to, okay. You have a table, loan by state Delta that has a bunch of Parquet files. Yay. Except the one big difference
is the Delta log folder. And the Delta log basically is
exactly what sort of implies, it's a transaction log. Every single transaction, every
single change, modification to the data is recorded, and is recorded in this format and you'll see the status
of these JSON files. Okay. And so the _delta_log
contains those JSON files. So you only seen one
of them, the zero file, mainly because that's the
creation of the table. All right. So let's go ahead and actually
run the streaming center where I'm gonna go ahead and do a unified faction
streaming source and sync. So what you notice here is that I have this Delta Lake silver path that I initially had setup. Alright, I'm gonna do a read stream, I am gonna do, run a
streaming job, I'm gonna read whatever data is coming out
from that Delta Lake silver path that's this path here by the
way, that I called up before, All right. So, now I'm gonna go read that stream and in this case I'm just
gonna output the data by state, so my apologies for for you non-Americans. The dataset I have is currently
only for the United States. Okay. So, the explanation of this
particular dataset by the way is that these are loans. So these are loans by state restraint. So in other words the number
of loans for each state that we have, okay. So let's let that sucker run. So perfect, okay, I'm gonna answer some
questions, real quick, as we run this through. For some reason it's
running a little slower than I expected. Actually, you know what, I'm gonna see if I can
just run it from here. Oh, that's why it's always
good to have a couple backups. Ah ha! All right, so, I'm just
gonna skip that step. I want this step. Okay, cool. All right, So, this one is
running a little faster today so I might use that one. Okay. So, as I wait for the initial
streaming jobs to kick in, basically, the quick call out is "will this recording of the Zoom webinar "be available as well?" Yes, just as a quick note this webinar, this tech talk is gonna be
available for you to go ahead and work with no problem at all, okay. So all right. So, for some reason this one's
running a little slow today so I'm just gonna use this one, ha ha. All right, cool. So right now, poor Iowa is null. Okay there's no values
for Iowa so let's go ahead and run that right now. This is a great thing about live demos. So I'm just going to insert data into it. Now, note the fact that I'm inserting data into loan by state Delta. What that means is I'm inserting
data into the batch table, not the read stream, not the write stream, but the batch table. It's ultimately going
to the same file system, but it's really, it's right now doing a
batch process of insert, as opposed to doing a write stream. And what you notice is Iowa's slowly steadily
building up more data, as you can see that it's progressing, from 450 to 900 because (coughs) I'm just doing an old
school insert here right now but the context and by
the way, in production, you don't wanna really do it this way. It just, I'm just doing it
so that way you can see, I'm doing a batch write to
a read stream concurrently, that's actually all I'm
trying to do here, okay. So that's just a quick call out here okay. So all right, so I
believe most of the data has been processed already,
the six loops, there you go. And for that matter, I will go ahead and take
a look at the data here. So if I go ahead and look at the log. Oh, sorry, I forgot the
stuff, there you go. You'll notice that now
I have multiple files that we have the original zero
but we have one, two, three, right up to the six, all here. These are the six inserts
associated with what you, of the six insert statements
that you have here. Okay. And we'll show you actually
how that looks in shortly. I just want to show you
the file system first. And then I can go back to the
loan by state Delta, right. So in other words I had the read stream which was showing it all here, okay but I can also go ahead and
see it back in the batch. And sure enough, this is the same value. It doesn't seem that interesting, but the important aspect here is that, now I can read and write,
whether it's streaming or batch to the exact same file system, and I can do so in a reliable fashion. And that in itself is a very
powerful thing that I could do. Okay. And so, what's also great about Delta Lake is that now if you are thinking yourself like I am somebody who is
used to using SQL syntax, because in particular may
come from a SQL background or for that matter just like the syntax, which is understandable. What's great is that you
can now run with Delta Lake with Spark, you can actually run deletes you can run updates and merges, okay. So for example, I just created the same
version of the same table except I'm creating a pq one i.e. a Parquet version of that. So for sake of argument if
I was to run a delete, okay, it'll fail, right? The delete only works when, because when supported by Delta. So not a big deal. It's because I realized, wait, no, I wasn't supposed to put the data in Iowa I was supposed to put that at Washington. So, no problem. I can go ahead and run
the Delete right there. Okay. And so, boom, now that delete's done, I can just go run this
particular statement, and I can look at my state all over again and I'm back pretty much to
the zero where Iowa is nil. Okay. (coughs) So now I want to run an update statement. So this is like our seventh
command is basically doing the update, sorry the delete. The eighth command to do this update where I'm gonna try to update it because it was this 2700
value that I saw there, I was actually supposed
to give it the Washington, not Iowa, okay. So, of course it fails with Parquet . So, again these failures are
actually designed to happen. So, by the way when you get the notebook and you upload it and run it, you'll see three errors right away. That's where the two of the three errors that you just saw, okay. You'll see the third error pretty soon. And so then we have the update statement again same idea with
Delta Lake I can go ahead and do that right away. Not a big deal, we're
good to go again, okay. And then again, let's review the data. Sure enough, here you go, okay. So, Washington State has 2700, okay. And then same idea if I want
to do a merge statement. So the reason, I'm just
gonna run that obviously, I'm just gonna to run that
in Delta at this point because I think I've
nailed that point across. I want to update Iowa to 10. I want to change California 2,500. I want to make Oregon
null for some reason. So nothing against Oregon,
I'm actually born there, so nothing against them. But normally a merge statement's
actually these seven steps. Identify new rows, that to be inserted, new rows that have to
be replaced or updated, rows that are not impacted
by an insert or update. Create a new temp table for where all three inserts have to occur. Delete the original table, renamed the temp table
back to original table and drop the temp table. Okay. That's what a merge is usually, okay. If you were to run this yourself in a standard distributed concept, or you can just run this one
merge statement in Data Lake. So again, we try to make
things a little bit simpler. So again, we made California 2,500, we nullified Oregon and sure enough, Oregon's null, California's
2,500 and Iowa's 10, okay? So this is really cool stuff. All right. But I also wanna evolve the schema. So for example, the counts
that I have here right now, that's just simply the counts, right? That's all this is. It simply tells me the count of loans over time the real business value of it should be actually the dollar amount. So I've generated this by the way, okay. This is obviously fake data, okay. So I've got the state
and I've got the count, but now I've got the amount. And so now I wanna just go ahead and say, yeah, yeah, let me include
that amount of information because that's super important. Well, that's the fourth error here because we're actually
wanted to enforce the schema. So this is actually a desired error. In other words, as you're putting data
into your data lake, with Delta Lake, you basically ensure that you're not gonna overwrite
the data that you have there or you're not going to go ahead and put potentially
corrupt data because maybe it was human error. We weren't supposed to put the sum amount data inside there we just it was by accident, okay. Well that's fine. So with scheme enforcement, we're actually preventing
that from happening. But if I was able to add this option, option merge schema's true, well, what's great about this
particular option is that then I can do exactly that. I am now saying to myself, "No, I want to evolve the schema "because it's what we have
here is correct in fact. "So let me go do that right now." And sure enough, now I can query the exact same data from the exact same table,
the loan by state Delta, and in this case I'm looking at by amount as opposed to by counts. All right. And then, we're going
to finish this off with time travel and MLflow, okay. So I want to describe the
history of my loan state Delta. Remember how I had shown all the files? Well, here's a better
view of that stuff, right. So every single change I did, initial creation, the
six inserts that we did, the delete, the update, the
merge, and the appending that we just did, they're all recorded right here. So now we know what transactions occurred. But what's really cool about this is that because I have those transactions and I also have the original files, if I was to simply go loan by state Delta version as of zero, which is the initial one
I originally created, I'm literally going back in time and this is what my table
looked like initially. Okay. Where Iowa was null, Washington was three 336, not 2,700, Oregon was still a 182 and not null, California had a different value of 2006. Okay. Versus the ninth version of the table where basically I was able
to go ahead and include, everything but the inclusion
of the amount and sure enough, here's what the table looks like now. So Iowa's 10. Oregon's
null, Washington 2,700, California's 2500. So, all of that is actually retained. So you can go back in time
and recall what version of the data you had assigned
to that transaction. And actually that's pretty
powerful as you can, as you can guess. Okay. Well why is that power from
a data science perspective? So here's what I'm doing, I'm just simply downloading
some U.S. census data because I'm building
a really boring model. Okay. Oh, a really boring model
where I'm actually combining the Delta Data, the loan state data, for each and every single
different version of that data, by the population of that state, okay. And so that's what this
particular set of queries are for. But now what I'm doing is I'm doing for the data
version zero, six, and nine. Zero is the original version,
the real version of the data, six is the version of
data that got modified where I went ahead and, nullified Iowa and increased Washington at the 2,700 and then the ninth value
of it was basically one where I really messed it up
where I also nullified Oregon, added a ten to Iowa and also messed up, messed around with California
from the front of it. Okay. So, what I'm doing here is actually just running a standard, I'm actually running Yellowbrick so I can do some residuals, so I can just visualize it. But I'm actually just simply doing, (coughs) excuse me. I predict a loan count,
that's all I'm doing. So I'm doing a standard linear regression. I'm using the exact same
metrics for each and every one. It's basically like if I
was to deploy the model, so I'm just gonna run that right now. I'm sorry, I am gonna
simply run the code, okay. And then now I'm actually
gonna run the code. We're actually gonna run
the predict loan counts for the three different versions. Version zero, the original,
sixth version and ninth version. The reason I'm calling this out is that I'm sort of faking this
concept of data drift. And as you are running your
machine learning models, the data can change over time. Now maybe it's because you
screwed up on the model or I screwed up the model for that matter. Or maybe it's because the
data actually did change, but how do you know? And so what we're doing here is actually we're gonna actually do a run, which is basically like I said, we're going to do a linear regression. And what's sort of cool
is that this model, okay, It's, this is your standard, you pull right from
Spark, linear regression. That's all this is. So the only thing I really did to make this somewhat interesting is that I added these statements
with MLflow start run, and I also included the data version, and I log the metric and I
log the model itself, okay. I also happen to log the
Yellowbrick residuals view as well, just because of sort of nice, okay. So that's it. That's all I've really done, all right. So what happens that if I go ahead and click
on the runs folder here. All right, all three runs are here and
as you can progress, okay, these are the older
versions I did last night, but this is the ones that I
did live as you can tell here. Version zero we have an RMSE of 65, okay. Not great, but not bad. But as you can tell using the
exact same hyper parameters, and again, this is linear regression, so not really that interesting, but still it gets worse. With version six, this is the
one where I've, nullified, (coughs) excuse me. I went ahead and, knocked
up Washington state's okay the RMSE got worse and now I went ahead and
changed California, nullify, added data to Iowa that
shouldn't have been added, nullified Oregon. It's, as you can tell, it's
getting worse, worse and worse. Right. And so let me go
ahead and click on a MLflow and open up the window here, okay. And by the way, even though I'm running
this all on Databricks, this is actually all
open source technologies I'm talking about. I just happened to be showing
you this on Databricks but this is all open source stuff, okay. So, if I want to do a
comparison, I can just compare the three different runs
version of nine, six and zero. And again, you can tell that basically the RMSE is much better
when I go with, go here, right, with the zero versus v six, v nine. You can also visualize
it here, right here, at the RMSE to see the different runs. But what I also like about
when I'm working with, MLflow is that actually, for example, I'm looking at this one
version nine, right. So I can tell the RMSE is horrible, right. But also I've also went
ahead and like I said, I went ahead and did yellow brick and I looked at the residuals
and right away I can tell, okay by the way, train basically
means that the v zero data and test means the v nine data. Like the training data basically is 0.97, so the R score, it's actually not bad, but the R-squared for my V
nine data is really horrible, right. And so it's really clearly
indicating the drift here, like the changes to the data. Now of course this is known because we manually manipulated the data, but the context here is that
as time progresses, over time, you have that ability to go
ahead and with Delta Lake, not only be able to
reliably look at your data and trust your data but you actually have data versioning. So you can figure out what
the heck's going on and, if things have changed
or progressed, it's okay. You can go back and verify, yes, my v zero model actually looked good. And so for that matter you
can even redeploy the model. So we have the ML model right here. It tells you exactly which
version that you were using, what Python version, what Spark ML, the Conda that goes with it. So these are the dependencies, so ML flow. This version of PI spark,
the version of Python, and for that matter all
the metadata and the stages for your ML model are now stored directly in MLflow. So I can go back and redeploy
it or go back and change it. So because there's always
invariably the issue that maybe the data that came in,
actually there were problems, but then later on they got cleaned up and fixed. So you can go back and
use this model, okay. So as you can tell, there's a lot of power by combining these two open source technologies
together to help you go ahead and get yourself
data ready for data science. Okay. So I'm gonna finish up
my presentation here and then I'll still leave
a little bit of time for questions. So, hey how do I use Delta Lake? Okay. So to use it, obviously I'm gonna be we're gonna give you this notebook and where you're going to be a
download stuff. That's great. But if you wanna run this
yourself and you don't, directly right from terminal
or from a Jupyter notebook, that's fine. Add the spark package goes
--packages io.Delta:Delta core. I'm using skull 2.12 the
current latest version is Delta Lake 0.5.0. so there you go. And then I'll keep on
updating the slide deck when we keep on updating the Delta jars, which we're doing at a roughly
about six to eight week pace. Okay. You can also of course use
Maven, which is then right there. Same idea that your
artifact ID is Delta core, which current version, the version, the current version as
of this recording is 0.5. When you're right, instead of using Parquet , in other words, data-frame.write.format Parquet just simply change it to data-frame.write.format Delta. And that's it. Now, just like that, you're using Delta Lake, okay. And we already talked about
that update from merges we already talked about
the time travel and yes, because I have that time travel capability pretend a bunch of those steps, six, seven, eight, nine were wrong. I can now roll back. This is an example of it but
basically I can go roll back to the sixth version. And so now from this point onwards, I'm back to version six of the data, or sorry, version zero of the data for that matter, the data that I actually
could trust, okay. And so this is a small
but fundamental component of that data science life cycle that allows you to go ahead
and have reliable data that by basically adding
Delta Lake and MLflow that you can ensure that
no matter what tools you're using to do your analysis, no matter what tools you're
using to do your processing, you're actually gonna
have a reliable data lake. And you can track both
the data version itself and the model versions that
you're working with, okay? So if you want to build
your own Delta Lake, please go to https://Delta.io if and, Delta Lake is actually gotten more more powerful over time. We have various connectors, we're currently working on Hive, Presto is already out. We have connectors for Snowflake, Athena, Redshift and Hive should become publicly very soon. There are lots of providers and partners, whether it's Tableau, Informatica, Qlick, Privacera, Attunity, Talend like you know, WANDisco, Streamsets and Google Dataproc, recently announced that they are supporting Delta Lake as well, which is pretty cool, right? And, as you can tell, there's a ton of users, so I'm not going to go through them all, but there's a ton of users of Delta Lake, we constantly update the
Deltaio page for all that. If you want the notebook
from today's webinar or our tech talk here, try it out and, Databricks Community Edition and you can download it from this link. We're going to send you this email out with all this information anyways, which includes these slides by the way. But yeah, like it's right
there for you to download and try our try out. (coughs) Did wanna call a couple of
things before I answer questions. Hey, you got more questions, you
want to dive into deeper, why don't you join us a
Spark + AI Summit, okay? Spark + AI Summit, June 22nd to 25th in San Francisco. We actually have two training days now just because of the popularity of them. It's organized by Databricks, but, you know, listen to
the open source technology. So come on down. It's in San Francisco. Use the code Denny, my name, DennySAI020. Okay? That's the code you get 20% off. So go ahead and give that a shot. Would love to see you down there. Love to answer your questions both online and also
basic face to face, okay? And also if you have questions come to Delta.io or mlflow.org, we actually have a ton of questions, where we can actually, we're already active on our Slack channels we are already active on
the distribution list. So definitely go ahead
and join us there, okay? So saying that I'm
gonna leave you at this, I'm gonna leave a few
minutes for questions, and we'll go from there, okay? Sorry, I am trying to open
the Q A but it's not opening, all right, let's just close the chat. Okay, So, okay, I'm gonna go through it, looking at the questions here. So I apologize, I'm just trying to scroll
through them right now. There is a question, say Oh by the way, as a
small joke here at the, when I did say morning everyone and it's a good morning for San
Francisco, that's from Karen. I'm actually Seattle based. So just a little quick alert. There's a question about "is there a difference
between Databricks Delta" "as in the Databricks version
versus the open source one?" There is no difference between
the APIs by version1.0. Most of the API changes, differences are actually are all resolved now, but I do want to call out that you know, since we're a pre 1.0 project, we're still trying to get, make sure that all of the functionality that exists in Databricks,
Databricks Delta, from a reliability perspective are included in the open
source version, okay? Oh, by the way, if you're asking questions, please don't use the Q&A panel, please put it on to the chat panel. For some reason, my Q&A panel
is not opening right now. And so, by version 1.0 if not earlier, we'll actually make sure that
every single feature that exists in Databricks
Delta is also available in open source Delta Lake. Now what is the difference then? The difference right now
primarily is, for example, some of the update, delete
statements that you saw. Those are not gonna be ready. You can run updates, you can run deletes, but you can only run them from
the Python and Scala APIs, and the open source one. We're actually waiting for
Spark 3.0 to release with data source V2 that allow us
to actually make some of the changes that allow us to go ahead and put the SQL statements in as well. So that way you can use Spark SQL, for deletes and updates. Like I'd shown you on the
screen here as opposed to, using the API. But if you want to use the API, obviously you can use the API right now to do deletes, okay? And updates and merges. And so the main difference post 1.0 is that well, all the API will stay exactly the same, it's just that Databricks
itself is a merged service. So because we're a merged service, we're gonna go ahead and
run Delta Lake for you. We're going merge we're gonna
do performance enhancements, things of that nature. But it is a open source project, it's part of the Linux Foundation. So if you are going to put in PRs, yeah, we'll probably take them right? Because it is an open source project. So we want to make that very clear, but right now at least that's the primary set
of differences, okay? So there's a question, can I change the Delta
store to Blob Lake store DB? Not to DB, at least not certainly not yet. But you can certainly make the changes for Delta Lake to be pointing to whatever, a file store that you want. So in other words, you wanna run this on
the Hadoop file system? By all means. If you want to run this on the, as a Data Lake Gen 2, or S3 or Google cloud storage, these are fine. There's actually nothing
stopping you from doing that, because if you recall in the end, the primary thing we do is
we add an additional folder _Delta_log where we put the transaction log into it. That's the primary difference
that we've made, right? Outside of that, it's the still technically
the same Parquet files that you would see in a Parquet table, on that file system, right? So we're still doing the same thing. There are certainly
some optimizations done underneath the covers but
for all intents and purposes, it's still Parquet files. The most notable change is that you, when you're, trying to look at the data, there's a lot more data in there, right? So for example, when I do a delete, technically I'm not doing a delete, right? Technically I'm building a
tombstone because what I've done is I've created a new version of the data that doesn't contain the data that I deleted. This is the reason why I can
go back and use time traveling to basically build up the
different versions, right? I can go back to version
zero and have the, have the version of the
data that says I was null, and Washington is 336, because the data still
actually exists, right? It's still there. It's just that basically
because the transaction log, it allows me to keep track of that. All right? So that's the main difference. So if you tried to, for
the sake of argument, have multiple transactions on your Delta Lake table, but then you try to just
read the Parquet logs without actually reading the Delta log. There could be multiple
versions of this data, so then presumably you could see, duplicates or triplicates or so forth because you're actually seeing like the previous versions of the data as well. (coughs) So because it's an open source project, actually we recently released a blog at Delta Lake 0.5 and zero, where we actually talked about manifests. And this is where Presto
and Athena can actually go ahead and take a look
at the Delta Lake data, they can look at the Parquet data because what you do is you
generate a manifest file. The manifest file
basically simply contains over here are all the files
that you should be looking at for the most current version of data. That's it. And so like for sake of argument, there are 20 files inside there, but if you want to look at
the most current version, you only need to look at the six. So the manifest basically
contains the file names of those six files. And then Presto, Athena
can read that manifest, which is invariably tells them to go look at those six files. And from there you're pretty
much good to go, okay? So hopefully that answers that question. So actually in this case, I
apologize, but I think we, our time is up now, so I will go ahead and probably end this session now. By the same token, if you do have questions, I highly would advise you
to join us at Delta.io or join us at mfflow.org and as well, we will be sending an email out. Well actually, not so much an email, but we'll be sending out...