- Apache Spark has truly
revolutionized what people can do with data and collecting all
these massive amounts of data. And we think the industry
is now at the point where it can actually leverage that data in a very strategic way, using a pattern we called the Lakehouse. But before I get to the Lakehouse, I wanna set the context. What happened about 10 years ago is that data native companies
started using data and AI in a very strategic way. And they enabled hundreds of use cases that actually disrupted whole industries, companies like Airbnb, Netflix,
and Uber that were using AI for massive amounts of data processing. Uber was using AI to predict the price, the supply and demand
of riders and drivers, they were putting people in carpools. Airbnb actually figures out
when you search for a house, it lists you the houses that
it thinks will be accepted. It gives you price
recommendations if you're a host and it computes customer
lifetime value for customers. Netflix, they do content
suggestion and they actually watch the streaming quality as you're watching and they're using AI to improve it. And the thumbnails that you
see when you're on netflix.com, they're custom picked for
you using an AI algorithm and these companies didn't
just have one use case, they had enabled the whole
organization to use hundreds of hundreds use cases to
disrupt these industries. And they were not using
any legacy systems. They actually were heavily
leveraging open source, they were not locking the data
into a proprietary system, or a proprietary format and
there we're stitching together these different open source
tools to enable these use cases. They used a unified approach,
they used open source to set the right foundation
so that you could do all the data engineering in
one place, and on top of it be able to do data science
and machine learning and data analytics and they
enabled lots of different views and teams to do machine
learning and data analytics. But today, enterprises they
don't have those armies of engineers to custom build
these systems for them. They have on-prem data and
they're moving into the Cloud and their architecture
looks very different. Today, what we see on-prem or in the Cloud is an architecture that
kind of looks like this. You start by storing all
your raw data in a data lake, often using Apache Spark,
and then on top of that, you have machine learning
and data science use cases and you use a separate data
warehouse and you copy the data into that data warehouse
for BI and reporting. This architecture is pretty
complicated and it's not what those data native
companies were using. This architecture leads to
two copies of your data. One in the data lake and other,
copy in the data warehouse when you're doing BI on the data warehouse that data in the data warehouse is stale because the most recent version
of it is in the data lake. And you have to do big
investments in data ops to make sure that the data is consistent so if you update the
data and data warehouse, now it's out of date with
the data lake or vice versa. So this architecture is a big impediment to be able to actually unlock
the true value of data. And more recently, we've seen
that some are now proposing an architecture, which is very
similar to the architecture of the data warehouses from the 1980s, which says, take all your
data, store it centrally in a proprietary data warehouse. And then if you wanna do data
science or machine learning, or real time, export
it out to the data lake and then do those use cases
on top of the data lake. This approach has several problems. Of course it's closed,
it's a proprietary system just like the 80s data warehouses. More importantly, data
science is actually not a first-class citizen
in these data warehouses so you have to actually
export your data out of it. And it's operationally and
financially very expensive to have these two systems. But also it turns out to do data science and to actually be able to work
with video data, audio data, you actually still have to use a data lake and store all those copies there as well, because you can't put
those in a database house. So at Databricks, we're excited for a new architectural pattern that
we call the Lakehouse, which offers a new way. You store all your data in a data lake, but then thanks to innovations
that I will be talking about, you'll be able to
structure it in such a way so that you can directly on
that data in your data lake be able to do data science
and machine learning, but also analytics and BI as well as the real time use cases. It's completely based on an open format and open architecture
and you store your data in data lakes, where you have
them today you don't need to move them anywhere else. It has first-class
support for data science and SQL based business intelligence. And it has excellent recency of data with one single source of
truth in the data lake. And this architecture works
great with the ecosystem of tools that are around it,
so it can actually work well if you wanna store subset of
that data in a data warehouse, or if you have data warehouses
it can connect to them. Or if you have security
solutions or OLTP solutions, or ETL solutions or data
providers that you wanna bring in, it works that ecosystem. So this architecture builds on a data lake and data lakes themselves
just alone are not enough, they have problems, so we have
to fix those problems first, so I'm gonna talk about
how the Lakehouse pattern does that. So at Databricks, we
looked at all the problems that our customers were
having in 2013, 14, 15, with data lakes, and we
classified all the challenges that they were seeing. And there were lots of
different detailed challenges and you can see them on the slide here. And if you categorize those, you'll see that a bunch of them
have to do with reliability. It's hard to append data, it's hard to transactionally
change the data, it's difficult to deal
with real-time data, it's hard to version your data, but also there are lots
performance issues because you have these files and data
lakes mainly deal with files. So you're getting into
performance problems when you wanna use this data downstream. You also have governance
problems because ultimately it's a fall obstruction,
it's not a structured data obstruction where
you can actually govern who has access to which
parts of that data. And most importantly, there
are lots of quality problems because these data lakes
often can become data swamps. So at Databricks, we
open sourced technology called Delta Lake and Delta
Lake brings reliability, performance, governance,
and quality to data lakes. And the way it does that, is
that it actually leverages four technological breakthroughs
directly on data lakes. So for reliability problems, it actually comes with ACID transactions. These guarantee that any operation
you do on your data lake, either fully succeeds or
gets aborted and cleaned up as if it never happened,
and you can retry it again. The performance problems are
handled by storing indices that are well known to speed
up data processing directly on your data lake, speeding
up over 10X to 100X that data access. For governance problems, it comes with table level access control and role-based access
control, so you can now reason at the level of structured
data, you can say who has access to a table or a column or a row. And finally, the quality
problems are addressed using schema validations and expectations. So you can specify exactly
the data in the data lake, what scheme I should have
and if it doesn't have that, it's stored in a quarantine
until it's fixed. With this, our customers
have been building curated data lakes. They leveraged the
expectations I mentioned, and they then specify
exactly the quality metrics that they expect from their data. That way they can start
by storing all their data in the data lake, let's call
those the bronze datasets, that's just the raw data,
it might be a swamp. The next level is a silver datasets, the silver datasets are filtered
and they actually satisfy some of those quality metrics
that you've specified. So you can always trust that
the data in the silver table has been filtered based
on those quality metrics. And then for the gold tables,
you have the business level aggregates and the
highest quality of data. That way your applications
can use the gold tables, but you can also store all
of your data in the data lake in the bronze format. So the Lakehouse pattern, at its core is based on Delta Lake. And we've mentioned at
previous data and AI summits, that Databricks has a data science and machine learning platform,
that enables data scientists and machine learning
folks to do predictions on top of their data. But what about analytics? What about business
intelligence and reporting? And I'm super excited
to announce Databricks offering SQL Analytics. SQL Analytics is now situated
on top of Delta Lake, it's an engine that's
highly tuned and optimized for Delta Lake, and it has integrations with all the BI tools that you want so that you can very quickly
and efficiently visualize what you want in your dashboards
on massive amounts of data, that's sitting in the Delta Lake. And because enterprises have so much data that they wanna do BI on, we've actually priced
this really competitively. So we worked with Barcelona
Supercomputing Center, and the researchers there
ran performance benchmark using the industry gold standard TPC-DS. TPC-DS has 99 queries
and they ran all of these and compare them with other
Clouds data warehouses, and it is 9 times better
price and performance compared to those offerings. So this is awesome, now
you can get cheaply access to those datasets that you
have in the data lake directly. So what does the SQL
Analytics offering look like? I'm gonna walk you through
some of the aspects of this. So first it has a user interface
that I'm gonna talk about. Second I'm gonna talk about
the built-in connectors and then the fine-grained
access control and governance, and then finally we'll talk a little bit about the performance. Starting with the UI, it's built on the Redash user interface, it's a familiar interface
where data analysts can go in, they can browse the databases
and the tables that they have, they can enter SQL queries
and they can see the tables and they can create beautiful dashboards from those visualizations. These dashboards can then be shared in the organization securely
to democratize access and insights to that data. It supports scheduling
this so that it's rerunning those dashboards for you and
it can give you automatic alerts if there are specific
triggers that happen based on the data in the dashboards. And, the Lakehouse and the
SQL Analytics offering comes with built-in connectors
for existing BI tools. It comes with very
simple connection set up, it's optimized performance
so that it's really fast based on the latest
drivers of ODBC and JDBC. And it's really quick and
easy to set up authentication and single sign-on. And I'd like to welcome
one of the thought leaders in the BI space, the CPO of
Tableau Francois Ajenstat to tell me a little bit what he's seeing with the Lakehouse pattern. Welcome Francois. - So great to be here,
thanks for having me. - So Tableau has been in
the center of analytics for a very, very long time. Can you share with me a
little bit what you're seeing in the analytics market,
especially in the Cloud? - Well, you know Tableau
has always been focused on helping people see and understand data. And increasingly we're seeing people move their data workloads to the Cloud. It simplifies data access, and they can scale to
even bigger data volumes. And so for us, it's all
about choice and flexibility, helping our customers analyze
data anywhere they are. But as data moves to the Cloud, people really wanna bring analytics and data science together
and really converge these two worlds together so they can see and analyze all of these
things in one place. - So we've talked a lot
about the Lakehouse adoption in the Cloud, what do you
think is driving this shift? - Well we see growing adoption
as well, and, you know, as more and more data
is moving to the Cloud, it's coming in from everywhere. Customer interactions,
web logs, IOT sensors, that's always a mess. And so we need to make
sense of all of that data and deliver it to the people that need it, in a reliable and high-performance way. And so what we hear is that our customers really wanna do their
analytics on their data lake and the Lakehouse architecture
is the way to do that. And that's why we're excited
to partner with Databricks to make that vision a reality. And with the new SQL Analytics endpoint with the Delta Engine, we're
gonna provide Tableau customers with an entirely new arc
experience, a new architecture for doing analytics on that
data lake in a way that, that wasn't possible
before with the performance and scale that they never
thought was possible. - That's awesome. So we have a lot of joint
customers that are using Tableau on Databricks, can you share some of the maybe favorite
examples that you have? - We have so many customers,
it's hard to pick just one. But last month I heard a great story about the U.S. Air Force and
how they're using Databricks and Delta Lake to manage
all of their analytics and using Tableau to
visualize and analyze millions and millions of records every single day. You know, in their environment, there's too many different technologies, hosted everywhere with different rules, different levels of access,
different levels of maturity, and the data just kept multiplying and multiplying over and over. And so what the Air Force really did, is they standardized
their data on Delta Lake to make their data visible, accessible, understandable, linked, and trusted. And as a result of doing that
they've increased data use, they've increased data
literacy because the data is more efficient and available
to make better decisions. And now they've got the best
state-of-the-art technologies to manage their data and explore
all of that data at scale. - That's so awesome. So we did this announcement
of SQL Analytics, you know, obviously we've
been working together on that, but I'm curious how you're seeing Tableau plus SQL Analytics simplify
the customer's journey? - Well, I mean we're really
excited about the availability of the new Databricks SQL Analytics, you know, we've had a connector in Tableau for over a year now, and
we're now gonna be supporting the SQL Analytics really, really soon. What this really does
is it simplifies access and management for our customers, right? Being able to provide
exploratory analytics directly on the data lake will make the
experience for our customers easier, faster and then
they'll get value much, much more quickly out of all
of their data environments and deliver that to all of
their users through Tableau. - That's super exciting and
we're really excited to bring this out to all the
Tableau customers together. Thank you so much Francois. - Thank you. Excited to partner together. - So we've talked so far
about the UI of SQL Analytics, and we've talked about the
connectivity to these BI tools. Let's talk about the fine-grained
governance and security. This is really, really important
because you have your data in the data lake, how do we now make sure that we can really govern it? Well, in SQL Analytics it
gives you a table level view of your data. Everything is a database,
tables with columns and rows, and it gives you SQL
level data access control. That means in SQL you can specify, who gets to see which database, which tables, what
columns and we which rows. And this is really, really important because now you can
actually lock down your data and share it with many, many
analysts in your organization. We also provide a way where
you can manage the usage of SQL Analytics, you can go
in and see if there are queries that are running away and deal with them. And it comes with virtual
cluster monitoring so that you can actually
set up the endpoints and see how they're spinning up, how they're scaling up and down. And finally, it comes
with a really awesome query history browser,
where you can go in, see all the queries and click into them and actually debug what's going
on with each of the queries. Finally, SQL Analytics is really fast and that's really, really important. And when we talk about fast, we're talking about three
different dimensions. One we wanna be able to bring
lots of lots of analysts. And what SQL Analytics enables you to do, is to have SQL endpoints
that auto scale up and down based on the number of
analysts that are using it. So a lot of work has gone
into making it concurrent. Second, we've optimized
the end to end latency of queries come in from BI
tools, all the way to the data that's sitting in Delta
Lake and that's important so every step of the way
has been optimized a way so that you can get snappy
performance especially if you have really large datasets. And finally, we optimized each query so that when it executes in
parallel, it's really fast. And this is through an
innovation called vectorization, and we're gonna hear much more about the technological
underpinnings of the performance of SQL Analytics from my
co-founder Reynold Xin, but before we do that,
I want you to see a demo of all of this in action from Brooke. - Thanks, Ali. To fully harness the
power of our Lakehouse, we're going to use the new
SQL Analytics workspace. To access it, all you need
is the click of a button from your Databricks workspace. For our demo today, we've
still fully acquired some security data for an
unnamed unicorn startup in the San Francisco Bay area. Our job is to help them identify
possible security threats as they prepare for hosting a major international conference. So over the next few minutes,
we'll write some queries, build visualizations, create a shareable and customizable dashboard
and craft a monitoring and alerting framework, all
with just a few lines of SQL. Let's get started by
digging into our data. I'm going to select the
Queries tab here on the left, create a New Query. And on the left, you can see,
I have my security database with my security apps table. So I have things like the
user, the date, the country, the app, admin logins, and failed logins and successful logins to a given app. So the first question I'd like to ask, is what is the ratio of failed logins to successful logins by country? So let's go ahead and write
the SQL query for that. Going to select country,
sum of app failures, notice the type complete here,
divide by sum of app success, as fail ratio, from our security apps table, group by country. Let's save this and execute it. And so while it's great
that I can see my data in a tabular format, I would
really love to visualize this on a map of the world. So let's add a visualization. Here I'll select choropleth map, as my Visualization Type, my Key column is country. And here in our dataset
we have the full name for the countries. And the value is fail ratio, and viola we suddenly
have a map of the world to be able to derive
very fast insight from. So for example, I can see
that there's a large ratio of failed logins in countries like Brazil and the Philippines, but
relatively small fraction of failed logins in North America and European countries like Germany. So let's go ahead and
save this visualization. Now, what I want to do is I
want to give this query a name, Fail Login Ratio by Country. And I want to add it to
our security dashboard so other people can take a look at this and dig into the data a bit more deeply. To add it to our dashboard, I simply select these three dots down here and select Add to Dashboard. So I'm going to add this
to the security dashboard that the team has already created. And now let's go ahead and
dive into this Dashboard. So you can see this is
our security dashboard, and at the bottom this
is the new visualization that we just added. But now I wanna walk you through some of these other visualizations. So for example, we can
see from January 1st, 2019 through November 4th,
2020, these were the apps that had the highest
number of failed logins. The one that really
stands out is Cat Food, because most companies access dog food, so this seems a little bit fishy. The other thing that stands out, is the app failures by date. So we can see sometime around June, there's a high number of failed logins. Let's zoom in on this area here. So now we can see more clearly that there was this huge spike
of failed logins on June 1st. And I'm curious which
apps had the high number of failed logins. So we can adjust the parameters
of our widget over here. Let's fast forward to June 1st, as our start date, and let's make our end
date June 1st as well. And to re execute this query,
simply select apply changes. And so here we can see that
Workday had a high number of failed logins that day. And so on June 1st might not
be a historic date for you, it was the date this company
switched to using Workday for HR management. So although this wasn't a
malicious attack at all, the security team would
still like to be alerted in cases of high number of
app failures on a given date. So let's go ahead and create
an alert for that condition. To create an alert, I simply
select the Alerts tab here, and New Alert. We give it the query that
we want to trigger the alert based off, in this case
it's App Failures by Date. We want to trigger when
the sum of app failures is greater than a given threshold. Let's call this threshold 20, there's more than 20 failed
logins in a given day, we want to trigger an alert. When triggered, send an alert just once, but to catch the attention security team, I'm going to create a custom template, lots of failed logins. And that's all that's
needed to create an alert. Now the security team can go
and set the refresh schedule on this query to determine how frequently they wanted to check this condition, every minute, every five
minutes, every hour, et cetera. So now that you've seen
how to write queries, build visualizations and
interact with dashboards. Now let's take it the endpoints
that are executing all of these queries under the hood. You can see here that we have
an Analytics Team Endpoint that's already up and running. And so when creating an
endpoint, you simply select the cluster size, extra small,
small, medium, et cetera. So you no longer need
to determine what type of instance do I want? How many instances? You simply select t-shirt
sized cluster sizing. You also have control over
the auto stop parameter. So if there's 120 minutes of inactivity, I want it to automatically stop so I stopped spending
money on these resources, my team isn't consuming. Two of the other really cool features are multi-cluster load
balancing and Photon. So if you have hundreds of analysts all on the same endpoint, they're going to have
resource contention issues. But multi-cluster load balancing does, is automatically scale
to have multiple clusters and so the users can just
interact with the same endpoint, but under the hood their
queries are being run on different clusters. In this case, I specified a minimum of one and maximum of four. Currently I only have one active, is I'm the only user
interacting with this endpoint. I've also enabled Photon for
faster query performance, which Reynold's talk
dives more deeply into. So now that you've seen
how the endpoint work at a high level, let's take
a look at the data governance and auditability of our workspace. To do that, we'll interact
with the History tab here. The history tab shows for all users, within a given timeframe, for all endpoints with a given
status, the query history of any of those conditions. So for example, if I select
the last query I ran, you can see the SQL code that
I had written, the status, when it started, how long
that command took, who ran it and on which endpoint. This way you have full
auditability and transparency over who's accessing your data. So now you've seen how to interact with the SQL Analytics workspace, to go back to the Databricks workspace, it's just the click of a button. You have control over the
data in your Lakehouse, and you have control over which workspaces you grant users access to. Whether it's the Databricks workspace, SQL Analytics workspace, or a bit of both, the choice is yours. Thank you and now I'd like to
hand it over to Reynold Xin, to do a deep dive into the Delta Engine and SQL Analytics endpoints
to discuss how they can get data warehousing performance
on the scale of data lakes. Thank you. - Thank you, Brooke. And hi everyone again. Earlier this morning,
I talked about our work to make Spark more Pythonic, and there was entirely
focusing on usability. This talk, I'll be
focusing on performance. The idea of the Lakehouse
is to build upon open format and architecture on data
lakes so we can offer a first-class support for
analytics, real-time applications, data science and machine learning. Now one shool of thought
in the database community has been that, it's
impossible to achieve best in class performance was
open data like architecture, especially when comparing with
proprietary data warehouses. But Ali show you earlier that
we're able to achieve best in class performance in TBC-DS, the standard data warehousing benchmark. As a matter of fact by beating most of the other data warehouses. So how do we do that? In this talk, I want to show
you that there's no fundamental reason why we can't beat
proprietary data warehouses with open Lakehouse. And the result is a combination of the past few years of work. Now I'm gonna show you
some of the examples, about what we have done. It's important for me to point out that there's no single
metric for performance. When I use the word performance includes three separate dimensions,
the first is concurrency, second is latency and
the third is throughput. In concurrency, we're
referring to how many users or queries can assist them
support at the same time. For latency, we're referring to how fast can the system respond to a
query in the end to end fashion as perceived by the user. And for throughput, we
meant for a given query, how much data can assist
them process per second? Let's first dive into concurrency. So one of the benefits of the Lakehouse is it's the Cabo storage
from computer architecture. In this model, you have all
of the data in the data lake most often called object stores
they are extremely cheap. When some users need to do analytics, we simply create a cluster for them in the cluster the ephemeral, the cluster would connect
through the data lake to pull data from it on demand. When the lot more users show
up, a single cluster might not be able to handle all of the load. And this is very problematic
for the traditional MPP or Massively Parallel Processing
data warehouse architecture that couple storage and compute. In order to add capacity to the system, we would require re shutting all the data effectively unloading all
the data out of the system and then loading them back in
again and this can take days and sometimes even weeks to do. In Lakehouse, this is
actually much simpler. All we need to do is
launch another cluster for the new set of users
and then they can do their work immediately. In this the Cabo storage
from computer architecture all clusters have the
same view of the data, and they provide very
strong workload isolation between the clusters. And this really enables
infinite scalability by just launching more compute clusters. One of the challenge with this design, while it has great scalability, is that it can be difficult to manage. For example, think about
how would the admin know how many clusters to create? When should the admin create them? The workloads are not always consistent. How would the user know
which cluster to connect to out of the so many that we have created? What if a team suddenly
has a workload spike and now needs more than one
cluster just for themselves? So SQL Analytics on Databricks
offers a new way to do this, instead of grading new
clusters admins can create a SQL endpoint for all their users. Instead of connecting directly by the user to the individual cluster
they usually connect to a SQL endpoint to access a proxy between the user and the cluster. When more users show up, the SQL endpoint automatically launch new
clusters to handle the load. As a matter of fact, SQL
endpoints offer even more than what I just talked about. Not only does it automatically
launch new clusters when load increase to
maintain the same level of performance. It shuts down clusters
when the load decreases to save cost. Because it sees all the
queries coming in and out of the system and their progress, SQL endpoints also know when a
cluster is in the error state and can restart that
cluster automatically. So not only does SQL endpoints provide much better concurrency
to all of the users, it also simplifies the management and makes the whole system
overall more reliable. Now after concurrency
let's talk about latency. For latency as I mentioned
earlier, it's important to think about not just how fast
the system itself responds, but the end to end latency
experienced by the user. And for this, I want to show
you some work we have done together with a BI vendors
to optimize the BI protocol, which is the communication
protocol between the client and the SQL endpoint servers. Here's what happens when we
try to run a query in Tableau or any of the other BI
tools like Power BI. Tableau would connect
to a Databricks server, and when the user tries to run
their query or a dashboard, what happens Tableau will send a message to the Databricks server, and
the message typically says in the first one, "Hey,
please run this query, and here's the query string." And when Databricks server sees the query on SQL endpoint side, you
assign the query a query ID and return that query
ID directly to Tableau. You might be surprised
and ask, "Hey, why don't the Databricks server return
the results immediately?" This is because the query
could be very long running, the query could be
scanning one roll of data, or it could be scanning petabytes of data. It could finish in a second,
it could also finish in a day. For very long running queries, it's very wasteful to maintain
the persistent connection and just for the client to wait forever. So by returning a query
ID, we enable the client to actually pull the
Databricks server periodically in order to look at the
status of the query. What would happen, is
Tableau will send a message to Databricks asking, "Hey,
is query ID 123 ready yet?" Probably every second or so and then Databricks would respond, it's ready or it's not ready. When it's finally ready,
Databricks will send a message back saying,
"Hey, query 123 is ready" and Tableau now can send
a message requesting the results for the query and Databricks will send the query result back. This protocol looks pretty chatty, but it's really optimized for
the diverse range of queries that can run on the platform
it's not Databricks specific a lot of systems work like
this from the on-premise days. When the system's on-prem
the data warehouse in the data center, that's very close they're usually using Tableau and all of the six messages
can be completed very quickly. But now considering the cloud case or for a multinational company, we might have a user in one continent and the data center actually
in a different continent, that process all of the data due to may be a GDPR constraints. In this case, even if the
server is extremely efficient, just consider the time it takes for a message to propagate
the network packet from the United State for
example, to Europe and back, that would itself would
take 300 milliseconds. So just with six messages
we're talking about two seconds of query wait time even if
query itself does nothing. The reality is actually much
worse than what I've shown you because a dashboard would
trigger many queries and typically one query will
include multiple queries under the hood because
includes looking at metadata, really compounding the latency. In SQL Analytics we work
was various BI vendors like Capital and Power BI
to improve this protocol in the new version of the
protocol, we will return the result immediately if
the query's short running. Essentially the service I
will wait for a second or two for the query to run
and the query completes within that time the response
would directly include the result rather than
sending back query ID. But if the query takes a while to run, we'll fall back to the old protocol. So with this it now only it
takes two messages or even one round trip until
complete the short queries. We've rolled this optimization
now to preview customers and amounts of their workloads
to observe a six-fold reduction in latency for short queries. The last dimension I will
talk about is throughput and that is for a given query, how much data can a
system process per second? To do that I want to show
you the life of a query going through the entire system by peeking into what happens
in a backend SQL Analytics. This is a query that comes into the system through SQL endpoint and then it would go
through the Delta Engine which is the compute
engine for SQL Analytics. Delta Engine includes a query optimizer, an execution engine, a caching subsystem, and the physical layout
optimization for data lakes. Now we'll go through this
components one by one to show the improvements we have done to drastically improve performance. So first is the query optimizer. Delta Engine includes a state
of the art query optimizer it extends Spark's cost based optimizer and adaptive query
execution that's effective at dynamic runtime optimization. This means even when statistics jiff or maybe it's just fresh data
that don't have statistics, the system could still
continue to optimize the query at runtime based on what
it has observed at runtime and this can deliver up to 18
times performance improvements for SQL star schema workloads. The next component is execution engine. Four months ago at the Spark
and AI Summit in San Francisco we gave a preview of the
brand new vectorized execution engine working on called Photon. Photon's designed for maximum performance, and it's written from scratch in C++ optimize the CPU time of query execution using vectorized instructions. It can be 10 times faster than our old JVM based execution engine, in this talk I won't get
into the details of Photon if you're interested in
learning more about it, please see Photon techies talk. His name is Alex Ben at this conference. Now the third component is caching. The execution engine Photon, needs to read it's data from somewhere and the source of data in the
Lakehouse house architecture is in data lakes, which
are cloud object storage. The throughput between
Cloud object storage and the clusters, the
femoral clusters reading the object stores are
naturally not very high they're not as high as
the local disrupt system. The reality is that most query workloads have concentrated access on hot data. For example, most outlets
will probably only be reading for example, yesterday's
data or this week's data and users in the United States tend to sort of scan
the United States data and users in Europe tend
to scan the European data. To avoid repeatedly going
into the object stores, Delta Engines caching
subsystem reads the data from the object stores and
then it actually creates a new copy of them that say
femoral on the NVME SSDs. But it's not just the dumb, sort of raw bytes copy of
the original data in parquet the caching subsystem transcode
the data into a faster format so when the execution
engine Photon reads those data, Photon's no longer reading
parquet, it's reading actually a highly optimized format
for the high IO throughput. The last component is the
data lake or how we organize the data physically in the data lake. One of the tricks, data warehouses
use a lot is to maintain auxiliary data structure so the engine can more efficiently skip data. And this is extremely important
because query execution time depends primarily on the
amount of data's access once you have a fixed execution engine. And parquet's design
doesn't actually allow a lot of auxiliary data
structure to be created because it's mainly sort
of to encode data itself so it has very limited data
skipping functionality. But with Delta Lake, it's
actually possible now to leverage Delta Lake's transaction log, to encode additional auxiliary data. As an example, Delta Engine
encodes a min-max statistics for columns in the transaction log. And this min max statistics, are updated atomically
to get with the data because there are done
in the same transaction so it never go out of date
unlike the optimizers statistics. And when the query shows up
like the one I'm showing here, Slack Star events where
a year equals 2020, and UID equals that 24,000 Delta Engine first look
at the transaction log to prune the list of files to
read before even reading them. It's smart enough to realize
it does not need to read the files who is maxi or less than 2020, because the predicate
itself once year equal 2020, and then we apply the
similar future on UID. And in this example Delta Engine would figure out it only
needs to read one file skipping two-thirds of the data. This is a toy example I'm
showing, but in reality, almost no query scan the
entirety of the table so often we can see 10 times,
100 times even 1,000 times falling some improvements just by doing better data skipping. Through the life of a query, we talked about how various
components Delta Engine can improve throughput of the system. And of course, by improving
the throughput here, the systems latency can also
reduce because the system's becoming more efficient as well. Now in this talk, we went through three
dimensions of performance concurrency, which is how many users and queries can a system
support at any given time? Latency, the perceive end to end how fast a system can respond to the user? And throughput for a given query, how much data can the
system process a second? We talked about how various
performance improvements into the different sub modules
can improve all those three dimensions drastically
and we put them together we can now deliver best
in class performance for SQL workloads. It's a tradition at Spark Summit now called Data and AI
Summit to bridge the gap between cutting edge
research and industry. So we have invited professor
Peter Boncz to join us and share with us some of his perspectives on the topic of our performance. Peter is a professor at CWI,
Netherlands National Research Institute for Mathematics
and Computer Science. But he's not just any academic,
we have all been benefiting from his work in the field as
he's one of the main pioneers inventors of columnar
databases and vectorization Peter's work laid down the
foundation for high performance analytic systems and it's implemented by virtually every data warehouse. We have been fortunate enough
to be collaborating with Peter at Databricks for the past three years through our engineering
office in Amsterdam. Hi Peter, thank you for joining us. - Thanks Reynold, thanks for having me. - So before we get into a
sort of technical questions, I'm actually kind of curious
about your experience working with the Dutch
government on COVID response, 'cause I've heard that's
what's keeping you busy lately. - That's definitely the case thanks for asking, yes. I'm advising Dutch government, it's actually fascinating, it's terrible on the one
hand but it's also yeah, it's a fascinating time and you see that fighting this crisis, well it's a very
multidisciplinary problem, like very, very many
different fields but certainly kind of computer science and data analysis are very important in this so yeah, it does keep me busy. - It's actually really, as you said, both sad and exciting to see how data and us as people that
work on data can contribute meaningfully to solve the
larger societal issues. But for today.
- Yeah. - I want to focus a little bit
more on some of the technical issues especially your
contributions to the field. I don't know if I've told you in the past, but long before we met
I knew of you from VLDB sort of the premium
academic database conference in the Lyon in 2009 at the time, I was sort of a PhD student, and you won a ten-year Best Paper Award for your work on MonetDB. So I guess it was published in the 90s. Do you mind sort of
just walking through us, how's the idea of MonetDB
and columnar databases came into your mind? - Well yeah, I started
working on that when I was, young still and a young PhD student, and we were interested in finding execution model so
designed for query engines that was really suitable
for analytical workloads so we came with this columnar
model so that we could kind of make query
interpreters much faster by doing column at a time operations and that proved to be really successful. - And that became one of the
foundational technologies of modern analytical databases, right? - That's true, yeah. I think if you look at all the very many technologies that we now see in products that are on the market for data analytics, almost all of them use
use columnar storage, yes. - And certainly us at Databricks have benefited a lot
from that, what's next? 'Cause MonetDB was done in the 90s. It's been almost 20 years past. - Right so after MonetDB
of course was successful and it's still around. We're looking at it very carefully to see whether it could still be improved and indeed it could. Column at the time operations
can be very efficient, but they also involve
more columns at a time and columns can be huge objects. So after we did that, we tried
to make it more fine-grained to look for a career execution
model that would be more fine-grained and that became
vectorized query execution so which combined some of
the classical advantages, of query engines that have a interpreter
with the efficiency of column at a time. And it also vectorized execution works very well on
modern hardware that has vector instructions
like SIMD instructions. - Thanks Peter. So one of the things I know this is also you're
not just a pure academic, you actually started a company
based on some of the work. Do you mind telling us a
little bit more about that? - Right. So this was about 2008, where we had created a new
prototype and that prototype became a spinoff of my
research institute's CWI and the spinoff was called VectorWise. And so VectorWise was the
first vectorized query engine, at some point they got acquired, I was of course the designer of VectorWise and the Marcin Zukowski was
my PhD student at the time. And we had our old team
and many of those people who are still in Amsterdam, but they now work for Databricks. - Yeah that's the reason
actually we came to Amsterdam it's for you and your former team. - Yeah. - And you'd be observing some of the work they've been doing. And one school of thought
in the database community is it would be impossible
to get great SQL performance using this open data lake architecture versus a proprietary data warehouse. What's your thought on that? - I think in order to make
an engine that's really performing you need to
have a few things in place you need to have an efficient
storage formats, for instance, that is columnar, but you also need to have proper metadata is very important for
creating optimization and to be able to skip a lot of data. But if those kinds of things are in place, there is no reason why an open data lake or a Lakehouse could not
be highly performance. - So that's how our belief
as well, as matter of fact, many of the ideas came from
our discussion in I think 2017 when we visited San Francisco and we discussed a lot of
different aspects here. And since then, we've been
working closely together in designing and implementing
a lot of this different ideas. And we certainly hope, through our work and also
our work in the future, we'll be able to marry
the best of both worlds of data warehouses and data lakes. - Yes indeed. So now that is you're
referring to the Delta Engine and Delta Engine is a new engine
for the Databricks product. And it is a vectorized
query execution engine and of course the input of well, of the Amsterdam team is
clearly present in that, and from the benchmarks that I've seen, it's indeed a very highly
performance engine. - Thanks, Peter. So last question. Where do you think the
industry is going next? What are the major things that excite you? - Well I mean, if you
look at like proposition like Databricks, like Data Analytics in the Cloud and the machine learning in the Cloud, people move to the Clouds
because it's flexible and it is elastic and easily available, but it's also, it should be easy to use. And therefore, I think that we
will see a lot of innovation, for instance, in systems
that automatically take care of many things. Like many organizations
are moving to the Cloud, but would like to have a system
that intelligently manages their data without having
to have highly specialized or database administrators or designers. So for instance, I think about yeah, automatically optimizing
the layout of data without users ever having
to care about that, but the system getting
this right automatically that's something that is
important for Cloud data engines, and I think that this is also
a direction that Databricks would be probably moving into. - All right, thanks a lot Peter, for taking the time and
sharing your perspective I think we have also benefited
from your work in the past and will continue to benefit
from your work in the future. I love to collaborate more
in sort of making data layout in a more self managing so our customers can have an easier life. - That sounds great. Reynold thanks a lot. - Thank you Peter. Hopefully through this talk I help explain why there's no fundamental
reason we wouldn't be able to achieve best in class
performance with Lakehouse. We're very proud of the
work we've done so far, and we'll continue to work
towards more efficient systems, that are easier to use. Back to you Ali. - Okay great. Thanks Reynold and Peter. And thanks Brooke for that awesome demo. Now that you understand
the different parts of the Lakehouse architecture
and SQL Analytics, let's look at how it's being
used by one of the largest corporations in Europe. Unilever is a company that had the vision to see the potential of the
Lakehouse centric architecture very early on. They have been building towards
this vision and they've had a number of successes and
lessons learned along the way. To share, I'd like to
welcome Phinean Woodward, Head of Architecture and
Information and Analytics at Unilever. Phin thanks for joining us
at Data and AI Summit Europe. - Thank you for having me
it's a pleasure to be here. - So Phin can you tell us a little bit about Unilever's data and AI journey and what kind of outcomes
you were looking for? - Sure. I think it's worth just
reflecting on the history. Our on premises state
has grown substantially over many years and it certainly
served the business well, but as it evolved over
time into dependencies between systems made
developing solutions further, more complex in the face
of increasing demand for speed or reduced costs. So with the emergence of
Cloud service providers and platforms, this would
really unlock the opportunity to rethink our approach to
delivering these analytical services to meet the ever-growing
demands of our business. Simply lifting and shifting these services to Cloud wouldn't drive down the costs and wouldn't solve the
challenges we were facing. So we took the opportunity
to completely rethink our strategy and unlock
the new capabilities that Cloud offers. - That makes a lot of sense. So how does the Lakehouse
strategy fit in the architecture and where does Delta Lake
fit in into that picture? - Sure. Unilever's Lakehouse strategy has got three layers in the architecture we call them the universal data
lake, the business data lake and the product data store. Universal data lake really is
just a copy of the raw data in the same format as the source system it's in full there's no
filtering or any transformations so in effect we ingest
data once and only once into the data lake without
ever needing to go back to the source system to extract data. The business data lake then
adds business logic and context to that data, converting
technical field names into business terminology,
harmonizing mass data and calculating reusable KPIs. So to sort of calculate things
once and reuse many times. And then the final layer is
the product data store layer and that's where our reporting
and analytical solutions sit each is built and run using DevOps and created and developed
independently of other products. And so therefore each of
those teams can develop and deploy changes at their own pace. Throughout the whole stack
Delta Lake is used to process the huge volumes of data
required efficiently and at speed. - I would love to hear some examples and potentially what effect
they've had on the business. - Sure. Databricks and the broader
as your platform really form the bedrock for all of our reporting and analytical solutions in Unilever, and we've a number of examples where this has enabled us to drive growth. One example is an analytical
product built for the UK, which has enabled us to
analyze category performance, undertake competitor
analysis, performed detail SKU and promotion analysis. And these really helped
us to grow the category for our customers, increase
turnover for our products and in one example, grew store penetration for one of our SKUs forefront. In addition to financial benefits users can save a lot of time in
deriving these insights, the ability to leverage the
on demand processing power and performance that Delta
Lake offers really enables us to process large volumes of data at speed and remove the need for manual efforts. Users can really then focus
on deriving the insights and taking the actions
needed to really unlock the benefits for the business. - That's really great to hear
as technologists you know, we're always deep in the tech
but it's so awesome to hear the use cases and how they're
affecting the business. So how does that affect, things like cost,
scalability and performance? - The key benefits that I see is the immense flexibility
which it offers us the ability to use
products without needing to buy licenses, without
having to forecast buy implement and manage infrastructure, to leverage the auto-scale
capability that Databricks offers really delivers us the
amount of compute we need just at the right time to
process large volumes of data. And that really drives
great value for meeting our business objectives and
driving our digital agenda. - That makes sense. And what kind of analytics workloads do you foresee in the future
moving into the Lakehouse? - I think in short the Lakehouse really will be the foundation
for all of Unilever's analytical products and
workloads in the future. As technologies evolve, we
will increasingly drive more advanced analytics,
machine learning and AI into the products that we are building and the data that we
have captured, curated and stored in the data lake. - And do you have any advice
for others that are embarking on their journey as they set
up their Lakehouse strategy? What would be some of sort of
words of advice be for them? - Cloud allows you to implement some of the best technologies
available to drive down costs, deliver data to the business
quicker than ever before. But if the trust in that
data is ever questioned but in my mind the whole data
lake strategy is put at risk. So my top recommendation really would be, invest time and effort to
ensure that the data lake is the single source of truth, that the data and KPIs there
in our catalogs, not duplicated and made available to
all who need to use them. That for me is the key to
the strategy being a success. - And curating your data
lake is then key to that. - Absolutely. Fundamental. - Thank you so much Phin. - It's a pleasure. - We're really excited to
be working with customers like Unilever, they've
been leading by example on the Lakehouse architecture
and already realizing the incredible value
that Delta Lake brings.