SQL Analytics and the Lakehouse Architecture | Ali Ghodsi | Keynote Data + AI Summit EU 2020

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

- Apache Spark has truly revolutionized what people can do with data and collecting all these massive amounts of data. And we think the industry is now at the point where it can actually leverage that data in a very strategic way, using a pattern we called the Lakehouse. But before I get to the Lakehouse, I wanna set the context. What happened about 10 years ago is that data native companies started using data and AI in a very strategic way. And they enabled hundreds of use cases that actually disrupted whole industries, companies like Airbnb, Netflix, and Uber that were using AI for massive amounts of data processing. Uber was using AI to predict the price, the supply and demand of riders and drivers, they were putting people in carpools. Airbnb actually figures out when you search for a house, it lists you the houses that it thinks will be accepted. It gives you price recommendations if you're a host and it computes customer lifetime value for customers. Netflix, they do content suggestion and they actually watch the streaming quality as you're watching and they're using AI to improve it. And the thumbnails that you see when you're on netflix.com, they're custom picked for you using an AI algorithm and these companies didn't just have one use case, they had enabled the whole organization to use hundreds of hundreds use cases to disrupt these industries. And they were not using any legacy systems. They actually were heavily leveraging open source, they were not locking the data into a proprietary system, or a proprietary format and there we're stitching together these different open source tools to enable these use cases. They used a unified approach, they used open source to set the right foundation so that you could do all the data engineering in one place, and on top of it be able to do data science and machine learning and data analytics and they enabled lots of different views and teams to do machine learning and data analytics. But today, enterprises they don't have those armies of engineers to custom build these systems for them. They have on-prem data and they're moving into the Cloud and their architecture looks very different. Today, what we see on-prem or in the Cloud is an architecture that kind of looks like this. You start by storing all your raw data in a data lake, often using Apache Spark, and then on top of that, you have machine learning and data science use cases and you use a separate data warehouse and you copy the data into that data warehouse for BI and reporting. This architecture is pretty complicated and it's not what those data native companies were using. This architecture leads to two copies of your data. One in the data lake and other, copy in the data warehouse when you're doing BI on the data warehouse that data in the data warehouse is stale because the most recent version of it is in the data lake. And you have to do big investments in data ops to make sure that the data is consistent so if you update the data and data warehouse, now it's out of date with the data lake or vice versa. So this architecture is a big impediment to be able to actually unlock the true value of data. And more recently, we've seen that some are now proposing an architecture, which is very similar to the architecture of the data warehouses from the 1980s, which says, take all your data, store it centrally in a proprietary data warehouse. And then if you wanna do data science or machine learning, or real time, export it out to the data lake and then do those use cases on top of the data lake. This approach has several problems. Of course it's closed, it's a proprietary system just like the 80s data warehouses. More importantly, data science is actually not a first-class citizen in these data warehouses so you have to actually export your data out of it. And it's operationally and financially very expensive to have these two systems. But also it turns out to do data science and to actually be able to work with video data, audio data, you actually still have to use a data lake and store all those copies there as well, because you can't put those in a database house. So at Databricks, we're excited for a new architectural pattern that we call the Lakehouse, which offers a new way. You store all your data in a data lake, but then thanks to innovations that I will be talking about, you'll be able to structure it in such a way so that you can directly on that data in your data lake be able to do data science and machine learning, but also analytics and BI as well as the real time use cases. It's completely based on an open format and open architecture and you store your data in data lakes, where you have them today you don't need to move them anywhere else. It has first-class support for data science and SQL based business intelligence. And it has excellent recency of data with one single source of truth in the data lake. And this architecture works great with the ecosystem of tools that are around it, so it can actually work well if you wanna store subset of that data in a data warehouse, or if you have data warehouses it can connect to them. Or if you have security solutions or OLTP solutions, or ETL solutions or data providers that you wanna bring in, it works that ecosystem. So this architecture builds on a data lake and data lakes themselves just alone are not enough, they have problems, so we have to fix those problems first, so I'm gonna talk about how the Lakehouse pattern does that. So at Databricks, we looked at all the problems that our customers were having in 2013, 14, 15, with data lakes, and we classified all the challenges that they were seeing. And there were lots of different detailed challenges and you can see them on the slide here. And if you categorize those, you'll see that a bunch of them have to do with reliability. It's hard to append data, it's hard to transactionally change the data, it's difficult to deal with real-time data, it's hard to version your data, but also there are lots performance issues because you have these files and data lakes mainly deal with files. So you're getting into performance problems when you wanna use this data downstream. You also have governance problems because ultimately it's a fall obstruction, it's not a structured data obstruction where you can actually govern who has access to which parts of that data. And most importantly, there are lots of quality problems because these data lakes often can become data swamps. So at Databricks, we open sourced technology called Delta Lake and Delta Lake brings reliability, performance, governance, and quality to data lakes. And the way it does that, is that it actually leverages four technological breakthroughs directly on data lakes. So for reliability problems, it actually comes with ACID transactions. These guarantee that any operation you do on your data lake, either fully succeeds or gets aborted and cleaned up as if it never happened, and you can retry it again. The performance problems are handled by storing indices that are well known to speed up data processing directly on your data lake, speeding up over 10X to 100X that data access. For governance problems, it comes with table level access control and role-based access control, so you can now reason at the level of structured data, you can say who has access to a table or a column or a row. And finally, the quality problems are addressed using schema validations and expectations. So you can specify exactly the data in the data lake, what scheme I should have and if it doesn't have that, it's stored in a quarantine until it's fixed. With this, our customers have been building curated data lakes. They leveraged the expectations I mentioned, and they then specify exactly the quality metrics that they expect from their data. That way they can start by storing all their data in the data lake, let's call those the bronze datasets, that's just the raw data, it might be a swamp. The next level is a silver datasets, the silver datasets are filtered and they actually satisfy some of those quality metrics that you've specified. So you can always trust that the data in the silver table has been filtered based on those quality metrics. And then for the gold tables, you have the business level aggregates and the highest quality of data. That way your applications can use the gold tables, but you can also store all of your data in the data lake in the bronze format. So the Lakehouse pattern, at its core is based on Delta Lake. And we've mentioned at previous data and AI summits, that Databricks has a data science and machine learning platform, that enables data scientists and machine learning folks to do predictions on top of their data. But what about analytics? What about business intelligence and reporting? And I'm super excited to announce Databricks offering SQL Analytics. SQL Analytics is now situated on top of Delta Lake, it's an engine that's highly tuned and optimized for Delta Lake, and it has integrations with all the BI tools that you want so that you can very quickly and efficiently visualize what you want in your dashboards on massive amounts of data, that's sitting in the Delta Lake. And because enterprises have so much data that they wanna do BI on, we've actually priced this really competitively. So we worked with Barcelona Supercomputing Center, and the researchers there ran performance benchmark using the industry gold standard TPC-DS. TPC-DS has 99 queries and they ran all of these and compare them with other Clouds data warehouses, and it is 9 times better price and performance compared to those offerings. So this is awesome, now you can get cheaply access to those datasets that you have in the data lake directly. So what does the SQL Analytics offering look like? I'm gonna walk you through some of the aspects of this. So first it has a user interface that I'm gonna talk about. Second I'm gonna talk about the built-in connectors and then the fine-grained access control and governance, and then finally we'll talk a little bit about the performance. Starting with the UI, it's built on the Redash user interface, it's a familiar interface where data analysts can go in, they can browse the databases and the tables that they have, they can enter SQL queries and they can see the tables and they can create beautiful dashboards from those visualizations. These dashboards can then be shared in the organization securely to democratize access and insights to that data. It supports scheduling this so that it's rerunning those dashboards for you and it can give you automatic alerts if there are specific triggers that happen based on the data in the dashboards. And, the Lakehouse and the SQL Analytics offering comes with built-in connectors for existing BI tools. It comes with very simple connection set up, it's optimized performance so that it's really fast based on the latest drivers of ODBC and JDBC. And it's really quick and easy to set up authentication and single sign-on. And I'd like to welcome one of the thought leaders in the BI space, the CPO of Tableau Francois Ajenstat to tell me a little bit what he's seeing with the Lakehouse pattern. Welcome Francois. - So great to be here, thanks for having me. - So Tableau has been in the center of analytics for a very, very long time. Can you share with me a little bit what you're seeing in the analytics market, especially in the Cloud? - Well, you know Tableau has always been focused on helping people see and understand data. And increasingly we're seeing people move their data workloads to the Cloud. It simplifies data access, and they can scale to even bigger data volumes. And so for us, it's all about choice and flexibility, helping our customers analyze data anywhere they are. But as data moves to the Cloud, people really wanna bring analytics and data science together and really converge these two worlds together so they can see and analyze all of these things in one place. - So we've talked a lot about the Lakehouse adoption in the Cloud, what do you think is driving this shift? - Well we see growing adoption as well, and, you know, as more and more data is moving to the Cloud, it's coming in from everywhere. Customer interactions, web logs, IOT sensors, that's always a mess. And so we need to make sense of all of that data and deliver it to the people that need it, in a reliable and high-performance way. And so what we hear is that our customers really wanna do their analytics on their data lake and the Lakehouse architecture is the way to do that. And that's why we're excited to partner with Databricks to make that vision a reality. And with the new SQL Analytics endpoint with the Delta Engine, we're gonna provide Tableau customers with an entirely new arc experience, a new architecture for doing analytics on that data lake in a way that, that wasn't possible before with the performance and scale that they never thought was possible. - That's awesome. So we have a lot of joint customers that are using Tableau on Databricks, can you share some of the maybe favorite examples that you have? - We have so many customers, it's hard to pick just one. But last month I heard a great story about the U.S. Air Force and how they're using Databricks and Delta Lake to manage all of their analytics and using Tableau to visualize and analyze millions and millions of records every single day. You know, in their environment, there's too many different technologies, hosted everywhere with different rules, different levels of access, different levels of maturity, and the data just kept multiplying and multiplying over and over. And so what the Air Force really did, is they standardized their data on Delta Lake to make their data visible, accessible, understandable, linked, and trusted. And as a result of doing that they've increased data use, they've increased data literacy because the data is more efficient and available to make better decisions. And now they've got the best state-of-the-art technologies to manage their data and explore all of that data at scale. - That's so awesome. So we did this announcement of SQL Analytics, you know, obviously we've been working together on that, but I'm curious how you're seeing Tableau plus SQL Analytics simplify the customer's journey? - Well, I mean we're really excited about the availability of the new Databricks SQL Analytics, you know, we've had a connector in Tableau for over a year now, and we're now gonna be supporting the SQL Analytics really, really soon. What this really does is it simplifies access and management for our customers, right? Being able to provide exploratory analytics directly on the data lake will make the experience for our customers easier, faster and then they'll get value much, much more quickly out of all of their data environments and deliver that to all of their users through Tableau. - That's super exciting and we're really excited to bring this out to all the Tableau customers together. Thank you so much Francois. - Thank you. Excited to partner together. - So we've talked so far about the UI of SQL Analytics, and we've talked about the connectivity to these BI tools. Let's talk about the fine-grained governance and security. This is really, really important because you have your data in the data lake, how do we now make sure that we can really govern it? Well, in SQL Analytics it gives you a table level view of your data. Everything is a database, tables with columns and rows, and it gives you SQL level data access control. That means in SQL you can specify, who gets to see which database, which tables, what columns and we which rows. And this is really, really important because now you can actually lock down your data and share it with many, many analysts in your organization. We also provide a way where you can manage the usage of SQL Analytics, you can go in and see if there are queries that are running away and deal with them. And it comes with virtual cluster monitoring so that you can actually set up the endpoints and see how they're spinning up, how they're scaling up and down. And finally, it comes with a really awesome query history browser, where you can go in, see all the queries and click into them and actually debug what's going on with each of the queries. Finally, SQL Analytics is really fast and that's really, really important. And when we talk about fast, we're talking about three different dimensions. One we wanna be able to bring lots of lots of analysts. And what SQL Analytics enables you to do, is to have SQL endpoints that auto scale up and down based on the number of analysts that are using it. So a lot of work has gone into making it concurrent. Second, we've optimized the end to end latency of queries come in from BI tools, all the way to the data that's sitting in Delta Lake and that's important so every step of the way has been optimized a way so that you can get snappy performance especially if you have really large datasets. And finally, we optimized each query so that when it executes in parallel, it's really fast. And this is through an innovation called vectorization, and we're gonna hear much more about the technological underpinnings of the performance of SQL Analytics from my co-founder Reynold Xin, but before we do that, I want you to see a demo of all of this in action from Brooke. - Thanks, Ali. To fully harness the power of our Lakehouse, we're going to use the new SQL Analytics workspace. To access it, all you need is the click of a button from your Databricks workspace. For our demo today, we've still fully acquired some security data for an unnamed unicorn startup in the San Francisco Bay area. Our job is to help them identify possible security threats as they prepare for hosting a major international conference. So over the next few minutes, we'll write some queries, build visualizations, create a shareable and customizable dashboard and craft a monitoring and alerting framework, all with just a few lines of SQL. Let's get started by digging into our data. I'm going to select the Queries tab here on the left, create a New Query. And on the left, you can see, I have my security database with my security apps table. So I have things like the user, the date, the country, the app, admin logins, and failed logins and successful logins to a given app. So the first question I'd like to ask, is what is the ratio of failed logins to successful logins by country? So let's go ahead and write the SQL query for that. Going to select country, sum of app failures, notice the type complete here, divide by sum of app success, as fail ratio, from our security apps table, group by country. Let's save this and execute it. And so while it's great that I can see my data in a tabular format, I would really love to visualize this on a map of the world. So let's add a visualization. Here I'll select choropleth map, as my Visualization Type, my Key column is country. And here in our dataset we have the full name for the countries. And the value is fail ratio, and viola we suddenly have a map of the world to be able to derive very fast insight from. So for example, I can see that there's a large ratio of failed logins in countries like Brazil and the Philippines, but relatively small fraction of failed logins in North America and European countries like Germany. So let's go ahead and save this visualization. Now, what I want to do is I want to give this query a name, Fail Login Ratio by Country. And I want to add it to our security dashboard so other people can take a look at this and dig into the data a bit more deeply. To add it to our dashboard, I simply select these three dots down here and select Add to Dashboard. So I'm going to add this to the security dashboard that the team has already created. And now let's go ahead and dive into this Dashboard. So you can see this is our security dashboard, and at the bottom this is the new visualization that we just added. But now I wanna walk you through some of these other visualizations. So for example, we can see from January 1st, 2019 through November 4th, 2020, these were the apps that had the highest number of failed logins. The one that really stands out is Cat Food, because most companies access dog food, so this seems a little bit fishy. The other thing that stands out, is the app failures by date. So we can see sometime around June, there's a high number of failed logins. Let's zoom in on this area here. So now we can see more clearly that there was this huge spike of failed logins on June 1st. And I'm curious which apps had the high number of failed logins. So we can adjust the parameters of our widget over here. Let's fast forward to June 1st, as our start date, and let's make our end date June 1st as well. And to re execute this query, simply select apply changes. And so here we can see that Workday had a high number of failed logins that day. And so on June 1st might not be a historic date for you, it was the date this company switched to using Workday for HR management. So although this wasn't a malicious attack at all, the security team would still like to be alerted in cases of high number of app failures on a given date. So let's go ahead and create an alert for that condition. To create an alert, I simply select the Alerts tab here, and New Alert. We give it the query that we want to trigger the alert based off, in this case it's App Failures by Date. We want to trigger when the sum of app failures is greater than a given threshold. Let's call this threshold 20, there's more than 20 failed logins in a given day, we want to trigger an alert. When triggered, send an alert just once, but to catch the attention security team, I'm going to create a custom template, lots of failed logins. And that's all that's needed to create an alert. Now the security team can go and set the refresh schedule on this query to determine how frequently they wanted to check this condition, every minute, every five minutes, every hour, et cetera. So now that you've seen how to write queries, build visualizations and interact with dashboards. Now let's take it the endpoints that are executing all of these queries under the hood. You can see here that we have an Analytics Team Endpoint that's already up and running. And so when creating an endpoint, you simply select the cluster size, extra small, small, medium, et cetera. So you no longer need to determine what type of instance do I want? How many instances? You simply select t-shirt sized cluster sizing. You also have control over the auto stop parameter. So if there's 120 minutes of inactivity, I want it to automatically stop so I stopped spending money on these resources, my team isn't consuming. Two of the other really cool features are multi-cluster load balancing and Photon. So if you have hundreds of analysts all on the same endpoint, they're going to have resource contention issues. But multi-cluster load balancing does, is automatically scale to have multiple clusters and so the users can just interact with the same endpoint, but under the hood their queries are being run on different clusters. In this case, I specified a minimum of one and maximum of four. Currently I only have one active, is I'm the only user interacting with this endpoint. I've also enabled Photon for faster query performance, which Reynold's talk dives more deeply into. So now that you've seen how the endpoint work at a high level, let's take a look at the data governance and auditability of our workspace. To do that, we'll interact with the History tab here. The history tab shows for all users, within a given timeframe, for all endpoints with a given status, the query history of any of those conditions. So for example, if I select the last query I ran, you can see the SQL code that I had written, the status, when it started, how long that command took, who ran it and on which endpoint. This way you have full auditability and transparency over who's accessing your data. So now you've seen how to interact with the SQL Analytics workspace, to go back to the Databricks workspace, it's just the click of a button. You have control over the data in your Lakehouse, and you have control over which workspaces you grant users access to. Whether it's the Databricks workspace, SQL Analytics workspace, or a bit of both, the choice is yours. Thank you and now I'd like to hand it over to Reynold Xin, to do a deep dive into the Delta Engine and SQL Analytics endpoints to discuss how they can get data warehousing performance on the scale of data lakes. Thank you. - Thank you, Brooke. And hi everyone again. Earlier this morning, I talked about our work to make Spark more Pythonic, and there was entirely focusing on usability. This talk, I'll be focusing on performance. The idea of the Lakehouse is to build upon open format and architecture on data lakes so we can offer a first-class support for analytics, real-time applications, data science and machine learning. Now one shool of thought in the database community has been that, it's impossible to achieve best in class performance was open data like architecture, especially when comparing with proprietary data warehouses. But Ali show you earlier that we're able to achieve best in class performance in TBC-DS, the standard data warehousing benchmark. As a matter of fact by beating most of the other data warehouses. So how do we do that? In this talk, I want to show you that there's no fundamental reason why we can't beat proprietary data warehouses with open Lakehouse. And the result is a combination of the past few years of work. Now I'm gonna show you some of the examples, about what we have done. It's important for me to point out that there's no single metric for performance. When I use the word performance includes three separate dimensions, the first is concurrency, second is latency and the third is throughput. In concurrency, we're referring to how many users or queries can assist them support at the same time. For latency, we're referring to how fast can the system respond to a query in the end to end fashion as perceived by the user. And for throughput, we meant for a given query, how much data can assist them process per second? Let's first dive into concurrency. So one of the benefits of the Lakehouse is it's the Cabo storage from computer architecture. In this model, you have all of the data in the data lake most often called object stores they are extremely cheap. When some users need to do analytics, we simply create a cluster for them in the cluster the ephemeral, the cluster would connect through the data lake to pull data from it on demand. When the lot more users show up, a single cluster might not be able to handle all of the load. And this is very problematic for the traditional MPP or Massively Parallel Processing data warehouse architecture that couple storage and compute. In order to add capacity to the system, we would require re shutting all the data effectively unloading all the data out of the system and then loading them back in again and this can take days and sometimes even weeks to do. In Lakehouse, this is actually much simpler. All we need to do is launch another cluster for the new set of users and then they can do their work immediately. In this the Cabo storage from computer architecture all clusters have the same view of the data, and they provide very strong workload isolation between the clusters. And this really enables infinite scalability by just launching more compute clusters. One of the challenge with this design, while it has great scalability, is that it can be difficult to manage. For example, think about how would the admin know how many clusters to create? When should the admin create them? The workloads are not always consistent. How would the user know which cluster to connect to out of the so many that we have created? What if a team suddenly has a workload spike and now needs more than one cluster just for themselves? So SQL Analytics on Databricks offers a new way to do this, instead of grading new clusters admins can create a SQL endpoint for all their users. Instead of connecting directly by the user to the individual cluster they usually connect to a SQL endpoint to access a proxy between the user and the cluster. When more users show up, the SQL endpoint automatically launch new clusters to handle the load. As a matter of fact, SQL endpoints offer even more than what I just talked about. Not only does it automatically launch new clusters when load increase to maintain the same level of performance. It shuts down clusters when the load decreases to save cost. Because it sees all the queries coming in and out of the system and their progress, SQL endpoints also know when a cluster is in the error state and can restart that cluster automatically. So not only does SQL endpoints provide much better concurrency to all of the users, it also simplifies the management and makes the whole system overall more reliable. Now after concurrency let's talk about latency. For latency as I mentioned earlier, it's important to think about not just how fast the system itself responds, but the end to end latency experienced by the user. And for this, I want to show you some work we have done together with a BI vendors to optimize the BI protocol, which is the communication protocol between the client and the SQL endpoint servers. Here's what happens when we try to run a query in Tableau or any of the other BI tools like Power BI. Tableau would connect to a Databricks server, and when the user tries to run their query or a dashboard, what happens Tableau will send a message to the Databricks server, and the message typically says in the first one, "Hey, please run this query, and here's the query string." And when Databricks server sees the query on SQL endpoint side, you assign the query a query ID and return that query ID directly to Tableau. You might be surprised and ask, "Hey, why don't the Databricks server return the results immediately?" This is because the query could be very long running, the query could be scanning one roll of data, or it could be scanning petabytes of data. It could finish in a second, it could also finish in a day. For very long running queries, it's very wasteful to maintain the persistent connection and just for the client to wait forever. So by returning a query ID, we enable the client to actually pull the Databricks server periodically in order to look at the status of the query. What would happen, is Tableau will send a message to Databricks asking, "Hey, is query ID 123 ready yet?" Probably every second or so and then Databricks would respond, it's ready or it's not ready. When it's finally ready, Databricks will send a message back saying, "Hey, query 123 is ready" and Tableau now can send a message requesting the results for the query and Databricks will send the query result back. This protocol looks pretty chatty, but it's really optimized for the diverse range of queries that can run on the platform it's not Databricks specific a lot of systems work like this from the on-premise days. When the system's on-prem the data warehouse in the data center, that's very close they're usually using Tableau and all of the six messages can be completed very quickly. But now considering the cloud case or for a multinational company, we might have a user in one continent and the data center actually in a different continent, that process all of the data due to may be a GDPR constraints. In this case, even if the server is extremely efficient, just consider the time it takes for a message to propagate the network packet from the United State for example, to Europe and back, that would itself would take 300 milliseconds. So just with six messages we're talking about two seconds of query wait time even if query itself does nothing. The reality is actually much worse than what I've shown you because a dashboard would trigger many queries and typically one query will include multiple queries under the hood because includes looking at metadata, really compounding the latency. In SQL Analytics we work was various BI vendors like Capital and Power BI to improve this protocol in the new version of the protocol, we will return the result immediately if the query's short running. Essentially the service I will wait for a second or two for the query to run and the query completes within that time the response would directly include the result rather than sending back query ID. But if the query takes a while to run, we'll fall back to the old protocol. So with this it now only it takes two messages or even one round trip until complete the short queries. We've rolled this optimization now to preview customers and amounts of their workloads to observe a six-fold reduction in latency for short queries. The last dimension I will talk about is throughput and that is for a given query, how much data can a system process per second? To do that I want to show you the life of a query going through the entire system by peeking into what happens in a backend SQL Analytics. This is a query that comes into the system through SQL endpoint and then it would go through the Delta Engine which is the compute engine for SQL Analytics. Delta Engine includes a query optimizer, an execution engine, a caching subsystem, and the physical layout optimization for data lakes. Now we'll go through this components one by one to show the improvements we have done to drastically improve performance. So first is the query optimizer. Delta Engine includes a state of the art query optimizer it extends Spark's cost based optimizer and adaptive query execution that's effective at dynamic runtime optimization. This means even when statistics jiff or maybe it's just fresh data that don't have statistics, the system could still continue to optimize the query at runtime based on what it has observed at runtime and this can deliver up to 18 times performance improvements for SQL star schema workloads. The next component is execution engine. Four months ago at the Spark and AI Summit in San Francisco we gave a preview of the brand new vectorized execution engine working on called Photon. Photon's designed for maximum performance, and it's written from scratch in C++ optimize the CPU time of query execution using vectorized instructions. It can be 10 times faster than our old JVM based execution engine, in this talk I won't get into the details of Photon if you're interested in learning more about it, please see Photon techies talk. His name is Alex Ben at this conference. Now the third component is caching. The execution engine Photon, needs to read it's data from somewhere and the source of data in the Lakehouse house architecture is in data lakes, which are cloud object storage. The throughput between Cloud object storage and the clusters, the femoral clusters reading the object stores are naturally not very high they're not as high as the local disrupt system. The reality is that most query workloads have concentrated access on hot data. For example, most outlets will probably only be reading for example, yesterday's data or this week's data and users in the United States tend to sort of scan the United States data and users in Europe tend to scan the European data. To avoid repeatedly going into the object stores, Delta Engines caching subsystem reads the data from the object stores and then it actually creates a new copy of them that say femoral on the NVME SSDs. But it's not just the dumb, sort of raw bytes copy of the original data in parquet the caching subsystem transcode the data into a faster format so when the execution engine Photon reads those data, Photon's no longer reading parquet, it's reading actually a highly optimized format for the high IO throughput. The last component is the data lake or how we organize the data physically in the data lake. One of the tricks, data warehouses use a lot is to maintain auxiliary data structure so the engine can more efficiently skip data. And this is extremely important because query execution time depends primarily on the amount of data's access once you have a fixed execution engine. And parquet's design doesn't actually allow a lot of auxiliary data structure to be created because it's mainly sort of to encode data itself so it has very limited data skipping functionality. But with Delta Lake, it's actually possible now to leverage Delta Lake's transaction log, to encode additional auxiliary data. As an example, Delta Engine encodes a min-max statistics for columns in the transaction log. And this min max statistics, are updated atomically to get with the data because there are done in the same transaction so it never go out of date unlike the optimizers statistics. And when the query shows up like the one I'm showing here, Slack Star events where a year equals 2020, and UID equals that 24,000 Delta Engine first look at the transaction log to prune the list of files to read before even reading them. It's smart enough to realize it does not need to read the files who is maxi or less than 2020, because the predicate itself once year equal 2020, and then we apply the similar future on UID. And in this example Delta Engine would figure out it only needs to read one file skipping two-thirds of the data. This is a toy example I'm showing, but in reality, almost no query scan the entirety of the table so often we can see 10 times, 100 times even 1,000 times falling some improvements just by doing better data skipping. Through the life of a query, we talked about how various components Delta Engine can improve throughput of the system. And of course, by improving the throughput here, the systems latency can also reduce because the system's becoming more efficient as well. Now in this talk, we went through three dimensions of performance concurrency, which is how many users and queries can a system support at any given time? Latency, the perceive end to end how fast a system can respond to the user? And throughput for a given query, how much data can the system process a second? We talked about how various performance improvements into the different sub modules can improve all those three dimensions drastically and we put them together we can now deliver best in class performance for SQL workloads. It's a tradition at Spark Summit now called Data and AI Summit to bridge the gap between cutting edge research and industry. So we have invited professor Peter Boncz to join us and share with us some of his perspectives on the topic of our performance. Peter is a professor at CWI, Netherlands National Research Institute for Mathematics and Computer Science. But he's not just any academic, we have all been benefiting from his work in the field as he's one of the main pioneers inventors of columnar databases and vectorization Peter's work laid down the foundation for high performance analytic systems and it's implemented by virtually every data warehouse. We have been fortunate enough to be collaborating with Peter at Databricks for the past three years through our engineering office in Amsterdam. Hi Peter, thank you for joining us. - Thanks Reynold, thanks for having me. - So before we get into a sort of technical questions, I'm actually kind of curious about your experience working with the Dutch government on COVID response, 'cause I've heard that's what's keeping you busy lately. - That's definitely the case thanks for asking, yes. I'm advising Dutch government, it's actually fascinating, it's terrible on the one hand but it's also yeah, it's a fascinating time and you see that fighting this crisis, well it's a very multidisciplinary problem, like very, very many different fields but certainly kind of computer science and data analysis are very important in this so yeah, it does keep me busy. - It's actually really, as you said, both sad and exciting to see how data and us as people that work on data can contribute meaningfully to solve the larger societal issues. But for today. - Yeah. - I want to focus a little bit more on some of the technical issues especially your contributions to the field. I don't know if I've told you in the past, but long before we met I knew of you from VLDB sort of the premium academic database conference in the Lyon in 2009 at the time, I was sort of a PhD student, and you won a ten-year Best Paper Award for your work on MonetDB. So I guess it was published in the 90s. Do you mind sort of just walking through us, how's the idea of MonetDB and columnar databases came into your mind? - Well yeah, I started working on that when I was, young still and a young PhD student, and we were interested in finding execution model so designed for query engines that was really suitable for analytical workloads so we came with this columnar model so that we could kind of make query interpreters much faster by doing column at a time operations and that proved to be really successful. - And that became one of the foundational technologies of modern analytical databases, right? - That's true, yeah. I think if you look at all the very many technologies that we now see in products that are on the market for data analytics, almost all of them use use columnar storage, yes. - And certainly us at Databricks have benefited a lot from that, what's next? 'Cause MonetDB was done in the 90s. It's been almost 20 years past. - Right so after MonetDB of course was successful and it's still around. We're looking at it very carefully to see whether it could still be improved and indeed it could. Column at the time operations can be very efficient, but they also involve more columns at a time and columns can be huge objects. So after we did that, we tried to make it more fine-grained to look for a career execution model that would be more fine-grained and that became vectorized query execution so which combined some of the classical advantages, of query engines that have a interpreter with the efficiency of column at a time. And it also vectorized execution works very well on modern hardware that has vector instructions like SIMD instructions. - Thanks Peter. So one of the things I know this is also you're not just a pure academic, you actually started a company based on some of the work. Do you mind telling us a little bit more about that? - Right. So this was about 2008, where we had created a new prototype and that prototype became a spinoff of my research institute's CWI and the spinoff was called VectorWise. And so VectorWise was the first vectorized query engine, at some point they got acquired, I was of course the designer of VectorWise and the Marcin Zukowski was my PhD student at the time. And we had our old team and many of those people who are still in Amsterdam, but they now work for Databricks. - Yeah that's the reason actually we came to Amsterdam it's for you and your former team. - Yeah. - And you'd be observing some of the work they've been doing. And one school of thought in the database community is it would be impossible to get great SQL performance using this open data lake architecture versus a proprietary data warehouse. What's your thought on that? - I think in order to make an engine that's really performing you need to have a few things in place you need to have an efficient storage formats, for instance, that is columnar, but you also need to have proper metadata is very important for creating optimization and to be able to skip a lot of data. But if those kinds of things are in place, there is no reason why an open data lake or a Lakehouse could not be highly performance. - So that's how our belief as well, as matter of fact, many of the ideas came from our discussion in I think 2017 when we visited San Francisco and we discussed a lot of different aspects here. And since then, we've been working closely together in designing and implementing a lot of this different ideas. And we certainly hope, through our work and also our work in the future, we'll be able to marry the best of both worlds of data warehouses and data lakes. - Yes indeed. So now that is you're referring to the Delta Engine and Delta Engine is a new engine for the Databricks product. And it is a vectorized query execution engine and of course the input of well, of the Amsterdam team is clearly present in that, and from the benchmarks that I've seen, it's indeed a very highly performance engine. - Thanks, Peter. So last question. Where do you think the industry is going next? What are the major things that excite you? - Well I mean, if you look at like proposition like Databricks, like Data Analytics in the Cloud and the machine learning in the Cloud, people move to the Clouds because it's flexible and it is elastic and easily available, but it's also, it should be easy to use. And therefore, I think that we will see a lot of innovation, for instance, in systems that automatically take care of many things. Like many organizations are moving to the Cloud, but would like to have a system that intelligently manages their data without having to have highly specialized or database administrators or designers. So for instance, I think about yeah, automatically optimizing the layout of data without users ever having to care about that, but the system getting this right automatically that's something that is important for Cloud data engines, and I think that this is also a direction that Databricks would be probably moving into. - All right, thanks a lot Peter, for taking the time and sharing your perspective I think we have also benefited from your work in the past and will continue to benefit from your work in the future. I love to collaborate more in sort of making data layout in a more self managing so our customers can have an easier life. - That sounds great. Reynold thanks a lot. - Thank you Peter. Hopefully through this talk I help explain why there's no fundamental reason we wouldn't be able to achieve best in class performance with Lakehouse. We're very proud of the work we've done so far, and we'll continue to work towards more efficient systems, that are easier to use. Back to you Ali. - Okay great. Thanks Reynold and Peter. And thanks Brooke for that awesome demo. Now that you understand the different parts of the Lakehouse architecture and SQL Analytics, let's look at how it's being used by one of the largest corporations in Europe. Unilever is a company that had the vision to see the potential of the Lakehouse centric architecture very early on. They have been building towards this vision and they've had a number of successes and lessons learned along the way. To share, I'd like to welcome Phinean Woodward, Head of Architecture and Information and Analytics at Unilever. Phin thanks for joining us at Data and AI Summit Europe. - Thank you for having me it's a pleasure to be here. - So Phin can you tell us a little bit about Unilever's data and AI journey and what kind of outcomes you were looking for? - Sure. I think it's worth just reflecting on the history. Our on premises state has grown substantially over many years and it certainly served the business well, but as it evolved over time into dependencies between systems made developing solutions further, more complex in the face of increasing demand for speed or reduced costs. So with the emergence of Cloud service providers and platforms, this would really unlock the opportunity to rethink our approach to delivering these analytical services to meet the ever-growing demands of our business. Simply lifting and shifting these services to Cloud wouldn't drive down the costs and wouldn't solve the challenges we were facing. So we took the opportunity to completely rethink our strategy and unlock the new capabilities that Cloud offers. - That makes a lot of sense. So how does the Lakehouse strategy fit in the architecture and where does Delta Lake fit in into that picture? - Sure. Unilever's Lakehouse strategy has got three layers in the architecture we call them the universal data lake, the business data lake and the product data store. Universal data lake really is just a copy of the raw data in the same format as the source system it's in full there's no filtering or any transformations so in effect we ingest data once and only once into the data lake without ever needing to go back to the source system to extract data. The business data lake then adds business logic and context to that data, converting technical field names into business terminology, harmonizing mass data and calculating reusable KPIs. So to sort of calculate things once and reuse many times. And then the final layer is the product data store layer and that's where our reporting and analytical solutions sit each is built and run using DevOps and created and developed independently of other products. And so therefore each of those teams can develop and deploy changes at their own pace. Throughout the whole stack Delta Lake is used to process the huge volumes of data required efficiently and at speed. - I would love to hear some examples and potentially what effect they've had on the business. - Sure. Databricks and the broader as your platform really form the bedrock for all of our reporting and analytical solutions in Unilever, and we've a number of examples where this has enabled us to drive growth. One example is an analytical product built for the UK, which has enabled us to analyze category performance, undertake competitor analysis, performed detail SKU and promotion analysis. And these really helped us to grow the category for our customers, increase turnover for our products and in one example, grew store penetration for one of our SKUs forefront. In addition to financial benefits users can save a lot of time in deriving these insights, the ability to leverage the on demand processing power and performance that Delta Lake offers really enables us to process large volumes of data at speed and remove the need for manual efforts. Users can really then focus on deriving the insights and taking the actions needed to really unlock the benefits for the business. - That's really great to hear as technologists you know, we're always deep in the tech but it's so awesome to hear the use cases and how they're affecting the business. So how does that affect, things like cost, scalability and performance? - The key benefits that I see is the immense flexibility which it offers us the ability to use products without needing to buy licenses, without having to forecast buy implement and manage infrastructure, to leverage the auto-scale capability that Databricks offers really delivers us the amount of compute we need just at the right time to process large volumes of data. And that really drives great value for meeting our business objectives and driving our digital agenda. - That makes sense. And what kind of analytics workloads do you foresee in the future moving into the Lakehouse? - I think in short the Lakehouse really will be the foundation for all of Unilever's analytical products and workloads in the future. As technologies evolve, we will increasingly drive more advanced analytics, machine learning and AI into the products that we are building and the data that we have captured, curated and stored in the data lake. - And do you have any advice for others that are embarking on their journey as they set up their Lakehouse strategy? What would be some of sort of words of advice be for them? - Cloud allows you to implement some of the best technologies available to drive down costs, deliver data to the business quicker than ever before. But if the trust in that data is ever questioned but in my mind the whole data lake strategy is put at risk. So my top recommendation really would be, invest time and effort to ensure that the data lake is the single source of truth, that the data and KPIs there in our catalogs, not duplicated and made available to all who need to use them. That for me is the key to the strategy being a success. - And curating your data lake is then key to that. - Absolutely. Fundamental. - Thank you so much Phin. - It's a pleasure. - We're really excited to be working with customers like Unilever, they've been leading by example on the Lakehouse architecture and already realizing the incredible value that Delta Lake brings.

Info

Channel: Databricks

Views: 15,313

Rating: 4.8956523 out of 5

Keywords: Databricks

Id: 9oYosh-AoX0

Channel Id: undefined

Length: 54min 22sec (3262 seconds)

Published: Wed Nov 18 2020