Databricks Delta: A Unified Management System for Real-time Big Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello super excited to be here in Dublin and I have a really exciting announcement around the unification of data warehousing and they don't lakes but before I get to that I want to talk a little bit about AI okay slides are jumping so when we talk to our customers and people that are using spark what they tell us is they all want to be more and more data-driven and what's top of mind for them is yeah they want to use more data and I want to use that for predictive analytics they want to predict things they want to use machine learning and they want to make better decisions that's the main thing that they're focused on and if you look at this list of customers that we have on this slide it's actually mind-boggling you can see that we have customers across many different verticals and all these different verticals their use cases that are machine learning predictive technologies where they're using AI and I want to focus on two of those use cases that I find really exciting one is in the healthcare space so data bricks works with a lot of companies that are in the health care space and what they're doing in this case is Regeneron they have 50,000 patients and chronic medical records so these are records of patients and you know if they visit a patient if a patient visited a doctor or if they had type 1 diabetes they have all of that rep recorded there they also have 50,000 the same 50,000 patients genetic records so they have their genomes sequence and then we run that through data breaks and then figure out which genomes are responsible for which which diseases and they can then use that to make better drugs and fight diseases so it's a exciting use case and we see a lot of these another interesting use case is Riot Games Riot Games is a gaming company and what they're doing is they're actually in real time analyzing when people are playing games and the dialogues that are having if it gets heated they can then detect automatically using machine learning hate speech and then censor that user or shut them off so these use cases are really interesting and they actually are really important for these companies how are they doing this so what they do to succeed doing getting these use cases off the ground so it turns out really the missing link to get these use cases working is big data and this is what connects my first slide when I said unifying it O'Lakes and data warehouses with AI turns out big data is the key secret missing ingredient that these companies got right in particular those companies that know how to get petabytes of data that they have if it's customer data sensor data click streams and they know how to scale it on modern machines and run it through classical algorithms from the 70s machine learning algorithms that were developed in the 70s and 80s they will get great results however it turns out that this is really challenging and it's not easy and I'm gonna have a slide here that's the same as you saw from Matt's talk this is a paper research paper by Google and Google looked at are they're all they're different a I use cases that they have and they looked at where are they spending their efforts when they're building these AI applications and the boxes here indicates the different activities that they had to do to get AI working and the size of the box indicates the amount of effort that went into it and as you can see there is a small little black box there as Matt pointed to that says machine learning code so according to Google most of their AI projects they're spending very little time on artificial intelligence or machine learning most of their time is actually spent on this all this other stuff so in other words the hardest part of AI isn't AI the hardest part of AI for Google is the big data processing all this other stuff okay so I want to take you through a journey of what the data landscape have looked like over the last 30 40 years and see why it's so challenging to do this data processing why is big data so hard so it all started in the 70s when companies had all this data in their databases about how their products were doing and different revenue in different regions but they couldn't actually put it all together they were flying blind and people came up with this concept of a data warehouse and I said let's do this let's take the most important data that you have in your different databases let's ETL them all into a central data warehouse we'll make it really clean we'll make sure that you can rely on that data and once you have it all stored in the central location then you can plug in other apps on top of it and you can get business intelligence out of this data that you have there so they built this BI tools and the data warehousing and this was really awesome now these companies could do they couldn't do before in particular the great things about the data warehouse was that the data that was in it was really pristine it was reliable it was cleaned up and he had a particular scheme oh and you could do really fast queries on it so these were really performant systems because you had such control over the data that you'd put in there you could actually optimize it and get really great performance out of it and then finally it gave you transactional reliability which meant that you could have lots of different business analysts using these BI tools simultaneously and using concurrently the same data warehouse and it just worked no errors no corruption so this was a great thing about data warehouses but this was 30 40 years ago and things have evolved quite a bit since then and over time people started to see the problems that these data warehouses had in particular it was hard to scale them so this was good for the pristine small data that you had but if you wanted to have lots of lots of data you want to move it all in there it was hard to scale it it was not elastic so you couldn't just say okay I want to double the size of my data warehouse that was a complicated operation and then it required this ETL to get your data into the data warehouse that ETL process was very cumbersome so you did it once a week so hence the data that you had in your data warehouse was stale and you were actually operating on old data so it wasn't real time and then most importantly as I said what's on the top of mind of all these companies today is AI machine learning so here you had all your data inside of your date of our house it was your all your customer data all your product sales everything you needed but you couldn't ask the date of our house to do predictions for you you couldn't ask it tell me what my revenue will be next quarter or which of these customers aren't happy and I should reach out to them the data or how simply couldn't do that it just supported sequel structured query language and then finally it was all based on closed formats so once you had your data in this system it was hard to kind of get it out so you were locked into these data warehouses so in short it was not future-proof it was missing predictions real-time streaming and it couldn't really scale okay so what happened is about 10 years ago the Hadoop vendor showed up and they said what you need is a Hadoop data week and what I have you data like will let you do is that you can now ETL all of your data it's a scalable thing so take all of your data and just dump it in to this open data Lake and then you can do all kinds of use cases on top of it since these are open formats you can get all the machine learning all this stuff working so all the problems that you had with your data warehouse will go away and you get all these benefits now with the data Lake which is true so now you got massive scale so you could store petabytes of data in it it was inexpensive it was based on this open source to dupe technology and it was based on open formats like parque and Orsi so you could easily get your data and move it around and it had the promise of a machine learning and real-time streaming so that was really great people got really excited and started storing all their data in these data leaks in over the last 10 years they've been storing more and more of this data and you know many enterprises would go talk to say I have a petabyte now or have five petabytes of data in this so now they've gone to the point where they actually want to deploy these applications these aspirational AI applications and they're running into a lot of trouble so the problem now that you have with the data Lakes it's kind of the same things that were the advantages you had with the data warehouse you kind of lost them so in particular since you just dumped all your data there you lost this pristine data with the control and the schemas on it and this data now is inconsistent it might be dirty it might not be throughout the petabyte of data set you can't just rely that it's just going to work so it's hard to build analytics on top of this data even basic and Licht aesthetics is hard let alone advanced analytics and finally the performance is actually really bad because you've just dumped the data in there it hasn't been optimized for your particular use case so a lot of companies are seeing that once they start building their applications on top of the data Lake they don't get the performance that they were getting from the data warehouses earlier so in short you've got a messy data store with really poor performance so that's what the data Lake looks like so what our enterprise is doing today which one of these have they chosen so I want to talk about the current state that we're seeing in enterprises over and over again and in particular I want to focus on one particular use case at the fortune 100 company it's a InfoSec use case this is a company that takes all of the connections that are happening on their network between any two devices and they're storing it in a Hadoop data Lake so this is billions of records a day that is coming in and I want to detect intrusions that want to have alerting and they want to build their sims on top of this so they stored it in the database but it turned out that this data was actually messy and it was hard to actually build these use cases on top of it so what I ended up doing is they have the data leak with all these billions of records a day and then the ETL all of this data using spark actually over to multiple different enterprise data warehouses so this we see this picture over and over again companies that have multiple data rates and lots of enterprise data warehouses and they have two sort of ETL things and move things around between these so now with these three enterprise data warehouse that I have they've built incidence response reporting alerting on top of the data leak but this architecture that they have has a lot of disadvantages first of all it has all those flaws that I said the enterprise data warehouses have which is since it's costly to scale they actually only have two weeks worth of data in this enterprise data warehouse which means they can't access all the intrusions or all the attacks that have happened in their system it's also really expensive to scale this system and as I said it's proprietary formats and it's hard for them to do any machine learning on top of it so they're just running sequel queries on top of these enterprise data warehouses so the architecture that have ended up with actually has poor agility so they can't really respond to new threats because they can't do machine learning and it has scale limitations they have no historical data it was limited to this two weeks of data and it actually took them six months to build this they had a team of 20 people that actually built this very complicated large architecture so today I'm really really excited to announce data bricks Delta it's the first unified data management system that unifies there are how saying with data lakes in particular it gives you the scale of the data Lake it gives you the reliability and performance of the data warehouses and it gives you the low latency that you get out of real-time streaming engines so data which Delta gets the good of the data lakes which is it gets that massive scale you store your data just on blob stores like s3 it's base on open formats so parquet and Orsi you can just have your data in those formats that you can read elsewhere but you can also do predictions machine learning and real-time streaming with it but in addition to that data books tell the stores control data and makes sure that your data is reliable so you get the good of the data warehouses as well so we try to keep your data really pristine in the Arabic Delta we give your transactional reliability so that you don't get failures you just have atomic transactions and you get really fast queries there's a lot of optimizations that go in to this data when you move it into the system so we see anywhere between 10 to 100 X speed-up over just if you took just Apache spark and ran it on a traditional data like yourself so in short it enables predictions real-time and ad-hoc analytics at massive scale so I mentioned it had these properties massive scale reliability performance and low latency how are we doing this I want to mention a little bit about how we achieve these four properties the massive scale comes from the fact that we decouple compute in storage so it's not coupled appliance like traditional data warehouses instead your data is stored on a blob store like s3 and you don't you don't have to pay a lot for that these blob stores are quite cheap today independently you can scale your compute as you need so as more and more people use the system during the day it automatically scales up and down so you get this elasticity so that's how we get massive scale reliability that's through asset transactions and data validation we actually validate your data when it comes in so that it actually follows the schema and if it doesn't we actually reject it and give you a chance to fix that and then for performance we've actually added the traditional data warehousing techniques into database Delta so that we do indexing in caching of your data that way we can get the 10200 X speed ups that we're seeing and then finally where we spend a lot of our time is this whole system is based on structured streaming from spark so you get the real time streaming ingest that spark provides to you so if we turn to the use case that I mentioned the fortune 100 company they've actually been using database Delta now in production and what they now do is they have now instead of billions of records they have trillions of records coming in every day and it comes straight into database Delta where it gets et Eldon and we do the schema validation right there so we make sure that the data is correct and pristine and we store all the indices that we need to speed it up and then you can use data box runtime and spark to do sequel machine learning and streaming use cases so what they ended up with by doing this is an AI capable data warehouse at the scale of a data layer and in particular they can do now interactive analysis on two years worth of data so they can actually go back and analyze all the data that they stored not just the two weeks that they had in the previous architecture and this whole thing took two weeks to build but a team of five people so instead of six months and a team of twenty they could do it with fewer people in faster time okay so I want to take a step back so at previous parts summits we've talked about the unified analytics platform that data box provides to its customer and in particular we focused a lot on the notebooks the dashboards the reporting that we had we focused a lot on the collaboration to bridge the skills gap today with data bricks Delta announcement we complete this architecture now to have unified data management plus unified experience across the teams now you get rival transactions and great performance together with those collaborative tools that the unified analytics platform gives you okay so enough talking by me I'm going to invite to stage Michael Armbrust who is the mastermind behind database Delta and many other things in spark such as sequel and the structured streaming engine and he's actually gonna do a live demo of this so you know fingers crossed and let's welcome Michael to stage [Applause] [Music] thank you very much Holly I'm super excited to be here today to demonstrate just how much data bricks Delta can simplify a lot of the complexity that sneaks in when you try to build real world production pipelines however before I want to get into that I want to start by looking at what it looks like when you build a cutting-edge pipeline today using the state of the art tools so in particular if what I want to talk about is one use case that's really near and dear to my heart the pipeline that I built a data bricks in the early days to understand how our users were using our cloud platform so this architecture looks probably a lot like things that you guys have built at your companies there's a bunch of events coming in from Kafka in this case it's logins to data breaks clicks on features spark jobs running things like that I want to archive it into a data Lake and I want to be able to do streaming analytics so I can understand what's happening now in real time in production and I also want to be able to run historical reports where I can look back and see what the trends are and how usage is changing over time so let's see kind of what it looks like how the system evolves so you know I as all you said I've been working on structured streaming recently so let's start with that part so it's pretty easy with spark to you know we've got great Kafka bindings you can read data from Kafka you can parse out the JSON and you can run a bunch of queries so you can get streaming visualizations and set up alerts and all those other kinds of things so that's pretty easy but what's next so you know now I also want to be able to answer queries historically this is really where streaming only systems start to fall short Kafka is really good at storing a day or a week of data but if you want to store years of raw events you're gonna need to use something else so I've heard of this thing called the lambda architecture so maybe we'll kind of duplicate the code and we'll have two different pipelines here this is a little bit of extra work but you know SPARC makes it easier because I can at least reuse the code from streaming in batch because it's the same api's so okay we've got the lambda architecture and now I've got stuff in a data warehouse and now it's easy for me to do reporting with SPARC kind of the same data frames data sets in sequel so what's next messy data it turns out there's tens Engineers across a bunch of different teams putting events into this Kafka stream so some people decide to use a different date format some people misspell event names and so we need to have something to check and catch this when it happens in production so we'll go and build some more spark jobs that do validation and it's kind of annoying we got to check both the streaming version and the batch version again but you know we can do it and so you know now we've added manual validation so the next question is mistakes and failures that validation is great but it's actually too late by the time it's gone off you know that bad data has already been produced it's in the data lake and so we need some way to correct it and this is actually pretty difficult especially in a distributed system it's not only human errors but it's you know the spot price is spiked and my cluster died and half of the results were written out and now I have to clean it up so a pretty typical technique here is you actually partition it by some granularity let's say by date or by hour and so now I have these nice clean boundaries and I can build a bunch of infrastructure that allows me to do reprocessing so I can kind of if anything goes wrong I can fix the mistake and then just go back and reprocess that whole day's worth of data it's fairly coarse grained but I guess it works so we've added reprocessing into this and now my users come to me and they say this is way too slow so we kind of start looking at it what's wrong here so it turns out you know the streaming ingest was running really fast but it was generating millions of tiny files on s3 so let's build a some more spark jobs that will do compaction so we're gonna run these they're gonna take those tiny files and write out nice giant gigabyte size park' files call him really encoded super fast to read but then oh man that job happened to run while my reporting was running and it crashed cuz now we got file not found exception so we have to be really careful we have to schedule it so that we don't run compaction at the same time as reporting and so now we've got compaction okay so this is a pretty complete pipeline and the funny thing is I've actually seen this exact same story play out hundreds of times in fact we built it at data bricks for different times so let's see how it's different with Delta by going to a demo so here I am in data Brooks notebook I can run commands on my spark cluster and I'm going to start by actually creating a Delta table so we can do that using standard sequel DDL so we'll do create table events and we'll say using Delta and this is gonna be more than just a pointer in the hive meta store I actually want this to be a contract so we're gonna specify the schema and we're gonna say both what my users of this table can expect to find as well as what I'm going to mandate that people who are producing data the rules that they're gonna follow so we'll say there has to be a date and it's going to be a type date we'll actually even say it's not null so it has to be a valid date you can't put garbage in here we'll say there is an event event type and that's a string and we'll also have a city which is also a string cool so we'll go ahead and run that command it's gonna go off in the background it's going to initialize this table creating kind of all of the transaction allottee and stuff that we need so now that we've got the table let's actually try loading some data into it so let's start with that Kafka data so we'll just do spark read stream Kafka and we'll tell it which topic to read from so we'll read from the topic events and so this is this will give us a data frame and if we take a look at this it's got kind of the standard Kafka schema we've got a key and a value we've got a whole bunch of metadata about exactly which broker this is coming from now I happen to know that the value has JSON encoded data that has all the stuff that we need so we'll just go ahead and do select value and we'll do write stream into table events so this is going to kick off a spark job in the background it's gonna run a whole bunch of tasks in parallel reading from Kafka parsing that JSON encoding it as park' and committing it into the warehouse and you know what we'll actually see is here it goes it's actually running you know tens of thousands of records per second pretty cool so now that this is loaded into Delta it's very easy for my users to do analytics on top of it all I have to do is write some more sequel so we'll do let's figure out you know how the number of events has been changing over time to see if usage is growing or shrinking so we'll do a select count star date from events group by date and order by date so this is going to kick off another stream in the background and you'll notice my users here don't have to know that this data is coming from cough Garr if it's from coming from Kinesis or how to parse it or how to read json all they have to do is write sequel and so we immediately get something that we can turn into a visualization and we've got this actual streaming visualization that's being updated in real time so if you look at this axis here you'll see every couple of seconds it bumps up there's more events or read it in so it's pretty cool but actually this graph is pretty boring it doesn't really mean anything to me that we have this many events today unless I can see it in the context of everything that's been happening over the last year and this is where the fact that Delta integrates with both streaming and batch becomes very useful so fortunately we've actually archived all of the historical records on s3 so we've got a bunch of JSON out there so I'll just do a spark read JSON and we'll do Mount historical so we'll read from that historical bucket so this is going to pull up another data frame and if we take a look at it we'll see that we have the date and the event type in the city that's exactly what we need so I'm gonna go ahead and do write to run a batch job and we'll write it into table events so this is going to be a much bigger job than that kafka stream that we're running up above because actually needs to process quite a bit of data if we pull this out you can see it split it up into 800 different tasks and it's running hundreds of them in parallel and so it's you know chugging through here doing the conversion and oh man it failed okay so at this point if I was using a traditional system I'd probably have to worry about the fact that 380 tasks already finished they probably put garbage out there I need to start over but fortunately this whole system is transactional so it's either everything is going to happen or nothing is gonna happen so you'll notice nothing has happened up here so let's see if we can fix this error it looks like somebody back in September decided to use this silly American format for dates so let's see if we can fix that so it's pretty easy with the data frame API I'm just gonna do with column date and I'm gonna use the sequel function called coalesce and what this does is it just takes the first valid value that comes out so we'll try you know taking the date and just casting it as a date and we'll also try doing to date where we take it and do date and then it looks like it's month month day day your your year here and let's try this again so now it's gonna go again and kick off this batch job it's running a whole bunch of stuff in parallel so what it's actually doing is it's reading this JSON from s3 it's parsing it using JSON it's applying my extra date format logic on top of that and once the entire job completes its going to then make a you know commit to this table so this transactionally adds all of that data so it's finished and now it's actually committing it and bam and shortly we should actually see all of that information appear up here pretty cool so let's actually take a look at this I think this is one of the coolest parts here you know a streaming only system can handle tens of thousands of records per second but one of the really nice things about Delta because it's built on top of Apache sparks tungsten execution engine you can actually when you have a huge backlog of data suddenly show up in the system we've switched into this massive processing mode where you're processing hundreds of millions of records per second pretty cool so ok so that's pretty great we've got our streaming analytics but I want to talk a little bit about the performance of ad-hoc queries as well we talked about the small files problem and it turns out Tahoe is automatically doing compaction under the covers so I want to talk about something a little bit harder something that was typically reserved only for data warehouses so if I were running a query like this where I say let's say select star from events where City equals and let's do Dublin so this query actually doesn't need to read the entire table it could touch a relatively small amount of it so let's run I'm gonna run it to this debug mode so we can actually see what's going on under the covers in the query planner but you know in a normal spark job we would still have to scan the entire table to find you know which data came from the city Dublin but what Tahoe's directs Delta is doing under the covers is it's actually maintaining statistics and an index that that allows us to skip and read only the data that's necessary so in this case you can see there's 952 files in the in the data set but it actually dropped almost all all but one of them which is pretty cool that can really speed up your your ad-hoc queries and reporting when you connect to tools like tableau so that's pretty cool so let's let's kind of review what we've done by going back to the presentation anytime so we started off with these really complex pipeline where I had to manually handle unifying streaming and batch I had to do manual validation I had to worry about the logic and kind of the boundaries at which I was doing reprocessing I also had to handle compaction Dana Brooks Delta axes this unified store where I can take for a data from a variety of sources whether it's Kafka or a data Lake or Kinesis or whatever I can put it there and then all of my users can read it kind of with the strong guarantees of a data warehouse but really the nice thing here is we still maintain the scale of a data Lake we have transactions we get reliability we have indexes we get performance and it integrates deeply with SPARC so we can do low latency streaming so if you think this is pretty cool and you'd like to try it out I'm really excited to announce that we're opening it up for private beta you can find out more deta bricks comm / Delta thank you very much [Applause]
Info
Channel: Databricks
Views: 24,474
Rating: 4.889401 out of 5
Keywords: #ApacheSpark, #DataScience, #datawarehousing, #datalakes
Id: -eZkegBnyMU
Channel Id: undefined
Length: 27min 49sec (1669 seconds)
Published: Wed Oct 25 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.