Databricks, Delta Lake and You - Simon Whiteley

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello hello and welcome to the session so my name is simon white and i spend a lot of time working with data lakes and azure and building giant parque big data kind of systems and it's been a really interesting time in the past year so data lakes used to be really really good at certain kinds of problems they allow you to really do some interesting scalable problems deal with lots of different data types do lots of different really really hard things but they're quite tricky you have to work with them in a very very specific way now in the past year that's changed so this thing called delta has got a longer it's a new file format based on parquet that really makes things a lot easier it makes a lot easier to do some things you think of as really really common in data warehousing that used to be very hard in data lakes so i'm going to take you through that so a little bit about lakes a little bit about parque what this delta thing is and what it's unlocked which is a new way of thinking about data lakes they call the data lake house and we'll have a look and we'll see what it actually means okay so this is me someone you could see me hello and i run a company called advancing analytics and we're a consultancy in the uk that deals with all sorts of analytics problems commonly either data engineering data science or that sweet spot where they meet in the middle and we spend a lot of time working with microsoft we're microsoft partners i'm always looking at what the next thing in azure is and how that's changing and how we can improve the architectures and patterns people are using but also we're databricks partners and so we spend a lot of time talking to databricks and figuring out what's the next crazy thing they're coming out with and delta is by far one of the biggest innovations to come out of databricks in the past year and it's really really changed a lot of how we work okay i'm also an mvp in the data platform so that's a thing okay so we've got this thing the data warehouse and it's familiar and we work with it a lot and it's really really good having a real managed structured data model that we can give to the business we can do reporting on it we can audit it we can work very very closely with it in a disciplined manner now there's a bit of a backlash against it's going it's too rigid it's too slow it's not quick enough to adapt to change it's a lot of people jumped and went i'm going to build a data height and that just meant they threw a load of files into a distributed store which went it's now a lake and it's been an interesting journey with data lexus they've kind of gone this massive hype curve of everyone adopting one everyone building a lake not many people doing it very well and they're going well this is tissue and when people started they just had like csvs they just took some straight data copper separated through it into a lay and when you're dealing with text adrenaline the csv doesn't know what structure it is it doesn't know how many columns it has we have to figure that out when we read it that's kind of a has a thing called schema on read and that means we're always having to kind of figure out what structure should it be um compression in csvs it doesn't really work that well um this is a lot of problems with using that and that's where parque came along so park a similar to sequel service column store indexes it's a columnar store so we're taking data rows and rows of data like we had in the csv we're storing it by columns instead and that means we can compress it really really well into very very efficient files so part a is great and it has a structure so we try and queries in part a we don't need to say this is the structure go and work it out we say just read it it knows what structure it is the same as a sql table does but then an evolution from there okay has a load of things that don't happen automatically you have to know how to work with it there's a lot of patterns you have to use to get around it again similar to when um sql server column store indexes first came out they were read only and so we had all sorts of patterns to kind of update it and then put it somewhere else and then put it back and it was a whole mess and we're dealing with the same thing when we're dealing with parking and that's where delta came in so delta is a file format the same as csv and parquet except it's kind of a wrapper so the data lives is park a inside it and it gives us a transaction log around the top which allows for things like transactional consistency uh governance auditing um real deep heavy performance tuning loads a load of stuff that delta gives us i'm going to talk about what that does but one of the big things is that evolution at the top that from a warehouse to the lake and they're going oh i miss some of the nice features of a warehouse that's kind of where delta puts us the next step we're going well what all the good stuff about lake all of that scalability all of the flexibility or the agility but i also want all of the good strong management features and discipline and auditing governance and strengths of the warehouse so actually what i'm saying about this like this thing at the end this data lake house we're not saying it's an evolution of those we're saying it's a combination of those so it's a all the because of a warehouse all because of river lake wrapped into this data lake house page which is pretty cool okay so what do we actually want from this thing we want all the good bits from the warehouse going to be structured governed i want to be able to check when users have gone and delete the data and roll it back with a backup if i need to need to be familiar never never overestimate how useful it is for things to be familiar this one come longer and says i've got a brand new technology you're all going to love it you can write in sql everyone goes okay sigh of relief we can use it we know sql we know how sql works and that is a real strong feature that often gets overlooked um and it's fast for specific things like single record stack and stuff it's great doing certain joins it's optimized for it's a relational engine i'm doing massive aggregations over huge amounts of strengths and that kind of stuff not so much you kind of do it but you end up having to throw a lot of power on it now on the flip side in the lake world they're super flexible like just throw some more data at it you don't even need to define it and you can just land it in the lake you can start capturing the data as soon as you see it you don't have to do any upfront planning and deployments and code releases super super super flexible and it's super cheap and it's normally a quarter or a fifth of the price to keep data in a lake than it is to keep it in a warehouse real really big cost arguments for using plates and the massive scale offs is distributed and it's fast but for different things that are warehouses it's really good at doing that scaling over lots of data doing aggregates of huge amounts of data and so what we need is actually all those good parts of the warehouse all the parts of the lake and bring it together and that's what the data lake house is trying to say i say i want all the flexibility to like all the good stuff all the scalability and the kind of the way the file patterns that we've got but i want to layer over the top it gives me transactions it gives me governance and an audit layer it gives me real kind of neat deep index tuning i want file management utilities i want all of the maturity and that's the keyword maturity of a warehouse and all of the agility big data venus of a leg and that's what we're trying to build that's the aim for most people these days is this data lake coast lake house paradigm so let's talk about files so we've got this evolution so csv package delta like why did we do that let's have a bit of elephant we'd step in and go see a notebook so this is inside databricks this is a notebook i've got set up and i've just done a few quick example things we can have a look at so i've got some sample data so i'm going to say what's currently in that data you can see about a database date and data set it's the nyt taxi stuff everyone loves it and that's coming back and saying that is roughly oh how big is it oh it's about 119 okay it's not you big that's fine we're going to read it in and we're saying when you read it in further headers i want to work out what the column names are in further schema i want their data types for csvs don't have it and we can tell spark work it out on the fly however 120 meg file not crazy it's set so it's csv dot gz with gzipped and that means this becomes single threaded spot's all about parallelism what about running lots of queries in parallel if i have to sync up thread things suddenly things go slow so just to define the schema for 30 seconds okay so then say i want to go and write it down to the disk so i'm taking that data frame taking that csv that i've read and defined i'm saying i want to write down i want to include headers when i write it out overwrite it it already exists because i've run this demo a few times and then put it there and again that took 43 seconds for 120 meg it's not very good because the way spark works is when i run that it's going back and it's reading that file and it's reading that csv file the zipped one that's causing us pain here okay then writing out this park a takes about the same last time because i'm still going back to the zip csv the zip is our problem here but i've now got three options i've got a zipped csv i've got an uncompressed csv and i've got park a i'm going to have a quick compare of what they actually look like so we've got our zip csv as expected that's around that was about 120 ish um we've got a compressed csv you can see there's some metadata that gets created there's there we go that's our big long csv and that is going to be oh 600 meg so it's uncompressed it a lot so that's much much bigger than it used to be and then we've got a parking file pocket file has been created and that's about 140 meg so interestingly it's not compressed it as well as a csv and that with parking all depends on the type of data and if it finds efficiencies in there it'll compress it really well in this data i don't have that many efficiencies it's really interesting what this is going to mean so we can do this so we can say spark.read csv it's block.csv it's got to read parquet and we can run those different jobs so that's saying go back to that flat file run a job to actually just bring this thing in and then i've got a dot count at the end i'm saying i want you to go count my tell me how many rows i've got first one 14 seconds so that uncompressed csv even though it's smaller than our uncompressed cfv so got our compressed one that dot gz took 14 seconds are uncompressed took five seconds that's all about parallelism and this one when we tried to read it we had to read it in one by one i can do one task at a time you should be able to see we can see in here eight at a time weirdly doing things eight things at once finishes quicker than one thing at once and it makes sense so uncompressed csvs are faster than compressed csvs because of parallelism but then we look at the park a and that took 0.75 seconds it's just a lot faster part of that's because parking does have some stats in there and can cheat and go to the stats but otherwise columns are just faster for this kind of thing so arcade is a speed demon when it comes to aggregate style queries and just so so much better than having just straight csvs so there we go we like parking if we look at the counter we get back actually i'm trying to make it a fair fight whereas both the csvs they brought the header rows back as data rows if i tried to strip that out all the counts would be even slower because i had to do that first so i was trying to be nice and still it's slower okay so we like to parquet parker is very very good okay let's carry on from there so the reason why parque is good so we've got our csv file i've got loads of data all mixed up and then when i change that from a csv file into a parquet file i do this compression it's a compression i can only do because i'm no longer storing it row by row by row throwing each column as a separate block of data i can do some interesting things so that first column i said well the first three rows is a aaa instead of actually storing that as lots and lots of strings repeated just say well what the string is times how many times it's repeated that's a thing called run length coding and that can only happen when we've got runs of contiguous values so the more repetition of it in your data the better par cable presses so park a is great for things like dimension values when you've got hierarchies and that kind of stuff for fats when you've got a load of measures and a lot of things that you can leave on the fact table and it'll compress it really well i mean you can do distinct counts so quickly so the impression that you get is really really good it's also just dictionary encoding so when i've got a friend bob and fred and bob i've got one and two now because that's smaller so it can be compressed more so lots of things inside a parking file that just make it efficient and good it's a good data storage mechanism and then along the top i've got some mess data so i can see the schema the structure why in my examples for the csvs i'm saying go and work out the structure for my parquet i'm saying treat it it knows what structure it is i don't have to tell it what structure it's the same as a sql table i can go select style i don't have to say select star by the way here's what the column is on it just knows because it's got an implicit schema inside it okay so why don't we like parkour why are we changing what's this new delta thing going to be so there's a few problems inherent in part a now one is because of that run length encoding because of that the big continuous freeze of uh repeated things gives is what gives us the good compression if i'm repeatedly inserting into my table i'm going to have like tiny little party files because park file is immutable i can't go and edit a parking file once it's written i just depend on to the bottom of it i just need to make a new file so every time i have a transaction that's a new file anytime i change anything that's a new file meaning parquet you end up with tons and tons of tons of small data if the data changes a lot and they don't compress very well at all which means big chunky pocket files great tiny little ones not so good but that tends to mean you're picking everything up and repacking it and you spend a lot of time managing file size in order to have any kind of performance so it's it's awkward because these small files really really impact things so in this example i've got one big file i've got two small files and think about it that one big phone can compress a lot it's got lots of repetition the two small ones they don't get as much so they end up bigger in total than my one big farm which is weird but it's because of that efficiencies of the bigger the file the more compression you're to get because of real-life encoding it's a big big thing okay so other problems so it has metadata inside the file but that means that every parquet file has metadata inside the file meaning i'm going to read something called that's us i'm gonna read lots and lots and lots and lots of metadata i'm gonna i'm opening up loads of files to go and check something so the methods are not very scalable and that's not to be an issue i'm doing lots of big update queries and checking schemes and doing some things it starts to slow down which is no good um complexity right looks like familiarity is what it's all about if i'm saying delete from my table where something equals something and i can't do that in part case sql doesn't work it works for a lot of stuff but operation sql like delete data doesn't work so for parking let's say get my parquet file read it into a data frame filter that data frame put it back down to this because a new packet file doing a load of redundant uh rights when i'm there and then probably replace my existing one and that's fine that's not a huge amount of work but you think about making it easy to use making it familiar for people to adopt this technology and you say oh no you can't update a file you have to do several different steps to do anything and people just look at you and go really are you and you seriously use this thing it's just it's a little stumbling block in terms of people adopting it when they see this kind of thing so interesting auditing so yeah i've got my nice delete statement i'm going to delete my record and i forget to put that where statement in i'm just changing around like comments out and they run it and then all my parking is gone and there's no backups no rollbacks no taking vacations that is done i've lost all my data that is horrendous um now yeah i wouldn't write it as a sql queries said that didn't work but if i had that same path if i had that data frame process and i didn't have a filter in there's lots of ways i can accidentally delete my data which is crazy and that's a big things of maturity don't have indexes that's bad right so if i say select style from my dataware and give it a key it reads my entire table this is a full table scan anytime i do anything and that it's fine because spark is a nice big chunky cluster but you're just using a sledgehammer every time no matter how big the nut so that's why we need all these things that's why we're looking at warehouses from the big data line again oh i miss it when it was structured and it had a schema and it was governed so if you know someone deleted something i could roll it back two people couldn't try and update something at the same time and get a collision and call some data and just sometimes that's why i emerged you know they were the days and i want to be fast if i'm looking for a single record and not do a full table scan every time that's like the wish list of why people are complaining about a lake and going we need to change we need to do something different something has to change and that's delta so delta is addressing that selection of problems he's going here's a load of things that are wrong we're going to fix it and there's the marketing spiel so according to databricks it's a unified data management system that brings reliability and performance 10 to 100 times faster than spark on parquet to cloud data lakes and that's not a sales pitch for me you know essentially they are going it makes things faster and that's good there's a lot of people that sales pitch is perfect for but that's not me that's not what i'm after yes it goes faster but it's a spark cluster and sparks fairly cheap so you can just spin up a bigger one if it's going slow it's fine um yes if it goes faster i can get a cheaper cluster great but all of those problems that we're talking about that that that's not addressed there at all i mean data management system is kind of what it's talking about so let's what do we call it so it's dealt an optimized managed format for organizing working with park a5 it makes a bit more sense so it optimizes them it's a format so it is just a file format we can say i will write a new delta file out and it's managed so it has some kind of management layer but it's still parking that's the interesting point so despite all those things about parking files it's still a really really good data storage device so delta is still using park a we still store the data as part k files we're just adding stuff around it to solve all those other problems in essence it's part okay but better i mean you know they should hire me to do the uh database marketing delta it's part of me but better essentially it's fine okay so what does it actually do there's a good sales pitch okay great we've established parking is better than csvs it's not hot we're now saying delta is better than parquet okay white so in this example that we had earlier so i'm deleting some data so i've got a data lake folder i've got a couple of parquet files in there and i'm saying we're going to delete not actually via sql but making it simple so we can delete from this folder where it equals a segment and i have to pick all those files up read everything filter it in memory when i replace it i delete the old ones and i replace it with a new file that no longer has segment number three which is big and it's a big operation and i'm deleting data if i delete those files and then it crashes and i don't get my new data all sorts of problems with that root which means we end up doing more steps and work around and yeah pain so delta when it's attacking the same thing what does that actually do so when we first create this tape we've got still got our two parking files that hasn't changed again it's still part but the moment we create this we get this delta log there's a folder that's just put alongside our data that stores json files and those json files are a transaction log so when i've gone and added in these two files it's gone okay you're adding two files and it's recorded the fact that i've added files it's grabbed some statistics about it but knows how many rows in each file it knows the minimum maximum value for a load of my collars the first 32 and yeah it's good just capturing that stuff separately so if i go i want to delete some stuff it adds another json file so i've now got zero zero one as a json file still creates a new parkour file so that hasn't changed i still have to read all that up filter out write it down except now i can use that delete statement and it'll work out what i'm trying to do and do it for me but it won't delete the previous 4k files what instead it'll do is in the transaction log it'll tell me it's deleted it'll mark them as removed so it'll say well i'm adding this new file number 002 and zero and zero zero zero one then i remove they're dead they are deleted but only logically so that means if i come and query this folder i would see the data duplicate i'd see all the data several times if i query the delta table using that transaction log then i'll go you no longer need to look at these files so i'll come in and i'm saying select stuff my sales here you see that transaction log it'll see that in the most recent one it's told me to ignore those two off files and so i'll just read the latest file so at runtime when i query it it's deciding which of all the different files are still in that directory are the ones i actually need to query to make up my actual file and that's crazy it's awesome really really easy and you can just look inside that chase then go what was what was added what was removed what was that it was removed and every time anyone makes any change that time whether it's a change of structure people adding columns that kind of stuff whether it's people running a backup archive when there's people changing the data doing an insert and update a merge any of that stuff goes in the transaction log and i can open up and see it and say who did it who deleted all my data okay it was then right roll it back cause i've still got all the old data and i can actually i suddenly have detection i suddenly have backups i suddenly have just this layer of maturity across my data i can do this kind of thing i can say well select with a timestamp so i can say well actually at this point in time would it look like and that's going to go in it's going to ignore the latest transactions that happened after the time i'm asking for which means it'll say well in that case ignore the new file which means at that point in time you're asking for the old files were active and so i just get a query as of the old one meaning i've automatically got temporal querying over my data which is just mind-blowingly cool so not only has just adding a little transaction log giving me kind of a little bit of kind of backup restore a little bit of protection against mistakes and audit trail it's giving me temporal querying out of the box just because i can now look at a different active set of files in different queries if i could run this query and say select star from with that timestamp compare it to select the current star and then say what's changed what happened inside that audit i can get all the different records that i've checked suddenly i can go loads of powerful things so it's very very cool let's go and have a bit of a look at vet okay so we've got our parquet pain what's going on with our parquet okay so we can do this we can say well let's just go and read my parking file i created last time i'm going to write it into a new one i'm going to call it my parquet update and again i'm just writing this out as part a particularly special use that to run for a minute and then i'm just doing a little bit of registry so if i want to use sql if you want to sql anywhere at spark you need to use hive and that's this whole data registry so i'm creating a database i'm getting rid of a table just in case i've run this demo before i'm saying create a new table from now on if i go to sql bit stop taxis i just want you to look at the parquet using the parquet that lives in that location so it's still in a lake it's still flat files it's still the same as it was sitting in my leg i'm saying i want to express this a sequel going forward which again pretty cool so i'll let that run and then yeah this one so when that's run when that's actually come back right listen say well update our taxes folder set passion is equal to zero but you can see there no yeah no you can't do an update you can't do a sql update you can't do sql merge or deletes it's a partner file it doesn't allow that so i'm blocked from doing that kind of thing because it's parking okay i can't use my familiar familiar uh things so to refresh it because i've just recreated that table and i can do this kind of thing so this is the more the pi spark the sparky way of doing this kind of stuff i need to import some functions i'm going to actually define a table over my hive table so i can use a little bit of height i'm saying my you know my new data frame let's override my existing passionate account with zero so that's the same as my op my my previous update statement i'm now saying set that to zero that's what i'm doing and there's a new column it's called the same thing as an existing column so it's going to replace that existing column with a literal zero and then going right in and that's fine and that's quick because we're dealing with small amounts of data but that's already three steps and it's not familiar people have to learn that syntax learn that i have to replace an existing file it's not just this update statement and that's annoying so okay let's do it in delta so i've got dataframe.right.mode i'm doing an overwrite and that's it the difference between creating this table with delta and creating the table earlier with part a is just that it's previously i said format parquet now i'm saying format delta it's baked in as a first class citizen um within databricks which is great so i could say go and correct it so i'm going to write it out saving it in my a new location so i've got dell's taxis i'm going to do the same thing we're going to register that as a table and create a new table using delta so no longer using park a now using delta in that particular location all right okay i can go to that file i can say let's see what's happening it's going to go in creative form and then we can try that again we can say go and update my passenger account i think before that we'll just do a quick let's just see what the data looks like so let's go select bar from sql bits and delta taxi just to show you the passenger no nothing on my sleeves the passenger count isn't zero and then we'll do this update and we'll have a look and we should be able to see we're gonna do a straight up there's really easy okay so that's going after running it show us the results passenger count one one two two two as you'd expect taxis usually take people um then we can run that oops running away and then we can do another one we can just say let's repeat that and select star from superbits delta taxi and then we can go and have a look what it looks like again we should see just exactly the same thing except all my personal accounts are now zero now this is going and doing that awkward ops it's creating reading all my existing data it's filtering it out writing down a new table it's updating that delta transaction log to say something's changed um but it's doing it just straight without me having to worry about it can it take a little while shouldn't we take that one okay okay so it's gone off it's updated it say look at that table again we should then see action account should now be zero there we go easy as that we can now use familiar update statements because we're using delta because it's going through this management layer even though it still is parquet and we can see this useful thing so describe history just says what's happened in this table so i can see my different versions i can see it starts off the very first version of this table was all right let's see i came along and did it you can see some of this settings i had i can see the notebook and the cluster id i've got an audiolog i can see what's going on in there i can see the number of files the number of rows that were in just captured automatically i can see i did an update and i can see actually the number of updated rows the number of added files number copied rows so actually if i'd said i only wanted to update a single record but see all the other ones would be my copyright but just had to replicate them so loads of cool stuff in there just by the fact i've just said make it delta and it's still parquet under the hood which gives us a load of extra stuff on top of it so that is why we like delta tables okay so some extra cool stuff that delta can do so i've got a nice big tailor and then oh wait no yeah that table's about four columns i've only got three columns in my table that's not gonna work i can just try and say appendage my part hey my pocket is gonna go no but no i'm a three column part okay you're trying to enter a four column no it's gonna break now if i'm over a delta table and try and do that that's up to me so i can tell it no if it's different don't accept it don't let it in i can say you know what just auto match if anyone tries and it can fit if you can figure it out and go oh it's only an extra column we can just append that extra column and treat it as if the other values went no it's fine it'll just take it just automatically append it and start working with it just baked in so schema drift is suddenly no longer a problem and that used to be a big thing in legs if you received a file it's like oh no there's new columns i'm going to have to read up the entire thing merge it together work out gonna do with it write it down again to replace my metadata now it's just a yeah just insert it it's fine we can handle that so merge just baked in schema drift baked in um the merge statement is the sql merge statement nothing particularly new about it it's just baked in so i can do my normal merge into target table using my source table give some merge conditions see what happens when it unmatched up to a not i have to do an insert all that kind of stuff that syntax is all supported and i can also do it in python or in scala i can drop my language flexibility that's one of the things with delta it's sql first it's first party for the sql stuff which again is really really cool um so let's have a look doing some of that cool stuff so we can go over this way into our merge okay so let's again shall we go and do this if exists and i'm going to create a text so just creating an absolute blank table i'm not giving any data saying this is the structure and this is where to go and put it hopefully i've not read this before i might have to delete some data but it should be fine and this is going to create just basically a shell structure i'll have no data in my actual parking i have no parking files but i'll have a transaction lock so i should be able to actually start tracking what's actually happened as we're going through it's taking a little while to set up my sql session so i'll give that a moment and then when that's done we just run an insert statement and you might notice one thing so previously when i was running some sql i had a little percentage sql i was having to tell it you're in a python notebook everything is python but actually this bit of code treat that as sql but i'm not here so actually here said this is a sequel notebook just treat everything a sequel don't worry about it just this is everything is sequel i'd have to tell it if i was writing something that wasn't c okay so that worked it's create a model's table now so that's going off and creating that transaction log and then we're going to insert some data into it we just make sure that actually runs okay taking the status okay good so we've got a delta table we're going to have a look at that it's a delta leg address we can go in we can see you've got a double leg see we've got address and then there's nothing there's no data yeah there's no point we've not created anything i'm just giving it a structure and then i have this transaction file i've got a transaction log don't have a look at that how much resource i could say okay i've got some metadata i've got no add files i've got no remove files because there's no data and it makes sense uh and then we can work with it okay so i want to do that insert so go off insert some records into it and then you know as we did down there we'll see what that looks like so okay i've got my three records all inserted all nice and quick if we go and look over here we should see i've got a new transaction shockingly enough that's going to be inserting some rows so let's see what does it think so it has oh it's added four new columns four new partition uh pocket files which because i did things in an inefficient way um so look back there we go we've got a load of parquet that's not going to be that efficient look at that the kilobytes not very good but it's done automatically i didn't have to tell it how to manage things how to do all that kind of stuff okay so i can do this i can say i want to make a new temp review so basically a temporary table and that's going to be just selecting another column so i'm making a dummy row that is go to one of our existing custom ids we already had customer id 11. i'm going to say this is an update an updated address and then what we're going to do is just merge so i'm going to merge into sql bitstar addresses that's my original table and i'll take my updates and that's that's the view that we had just here's updates i want to join in the customer id if it's matched do a full update if it's a matter just make all the new columns the same and ignore the existing record if it's not matched insert is a new record so super generic urge statement nothing particularly crazy let's just run that okay so run emerge so we should be able to then get down here do a quick look see what's happening and we've got an updated address now so number 11 has been updated and we should have our next one so we're going to do updates again maybe another address this time it's got an entirely different perspective so this is the other side of the merge this should be a straight insert and then we see what's going on in this case actually oh that's sneaky so we've got country we didn't have country before country is not part of that table like we just have end day there's no there's no country so we're saying here's a new record that doesn't exist currently that's a different structure it's got extra data and that's where that little feature comes on the auto merge saying if that happens don't worry about it just put it in just work with it so i'm going to merge again so exact same syntax pushing it through thinking about it and then when it's done we can tell it to the next bit doing the merge writing the new file and then we should see that update and then we can go and have a look at what that's actually done in terms of the parking we can go and see what effect that's had a little while late at night it's getting so getting stuck occasionally there we go okay so that's gone on updated and there we go we've got a country it's just accepted the fact that i've added some new data with a different structure it's part of our merge statement and it's giving me nulls for the existing ones well that's fine so really really cool features that are just baked in and then finally we've got this history so we can say actually show me show me the different steps this time has gone through there's loads of different changes so my original crate table when i first wrote data into it i merged and did the updated table when i merged again and i kind of added some new columns and you can see i've got this version the version zero one two and three and actually over here if i go and select my version as of so my version out of one that's my original table did it originally look like and it's fairly straightforward i still have my new custom address version as of two just go in and say what it looks like oh it's an updated address as of three and we'll see a country column coming in just it is as easy as that to do temporal querying now it's a little bit finickety uh about timestamps i had to give the exact timestamp of the version i'm looking for it can't do a timestamp around this time it'll find the nearest one but to be honest that's fairly easy to write a little query for but yeah really interesting stuff let's crack on quickly okay managing delta tables so this problem if i'm doing all those little incremental updates we saw those tiny little parkway files and they're not good now we've got this thing optimizing and that is the the delta equivalent of uh running a defrag on an index that's going to take that and take all my files and go oh we'll actually just delete all those files replace them with some big chunky well compressed files i want to say delete all those files i mean lock them mark them as logically deleted in my transaction mark not actually to leave them so it's like this is a little operation i can just run whenever i want because it's not actually going to stop people working because the small file's going to be there the moment it finishes the new files are there and it just makes things more efficient this is a nice offline maintenance job to keep our lake clean and we said at the start that small fire problem is a massive problem for parque this just fixes it really straightforward really simple just oh just compact it just bung them into the same thing now there's this special thing called zed ordering that i can do at the same time and that is so if you've done lots of things with tabular models and you know they're sorting and tapping them on a certain column just means that compression works better again because i've run lengthy coding if you sort on the column you're more likely to get contiguous rows in it who is better dead ordering is kind of like that um but instead if we've got several different files what delta has is a thing called data skipping so if i look at a file it says well my minimum value is zero my maximum value is 10 the next one is 10 to 20 next one is 20 to 30 and i run a query saying well my value is 15. it's going to ignore the top and bottom files it knows it's not in between those minimum maximum values so it can automatically do something like thin to partition printing just by looking at those mid max stats z ordering is saying i want you when writing the data down optimizing the data to actually organize it in a way to improve the likelihood i'm going to skip files so you're saying prioritize the fact that if i filter on this column this column or this column then they're likely to be organized in such a way that i can do skipping so it's kind of like a multi-column kind of like a sort into certain columns to improve the chance of me actually doing some data skipping which is really really cool i just run that optimize my table said order by and it's a sequel first one so i can run it in sql that's just familiar so in that example if i've got various different tables and i'll say where's brett go find brad this is all kind of or index finding how do i find that particular thing in this case i would read all my data one of them i don't happen to have done in january and then i'd just skip it a jam damn brad's not in the middle of those two so actually that i can skip one file now if i want to do a good optimizing then actually optimize always does that file compaction so i'm going to take take my files take my four different files i'm going to squeeze it down to two different files to make it more efficient but then you can see how i've organized my people so all those different people i had in my original files i've now organized it alphabetically because i zeddled it by name so my first file i've got andy bob brad down my second one fred jan tim and tom so when i come and run that query again i can skip that whole file i can do a bit more gently now if you imagine this is running across huge swathes of data i've got like a big big chunky queries i'm running that's going to get even better and even faster finally i can do this thing i can do auto optimize now auto optimize is kind of similar except it's doing little 128 meg files so as soon as i write the moment i write it goes oh i'm about to write down as four tiny files why don't you make that one big file and it'll kind of interrupt the query plan just to say you know what let's optimize that right let's try and get 128 meg files the other one auto compaction is after i've finished writing it looks back and goes right what files are in that folder should i do a little bit of tidying up again looking at 128 now now it's different to optimize if i run optimize it goes for one gig one game about one gig for pocket file is kind of like the optimal that's what we're looking for 128 is a bit small but it's quite nice it's a neat uh minimum file size to be doing every time we write so optimize right spotter compaction you can turn it on you can turn on a given table you can turn on your entire day risk cluster and it'll just work that out doing 128 make files and then you can run an optimized job later that'll go clean everything up into those nice big chunky one gig files so we've got nice performance we've got compaction we've got the equivalent of index defragging that's happening inside our table and if you think how far we've come from the starter going i've got a csv i don't even know what data's in there to i've got a whole management layer of working with things and an audit log in time travel that is huge so in this case i've got my delta table i've got some chunky files lots of small files i've been working with a lot and then i come and say right there's eight chunks of data i want to try and write into it so to illustrate those three different things optimized rights i'm no longer writing eight i'm taking those uh eight and going actually that's that's more efficient to two files and so it's interrupting changing my plan optimizing as i go auto compaction is going to go well all those tiny files like we need to clean that up while we're there let's fix that packed everything into two chunky files and then after the fact if i run an optimize that's going to clean up all those files into big chunky one gig files so they're my different option for making things run really fast and kind of just better optimized so let's have a super super quick look at that so again we've got this described detail so we can have a look at that and that'll tell us essentially how many files is in a given folder so we can go off and run that that'll tell us oh how many we got we've got five files that's not many i've got a tiny amount of data and it's across five files which is rubbish and i've got this one command my optimize command whoops i sent it to the wrong place so my optimize command i can just run on this one type there we go it go off run it it'll tell me what it's done when it's finished it says okay right i have removed five files added one file taken five small files smashed all the data together into one bigger file that's a lot more optimal and then if i run that finally again then we should see i've now got a single file and it's working quite nicely there we go one file e p times okay so all of that together all of that additional functionality that our wish list we had at the start so i wanted to be structured and governed familiar and fast and things like never had well now it is structured so i've got a schema i've got schema management i can manage schema drift um governed in that i can see a transaction log and see whenever someone went near it i get a log if two people are trying to write it at the same time the transaction log actually manages that but protected if someone accidentally breaks the thing i can roll it back i've got the full history i can control how long i keep that history for that's a level of government you never had for any kind of data like um it's familiar as you've seen we've used sql we just write sql to manage so much of it and that's crazy that it is just so efficient that we can work with it's fast because it's got the index seat kind of things from that zed ordering stuff plus it's got the scalability of park a and how that manages it's bringing both those together i've still got all the flexibility the cheapness the scalability of my make now on the other side when we've got those small files operational complexity actually what small files have gone so that's fixed by optimize i can go and clean that up as we go operational complexity well now i've got deleted merge and i can simplify code into the things i'm familiar with it's very auditable and i've got indexing i've got management i've got going like to look after the table in an offline maintenance job if i want to actually my problems work so what i'm left with is this brilliant thing that sits in the middle it is this combined abilities of a warehouse and the functions and features of a lake in this one system so it's when you kind of get this this was the idea this modern warehouse is that's the kind of thing something i've been building from the time is this get some data into a staging area and a leg pick it up shape it clean it put it down as a parquet file and then put it into a warehouse and then serve it out to users we've got different people there but my data scientists using databricks directly to go and do some data managing i've got my reporting engineer my bi person building something in the actual warehouse doing the sql handy stuff my report users are over there using power bi now what we're saying is actually or certainly at least what database are pushing and kind of what there's a huge star movement towards is this of saying well actually that whole curated starscream type thing just have that in your leg because we've now got the tools and we've now got the maturity to deal with those kind of data models to actually have several layers of your light and then actually your data scientists and your engineers are all using the same tool using the same bit of compute they're using the same set of languages they can share notebooks and environments and features and functions and go over the same data your power bi is there coming off and accessing through that saves there so when i'm saving things is that seat of its taxis i can then expose that as a table the power bi could see and they can go bring that table in they don't know that under the hood it's this transactional wrapped parquet file i can just expose that by going through davrix's querying there and suddenly i have an incredibly powerful system at my hands that is what the data lake house is all about so hopefully that message is clear that data lakes have moved on a long way since their start since the ideas of saying it's just a free-for-all that you can use to build some crazy system that can scale crazily now it's a mature managed very sophisticated data management platform that can do a whole load of stuff that we can do with warehousing plus a whole lot more it because we've got the flexibility of lakes and that is pretty darn cool okay so i've been somewhere um thank you for listening i'm hanging around in the comments answering questions and generally being a nuisance so feel free to pester me with questions about delta and spark and next generation data lakes and all that kind of stuff i'm on twitter so feel free to ping me and i do do lots of these videos on youtube so if you're interested in signups or data bricks or delta then catch us on youtube and i'll be talking about more and more stuff right hope you enjoyed the rest of conference [Music] you
Info
Channel: SQLBits
Views: 2,751
Rating: undefined out of 5
Keywords: Big data analytics, Cloud, Data Lake, Databricks, Developing, Managing, Modern Analytics, Optimising, Python, Spark, delta lake
Id: y91r_DLMEq8
Channel Id: undefined
Length: 48min 2sec (2882 seconds)
Published: Wed Apr 28 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.