Advancing Spark - Give your Delta Lake a boost with Z-Ordering

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello spark fans and welcome back to advancing spot so this week we're going to dive into something called zed ordering or z ordering depending on which side of the atlantic you're from and this is all about how do we actually make the data in our files more efficient and it's kind of a hard concept so when i first came across delta and it's all to do with delta lake the whole idea of zed ordering was like what do i need to do things that make any sense so we're going to run through it how it works what it is why you should use it and the impact it's gonna have when you turned it on if you liked the video don't forget to like and subscribe and let's dive in okay right so super quick i've got a load of little files so i've got four little small files some data some maze some integers and then i've got that name column so you've got bob fred andy and tom brad and tim blah blah blah now delta super smart if i write a query saying select count star where a name equals brat it does a thing called data skipping if it understands how your data is organized so data skipping if we had it turned on using this data it would look at this and it uses a min max value in the delta transaction log so i can go to the trends log and say does this file contain the data i need well it's got data from bob to fred and brad sits alphabetically in the middle of there and then andy's tom and again brad's in the middle right to tim brad's obviously in there and then dan to jam brad's not in there because d becomes after b so if we had data skipping working on these files it would actually see a lot of that data and z ordering it was all about changing the ordering of these of the data so that you're more efficient when you're doing data skipping and it's built into the optimize command so the optimize command itself you can run it without zero ordering and that just says here's a load of files let's just compact them down so it's saying i've i did have 200 files actually they'll be more efficiently compressed more efficiently held have less overhead of going back and forth and opening files on the disk if i hold it as 10 files so it's doing that kind of thing so that's not great it's not going to do data skipping that well it might just by chance go oh that one doesn't have any data but there's no rhyme reason to it so if we put data skip if we put z ordering on words uh so we said take this z-order it by the employee name and again i need to do that at the same time i'm running an optimized command and then it's going to it's going to combine some of these files so i'm going to have bin compaction um i'll end up with two files in this case but you can see the ordering of which records it's decided to put in each of those files is based on the column iz-ordered by so i've now got andy to down in one file and then fred to timed another file and so when i run my query it goes well i don't need to look in the second file because i'm looking for bread it's definitely going to be in that first lot not going to be in the second one so actually the column that you said would buy really really impacts how well you're going to find your data because the middle column that kind of ones and twos actually both files have got minimum one maximum two minimum one maximum two so i could have said ordered by that instead and then it'd be more efficient when i'm running queries i need to look for that column but then my andes bob's threads would be a local place so there's a bit of a balance it's what column do you want to zed or buy how are you trying to optimize your data and this is a lot if you're from the sql server background like a clustered index essentially you're saying i want to store my data on disk according to someone with a specific search predicate to make it go faster that's kind of what z ordering is doing so you're giving it a logical dimension you're saying use this attribute it's going to change how it puts data down to disk optimizing it so hopefully it'll spot that you're filtering on a specific column and then go cool don't need to query the rest of it that's what that ordering is that's what we are trying to get working okay so there is a whole blog post about it on the databricks site talking about what zebralink is how it works that whole uh bin packing compaction stuff talks about data skipping most importantly it's got a few notebooks so you can go and dive into the python notebook to get an idea of how these optimizations work so i just dove into that i pulled it out i've imported it into my local database cluster but i've gone through it and it's it's useful it kind of explains how things work um so we're pulling out some flight data again this is using whenever you're using data bricks don't forget you will always have databricks data sets as a mount point so you can get all of the demo code you should go to that pull it down it's super easy so it's pulling that in as a data frame the original one it does a load of stuff just writing it down as parquet and then it doesn't optimize and z-order and shows you the query forms before and after and that's for me one it's comparing park a to delta and two it's comparing pre-optimized to post-optimize it doesn't really tell you what z-order does in that scenario so i ran through it i followed it through i changed it so it's just writing it straight down to delta it does a query before optimizing um and you can see it took about 31 seconds we can have a look at this view and say well actually we could say what did that actually do so inside there what you're going to do it read 2297 files a lot of work involved lots of going and opening lots of small files did the optimize so again optimizing it's zed ordering by the day of the week and optimizer breaks it gives you that little this is what i've done so it removed 2308 files it added 300 files so it took 2 000 odd files and then combined them into much chunkier um file so it's now got 300 and could see it got a z-order stat so it's gone it's created that we have a cube and essentially that z-order cube is what it uses to reference going what's it settled by oh you've got that as your ordering strategy kind of like your clustered index key going off and saying right i can use that doing data skipping but then the example query that we've got is actually just saying well go through filter it by certain thing go run it pull it back and this actually isn't hitting um it's not doing data skipping you know so if we go and have a look at what happened in this query we can see well it reads it read all the files so read the 300 files and for each of the bits of data directly to pull that back more efficiently great but that's not actually doing z ordering uh what's doing z ordering is not doing data skipping so i thought you know what i'm going to make my own quick version of this i'm going to quickly run through and we can have a look a little bit more in detail what's actually happening here so again back to databricks data sets i've got nyc taxi so i'm pulling that down uh i'm doing a little bit of globbing just to say you know what i want the 2009 data but just don't try and read it all in the filter down because this isn't partitioned let's just give me the ones that have 2009 and title and then give me got start give me all files all files that pertain to 2019 is the plan um i'm using the first gamer i have a laziness so i'm doing a quick read of a sample file and then applying that schema into the main one otherwise it takes ages to read them all so i end up with this taxi dead frame which is just all of that data set which is 2019. okay so i'm pulling it in i'm doing a bit of fiddling to add some partition coverage because what i want is this to just give me that pickup date year month and day because that's what i'm usually partitioning on so year is my year month is year month day is year month day and that means you can do range queries on partition columns which makes your life a hell of a lot easier uh so i've got that in there so we're gonna have a look at that data so you can see i've got my new columns over the edge so i can just see what i'm gonna partition by which makes my life super easy gonna write it down i'm doing a quick filter for some reason these files have pick updates all over the place so i'm just making sure just give me 2019 data and then i'm doing this cheeky little fellow i'm saying why don't you re-partition to 200 now that normally 200 is the bad number that you're shuffling does but i'm forcing it to do a little bit of inefficiency here because i want to test this out so i'm saying take my data frame and i've probably read it i think it reads in as 12 different blocks 12 different month files and i'm saying take them do a shuffle force a shuffle into 200 blocks i'm gonna have data all over the place so i wouldn't normally do that i'm doing that so we can have a look at how it works then i'm writing it down with a partition into a delta tape so that's not going to be ordered i'm not doing this at all i'm doing optimize i don't have auto optimize turned on this is shoving this data straight down onto disk okay and we can see in here a couple stages so i had like 12 stages reading my 12 individual files and then my 200 tasks in second stage which is when i've told it false it go off and do 200. so when it writes that down i'm going to have 200 different files so go and have a look at what that looks like so over in our lake i've got my transaction log so i've already run this and annoyingly optimized takes a little while to run so i've run it in advance but we can say for that first json file what actually happened in there download it open up locally okay so i get a ton of stuff so each file that was added in this original install i've got inside my transaction log for my dell statement which is good so i can see it's all 2019 i can go and have a look so what it does it gathers stats on each of these files so i've got the stats column i can see how many rows are in there and minimum values so i've got my trip pickup time and i can see 2019 0101 across all of these files so it's not grouped into months it's not so it's actually just jumbled all of my data together they've all got data from the first day if i scroll over we'll find max further okay so into my max values and going across here so i can see again 20 20 of 101 basically right at the end now of my thing so all of my 200 files i've got data from all the entire range of that year so if i try and say somewhere in this year give me that day it's gonna have to read every single file so based on the min max everything has all the data it's just all jumbled up that's no good okay so that's our starting point so let's go back what do we do next so i quickly pulled it into a hive table and actually the optimize works either on hive or directly on path uh but just wanted to try some stuff so i can do a select system and go get some data and then i can run and optimize so i can say go and optimize this and z-order by the pickup date so it did a similar thing to the one last one that we saw it removed 200 files so it saw my 200 that it created went okay they're all garbage and it's managed to compact them into 10. now there is a minimum size that your optimize command is going to aim for so i could change that i could set actually make it each file smaller each file bigger and you can tweak it normally optimize aims for about a gig so try to make each file flee a gig roughly even number of rows to try to avoid data skew and get stuff in there so see what it's done what i've gone in and actually so total size hasn't really changed so changing this round i've actually not gained much efficiency from by having these larger files because the taxi data sets nearly all integers there isn't really that much i can get from repeated rows and all that kind of stuff if there's lots of strings lots of repeated like dimension values that would compress a hell of a lot better in this case it's all about how i've organized it to optimize for data and skipping i could see my z-order stats so again i've got my minimum cube size and so it's gone in and it's rearranged my data so it's it's changed that physically okay so we can go back and have a look let's have a look at the what it actually did in transaction log first let's get this open okay so firstly i've got removes so it's like i'm getting rid of all that junk all those 200 files and again remember it doesn't physically remove them it logically removes them then transaction log it's only going to physically move them when you run a vacuum command and i've got my 10 new files so these are my nice optimized ones that have been reordered so i can go and see a number of records oops and i can see over here so actually what it's done i've got my tripic update and half of them is starting up a 101 and half of them are starting at 621 so essentially what it's done is it says okay the most logical way to pack these in is to create just two options and then so i've got five files and one option five thousand another option and i think in the first half of the year it's here anything in the latter half of the year is here that's kind of i believe that's to do with the minimum cube size which has gone uh okay it only makes sense to do two cubes given the amount of data that we've got and the more cubes that you have obviously the more granulates but it'd be i could force the values to be 12 making sure it's actually got 12 individual cubes each of which is a file but in this case it's looked at the amounts and said you know what you're fine having just two groups kind of on a b tree really it said start the year end of the year pick your collection of funds and if we go over and have a look at the um maximum dates in here so you can see there we go so right at the end we've got the the first chunk which started on the first of jan are now going through to uh the 21st the sixth and the latter which started on 2016 going through to the end of the year so it's it's organized the data that sits in these files it's not just done the compaction into the 10 files it's done logical ordering of data to enable data skipping okay so that's what's happened inside my data and again you see that through the transaction line and it needs the transaction log in order for databricks to know what to use you want to know that these files are organized in that way that's essentially the index being kept there so it knows what to do so if i go and run some things so we just run that whole thing and say what's going on here uh and we should see that just creating all this data okay yeah so it read all 10 files just makes sense so that there's there are 10 files now it's ignore the original 200 there now obsolete um but i had to read everything because i'm not filtering i'm saying bring back everything now actually in terms of getting those counts it's going to be a little bit more efficient anyway because each file is going to have fewer um the cardinality of those values is going to be less because each file only contains a subset of those dates which is great um but then if i try and do something where i'm filtering it so i'm saying i want to do a count from my hive table i'm including a partition filter just to remove that as a point of complexity and i'm saying okay i want to filter on my z-ordered column where it equals that date now we know that 0-9 that's in the latter half of the year so there's only five fines that contain data pertaining to that month so we're looking at our spark jobs go over here we can have a look at my associated query to get the details and i've got five files read so it's automatically done data skipping and that's not a configuration i need to turn on if i've used z ordering i'm using delta then if i query using the search predicate that is my one of my z ordering columns then it will automatically do data skipping that's awesome it's really really good so that's just it's just now a more efficient query uh and again you can do things like ranges uh and all that kind of stuff all works very nicely you can do starts with because again because it's this kind of dictionary based minimum maximum minimum maximum things like range queries and that kind of stuff actually works really nicely if you've got a z-order color and again i'm going here and it's only got those five columns so that that is the point of z-ordering so you've got a few things where you know if each individual file has to say what unique values have i got and you're limiting the number of unique values in each file obviously those files are processed quicker so just generally queries around aggregates around counts around distinct if you've said ordered those columns those kind of queries are going to go faster however also if you filter on it and you've settled on that column you're going to get data skipping and that is just a whole bunch of speed and optimization and great stuff so dead or doing really good now you can always order by multiple cars now obviously because has to organize the first one and then the second one you get kind of diminishing returns on each additional column and it does take longer you know so the more complexity you add the less likely you are to actually get it again same as having a clustered index if you try and cluster index by a whole bunch of columns it's not that useful by the end of it so have a bit of a thing and that's one of the big things when you're when you've got a delta table and you're trying to think how you want to put that down on disk deciding do you want a z order and what column you should set or to rock is a fairly big thing now it also depends on if you have the optimized step so in here where i've written it you know so i had to do that as a separate step that's not me on right saying optimize this and there is a spark configuration called auto optimize so you can auto optimize an auto compact so you can say every time i write to a file do a bit of optimization do a bit of auto compaction but you can't said order with those automatic ones so if you want to have z-ordering you need to do as a deliberate separate task but you don't need to do it constantly you can run this as a weekend maintenance job so all week long you're going off running things inserting data repenting running mergers doing whatever and then at the end of the week say right just sort that table out again keep hammering this analogy it's like defragging an index right so it gets worse over time as these small files get created and then you can step back and say just go and order these sort them out make them more efficient and that can be a nice asynchronous offline test again because of the way delta works you could be optimizing a table and so it can be querying it while you're optimizing it and it's not going to use those new files until it's finished and it's actually created the new transaction log record so it doesn't have to be an offline task it doesn't have to be a weekend maintenance thing kind of depends when you want your cluster to be used because that's the hard thing optimize is a heavy heavy compute test i think when i optimized this it took 10 minutes and it's not that big a table given i can query it in three seconds in two seconds actually reading that data is fast but 10 minutes to bring it all back sort it it's doing salt operations which are horrendously slow at the best of times so optimize is not a fast thing to do but it will make all your queries on it hopefully faster as long as you get older on the right column once you pick the right thing to zero to buy then you're gonna have a good time generally you want a z-order by something that's got fairly high cardinality because it benefits from having this range like the bucket of uh different values in there if you pick something it's only got one or two values you're not going to get that heavy data skipping it's not going to be that important so big performance thing using delta use z ordering check it out it is all of the awesome all right so that's all i'm talking about today so have a look in dessert or doing let me know what you think let me know if you've always seen it never actually tried using it and now you're gonna try it out because hopefully you'll see some real performance benefits and hopefully they should outweigh the time it takes to actually z-order your table again bit of advice if you're gonna do it on a multi-terabyte table plan out a long bit of time to actually go and do that but then once you've done it you can actually add in a z-ordering partition value so z so optimize where add in a partition filter is that all to buy so a big big task to opt zed order for the first time and then as you go in if you're updating just partitions you can z-order and optimize each individual partition so you're not having to do your whole table at once so big job at first and then just a regular maintenance routine and it's really not so bad okay so if you like the video don't forget to like and subscribe and we'll put another video that we made up there so you can go and check that out alright see you next time cheers
Info
Channel: Advancing Analytics
Views: 4,929
Rating: 5 out of 5
Keywords: spark, data engineering, optimize, delta lake, databricks, optimise
Id: A1aR1A8OwOU
Channel Id: undefined
Length: 20min 31sec (1231 seconds)
Published: Tue Aug 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.