Azure Synapse Analytics - How does Delta Lake compare to Databricks?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and/or welcome back and thanks for joining me on my little journey exploring the new as your sign ups analytics putting its paces see what it can do can't - how it compares today tricks and all of that kind of thing one on poverty to the last video I had the audio turn alright and hopefully that should be fixed it's still echo into a finish decorating but we should be good I admit I've had a little cheat I was having to play a bit of weekend in just to trying one or two things out and I want to show you what I found mainly comparing it to some standard training that I do with Delta so I'm running a data bit of training course I'm teaching people how Delta Lake works then if you think I'd normally run through just to show how it works and not all of that is compatible again one of the big differences between date of X Delta the premium version you gain dead works and then the open-source Delta like so we'll have a look at a few of those things let's step out again sir he is my little day / - no but lots of lots of things in here and it's high step two and see what works so firstly to make sure that's wall cleared up don't want that table there anymore a case that's not there and I've got a lake and I don't normally has a folder called addresses but I've deleted it so we're starting blank slate no data let's see what we can do so we'll step in I'm gonna run this so it's a base crate table giving it a schema telling it on using Delta so it's not gonna take - there's no promoters defined there's no data frame here but they're saying this is a structure register it with hi someone to be available for you to write sequel it use Delta so it's gonna have a transaction log and then who's web put it okay so I should be able to now see this so we go into my Lake okay I've got addresses and then I've got something in my trans log so call my dresses with no date on it I doubt the log with day two in it blank flight ready to go happy days and then on the other side in my data I've learned on my Delta beta by South African C addresses be arranged it so that's kind of a two in one because I'm running in sequel no D I'm writing sparks equal then everything's gonna happen within high anyway so the entirety of this notebook my pop to the top is written using C so all these commands it's assuming it's equal I'm writing anyway okay all right let's get something so real straightforward insert into statement dummy ain't got some data having some random stuff you know just wanna get some data in there so we can actually have a look at what it's doing it goes it does a couple of jobs so does one job practical to update the data does the second job to update any transaction look and then we can go and run that just a couple of times no big thing this is super inefficient park' is a column store for each type so it's great for having large amounts of data especially that repeated repeated values you know a little picture ahead at the top kind of explained it so if I've got icon on a Daytona and maybe cccc the roads that have the same value contiguous line of repeated values it can squish up on your well actually the next three in just a another gonna be and then got every one of three C's knuckle run length encoding and one of the compressions that you see heavily in sequel comes to wall parquet all these kind of contour style things repeated valued in the road select more down and really good way of compressing so we've got that I've been to three separate pocket files because parking is compressed I can't just add data into an existing pocket far so as we go over to our data removal to him a refresh I've got three separate Parque fast because each new insert each of a batch of changes has to be a separate file not the best but at least I've got some data okay so I can do a quick select start see what's going on haven't got my data and I'll see a little bit of info okay I got some basic data just giving this without any info to unset oh there's a delta table there for you go go nuts to commands firstly just the strengths described so my inside sequel very importantly only sequel not in the data frames describes a very different function inside a difference but min seek why would describe I get a schema like what is this thing here are the columns here's the data types I can start using it there's quite a detail gives me registration data says well actually this is the Delta Pi visit it's full name this is the legged sitting in tells me things like this is how many files and the type of size of bytes super useful information that you're trying to work out how efficient is it how I'll compresses all of that kind of stuff so we've got a little bit so let's just try and do that over in signups and see how we're going suppose we just make clear what my database if it's still there this should give me an hour everybody's leads there okay cool just saying no such object that's already gone so we can start before we do anything more in sign up to let's show you something quickly lots of people have had a little bit of confusion when I first and integrates maybe because inside the sparkles you can have a look at the bottle itself and it can't think it tells you a bit about current language this is what's going on currently the slot tables look like all of that stuff and it's got the double leg version 1.4 I mentioned last time there's a big who hiring building there's no more point six and then we're not seeing that reflected we should be seeing visual warm for dotnet for Apache spark smart we don't get that but that looks like that's just a refresh issue with that actual screen it looks like it does have the later libraries actually installed how can we tell well we can question it so I switch over to a pint and mode from it some I do import Delta I'm smoking spell let's call it DT and then print out DT I catch there we go so I imported this library brought into context Delta's pre-installed all that plus we have to worry about registering and that kind of stuff who say bring it Delta into context tell me what that label is it's like okay well that's a library this is where it sit and you can see not wait six so if you ever I'm not sure you been looking at that going well what library is they're actually using don't forget you can just actually query and say what are you some place and libraries have there's a property and you can go and query that way but you know easiest is if you just look at the actual thing look at the file is bringing in you can usually tell what version is so pretty happy it's more point six to call the new features not mine today so no issues there happy happy days back to sequel man so let's try and do what we're trying to do there so by Craig table same staff using Delta he's my location I change location location over because this is using the local maker to sign up not my separate Balak and this is not gonna work so this photos of an arrow saying mom how uses a specified schemas she's been weird I think this it's more than that so if I go back over to my lake and I grabbed my I saw it currently got addresses my get rid of that folder so starting from the same place as earlier completely from scratch and I go go create this thing there's nothing there now go and create me a table I get this a going by I didn't exist it's expecting me to register a table that already exists so if the fault is not there if frozen there and if the folder is there but it doesn't have date - I don't I don't have anything working it's still frozen area it doesn't like the fact they inside this create tables I'm fronting actually to find the schema so I can just not the full surface area of what you get inside think of X Delta is implemented in Delta like that's just because the date Brooks haven't open source separately they've got to keep something from themselves right so interesting to see what we can encounter though I'm back that I did a little quick bit of PI spark I need to tell it that this is Facebook running man again still getting used to the fact that after the two percentages not warm percentage data vector power that's fine i case that Craig's that ghetto write it down so my dead friend but right I'll format say final location let's shoot haffley go away no upside still playing let's do that I've been feeling lazy to switching between telling a whole low pitch by it's not sequel as opposed to individual sales it don't work okay it's that's going off running that job you should see what's going on now the next bit of baggies so that rounds happily at this pointed to not register this table so I do need to kind of add that as well we can take that once this okay that's correct in my table I've got my delts table so I should be able to see in my root directory and you might give this a refresh got addresses I've got some data and I've got my Delta like so it is similar state so we have to do it in one set of creating data and creating the Delta at the same time but that's fun and then want to register it so when you go back to my sequel scripts to render that take so even though they it snowed unstable it now exists I still call them this grip this grip will still tell me off if I try and have those different bits and pieces in there it's again can't you to specify schemas but I can to fit that up in it like we did last time and just say there's already a Delta table sitting there you go credit oh I need to create my database always helps when I clean things up too thorough and it's just great that makes the Delta it's gonna create a new logical database for me create my table a trench drip and then we're in the same place oh okay we can do all the same stuff it's just the commands aren't exactly the same and you have two occasion you take a couple of extra steps together okay now this a good friend so we've got describe delta address so because we now registered that Delta table with height we can use the describe function and there we go I can see what my table actually looks like unfortunately so we're drunk detail it's off to be invented because it comes back and tells me that people's not support the describe detail please use a path like okay little odd but fine so I can go and grab that same path and just give it the same thing and just say do same thing go over my path because we're in a sequel and I want a single quotes so I should be able to do that based on that message then from that again XQ there and I think that should to preview think I think that's a week mod implemented this rug detail that's not working so a little bit of warning not everything's fully plugged in there I'm not everything's fully working but the core functionality tells us there okay so let's have a look at some other stuff but a quick example of what's going on inside that transaction log you'd see something like this is when it was committed this is whose commit this agency out doing things you can see if you did a update sent that'll delete the old files and replace it with new files you do a merge statement same thing so yeah this logical deletion and logical addition of files and I'll give you some stats of how many records minimum maximum values that kind of thing it's a super beautiful now date without the hands of this same it's called optimize and that's one of the big big things inside daily Brooks again that's more gonna be around in Delta like the open-source version now what that does is for those three files of credit I'll maybe not make me a fish into it and I've got lots of small tiny inserts well that's not very efficient what are you just have one file that has all three records in and then you're gonna get better compression okay so it's looked at it it's added a new file and remove three files so it's it's optimized to whatever is oddly enough there's other things like that over we'll look at another time now interestingly it doesn't actually go ahead and do any deletions so if I go and look at my data hit refresh I could see my fourth pocket collapse being credit so come on it's a little tiny bit bigger than the others she's got free records as opposed to one individual record but I've still got I can still see all four original files that's the Delta never actually physically deletes things into the in knows you have done with them the idea behind Delta being you kind of constantly are changing these files replacing them logically deleting the old files and then you can do think anti-graft you can go back and say i want to see what that looks like as of that date and then in order to actually do their kind of temporal query interesting in all certain records in the journal transactional oaken goes well actually i don't worry about the files is it created and these files are not gonna treaters or logically deleted because if transaction hasn't happened yet therefore why me okay class and so don't me until you're kind of outside of that window you want a thorough and physically believe things now that physically she happens by this vacuum command so the two things that we're thinking about optimized that's not going to be available to us like we're not gonna be able to optimize things but vacuum we should be able to so we can have a come back you can see how that works and so let's switch over see what I can do I got it ready no okay so this just to try and do so might do optimize I can then play in Delta dresses hey I should get a way to talk about because that commands not a thing and to week that comes up intelligence it kind of offers optimizes something I might girls me and then goes and that's fine dining of functionality thank you however should definitely work so Delta doc dresses no I should go away inspect it and stylish theater then again table is not supported in vacuum so can it's expecting you to never use hive for one of the maintenance tests which is just a little automated expect me to always go back to that path but we go back and work out where it is now I can't I grab that path say don't use my hype entry use my path instead go and have a look look at the data and it's ready I had to give you an idea what that looks like on the other side in this case I can point to that a highlight table and there's a few command you can use with vacuum so you can say well how many hours you want to retain am i until you think and the thing that was logically deleted from now from a day ago from seven days ago from 30 days ago depends on how long you want to keep that rolling history I'm another me daily check built-in well just tell you know if you try and delete anything that's too soon so my clicker trying to run this as is it's kind of like it's kid gloves and I'll go no no you're not allowed to I'm sure you want to vacuum you only using more towers that that's that's not long enough so exactly this little check that you can do so far table data bricks of Delta retention duration check is enabled I was saying no please turn it off and then it would do now it'll allow us to run a vacuum in it with a shorter amount of time in seven days and I got this command called primer nobody tell us what it's going to do I'm sometimes a little bit in focus he was doing this should say up identify three files could be did optimized I forgot one file that supersedes those three by those three files it says it can't believe them and so I've actually run the vacuum that will go through identify those who failed again and delete them away that's nice that should work then if we switch back to sanity what's going on there guess less still in progress running its vacuum it's doing a lot of things for our table which I only created with barely anything I think just food signups the digit tends to report a lot more than all day like this cuz every time we run a query it kind of gives you a couple of jumps which seem to be telling you it's going through the optimization engine and deciding what to do with the spark context rather than just the actual spark jobs that needed so you'll be seeing a little bit of extra info when everything signups it's kind of way thinking of things doing lots of lots of things even though I only had two tables two rows in the table and it's all still running it's last one it's my switch over since is not completed that's got off did this thing so email update debris site so you rittany have these four pop game random vacuum we now got one pocket and we still got a lot of transactions so still knows we'll still have the history of my crate of these tables like manatees and pocket file effectively disease pocket fast we ran a vacuum which is actually deleted these they're all mostly kept but we no longer have the data to be trying to do a temple where we we wouldn't be able to do that gets a signups are still doing lots of things no idea what's actually doing but eventually we should actually see it vacuumed it seems like ever since it's it's more probably properly but certainly to vacuum to park a fast shouldn't take that long unfortunately hasn't been going around from play with things I've got the open Sparky Wyman and that seems to be a very unhappy playing bunny currently anti my fried it's given me a Active Directory token error which is always lovely let's let's see the thesis behaving anymore but for now I can't get it right first time that could be not to get into this bar Qi so we can see what's actually do me image it great okay so it's kicking off your groups so the single statement and that kicked off so many different jaws again over in date works to do that so to do a vacuum we had two thousand which I'm assuming with want to actually do we do the vacuum and work out what it should do the other one to execute I find the files with now we're having always different things with so many different stages try and see if we can figure out what's going on the hood here but that's looking fairly inefficient that's looking he's doing a lot more than it really should be I walk back when I figured out what's actually happening um yeah so that's the first thing I wanted to show you Bob Delta so remember Delta is awesome in that it's doing this kind of thing under the hood it's kind of giving you that way of doing traditional sequel right I can do insert update could do merge all in the sequel walking day and data frames and then normally with parking parts you'd have to write a load of applicator to manage that for you now I could just rent those tables write a bit of sequel it'll do things fairly efficiently and dirt but then certainly data bricks we've got the optimizer command to sort that out to tidy up and make it a little bit nicer to work with in the play window we'd have to write something manually to say pick up these files put them back down again and have a regular things kind of regular geometry for another weekend kind of just to say sort out a table read from here put it back the target that overwrites use this partitioning that kind of thing so it's okay something we know that's a limitation with doubling the token sauce anyway but at least lots of commands are there lots of things are actually working not everything is fully implemented that yeah so I'll keep it for I'll keep reporting back in terms of what I find but yeah hopefully you guys can get in have a play let me know what things you're struggling with let me know what's good what's bad but certainly next time having it look a bit more into title so I'll show you how we do versioning how we could do it update and then say well what was that before the date and after the update and how that changes and my little investigation now looks really solid and sign up so far so should be good yeah hope you do next time Cheers
Info
Channel: Advancing Analytics
Views: 7,294
Rating: 4.9207921 out of 5
Keywords:
Id: 6S-0JSDZUF4
Channel Id: undefined
Length: 20min 48sec (1248 seconds)
Published: Tue May 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.