Azure Synapse Analytics - Parquet, Partitions & PowerBI with SQL On Demand!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
oh and welcome back sir I'm not having a bit of a dig into sequel on demand and I put a video out earlier this week just opening it and trying to get it working and seeing what it can and can't do and something's I just actually just works for another box really really cool and some stuff I was a little bit disappointed honestly some things I was expecting to go push up on see something cool anyone know however putting the video well some of the guys for Microsoft reached out when you kind of doing it wrong so there's a few bits I can show you we texting just put a little bit of an icing on top of what they're doing in sequel on man so apologies for anything kind of I missed but let's never come out some of it actually worked a little better and some of things that aren't implemented yet so ok here we are so we're in the silent workspace I've got a few things set up so back to when well I've got a couple of Spock notebooks here so first things I'm gonna set something up really quickly it's Demmel exited on data bricks for a user group just the other day so I've got a list just a little quick plaything arrays and tables gonna same run to that list gonna create a new database called adventure works so I'm just making a new spark database and then really simple we're saying everything in that list so hit to right through each of those different tables I want you to grab the place in the lake on you to something new place in the lake read data brimming so don't frame not read spark I'm getting into in first gamers and heads to CSV telling you to sees me and then loading into a dead friend I'm throwing a quick video lineage on there so I know which finally came from and if writing that out to Parker that's super super reports so in all the demos I did first certainly when I first turned sign apps on I went straight to Delta and Delta is the open source Delta like if you're in data breaks its data rec CELTA and it's essentially a transaction layer to wrapper on top of parking so the actual data is still held in park a funds but it's like a pocket plus plus its parking with storm whistles bells and extra bits now some of the things are funny get working simply aren't working for Delta yet hopefully yet is big underlined yet so every loads pocket so said for each those tables leak through do this get the data read right out of parquet and then run a bit signal so there's generating the sequel on-the-fly for each table throw into the adventure works by table registering at Fargo AMS the location so around there and managed to get an effective work table now one of the things that I I saw a regionally second of my first video I put out I was looking in here and going well what is this address that's great but we've got new columns I'm that like a delta problem so interesting lee going back into the ones i just registered using this script I've got metadata I can actually see what it is so looks like it's abilities you pass some of that stuff actually just hasn't been implemented for Delta yet which is better we're still in three key so that's cool so I can pick some that stuff when I can do the whole right-click allotted airframe and that spoiler happy days okay I'm totally happy to syntax but it works so that's good and that's that is apparently a very important difference so this spark database has a lot of tables had a lot of things in adventureworks and one of the things I was desperate to just want to work without even thinking about it it's the ability for me to actually come in here and say I wanna open you up its equal I don't sequel to understand what this is so if I go and just actually over here if I say I want a new sequel script I only use the database for adventure work so again I'm using a Spock database and not a sequel on demand database I don't say select star from DVR so you don't have a schema in hive so in hive it doesn't have this idea there's something I've got to level naming I've got my database and I've got my table so dvo everything is by default which makes some people squirm and then actually I can go and run that and lo and behold I have some data so that didn't work and we're not fun to do on a delta lake table it does work but it's a part a tale so apparently this is selective replication so it's actually it's not winning the hive database it's just if you register I don't go debates for the signups and you register certain kinds of table at the moment Parkay it makes a shadow copy of that that's available for the sequel side which means if I come in sequel I can write these directly on my hive things without me worrying about it so yeah it's a little lot of that you have to have two copies but then it's hidden away it's abstracted it's done for me so great fantastic if I'm using plot game I can go shoot speech sequel on demand I don't have to wait for a clustered start and I can create my data and that opens up a whole load of stuff well c'mon for some of those things and the other side of things is having a larger data sets so classic example everyone does they you've got a taxi data set so New York taxis very famous over data set I'm truly grabbed that and said can I pull it in we heard 66 million rows instead of doing that and standing actually going to right well I want you to be very importantly partitioned I want you partition by I didn't know there's a Concorde pick a month so what I'm going to do it's high-style syntax so it's going to create a folder for each unique value of that column and it's going to segregate my data on each of the different folders thank you format of parking or Delta I'm gonna do a right to just my local one this time so I'm not going back to that other Lake if you know you base I'm gonna call it taxi I'm just gonna add something in here so I have been playing around I've done this before cheating I'm gonna say this is over right normally if you've got a dead friend you try and wipe some someone wins or any data he'll throw an arrow and go there's really day lady we don't know what to do so override you can trash anything that's there create a new copy of it get that data running in there so we'll have two things we can look at we can have a look at a slightly bigger data set and that is a partitioned i mechanical developed a sense that's registered with a high table the came from spark there's still parking and then we can have a look at how that works in this equal demand side so what running let's just switch over into manager studio so I could go back and got my eventual work table and the thing I found out last time so my base was having okay I'm going what's here but I couldn't see it I think there's nothing really I am here I could work with pretty useless if I look at my adventurous world where we have the park a side of things I can go into my external tables and lowball I see a lot of tables there so I can start ideas a replicated site so when I'm looking in that base database I'm going where my tables there's no objects and that's because they haven't replicated over to the shadow copy that used for sequel undermanned these ones I have so I can clean tell me my products do a quick select mm it pulls it all out it's got intellisense and it works as if it was any other sequel table except again this is going back to my parking so that is a nice snappy ability to do direct query against parquet and that is really hard so a lot of things can't do that to do it in daily bricks I have to have a clusters but up I have to be paying by the minute for however many knowings I've got sat there waiting for me and it'll take four or five minutes to start my cluster before I can run my query you get my data back so that's right the fact I can do that and I don't have to wait opens up some stuff so I see it as my thing so that's still running still thinking about it my car to show you it while it's waiting let's see what do I have this is what we're waiting for all there we go so when let's actually run it about to going to replace this stuff but inside my taxi folder so I just said make base mate make it taxis put my data in there but because I said partition by pickup month he said cool okay he's a load of boulders for picker month that should work quite nicely I'll just see I can cancel that actually rather than waiting for it to run see how nicely it cancels gonna think about you might kill my data we'll see but importantly what we've done that we can now access it and we can start doing some see cooler so I can grab the allocation again I'm pretty have to give the full location so that's going to my room let's just grab any other end of it parka and say I'm doing a sequel query and we saw that last time it gives you belly big big big big it lists of stuff and actually I don't really care about a single file it's whoever does then get rid of that and we can put these wild cards in we put some stars in them important so I could just do that and try and do the whole thing it certainly do that gave beginner this one it's the important bit so where I've said that's a partition comment and I've highlighted it that's going to be really important in terms of how it knows how to query that data which is kind of good and kind of bad so it would just do this we just run this we should see some data there okay so I don't know if I do a quick count let's just see how much Dane I'm dealing with they should be mine 66 million so if I say I actually only want it from the second month then I've got 97 million you know so it's kind of lots of held in different folders that we can go back and we can work with now the super important thing with that is that's pushing down that predicate so I've got a filter on that pick up month I only want to read the files that are inside that folder I don't care about the other folders no ops you have to hard code that's know that's pretty obvious but what wouldn't do is partition elimination now in spark that's particularly good so in spec insane what I want a date frame an ought to be no spark to read my little stuff okay I don't go and get it from that location is it going it picked that up and I'm going to do that it's what am I gonna know that that data is partitioned it makes it just looks inside that faltering goes oh oh yeah there's some folders in there that I've got the syntax of something equals something hey I could have read the contents ago well that's got partitioning and that is high style partitioning so I'm going to tell it there's a certain following tell it how many it's just gonna in phone the fact that there is a column called pickup month even if that doesn't exist in with data that's literally all I need to do it's not quite as straightforward on the sequel on demand side it's pretty good that's not quite a straightforward I thought easy do this and then I've got these syntax gonna look it up here's my mouth so I can do that they're super importance that are saying okay of this same because it's an open process it automatically has this attribute of file path which you can use to repent the filename onto each row which is catechol anyway I'm and then I could say one and that what's good can pick out the first wildcard token it can fight so going in there saying if I've got that there's my star and it can say the value of that first Asterix I've claimed there so I can say over there that's equal to now even though we've done a wild card fair then that should bring back our same nine point seven million so checking without a wild cards so that's for why can't I get 66 where I've got file path equals to not something get my boots air so that's doing product information that it's reading less data and that becomes especially especially important for sequel on demand qussuk one demand is priced by the amount of data the throughput and if you're reading less things that you pay less so what if elimination not only does this speed things up massively cuz we're in a link we don't have an index it doesn't have a clustered index beat for you to find the record you're looking for it'll read everything into memory chop it down and then do its work if you can do something like this and put partition elimination in suddenly ever think goes on off step makes a lot of sense but I don't do that to me isn't great because it especially just ignoring the fact that a notice the column name it it's written literally in my path this is the column this is how I want to refer to it but it doesn't look my math being implemented to pick that up automatically we're gonna do something to tell it that's what they pick a month is did my spunk we come back most are still waiting I've been a bit funny tonight okay but what I can do is that's just anything that's just literature to file an a an attribute so like get rid of my west own and just pull that in and then I can call out pick up a month and then if I just do a quick top hundred on there and that's okay that's essentially depending it right down the bottom as if it was another column in my data seems it got pickup months in there so you might say this is Abby so I can do again crate view I'm gonna call this DVL taxi as we'll do it when you threw it on sequel database other things so if I have my adventure works database which I can I can query and as we do pull back things from I called prey objects in there we got little arrow saying it's not line for a replicated database because again that is a shadow copy of Allspark I've database so I need to do it in a sequel under manual so sequel 10 pretty less time ran you should see there there we go so got that back it's like a nerdy slate start from and deviate taxi at least let's it's a big Baylor's do drop hundreds run this make should be going off and that's running my career again now I should write on the side I've got my pick up month as a caller in there I can go ahead and use that so let's just see if I can do my count star and I can say where my pickup month equals to I know again and that was nice and quick so it looks like it's passing that down I mean it'd be great when we had to get this little a great L button saying query planned out a little turn that can be able to prove the fact that that is doing potterson elimination and wonderful running quickly but yeah really really nice that we connect seeds of doing that it's yeah yeah I'm it's frustrating that we have to go back to the old days of manually specifying or each of the flag cards our pipe syntax has been around for a while but if that's if that's the amount of work we have to do till you actually go back to that that's not the end of world that's that's absolutely fine and honestly if we're doing this kind of thing and we're generating sequel programmatically to register a load of things then doing the same to register those for use in our sequel on demand table that's not too much work that's fine so yeah some nice things in there some nice things that actually took fixing and works yeah I don't know why it's far as being a bit flaky cool okay however the biggest thing for me in terms of being able to use sequel on-demand on top of a replicated spark park' table if a B and the being loaded Emma's going around local people talking about this but that's you know come up the next big thing right see how much I use power bi now power bi desktop it's not the first thing on my list so you got power bi desktop up and we've got that sequel end point we used last time that we can go and actually explore pub in front so I need you to dive into portal to grab that out but do i where you got it here there we go says grown-ass Jepara i can say when we get data and then i don't even think special I can just say actually I'm just going from a database I'm going from the jaw sequel DB so I can do now that anywheres from that's my server so my sequel on-demand is the server than can break it cutting anything go from adventureworks which is again a high you sequel repository and most importantly is the fact that I can choose import direct weary and direct query is big now we've been able to do this kind of stuff in data bricks you know if you have a day-trip clusters spun up you can use the JDBC connector how do I can talk to it you can see a list of all the high tables that registered you can serve directquery but again you need a cluster spun up you need to have something sitting there waiting going is anyone gonna send me a query and that cost some money you're paying for all the time it's just sat there and no one's actually querying it it's just charge taking a log whereas we've got this mayhem say well actually I want my product and water sounds on to detail and I want to load it and it goes off looks at those tables rains been and they're not flu you know we can do can build a model do all my kind of stuff there we go I've got my tables in there and again these are Parque finals still in the lake and a great relationship but probably quick really quickly okay now ago so I can from there well it's a nail and say I want to see write name and the gates that I will see the cell phone let's do a light oh no yes so super quick report running directly direct query off the lake telling me my top products now that for me is absolutely huge just being able to query the lake in real time and if one of the things that's always been kind of a slight downfall of when people are talking about having that like walrus thing of the day delay house of I do a lot of dynamics bark I do a lot of kind of an elastic scale spinster buffer work some stuff turning off again and then I get to like power bi and probably I can't read parquet I mean I cannot just connect straight to the date leg and read quietly files natively you always have to have something in the way some lie they connected to a data rich Buster connecting to sequel data warehouse or your sign up sequel pools now and then having an external table there's always going to be something turned on that you're paying for that accident immediately allowing you to query it which people don't do direct query over things like parking because it's hard and costly so you always end up doing import models and then yeah overnight spin up a spark cluster generate your models turn the spot let's rap again that's fine that's a nice pan but it just meant seen a direct way just not a thing you do you always have to put it into a database now the fact that we can do this kind of thing and we can have it working dynamically suddenly opens up stuff so yeah a lot a lot of things going on what things kind of burn coming out what make sure this should read the documentation before I start playing with some stuff but it's always interesting to just see what you can do just by having a plague by having a bit of an explore I'm so yeah wrapping up partitioning it works it's not as slick as it with out of the box hive but actually you can do any it's performant and you can make it easy if you're used by wrapping in some logical objects the hive replicating over to the sequel undermanned databases it's limited in that it's only certain objects currently look like it's only bits of direct parquet table it'll replica over but then you can query it instantly in super on-demand and gain out hope that's just something that just grows and gets add they add more and more things to that list of what can be shared as the tool of choice there's a loads of learning stuff in there that's actually getting better and getting cool all right so that's me for today I've got tons of more things to explore a comfortable post raishin things to look at I've got traditional sequel calls to look at so join me next time and we'll dig further into even more of it how do we know
Info
Channel: Advancing Analytics
Views: 7,266
Rating: undefined out of 5
Keywords: spark, hive, data engineering, Powerbi, Power Bi
Id: hYl1jRWPhmc
Channel Id: undefined
Length: 20min 45sec (1245 seconds)
Published: Fri Jun 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.