Azure Synapse Analytics - Provisioned SQL Pool First Look

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and thank for joining once again as I dive deeper into the world of as your sign up analytics and spoiler warning we're gonna talk about sequel pools and therefore I'm gonna call it as your sequel data warehouse accidentally way through this video so Sitting Bull's is our MPP style thing that's when we're trying to say I want lots and lots of big chunky secret servers to query massive amounts of data in parallel now it's less of a exciting new weird stuff finish or sign of cymatics because it's already there I'm talking live so if you go to these secret early is your pocket currently and you go and look for as your Samsung attics this is what you'll see this is the product is currently live except we're gonna look at it as part of the wider workspace what does it look like when we can do things like spark and sequel on demand and orchestration and we have these single provision sequel pools which provide the same functionality that the existing product does does it work well does it can you use it through the workspace how does queries work between on-demand and cigarette oh and provision secret pools sorry so don't forget to Like and subscribe you liked the video but otherwise that there's just to see what works right ok so I've got my sign up studio here and I meant provisioned something from scratch so we'll do that first and then we'll take a step back and talk a bit about what it is and how it works so let's go in here and go down to my manage area I've got that sitting there let's manage and then I've got a secret pools and spark pools so we're doing a new sequel pool 60 I've already got on demand that's triggered by default so I'm gonna say I want a new secret pool gives me a name to be 30 warehouse and then ask for performance level so basically how much do I want to spend per hour to have this thing scaling and I can't go so big I will talk about in a second I'm gonna keep it really really low and slow it gonna be nice and cheap um that's not the best idea for production but it's fine for just playing around so additional things I won't blanket on it a backup I'm not creating anything's collations fine okay so crits that's off and creating and then let's take a step back let's talk about what it actually is so whether you're familiar with is your sequel data warehouse or you know you've come across this sign app thing and you're not sure what a sequel pool is um but all the same thing essentially so it is an MPP solution that is a massively parallel processing engine that is essentially do sequel but massively a pepper so it's a relational engine it looks and feels like sequel server you can connect using measurement studio and it's like okay I'm just using super sever great accent anytime you run it it distributes that query if that's a load of service they all do the query in parallel then it gives you your results back kind of like a spark engine so you have a brain you have a driver in spark pilot or a control node in sequel pools parts so that is a single server that's it so whenever it's turned on there's always at least one server the sitting there getting queries responding talking to the workers getting results back all of that stuff that's your control node sending underneath that you've got a variable number of compute nodes so that's tailing Neruda and we're saying one at 100 500 mm that's saying how many compute nodes do I want provisioned and I can change that number however apparently if I change it that turns my server off change of the number turned my server back on again so it's not like a super quick dynamic elastic scaling thing it's a turn off wait a minute or two it's back but a different size it's quite chunky in the scaling then sitting underneath that currently the 60 distributions and they're essentially as the picture so like little as your sequel databases of their own so if I get a 60 thousand row table and I say spread that out then actually it'll put a thousand rows on each one of those 60 tables in each one of the 60 databases and then when I query it each one of those compute nodes will sell I'm gonna query those databases I'm gonna create those databases I'm going to create those databases so they all have isolated works that are actually dis storage than a map t essentially it's doing the same things that we do in SPARC and big data the kind of things of saying separation of storage in compute you take your compute to the storage because moving data is expensive ruk those queries and then collate the results after so that's kind of the architecture that we're talking about so when I'm talking about 500 data warehouse units I'm talking about one compute node now that is a bit of a change so if you around for the early days of sequel data warehouse we had Gen 1 and then 100 data warehouse units with one compute net so you can go every hundred date you go up you get an extra compute node and they're fairly small little servers I'm we're just saying I'd want lots of service so you go up to 6,000 data warehouse units with Gen 1 and that means that you have one compute node but each one of my 60 distributions those kind of little data stores sort of one-to-one mapping I can't go any more than that because I can't have more than one server queering the same data because of our isolation of computer storage together so Jenn - it's a wax thing we're gonna make these compute nodes way they go away chunkier they are much beefier servers that are being used but that means that we now have much bigger hops in terms of how we do it so I can still doctor 60 of these ones but each one is now 500 some going up so a thousand for to compute nerds thousand five hundred is 3 mm is 4 that's the kind of chunkiness that we're talking about and when they did this there's allowed to people like me who had like a dev test scenario and it didn't only paying 500 because that's actually a little bit expensive for something I'm doing some Devin I wanted to my nice super cheap thing like 4 pound an hour 100 data warehouse units and that was fine for me to test on 5p to play around with not that fast but doesn't need to be so the 100k will give you that but essentially to get 100 data warehouse units you're having a slice of a compute node so hundred is a bit more of it you basically it's shared tenancy so there's a compute notes I control it a computer that's like shared between a few different people all trying to do little bits of dev and test so small ones you can go one two three four hundred they're not for production could just share and compute with someone else and also it's not that predictable it's a little bit flaky occasionally not flaky just performance is inconsistent because you're sharing a bit of kit with someone else so yeah be a little bit careful with those lower ones completely fine for Devon tests no take the leakage any of that kind of stuff just it's not production grade doing that kind of thing so the minimum you want to be on is 500 data warehouse units if you're doing things in production okay so then getting data into it so if we're using some traditional things like s as high as you can see P those kind of things they're gonna try and squirt data at the data warehouse will provision sequel pool and it's gonna hit that top node you're gonna hit the control mode and not scale that effectively what you want is for your various compute nodes however many you have to be reaching down the pulling data into your warehouse now that means using things like Polly bits that's using external tables that's having to run the script saying select star from my external table seat as a into another place and that's the most efficient way to bring it in case that means if I say I don't want for compute nodes I want 1000 on 20 then it just scales linearly because I should get more threads that can read more bits of files at once from my leg so I can just do more work at once in parallel meaning my query goes faster so you want to be using poly base at some point to be able to pull data and effectively look at Li so better bricks better Factory well the tools that we kind of commonly using in the late the area they can do this staging approach so certainly I say I want to take data if it's sitting in data breaks want to get it into my provision sequel pools but actually I want to be efficient so you staging and dump the data from wherever it is into an intermediate blob store and then poly base it in automatically and that's great that's nice an efficient sometimes it makes you do that even if you're pulling data from a data Lake it'll do it into a blob and then into a thing which is like an extra redundant step it doesn't need to do because you can pulley base over the leg but even so if you're trying to get data interested into super else you want to be using these poly based tables so we'll try and start off with that what to say look I've got data in my lake can I get my provision secret pool that we just created to get that data in to make it accessible through my querying layer and then I can do stuff with that I can save it materializes as a table do whatever I want so that's the plan let's see what we can do okay right cool so that's now provisioned so I've got my table there sitting around and go see some details on it so I've got my endpoint we'll use that in a bit we can go and connect to that fire management studio but let's try and do something here in the browser first so there's one thing I noticed I want to point out to you guys actually so when I created it so because I've created this provision sequel pool it creates away basically a warehouse database so there's a database created in my sequel server which has all my stuff in ready to go that is an empty sequel data warehouse apparently and you can see I've got a sequel on demand one that I created that's a logical database that I've created previously and I can see both of them in my databases that's great but if I say I want a new sequel script so I've got my choices here of which sequel parlament so I'm using my on-demand sequel which is the pay-as-you-go always on responsive version or my paying my provisioned one which is doing that whole clustering I need to spin up a post I need to do that stuff on it now if I select sequel on demand in my database I've only got master and ad-hoc queries if I switch back and user dementing warehouse I've only got my advancing warehouse I don't have access they can't see each other essentially um so to test that so on my white house I can quickly say well let's just say so you let's use the master database and do you select star from sits the databases and see what to eat there we go yeah so you can see I can see exactly what I put back there master and warehouses the things that this master database can see switch around to the other one over there and get the other side so in terms of how we're building a jewel databases currently we build a logical sequel server and then we attach databases or data warehouse to them so kind of the way it's working is we've got a CL pool so the on-demand sequel pool is in one logical sequel server and the provision sequel bullet is in a different sequel server so they they're not aware of each other so whether that's Debra with a preview whether that's they'll never be able to see each other no idea there's probably a config thing we can do but but now there are two different beasts oh we can work with that we can go from there okay so let's quickly just show that we can connect to it so I can go and grab for that address let's open my own studio make sure I can go in there area and then it's true that just like any other secret anywhere else right so everything that was this address yeah yeah and connect okay so I can connect that's good so you've got some databases over the side see over here I've got my advancing warehouse and I've got my various different bits I run a query and then here I can do you know so it is just secret White House so it's next out from so you've got a load of DMVs a lot of dynamic measuring views which are the PDW versions which exist within secret White House land or provision sea for Paul and sorry so I get a full history of all the different queries I've ever been ran what happened on air who ran it a load of thing about workload isolation so I can see the importance I can see the regional classes all that stuff again that's just existing secret everywhere how stuff I'm not gonna go into that too much but that works I can see what's going on there and I should I think in monitoring I think I get a similar thing so I've got this super requests I'm this defaults to my sequel pool I can go into a fancy white house and that to me because I've got those sequel request IDs that looks like the same thing so essentially it's just showing you your history is exposed through the workspace which kind of nice being odd to see what's going on I'm curious Roberts there at the end and if I still got that hair to be that'll be great and I suspect that history is just reading straight from this table and that table is flushed anytime I restart my server at my scale my server and if I might do anything with revision secret Bulls sequel in a warehouse it's gonna dump all of that date to that it's just temporary data transient data that's there while I'm using it so I suspect that's going to spit it might not might be pleasantly surprised but we can have a look okay right let's get some data in so I want to go into my lint I've got my lake in here scale it up so it's a really basic like it's built last I'm getting some Park in there okay so that's a park a fire so I can do this new script and I can just open it automatically and that will use the open rosette stuff I can put an external table and that's what I want to be for the sequel data warehouse side so can make sure that works I can do bulk load and that's gonna believe that but tries to build a better factory pipeline for me so tracks dental table okay so it's which pool I want to use my advancing warehouse I'd pick the database what I want to call this products so let's call it dbo product I haven't prayed to those schemas I'm sure I'll be upset and let's get this to critic script for me I want to see when it's trying to do oh well okay that's quick so what if I got um okay so not stuff create a poly based table you need some objects in there so you need a master key and a credential which I don't see but I'm assuming it's using somehow passed through stuff it's going a file format so just it's registering part a is a file format you do need those sequel objects created getting external data source so it's exposing my data source just as a new thing and then sprayed in my table for me so it's gone through it's worked out what all of the things are so that machine has just read it from the park a schema so park a has a schema built into it so it's assumed it's inside that file and pulling it in I'll see that's going to a particular file currently we'll see we're going to go from a folder that you're trying to this first and then it's doing a select to prove that it works ok so I'll be surprised if this works off the bat without any kind of credential normally you have an additional thing in him saying you know my credential is I'm pointing to a sequel object which is telling you how to connect doesn't happen but it might be using passing so let's just try and run that so it's running on advancing warehouse connecting starting it's going off trying to do ok so it's a few bits we can do so I'm not even created a master key so normally if you are storing credentials and then you need to create a database master key and that can be an encryption key you do so I can just generate one and then you store your credential which are like your blob file key in there and ok well that worked awesome ok so obviously it's must be using pass-through so it's either connecting as signups using the signups managed service identity or connecting is me as my user and going through and doing it either way it's managed to do it so it's connected to that file for me it's under to this park' it's connected to that location and it's gone back and it don't queried some pocket for me so I should actually if I go up to my workspace over here by here quick refresh should be on say extent a what's ok and there we go ok such registered it as a good object I can have a play with that I can switch over I should be able to see that over here petrol DVRs perhaps no it didn't well isn't the answer let's find out what I called it so some tables quite - price price okay uh okay hmm external file access to internal memory jar exception on the GTPase bridge so the hf bridge is the that's the Hadoop file system bridge so that's the Lake reader the poly base is trying to use so poly base failed essentially because it doesn't know how to connect to there um and that issue would be because the credentials missing it didn't know what the credential is that's interesting so what is that what can that doesn't work so let's just if we create a new script hmm and just say or do the same thing using advancing warehouse like that ghost like my products yeah it's not works that's that's happy so it works if I'm in the signups workplace it doesn't work through my trench studio mm-hmm let's try and find out wife okay um so I can go to my history so good my undermine history I've got my advantage warehouse history did you failed oh that could be why actually okay so you can see in my history I've got my different things I can see in that one I doing a select hundred starla in that one are doing same thing one worked one didn't know me all different except for a top 100 but that one was trying to you see people admin user so because that was using a sequel log and then sequel logon isn't aware of any kind of Active Directory pass-through kind of stuff so I'd assumed that's trying to act as like sequel data warehouse currently does as it is it called it's not aware of any pass-through stuff the signups will be using so that would be interesting to say is that just saying I know who's actually connecting there I'm gonna see if you've got access because there's a sequel user doesn't have access or whether it is a swap a difference from active connecting through the workspace so I can try I've not set it up but I can try and connect through Active Directory to my secret server to the warehouse by Isis mess and see if that passes through I'll give that another test at some point what we do we can create the objects that are missing so okay I can create a master key and then I need to her I need to go and work out how I created this credential that's been a while okay that was gradual just create that I think what we can do there we go I'm going to do it through him so if we do that so wanna create a differential are gonna call this my what we're talking to we're talking to my lake right so I wanted to go to this same thing I can compare these things let's go over here but it inert crate that thing I want to call it roots of advancing ideal s who knows where it is and in this case I think we'll just make that we can try and do it with a blob key actually giving it is just on a that's Jen - mm-hmm so you doing that for and the things we're not doing its share access signature so we could try it just see if it works with that like as it is so it's gonna be in that one steal a key and change all these keys after I show no you this because otherwise people to me okay okay so can I do that's see that works yep so like that and then we can say a natural equals bet so all were doing is saying create an external data source rather than passing it through and doing it based on whoever's using it I want you to connect using this key now that key is like full admin access to my entire like so that's normally not necessarily the way you want to kind of test it see how this is working so I see what's doing go in that so I should be able to do that that's going to tell me I already exists so just do it as well does it really matter logs it's something new so if I run that I can be happy okay so that's created a new external data source okay then right roxtor table get that one and then I can crate it again except this time I'm going to point it at my other data source so my new data source was doing this one and then hopefully the suture room fed it's working using some local pastors if I'm using a sign-up workspace but I can still do things the old-school way let's see okay craig scime table run there guy seemed happy roof isn't pudding okay so that worked but then that worked originally anyway so I should hop back over here if I can run this now okay cool okay so that's good so poly based all that kind of stuff look like it works in the same way that he used to they haven't really changed that except there's now the ability to do it using pass through with a logged in user if we're using it through the synapse workspace okay good so then what we should be able to thing is the old any let's great tell you more give it a name so let's pull this dim product and I'll have some tenants No so that's known as a seat tennis so my create table as select because we're in provision sequel pulls you to things means distribution I want to need a index now add faults are round-robin [Music] coming around right mint and blistered it's busted up store and next car number let's check that as I see did you do quick table select sucks being a while just reach an option mmm-hmm it's a second word search today okay so yes I'm saying credit is round robin so that means across all those 60 distributions so I'm going to take a copy of this data the data is currently now there's this single park' file sitting in my leg I'm saying read all of it in bring it into my provision sequel pool and spread it out across those 60 distributions putting some records on each one of those databases so I'm saying round robin but just as make it as even as possible try and spread that data out in a nice even way I could say hatch and say actually wanting on hash around one of the columns that's gonna say every repeated row with that's got the same value is gonna live in the same distribution now that is awesome for doing performance tuning and making sure things are aligned popping all that helps with great ability to skew your data so if you've got your data really nicely spread across and then most of your data sits on one of those databases you've lost all of your parallelism you're paying for a big expensive cluster and you I'm not using most of it so if hash is really good but you need to know what you're doing the other one is called the replicates when that saying basically put a copy of this table on each one of my college tuitions is available for each one of my compute notes so I've got four compute nodes whenever I query it it'll copy across I've got four copies of that data that just means that I don't need to worry about we're hearing things in joining out across distributions and doing lots of data movement just everything already has a copy of it so that's for smaller tables when I want to be able to actually just get some efficiencies exactly the same if you're in spark and you do a broadcast join what calf joined just says take that data to copy it to each of my workers because I don't have to do data movement and shuffles sector the same concept same thing so it should be up to that it's a dim product you're not going bread it so just get everything from they're completely successfully so if i refresh this should have 4n we got dim product great I can only select start from you know dim product okay so that works happily unless this enables me to do some interesting things I can do I can create I can join different things together so dim product is mine materialized that's a hard copy it's not linked to anything in the lake let's just inside my provision secret pool EPO product is going back and querying the lake each time each time I hit go on that select so I've got my original select so get this one is really the lake this one is my internal storage basically but why candy the Union so I can do and select can star from yes romance do that's a sub-query so I can run queries like that that's saying actually you just take everything together join him together treat them as the same thing now that walnuts in my lake I could be streaming to that I could have a constant influx of data that could be a file that a CSV file people just drop occasionally and kind of replace so that can have like fairly loose production and I thought you know things I can have people playing around they're doing what they want whereas in this one I can have my proper official one so we can have a blended data architecture in this case I've not got more counter said you know I said uni insert got rid of the duplicates you need all I should see double that number so yeah so that's quite nice so now for my secret data warehouse from my provision sequel pool I can see they turn the lake in query it I can see data that I've got locally query I can join in both together I can work with them that's quite nice and I guess I manage everything through signups so however standing in my figuration if I want to go to that yeah so I want to turn it off don't be spending money or comes every hour that is turned on it's costing me money and I can scale that's what makes you say well that 100 data warehouse units that's okay but it's not very fast you can go up to 500 and then that's kind of like my little chunks of compute nodes then after that it scales in units of 500 cuz that's whole compute nodes so I can scale it from there battle again killing the queries wait for it to scale turn it back on again so that's actually if I pause that now that's so check there's no active queries going opposing that so everything inside there would be unavailable now I've paused it so if I went and tried to write a sequel script and I wanted to query from that database I wouldn't be able to see it I I'm assuming as soon as that's done I won't be able to see any of my my history so in my sequel requests and backs equivalent I can see to anyone has so I don't get any results in here so even though a minute ago I could go in here and see all of my different exact requests I could see the history of how people are using my server I now can't see and think because it's the queries down that are clustered them so it's only available while you got the cluster turned on which is okay and then one thing can I see there that's paused I'm currently still open it but that doesn't work cannot connect one database support which is there it is a message you get very used to if you're using secret of the warehouse and you turn it off a lot okay so it looks like sonnet on Linux is essentially the owner of the provision secret pools is exactly the same as it currently is it's got something new wants as to whether you're accessing it from within the synapse workspace or whether within man from studio depending on whether you're a sequel user or not it's gonna work differently it doesn't have things like Active Directory pass-through if you're not using Active Directory makes a lot of sense what is in there and it just seems to work so it's fairly straightforward to be two different to how you've seen it before okay so that was a nice little starter we've had a play we can get in there we can go work with it just good main thing is going to be remembering to turn it off because it is just an ongoing cost when you press that button to turn it on it's gonna start charging you for that full arm if you scale it up and down within an hour it's gonna charge you for that full hour the highest volume you had it in so even you put it in up for five minutes turn it right like that again that full hour is charged at that full the highest cost and it doesn't have things like auto terminate it doesn't have auto scale you can't say if you get if you're really quiet just a scale back down again it's a decide how big you want it turn it on work with it at that size but otherwise it does seem fully plugged in so we can query it through the room we can build scripts we can build those external tables all seems pretty standard and yeah seems pretty good so interesting as to how big does your data need to be before you start using it rough rule of thumb kind of longer talking terabytes is probably good is it worth spending on people you only got 500 mega of dead mega didn't know if I'm gonna get you data maybe you'd be better off using sequel on demand depends on how often you query didn't how many queries you ran because he put on a man is paid by throughput by how many terabytes you read and there'll be this kind of balancing act of at some point you're you see quantum and enough but provision secret pools becomes cheaper so it's a how much you're gonna use it you can use it in sporadic bursts and you know that burst is coming in which case provisioning a pool might save you a lot of money it's a lot of different things be interesting to see how it develops and how it plumbs in again haven't tried orchestration haven't seen how dynamically dynamically we can build these things and I want to dig more into say can it query those on demand tables because it really should be able to yeah don't forget to Like and subscribe you can like the video and check out some of the other videos that we've had before and otherwise we'll catch you next time
Info
Channel: Advancing Analytics
Views: 6,854
Rating: undefined out of 5
Keywords: Azure, Microsoft, Synapse, MPP, SQL Datawarehouse
Id: bVGnczEP700
Channel Id: undefined
Length: 31min 39sec (1899 seconds)
Published: Tue Jun 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.