Azure Synapse Analytics - Introduction & Overview

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Oh welcome back to the Synapse sessions where this week we're going to go right back to the start so Azure Synapse Analytics went into public preview fairly recently and I got super excited and dove straight into the spark side the SQL on demand side roaming around all the goodies that we've not really been held to see before now. Over the weekend a few people got in touch and said so I mean no it sounds good but what is it? I'm using Azure SQL data warehouse isn't this just that with a different name? Where did spark come from? So this time we're gonna have a quick run through the beginning. What it is, where it came from how to put it in the right context. What's live, what's not live so less tech today, more just context and if you're joining us the first one welcome don't forget to hit like and subscribe i'm new to youtube, i keep forgetting to tell people that. But if you do I'd appreciate it. So, let's dig in okay now in true consultant fashion I have some slides. I just want to tell you the story where it came from so back in the day we had the granddaddy known as parallel data warehouse that they got rebranded as the analytics platform system which is a whole mouthful now this thing was an appliance so you have to buy the actual physical server with a specialist version sequel server the sat on top of it when this thing worked as an MPP solution that's massively parallel processing essentially a whole load of sequel servers all stuck together working in unison spreading the work out and then giving a single answer at the end it's parallel scaling but sequel style so that was the analytics platform system and it was quite being very expensive so I think people used it not that many people used in the people use they loved it now when people started using the cloud more and more and more and more they said well that'd be a great thing right rather than having just one fixed server that can't change we can just go will actually spread the work out across more servers less service it is the perfect story for things like azure so we had that come along in the form of as your sequel data warehouse a lot of the good ideas a good thinking the core sequel engine from PDW the parallel data warehouse and rebranded and change how it works and it is different something now it took them a scalable amounts of compute where that wasn't a concept back in the APS PDW days and that's been in Azure for an age so I learned people are already using it its life lots of people love it sql data wa house is a thing it is good beautiful it's fairly mature and stable now so last November there was a big fanfare, announcements we had... was it... was it Ignite? Well there was a big announcement saying "hey, it's now Azure Synapse Analytics" that's just a simple straight rebranding announcing the new thing we've taken azure sql datawarehouse, we've changed the name we've changed the logo it's now called azure synapse analytics so all the docs everything changed except the actual product it's the same thing it's just azure sequel data warehouse with another name the confusing part is it's part of an evolution so there's a new bigger thing coming called azure synapse analytics that's right the new big crazy thing is called the same things the thing that's currently live and they wonder why people are getting confused with their marketing so we've got these two concepts of azure synapse analytics brackets sequel data warehouse and then the new thing the thing that's just in preview is a azure synapse analytics workspaces and that's so much more than just the data warehouse path so synapse analytics (sql data warehouse) is still present in the new big altogether thing but it's now called provisioned sequel pools get that as a name evolution where we've gone PDW we knew where that was. azure sequel data warehouse, ok it's strange that you name the product after what its meant to be used for, but not all the use cases, ok cool. Azure synapse analytics okay yeah fine just a rebranding... Azure synapse analytics a bigger thing... oh and this thing is still part of it but now it's a different thing super super confusing and I get why people don't know what's going on so azure synapse analytics the sql datawarehouse version is currently live, it's GA - generally available meaning use it in production. It's supported in a full full-blown SLA covered member of azure azure synapse analytics workspaces with other things in there is currently in public preview so don't use for production things they're still building it still buggy they're still building up the story that just doesn't quite explain what's in this big new thing and why you should care so there's another slide the marketing slide this is the slide that every consultant under the sun just said "hey this is what it's gonna be" and pushed it out, including advancing analytics we were like "this is a new thing, look at all the boxes, that that sounds great" and that doesn't really tell the full story so some great things sodown the side we've got management, security, monitoring, meta store - having those all in one box with previously they were managed in different places that's great make sense the core confusing bit is all of those different languages so I've got sequel, python, dot.net, java, scala and r and then i've got two different flavors so I do want to provisioned or on-demand and that kind of looks like it's all a single story but it's not. So sequel I can use provisioned or on demand. The spark side of things when we're talking about Python dotnet Java Scala or R, they don't work on demand they are just provisioned and that's not that clear from this diagram so this is kind of just it doesn't quite tell you how those things fit together you have two different compute runtimes. 1 that's sql based and one thats spark based which is why you have that different functionality and yeah that's kind of what we've had to go on so I've had a stab and pulled it back and said lets just really simplify it and let's talk about what is actually the hood in this thing so super easy we have four boxes first we'll talk about the compute options what are these different versions of sequel and then demand and they're not on demand at all of that orchestration and getting to working what's it based on storage and then what else is in the workspace let's see if that makes a bit more sense than l1 diagram.So... options for complete I've got two things on the sequel side, I've got provisioned sequel pools and that's when I want to say I know how many compute nodes I need I want to have a certain number of sequel servers in my cluster doing some parallel work I want a provision in advance so I want to say ok to turn my server run and I expect I'll get some queries coming at some point and I get charged by the hour and I size it in data warehouse units and if that's all sounding familiar it because that is exactly azure sequel data warehouse that rebranded badge of synapse analytics that used to be sequel data warehouse is now inside the Synapse analytics workspace now called provisioned sequel pools and they are exactly the same thing that is provisioning sequel pools all are sequel data warehouse, but built into this larger workspace so that should be super familiar if you're from there now on demand secret pools is something else entirely so that is a severless version that is, I write a query, I hit go and it instantly runs and it doesn't actually run on any server. I don't have to scale it. I don't have to say how big it should be. I don't have to predict workload in advance so great I get charged per terabyte read so if I use that a hell of lot and I'm constantly hammering my lake reading lots of data that's going to cost me more than provisioning my sequel data warehouse it's a balancing act so depends on how much I use and what I do but if I don't know yet and I've got really really unpredictable workloads then I can just use on-demand sql pools really quickly to try and figure things out to get things going and go from there and that's great! other side I've got my spark element so I got the provisioned sparkpools I don't have OnDemand spark pools. Again, you can't quite see that from that last diagram so I can choose a spark cluster so I can say I would like a cluster with this many nodes there should be this big this balance between memory and compute, all of that kind of thing. And I can run some queries and that query can be python or R or c-sharp, it can be sparks sequel, can be scala. We've got lots of different flavors because it's spark the one weird one, the C sharp one in there is kinda cool. that is the first time we've had a spark engine where C sharp is a native language which is great Now it's got to be said that is not data bricks and that's another thing that caused a lot of confusion everyone like "we've got spark in Azure it's called databricks" which to be fair when databricks came out they said that with HD Insight. So this is a different runtime this is much closer to vanilla spark so if went "I'm going to spin up a load of local little machines and installs spark on it by going to Apache spark foundation and downloading the latest cut of it" then that'll be a very similar versions of spark that is installed currently inside azure synapse works spaces if you go on data bricks you'll see a slightly different functionality , you'll see premium functionality essentially they've taken spark and built a load of extra stuff around it so they can charge a premium data rich price there are two different things but we do have spark inside here it runs on a cluster when you want to run it you got to wait for your cluster to start and so far it's been like three or four minutes. It's pretty quick but you feel have that wait time, it's provisioned and then when your queries are finished you can say "stay alive for 10 minutes, 15 minutes" whatever it happens to be. But you do have that you got to worry about that okay Up at the top lots of familiar friends we've got data factory, so orchestration in anywhere else in azure we use azure data Factory and now this is going to be part of azure synapse analytics so if you go into this workspace this one pane of glass for doing everything it's got a version of data factory in there, called "orchestrate" or something like that that but if you look at it, it is just data factory and it's a slightly different version doesn't have exact feature parity with the one that's live because this is in preview and I'm assuming that they will just be the same thing but you can log inside synapse analytics and see orchestration and see some stuff and log in to live ADF, try and you'll see... So we had like the management layer just went over recently because they're building stuff here to make it synapse friendly now I'm hoping that as soon as everything is live and it's in then they're the same product and it doesn't matter where you use it or even it's just inside synapse but for now, slightly different you've also got ADF mapping data flows now that I kinda put that on orchestration but actually it's another flavor of compute so mapping data flows is a draggy droppy GUI driven spark engine essentially so like in the old days of SSIS I can say we'll take some sources combined these do some aggregations do a lookup write it out to another source and I can drag and drop and do all that when I hit go that runs on a spark engine which again has to be provisioned will stay alive for certain amount time it gets charged for so some different compute options but that's kind of baked into the orchestrate layer but it's kind of a fourth compute choice. And then finally we've storage down at the bottom when we provision synapse analytics you can pick a lake for it to sit on and that is just an existing data lake storage gen2 so blob storage with hierarchical namespace enabled and that is exactly the same as it normally is, I just used a production one and it's all fine it just the ability to see it inside synapse analytics is the new preview-ey bit we've got a metastore so I've been going through doing things in hive in spark or in sequel on demand I can save things so I can say "well actually there's a load of parquet but I don't want people to have to go and find that in the lake each time just call this dbo.table" and then people can select from it and it'll go off to that parquet so I can I can have a meta store on top of my lake that describes also many different entities an object which is awesome. The plan obviously being that that works across languages so it's not full parity yet but I can do create some hive tables in spark and then read them from sequel on-demand and that's great I mean I can just flex and switch between languages, that's awesome the workspace elements there's less kind of core bhatia more just you get in the box that wraps all this stuff you've got the studio so there's a new dev environment there's a whole place I can go to write notebooks with intellisense and write sequel scripts and hug them together and manage all my codebase I can use linked services from ADF, kinda, and just say well here's all the things that synapse is linked to and use them elsewhere so it's kind of this one big wrapper around it with a bow on top I've got monitoring which again if you're from data factory land will be very familiar to you it's but it is a one stop monitoring plane and I got a management place where I can go and tinker and say well this is the size of my spark pool, how many spark pools, this is what I've done in sequel on demand recently this is how big my azure sequel data warehouse (provisioned sql pools) is now so lots ofthings all baked in one box. That, to me, is what synapse analytics is and I wish they come out with just a super simple diagram saying it's just these things guys but... it's taking a while. A final thing in order to hammer home is the different flavors that we've got in here... umm.. no the point that were in preview so provision sequel pools (aka sequel data warehouse) is the only bit here that is live that you can use in production that it's supported that has full SLA behind it everything else so the data factory version that's inside synapse the lake integration inside synapse, the whole of the secret on-demand pools the synapse flavor of spark, all of that stuff that is in public preview and it's currently growing and having functionality added and having bugs found and all of that stuff so that's kind of the the hardest question is what's live and what's not. The thing that used to be called sequel data warehouse that's now being rebranded, that's live, of course its been around for ages. The rest of the stuff aka synapse workspaces that's not live yet that's in public preview hopefully that helps clear those bits up Another bit I was gonna go on to finally those different compute options those four compute it's kind of tricky to say well which one should I use then it's almost like too many different options by which one of them do you actually pick to do your work so if I give you a quick overview of the four recapped, how they're charged, where you should do so to sql data warehhouse, aka, provisioned sequel pools, I can't say that enough I have that billed by the hour so if you say I want I want to compute, I want six compute nodes and you run a query and then you turn it down to two (nodes) you're going to get charged for six for that full hour and then you're going to the loweered. You get charged for the highest scale you've had it within any given hour so be careful about that because if you're doing just a small little bits of work I can rack up cuts really quickly so sql data warehouse is like that it's full T- sequel... well... it's not FULL T-sequel it's the PDW variation flavor of T sequel so it's MPP T-sequel so certain things don't work like you can't do a recursive ctas... I'm sorry a recursive CTE you can't do a CTE that inserts as part of a ctas that you're doing an insert inro...there's certain restrictions on it but also other extra bit of language because we're dealing with sequel data warehouse so you can do things like you have to know which distribution you're using so when you create a table should it be round-robin distribution or a hash distribution, should it be replicated these are all terms are very MPP related so it's a slightly different flavor than the traditional sequel server engine and you need to be aware of that in some of the syntax that you're using so generally what's it for? Huge data sets if we're talking about our datasets in the size of terabytes sql data warehouse really good at scaling and working at that kind of size if we're talking about huge deep fact tables with only a few amounts of joints it's really good at that kind of thing so big hefty good aggregations sql data warehouse will excel and it's gonna be a long time till I stopped calling it sql data warehouse rather than provisioned sequel pools but we'll see On the other side, sql on demand and as I said it is billed per terabyte read so it's an awkward balance of how are you gonna use it if I write a query a day that does a couple of terabytes. that's gonna be far cheaper than leaving a sql data warehouse turned on all day just in case I happen to write a query but if I write if I run the same query every 10 minutes that's going to blow over on my budget so it's really going to be careful cuz it's just how much are you going to use it equates to which one should you use so for me it's kind of like experimentation quick things and ad-hoc occasional workflows, sql on demand, great and will end up being super super cheap really useful. Things that're touching a lot of data if you're doing that a lot you're gonna rack up cost very quickly so be a little careful which one you choose in which when you work for maybe start off with sequel on demand and then when you've got a better idea of what your usage pattern looks like you could say we've just reached the point where we're now more expensive than just having a cluster we'll switch over and we use a provision cluster kind of makes sense so yeah, ad-hoc, occasional access, lots of different things that's kind of where I see it fitting. Mapping data flows so again that's all it's billed data factory style which is per execution of different activities but also when you execute it it has to spit up a spark cluster so when that spark cluster is provisioned you're paying for it so you've got a cluster cost as well as your data factory poking things cost and then that cluster has an uptime you know so it'll finish the job and then you give it a time to leave you can say well just shut down immediately or stay alive to the next 15 minutes in case another job comes in and then it can just execute immediately rather than having to start up again but you're paying for that time while that list is turned on and if you got of things just running occasionally they might just chain up so you got cluster turned on permanently and that's gonna cost a lot of money so you've got to be a little bit careful about how you're balancing that cluster uptime and the cost associated what do you do finally you've got a spark I mean spark is billed in the same way as the mapping data flow clusters it's a spark cluster that has to be turned on when you try and run your job it'll take three or four minutes to turn it on and you can set that cluster to stay turned on for an amount of time after the last query ran and it gets no more queries it'll turn off again same problem get lots of occasional queries could just play keepie-uppy and you'll just keep that cluster alive and that's gonna cost you money so it can run scala, python, c-sharp, sequel r, it's language flexibility is much greater on the spark site maybe a little more of a learning curve because you're having to write code so someone's just used to sequel, just used to the graphical interfaces they're gonna find a little bit steep learning curve when they first start but it's so powerful if you're doing a dynamic workflow I want to set out a set of tasks and run the same thing for every single different thing that happens I can write very generic workflows in spark and should apply that to a ton of stuff it's very powerful, if I wanna do machine learning then I'm probably gonna do it in spark just because it's quite a rich ecosystems tons of open source stuff I can integrate and bring in as libraries and I'm gonna work with anything really strange like complex gnarly nested JSON data structures or put it in computer vision and do analysis on videos in real time all that kind of stuff I could do that in spark, I can't really do that on the sequel side so that maybe there's use cases for all of them. So your graphical one, your business analysts, your guys who aren't that developer savvy, they're not software engineers, they're not data engineers they might want to use that so they can drag-and-drop them they can visualize the data flows but then you've got a lot of people who are doing some data science at scale and they would be writing a spark and then there's a lot of traditional sequel analyts and they want to use management studio and they wanna write queries against a proper Kimball warehouse and all of that kind of stuff, they want to be over on the other side now just the thing to be aware of is if you're a spark cluster turned on and you've got a sql data warehouse turned on and people are writing sequel on demand then there's three different cost bases and you're not getting economies across them so you're got loads of flexibility, loads of choice, loads of options but be a little bit careful between them because again you might end up paying three or four times what to do one system so everything's in one place but with great power comes great ability to spend huge amounts of money all righty so that is my quick overview of what is synapse analytics and how it all fits together I hope that is useful! Now the first thing that did was went and had a bit of a setup of my first 20 minutes of using synapse so you might want to find that video that gonna be over that on one side and might find that useful and if you haven't already please subscribe and like the video. We do appreciate it and the next case we'll just catch you next time! so... Cheers!
Info
Channel: Advancing Analytics
Views: 32,763
Rating: 4.9661493 out of 5
Keywords: Azure SQL Datawarehouse, MPP, Microsoft, Synapse, Tutorial
Id: 2DX7dgR8cEw
Channel Id: undefined
Length: 21min 43sec (1303 seconds)
Published: Tue Jun 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.