Advancing Synapse - Getting Started with HTAP & Azure Synapse Link

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome back to advancing signups where this week we're looking at a thing called synapse lane now it's technically a feature for cosmos db and it's doing a thing called htap which is hybrid transactional analytical processing otherwise known as if you've got a massive oltp system slap a dirty grey analytical layer on top of it that allows you to query it without impacting your oltp that's what we're looking at and now the whole thing is a feature built for signups so that your signups workspace can query this analytical version of cosmos without impacting anything again you can just do real lightning fast analytical processing on your transaction later so if you like the video don't forget to like and subscribe otherwise let's have a quick look at how it works and how we set it up okay so this is how they describe it this is the the website version of it on the blog um you've got your transactional store now it's cosmos db so it's json um and in this case we're dealing with the sql api it's not enabled for all of your apis yet but if you're using sql api i think of using api as well uh so you could be going here doing massive hugely scalable concurrent reads and rights now it's oltp that it's four it's for putting behind an application that's having thousands of thousands of people going in and hammering this and updating it and scaling natively but it's never been very good at analytics you can never say take my billions of rows and just aggregate it give me these sums give me these queries that's not really what it's for and if you try to do that you just consume massive amounts of compute to get really expensive so this thing is a feature you can now enable saying whenever i make a change to my oltp essentially my document collection inside cosmos db replicate that over and put that inside a little lake which basically blob storage with parquet that's automatically maintained and updated so i can go and update my records i can go chain markers or take various insert load new records without me building any etl any logic whatsoever that's going to go and update a load of parquet files that are there ready for me to query and then in signups i can then hook that up and say connect to that whole pool of parquet files and then query it as if it was local data so really really cool getting rid of a whole sleeve of work which is now a little word of warning here the whole htap thing this hybrid transactional analytical processing is kind of an evocative statement now when they announced this they said the days of warehousing is gone we don't need to build analytical systems anymore just do your analytics straight on your operational data now that is an absolute nightmare just saying you don't need to shape your data in a way that it makes sense for your business users you don't need to have centralized business logic just do it every time that makes no sense um so yeah it's not a replacement for a good data platform it's not a replacement for semantic models and business logic and all of that kind of stuff but there's been forever since days gone by people have tried to do operational reporting from their warehouses so you've got your operation you've got ltp loads loads of stuff happening and then some extract and some transformation some cleansing some validation some shaping whatever and it gets into your warehouse and then people try and do real-time kind of operational what might what are my current stock levels from their warehouse and that doesn't make any sense because it's gone through so many layers to get there so operational reporting your kind of what's going on currently the quick exploration that kind of stuff absolutely awesome to be able to do on your oltp system so this h-tap thing it's not a replacement for traditional analytics it's not a replacement for building your own lake and curating that properly but it does unlock a huge amount of value and a huge amount of potential for querying your transactional systems directly and not having to try and build all that logic into your warehouse so it's awesome but it's not magic it's gonna replace everything you currently do so let's have a look at what it is and how it works so back into azure and we're gonna need a new cosmos tv so let's just go and create one so here we go cosmos now i've had this hit and miss sometimes it just worked straight away sometimes it took a little while thinking about things so let's just see how that works i'm creating a new one i'm going to call this advancing link see how that works i've got my choice of apis now again this is only going to work if i'm using the core one or maybe uh um i want the notebooks preview because the fact i can use notebooks inside cosmos is great so if you haven't used that give it a go uh and i'll put it close to me so it doesn't cost me the earth okay so i can have i can choose to make it free which i always like if you don't already use free cosmetics just have a play it doesn't cost you anything so if you're doing this kind of experimentation getting things working right uh it's not production i don't want geo redundancy all the things that cost lots of money uh i'm gonna go and create that as is okay so go away right now let's create me a new cosmos good okay uh and then there's a couple things i need to enable in this i want to go and do it one of the things to note is that you can't enable synapse link this separate little inbuilt lake on an existing container so you can't just take a lot of data and say actually just copy this in you need to create it as a new container currently it's in public preview i'm hoping by the time it's in preview you can just convert it and say take this thing make it into um i think that's it because all this doing is just an extract take all my data compress it land it as park a shouldn't be that bad but currently we have to take an existing one still going gonna think about it and when that's done we can then write some stuff so i've created some logic already so what we're going to do is just create some dummy records and then try inserting some records and try querying it and see how that works on the other side so let's just wait to do i can start talking through the code while we're doing this so when i've got my thing up and running i'm going to do a few different things so first thing first we're using the python sdk so nothing particularly special in here we're just going to go and read some queries oh he's dead see it's not too bad um so we create an existing database so i'm going to create a database manually and create a container manually and when we create that container i need to tell it you're using the analytical store i want the analytical store enabled for this container and that's the thing that currently has to be done another new now i created a little thing just grabbed uh off the cosmos github it's just a dummy record talking about a sale made up some things i've got some pulling in the current date time it's pulling in some random figures so essentially i'm trying to generate some data so we don't need to worry too much about what that code's doing it's just making an individual object and i've got a little loop saying ten times nine times i'm going on loop generate a random object and in container.create item so insert this item into my container this is going to be really simple dump this into a notebook run this and it's going to just seed it with some sample data so nothing particularly special happening in there okay still thinking about deploying my cosmos db and we can wait for that to happen we can go and have a look in sign ups so when this actually occurs what we have to do is use a it's a special type of cosmos link service so when i'm actually in my studio which grab that up so we can go in under my data side so now one of the options that we've got under connect to external data essentially they are the the fancy connections so we can go and we could do a data lake or we can do our two different types of cosmos db integration so we could say you're a sequel integration or you're a mongodb integration now when we do that it's going to go on want to go and ask a lot of questions and it's going to go and find our cosmos tv so this is what's used to create that link to say actually you're now part of my signups workspace i'm going to be able to see it so it's quite nice and that we'll see under databases and it'll be part of our actual kind of data of state that we've got in our inside our sign-ups account okay okay right finally took a little while okay so we can go in we've got our cosmos tv account and we can do a few things so first things first most importantly i need to go enable this feature so we've got that azure signups link features currently status is off and say yes i would like you to use that gives me a little bit of information about what that is what haste app is all that kind of stuff and i can hit a name it goes off thinks about it creates that for me okay so while that's working we can then go and create a container so hit create container do i have the option yet no i don't have there we go so we're waiting for that option that's the new thing you're going to be able to see as soon as this is enabled we'll see this analytical store i can now turn it on so important that feature needs to be on needs to be set up before you can actually start creating containers that are set up in this right way while it's doing that while it's thinking about it we can get our notebook ready so i'm gonna go in gonna create a new notebook then i can copy over that code i had in my vs code so i'm just going to do this import cosmos db and get nixon feels like a normal notebook uh it's not quite the same it looks like things like the cell outputs you don't quite get the same you don't have the full history of things but generally just a notebook i think i saw then that the is now enabled that'd be good we can go and use it and there we go okay so just straight copy that over nothing special happening there but that's ready for us to use and just to check yes okay so we've now enabled um synapse link for our cosmos tb that's now it now has provisioned its own little data lake store essentially it's blob store which ads gen 2 that you put stuff into i think it's a us gen 2. see okay so we can now create our container so we can add a container okay where does it need to go so i want to just be called online sales it's got some throughput i'm going to call this my sales i need to give it a partition key that's called sale id then analytical store so i can now turn that on and off so because i've enabled that feature in my cosmos db i can now every time i create a container decide is this page type enabled yes or no so that is going to be created here and that's going to go off and create that for me now again i could do that in a bit of python script i could do that uh by the c-sharp sdk it's just nice to show you that that option is enabled so you know that your cosmos db is working for signups link if you've got that option on your container okay so i'll be able to see it refresh that okay so we've now got online sales and we've got our sales container ready to go that's nice and we if we go to settings you can see we've now got that analytical storage time to live so how long does the actual store go and you know it's a container that's enabled for htap if you see that option in there okay so that's good and that's working in our notebook uh oh cosmos tv so i've had this before so if you step away from your notebook and you've not hit save it just forgets everything you've got there there's no idea a bit like synapse when it has this kind of this working copy and you have to you know you've got this section that you're moving around the moment you leave this notebook uh the notebook is white so we have to copy all that in again which is painful but not the end of the world let's get that in okay and then the final bit of just create that and then we can run it and see how it goes all right okay so we need our database so we're say this is called online sales uh and then i just got that sales to be imaginative okay so i can run that that run again so it's a little bit different than other uh notebook implementations and i don't really get any feedback i could i could print out and force the output but you don't get a i succeeded this is how long it took it just says yes i ran happy around that good uh and then i can just try and run this down the bottom so looks like it ran i'm gonna hit save so i don't lose that code if i step off this page without hitting save i'm gonna lose all that again go to items hit refresh and there we go okay so i've got some items these are my randomly generated data so that notebook despite the fact it was giving me no output with actually running it was doing things so it's gone and created these 10 records inside my cosmos tv collection okay so i've got cosmos tv i've enabled um my analytical store um so what you can do is go into signups and say can you go and query it can you go and talk to us my tv properly so step over and again do what we did so connect to external data it give me that choice i wanted to go and talk to cosmos db and then i would just call this my online sales htaps it's clear now i have sneakily gone and enable this interactive authoring if you don't it'll allow you to pick your subscription allow you to pick your comments db it won't tell you what databases are unless you've enabled interactive authoring that is that weird adf thing again online sales so i've said it's found the database that i've created and if i hit create that go and create that link service but then it'll also create it as part of the data that i can see so if i hit refresh over here then in linked i've now got cosmos db i can see online sales and then hopefully there we go okay so let's do a little a little test just to show you this thing if i go back into cosmos db and i create container that is not uh enabled for this so i'm going to create it inside oh my sales again i'm going to call this my products i'm going to turn the analytical store off so this is a different container it doesn't have that hdbid so there's two different types of container that you can see when using synapse the ones that are pure oltp it doesn't have that separate store of parquet all the ones that are full parque nice fast analytical stores etc etc uh and it diff differentiates by an icon so you know what you're connecting to depending on whether you know you see the htap icon or the otp icon okay so just to check inside here we should have my two containers now okay so we've got products and sales one of which i go into my settings doesn't have the analytical time to live one which does so again that analytical storage time to live only exists if it is analytical store enabled and then over on the sign up side if i just come in here and i do another refresh should be able to see my two containers in here there we go okay so products that classic just a load of squares that is oltp sales a load of squares with some bar charts or i assume column store compressed so i've got my different column store segments that means that is the htap enabled so you'll be able to tell if it's working has that actually turned on the analytical store if you see that icon if you see that h dot python when you're in there let's try and view it so do a new notebook get some options so we can just create a data frame give me an edit frame i want to start working i want to type some python write it to a container so i'm going to push data to um cosmos said again that's not going to go into http that's going to talk straight to the old tp i can save it as a high table so i can use it i can go streaming you can just have it stream so as soon as i insert data that'll insert into the park and then stream it out to me so what a few different options of what i can do i think streaming goes via hdap but see loads of edit frame is the easiest so that is the straight i want you load it up it's automatically generated this code for me advancing spark is my spark pool i want to use an existing spark pool i'm just going to hit go so it starts um starting my spark session here let's have a look at what it's doing so it's doing spark.read so it's got its own format so it's got a built-in driver for this cosmos db analytical store called cosmos style olap it's picking out a linked service and it's picking out a container i mean so that's nice so that means i don't have to do the whole you know if i was going from another system i'd have to do go to my key store bring back my username and password figure out a managed service you know all the connectivity bits that's baked in i'm just saying there's already a link service for it just go to this container bring it back for me and that brings in as a data frame again always annoys me that the code just isn't that nice man go like that get all those nice slashes okay so that's now bringing back a data frame and then as soon as that's a data frame we can kind of do what we want with it it's the same as any other data frame um so while that's loading let's just say so my data frame is going to be df so we're going to want to do an aggregate so we're going to use our python functions i'm going to say import um actually from iceberg sql functions stop okay so see the previous ones now run so my spark session started now it's actually kicking off the earlier job to show those first 10 rows there we go so i've got a load of data back so that's good that's working that's nice so i can see it's got my data it's bringing it back it's bringing back it's happily working with nested objects so it can have a structure can have nested arrays it can have the complexities that you get in json is being properly represented it's not gonna do anything nasty to um and then i can start working with that so yeah what we're writing down here okay so i wanna do an aggregate i just wanna sum some split column and some what have we got in here let's have a look uh total due i think makes most sense okay so i'm saying my data frame is replace my data frame with an aggregate version of it summing the column called total g let's do that and then let's just df dot show and let's see what my results back so can i go back and do that and there we go okay so that's nice so i've got 12 000 is how much money i've made my randomly generated records just got in added that up and that worked quite nicely so that is oltp smashing that down into parquet bringing brought back into signups without a lot of connectivity without having to do a load of integrations now that i've set that linked service up i can just bring new stuff in um so let's just go and do a quick test let's go back into my cosmos tv let's go back into my notebook which hopefully hasn't forgotten everything and this make sure that's got the right context and i'm going to run that again no it didn't run that properly so let's run that first one run that second one from that third one okay so that's gone in and created another load of records so if we go and have a look at my sales items okay so i've got more than nine records now i've got a lot more records probably 18 and then go in i can run this again so we're actually expecting that twelve thousand seven hundred to be somewhat minimal uh but it's random numbers so we'll see how it goes now how quickly will it go so that hasn't gone through yet so it's going back having a look at that data bringing that jason back again i'm trying to say how quickly will be making that change to that data actually be reflected in my spark environment now earlier i was tricking myself going do i have to go back and recreate that data frame i don't think i do i think that'll be madness if i have to because it's again it's not caching and i think it's just going back to the start we just make sure that's working so get my daily throwing get my definition do a show make sure it's working properly okay so it looks like it's not instantaneous so it's not gonna go back and just give me that immediate immediate real time but it should be i don't know what's my latency few minutes a few seconds but then i guess the point is i'm not having to do any elt etl i'm not going to do any extraction of data i'm not having to build routines to poll has it changed and tell it when it should be pushing this data across it's just a coming straight across when i actually need that so that's just not coming through at all okay good i'm gonna go back and just um push some more data through because maybe it hasn't been triggered so if i just run this again it's got even more data in so you should be able to see even more stuff happening in here and then go and make sure that's actually coming through see how much data we've got okay so i don't know whether that was it wasn't triggering a rebuild because a certain number of records haven't changed or whether it's just there's a fairly slow polling it's not instantaneous it's not a real-time stream but it's just an asynchronous stream that's going to get pushed in there and you'll be able to pull reporting from what's how real time is real time right we could connect to the cosmos db change feed and have it sending each and every individual record and then handle that in a streaming data frame then land it down into spark and then have that perfectly real time as much as the latency of that record getting set across is or we can just click this button and then within a minute or two my users are querying the actual data in the system so if you had to have it if you're trying to put this behind like an online store that says is this still in stock that's probably not the right thing for it there is a little bit of latency but if you're trying to build it so people just throughout the day can just do some operational reporting what have we sold this morning how are things going how's this selling then you can just do that directly off it build out some spark notebooks and then start working with it now one of the really cool things is it's not just available in spark i think your sql on demand can also access uh this stuff so meaning people writing sql queries can just hit refresh and then go and query this same data now admittedly it's not perfect there are some limitations so there are some things where it's inferring data types in the parquet so if you've written it down as a certain data type and then you change the data that you're sending in it's going to get some color clashes some collisions there is some some inflexibility if you're trying to use the analytics tool about how much you can change that object and one of the points of cosmos db is it's no sequel it's just a giant json object you can just entirely change the structure and schema of it but that does confuse some of the indexing it does confuse some of the structure of stuff and obviously the parquet files are created as a result of that are going to be very confused if you're just each time massively massively changing the object that you're putting in there so yeah it's not a magic bullet but what it does do it does really well we can just go in there we can query the data we can go and have a look at our data in almost real time and much for most people a minute or two away is as close to real time as you possibly need especially because you're not having to build etl you're not having to build giant data extraction routines that go through and do all this stuff so that's what signups link is sign up link is the analytical score behind cosmos db whenever you change your collections whenever you make any updates to your otp store your big pile of json objects it's going to take that asynchronously replicate it down to a parque store in this analytical store that's baked inside cosmos tv and signups link is just this linked service that allows you to query that analytical store directly so you can go in there build out a load of logic but a lot of routines query your transactional database directly without impacting transactional compute and that should i guess kind of the big thing there is the number of analytical systems that we've built where you're told you're only allowed to extract data at night you can't impact business critical users people are trying to use that crm system you're not allowed to go and query it um this whole thing gets around that essentially you're querying a different store it's like having a report server that's replicated but without all the work that it takes to build a replicated report server um to expose your data to your analytical audience so hdap it's cool again it's not going to replace the entirety of your analytics platform but if you're currently using cosmos db or if you're thinking about where you're going to place an application and it needs that high scale oltp then cosmos db is a great location for especially with the free and tier if you're doing a very small application and the ability now to query it direct directly from within signups is really cool now i want to have a bit of a look and say can i query that analytical store elsewhere can i hook data bricks up to it or is it just i have to push it into synapse and then land it and then i can query it from other things i don't know how accessible that analytical story is currently um and i don't know just how easy it is to get around some of those limitations about data types changing and that kind of stuff again we'll see so if you think it's gonna be useful if you are super excited and looking forward to using azure signups link as soon as it comes out again it is still in public preview it's not ga yet then let us know let us know in comments and let us know what you think otherwise there's some videos for you to go and watch otherwise subscribe to the channel if you are new to us and we'll see you next time cheers
Info
Channel: Advancing Analytics
Views: 1,698
Rating: undefined out of 5
Keywords: cosmosdb, spark, data engineering, azure, Synapse Analytics
Id: DWoW7ayPzJ4
Channel Id: undefined
Length: 26min 32sec (1592 seconds)
Published: Tue Jul 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.