Building a Data Warehouse Dimensional Model using Azure Synapse Analytics SQL Serverless

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone welcome to one more session of the cloudy lunch and learn i have here with me today andy cutler that to talk about build that dw dimensional model okay before we jump before we jump and talk with andy i would like to for you to scan the scare code and get a free access to the resource that we just get for you from microsoft and please provide your feedback your insights that would be really really really appreciated i would like to say thank you again from microsoft as our sponsor and andy you're there hello hello andy how are you i'm very well thank you good good good look at it tell me a little bit about yourself before we jump into this amazing session okay yeah sure so my name's andy cutler so i'm an independent consultant and contractor i primarily work with the azure data platform so synapse analytics data factory power bi um i've been using microsoft the bi stack for well well over 15 16 years you know the starting with the traditional um ssis ssas stack and then slowly transitioning you know from around 2016 into into uh the azure data platform so that's why i'm here now talking about science analytics oh that's great look like you have an amazing experience and i'm really looking forward for you what you have to share to collect insights and so on i'd like to highlight to the audience if at any point you have any questions that maybe come up just type to the chat i'll be monitoring all the social medias and i'll come back i will just talk with and it will be able to answer this okay and the floor is yours if you want to share your screen and proceed it's fine i will share let me add here okay here we go so you let me know when you can see my screen yeah we can see it edit the close yards thank you very much well firstly thank you very much victor and thank you very much lisa for inviting me on to talk about building a data warehouse dimensional model using the serverless sql pools engine within synapse analytics bit of a mouthful that title do appreciate that we are going to break it down step by step as we go through this session i'll just hide that so a little bit about me i'm an independent consultant and contractor working with the azure data platform if you scan the qr code it'll take you to my twitter if you like to connect and we'll just carry on the synapse analytics conversation there there's my linkedin details and my website that i blog at so in this session really what are we looking to get out of this session so we're going to talk a little bit about what synapse analytics is the various services that are available within synapse then we'll drill into the serverless sql pools service itself what do we use it for what can we use it for we'll look at data warehousing options and then we'll start to look at how we actually use serverless sql itself we'll look at creating the synapse database environment in serverless then we'll look at connecting to our source data and then we'll look at well how can we use serverless to actually take that data and transform it into a more analytical model for us into dimension and into facts and we'll look at reading that dimensional data as well and we'll have a few points around serverless sql pools considerations so with serverless analytics or with synapse analytics itself it's actually a set of services we've got three pools engines in there we've got dedicated serverless and spark now some of you may have already used dedicated sql pools for a few years they were used to be known as the sql data warehouse service this is where you are actually importing data into sql database into the dedicated sql pools engine itself serverless on the other hand it's a sql based service it does not contain any data itself it allows us to connect to data stored in azure storage in data lake gen 1 and gen 2 and also cosmos db and with spark pools we can actually use i think we're up to spark 3.1 is in preview and we can bring our python and our scala workloads into synapse analytics we also have pipelines which is data factory and we've got power bi integration in that as well so diving into serverless sql pools themselves it's a sequel based service that's the crux of serverless sql pools it's a sql based engine that's inbuilt within synapse analytics so when you create a synapse analytics workspace you will get serverless sql pools by default and it allows us to not only read but actually write data from azure storage using delimited files parquet files and nested data types like json and reading from cosmos db important thing to note is that no data is stored within the serverless equal pools engine itself it is an engine if you'd like to read up more about the engine it's called the polaris engine and i can tweet a link out to the to the paper we can write data back to azure storage but we have familiar objects available to us within a sql within serverless sql pools database we can create tables external tables that point to our data we can create views over that data we can create databases to house these objects we can create stored procedures and we've got access to some dmvs to help us administrate and manage serverless equal pools one thing to note is that the cost model of serverless sql pools is all around how much data you're processing it's not the amount of compute it isn't the length of time that the computer is running it's how much data you're processing that's both for reading and for writing and we're currently around five dollars per one terabyte of data that you're processing now azure and signups analytics isn't the only vendor working in this space i would like to call out google bigquery amazon redshift especially amazon redshift spectrum and snowflake okay competition's good right competition is good because everyone is trying to get the best features out of their offering now we're going to dive into the serverless sequel pools scenario itself and really i was interested in this service in terms of can i build a data warehouse using serverless sql pools as the processing engine and i said well why do i want to do this well firstly i'd like to leverage my existing sql skills i am i'll be honest not much of a python developer nor am i a scalar developer but i'd like to use my sql skills in a big data processing engine perhaps i have a lot of data stored in my data lake that i would like to take advantage of without having to pull that data into a proprietary database system but also i'm interested in looking at this data lake house concept bringing the relational constructs to data in a data lake now in microsoft's official documentation they do actually state three use cases for serverless sql pools we've got data exploration ad hoc data exploration in which you can connect to data in a data lake and write sql to process that data you might be crunching and aggregating a whole bunch of event and activity data stored in a data lake the concept of a logical data warehouse in which we are casting structure over data without moving that data now we'll explore that option later on or how we advance that option later on and lastly data transformation so because serverless sql pools is based on sql we've got the full power of the sql syntax some functionality isn't available but we've got sql to help us process data as well now when we're looking at setting up the synapse environment we can start by creating a serverless database that can house our schemas our security our views all our objects the first thing i'll do is create a database and i can create schemas just like i would in a dedicated sql pool or azure sql database or sql server now for this demonstration we're going to be using active directory authentication to pass through from synapse studio when i'm writing queries to the serverless service down to the data lake there are multiple security mechanisms in place we're using active directory then i create an external data source i'm telling serverless sql well where's my where's my data lake so in azure storage you have various urls that can be used to access the various services one of those is the dfs protocol which is the data lake url i need to tell serverless well what types of data am i processing how is my delimited text actually being delimited so in this example here my field terminator is a pipe and the first row of data is row two now when we created objects within serverless equal pools we've got two choices in how we cast structure over data in the data lake we have external tables and we have views now there are differences between these two objects and their functionality so when you're creating an external table you must define the schema you must define columns and their data types and you're not able to do something called partition pruning so if your data lake has a folder structure such as year month day where you're partitioning data yes you can select that data yes you can filter on that data but you won't be able to specifically identify source folders whereas in views we don't need to define a schema we can select all if we want to but more importantly there's a function called file path in serverless sql pools in which we can expose the source folder hierarchy and we're able to partition prune by filtering the rest of this session will be using views to cast structure over the data and we'll be using external tables to write data so for our initial load what we want to do is we want to create some views over data that's in our data lake so we're going to create a view as you can see it's using csv files we've specified the data source which is the data lake url and a folder in which we've got our csvs stored in i'll do that for the fact data as well now if you note within that syntax is we have a file path option and within the bulk command we have asterisks or wild cards those file path locations point to those wild cards so file path 1 points to the order year folder file path 2 points to the order month etc now when was looking at the dimensional model it's a methodology that has existed for for many years and we are looking to describe our data in two ways we are looking to either create measurements or facts things that we want to measure and dimensions the way in which we want to measure something so we can think of dimensions as the reference data the report labels and we can look at uh fact tables as the measurements themselves now the reason why i'm marrying these two together is that microsoft have a best practices document for serverless equal pools it tells you about co-locating storage accounts optimizing data types the one thing that did caught did catch my eye is it states use integer based data types if possible so your sort your join and your group by operations there's a screenshot from the best practices document itself and where have i seen integers before surrogate keys in data warehousing okay so just to pause on this for a minute even if you were not created a dimensional structure it still is of benefit to join your data in serverless sql pools on integer because if you're joining on character data the amount of data may actually be larger than using integer-based joins so it's a an important consideration there so when we're looking at the initial dimension loading we are going to use the create external table syntax but we're using ctas create external table as select and we're going to create sql query we're going to generate a row number a surrogate key we're going to join on our source views because essentially we are denormalizing our data now but we're saving that data back to the data lake in parquet file format now parquet is a column store compressed file format that's very good for data warehousing and analytical workloads ultimately we're writing that data out as part a and we're using a row number as a surrogate key so for the fact loading we're following the same process now our source data is in a folder structure and we'll come on to this when we go through the demonstration okay but ultimately we are loading all of that data from our sales order lines and sales order views and we are saving and pushing that data out as par k to the data lake ultimately when we're selecting from the data we don't actually want to use the external table it's important to note that in this process i'm using the create external table as select syntax purely to write the data out using server sql actually i can drop that table the data itself from the data lake will not be dropped and i can create views on top of that data so now i'm creating a view to select my dim stock item dimension purely from that processed parquet file and i can create aggregate queries over that data by joining my fact data together with my dimension data so with the considerations we're writing the data based on the amount of data that's being processed not the time or the processing power itself so there's no compute time or compute tier costings it's purely amount of data processed there's no caching of the data retrieved so if a query is performing the exact same operation and the data hasn't changed there won't be any caching it will add to your data processed again and the currently the data is uh is immutable so you can write data into your data lake but you're not able to update that data you can with delta but it's outside the scope of this session so now i'd like to jump into a live demo and go through the process of creating these objects and writing the data out so you can see what's happening the one thing i would just like to uh to do is change that i think it was jumping between between sizes there okay so we've created a database already launch and learn and i'm going to create an external data source to my data lake account specify the url i'm going to create file format for park a i'm not going to create file format for csv purely because i'm using views to connect to that csv data i'm going to create my schemas to store my objects in i will run an alter database command for to support utf-8 source data and the last thing i'll do is i'll just run security and this environment will be uh will be taken down after this demonstration by the way and my user identity is my own active dorentery account that will be used to pass through all the way down to the data lake so after we've set our objects up we will look at our views now the syntax for our views is we're going to create a view and we use the open row set command we point to the folder location that our csv data is in we specify the url i.e the data source to our data lake the format parser version 2.0 which is a faster csv or delimited parser engine we've got a header row and a field terminator so i've got stock items colors and package types and i've got my sales orders and my sales order lines data and as you can see i've got this file path function that i'm using to expose the hierarch the hierarchical folder structure if i run those views and if we have a look at our database okay we don't seem to have our lunch and learn there that's okay we'll go ahead and create our dimensions i'm gonna skip back to the data lake just to show you the source data that those views are pulling from so we've got our warehouse colors our package types and our stock items as csv files and we've got our sales data as well which is partitioned by year month and day i'm now going to run the create external table command as selecting from those views i specify the location that i want that parquet file to be stored in or data to be stored in so i'll run that and if we jump back to our data lake we can see that we have a new folder with our data inside and we'll create a view over that data which is pointing to the parquet file and lastly we'll create a dim date dimension that is just based on a csv stored in the data lake as well so after that we'll just do a select all from our view just to see that data is being pulled through and it's being denormalized so we have our color name and package type then we'd like to use the sales data that we have plus the dimensions we've created to create a fact again i'm going to use the create external table as select syntax joining to my dimension to get my surrogate key if i run this process i'm in effect getting all of the data from my sales order and the last thing i'll do is create my view over that data if i go back to the data lake we can see that we now got a facts folder if we go in to that facts folder and into fact sales we have our parquet file available now just to note that in our source folder so if we go back up to our top level folder and look in our sales order we actually have three folder three folders here because our data is being exported by day it could even be by hour by minute but we have a folder structure there now this is an important consideration when you're dealing with a large amount of data usually you'll want to partition your data in the data lake by a hierarchical folder structure however what we've done is simply loaded it in into one destination folder now dependent on how much source data there was available would actually depend on how many parquet files would be written out so serverless sql pools doesn't give you any options in terms of how many files or how many rows per file i do actually have a blog which goes through different levels of rows being written out to a data lake via serverless and how many parquet files and how many rows per rk file is written out links are at the end of the session okay so now that we've created our fact data we can now use these tables like we'll use these views like regular views that we were selecting from selecting data from so now we've got our weekday name we've got our color and we've aggregated the total quantity that's fine however going back to what i was saying is we loaded all of that data from separate folders into one folder what we can do is we can break that down so that we can load a day's worth or a folder's worth of data from the source into a destination folder structure that matches the same pattern now one of the supported objects that we have within sql pools i'm just going to collapse that are stored procedures so you can add any sort of logic that you would like to do processing um as long as the syntax is supported by serverless you can create stored procedures what we're doing here is we're using the the external table purely as a loading table so when we write the select statement out using the external table we can drop that external table and create it again pointing to a different location and i have some dynamic sql here that is allowing me to parameterize that location and also the source data so i'm going to select a specific day's worth of data from a source folder and i'm going to use that to write it out to the same folder structure so now if i run my stored procedure for my three dates i've got three days worth of data in my source system just for demonstration purposes let's uh create that stored procedure and if we go back we're gonna run that procedure we'll get the first one back we'll go back into our data lake into our new conformed folder and as we can see we've got our fact sales but we've now got a year month and day structure which means i can go in i can run that for the second day and i'll run it for the third day if i come back in and refresh we'll see i have a folder structure that represents what i would like the parquet data to be stored in and i'll create a view i'm going to bring in the file paths of that data and i'm also going to bring in the file name as well because i'd like to show the metadata attached to this data from the data lake so if i create this view and i'm going to do a regular select from that view to show that we've got our date we've got our colors and we've got our total quantity but i'm going to show the metadata that's associated with that okay so i've got a select query here that's selecting the year the month the date from the view if we go back up to our view we'll see that i've essentially aliased the file path location 1 as the year the file path location 2 as the month and the file path location 3 as the date and if we look at this bulk statement here we have 1 2 and 3. so now if i select this data and bring this up and we can see that we have if i just scroll down so we have the date from inside the files themselves and we can see that we've got the 15th of april stored in the 15th of april folder structure if we go to the 16th we've got 16th and if we have the 17th we've got the 17th and you can also see the source file name that the data comes from itself so that's just exposed with the file path and the file name functionality so what i would like to do is one last check to see if my lunch and learn but i won't so what i'll do is i'll open up another database and just show the resources that have been created so we've got external connection to our data lake and we've got our file format we've got our external tables just our transient tables we're using to load data we have our schemas including the custom schemas that we created to store our views and our external tables and we have our views and if we expand our views we've got our column definitions okay now we can use sql server management studio we can use azure data studio to connect we don't have to log in and use the synapse analytics studio so we have other sql tools available to be able to connect and use serverless sql pools hey andy hello all right there up to you yeah so uh we have a question for mark marshall i i don't see i don't think you you're seeing this on the screen oh yeah i can just i could see the question now from mark yeah yeah so have you tried the following writing data out if you had one thousand rows in synapse when right into a file do a select top two million sorry a million rows inside apps when right into a file do a select top two million any number that is greater than the known data row count i haven't mark i haven't we could uh we could do it though ah so it works with normal synapse we're not tried it long demand if the top number is greater then you should get a single file that's great we won't do it here but what i want to do is i want to create a new script i have seen some interesting behavior when using top to write data out sometimes if i did have a select and i was writing data out and i knew that it was a million rows and i selected a million rows yes it would create a single file whereas if i just let it do its own thing it could create several files it's a great one though so what i'll do is i will add a script to say top i'll say test select top where top is greater than known rows yeah i'd like to well i'd love to test that out so i'm going to test it out and probably tweet to tweet that out okay what i'd like to then go back to before we go back to the slides and finish up the session is this concept of data processed okay so when you're using serverless sql pools this is how you're being charged okay so how do you find out how much you're using so we've got monitor available within synapse and if we've got our sql pools we've got our built-in this always exists this always exists i mean you can see i've got a dedicated sql pool there so i could have any number of these but i've got one serverless built in so if i click on that okay we get some stats about the data that we're processing so as we can see i've processed 53.51 meg i can change that for the last 30 days and we can see what's happening in this particular instance of synapse analytics and we've got our success and failed requests available here now if we go and click on sql requests itself we can see the requests that are being sent down to serverless sql pools this is also really good for when you're connecting external tools such as power bi so i have a logical data warehouse blog series about four posts the last one is about connecting power bi ultimately you'd love to visualize this data so there are some caveats in terms of connecting to serverless and partition pruning to reduce the amount of data that you're processing to return your query results however when you're running your sql statements when you're analyzing your workloads you can see the data processed metric here now you're going to see a whole lot of 10 meg processed here i'm not processing that much in this demo even if i was only processing one meg the minimum data processed per query is 10 meg so that's an important consideration but as we can see here maybe if i go back the last 30 days and scroll down there we go we can start seeing some some juicier numbers so we've got two gigabyte here for uh for a select um 6.43 gigabyte here so ultimately you add all of this up that's how much you're being charged i do have um on on github and again i'll probably tweet the link out a very embryonic repo with serverless sql tools so there is a um a select statement you can run to get all this data out of serverless sql using dmvs now what happens if you would like to set some limits on this data processed we can do that so the built-in we've got this tiny little icon here which cost control we're able to control this using stored procedures as well so you can execute stored procedure a system store procedure rather and update this cost control but what we're telling azure is if on any given day four terabytes of data have been processed don't let any more queries be executed on a weekly limit we have eight and a monthly limit we have 16. so for example my particular account sixteen times three pound fifty will give you know the total amount that i am prepared to spend on serverless every month i just want to try one more on the database to see where the lunchroom appears no that's absolutely lunch and learn down the bottom there there we go so there's our external resources and our views okay and you can see i've created many many databases here so you can have you can have a lot of databases in your serverless equal pools service okay so we'll jump back to the slides just to finish off oops let me go okay so so i have a a blog serverlesssql.com where i've got various blogs on using the server the sql pools and uh service itself so there's a a getting started which will go through actually creating a synapse analytics workspace itself logging into studio and starting to use the service there's a four-part blog series in which i essentially go through the process that this session's gone through we go into a little bit more detail there's github repositories for sql so you've got the you you can download the the code to do that we look at slowly changing dimensions now we are pushing several sql pools a little bit here in its inbuilt um functionality but i offer a solution in which if you'd like to track changes to source data over time using serverless and writing that data back to the data lake as park a that's included and we also we're also connecting power bi to our views in our serverless sql pools database that one's the most interesting one i would say because it goes through the process of how do we partition prune if we're if we're building a report and we've got fill date filters for example how do they get pushed down to serverless sql pools and how do we initiate this file path function successfully to reduce the amount of data that is being processed so resources for further reading there's the official documentation for synapse analytics there's the lake house yes it's data bricks but it's a definition of the data lake house itself a little bit about dimensional modeling and the parquet file format itself and i will leave it back on the references and further reading so we'll thank you very much for listening i hope it's been useful and if there's questions we can dive in feel free to ask anything if i don't know it i'll take it away and i'll research it and i'll get back to you such as the top the top um the top 1 million top 2 million i'm pretty interested to see what happens with the behavior there uh that's one question here andy let me show for you so that does that trick up top avoid synapse choosing the pocket size and give you the more control we'll find out john we'll we will find out definitely when i have tested using top before i was sure that it reduced the number of files that were being created i ran a query or a create external table as select on a lot of data so quite a few gig of data that was in separate date folders when i wrote that out when i looked at that resultant folder there were a lot of parquet files it almost looked as if it was created a parquet file for every source folder which is a bit funny because i wouldn't think that serverless would essentially be having a one-to-one relationship with a source folder and a parquet file um when i was using top to write data out to see what the file sizes were with a known row count i'm sure it did reduce the number of parquet files that were created i'm pretty sure i'd love to test it now but i'm not a hundred percent that i've got the connection to the files that i would need to do that but i'll do it i'll do it after this i think yeah great andy i i think so far was really an amazing session again for me it was a master class about how to use the serverless uh i'd like to thank you john and mark for the question that's amazing as well uh let me just switch it just a little bit here so i'd like to think thank you again from microsoft to provide for us uh the sponsor for this session uh i send a couple links through through the chat and ask the guys to claim the vouchers for the certifications you'll be available there and it's important that the bioshockers will expire this month so go there get your voucher in away submit for the exams uh i have some comments here for john say thank you for wonderful session for to you to see oh that's great good feedback andy was brilliant for sure yeah john john by all means that um that blog series it was as we went along so it was sort of exploring what serverless can do on its own what i'd like to do next is bring in pipelines okay because we don't have too much control over how we're writing that data out maybe the top could help us but ultimately with pipelines if we did need to transform that source data if it was just a little bit too inefficient to be reading lots of csv files and we did want to transform that into parquet for then serverless to read ad hoc then pipelines would probably be a good way to go because then we can partition our data we've got a lot more control over what we're doing when we're writing that data out yeah ah so mark yeah so compute nodes so yeah yeah with um dedicated that may be the case but obviously with serverless there is no um there's well i guess we don't know we don't know how many nodes the polaris engine is using for any given level of compute it's an interesting question to ask though yeah so have some really good feedback sandy again was really really great session thank you so much for the participation and guys feel free to reach out to andy and his social social network and for sure you giving a lot of insights and i just trust and believe in his blog post about synapse is really cool uh i think john john's just skipping here so jones mentioned i never met a csv pipeline that source provide didn't change this schema looking forward to those articles yes nice one that's great guys that's great that's great good participation guys abstract anyway yes i'm from microsoft i i can say sometimes this will have respect we love we we love we love control we like to know what we're doing with our with our data engineering pipelines yeah exactly but yeah i i don't know i'm going to have a little look to see that exam voucher but if anyone was interested in in the synapse analytics route of certification then the dp 203 which is the new single data engineering certification is is really is really good it covers data lakes it covers data factory synapse dedicated serverless and power bi integration so it's really really good it's really really good certification oh yeah i didn't know about this i because i said faith both in dp 200 201 and and are you saying that the gp 203 is more heavy in terms of synapse what's a good question i mean it covered lots of different aspects it had streaming analytics event hubs and streaming analytics um data lakes and setting up folder structures and folder patterns um data factory so configuring sources and destinations and then yes so things like you know table distribution in dedicated sql pools you know hash distribution replicator distribution that kind of thing and then with serverless um the objects required to connect to and read data so creating external data sources uh and file formats yeah that makes sense looking forward to try this exam as well so i don't see any more questions here uh andy one more time amazing session thank you so much for providing this product if you have more sessions and you just bring to us we're looking forward to having one here well there is uh we're gonna look at bringing in a bit of spark a little bit of spark but yeah and the plugging in of business intelligence tools is is the big is the big thing because as i said this whole concept of partition pruning what you need to be doing and you know there are some considerations on that which i'd like to talk about yeah for sure okay thank you so much and that's it for today and see you next time okay thanks everyone take care thank you bye bye bye
Info
Channel: Cloud Lunch and Learn
Views: 400
Rating: 5 out of 5
Keywords:
Id: ayuHbfHJT0o
Channel Id: undefined
Length: 51min 0sec (3060 seconds)
Published: Wed Sep 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.