Modern Data Warehouse explained - James Serra

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi welcome everyone my name is james sarah i am a data platform architecture lead at ey i joined ui recently before that i was a data platform architect at microsoft for seven years where i spent a lot of time discussing with customers the modern data warehouse which is the topic of this short video today i'll try to keep it under 20 minutes with a lot of whiteboarding no slides on this one and the the idea is that i want to talk to you at a high level of what a modern data warehouse is concepts such as a data lake and using a relational database is what i'll go through in the five stages of putting data into a data warehouse with the whole idea is i want to ingest all this data and i want to use it to make better business decisions by having all this data in one location in there i'll also after talking about through the concepts it will be most of the video is then show you what microsoft products would be used for each of these stages within this modern data warehouse so let's jump to whiteboarding all right so we begin with the first stage which is adjusting the data and this could have a lot you could have a lot of data on-prem in the cloud that you need to collect all in one location and i'm going to focus this session on moving this data into a cloud like azure the second stage is where you're going to take this suggested and you're going to store it someplace this will be your data lake and then the next step is where you're going to transform the data and this is where we go from elt we go into elt extract load and transform instead of extract transform and load so the idea is we're going to move this data right into a late into a the data lake and then we're going to transform it and then we'll go into step four which is to model the data because the data as it sits in a data lake may not be easily consumable for the average end user so we need to model it and put it in a relational database to make it easier for that and use it to consume it which leads to stage five visualize and i'll put an ml here too and this is where we're going to make those better business decisions by being able to investigate the data through reporting or through machine learning now these are not hard and fast rules you could visualize earlier in these stages and so machine learning but for my subject of this video we're going to break it down to five and and most you'll see most architects break it down to these five they may call them a little differently but in general they're following the same flow so let's now look at stage one ingesting so i may have a ton of different data sources that i want to ingest and this there could be many challenges just in this one step right here because i could have data that's on-prem it could be in the cloud it could be data that is really large it could be stream data on here for the purposes of this video we're going to focus just on batch oriented so not streaming streaming add a little bit more complexity it'll be generally the same but it'll be other other options and products inside of that but this ingestion could i could have a lot of hours with customers just talking about this phase on there because the challenges are i can have many petabytes of data setting on-prem that i need to move into the cloud i could have data that i have to move every day and i i could struggle with that because of the size of the data and do i want to do incremental updates or a flush and fill i have a big enough pipe from on-prem to the cloud so all the things to talk about and further conversations in here but the idea is i look i just want to collect all this data and ingest it all so where do we want to land it when we adjust it well that's step stage two and this is where we have our data lake come into play and when we ingest it from the source we want to put it into this raw layer of the data lake now with data lake design another topic that you can spend many hours on and i've done that with customers is you need to to design your data lake up front to be able to handle all the different types of data that you're going to ingest and every customer is different so the layers that i'll talk about here are not are not always the same there could be many more layers with some customers they could be named differently but the idea is i want to use that elt i want to land the data in its raw format in the data lake and then this is where i'm going to go into stage 3 is that raw data needs to be cleaned and so i'm going to have the clean layer in a data lake so step 3 is we have stage 3's we have this compute and this compute is going to take that raw data and put in the clean layer now that may not be enough at this point because i may need in order to allow other people to query and report on this data i may need to do more of it and this is where we get into what we typically see customers have is the presentation layer and this is where i take the data and i clean data and i join it with other data i aggregate i do all these things to make it easier for end users to consume because at this point i could have some end users that want access to data now these are usually power users or data scientists which are signified by somebody wearing a hat and they could go and query that data maybe they want it in raw format but but they also may want it after it's cleaned and or a presentation layer but this is the great thing about a data lake the scheme on read means i can land data in there very and i can clean it very quickly and have people access it right away i can even say uh data scientists may want to have a sandbox layer where they want to take a copy of the data and do other things to it so this is the the power of having a data lake and there's many many benefits of a data lake which i put in my blog and i'll just i just touched on a couple of them here but the one that's that resonates with a lot of customers that think if you have a relational database and you have a staging area where you're doing all this cleaning the data the data lake at the very least could be that staging area removed from the relational database and now i can clean this data and i can put tons of compute on top of it without impacting end users who are hitting the relational database on there so that idea of a maintenance window can go can go out the window literally and not have to have a maintenance one anymore because all that all that time to clean the data is done outside of that but still the data lake may not be in a format that is conducive for end users to consume it it may be too difficult for them maybe two technical for them to do reporting off of that and this is a large part because the data lake has does not have the metadata along with the data like a relational database does so in many times it could be too much for an end user to just go into a data lake because data lake really is just a glorified file folder in there and it's just files and folders in there so an end user may be too much and this is where we get into step four where we want to model the data and we want to do that into a relational database and so this is where we land the data from the data lake most of that if not doesn't have to be all of it could be copied into a relational database in thorough normal form we could even go one step further and put this into a star schema to make it even easier for end users so what's happening is this is an enterprise solution where it is doing a lot of the work to move this data along into a format that is easily consumed by an end user with the trade-off this is more time and expense to do that but it's almost always worth it because of of now it's making it easy for an end user just to go to a dashboard a canvas and and drag fields onto that canvas and create the dashboards and reports and queries but you have the best of both worlds because somebody is has the technical chops they can access the data as it sits in a data lake and not have to wait till it goes through this process and this is where we get into then step five so this is where we visualize the data so we have this end user over here who says oh this is awesome it's in the start schema i don't have to know how to join the data i can just click and drag it and and get value of the data right away and so this is a high level this is your modern data warehouse with the five steps and usually the first question i get is do you have to have all the five steps can you skip them and the answer is yeah there are exceptions to everything a lot depends on the type of data that each customer has and as an example is some customers say look james i wrote all these ssis packages for my on-prem solution and move in the cloud and those exercise packages are not using a data lake so could i just use those size packages and repoint them and go right to a relational database from a relational database that i have on-prem because now i have to do an extra work i have to take that relation that is exported out to file like csv and then import into a data lake and then copy into a relational database is it worth it well many times it is worth it but to get a quick win or to build the solution quicker you may want to have in some cases use those exercise packages or or build a solution without bypassing a data lake and then go back and enter the data lake because what will happen if you do this is you'll get that power user data scientists will go wait a minute you have data in in the relational database that i need in the data lake and now it's not in there so can you move in there and now you have to go reverse way in there so and you also miss out on all these additional benefits of data lake that i have in my blog but some of those other could be in addition to cleaning the data in there you could have unlimited storage that's very cheap in there because we can have tiering in there and we can make it very cost cost cheap to to keep data in there and definitely so if i need to go back and rerun a job because it didn't clean it properly the data's sitting in a data lake on there so lots of reasons to keep all your data in the data lake but understand there may be exceptions for that okay so now let's look at how would this look in a azure world with all the various products that it has in there which overlapped some of the use cases in there so hopefully i'll clear some of it up so when we look at stage one that ingestion part of that the one product that most customers use in azure data factory that's a great orchestration tool that is very performant when it needs to move data from point a to point b that point a could be date on-prem could be in the cloud i want to move it and land it into the data lake so data factory would be your tool for that okay what do i use for the data lake well this is where in azure we have azure data lake storage and two it's kind of it used to be blob storage and now data lake store gen 2 underneath the covers using blob storage but it put this layer on top of it with a bunch of additional features such as hierarchical support and additional security on top of that there's a lot of features in data lake sturgeon 2 that you want to check out to help you with performance or to help you save money a lot of other options in there for example as object hearing so i can have data and i can specify whether that's in hot or colder archives on there so with some trade-offs i can go into archive and save a lot of money on that data so definitely make sure you check all those out in there okay so i have this data sitting in in the data lake in gen 2 it's strictly storage i now put compute on top of it to do things like cleaning this is where a lot of options one is as your data factor we have this thing called data flows which is a lot like ssis if you're familiar with that product where i can visually transform the data i can say i have data in raw i want to take that data and i want to apply all these rules visually on there and it's a low code no code and i want to clean that data and land into a clean layer in the data lake on there so data factory is great at that if that doesn't give you all the features you need so maybe there's some library that you need to use this is where a lot of customers start using data breaks and it's not an either or so databricks being a spark based solution where you create spark notebooks that could be too daunting for a lot of customers in there so they want to use data factory data flows as much as possible but there may be a case where oh i i can't support this and data flows and i'll use data break so this is where a lot of customers are using both now interestingly enough data factory behind the scenes when you go to run it all that visual transformation you created takes that code and puts it into a spark cluster and runs it so that's where it's very performant and you i like to tell customers think of it as data factory data flows is putting a pretty interface over spark and data data flows has a couple options now mapping data flows which is that visual interface but also a thing called wrangling data flows which is power query so you have two ways to now transform that data and and so you want to look at those and see what works best for your skill set and that's what i ask a lot of customers what is your current skill set and that's going to direct me to which product to use in there and there's plenty of others to use there's these hdinsights and then you can use third-party products in here to transform your data so so make sure you you look at all those products and those features and then match it against what your current skill set is all right so now let's go to stage four this is where we want to model the data and where we're going to land this and this is where synapse comes into play and i have a whole video on on synapse that you may want to check out because there's a lot to if i can spell and this is where in synapse you can have a dedicated pool and this dedicated pool is a relational database and i can then use data factory to take that data and put it into this relational database and synapse now here's where it gets very interesting because synapse which g8 a couple of months ago has a lot of features in there that one called synapse studio that allows you to do everything i've talked about inside synapse studio so i don't have to jump out to use data lakes or gen 2 or use data factory i can do that all within synapse even spark synapse has its own version it has its open source spark where if i need to use spark i can go with that inside of synapse under that single pane of glass and it even has data factory now it's not called data factory but it's using the same code base it's called pipelines and data flows in there so i can i can use all that that i can use that product inside of synapse without having to leave and and you have also a great option in synapse for cleaning the data and it's called synapse serverless so this is as it says a serverless you paper query an option to query data sitting on a data lake using regular t sql so this is the big thing because with databricks it's got its own version of sql and same with other third-party products with with synapse i can use the t-sql that i may know and love and pull that data out of the raw format clean it using t sql and write it into the clean layer or i can use data factory inside of synapse for that so a lot of options again look at your skill set look at what you need to get done and then and then choose the best option in there all right and and synapses look go into that other video that i've created and you can see a lot more about the features and synapse and then finally this is where we get into our visualization or machine learning is power bi now power bi has machine learning automated machine learning in there you can use that you can use another product outside of that like machine learning services manager but power bi could can do can do both and it's excellent at reporting it's also within synapse studio so if i use again synapse studio i have a single player in glass where i can also use power bi reports now there's a lot of topics this is a step five here it could be a long conversation with customers on how to properly use power bi to get the most performance and costs along with it because power bi has all these options like import mode and direct query mode underneath the covers is using tabular model so do i want to import the data into that tabular model or do i want to just pull it from the start schema sitting in synapse so there's a lot of things to talk about here but at a high level think of power bi as your reporting tool that can ingest this data from your dedicated pool and synapse now power bi also has options to pull data from the data lake or even back to the sources in there now that's when you get into power bi be more of a self-service tool for your end user your power user and not an enterprise solution in there but there's a lot of synergy going on between the two to that product and and synapse now so i can do things in power bi like use power bi data flow and an end user can do that and then you can copy that code into an enterprise solution in synapse or data factory and reuse it so you can think of power bi as a as a great way to prototype for an end user so instead of asking them as an i.t person what an end user's business requirements are you can just say can you go and prototype in power bi and then i will take what you've done and import and use some of that inside of this enterprise solution on there so there you have it that's a very high level this is i went over and modern data warehouse where we have these five steps on here i talk about the concepts the high level of of data lake and relational database and then i plugged in various microsoft products that you can use when building this solution in azure and let me go back to my slide here so if you have any questions about what i went over feel free to reach out to me there's my email address and also on my blog because a lot of what i talked about i go into more detail on my blog so i hope you enjoyed this video i hope you learned a little bit something and and again feel free to reach out to me if you have any questions and thanks for watching

Info

Channel: James Serra

Views: 5,825

Rating: 4.9716311 out of 5

Keywords:

Id: TZHykX6cEyc

Channel Id: undefined

Length: 19min 56sec (1196 seconds)

Published: Sun Apr 25 2021