Building a Data Warehouse from the Ground Up

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i would like to welcome dustin to the data analytics and ai club thank you so much for bringing this absolutely amazing session that will help any beginner looking to start building a data warehouse others will revisit and perhaps learn something new or remember some basics dustin has been volunteering and speaking in events such as user group sequel saturdays probably more in the state of florida united states at least that i'm aware of he has visited us in the past hopefully he will join us again in the future in person event it's always a great pleasure to have you dustin and please when you're ready take it away and tell us more about yourself awesome thanks cecilia and i'm i'm really excited and really glad to be um involved here with the the presentation and and be involved with this community it's um really kind of exciting to get back into the groove of presenting after kind of the pandemic shut down and being so disconnected from everyone so it's it's nice to get connected back up so thank you so much for having me it's really really a special pleasure let me share my screen and make sure that you can see that okay okay okay yes we see your presentation awesome great okay all right so this session lexicilia said is called building a data warehouse from the ground up um just to introduce myself my name is dustin ryan i come from a background of data warehouse design and i've been working in this field for a little while i started off my career with a microsoft partner pragmatic works some of you might know of them and i kind of got my feet wet in the data warehouse etl analysis services reporting solution um space with pragmatic works um but then eventually my career took me to microsoft and i joined microsoft as a data platform solutions architect and i spent um a few years in that role then i transitioned after everything became cloud focused i transitioned over to a role as a cloud solution architect um but now i'm the director of data and ai for our southeast education customer so part of my role means that i'm focused on our southeast customers that are higher ed department of education and university or k-12 rather and so i help those customers adopt our data platform technology specifically in azure so i get to spend a lot of time with our customers talking about azure synapse and data lakes and data warehouses and power bi and data factory and kind of all those those cool things um you can find me blogging over at sqldusty.com and i live in northeast florida just outside jacksonville with my wife and three kids three cats three dogs and five chickens all right so this session just to kind of give you the the run rundown here on what this session is all about um i don't really have an agenda slide here but this is this is what i i'm intending to cover in this session is we're going to talk about the the fundamentals for designing a data warehouse star schema and we'll spend some time talking about the technology side of things but really the method that we're going to talk about here is not technology focused so even if you were designing a data warehouse on sql server or you're designing a data warehouse for azure synapse or you're designing a data warehouse for another platform really the steps that we're going to talk about you can use um for any platform so it's a kind of a technology agnostic approach and this is an approach that you might already be familiar with if you've read ralph kimball's data warehouse toolkit book um these are kind of the four steps that are i think it's the first five chapters that he goes over this uh the first five chapters uh for designing a data warehouse and if you are a data warehouse professional and you haven't read the data warehouse toolkit you need to go get the book and and read it it's it's fantastic and it's it's such an easy read and it's super clear too and so the four steps that we're going to talk about for designing a data warehouse star schema um come from that from that book so if you've read the book a lot of this is going to really sound familiar if you haven't read the book then you need to go get it if if your job or your passion is around data warehousing so we're going to talk here talk about why a data warehouse is useful for for analytics and reporting and then we'll jump into those four steps and part of the really cool thing about this session is that part of this session we're actually going to kind of whiteboard out the design of a data warehouse star schema together and so i can certainly do this on my own but it's more fun if you participate so i'm gonna look for participation from you guys for you to chime in tell me what you're thinking tell me your thoughts and kind of guide me as we go through that white boarding to design the the data warehouse together otherwise it's just me talking for like an hour and and that's really boring and it's it's not going to be very exciting so let's let's try to participate together um and for the sake of of my wi-fi bandwidth i'm going to turn off my camera i live out in the woods and wi-fi out here is not fantastic all right so let's jump in and if it if at any point you think that you're losing audio you're losing my audio or you can't see my screen for some reason um stop me let me know so that way i can pause and we can make sure that we've got everything correct so that way the presentation looks good especially for the sake of the recording okay all right so let's talk about this uh this slide here why a data warehouse and cecilia can you confirm for me real quick that you can still see my screen okay yes i'm looking to why a data warehouse is like okay all right great okay so let's talk about why i did a warehouse and when i first started in this field i remember that the story of this bank who shall remain nameless to pre to protect the identities of the uh the victims um that was running their reporting um uh they were reporting on their atm systems and so they had designed some queries they had built some reports out and they're running reporting on on atm transactions at this bank and they had some uh report developers that had written some queries in the and they were running their queries directly against the the production atm system and what had happened was these queries started to run more and more they started to create more load on production system and one day they actually took their atm system down because of the queries that were being run against their production systems um and it was actually a really bad situation for the bank they spent a lot of time trying to get this system back up and they wondered why they were having so much issues and it turned out they were running all of these reports against their production system created a lot of issues and so this data warehouse the data warehouse really is an environment where we can offload that reporting from the production systems right and when we talk about the reason why we need a data warehouse right is because a data warehouse serves a very different purpose than than a production transactional system right and in the case of like the bank that we're talking about here where you have uh the atm system right the that system and you see the two columns here oltp and a data warehouse the the transactional system really the purpose is for the execution of the business so whether that's banking or retail or transportation or education whatever it happens to be that that system is designed to execute the business right so to let people withdraw their money from the bank to let people check their balance those types of things and the data warehouse completely different purpose than that the purpose of the data warehouse is for the analysis of the business right so i want to see what i want to be able to analyze what are the transactions i want to look at a million transactions at a time right i want to be able to do things like maybe look at quantities of inventory i want to look at you know how much freight did we ship how many cheese balls did we sell whatever the the case may be and so a data warehouse for the is the purpose is for the analysis of the business the transactional system is for the execution of the business and because those purposes are really different the primary interaction between those two different types of systems is going to be really different in a transactional system and if let's just keep using the let's use a retail example here in a transactional system in a retail system right i need to you know insert a record very quickly when somebody buys a product right i want to insert a record into the database to say that you know cecilia bought a gallon of milk right and we need to track that in the database so that way over time maybe i want to be able to see how many gallons of milk sicilia bought you know and uh maybe dana bought uh you know so many packs of cookies right and every time she buys packs of cookies i need to be able to um sit uh write that record in the database so that way i can look and see how many gallons of milk cecilia bought how many packs of cookies dying about be able to see that right and a data warehouse is an aggregation of all of those transactions so different purposes and different primary interactions that transactional system i want to write a record i want to update a record or i want to delete a record and in a data warehouse i want to be able to upload potentially millions or billions or even trillions of records into that system or i want to pull out millions or billions of trillions of records in order to understand the analysis of the business right and so that interaction method in a data warehouse the primary interaction method is going to be select statements right i need to run queries against that data warehouse so i can analyze the business while the primary method of interaction for the transactional system we're inserting we're updating or deleting records it all depends on that that system and how we're executing business now from a temporal focus perspective a transactional system typically it's current you might have a little bit of history in something like an erp or some or like a retail system you might have a little bit of history there but for the most part it's going to be just what is the current state of of transactions of in this system right while a data warehouse is designed to not just hold what's currently going on but what has gone on in the past and and how much history are we going to keep in that data warehouse really depends on on your requirements and your organization's requirements but it could be everything from you know this past year to 10 years to 20 years or however many years you need to keep history back so the temporal focus is going to be very different between a data warehouse and a transactional system the design optimization between those two types of systems is also really different design optimization for a transactional system is we want to have maximum update concurrency right because hopefully if we're like a retail organization hopefully we're having lots of transactions as people buy products and so we want to be able to have those transactions go through as quickly as possible so that way people are not hindered from buying the products data warehouse right we're not looking so much for update concurrency but rather high performance queries right in that reporting system that's designed to support the analysis of the business we want those queries to be as fast as possible the design principles are also really different between two systems third normal form is typically what you see in like a transactional system in a data warehouse star schema design right because the name of the game here in the in the design of the data warehouse is to be able to get the records in as quickly as we can get the records out as quickly as we can and have as high performing select statements as possible okay all right so there are four steps here that we're gonna talk about as we go through this process together um and the four steps that we're gonna that we're gonna follow here um are you see them here on the screen here and the first step here is really the most important part right um and i'm sorry the second step here identifying the grain we're gonna talk about all of these steps we're going to jump into them in detail but as we go through these steps like i said we're going to design this star schema here here together okay so this is the part where i'm going to need you guys to participate okay now when we talk about what a well-designed star schema looks like and and we'll talk about why it's called a star schema if you're kind of new to this world but if you hear me use the word star schema if you ever use the word dimensional model what we're talking about is the design of the table structure in the data warehouse database but when we talk about building a good star schema there's some some qualities of a well-designed star schema that i kind of want to call out here it should be simple right you know the follow the kiss method right keep it simple stupid right the simpler the better and that has a lot of benefits because it it makes it easier to use it makes it very easy to understand and it also um eases the the etl process for the process that we design the etl or elt depending on you know what you design and the platform that you're using can dictate whether you might follow an etl pattern elt type pattern but having a simple design means that it's also not not only is it easier to query but it's also easier to load and a simpler star schema design also means that we can have it's easier to have high performing queries and it's also optimized and then i know that when we talk about these four steps we're talking about that where you know it's mostly technology agnostic i am going to call out some specific things that are relevant for analysis services or power bi um but a star schema is really great if you're planning to use tools like analysis services or power bi or power pivot or something like that okay so they really like a good star schema okay so the first step we're going to dive into here is identifying the business process and when we talk about the business process um we we have to first decide which process is where do we start and right and any organization has many different processes that are occurring at any given time but we have to first decide which is the business process that we're going to start with and sometimes this is a really challenging thing for an organization when you're saying hey we're going to build a warehouse but where do we start right and i have this conversation with customers all the time they recognize that they need an analytics platform that can help them answer the questions those tricky and tough to answer questions but they they don't really know where to start okay now to identify the business process this really requires an understanding of business requirements as well as an understanding of the available data and so when we talk about a business process what we're really talking about here is a is a natural business activity that's performed in your organizations that's supported for some sort of data collection and it's important to focus on a business process as opposed to a business department and sometimes that's a trap an organization you don't focus on departments and you focus on processes is because when you focus on a process this allows you to deliver information across a business in a more consistent manner otherwise we end up with a duplication of data so maybe maybe marketing says hey we need a database and so they focus on they build a marketing data warehouse right and then maybe the sales team that says well we need a database and so they build a sales data warehouse and then you end up with these uh duplications of data in the different departments and so focusing on a business process rather than a business department is a is a good way to ensure that you have a consistent delivery of data across across the organization um but the question bag well how do we decide which business process that we're gonna that we're gonna start with and the the most simple answer i can give you is really choose the business process with the most impact and the lowest risk and so that's why i have this little kind of cube graphic here on the screen when you think about where to start with if you're just starting from scratch right you don't have a data warehouse and and you're trying to figure out where we start choose choose a business process that is going to have a lot of impact for your organization but is relatively low risk and so when i say low risk i don't that doesn't mean you pick something that's easy although if you are new to the world of data warehousing there's some benefit to starting with something that is not super complex but if you can choose a an area of the business that is going to be super high impact and is low risk from the standpoint of you know we know where the data comes from we have the expertise in-house that can answer all the questions regarding the data itself we know what are the the relevant types of business metrics that we want to build up to measure right that's what i mean by low risk so we wouldn't want to choose something like an area of the business that you know maybe the person who knew everything about that area of the business just quit and so they're no longer with the organization and so all the questions that we're going to have are going to be really difficult to answer right so we wouldn't want to choose like an area like that so we want to choose an a business process that is going to be highly impactful so that we can illustrate the value of of the data warehouse in the and and have a high return on investment from the project we also want to choose something that is is relatively low risk so not not choosing an area that's really difficult to start with okay and to be honest you know the business is probably going to help you do this right they're going to give you a lot of feedback on where they think you should start right and if you think about right your own organizations and where you work at if you're a data professional you probably if i said hey think of that one person in your organization that is always asking you for data right and your mind probably goes to a specific person right we all have that one person that is always asking you for you to rerun that report or write that query or send me that report again or change the filters right they're always asking you to do those things right and and that person could be telling you hey that this is a business process that we don't have good reporting around and if we could really deliver on this particular request or this business process that could be highly impactful for the organization okay so if they're asking you questions like well what are the gross profit margins right for these product categories or what's the average account balance or you know what's the return rate by supplier you know those are all questions that that a a user in your organization might be asking you and those could be indications of of that's a valuable business process for you to kind of kind of start with and so the business can kind of help you that and and if you're listening and paying attention um they'll kind of clue you in onto what's a good business process for you to start with now for the sake of our example we're gonna we're gonna use the example of retail sales and there's a couple of reasons why we're gonna do that one is it's a fun one to use um and it's a and it's an example that i think most everybody can kind of play along with um so i don't want to pick like a really complex example that maybe people can't wrap their brains around so we're gonna pick something kind of simple because really the goal here is to to learn the method right not actually design like a retail data warehouse we just kind of want to learn the method of the four steps because you can take these four steps that we're going to go over or that we are going over and take this back to your organization regardless of whatever the business process is and i've used this many many times with with organizations across transportation logistics retail health care education lots of different things out there in the in the method works and so that's why we're using a retail sales example because it's simple it's easy and i don't want you guys to kind of get lost in the minutia of the business process because that's not really the goal here but we talk about this because it's an important first step okay so the second uh the second step here is we need to identify the grain and when we talk about the grain what we're talking about is the what a record in a fact table represents okay and we'll talk about that here in a minute we must this is really the most pivotal decision in the whole design process and i kind of called that out in the beginning and and the reason is is when we identify the grain of a fact table what we're talking about is the level of data detail that should be available for that part of the dimensional model okay so what is a fact table a fact table is a table in our data warehouse that contains the measurements of our of business success so when we talk about a fact table we're talking about the quantitative measurements now there are some nuances to designing a data warehouse a fact table can contain qualitative data um but we're going to keep it simple here for now there's there's specific use cases where you might have that um situation um and if you wanna know what those those uh those specific use cases are gotta get the you gotta get the data warehouse toolkit book it's it's critical uh diana called out she said hey these are all important steps even for designing a power bi data model dyna is 100 right that all this these steps here are are relevant whether you're designing a warehouse or even building a power bi data model but if you're building a power bi data model on top of a data warehouse you can follow those steps inside the data warehouse and that and all of that will trend transition over into your power bi data model very very nicely so diane is 100 correct the so we talk about fact table what we're talking about here is the quantitative measurements of business success and when we talk about the grain what we're really talking about is what does one fact what does one row in that fact table represent okay now it's recommended that your dimensional model be designed to support storage of the data at the most atomic or granular level now what does that mean and so what we're talking about is is data at the highest level of detail and there's a few reasons for that the reason is is because when you when you bring data into a fact table at the most detailed level that most atomic level that means that it can be rolled up in every way possible and this is really ideal for ad hoc query uh for users that are going to be exploring the data model whether they're probably gonna either they could be using a modern tool a bi tool like power bi or tableau or you know every organization that's got their excel gurus right and they just love excel and they love pivot tables and pivot charts right and so they they want to browse and explore that data model and so having a data stored at that most atomic level um is going to be really really valuable to ensure flexibility of the model i can think of a project um a project of um that i worked on with a care company and um we had the meetings with the with the customer and we were talking about how we're building the design of the data warehouse and they're saying well you know we only look at the data you know every so often and so i really think that we can aggregate some of this data and that'll help us um with data storage concerns and we say well are you sure you know you don't we we usually recommend that you store it at the most atomic level and here's the reasons why they said yeah we're sure we don't need it at this level and lo and behold we had a user was like well what about this report that i need to run i always ran this report with the old system and it looks at it at this level and we were in a pickle because we had we had decided that we were going to aggregate that data so choosing it at the most atomic level has a lot of benefits um over over trying to aggregate it um and in the long in the long run you know storage is is not all that expensive um so there's a lot of benefits sorting at the most atomic level okay michael has a question he says what would be an example of qualitative data in a fact table would it be reference to a dimension it's a good question michael and so when we talk about something like a qualitative data um this is a a good scenario would be um let's say that you have a qualitative field that has a one-to-one relationship with the grain of the fact table so that this is just an example i'm thinking about top head so let's say that you have like a sales order number right it's not quantitative because we can't aggregate it but a sales order number is a description right it's really a description of that particular sales order and so in that scenario right storing that sales order description it's a qualitative qualitative field but we store it in the fact table because if we if we have its own dimension what ends up happening is you have it really a just a one-to-one relationship between um that fact table and that dimension um and that doesn't mean that you know you don't break it out and like a bi tool have it in a separate table but in the underlying data warehouse um that that's often a scenario where you you can store that qualitative field in the um in that that fact table does that make sense michael awesome yeah that the book the data warehouse toolkit is is great it's a it's a great read and even if you don't go through the whole book you can really you can get a ton of value out of those first five chapters um i think after you get past the first five chapters the rest of the book is kind of focused on like specific business cases and it's still a great read i highly recommend it for everybody that is involved in data warehouse design um so the last the last thing here on this slide here is the reason that we can't the reason that we choose the most atomic level that most detailed level is that because it's because we can't predict the queries and in today's day and age data changes so fast and you may think hey look i know all the questions that the users are going to ask of the data it's just best to go ahead and and bring that data in at the most at the most atomic level because we just can't predict the query that the users are going to run all right now when we're when we're identifying the grain of a fact table this can be typically can be represented by a particular action in the business process um and so here's a couple a couple of examples here that we'll we'll dive in here into a second um maybe something maybe a transaction occurs and because we're using a retail example that's what we're going to go with here but it could also be something else maybe a balance has been recorded so maybe let's say every day we take inventory in the warehouse and we get a record of how much inventory we have in stock and we record that and that gets recorded in the data warehouse so that could be that's an activity right that is that is related to a particular action in the business process uh so we're taking inventory or we're recording the balances of all the different accounts right or where transactions retail transactions are occurring and so those are so those are specific actions in a business process okay now we must carefully craft our grain statement because this is really going to have great implications on the design of the star schema and and therefore impact the rest of the data warehouse and any future analysis services models or cubes or power bi models this this is all going to impact everything downstream from here so it's really important to get the the the grain statement correct now if you if at some point so let's say we define the grain statement and we realize hey we made a boo-boo we made an error we got to go back it's okay go back and start over with this process and make sure that you get the grain statement correct okay so for for our example here we're going to choose kind of a fun one we'll use blockbuster since they're not around anymore for our retail example here we're going to say that one row represents a a movie or product rented by a customer from an employee in a store on a date okay and when we define this grain statement really this grain here when we this statement here is capturing an action part of our retail business process and that action is we're selling a rental to a customer where they're buying candy at blockbuster or popcorn or whatever but this is this is a grain statement it captures the level of detail that we're going to record within our data warehouses fact table another example of this could be um a customer bought um a gallon of milk for 7.98 or on march 1st 2021 at store 358 using a bogo coupon right that's another example of a grain statement or another example could be one row represents a record of inventory taken on a single particular day so this is really the most important step in our data warehouse design and if you mess this step up it's going to cause a lot of rework and so this it's important to take take your time on this step okay um all right make sense okay so let's do this together here all right so let's capture our grain statement here okay one row represents i know the other one said movie but we'll call this one a a product purchase a customer from an employee at a store on a date okay cool all right i'm gonna put this at the top okay cecilia are you are you okay to see my whiteboard are you seeing that all right yes okay great okay so i'm gonna write that grain statement at the top so if i'm going through this with a customer what i'm gonna do is i'm gonna we're gonna write that grain statement up so that way there's no doubt the level of detail that we're capturing in the particular star schema that we're that we're designing at this part of the the process okay um now if at some point um you are working on your data warehouse design and you realize that you have a requirement that is going to violate this grain statement we need to come back and we need to adjust this um so let's say that uh you know maybe we need to capture if somebody uses a coupon right well this grain statement doesn't capture that right we didn't say anything about coupons or specials or anything like that if that's something that we need to capture we need to go in and adjust this grain statement so that we can make sure that we get this get this thing right okay all right all right now third step here is we're going to define the dimensions okay we talked about what a fact table is a fact table is a table in the data warehouse that contains the quantitative attributes right the measurements the business metrics could be things like cost account balance uh revenue profit right all these different types of things right in a dimension when we talk about a dimension a dimension table is a table that contains the descriptive data this is the quantitative or the qualitative data regarding the the business process right it's the who what where when right of the of the business process so it's things like if we know what our grain statement is right now our grain statement is right is a is a customer uh bought a product on a from a at a store on a given day from a from an employee right right that those are the who what where and when and so those are captured in what we call dimension tables so a dimension table can be a table that has to do with the products that are sold it can be a table that contains information around the customers who are buying products it can be the um it can be information around the dates right there's a date dimension right every every data warehouse should have a date dimension although i have um uh scene data warehouses that would that were designed without a date dimension how many of you have ever worked on a data warehouse that did not have a date dimension i had to tell that customer that uh you've done this wrong uh you need to go back do not pass go do not collect 200 right you have done this wrong and they're they're uh of course i won't tell you the customer was but their thinking was that well i don't have to include um the the date because i know that it's always going to be current and where i was just like oh oh let me help you uh but so the dimensions are the who what where and when of the data warehouse okay now if you've done a good job on the the the grain statement if you've correctly identified the grain statement coming up with the dimensions this should be really really easy now if you now as you're going through and identifying the dimensions those descriptive um kind of topic areas that are going to be represented in your data warehouse um and you let's say that you come up with a a a dimension that is not represented in the grain statement then we go back so the example we talked about a few minutes ago was um a coupon right well we didn't count was let's say hey we need our our coupon or our specials our special um sale dimension right well we didn't represent that in the grain statement so we should go back and we should adjust that okay all right so now if we look at this grain statement all right what are some of the dimensions this is the part where you guys are gonna have to come off mute or type it in the chat window what are some of the dimensions here that you are identifying in our grain statement all right so mona says products absolutely right so obviously we need to capture which product was sold or rented right michael said employee 100 right we've got employees that are running the register they're checking customers out so we want to know which employees are checking people out andrew said store right we want to know which store the transaction occurred at so we can do analysis across stores correctly right ja said date that's obviously a good one we want to be able to see how do sales compare from this year to the past years right that's another another good example and uh what else we got here yeah customer is another one here too okay all right so good job guys um um just call this out here because a good example ian said product employee product id store location time and price so let's talk about this okay so ian is right product is a dimension product id though is really an attribute of that dimension we'll talk about attributes and dimensions here in a few minutes but product id is an as a way to describe a product so we talk about product but we're talking about kind of a topic area that is used to describe quantitative measurements and that product id is really how we might describe a product now there are other ways we can describe a product right we could say what color is it how much does it weigh all those kinds of things so a product product id is not a dimension itself it's an attribute of that dimension we'll talk about the two same thing with store location right store location is an attribute of a store dimension what about time guys what do you guys think about time what should we if we let's no this isn't it it's not represented in our grain statement right we said on a day so we didn't say we didn't represent time here in our in our in our uh statement here but what about time what do you guys think which one i thought i'm sorry if i mispronounced your name vishwanath said that is a measure could be could be a measure right probably something in retail right we probably should capture time um right just to understand things like well what is what does sales look like in the morning versus versus the evening right or lunch time right this time so time could be a measure but i think in this case it's probably not a measure so when we think about like a time measure right a time measurement might be something like i want to measure the amount of seconds that have occurred right and that's not what we're doing here right we're not counting time we're not measuring time with a retail transaction i just want to know when it occurred so in this case time would really be like a dimension right could because it's a way for us to describe when that transaction occurred right we can capture the day in which it occurred we can capture the time in which it occurred um so if you guys think that we should capture that here the time of a transaction let's let's alter the grain statement here okay on a day at a given time okay all right so we'll alter that and in that case that that informs us to what the dimensions should be okay so we're going to alter this here we're going to have a time dimension too okay all right wilson said sales amount now this is a good one wilson it's a good suggestion we're going to capture that in data warehouse certainly but a sales amount is a quantitative measurement it's not a qualitative and the dimensions really are just for capturing the qualitative so a sales amount when we think about like a transaction sales amount that's going to be captured in a fact table we'll get to that part here now a sale amount could be a descriptive attribute if we talk about the sale amount of a specific product in the in the dimension so we can represent that here too um there's a couple of different ways we could capture that but it really depends on when we talk about a sale amount what are we talking about are we talking about the cost of a product for somebody to buy the product or are we talking about the uh the actual amount for that transit for that that the customer paid for the product out on the transaction okay all right so let's let's keep going here okay so we've got we've got these dimensions identified here and um using kind of these primary dimensions we could also kind of spawn off some secondary dimensions that don't really affect the grain of the tape the fact tables for something light so an example could be something like geography geography is related to a store location and it might go in a separate table but that wouldn't necessarily affect the grain of that that fact table even same thing like a sales territory maybe we had people that were related to sales territories that's related to a salesperson right dwayne said unless you want a price range with a mini dimension um so that could be a price range that could be that could be that's really a descriptive attribute of like a product right so you could say we have products that are in the zero to ten dollar range and then products that are in the 10 to 20 and right so on and so forth those are still like kind of a descriptive uh elements of that of those products um and that could exist in that dimension or we could have its own dimension there's a few different ways to kind of approach that but right that would really be like a descriptive thing so i think you're right there um now based on these dimensions we can then correctly identify the attributes that are included in that dimension so the dimension is kind of a kind of like a topical area in the in the in the uh the data warehouse that we're going to use to describe those transactions um and the attributes are ways to describe the dimension all right so let's let's continue to draw this out here and i'm going to do my best on this let's start with the let me make this a little bit bigger here a little bit smaller here okay no it's not cooperating all right all right so we'll call this our product dimension okay and i'm going to prefix it with a dim underscore and and just like i would if i was building this out in a data warehouse because that dim underscore helps me really look at this table in the data warehouse and quickly understand um what this what kind of table this is it's not a fact table it's a dimension table so i do dim underscore product help me quickly look at it see very easy for me to understand okay all right so what are some ways that we might describe a product um now uh ian already gave us one right ian gave us the product id so we'll go ahead and put that one in here what are some other ways we could describe this product right right mona eric you guys are are both correct right we could have something like a product code right maybe that's a code like in this in the system that is used represent a product right product name absolutely product category definitely um maybe things like product weight could be right product color right you guys are right on product uh type right all there are all kinds of different things right that we could come up here maybe the um product cost it's kind of a tricky one because it is quantitative but it's not a measurement that we use to kind of aggregate so it's just kind of a way for us to kind of break this down maybe into to cost buckets like price range buckets like dwayne was talking about earlier so we could include that in here right so these are all ways that we could describe this product all right now one of the interesting things here about this and and and one of the things you'll read about if you read the data warehouse toolkit is that one of the best practices to follow is to when you're when you're designing your dimensions you want to use a database key to represent that will serve as your primary key on this table right and there's a few reasons for that and that that key is what we call a surrogate key and i'm going to abbreviate it here sk for short okay sk for short and that sk is a database key that does not exist in the transactional system at all and there's a few really good reasons why we do not use use the product id that comes from the production transaction from the production sales transaction system can you guys think of some good reasons why we might not want to use the product id from the production system and we want to create brand new primary keys in this in this uh in our product dimension while you guys are noodling on that thank you cecilia that's uh we'll call we'll write this in here surrogate key all right we'll put this up here at the top okay ah lm is right what if the id changes well if that's if we've made the product id our primary key in the dimension table and the id changes we're in big trouble right so using a surrogate key which will act as the primary key on the database table is really important um what if we what if a blockbuster let's pretend that like blockbuster won the video the the video rental war and they acquired netflix right and now they have to bring in all these netflix products right what if uh in blockbuster system all the product ids are numbers and in uh netflix's system all of the product ids are widths right now we've got to bring those together and if i've if i'm counting on my product id to be a number i'm in i'm in trouble now so having a surrogate key to act as our database key that is unique to the data warehouse is important to insulate us from changes in the source system so if things change in the source system we want to be able to internally insulate ourselves from that all right here's another example what about slowly changing dimensions if i have a dimension that i need to track history in so let's say product cost changes um i want to be able to reflect that change in history of my dimension table and so therefore i might need to have two records for the same product well if the primary key is based on the product id then then i'm going to be in trouble because i'm not going to be able to have two records representing the same product in that in that in that table so here's an example here okay so let's say i've got product id number one and the product is dances with wolves okay and it costs you know 498 to rent that product and and if that part that cost to rent that product ever changes i want to be able to capture that and so usually the way that it's done in a data warehouse is you have a a begin date on the record and an end date on the record let's make this a little bit bigger okay and therefore i can create when that price when that price changes i can create that second entry in my in my data warehouse for that same product if my product id is the primary key on the table i will not be able to do this right and and duane's right right slowly this is a slowly this is a classic type 2 slowly changing dimension right and so there's a lot of good reasons not to use your system keys as primary keys in your data warehouse you want to create new product keys for that and there's a lot of patterns for that in the popular etl tools um that you can use to create those surrogate keys but that helps insulate you from changes in the data from the from the source system and it also helps you do things like slowly changing dimensions i just want to call that out um we can spend a lot of time talking about that but we don't have time to to do that okay cool all right so we've got to build out the rest of our data warehouse so let's do that so we've got product we've got employee i'm gonna do date as well we won't do we're not we don't have time to do every single one of them so i'm just gonna pick a few here all right all right employee date then store okay all right so let's fill in let's fill in some of the attributes for these different dimensions okay now we already talked about why we might want to have why we use a surrogate key so i'm going to go ahead and fill that in right employee sk what are some of the attributes of the employee dimension that that we might include in here ed's got a couple there so we might want to have employee name employee title right employee department right right what else you guys got any other good ideas for for those maybe there's like an employee id too right right you kind of got that they move right employee id right might have their you know higher date right there could be some other things right there could be a lot of other things in here too right so we'll just call it we'll call it this but you guys get the idea here okay so we've got a few different descriptions around this here okay all right let's do date now this one is going to be a little bit different we're going to have a date sk and uh what are some attributes of the state dimension right what are some ways that we can describe a date right we might have the year right you guys got it or you might have the month right we might have the quarter right it could be something like uh the month name right yeah okay yeah that's a good one damn maybe a holiday indicator or maybe like a weekend indicator too right it might be good to be able to know if a day is a weekend or a holiday uh diana getting getting fancy here right what about a fiscal year right fiscal year could be different than the calendar year so it might be good to have like a fiscal year like a fiscal order like a fiscal month right all of those things so there's all kinds of different things ways that we could describe um a date okay you guys are doing a good job you guys all get imaginary bonus points here okay all right let's do the let's do the last one here too let's do store right store sk what are some of the the attributes of a store that we might want to record in the data warehouse too yeah okay yeah so so store address yeah definitely it might be helpful to know where the store is at especially if we're going to do any type of geographical analysis of it um maybe like the store number yep for sure maybe like the store region right other things like that right yeah city right yep for sure right it's maybe uh square footage like uh duane said size right maybe we want to record the square footage maybe that's relevant in our analysis to be able to determine you know why one store has a different number of transactions the other one has something different right yep the open date yeah you guys are getting the idea here if it's owned or rented right all kinds of different things good job guys all right now one thing that i want to point out here um in this design is that um ah cecilia you you beat me to it you beat me to it i was actually just about to call this out here is it is considered a best practice to write out in in in detail the names of each attribute and and i did violate that a little bit here cecilia you'll notice that like i i did i did uh i did um kind of shorten here department right um but if i was following my own best practices right i would i would have wrote it out all the way same thing with like quarter right write these out all the way but for time's sake i was kind of abbreviating things but it's considered best practice to write everything out fully and the reason for that is is is certain abbreviations may make sense to you but remember the name of the game really the name of the game here for this data warehouse is is usability right we want this to be as user friendly as possible and one of the ways that we can do that is by is by writing things out and making it as easy to understand for our users as possible right so i don't want to use abbreviations i don't want to use acronyms or anything right here because if a user looks at this table and they can't understand what the columns represent right that's going to make it much harder for them to use use the the names of the uh use this data warehouse so it's considered a best practice to to not really use any abbreviations or acronyms in your table names or your columns okay so an example of that right employee department right ideally i wouldn't want to use that i wouldn't want to abbreviate that but i think you could you probably can make a case in scenarios like this where you where most people are gonna understand what dept so if you're gonna use abbreviations for things like that i think you could probably make a case for that but it's considered best practice to make sure that you write things out fully and and cecilia says that she was calling out like the the the cap the camel casing right of of capitalizing the the first letter of each each word it does make it a little easier to read right and that really the name of the game here is usability we want this to be as usable as possible okay so but it's a good call out thank you cecilia so avoid avoid using abbreviations where you can um now one thing i'll point out here in the when we talk about syria he's kind of going back to the surrogate key topic here for usually the surrogate key and we're talking about sql server here is is it's it's an identity column it's a primary key that's an identity column so it's an auto incrementing number that starts at at one or however you set your in any column up and it auto increments one two three four five six seven eight right until the end of time right and so they're typically not useful for users to see and so generally speaking when you are building your data warehouse and you're deciding which columns you're going to expose to the users through the views through the analysis service model through power bi whatever the bi tool is typically you hide the surrogate keys you don't want your users to use the circuit keys you don't want them to know they exist because they're really serve a functional purpose in the data warehouse and not a reporting purpose so generally speaking don't show the users the surrogate keys and there's a reason for that what happens is what i've seen happen before is that a user will will realize that a certain product is a certain surrogate key and they write that number down and they start using it and then let's say something happens with your product dimension you have to reload it and those surrogate keys change because they're they're just a really a practical functional purpose in the data warehouse that user that has built reports off that table that is using a specific surrogate key now their reports are broken so typically you don't want to show the surrogate keys to the user the circuit key like i mentioned is typically an auto incrementing number starting at one and going on and on and you know every time you insert a new product into that table right you would it will auto increment there is a exception to this and that is with the date dimension oftentimes in the date dimension you don't use an auto incrementing number you use what's called kind of a smart surrogate key and that smart surrogate key instead of starting at one let's say our date dimension starts at uh 2000 okay um so our smart circuit key could look like let me do this here like this okay so i'm not starting at one but i'm using a smart surrogate key and that's really just kind of a a practical thing i mean you could use an auto incrementing identity column for the serial key in the date dimension but using that smart key is just helpful for being able to quickly look at that table and understand what that date represents it represents in a fact table right and we'll talk about that here in a minute so i would have you know year 2000 january 2nd then the next row would be january 3rd the next row would be january 4th right and so it's it's not necessarily um auto incrementing because eventually we're going to get to the end of january right then we're going to roll over to february right um so that's the idea here is that a lot of times you'll see that in a date dimension is that you use a smart smart surrogate key okay all right any questions about any of that that we that we just went over any questions about it in the dimensions cert keys any of those types of things all right all right so we've built out some of our dimensions here i'm going to make some space up here at the top okay and let's go back to our our business process okay for our grain statement here okay so now that we've kind of fleshed out the dimensions and what those dimensions are going to look like now it's time to flush out that fact table and so remember that fact table is where we capture the measurements of this business process and so now we're going to talk about that what are those measurements of that business process all right so the last and final step here is is how do we define the measurements and this this step here is where we identify what are the business metrics or the measurements that measure success right and something like retail right it's probably gonna have something to do with sales amount and profit and revenue and those types of things right but depending on what your business processes it might might look a little different okay now when we're defining the business metrics here it's important to remember that these business metrics not violate the grain statement okay so that's why i went and reviewed the grain statement as we defined the business metrics we don't want to violate that grain statement so if we remember it said if we if we come up with a dimension or a measurement that violates the grain statement we need to go back to step two and redefine our grain statement to make it more accurate okay so let's say that we're um we're looking at this and we said oh well we need to capture you know what was the percent of discount based on you know the coupon the customer had maybe they had a rent one movie get one free type coupon right well we didn't capture coupon in our grain statement so we need to go back redefine our grain statement rework the dimensions and then come back to this okay so if you come up with anything that you feel like is going to violate that grain statement we need to go back to step two and hash that back out okay all right so when we talk about measurements right these are the quantitative numbers that we want to capture of how the bus how the how the business is going to measure success so it could be things like the sales amount the order quantity the cost amount if maybe there's a shipping amount maybe there's like a text a tax amount right something like that those are all different measurements right now when we talk about measurements or metrics right it's important to know that the best measurements to capture in your fact table are measurements that are fully additive or fully aggregateable all right so when we talk about fully additive measurements fully additive measurements are measurements that can be rolled up anyway any way they can be and typically we talk about fully added we're talking about like sums averages or counts or distinct counts um if you if you want to get technical right um um which will which we'll talk about here in a little bit um but when we these are i'm sorry not distinct counts um just regular counts distinct counts are going to be an example of a semi-additive we'll talk about the semi-additive and the non-additive here in a few minutes but fully additive measurements are things that can be summed they can be counted or they can be um averaged right those are just some examples right um there are certain measurements that are not fully additive that are best captured in the analysis services layer or the bi tool layer like power bi so what are some examples of if we if if we think about what are fully additive measurements that are counted or are summed or averaged semi-additive measurements are measurements that can't be just summed across all time in all um um in all dimensions right so what are some examples of maybe some semi-additive or non-additive measurements that we might want to hand or handle in the bi tool layer all right so you guys kind of beat me to it right the account balance right yeah an account balance is a great one duane because we can't just sum up account balances we're going to get whacky numbers if we sum up any account of the account balances right and so an account balance is a non-additive measurement right it we have to get the if we want to get the correct account balance right for a month right typically you take like the last day of the month right so i can't just sum up the balances across the month i got i can only i only have to take i'd have to take only the last account balance of the on the last day of the month right now something like eric had a question he said would you consider profit margin to be secondary and aggregate to be a secondary aggregate measure so something like the margin right um uh if we're talking about like a percentage right that's a semi-additive measurement right so that's typically something like i said like any type of percentage is going to be something that is best calculated in the bi layer because we can't aggregate those percentages in the data warehouse table we can't sum them together right it makes it really really difficult so typically things like that are percentages right are considered semi-additive and those are things that we're going to calculate in the bi tool there whether that's analysis versus power bi or something else okay now mona said count of id um now when you when you talk about like a count of id if you're talking about like a distinct count right that's a good example of a of a semi-additive measurement that you want to calculate in the in the bi tool in the bi tool layer so if you want to count like the distinct number of products sold it's a good example of a semi-additive measurement okay so whenever you have any type of measurement that is like a time-based calculation it could be like um year-to-date quantity right or the sales last year or um number of products sold this month right those are all examples of time-based calculations that are best handled in the bi tool layer and analysis and tools like analysis services and power bi are are incredibly well suited for those types of calculations right those time based calculations so something like profit margin right to to eric's point right a great semi-additive measure any type of time-based calculation does another semi-additive measures distinct count semi-additive any type ratio type calculations right those are all things that are typically handled really really well in the in the bi tool layer okay so when we define the measurements of our fact table right the best measures are fully additive okay all right so let's go back here let's add in the fact table okay alrighty we're going to call this back sales all right okay so let's do this obviously we want to capture the transaction amount right the sale amount so how much did the customer pay for the product when they rented the movie or they bought the popcorn or they bought um ed's already already uh listed out some of the things that we should put here in the fact table but we've got the sale amount right maybe there's a tax amount right we want to capture that we want to know how much the customer paid in tax was there like a discount amount or something like that right that could be something that we capture in there maybe there's like a quantity too right well how many boxes of popcorn did they did they purchased or how many bags of skittles or how many snickers bars did they do they might we want to capture that as well any other examples of any uh business metrics or measurements that we might want to capture here in the fact table okay all right on here and then of course the last thing here is we need to capture the we need to create these relationships between the fact tables and the uh dimensions all right so ideally we'd have the product sk captured in here we have the employee sk captured in here right sk store sk also the customer sk even though we don't have those tables represented here time is k okay all right and that's really where our relationships are going to exist right right and and typically what you have here depending on your now this is gonna this this greatly depends on your system right um typically these are your your sks are you going to be your primary keys on your dimension and this is going to be a foreign key relationship so we're talking about a good old sql server or something like that right you have primary key and foreign key uh relationship here and thank you cecilia pk stands for primary key and the fk stands here for a foreign key okay okay duane says please speak to the difference between order and order line usually a sales table would be at the line label level at the line level um yeah so and and i think what duane speaking here too is is it really just depends on the type of retail sales process that you're that you're that you're talking about here and and if you have multiple fact tables right you you know you'll see this actually you'll see this example if you ever go and look at the adventure works database um that's a great sample kind of data warehouse schema to look at you'll see that built out in in adventure works where you have the the sales order table and the sales order line sales line order um table you'll see that kind of built out in there and if you read ralph kimball's book the data warehouse toolkit he also talks about that too even though we don't have that represented here um this is just kind of a simple example but but it but it's a good a good call out dwayne all right rasheed said uh is it good is it a good idea to have a surrogate um e in um in a distributed data warehouse because of his distributed nature calculating next incremental number would need compute resources um oh i just solomon i just saw that was you solomon is a a colleague of mine that him and i have worked together on on pro projects in the past um it's a good question and and i'm trying to remember off the top of my head salman um in something like synapse i i and i i don't i don't know if is still the case but a primary keys are they supported in synapse i don't i don't believe they are and i'm trying to i have to go look it up but i but it's something like an mpp data warehouse like azure synapse when we talk about like the the dedicated sql pools i believe let me look and see i have to look it up yeah okay so primary keys are only supported um when non-clustered and non-enforced are both used so it looks like it is now supported so primary keys are now supported in azure synapse and if you go look at the docs if you go to look at the doc salmon um you will see the remarks on why it's good to have a primary key in the dedicated sql pool and it really has to do with the the query engine coming up with an optimal execution plan for a given query um you're right it does so if you have something like surrogate key generating that auto incrementing number that is going to require some compute resources in order to do that um but it sounds like that there's going to be some benefits to query execution that you might want to consider when you're talking about building out a surrogate key auto incrementing surrogate key and something like an mpp data warehouse so that's a really good question though um okay so there you have it we've built out a star schema and i i really just kind of wanted to walk through that this is an exercise that you might go through with your organization as you build out your data warehouse now obviously right here what we have is a very simple star schema and the reason it's called a star scheme is because it's kind of it's kind of the shape of a star right it's not a not a perfect star right but when you have a star schema usually it looks like you know something like this right you have a bunch of tables kind of around a fact table right and you've got arrows right so you call that's why they kind of call it a star schema and you might hear the term snowflake schema where you have you know secondary dimensions that are kind of built off of that off of your your primary dimensions and that's why they call it a snowflake schema because it starts eventually it starts to kind of look like a snowflake right um but i just kind of wanted to walk through this exercise here because you can follow this exercise with your organization as you're designing out the structure of your data warehouse okay now this is what it might look like after you take your first swing at building out your data warehouse but as you continue to build out your data warehouse you're going to go through this process over and over and over again so starting at the very beginning identifying that business process step two identifying the grain step three build out the dimensions step four build all the facts fill out the measurements right you'll go through those processes every every time and the goal here is as you go through those processes and you continue to build out other examples of of star schemas you want these dimensions to be what are called conformed dimensions conform dimensions are dimensions that can be shared across fact tables and so eventually you might have a what's called a a bus diagram that helps to tie dimensions to different fact tables so in this example here what we this is the first one that we we tackle here fact sales and it's related to product date employee store and customer but it's not related to data warehouse or or to the dem warehouse dimension or xyz dimension right and let's say the next business process we're going to tackle is inventory right well inventory can be tied to product and date but not employee right and so in this example as we built out inventory oops what i would do is i would have another eventually i would have another fact table over here that would be called fact inventory and we won't build out all the all the all the columns and attributes and everything here but it would also be uh oops not that one be related to these same dimensions and this is how you create the ability to analyze different business processes by related dimensions now i can look at my product sales and my product inventory in the same query in the same report okay so it's important to have use these conform dimensions so that way we can support analysis across different business processes okay eventually i would continue to build out these different facts and then i would be able to see how these different fact tables are tied to specific dimensions now uh damn move says can you please explain a jump dimension with a sample so that's a good question and they'll talk about this in the data warehouse toolkit a junk dimension is really just kind of a dimension that you use to hold a attributes that really don't tie to any one particular dimension that you already have in place and it's just called a junk dimension because it's kind of like a junk drawer you just use it to hold dimension attributes that really don't tie to anything um so let's think of an example of what a junk dimension might look like in this model um help me out guys here what's up what's a good example of maybe a dimension attribute that doesn't fall into one of these and doesn't doesn't necessarily fit into one of these dimensions but we still might want to capture in the in the data warehouse while you guys are thinking about that i'm gonna i'm gonna address sonny's question here he said so if i purchase five products in one transaction that one transaction will generate five sales fact records that's exactly right sonny you got it right so if i if i ran to dances with wolves and i bought a bag of skittles i'd have two rows inserted into that fact table that showed that i rented dances with wolves and i bought a bag of skittles that would be two separate rows let's see ed said if you had a category subcategory for a product hierarchy would that still be considered a star schema i mean it depends on how you build it out um but um [Music] i mean it i mean if you have like if you have secondary dimensions i guess technically they call that like a snowflake schema um but i mean i guess it doesn't really you know the the the nomenclature that you use whether star schema or snowflake schema for your for your data warehouse design it doesn't really matter i mean in my mind i think it just really depends on you know are you building a a as simple a design and as as optimal a design as possible um now you know we've only got five minutes left here and it probably would be good to you know mention how do you know if you should snowflake out a dimension um ralph kimball talks about that in the book um and we don't have time to kind of go into that but there are some easy rules to kind of follow um you know easy rules to kind of follow um you know to help you understand when you should snowflake something out domino said maybe for the junk dimension if the customer purchased any other snacks or something as a flag that could be i mean that could be right and really i mean really the takeaway from that is um when a you know when you have dimension attributes that really don't fit in any dimension and you kind of need you know its own place ed said genre i mean genre could go with like product i mean i mean maybe i mean maybe i mean there's there's a good example there um uh gregory set or info yeah that could be a really good one maybe you have like an like an order note or something i mean i don't know um it's kind of a you know being put on the spot kind of hard to come up with an example but i think you guys get the idea and and like i said ralph kimball talks about that in his book um so that's really all that i've got to share i hope that was helpful i hope that gave you some a good understanding of maybe what kind of where to start with data warehouse design really i think if you take away one thing from the session is get these resources um start the complete reference star schema fantastic data warehouse resource uh the data warehouse toolkit like i've mentioned it several times fantastic data warehouse resources if i think every data warehouse professional should have these books and they're on their bookshelf these are our two really incredible books that i highly recommend that you purchase if you're if you're doing anything around data warehousing thank you again cecilia for having me on and letting me do this session um it's a lot of fun to do this session and thank you thank you guys for all of you for uh participating and um asking lots of good questions and helping me build out the data warehouse design um i'll make sure that i share the slides i'll put out the um i'll put the the link to the slides they're at this url right here so you see it here at the top of your screen you go to tinyurl.com design dw ground up and you can download my slides and have that um as a reference but if you if you only have one take away from the session go get those books because the books are incredibly beneficial you
Info
Channel: South Florida Data Geeks
Views: 202
Rating: 5 out of 5
Keywords: datawarehouse, businessintelligence, business intelligence, analytics, data analytics, star schema, datawarehousing
Id: X_YcGkWzEiA
Channel Id: undefined
Length: 80min 44sec (4844 seconds)
Published: Fri Jul 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.