Header-detail vs star schema models in Tabular and Power BI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ciao friends and welcome to a new video from sql bi in this video i want to talk again about star schemas i already recorded a video some time ago where i was comparing the difference between creating a model with a single table or a model with a proper star schema and it turned out that the star schema was better in this video i want to do i want to go in the opposite direction i want to create a model with a header detail table and then a star schema header detailed tables appear quite often in data modeling because you might have invoices therefore you have an invoice header table that contains the customer the data and other information that are related to the header and then you have the detail rows in the details that you have the product key then you have the quantity the net price and all the information that are related to the individual line then the two tables are linked through the order number which is a column that identifies uniquely the order so it is quite likely that you have or you find in oltp databases header detailed structures you can store them as they are in your power bi model or you can create a star schema out of them by moving all the information in a single table sales and then link it through the or link it to all the other dimensions the important question the interesting question that i want to answer is whether a header detail model which is a more normalized version of the star schema is a good option or data and i can already give you a very quick answer it is not no by no means not at all are you kidding me no it is not heather detailed tables are far worse than star schemas because we create star schemas all the times seasonal data modelers always create star schemas because we know it works better but as i'm saying this it might be just my personal opinion therefore i am going to show you a data model the same data model the same content the same data shaped in two different ways as a header detail table and as a star schema then we perform some considerations both in terms of size data model and performance in order to draw the proper conclusions let's get started i have this model this is our regular contoso model but this time is much larger it contains 1.4 billion rows in the sales details and around 100 million rows in the header and this first model is shaped as a header detail table we have the detail containing the product and then all the information about the product then we have the header table that is linked to the date of the order and to the customer therefore we have two tables and they are linked through the order number column that links the detail with the header you can shape the model this way or you can go for a star schema in a star schema both information from the header and from the detail are stored in the sales table so we have sales that is related to product to data and to customer the two models are different indeed the sales the star schema is a proper star schema because we have a central table that is linked through all the other dimensions the header detail table might look like a star schema but it is violating one of the most important rules of star schema let me elaborate on this this is still the star schema which has the proper star shape but if you look at the header detail we have sales detail which is a fact table because it's linked to product that is a dimension and it contains numbers that we want to aggregate then we have the sales header table a sales header table again is linked to dimensions customer and data and it might contain numbers that we want to aggregate that makes both sales header and sales detail both fact tables this means that the relationship between sales details and sales header is a relationship between fact tables and in a star schema you cannot have relationships between different fact tables this is one of the rules and violating it means that we are violating the rules of schema that is why heather detailed models are not star schemas but with that said that might be just theory so we do not like it because it's not a star schema we need to investigate a bit better in order to understand where the problem is and to do that we are going to use a dac studio to investigate on the sides of the model and we start making further considerations so we have the model but i need a dax studio which is here in a different monitor it's already connected to contoso 100 million that is our model and let's look at the sides of this table where is view metrics and gathering information and then we see what kind of database we are dealing with as you see we have sales sales details and sales header both sales and sales details they contain the same amount of information around 1.4 billion rows whereas sales header is much smaller around 100 million rows let's look at the detail of header and detail first the largest column of the sales detail is the order number that contains 94 million rows and its total size is here let me enlarge that a bit it's 11 gigabyte of size so it's kind of a large structure but that is expected because of the large cardinality and the same happens for the sales header table on the sales header table the order number is 6.5 gigabytes the reason why the numbers are different is because we have only 94 million rows in sales header and 1.4 billions rows in the detail table but the important point is that we are using 11 gigabytes plus 6 gigabytes that makes a total of 17 gigabytes only to store the order number column and the problem here is not only that we are using space in the model as we are going to see in a minute uh actually the header detail table is smaller than the star schema the problem is the size of the column that is used to relate the two structures because whenever the engine will have to follow the relationship between header and detail it has to handle quite large data structures and this is going to affect performance as we are going to see but now that we have a picture about the header detail let's take a look at the star schema in the star schema we only have sales and in sales we have the order number column again which is 11 gigabytes it's quite the same size as the previous column and then we have other columns which are needed it's worth to note that customer key and or delivery date order data these are columns that are stored in the header in the header detail table whereas they are stored in sales in the star schema therefore the star schema contains more columns in the table with a larger number of rows because we denormalized information in the fact table therefore i would expect the star schema to be slightly larger than the header detail and we can actually check that the total table size of sales is around 55 gigabytes whereas if you take sales detail and sales header together is 16 plus 6 16 20 to 23 gigabytes so there is a difference of around 2 gigabytes in the space that is used and it turns out that the star schema is actually larger therefore you might think that being larger it is going to be slower but if you do not pay attention to or if you only pay attention to the size you might be misleading what really matters is the usage of the columns and let me elaborate a bit more about this the largest column in sales is the order number and the important question we need to answer is when are we going to use the order number column we are going to use it whenever we want to scan the order number to compute the distance count of the numbers or the orders for example so it's a large column but we are not going to use it unless we actually want to scan it so a distinct count of order number that might be slower on the star schema but that is the only calculation where i actually hit the order number column if we look at the header detail table things are very different because here in the header we are using 6.5 gigabytes 6.2 gigabytes actually and in the detailed table we are using 11 gigabytes and the usage that we are doing of the order number column is very different we are not using the order number when we want to count the number of orders we are using the order number whenever we want to slice the detail by dimensions that are linked to the header whenever we want to follow the relationship let me show you that with the diagram i have here we have the diagram and that is the header detail table now based on what we have seen we know that the relationship between sales detail and sales header is around 10 100 million rows and we haven't looked at that but i remember that the relationship between customer and sales header is around 2 million rows now when are we going to use that relationship imagine that we place a filter on the customer table the filter need to go all its way to the sales detail table and to do that it needs to scan this relationship which is two million rows that is quite large already but then it has to traverse the relationship that is a hundred million rows that is a huge relationship so the presence of this relationship that links at these two tables is going to affect performances by a lot the same scenario is not happening if we look at the where is it if we look at the star schema because in the star schema moving a filter from customer to sales only uses the 2 million rows relationship from product to sales it's only 18 000 rows and from date to sales is only a bunch of rows whereas in in the header detail table whenever you want to move the filter from a customer or from data and you want to reach sales detail there is an operation that you are going to do quite often then you need to use the largest column in the model as a relationship therefore despite being smaller i expect the header detailed model to be way slower and of course we need to prove it so let's start writing some queries using dax studio and measure performance the query needs to group by attributes of the customer table so that we are linked to the header table and then we some values from the detail table so to use the relationship that we believe is going to be a problem let's do that we already have that studio we just need to get some more real estate by getting rid of these and we can start writing a simple query that does evaluate summarize column as i said we need a column from the customer table we can use customer country and then we use an amount which is a sum x over sales detail of sales detail quantity not like number but quantity times sales detail net price let's format it the proper way so i'm slicing by customer and i'm computing sales d and computing values on top of sales details then we need to enable the server timings and finally we run it just to have a first number we always run it a couple of times just to warm up the cache of the computer and here we have the number i'm not interested right now in the total execution time i'm interested in the storage engine cpu time that is 236 000. so we can start writing here that amount of header detail uses two three six three seven times milliseconds the degree of parallelism is quite high but i'm interested in the storage engine cpu now what happens if i do the same over the star schema or what i need to do is get rid of detail here customer is related to the sales table so now i'm using only the relationship between customer and sales let's run this as usual we always do that a couple of times and it happens that it's only 15 000 milliseconds so the calculation of amount on the star schema is 15.516 now you can easily compute the ratio between 15 and 236 000 it is huge and the reason of this huge difference is because of the large relationship that is linking the sales table sorry the sales header with the sales detail which is based on a very large column this price is a price that you need to pay whenever utilize the detail table but one of the dimensions that are linked to the header table therefore if you scan using attributes from the product table you will notice nearly the same performance because product is linked to both sales detail or sales using a relationship of the same size and the number of rows in the two tables is quite close so you will not notice any difference but as soon as you slice by year or utilize by customer country at that point the relationship between the two tables comes into play and you experience really poor performance moreover it is also worth to note that we are doing this test with a very simple calculation so the ratio between these two numbers is actually clearly in favor of the star schema your mileage may vary by a lot because this is a very simple calculation if these were a much more complex calculation things might be different because the time required to compute the measure actually compensates in some way the time that is required to scan that to traverse the relationship but that does not mean that the time to traverse the relationship is not used you are actually spending more time and there are no ways a star schema is going to be slower than a header detailed model there are actually scenarios where it might make sense that to create a relationship or a structure that somewhat resembles the header detail table let me again elaborate on this a bit in a star schema if i want to count for example the number of orders what i need to do is compute the distance count that this thing count it doesn't zoom okay the distant count of the order number where is my order number here so i have the order number which is a very large column inside my fact table and computing distance count might be quite heavy in the header detail table if i only need the distance count of the order number i can just do account rows of the sales header table following the relationship between customer and header therefore if it happens that i have to compute very frequently the distance count of customers or buying customers slicing it by attributes which are on the customer or on the day table it might be the case that the header d table is slightly faster because i can perform the count on a table that is only a hundred million rows instead of being 1.4 billion rosa moreover i can just do a count rosa instead of performing a distinct counter therefore it looks like i have better performance and this is for the number of orders but the same would be for the number of customers that goes to a distinct account instead of being account rosa but on a smaller table the problem appears as soon as i do things the different way because if i have the customer key in the sales header table and i want to slice it by product then the only way to make this work is to make this relationship bidirectional and performance will go south quite quickly again i want to show you that so let me go here and we do a different summarize column we do by customer country we compute let's say the number of customers by doing a distinct count of sales customer key let's clear the numbers here so i'm grouping by customer country and i'm computing the number of customer key from the sales table 1.4 billion rows i can run this and it's going to take a while because this encounter is a heavy operation you see it's 76 000 on the star schema let's call the distinct count and let's call this distinct count that is seven six four six nine the same calculation i can do that on the sales header table on this header i have only a hundred million rows instead of 1.4 billion rows so i would expect this to be way faster and if i run it we got some numbers and actually it is you see it's 15 906 therefore it turns out that computing this thing count over the header table is actually faster than computing the distance count over the fact table but as we said earlier that depends on the query you are running because if you only slice by attributes which are linked to the header table that is true speed will be better but what happens if you slice by attributes on the product table as usual we need to test them so we do that instead of slicing only by customer country we sliced by customer country and by product brand now i will need to traverse two relationships i have the relationship between a product and the relationship between customer if i work with the star schema indeed that should be sales customer key if we want the star schema so let's call this thing count let's clear again the numbers and let's run it that is the distance count by country and by brand executed on the star schema and this time numbers start to grow this distance count takes on the star schema three five five seven eight one okay so we have a number what happens if i do that on the sales header and i slice by country and by brand and i compute the sales header customer key i can run it and it's actually a bit faster way faster 15 15 703 but we are missing one important point let's take a look at the results you see that the results are slicing by country but they are not slicing by brand why that is happening because of the data model we created in the header detail if i place a filter on product the filter on product reaches sales detail but it doesn't go to sales header therefore the filter on product is not effective against the sales header customer is says header is not if i want to make this filter work i need to go on this relationship and make it bidirectional which i cannot do because this is not the power bi model is a tabular model so i cannot use power bi to do that but we can use tabular editor so let's open tabular editor i have already here connected with the the same model and if we go on the sales detail order number the relationship here between says detail and says header right now it's where is it it's a unidirectional relationship what i need to do is make it bi-directional by making it bi-directional now the relationship between sales detail and sales header will become a bidirectional relationship and the filter on product goes to detail and then to header and at that point i can compute the distance count of customers so what i need to do is just go here save this change reduce that i still have my query that computes a customer country and product brand distance count of sales header but this time it's going to work with the bidirectional relationship let me run it and uh now what i'm going to do is disappear for a while because i already checked this query i know how long it takes so i'm gonna just stay silent for a while then i will cut the entire execution time and we look at the result later oh it finished let's take a look at the numbers server timings here we are that is 2.5 millions plus 1.6 millions let's say that is around three point six three point six hundred three million six hundred thousand and you might be interested in the total execution time that is around 18 minutes so it took 80 minutes to compute the distance count and the reason is this relationship that you still see as a unidirectional one it is actually a bi-directional one we can actually refresh and you see now this is a bi-directional relationship therefore the time needed to compute the value when you have a large relationship is very high but when you have a large bi-directional relationship that is way worse i would say totally insane i have the time to grab a coffee and to rest a bit you might have time to eat lunch while the engine is computing values it's insane and you might think at this point that tabula sucks at computing distant counts over large bi-directional relationships that might be an opinion but i'm sorry that's on you you built the wrong model you built a header detailed model where a star schema would fit much better next time think about the star schema because star schemas would always win so as we have seen we started with two models star schema and had a detail we made several considerations we took some measures and even though header details models are slightly smaller and in very specific calculations the header detail might perform better overall a star schema is a clear winner there are no reason to go for a header detail when you can build a star schema of course that requires some extra effort you start from a header detail in the source mode and you need to build etl steps to create your own star schema but it is totally worth it then again we use 1.4 billions rows because i wanted the difference to be really relevant if you have a smaller model you will not notice such a huge difference but still you are working with a model that is suboptimal whenever you have the choice between anything else and a star schema in my personal experience a star schema always wins enjoy star schemas [Music]
Info
Channel: SQLBI
Views: 11,505
Rating: undefined out of 5
Keywords:
Id: R8SaJ__UMHI
Channel Id: undefined
Length: 27min 26sec (1646 seconds)
Published: Tue Oct 12 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.