Star schema or single table in Power BI

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello friends and welcome to a new sql bi video in this video i want to answer one important question is it better to build a regular star schema or performance wise is it better to denormalize all the columns and build a single table data model the reason why people are interested in visa is because of performance in a regular style schema you need to pay the price of relationships whereas if you denormalize everything into a single table you do not have to pay the price of a relationship therefore the model might be faster this is not actually true in some some scenario it is faster but in most of the scenarios a star schema is way better but i wanted to prove that i want to give an authoritative answer and to do that i created a data model with for billions rows so that we see numbers that are meaningful we see a difference which is important between the two data model and i built them both as a star schema and and a single table let's look at what i did this is the star schema as any star schema we have the fact table in the center and then dimensions around it whenever i run a query on this model what i need to do is use the relationship indeed i will slice by one column in the dimension but i will group values coming from the center fact table and in order to move the filter from the dimension to the fact table the engine needs to use the relationship therefore it has to pay a price which depends on the size of relationship on cardinality on a lot of different factors but there is a price to pay then i created another model where everything is denormalized into one table so all the columns from the different dimensions are now into this single table and that the logic why this model should be faster is because now when i slice by one column in the table and i compute values from the same table i do not have to pay the price of a relationship everything happens in the same table i called these two models the slim model the style schema i called it slim because audience contains a fewer smaller number of columns whereas the data model where everything is into a single table is called the fat model because the table is much wider it contains a lot of columns and potentially might be faster what we will do is first look at the size of the two data models inspect the final result and then we start running queries and in the process of running queries we do some reasoning in order to understand why sometimes one model is faster and sometimes the other one is better at the end once we have gathered all this information we will draw some conclusion and choose the best one let's get started let's start by analyzing the sides of the slim and the fat model size is important is probably the most relevant topic when looking at a tabular data model because of course scanning a small column is much faster than scanning a large column that is why we analyze first the sides of the two model and we do some reasoning on top of that what i have here is a dac studio which is connected to both the slim model and the fat model let's start with the slim model we look at the matrix increase the font a bit and we start doing some reasoning on top of that first of all we have several tables audience is at the primary fact table and then we have all the dimension you see that the size of dimension is not relevant at all the the total space of the data model is used in the fact table which is around 18 gigabytes if we expand the columns you see that we have columns which are larger columns which are small you can see the size here time for example is around 6.5 gigabytes whereas a weight is 2.6 gigabytes and age that only has 96 value is much smaller the size somewhat depends on the cardinality because age is small because it does not have a lot of values but it also depends on the sort order if you look at time time has 1400 values and uses 6.5 gigabytes another column like data that has again 1.5 1500 values it's tiny only 70 kilobytes the reason the reason for this difference between the two columns is the sort order date changes very slowly whereas time changes very frequently therefore the compression algorithm of vertipack is not able to compress time as well as it compresses data so this is a regular model and the number to remember is 18 gigabytes let's look at the large model and look at the matrix of that the first thing that you notice is that we only have one table we no longer have dimension but the most important part is the size now it's 47 gigabytes 47 gigabytes compared with 18 gigabytes that is two three times larger the reason is now the number of large columns increased by a lot before we had the id time that was using 6.5 gigabytes and here is the same but look at other columns howard minute is one of the attributes of id time and again it uses 6.5 gigabytes period 5 minutes period 10 minutes all these are columns which previously were stored as attributes of the time dimension now all these columns are denormalized in the fact table therefore they change very frequently and they use a lot of space so the first important thing to notice is that the size of the entire model is much larger with the fat model than with the small model there are other considerations while it is true that you need to pay the price of a relationship whenever you use the slim model when you use the fat model you need to scan columns which are larger if you group by period five minutes you still have to scan 6.5 gigabytes therefore you do not have to pay the price of relationship but the size of the column at some point will start to weigh in and slow down the single mode the single table model the fat model so now that we have a picture of the size of the different of the two models let's start running some queries and gather some measurements the first query that we run groups by individual's age range and computes the sum of audience weight now individual is a dimension that contains a 144 000 values and the column that joins the two tables is 2.5 gigabytes in terms of size age range contains a very small number of distinct values because it groups individuals by age range so it will contain six seven different values but in order to move the filter between age range and the fact table the engine needs to scan the id individual kernel which is kind of large let's see how the performance on the slim model first we run it look at the server timings and it takes 11 seconds the first thing to note is the difference between the total execution time and the storage engine cpu time total execution time is the time we had to wait for the query to run storage engine cpu is the amount of cpu power that has been used in order to answer the query we are not going to look at the total execution time because we are running on a powerful machine where we have 64 virtual cores therefore you see the degree of parallelism is 53.6 very high very high the number that we are going to use is the storage engine cpu that gives a better picture of how much power has been used in order to answer the query so 11 seconds is the time required to answer this the first simple query on the slim model what about the fat model on the fat oh last thing if we look at the xm sql query you see that we do have a join between the fact table and the dimension which is working on the id individual column so it's clear that we are paying the price of a join in order to answer this query what about the fact table the fat model if i run the very same query on the fat model where this time i group by audience age range and not by individual age range so everything is happening in a single table and we run the query it now takes 6.6 seconds if we look at the query there are no joins so 6.6 seconds against 11 seconds it is may much much faster working with the fat model not having to traverse the relationship we have a large benefit because the relationship has a price and if we can avoid using the relationship of course the model is faster does it mean that the fat model is always faster not really as long as we go on with the next queries you will see that this advantage will become smaller and smaller and smaller the second query that i want to test does the same summarize column but this time we use a time period 5 minutes time is a very large dimension because the column in the fact table is very large 6.5 gigabytes as we saw earlier therefore now the price of the relationship is larger because uh the current scan is larger but also in the fat model this column will be larger so the difference that it would be now a bit smaller or at least i would expect the difference to be smaller let's run first the query on the slim model it was 11 seconds with the previous query now it's around 29 seconds 28 and half a second and now the join is happening on time so 30 seconds around 30 seconds is the time needed on the slim model what about the fat model now the fat model needs to scan a very large column in the fact table and it takes a 25 seconds let's run it again just to be sure yes 25 seconds so 28 against 25 you see that now the difference is no longer so relevant it was more important earlier because id individual denormalized in the fact table resulted in a smaller column compared with period five times five minutes period five minutes is very large and being very large requires time to scan even though we do not have to pay the price of relationship but we need to scan large columns so as long as soon as your columns start to grow the benefit of using the fat model start to be reduced the third query let me gain some space the first the third query groups by different columns in the same dimension so in the slim model we group by time and in the fat model we group everything by audience also same query using different tables now here we will have one single join with the pack table and we use different columns from the dimension let's look at the time required to run this query i totally don't remember the time needed so we will look at that and you see that now numbers start to grow 65 seconds in order to answer the query one second was the wait time but 65 seconds is the total cpu time used and we have one join and we have the grouping by different columns in the xm sql query what about the fat model let's run it here now we have to scan multiple large columns and it runs in 68 seconds as usual we run it multiple times when the numbers are not as expected i always run them multiple times 56 is a better picture let's run it the third time yes 56 it has to wake up the cpu at the beginning so 56 seconds against 65 seconds it is still faster to scan the fat table but the numbers are getting closer and closer they become closer when the size of the condoms needed to scan increases by the way i was not totally expecting this result because in this scenario the column to scan is only one whereas here we have to scan three columns in the slim model we have only to scan the relationship once so the relationship is the large column that need to be scanned once all the other scanning happen on the dimension anyway it is what it is uh the difference is starting to be smaller so far we just use the sum so simple calculation let's try some heavier calculation like a distant count what happens if i group by time period of 5 minutes and i do a distinct count of id individual id individual is a slowly changing dimension so we will have the option of running two kind of queries one that does a distance count on the surrogate key id individual and then we will do the distance count of the natural key code individual which in the slim model is stored in that dimension whereas in the fat model is stored already in the fact table so again we will see a difference between the two because once we have to pay multiple relationships and in the other case we will have to pay no relationships at all but let's start looking at the surrogate key distinct count of time period let's run it once and that you see it started to slow down it's no longer as fast as before now it takes something like 500 000 milliseconds eight second is the time required to answer and we do the grouping by time period five minutes and we paid one join the price of one join but the numbers start to grow all the time is still all storage engine what about the fat model we have the same distance count we run it on the fat model and i want to evaluate the difference i totally don't remember what the result will be but because this thing count is heavier than a sum i expect the number to be different and actually it is look at the slim 498 and the fat 469 now they're very very close they are nearly the same the reason is paying the price of a relationship is important but as soon as your calculations start to be heavier then there will be that difference will no longer be so relevant so now the distance between the slim model and the fat model is further reduced what if we do the distance count of the natural key if we want to do the distant count of the natural key on the slim model we need to traverse the relationship because the column that i want to disencount is called individual in the dimension i need to summarize the fact table by the dimension and then count the rules finally i group them by period five minutes in this query i will no longer pay the price of one relationship only i paid the price of one relationship to do the grouping by plus another relationship because i want to count values which are in a different dimension it's actually faster you see 242 000 milliseconds the reason is that the number of distinct values of the code is smaller than the number of distinct value of the surrogate key because it's a slowly changing dimension so we have fewer codes than surrogate kills and we do have to pay the price of two relationship one with time and one with individuals so this is the value on the slim model what about the fat model we run this distance count on the fat model the difference is now now everything is happening on the same table so we no longer have to do a distinct account rules of a summarize just a distinct count of code individual again it's faster than the fat model not so fast but still faster 200 milliseconds against 242 milliseconds so for this kind of queries the fat model is quite always faster how much faster that depends on the complexity of the query that you want to run but the thing is not all the queries are the same let's try to do another distinct count let's do that on the slim model this time we do not group by one column we group by multiple columns i group by date by time and then i can compute the distance count of code individual so now three different relationships are involved again which one will be faster let's test it we run the first query on the slim model now what i expect is a three relationship being involved one in day one with date one with time and one with individuals in order to do the group by the distance count that's very slow it takes 24 seconds and now numbers are starting to grow one point around 1.2 million milliseconds because we had to scan to use multiple relationships we have the distinct count of individuals and then we have the three left outer join it's interesting to see that even though we ask that for a count rose of summarizer the optimizer actually computed a distant count using the three relationship so the optimizer did a quite a good job in understanding our intention so three relationships and now the numbers start to grow 24 seconds is the time needed to run the query on the slim model what about the fat model oh we already have it here run the query we know it's going to take some time the query this time will no longer need to traverse the relationship but it needs to scan the year the month number and the period five minutes year and number are not very large period five minutes is large and it need to match up multiple columns even though the columns are from the same table it still need to pay the price of merging the partial results and you see that now the numbers are nearly the same 1 1 6 1 is the storage engine cpu for the fat model 1191 is for the slim model the difference is really tiny the more complex the query becomes the smaller the difference becomes so while it is true that the fat model is faster with very simple queries as soon as your queries start to be a bit more complex then the distance between the fat model and the slim model is very very much reduced therefore what you will gain what looks like a real gain at the beginning is starting to be thinner and thinner as long as the queries become more complex again we have no relationships involved but still a lot of time what happens if we make a more complex query because so far we just used the storage engine we never used the formula engine we didn't create calculations which are really heavy just simple sum or distance count what if we do a year to date a simple calculation but that requires to use the formula engine the year-to-date is very simple to express as tax code on the slim model whereas it is much more complex to express on the fat model because on the slim model we could leverage that this ytd function that we cannot use on the fat model because this weight like all-time intelligence calculation requires a properly designed time dimension nevertheless because i wanted to compare the performance of the two i used the very same code both on the slim and on the fat model so we have a comparison of the speed and not a comparison about how simple or hard is to use a dax on the two different models let's look at the ytd on the slim model now we need a bit more space to look at the code to compute the ytd we group by year by month number by period five minutes so we have three different columns for the grouping by and then we compute this calculation by taking the current year the current data and compute the measure the sum of weight where the data is after the year start but uh no sorry your start is the beginning of time the beginning of the year so is a date that takes the current year and the first of january current date is the last date and we take the sum of weight for all the dates after the 1st of january and before the current data all is useful to remove any filter so this is computing just a whitey let's run it now we have both storage engine and formula engine at the same time and you see that we no longer have one single storage x and xm sql query we have multiple queries because the engine need to gather first the individual values day by day and then it computes the values informal engine you see that formal engine starts to weigh in and eat some part of the cpu usage the main query runs in 60 seconds 60 seconds of storage engine cpu and it is retrieving grouping by period date it's retrieving the sum of weight using a join with time and the joint with date in the word condition we have a condition on the data because we first retrieve all the values at the day level and then formula engine actually computes the year-to-date summing individual values the important note here is that in the work condition we only have the data so 1796 different values is the size of the filter used to move a filter from data to the fact table the same query on the fat model you see it's the very same query so we just need to highlight it and then run it now the problem of the fatty the fat model in this scenario is that the engine can no longer rely on specific optimization that exists for star schemas because now because all the columns are from the same table it needs to use arbitrarily shaped set in order to do the calculation let's first look at the total time 140 000 milliseconds compared with 60 000 milliseconds that is more than twice the amount of cpu required and the reason for these are slower speed is because the first queries are nearly the same but the query that retrieves the values day by day is no longer simple as it was the case in the slim model it is still grouping by date period and compute the sum of weight but look at the filter the filter now is not only on the date but on the date and on the period over 5 minutes and it contains 457 000 different values the same on the slim model was using only 1700 distinct values and the filter was only working on the data the reason is that now because i'm grouping by columns in the same table auto exist kicks in and the query becomes much much not only we have seen that the fat model is the the benefits of the fat model are reduced with the increasing complexity of the query but you reach a point where the code you start to write is no longer so simple where the fat model becomes much slower than the slim model so for very simple queries the fat model is faster but as soon as the queries start to grow in complexity the fat model becomes much much slower and really slower it was more than twice the speed of the simple model there is another scenario that i want to look at with you before leaving the topic that are very simple queries there are a lot of queries that your dax engine needs to perform just because you create slicer in the report imagine that you just create a slicer in the report by period by year by month by whatever column you want to slice for in order to fill the values of this slicer the engine needs to run a simple query a very simple one that is just values of that column but here is the difference this query when executed on the slim model will scan the dimension but the same query executed on the fat model needs to scan the fact table because that is the only table that is available so to gather only 10 different values the distinct values of the year for example the fat model will have to scan 4 billion rows whereas the slim model scans a few thousand rows so there the difference is going to be quite important let's look at that the next query we run it on the slim model first just compute values calculatable values of time period 5 minutes if i run it i expect it to be super fast actually it's taking 1 millisecond to run because the query runs on the time dimension is retrieving the period five minutes from time what about the fat model if we run the very same simple query on the fat model it takes 111 milliseconds but and here is the important number 5 seconds of cpu time to run the query because in order to retrieve the values of period of 5 minutes the engine had to scan the audience table 4 billions rows and it's not that simple not only it's already very slow but you need to pay attention to cross filtering if you just put one slicer that is easy but what if you put multiple slicers in the same report that happens all the time if you put multiple slicer coming from different dimensions in the slim model they will not cross filter each other but if you put multiple slicers from multiple columns in the fat model they will cross filter each other because they belong to the same table so the next week that we look at is again values but this time filter by another column let's look at the difference values filter on the slim model now computes the values of time period where audience age is greater than zero the filter from audience age does not reach the timetable and if i run the c the query it say it takes only three milliseconds because it still select the value from the time dimension the same query executed on the fat model that is gonna take a bit longer indeed 20 seconds so it's taking 20 seconds of cpu time to run a simple query that retrieves the values of a column now the difference is really huge for very simple queries that group values are where relationships are involved the fat model is faster not that faster as soon as the query increases in complexity the benefit is reduced and reduced and reduced and when you reach a given level of complexity then the fat models start to be slower than than the slim model but for simple query for very simple queries that retrieve only values of columns here is where the difference is really huge the fat model is so much slower compared with the slim model now let's draw some conclusion by looking at all the numbers together let's draw some conclusion i have synthesized all the values here so these are all the queries that we executed from 1 to 10 with the storage engine cpu time and on the slim and on the fat model what is important to note is that there is a difference you see that for these first queries the fat model was actually faster than the slim model because the price of the relationship was kind of important but as soon as the queries start to increase in complexity then the difference is no longer so relevant the more complex the code to run is the smaller the difference becomes and if you look at the group by over different dimension like these queries then actually the difference is tiny the fat model start to slow down as soon as you have code that is just not trivial very simple code that only uses storage engine there the fat model is likely to be faster but as soon as the formula engine weighs in then the fat model is much slower and for very simple queries that only retrieve values look at the difference you go from two milliseconds or two sorry not that one you go from one milliseconds here to five millisecond five thousand five seconds twenty seconds and it's important to also look at the number together because it is true that let me clear some stuff it is true that a query can be very fast using 6 seconds against 11 seconds on the slim model but as soon as you add a slicer to your report you will have to pay the price for not only the query but also for the slicer and if you sum these values together you get a much larger value than the same calculation than on the slim model so all this goes to a very simple conclusion no matter what the star schema still wins and it wins by a lot there are actually no real reason to merge everything everything into a single fat table by doing that you have the feeling that some queries are faster but you obtain a model that is three times as large and it's actually slower for most of the queries than the star schema if you are ever in doubt whether to go for a star schema or for a single fat table well take the time to measure it take the time to check the numbers what you are likely to find is that a star schema always wins enjoy that

Info

Channel: SQLBI

Views: 17,708

Rating: undefined out of 5

Keywords:

Id: qEWrYO1ioe0

Channel Id: undefined

Length: 35min 54sec (2154 seconds)

Published: Tue Apr 13 2021