Why Power BI loves a Star Schema

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

- Yooo! What's up? This is Patrick from Guy in a Cube. And in this video, I'm going to talk about what is a star schema and why it's so important to have a star schema when you're using Power BI. Stay tuned. (upbeat electronic music) If you're finding this for the very first time, be sure to hit that subscribe button to say up to date from all the videos from both Adam and myself. All right, so star schemas, what is a star schema and why, why do I need a star schema with Power BI? All right, so first, a star schema is a specialized data model that you typically see when you designing a relational data warehouse. There's a Kimball method that we often refer to. Kimball's got a great book on the data warehouse tool kit. There's also another book called "Star Schema". They're great if you want to really dig into this. Within your data warehouse, there's two types of tables that's classified as a dimension and a fact. The dimension is a table. It usually defines your business entities like product, and employee, and customer, and geography. And within that dimension, you have attributes. You know, like for an employee, you may have their start date, their department they work in, their first name, their last name, yada, yada, yada. But, there's two special columns on that dimension, one called the surrogate key and one called an alternate key. The surrogate key is usually the primary key that's defined within the data warehouse. The alternate key is the primary key from the source system. The dimension is typically used for grouping or filtering data when you're doing your analytics against your data warehouse. And then you have the fact table. The fact table, it's used to track events, or transactions, or enrollments, or sales. It's usually the table that contains the stuff you want to add up, the additive values. You want to sum them, you want to average them, you want to add them together and average them. It is related to each one of your dimensions using that surrogate key that I've talked about. And so, the fact usually sits in the middle and then each dimension is along the side like points on a star. And if you connect the dots, right, you get a star that's created from that database structure. So, let me show you what I'm talking about. If you take a look at this diagram layout, I have my fact table in the middle and then I have my dimensions. So, you can see these are the points of the star. So, you can go, boom, boom, boom. I'm not that good at drawing. But you can see, here's my fact table and here's the points of the dimension. A lot of times when you're using, like, in memory tools or doing analysis, we used to call it a decision support, that's back in the day, but when you're doing that, you'd have a data warehouse because it's designed in the way for you to do quick analysis, and aggregations, and grouping against these additive values. So, the aggregation, and grouping, and filtering is done by the dimensions, and the additive values are obtained from the facts. Okay? And so that's what a star schema is. If you want to get more details and more information about it, there's a doc on the Microsoft page that provides you some guidance on understanding what a star schema is and why it's useful for Power BI. You guys know I have this infatuation with really nice data models. And there's four reasons, four reasons why I think you should use a star schema when you're designing your Power BI data models. Okay? The first reason is usability. Let me show you what I'm talking about. So, let's say instead of having a star schema you just have one big flat file that contains all the data, right? Usually that's what I see. If I wanted to build a report with my one large fact table, I'd have to go and search throughout this looking for everything that I need. If I want to find the store information, if I want to find promotion information, or product information, I got to kind of filter through this. Where if I had a nice, clean star schema, check this out, look what the field list looks like compared to my wide flat table. Everything's kind of grouped, so I know I go here for my calendar type of information, for currency information, product. If I'm sharing this out with other people in my organization that's gonna build reports, it's cleaner and it's easier to use, and identify, and find stuff to build my reports as opposed to going through that long list of information. Okay? So that's number one, usability. Number two is, I see this a lot, if you have a nice clean star schema, your DAX is most of the times easier to write and it's simpler than the DAX going against a wide flat table. You don't believe me? Let me show you. Let's take this example here, back to my flat file. I want to create a measure that gives me sales amount year-to-date. Let's take a look and see what the DAX looks like for that measure. So, you can see right here that I've created a variable and then I'm doing, you know, some summing. This could be a measure right here. And then, I have to write my own filter to do year-to-date. Whereas if I have a nice dimensional model, I have my calendar as one of my dimensions, I would mark it as a date table. And then I can write a nice, clean measure right here that just uses the date time functions that are built into DAX. And it just works. Significantly fewer lines of code to write, and it just is easier to read and faster to write, okay? So number two, simpler DAX, possibly simpler DAX, all right? Number three is performance. Now I will say this, the star schema is perfect when you're scaling, to ensure scale. So, when you're working with a small volume a day, the 100,000 rows, even a couple of million rows, you may test this out and go, "Well, mine's just as fast." But when you start to get into millions, hundreds of millions, and billions of rows, you know, and you want to scale this out, ensure this scales, you should definitely look at a star schema because it will scale. Let me show you what I'm talking about from a performance perspective. Here's my report built with the star schema, exact same report. And then here's the one that's built against the flat file. And what I've done is I've opened up DAX Studio. This is the query that the one from the flat file will generate. And you'll notice that it's really similar and you can see what the timings are, right? 29 milliseconds. It's some time spent on the storage engine CPU. If I go to the one that's actually against my star schema, you can see that it's significantly faster. And I know you guys say, "Well, milliseconds." But this is just 7 million rows. But if you're working with larger volumes of rows, you should see more numbers. You can see, I didn't spend any time on the storage engine CPU. Just by using a star schema, I improved performance several times over, right? So we had three, which was performance. Number four is faster refreshes. And I have to be honest with you guys, I didn't even think about this until I actually started doing the load of the data. So, both of my Power BI desktop files are pulling from a SQL server database. And so I hit close and apply on both of them around the same time. And the one that was a star schema actually loaded significantly faster than the one that was a flat file. Phil Seamark has a really, really good blog post on his DAX.Tips page. And he was actually on our live screen this weekend talking about it, and he shows you how to visualize your Power BI refresh. And so I just followed the steps in his blog post. And what I found out was, if you take a look at this for the flat file, there was the auto-time intelligence table that Power BI builds 'cause I need those, 'cause I'm using those because I don't have a date table. And then, there's just this one flat table. And you can see that it took Power BI 34 minutes and 26 seconds to process that table. It took a total of 34 minutes and 32 seconds to refresh the entire data set. And I was like, holy smokes. And then, what's really significant here is that it only processed 2,069 rows per second. Okay? What do you think the star schema did? Let's take a look. So on the star schema, it only took 4 minutes and 41 seconds to load the exact same number of rows, 7.8 million. If you flash back over here, you'll see this one loaded 7.8 million. But, you can see it processed 28,000 rows per second. If those aren't four good reasons why you should use the star schema, I don't know what is. Again, you may have smaller sets of data and it works just fine. But if you want your model to scale, you should absolutely consider using a star schema. All right, what do you guys think? You got any questions, comments? Are you using star schemas today? Do you hate star schemas? I'd love to know. Let's continue the conversation. Where? In the comments below. This your first time visiting the Guy in a Cube channel, hit that subscribe button. If you liked my video, give me a big thumbs up. As always, from Adam and myself, thanks for watching. We'll see you in the next video.

Info

Channel: Guy in a Cube

Views: 47,186

Rating: undefined out of 5

Keywords: power bi star schema, power bi star schema vs flat table, snowflake schema, data analytics, data model, data modeling, data modelling training, data warehouse, data warehousing tutorial, datamart, datawarehouse, dimension table, dimensions, fact, foreign key, microsoft, power bi, power bi data modeling, power bi reports, power bi tutorial

Id: vZndrBBPiQc

Channel Id: undefined

Length: 8min 9sec (489 seconds)

Published: Wed Feb 24 2021