- Yooo! What's up? This is Patrick from Guy in a Cube. And in this video, I'm going to talk about what is a star schema and why it's so important
to have a star schema when you're using Power BI. Stay tuned. (upbeat electronic music) If you're finding this
for the very first time, be sure to hit that subscribe
button to say up to date from all the videos from
both Adam and myself. All right, so star schemas,
what is a star schema and why, why do I need a
star schema with Power BI? All right, so first, a star schema is a specialized data model that you typically see when you designing a relational data warehouse. There's a Kimball method
that we often refer to. Kimball's got a great book on
the data warehouse tool kit. There's also another book
called "Star Schema". They're great if you want
to really dig into this. Within your data warehouse,
there's two types of tables that's classified as a
dimension and a fact. The dimension is a table. It usually defines your business entities like product, and employee,
and customer, and geography. And within that dimension,
you have attributes. You know, like for an employee, you may have their start date, their department they
work in, their first name, their last name, yada, yada, yada. But, there's two special
columns on that dimension, one called the surrogate key and one called an alternate key. The surrogate key is usually
the primary key that's defined within the data warehouse. The alternate key is the primary
key from the source system. The dimension is typically
used for grouping or filtering data when
you're doing your analytics against your data warehouse. And then you have the fact table. The fact table, it's used to track events, or transactions, or enrollments, or sales. It's usually the table that
contains the stuff you want to add up, the additive values. You want to sum them,
you want to average them, you want to add them
together and average them. It is related to each
one of your dimensions using that surrogate key
that I've talked about. And so, the fact usually
sits in the middle and then each dimension is along the side like points on a star. And if you connect the dots, right, you get a star that's created
from that database structure. So, let me show you
what I'm talking about. If you take a look at this diagram layout, I have my fact table in the middle and then I have my dimensions. So, you can see these are
the points of the star. So, you can go, boom, boom, boom. I'm not that good at drawing. But you can see, here's my fact table and here's the points of the dimension. A lot of times when you're
using, like, in memory tools or doing analysis, we used to
call it a decision support, that's back in the day,
but when you're doing that, you'd have a data warehouse
because it's designed in the way for you to do quick
analysis, and aggregations, and grouping against
these additive values. So, the aggregation, and grouping, and filtering is done by the dimensions, and the additive values are
obtained from the facts. Okay? And so that's what a star schema is. If you want to get more details and more information about it, there's a doc on the Microsoft
page that provides you some guidance on understanding
what a star schema is and why it's useful for Power BI. You guys know I have this infatuation with really nice data models. And there's four reasons, four reasons why I think
you should use a star schema when you're designing
your Power BI data models. Okay? The first reason is usability. Let me show you what I'm talking about. So, let's say instead
of having a star schema you just have one big
flat file that contains all the data, right? Usually that's what I see. If I wanted to build a report
with my one large fact table, I'd have to go and search
throughout this looking for everything that I need. If I want to find the store information, if I want to find promotion information, or product information, I got to kind of filter through this. Where if I had a nice, clean star schema, check this out, look what the
field list looks like compared to my wide flat table. Everything's kind of
grouped, so I know I go here for my calendar type of information, for currency information, product. If I'm sharing this out with other people in my organization that's
gonna build reports, it's cleaner and it's
easier to use, and identify, and find stuff to build my reports as opposed to going through
that long list of information. Okay? So that's number one, usability. Number two is, I see this a lot, if you have a nice clean star schema, your DAX is most of the
times easier to write and it's simpler than the DAX going against a wide flat table. You don't believe me? Let me show you. Let's take this example
here, back to my flat file. I want to create a measure that gives me sales amount year-to-date. Let's take a look and see
what the DAX looks like for that measure. So, you can see right here
that I've created a variable and then I'm doing,
you know, some summing. This could be a measure right here. And then, I have to write my
own filter to do year-to-date. Whereas if I have a
nice dimensional model, I have my calendar as
one of my dimensions, I would mark it as a date table. And then I can write a nice,
clean measure right here that just uses the date time
functions that are built into DAX. And it just works. Significantly fewer
lines of code to write, and it just is easier to read
and faster to write, okay? So number two, simpler DAX,
possibly simpler DAX, all right? Number three is performance. Now I will say this, the star schema is perfect
when you're scaling, to ensure scale. So, when you're working
with a small volume a day, the 100,000 rows, even a
couple of million rows, you may test this out and go,
"Well, mine's just as fast." But when you start to get into millions, hundreds of millions, and
billions of rows, you know, and you want to scale this
out, ensure this scales, you should definitely
look at a star schema because it will scale. Let me show you what I'm talking about from a performance perspective. Here's my report built
with the star schema, exact same report. And then here's the one that's
built against the flat file. And what I've done is
I've opened up DAX Studio. This is the query that the one from the flat file will generate. And you'll notice that it's really similar and you can see what
the timings are, right? 29 milliseconds. It's some time spent on
the storage engine CPU. If I go to the one that's actually against my star schema, you can see that it's
significantly faster. And I know you guys say,
"Well, milliseconds." But this is just 7 million rows. But if you're working with
larger volumes of rows, you should see more numbers. You can see, I didn't spend any time
on the storage engine CPU. Just by using a star schema, I improved performance
several times over, right? So we had three, which was performance. Number four is faster refreshes. And I have to be honest with you guys, I didn't even think about this until I actually started
doing the load of the data. So, both of my Power BI
desktop files are pulling from a SQL server database. And so I hit close and
apply on both of them around the same time. And the one that was a star schema actually loaded significantly faster than the one that was a flat file. Phil Seamark has a really,
really good blog post on his DAX.Tips page. And he was actually on our live
screen this weekend talking about it, and he shows you how to visualize your Power BI refresh. And so I just followed the
steps in his blog post. And what I found out was,
if you take a look at this for the flat file, there was
the auto-time intelligence table that Power BI builds
'cause I need those, 'cause I'm using those because
I don't have a date table. And then, there's just
this one flat table. And you can see that it
took Power BI 34 minutes and 26 seconds to process that table. It took a total of 34
minutes and 32 seconds to refresh the entire data set. And I was like, holy smokes. And then, what's really
significant here is that it only processed 2,069 rows per second. Okay? What do you think
the star schema did? Let's take a look. So on the star schema,
it only took 4 minutes and 41 seconds to load the
exact same number of rows, 7.8 million. If you flash back over here, you'll see this one loaded 7.8 million. But, you can see it processed
28,000 rows per second. If those aren't four good reasons why you should use the star
schema, I don't know what is. Again, you may have smaller sets of data and it works just fine. But if you want your model to scale, you should absolutely
consider using a star schema. All right, what do you guys think? You got any questions, comments? Are you using star schemas
today? Do you hate star schemas? I'd love to know. Let's
continue the conversation. Where? In the comments below. This your first time visiting
the Guy in a Cube channel, hit that subscribe button. If you liked my video,
give me a big thumbs up. As always, from Adam and
myself, thanks for watching. We'll see you in the next video.