[MUSIC PLAYING] Welcome back to
MSPTDA number 02, Microsoft Power Tools
for Data Analysis. Now, before we can talk
about the tools like Power Query and Power Pivot, we've
got to get busy defining data analysis and business
intelligence terms. Here are some of the terms we're
going to learn in this video, like what is granularity
and why is it important for data analysis? Now, this video-- video
number 02, like last video, is mostly an informative video. We're not going to be
using any of the tools. But for all the
videos in this class, you can download the associated
files below the video. Both for this video
and last video, there are comprehensive
PDF notes. So if you like to
read in written form what you see in this video or
download this Excel file which has all the same information
that's over in the PDFs, click on those links
below the video. Now, we're going to start off
by going to the sheet DA-BI. And we have to define
the term data analysis. Now, in all the
prerequisite classes, this is the definition. Convert raw data into
useful information for decision makers. Now, that is what we're going
to be doing in this class. But since we're in
a business context, we're going to use the term
business intelligence, which if we change a few of the
words in the definition, we get convert raw data into
useful, actionable, refreshable information for decision
makers in a business situation. Now, the term
business intelligence has been around for a
while, and it's often associated with data modeling,
like this star schema we're going to use in this class. But really, these
are synonyms for us. Another word we'll be
using often is query. Now, query just means
to ask a question. Those of us from the Excel side
might not use this word as much as people from the
database or SQL site. But we're always asking
questions of the raw data. And guess what? For those of us in Excel, we
don't think of it this way. But when we change the
criteria or filter in a slicer, or drag and drop a field
into a pivot table, we are changing the
question or the query. Down here in this pivot
table, those of us from Excel, we say, well, we're going
to select the criteria or apply this filter April. But really when we
select April or June, we're changing the question
we're asking of the raw data, and thus changing the query. So query, we'll get
to use that a lot. Now, the goal of data analysis
and business intelligence, create useful, updatable,
actionable information for decision makers, whether
we're going from text in Excel files using Power Query to
get our useful information, various files to create a
Power BI Desktop dashboard, or connecting to
an SQL database, creating our star schema, and
then creating some reports. Now on the sheet Raw
Data, we all know this. No, this is not good raw data. Yes, this is good raw data. We cannot analyze easily
based on zip code. Here we can. Raw data is always the data
stored in its smallest form. And of course, we'll do things
like get files like this, and convert it to a proper
data set using Power Query. Now, Proper Data Set, let's
go over to that sheet. Proper data set, we
all probably know this. But let's make sure
we're on the same page. Field names in the first row,
records in subsequent rows. Now, there are lots
of synonyms for this. Field, column, headers. Those of us from the database
side would call these fields. Those of you that use
pivot tables all the time would call it field, because
in pivot table field list, it calls them fields. Now, in a lot of
the power tools, they use the word column. So fields or columns,
they are synonyms. Also headers, that's
another common synonym. Now, a field is very important. It defines what sort of
data goes in this column. Date goes in this column,
units in this column, sales in this column. Also, something
that we Excel people don't usually think about-- data type. What type of data? Is it a date, a whole number,
a decimal number, or currency? Now, in our power
tools, we'll have to define most of the
time using Power Query that a decimal or currency is
going to be in this column, a date is in this column. Also, field names, we want
to make sure and always name them using two criteria-- name the field smartly
and consistently. And we'll do that in this class. Now, sales are key. This is going to be--
and we'll talk about this later-- a foreign key that
connects to a primary key. But this one right here
is a product foreign key. So if I'm going to use
the word Key to define all of the foreign
keys or the primary key on the other side in the
lookup or dimension table, that's not a good name. I want to make
sure, Product Key. Now, naming things smartly is
one of those things that most humans on the planet Earth
that use computers are required to do, but most people don't. In this class, we'll be
vigilant about naming everything-- not
just the fields, but also the tables
and other objects. Now, this is an example of
a proper data set in Excel. It's very important in Excel
that you have empty cells all the way around your table. In other situations--
a database, Power BI, text files-- that's not important. But that is important in Excel. We'll have an example of
that in just a moment. We can also have proper
data sets in text files. Notice the field names
are in the first row, records in subsequent rows. This is called a
tab-delimited file, because the columns
and cells for each row are separated by a tab. Now, our tools can
interpret this correctly. For example, when we
import it into the data model in either PowerPivot
or Power BI Desktop using Power Query, Power
Query will have no problem interpreting this correctly. Here's another example
of proper data sets. Here they're listed
in a data model. But notice the names. This is a fact
table, which we'll define later in this video. But we're always going to
preface the name for our fact table with a little f. Same with our dimension
tables or lookup tables. We'll preface it
with a little d. Everything's going
to be named smartly and consistently, whether
they are field names or they're measures. Those are our
formulas we're going to use in our reports and
dashboards or our table names. Now let's go over to the
next sheet, Excel Table. If your data is coming
from an Excel spreadsheet and you're actually
going to use it in one of the tools
in this workbook, then you have to convert
it to an Excel table. Now, you can have data like this
on a sheet in an external Excel workbook and use Power
Query to import it. But when you have your
data here in Excel and you want to use it
in this current workbook, you've got to convert
it to an Excel table. How do we do that? Empty cells all the way around. Click in one cell in
the proper data set. Insert Table, or we'll just
use the keyboard Control-T. Now notice, that's why
we have to have empty cells all the way around. So I'm going to click
Escape, over to the side. Control-T and Enter. Why do we have to
use Excel tables? Because if we add new records
below this table or delete records, anything pointing to
this table object will update. Now in Excel, it doesn't matter
if it's a formula, chart, or pivot table. But Power Query, Power Pivot,
when Microsoft invented these, they said we want
these solutions to always automatically update
when source data changes. So that's why we're
always required to convert data in the current
workbook to an Excel table before we can import
it to our power tools. Now, we made a mistake here,
because we created a table, but we didn't name it. To name it, simply go up
to Table Tools, Design, and over to Properties. There it is. If you do this all the time,
we want to use the keyboard-- Alt-J-T-A. I do
Control-T, Enter, Alt-J-T-A when I create a table. Now we're going to
name this smartly. d, and then SaleRep. And I left out the s, because
I already had a SalesRep. But when I hit Enter, now
we have named the table. Now, the next term, we want
to talk about is grain. Now grain or granularity
is simply the size or level of the detail. In this sales table, we
have two different products sold at these two different
amounts for invoice 27002. Over here, we have
a single line that represents that same invoice. These two amounts add up
or have been aggregated to this total amount. The grain or granularity
of the sales table is at the product or
invoice line level. The grain or granularity of this
table is at the invoice level. Now, notice there
are two lines here. And this is considered
more granular. And the smaller size means
these numbers are smaller. Over here, they've been
aggregated to become bigger. For our product level
or invoice line level, the grain is smaller here. Over here, the grain is bigger. When we have less
granularity, that means there's more
aggregation in the number. Now, there's another
important way we want to think of granularity. Now in our data
models, we'll have to have something
called a date table. And we'll talk about
why we have to have it, and why we can't use
automatic grouping. But notice if we have the
fields Date, Month, and Year, we can drag these fields into
any pivot table or report. And guess what? For Sales, we'll get
total sales by the day. If we drag Month, by the month. The Year, of
course, will give us the total revenue for the year. So the granularity is increasing
as we go from day to month to year. So when we drop criteria into
a report or filter our reports, we have to think about
granularity also. Remember, it's the size
or level of the detail. Now, granularity
will be important when designing the data model,
the tables with our numbers, and when creating formulas both
in Power Query and the data model. Now, we'll come back and talk
much more about granularity in just a moment. But now I want to go over to
the sheet P and F. Primary key, foreign key. If you are from
the database side or you took the
pre-req classes, you know exactly what
these terms mean. And they will be important
for all of our tools. Here's a lookup or
dimension table. A primary key is simply
a unique identifier that prevents duplicate rows. We have exactly one row
for the product Fast Catch. So over in our
Sales table, if we need to look up, for
example, the price, we're not going to
make any mistakes. Because there's exactly
one 3 in the primary key, we're going to get
the right price. Now, primary key, there
are no duplicates. With the foreign key, there
can be many duplicates. 3 is listed multiple
times, because of course we can sell Fast Catch many times. Primary key, a unique list. Foreign key, many duplicates. Now, primary and foreign
keys will be important. Well, VLOOKUP in Excel-- if you're from the
Excel side, you know that you have to lookup a
particular product many times in a sales table to go over to
a unique list, find a match, and then go and get
a particular price. Relationships in the data model. We will actually
build a relationship between the primary key
and the foreign key. And that relationship will
be called a one-to-many. We have a unique listing
of each item one time. And of course, we can have
many duplicates over here. Now, through the
relationship, we'll be able to do all sorts
of amazing things. We'll be able to look up
the price, the product, and bring it over
to the Many side. We can actually look
up from the One side and pull for a particular
product all of the sales. We of course will also be able
to, from our dimension table, drag and drop any of the
fields into our reports and dashboards. We'll also be doing merges
in Power Query, where we rely on the foreign
and primary key to merge two tables. Now we want to go over to
the sheet F and D Tables. Fact tables, dimension tables. Now, a fact table,
that's the table that has the numbers we
need to summarize, like sales and units. Now, it's called a fact,
because this table contains the measurements of the
business activities. How much did we sell? How many units? Now, that word measurement
will be important, because when we get
to the data model, our DAX formulas that we'll
use to summarize and make calculations on our facts
will be called measures. Now, the fact table
has the numbers we want to make calculations on. It also will have foreign keys
that we use in relationships. And after we complete
the data model, the DAX measures
we create will be listed as a field with a special
measure icon in our field list. Dimension tables--
as we mentioned, they have first column
is our primary key or unique identifier. The remaining columns
are attributes that can be used as
criteria or filters in our reports and dashboards,
lookup values, or helper columns. The SalesRep and
Region column, these are the attributes
for filtering. Price, these would
be like lookup values that we need to use
in calculations. As for helper columns,
this is a date table that we mentioned earlier. And when we drag Month to a
pivot table or some report, the months don't sort correctly. They sort alphabetically instead
of by January, February, March. Well, we'll have to use a
helper column called MonthNumber and use it to assist this
column to sort correctly. Now, I want to go back
over to the grain sheet. What's so important about
the grain of the fact table is the grain of
a fact table will define what criteria
conditions filter and dimensions we can
use in our reporting. So for this table,
we're definitely allowed to use Product. Over here, because we've already
summarized or aggregated it, we cannot use the Product field
from that dimension table. So we always want to think about
the grain of the fact table. What is the smallest
possible size that's going to allow us
to do all the filtering and selection of criteria when
we set up a particular query? And most of the time, we'll
have the correct grain. But in some of the examples,
we'll have to fix the tables, because it does not
have the correct grain. Also, as we mentioned when
we looked at the date table-- grain from the dimension
tables, Day, Month, and Year. When we drag those
into our reports, they'll have a
different granularity. Now we want to go over to the
sheet DM for a Data Model. Now, before we look at some
examples on the next couple sheets, we want to talk
about some important concepts associated with the data model. Now, the structure
of the data model comes from the type of
measures and reports that the business
decision maker requires. That means up front, we have to
go find out where the data is and what they want,
before we ever start designing the data model. Now, requirements for
a good data model. Number one, the data model
must contain the necessary data for the end solution. That means up front,
we get the data, and then we bring it into the
data model using Power Query. Number two, the data model has
to be easy for the decision maker to use. Now, we're going to use the
star schema data model, which is an easy model to use. This also means that we'll have
to do things like name things smartly, hide any unnecessary
columns and tables, and so forth. Number three, the data model
has to allow fast queries. That means when we
change the criteria, the solution updates quickly. Now, the Microsoft Power
Tools were specifically invented so we can
import big data, create our reports
and dashboards and formulas, then
calculate quickly. Now, if our solution
doesn't, that means we have to go and try and
fix the DAX code or the data model. Number four, the data
model is easy to update when new data is available. Well, Power Query makes
that unbelievably easy. And finally, the data model
is easy to update structurally if needed. Now, both Power Query and
DAX meet this requirement perfectly, because
we can go and edit the DAX formulas or the
Power Query import transform at any time. Now, our data models will be
in Power Pivot and Power BI Desktop. The data models can contain
fact tables, dimension tables, relationships, helper
columns, and tables, DAX calculated
columns and measures-- we'll learn how to
create those later-- hidden and not hidden tables
and columns, and of course formatting. Our DAX measures or
formulas will actually be able to contain
formatting also. Now, some of the data
models we might encounter. A flat table, like we used to
do in Excel; star schema, that's what we're going to use
almost all the time; snow flake, and
even other models. Now let's go remind ourselves
on the Flat Table sheet about what we used
to do in Excel. Here is the sales table. Here's all the lookup tables. We had to do crazy formulas to
look everything up, bring it into a single flat
table, and then we could create our pivot table. The problem with
this is if you have lots of data, that
means lots of formulas, and calculation times slow down. Now, before we go talk about
the star and snow flake data models, we've got to
go over to the sheet R and talk about relationships. Now, the reason that we have
relationships between tables like this is so that
we can build solutions from multiple tables. Now as we mentioned,
the relationship between a dimension
and fact table will allow us to do
all sorts of things. We can look something
up on the Many side, look something up
on the One side. We can drag and drop
conditions or criteria into any reports or dashboards. But here's the
most amazing thing about these relationships. Now, if we have total
sales and total units, those are measures. Those are formulas. If this table has
a million rows, those formulas are calculated
over a million rows. But get this. If we have Product
as a field dropped into a pivot table
or a dashboard, when it gets to
the product Quad, that means this table
is filtered down to one row, one product. That filter will flow across
the one-to-many relationship, and it will actually
filter the fact table down to just the records
with the Quad product. And then the formulas calculate. Now, if you think
about what happens in Excel with array formulas,
if you have an array formula on a column for
Quad Product, in Excel, it calculates over every
single cell, even the ones without Quad. But in a data model with
relationships and DAX formulas, that doesn't happen. That's one of the most
amazing things about the data model and these relationships. Now, you can see that arrow. That means the filter will
flow from the One side to the Many side. Now, we're allowed to have
bi-directional filters, but those are usually
not the best case. There are some isolated
cases where those will be just what we want. But most of the
time, we're going to have this one-to-many
relationship, with the arrow pointing from
the dimension or lookup table over to the fact table. Now, there are other
types of relationships. One-to-many, that's what
we'll use most of the time. Many-to-many, and one-to-one. Now let's go over
to the sheet Star, and we want to talk about
the star schema data model. This is also known
as a dimension model because we have these dimension
tables surrounding one fact table. Now, the star schema,
one fact table with the numbers at
the correct granularity to match whatever
dimension tables we need. The star schema data
model will be easier to use than most models. It will allow faster queries
than most other models. And the beauty is we can select
a condition or criteria from any one of the dimension tables,
and all of the measures-- the formulas that we create-- will calculate accurately. I mean, the beauty
of this is that we have one formula we can
use in many reports, and whatever reports
we put in and slicers or filters we
connect to them, we can choose any one of those
conditions or criteria, and the formulas
calculate correctly. Now let's compare that to
the snow flake data model. Now, this is the same basic
data over on the star. We pulled in category
and manufacturer. We could see we have five
different tables here. But using Power Query, we
created the proper star schema data model. Over here, this actually
works fine in the data model. And actually, you can get
through the relationships to any one of these. But it's overly complicated,
especially if you have lots of these sub tables. A couple other things that are
problems, if you want to create hierarchies-- that
we'll learn later-- that's where you grew
multiple fields together so you can drag a single
hierarchy into a report. Well, you can't do that
in a snowflake data model. The other thing-- and
this is from Marco Russo and Alberto Ferrari,
their optimization course and in their book. They say that when you
get large data models, the actual cost of
the relationship will end up being more
in terms of efficiency than pulling these tables
in and having a table that contains duplicates. Now, as we'll see
in just a moment, the columnar database
behind the scenes that stores all the data
deals with duplicates in columns just fine. Now, one other thing
about this snow flake. If you are connecting
to some database, and the data model
comes in like this, and you don't have
a lot of data, it's perfectly all
right to use this. But in this class,
we're going to stick to the most efficient model,
the star schema model. Now, let's go just
for a second over to Relation D. This is
a relational database. Now, relational databases
are awesome for storing the company's raw data. Because in a
relational database, the goal is no
redundancy and accuracy. But if we're going to start
to try and query this, this gets pretty complicated. So we'll connect to some
relational databases like this, pull the data we need, and
create a much easier model to use. Now I want to go over to
this sheet columnar database. And this is sort
of jumping the gun, because later when we get
to talking about Power Pivot and DAX, we'll talk about
the columnar database. Columnar database
is also referred to as XVelocity, analytics
engine, or VertiPaq. And the columnar database,
if we have a proper data set like this, and we import
it into the data model either in Power Pivot
or Power BI Desktop, it doesn't get stored like a
normal table would in Excel. It gets stored as a
columnar database. It actually breaks
apart every column-- that's why it's
called columnar-- and stores only a unique list. Now, there's also a map that
we'll talk a little bit about later that is constructed. So when our formulas are looking
at the columnar database, it can reconstruct
it and give you whatever query results you want. But this is in large part
why the columnar database can take big data
and condense it down to a much smaller file size. Also in regards to
the DAX formulas, when they calculate
over a column with just the unique
values, it can calculate much more quickly. Now, the advantages to
the columnar database is it reduces file size. And in fact, even in
Excel when you don't have over a million rows-- if you have 500,000
rows in an Excel sheet, if you import it from one
workbook into another one just to store the data in
the columnar database and then build your
pivot table reports, it dramatically
reduces the file size. And of course, the
columnar database was the invention
from Microsoft that allows us to bring in hundreds
of millions of rows of data, or even more. And this columnar database
works with the DAX formulas to calculate
quickly on big data. Now let's go to our
last few sheets. I want to go over to
Clean and Transform sheet. Now, cleaning, transforming,
those are two almost synonyms. But I'm going to use
them differently. We'll clean the raw
data, which means we'll fix unusable raw data. When we have two fields
merged together in a cell, or some data source, when we
clean it and break it apart, we're cleaning the raw data. That's a piece of raw data,
and we're cleaning it. Transforming data sets, that's
when we fix unusable data sets. For example, we might add,
remove, or filter columns in the data set; combine, merge,
append, or unpivot data sets; taking relational
databases with many tables and converting just the data
we want into the star schema. Those are all transforming. Now, we'll do these two
things mostly in Power Query. Import data, that's going to
be exclusively in Power Query, not in Power Pivot. Even though Power
Pivot has a feature, we want to do it
through Power Query. Three other terms that
we're not going to use, but you might hear in
relation to what we're doing. ETL-- that simply means Extract,
Transform, and Load data. This is a term from
data warehousing. And that's exactly what we're
going to do with Power Query. Extract-- well, I
always say import. So we import the data. But what it really means is
we're extracting from some data source, then we're transforming
and loading to the data model. Data warehousing--
this is a term that means get data
from the original source and form it so that you can make
useful reports and dashboards. Data warehousing as
an intermediate step, where we take data from some
big relational database, format like into a
star schema, and then have it sitting
somewhere where we can access it to make reports. Now, there's much more
to data warehousing, and that's a term that's
made famous by Ralph Kimball. Ralph Kimball authored
some great books, like The Data Warehouse Toolkit. Now we'll be doing a version
of data warehousing, where we build star schema data
models with our MS Power Tools. And finally, SQL, that means
Structured Query Language. That is a computer language used
to build, maintain, and query databases. Now, we're not going to
write SQL in this class. But much of what
we do is designed to replace SQL with
an easier tool. And some of the terms like
Grouping in Power Query come straight from SQL. Now, I have a sheet from
one of the pre-req classes. And this was what we
called data analysis. Bad data, we cleaned it to
become a proper data set into useful information. In this DI and BI
class, would be able to connect to
lots of source data, and whether or not it's bad
or good data or useful data, we'll clean it,
transform it, and load it to become star schema
data model, and then into useful information. All right, if you
like that video, be sure to click that
thumbs up, leave a comment, and sub, because there's
always lots more videos to come from Excel
is Fun, including our next video,
video number three, where we'll have an
introduction to Power Query. All right, we'll
see you next video.