MSPTDA #02: Data Analysis Business Intelligence Terms: Microsoft Power Tools for Data Analysis Class

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] Welcome back to MSPTDA number 02, Microsoft Power Tools for Data Analysis. Now, before we can talk about the tools like Power Query and Power Pivot, we've got to get busy defining data analysis and business intelligence terms. Here are some of the terms we're going to learn in this video, like what is granularity and why is it important for data analysis? Now, this video-- video number 02, like last video, is mostly an informative video. We're not going to be using any of the tools. But for all the videos in this class, you can download the associated files below the video. Both for this video and last video, there are comprehensive PDF notes. So if you like to read in written form what you see in this video or download this Excel file which has all the same information that's over in the PDFs, click on those links below the video. Now, we're going to start off by going to the sheet DA-BI. And we have to define the term data analysis. Now, in all the prerequisite classes, this is the definition. Convert raw data into useful information for decision makers. Now, that is what we're going to be doing in this class. But since we're in a business context, we're going to use the term business intelligence, which if we change a few of the words in the definition, we get convert raw data into useful, actionable, refreshable information for decision makers in a business situation. Now, the term business intelligence has been around for a while, and it's often associated with data modeling, like this star schema we're going to use in this class. But really, these are synonyms for us. Another word we'll be using often is query. Now, query just means to ask a question. Those of us from the Excel side might not use this word as much as people from the database or SQL site. But we're always asking questions of the raw data. And guess what? For those of us in Excel, we don't think of it this way. But when we change the criteria or filter in a slicer, or drag and drop a field into a pivot table, we are changing the question or the query. Down here in this pivot table, those of us from Excel, we say, well, we're going to select the criteria or apply this filter April. But really when we select April or June, we're changing the question we're asking of the raw data, and thus changing the query. So query, we'll get to use that a lot. Now, the goal of data analysis and business intelligence, create useful, updatable, actionable information for decision makers, whether we're going from text in Excel files using Power Query to get our useful information, various files to create a Power BI Desktop dashboard, or connecting to an SQL database, creating our star schema, and then creating some reports. Now on the sheet Raw Data, we all know this. No, this is not good raw data. Yes, this is good raw data. We cannot analyze easily based on zip code. Here we can. Raw data is always the data stored in its smallest form. And of course, we'll do things like get files like this, and convert it to a proper data set using Power Query. Now, Proper Data Set, let's go over to that sheet. Proper data set, we all probably know this. But let's make sure we're on the same page. Field names in the first row, records in subsequent rows. Now, there are lots of synonyms for this. Field, column, headers. Those of us from the database side would call these fields. Those of you that use pivot tables all the time would call it field, because in pivot table field list, it calls them fields. Now, in a lot of the power tools, they use the word column. So fields or columns, they are synonyms. Also headers, that's another common synonym. Now, a field is very important. It defines what sort of data goes in this column. Date goes in this column, units in this column, sales in this column. Also, something that we Excel people don't usually think about-- data type. What type of data? Is it a date, a whole number, a decimal number, or currency? Now, in our power tools, we'll have to define most of the time using Power Query that a decimal or currency is going to be in this column, a date is in this column. Also, field names, we want to make sure and always name them using two criteria-- name the field smartly and consistently. And we'll do that in this class. Now, sales are key. This is going to be-- and we'll talk about this later-- a foreign key that connects to a primary key. But this one right here is a product foreign key. So if I'm going to use the word Key to define all of the foreign keys or the primary key on the other side in the lookup or dimension table, that's not a good name. I want to make sure, Product Key. Now, naming things smartly is one of those things that most humans on the planet Earth that use computers are required to do, but most people don't. In this class, we'll be vigilant about naming everything-- not just the fields, but also the tables and other objects. Now, this is an example of a proper data set in Excel. It's very important in Excel that you have empty cells all the way around your table. In other situations-- a database, Power BI, text files-- that's not important. But that is important in Excel. We'll have an example of that in just a moment. We can also have proper data sets in text files. Notice the field names are in the first row, records in subsequent rows. This is called a tab-delimited file, because the columns and cells for each row are separated by a tab. Now, our tools can interpret this correctly. For example, when we import it into the data model in either PowerPivot or Power BI Desktop using Power Query, Power Query will have no problem interpreting this correctly. Here's another example of proper data sets. Here they're listed in a data model. But notice the names. This is a fact table, which we'll define later in this video. But we're always going to preface the name for our fact table with a little f. Same with our dimension tables or lookup tables. We'll preface it with a little d. Everything's going to be named smartly and consistently, whether they are field names or they're measures. Those are our formulas we're going to use in our reports and dashboards or our table names. Now let's go over to the next sheet, Excel Table. If your data is coming from an Excel spreadsheet and you're actually going to use it in one of the tools in this workbook, then you have to convert it to an Excel table. Now, you can have data like this on a sheet in an external Excel workbook and use Power Query to import it. But when you have your data here in Excel and you want to use it in this current workbook, you've got to convert it to an Excel table. How do we do that? Empty cells all the way around. Click in one cell in the proper data set. Insert Table, or we'll just use the keyboard Control-T. Now notice, that's why we have to have empty cells all the way around. So I'm going to click Escape, over to the side. Control-T and Enter. Why do we have to use Excel tables? Because if we add new records below this table or delete records, anything pointing to this table object will update. Now in Excel, it doesn't matter if it's a formula, chart, or pivot table. But Power Query, Power Pivot, when Microsoft invented these, they said we want these solutions to always automatically update when source data changes. So that's why we're always required to convert data in the current workbook to an Excel table before we can import it to our power tools. Now, we made a mistake here, because we created a table, but we didn't name it. To name it, simply go up to Table Tools, Design, and over to Properties. There it is. If you do this all the time, we want to use the keyboard-- Alt-J-T-A. I do Control-T, Enter, Alt-J-T-A when I create a table. Now we're going to name this smartly. d, and then SaleRep. And I left out the s, because I already had a SalesRep. But when I hit Enter, now we have named the table. Now, the next term, we want to talk about is grain. Now grain or granularity is simply the size or level of the detail. In this sales table, we have two different products sold at these two different amounts for invoice 27002. Over here, we have a single line that represents that same invoice. These two amounts add up or have been aggregated to this total amount. The grain or granularity of the sales table is at the product or invoice line level. The grain or granularity of this table is at the invoice level. Now, notice there are two lines here. And this is considered more granular. And the smaller size means these numbers are smaller. Over here, they've been aggregated to become bigger. For our product level or invoice line level, the grain is smaller here. Over here, the grain is bigger. When we have less granularity, that means there's more aggregation in the number. Now, there's another important way we want to think of granularity. Now in our data models, we'll have to have something called a date table. And we'll talk about why we have to have it, and why we can't use automatic grouping. But notice if we have the fields Date, Month, and Year, we can drag these fields into any pivot table or report. And guess what? For Sales, we'll get total sales by the day. If we drag Month, by the month. The Year, of course, will give us the total revenue for the year. So the granularity is increasing as we go from day to month to year. So when we drop criteria into a report or filter our reports, we have to think about granularity also. Remember, it's the size or level of the detail. Now, granularity will be important when designing the data model, the tables with our numbers, and when creating formulas both in Power Query and the data model. Now, we'll come back and talk much more about granularity in just a moment. But now I want to go over to the sheet P and F. Primary key, foreign key. If you are from the database side or you took the pre-req classes, you know exactly what these terms mean. And they will be important for all of our tools. Here's a lookup or dimension table. A primary key is simply a unique identifier that prevents duplicate rows. We have exactly one row for the product Fast Catch. So over in our Sales table, if we need to look up, for example, the price, we're not going to make any mistakes. Because there's exactly one 3 in the primary key, we're going to get the right price. Now, primary key, there are no duplicates. With the foreign key, there can be many duplicates. 3 is listed multiple times, because of course we can sell Fast Catch many times. Primary key, a unique list. Foreign key, many duplicates. Now, primary and foreign keys will be important. Well, VLOOKUP in Excel-- if you're from the Excel side, you know that you have to lookup a particular product many times in a sales table to go over to a unique list, find a match, and then go and get a particular price. Relationships in the data model. We will actually build a relationship between the primary key and the foreign key. And that relationship will be called a one-to-many. We have a unique listing of each item one time. And of course, we can have many duplicates over here. Now, through the relationship, we'll be able to do all sorts of amazing things. We'll be able to look up the price, the product, and bring it over to the Many side. We can actually look up from the One side and pull for a particular product all of the sales. We of course will also be able to, from our dimension table, drag and drop any of the fields into our reports and dashboards. We'll also be doing merges in Power Query, where we rely on the foreign and primary key to merge two tables. Now we want to go over to the sheet F and D Tables. Fact tables, dimension tables. Now, a fact table, that's the table that has the numbers we need to summarize, like sales and units. Now, it's called a fact, because this table contains the measurements of the business activities. How much did we sell? How many units? Now, that word measurement will be important, because when we get to the data model, our DAX formulas that we'll use to summarize and make calculations on our facts will be called measures. Now, the fact table has the numbers we want to make calculations on. It also will have foreign keys that we use in relationships. And after we complete the data model, the DAX measures we create will be listed as a field with a special measure icon in our field list. Dimension tables-- as we mentioned, they have first column is our primary key or unique identifier. The remaining columns are attributes that can be used as criteria or filters in our reports and dashboards, lookup values, or helper columns. The SalesRep and Region column, these are the attributes for filtering. Price, these would be like lookup values that we need to use in calculations. As for helper columns, this is a date table that we mentioned earlier. And when we drag Month to a pivot table or some report, the months don't sort correctly. They sort alphabetically instead of by January, February, March. Well, we'll have to use a helper column called MonthNumber and use it to assist this column to sort correctly. Now, I want to go back over to the grain sheet. What's so important about the grain of the fact table is the grain of a fact table will define what criteria conditions filter and dimensions we can use in our reporting. So for this table, we're definitely allowed to use Product. Over here, because we've already summarized or aggregated it, we cannot use the Product field from that dimension table. So we always want to think about the grain of the fact table. What is the smallest possible size that's going to allow us to do all the filtering and selection of criteria when we set up a particular query? And most of the time, we'll have the correct grain. But in some of the examples, we'll have to fix the tables, because it does not have the correct grain. Also, as we mentioned when we looked at the date table-- grain from the dimension tables, Day, Month, and Year. When we drag those into our reports, they'll have a different granularity. Now we want to go over to the sheet DM for a Data Model. Now, before we look at some examples on the next couple sheets, we want to talk about some important concepts associated with the data model. Now, the structure of the data model comes from the type of measures and reports that the business decision maker requires. That means up front, we have to go find out where the data is and what they want, before we ever start designing the data model. Now, requirements for a good data model. Number one, the data model must contain the necessary data for the end solution. That means up front, we get the data, and then we bring it into the data model using Power Query. Number two, the data model has to be easy for the decision maker to use. Now, we're going to use the star schema data model, which is an easy model to use. This also means that we'll have to do things like name things smartly, hide any unnecessary columns and tables, and so forth. Number three, the data model has to allow fast queries. That means when we change the criteria, the solution updates quickly. Now, the Microsoft Power Tools were specifically invented so we can import big data, create our reports and dashboards and formulas, then calculate quickly. Now, if our solution doesn't, that means we have to go and try and fix the DAX code or the data model. Number four, the data model is easy to update when new data is available. Well, Power Query makes that unbelievably easy. And finally, the data model is easy to update structurally if needed. Now, both Power Query and DAX meet this requirement perfectly, because we can go and edit the DAX formulas or the Power Query import transform at any time. Now, our data models will be in Power Pivot and Power BI Desktop. The data models can contain fact tables, dimension tables, relationships, helper columns, and tables, DAX calculated columns and measures-- we'll learn how to create those later-- hidden and not hidden tables and columns, and of course formatting. Our DAX measures or formulas will actually be able to contain formatting also. Now, some of the data models we might encounter. A flat table, like we used to do in Excel; star schema, that's what we're going to use almost all the time; snow flake, and even other models. Now let's go remind ourselves on the Flat Table sheet about what we used to do in Excel. Here is the sales table. Here's all the lookup tables. We had to do crazy formulas to look everything up, bring it into a single flat table, and then we could create our pivot table. The problem with this is if you have lots of data, that means lots of formulas, and calculation times slow down. Now, before we go talk about the star and snow flake data models, we've got to go over to the sheet R and talk about relationships. Now, the reason that we have relationships between tables like this is so that we can build solutions from multiple tables. Now as we mentioned, the relationship between a dimension and fact table will allow us to do all sorts of things. We can look something up on the Many side, look something up on the One side. We can drag and drop conditions or criteria into any reports or dashboards. But here's the most amazing thing about these relationships. Now, if we have total sales and total units, those are measures. Those are formulas. If this table has a million rows, those formulas are calculated over a million rows. But get this. If we have Product as a field dropped into a pivot table or a dashboard, when it gets to the product Quad, that means this table is filtered down to one row, one product. That filter will flow across the one-to-many relationship, and it will actually filter the fact table down to just the records with the Quad product. And then the formulas calculate. Now, if you think about what happens in Excel with array formulas, if you have an array formula on a column for Quad Product, in Excel, it calculates over every single cell, even the ones without Quad. But in a data model with relationships and DAX formulas, that doesn't happen. That's one of the most amazing things about the data model and these relationships. Now, you can see that arrow. That means the filter will flow from the One side to the Many side. Now, we're allowed to have bi-directional filters, but those are usually not the best case. There are some isolated cases where those will be just what we want. But most of the time, we're going to have this one-to-many relationship, with the arrow pointing from the dimension or lookup table over to the fact table. Now, there are other types of relationships. One-to-many, that's what we'll use most of the time. Many-to-many, and one-to-one. Now let's go over to the sheet Star, and we want to talk about the star schema data model. This is also known as a dimension model because we have these dimension tables surrounding one fact table. Now, the star schema, one fact table with the numbers at the correct granularity to match whatever dimension tables we need. The star schema data model will be easier to use than most models. It will allow faster queries than most other models. And the beauty is we can select a condition or criteria from any one of the dimension tables, and all of the measures-- the formulas that we create-- will calculate accurately. I mean, the beauty of this is that we have one formula we can use in many reports, and whatever reports we put in and slicers or filters we connect to them, we can choose any one of those conditions or criteria, and the formulas calculate correctly. Now let's compare that to the snow flake data model. Now, this is the same basic data over on the star. We pulled in category and manufacturer. We could see we have five different tables here. But using Power Query, we created the proper star schema data model. Over here, this actually works fine in the data model. And actually, you can get through the relationships to any one of these. But it's overly complicated, especially if you have lots of these sub tables. A couple other things that are problems, if you want to create hierarchies-- that we'll learn later-- that's where you grew multiple fields together so you can drag a single hierarchy into a report. Well, you can't do that in a snowflake data model. The other thing-- and this is from Marco Russo and Alberto Ferrari, their optimization course and in their book. They say that when you get large data models, the actual cost of the relationship will end up being more in terms of efficiency than pulling these tables in and having a table that contains duplicates. Now, as we'll see in just a moment, the columnar database behind the scenes that stores all the data deals with duplicates in columns just fine. Now, one other thing about this snow flake. If you are connecting to some database, and the data model comes in like this, and you don't have a lot of data, it's perfectly all right to use this. But in this class, we're going to stick to the most efficient model, the star schema model. Now, let's go just for a second over to Relation D. This is a relational database. Now, relational databases are awesome for storing the company's raw data. Because in a relational database, the goal is no redundancy and accuracy. But if we're going to start to try and query this, this gets pretty complicated. So we'll connect to some relational databases like this, pull the data we need, and create a much easier model to use. Now I want to go over to this sheet columnar database. And this is sort of jumping the gun, because later when we get to talking about Power Pivot and DAX, we'll talk about the columnar database. Columnar database is also referred to as XVelocity, analytics engine, or VertiPaq. And the columnar database, if we have a proper data set like this, and we import it into the data model either in Power Pivot or Power BI Desktop, it doesn't get stored like a normal table would in Excel. It gets stored as a columnar database. It actually breaks apart every column-- that's why it's called columnar-- and stores only a unique list. Now, there's also a map that we'll talk a little bit about later that is constructed. So when our formulas are looking at the columnar database, it can reconstruct it and give you whatever query results you want. But this is in large part why the columnar database can take big data and condense it down to a much smaller file size. Also in regards to the DAX formulas, when they calculate over a column with just the unique values, it can calculate much more quickly. Now, the advantages to the columnar database is it reduces file size. And in fact, even in Excel when you don't have over a million rows-- if you have 500,000 rows in an Excel sheet, if you import it from one workbook into another one just to store the data in the columnar database and then build your pivot table reports, it dramatically reduces the file size. And of course, the columnar database was the invention from Microsoft that allows us to bring in hundreds of millions of rows of data, or even more. And this columnar database works with the DAX formulas to calculate quickly on big data. Now let's go to our last few sheets. I want to go over to Clean and Transform sheet. Now, cleaning, transforming, those are two almost synonyms. But I'm going to use them differently. We'll clean the raw data, which means we'll fix unusable raw data. When we have two fields merged together in a cell, or some data source, when we clean it and break it apart, we're cleaning the raw data. That's a piece of raw data, and we're cleaning it. Transforming data sets, that's when we fix unusable data sets. For example, we might add, remove, or filter columns in the data set; combine, merge, append, or unpivot data sets; taking relational databases with many tables and converting just the data we want into the star schema. Those are all transforming. Now, we'll do these two things mostly in Power Query. Import data, that's going to be exclusively in Power Query, not in Power Pivot. Even though Power Pivot has a feature, we want to do it through Power Query. Three other terms that we're not going to use, but you might hear in relation to what we're doing. ETL-- that simply means Extract, Transform, and Load data. This is a term from data warehousing. And that's exactly what we're going to do with Power Query. Extract-- well, I always say import. So we import the data. But what it really means is we're extracting from some data source, then we're transforming and loading to the data model. Data warehousing-- this is a term that means get data from the original source and form it so that you can make useful reports and dashboards. Data warehousing as an intermediate step, where we take data from some big relational database, format like into a star schema, and then have it sitting somewhere where we can access it to make reports. Now, there's much more to data warehousing, and that's a term that's made famous by Ralph Kimball. Ralph Kimball authored some great books, like The Data Warehouse Toolkit. Now we'll be doing a version of data warehousing, where we build star schema data models with our MS Power Tools. And finally, SQL, that means Structured Query Language. That is a computer language used to build, maintain, and query databases. Now, we're not going to write SQL in this class. But much of what we do is designed to replace SQL with an easier tool. And some of the terms like Grouping in Power Query come straight from SQL. Now, I have a sheet from one of the pre-req classes. And this was what we called data analysis. Bad data, we cleaned it to become a proper data set into useful information. In this DI and BI class, would be able to connect to lots of source data, and whether or not it's bad or good data or useful data, we'll clean it, transform it, and load it to become star schema data model, and then into useful information. All right, if you like that video, be sure to click that thumbs up, leave a comment, and sub, because there's always lots more videos to come from Excel is Fun, including our next video, video number three, where we'll have an introduction to Power Query. All right, we'll see you next video.
Info
Channel: ExcelIsFun
Views: 66,585
Rating: 4.9882827 out of 5
Keywords: Excelisfun, Highline College, Data Analysis, Business Intelligence, BI 348, Mike Girvin, Microsoft Power Tools for Data Analysis 01, Free Business Intelligence Class, Online Business Intelligence Class, Terms for class, Granularity, Grain, Define Granularity, Grain of Fact Table, Granularity of Fact Table, Columnar Database, Star Schema Data Model, Relationships, Goal of Data Analysis and Business Intelligence, Data Models, What makes a good Data Model, MSPTDA
Id: IOl3qlCwmEs
Channel Id: undefined
Length: 31min 26sec (1886 seconds)
Published: Sat Jun 23 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.