Business Intelligence: Multidimensional Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
To make better decisions faster, managers require useful information that is readily available and flexible enough for analysis. This requirement isn't trivial, mostly because of the gap that exists between the mountain of raw data that companies collect, and the information that business people need. To bridge this gap, companies have to realize how information can be used for business analysis, and understand how computer systems convert raw data into useful information. To understand what makes information useful for business analysis, let's walk through a simple example of business intelligence in action. A major US beverage company is trying to break into the German market. They're selling four different flavors of soda: cola, cherry, grape, and lemon-lime; and they're interested in what's going on in four major markets: Munich, Frankfurt, Cologne, and Berlin. They want to begin by analyzing sales - a pretty reasonable place to start. This table shows their sales information for the last two quarters of 2013. It looks like the company has identical sales performance in both. The next step in their sales analysis may be to examine the same sales data by a few other perspectives, including the flavor of soda being sold and the market where it was sold. Hopefully you noticed that the total sales are the same - $32,000 - for all three views. This is a reassuring sign. It means you can be confident that you're looking at the same information - the soda sales of the beverage company - but with each view broken out into different categories. Let's take a minute to consider what we're looking at. These are the overall soda sales broken out by three distinct categories: time, geography, and product. These categories are called dimensions. Based on these data tables, it's not real obvious what question we should ask next. All we know at this point is that sales are identical for each of the two quarters, identical for each of the four products, and identical for each of the four geographies. The data is boring. I'm bored by it. You're probably bored by it. There aren't any interesting patterns, there isn't much to analyze, nothing's happening. Instead of looking for a fourth dimension to categorize the sales data, let's see what happens when we mash the existing three dimensions together to create a single, multidimensional view of the data. Nice! All kinds of patterns and anomalies start jumping out. Significant management information - which was obscured by separate analysis of the dimensions - suddenly becomes obvious. For example, lemon-lime and grape flavors don't sell in Cologne and Berlin in the third quarter, but cola and cherry do. It's vice-versa for the fourth quarter. This process of interacting with multidimensional views of the data - slicing and dicing - almost always reveals new and interesting information compared to isolated, single-dimension data lists. This is called multidimensional analysis. At this point, you're probably thinking that multidimensional analysis is an obvious way to analyze data - and you're right. Multidimensional analysis is very intuitive for businesspeople, because it represents a natural, easy, and effective way to analyze information. The more difficult task is getting the data into a format that supports multidimensional analysis. Let's take a minute and think about where a company's data originates. The sales information for the beverage company was generated by lots of individual shoppers plunking a soda down in a checkout line at a retailer. Each checkout transaction was then stored in a database, which was specially designed to store information about every checkout transaction in every store. This type of database is called an operational database, because its job is to support the day-to-day operations of a company. Even though data is captured and stored in an operational database, the data isn't necessarily available for business analysis. In fact, the opposite is usually true. These databases are structured for the purpose of running the day-to-day business by processing transactions. They aren't structured for effective business analysis. While the underlying technologies have changed pretty dramatically over time, operational databases still have the same basic functionality: they gather, update, store, retrieve, and archive data. In simplest terms, a rotary card file like the kind my grandfather used in the 1950s to store addresses is a database. At the high end, a large enterprise like Wal-Mart stores and manages trillions of records, taking up terabytes, and even petabytes, of storage in a database. Collectively, operational systems are usually referred to as online transaction processing systems - OLTP for short. To understand how an OLTP system works, let's walk through a common OLTP example: purchasing a tube of toothpaste at Target with a debit card. You walk up to the cash register, hopefully not having stood in line for too long, and place the toothpaste on the rubber conveyor belt. The cashier swipes the barcode on the toothpaste packaging across the point-of-scale scanner; the POS system looks up the price of the toothpaste, totals the cost of the transaction including sales tax, and prompts you to swipe your debit card. You swipe your debit card, input your PIN, and the POS system talks to your bank and transfers the cost of the toothpaste from your account to Target's. After the money is transferred, the POS generates a receipt, the cashier bags your purchase ,and you're out the door. This entire process is you interacting with an OLTP system, and this activity has three characteristics that fundamentally define the system. First, it processes a transaction - you're exchanging money for toothpaste according to the rules of the game, which are valid product, valid debit card, correct PIN code, money in the account, and so on. Next, it performs all of the elements of the transaction in real time: retrieving the price of the merchandise, verifying who you are, transferring the money, immediately updating your account, and so on. Everything about the transaction occurs more or less simultaneously. And finally, it processes lots of transactions on a continuous basis. Not just your purchase, but the purchases of every other Target shopper are processed continuously all day, every day. There are no rigid cutoffs, stops, or starts - at least none that are apparent to you or any other shopper checking out. OLTP systems are everywhere: order tracking, invoicing, credit card processing, retail point-of-sale, inventory control, personnel management, banking, airline reservations, telephone call processing, and more. OLTP is designed for managing the raw data of business, which requires efficiency and up-to-the-minute processing of transactions at the lowest level of detail. So why not use these same OLTP systems for business analysis? The answer lies in the questions you want to have answered efficiently. For example, "Did somebody buy toothpaste today at the Framingham Target?" is a transaction question that can be easily answered by an OLTP system without much fuss. While this sort of data can be served up quickly, it's not really useful for an analysis of the overall business. "As a trend over the past 6 months, what has been the average dollar sales of toothpaste per Target store per week in the greater Boston area? Which Targets sell toothpaste as the greatest percentage of their oral care category sales? Which Targets are seeing the fastest growth in toothpaste sales?" Those are interesting questions, with potentially interesting answers, that might affect how both Target and toothpaste manufacturers do business. But they aren't questions you can ask an OLTP system. There are way too many records to search, sort, and summarize. And don't forget the mathematical calculations that are also required to obtain the answers. Imposing these types of queries on the point-of-sale OLTP system on a regular basis would almost definitely interfere with the main business of ringing up shoppers' purchases. While OLTP systems are lousy for analysis, they do gather the data that is the foundation for multidimensional analysis. And that's the challenge that businesses face. It's those millions and billions of transactions in operational systems that are the rich basis of data that needs to be converted into useful information for business analysis. Getting it there isn't simple, especially when the data resides within multiple, disparately organized, and frankly, often crufty old legacy systems. Historically, IT departments spent the majority of their resources developing and maintaining operational systems. But in today's environment, such systems are almost always purchased as packaged software applications. These apps typically include meaningful reporting capabilities, which do have value for performing business analysis and are rightly part of an overall BI strategy. Unfortunately, virtually all of them suffer from two basic limitations. The first limitation is that they report only on their own internally gathered data, without the ability to combine data or absorb structures from other systems. This situation, where reporting from one system is essentially blind to the data from other systems, is commonly known as "stovepipe reporting" or "silos of data". When you report from these silos, it's difficult to create a broad set of metrics and key performance indicators that combine data from multiple systems. For example, a typical productivity measure in many industries is revenue per employee. That KPI is computed by combining revenue data from the order management system, and employee data from the human resources system. The second basic limitation of operational reporting is that it typically doesn't do a good job supporting high-speed multidimensional analysis. It's significantly slower, less intuitive, and less flexible than required. For these reasons, some type of BI system - where data from multiple operational systems, and possibly outside data sources, is pulled together for analysis - is part of every large company's infrastructure. Fortunately, all operational systems have export capabilities. Even better, the vendors who design and sell operational systems understand the limitations of data silos, and are increasingly designing their systems to be integrated with downstream BI systems. Most modern BI systems are designed for a specific analysis model: online analytical processing, or OLAP. The name was chosen because it contrasted well with online transaction processing, a term that was already in wide use. E.F. Codd, an IBM researcher who coined the term OLAP - he also invented SQL and relational databases - wanted to highlight the fundamental differences between transaction processing, and analytical processing. In 1993 he laid out a broad set of criteria for OLAP databases. Without getting into the minutiae, there are three critically important capabilities that all OLAP systems have to have. First, OLAP provides an intuitive data model that users who aren't necessarily trained as analysts can quickly understand. This model is multidimensional analysis - being able to see data through multiple filters, or dimensions, at the same time. The best managers and analysts were multidimensional thinkers, even before OLAP, asking questions like: What are our actual sales compared to the forecasted sales by region, by salesperson, and by product? What is our profitability by product and by customer? What is our backlog by product, by customer, and by time? Region, salesperson, customer, product, and time - the by, by, by things - are typical dimensions for business activities. Unlike standard relational databases, where dimensionality is present in the data but is hard to see, OLAP systems organize the data directly as multidimensional structures. OLAP systems also provide easy-to-use tools for users to get at the information in multiple and simultaneous dimensional views. Secondly, OLAP is fast for the user. Fast retrieval times let managers and analysts ask - and answer - more questions in a concentrated period of time than ever before. This addresses the "infinite query syndrome", where every interesting pattern and useful piece of information in a database can be theoretically explored. Finally, OLAP systems have robust calculation engines for handling the specialized calculation requirements that a multidimensional structure imposes. Think about being forced to write spreadsheet formulas that interlink more than two dimensions. That's not something easy for human beings to conceptualize. OLAP calculation engines structure the data in a way that allows an analyst to write simple, straightforward formulas that automatically perform across multiple dimensions. The power of OLAP comes from structuring the data so it naturally supports the way people do analysis work. The highest-level OLAP structure is a dimension. Simply defined, a dimension is a categorically consistent view of data - in other words, all of the members within a dimension, such as products, belong together as a group. There are two simple tests for this. First, can data about the members be compared? For example, the sales numbers of one product can be compared to other sales numbers in a product dimension. This is the famous "apples to apples" test. Secondly, can the data from the members be aggregated into summary members? For example, the members January, February, and March of a time dimension can be aggregated together to a value for Q1. One of the interesting things about dimensions is the ability to slice and dice multidimensional data, just like you might slice and then dice a tomato. Think of the slice as being a specific member of a dimension, for example, Product X in the product dimension. When we dice with OLAP, we're creating a series of intersections for that Product X slice with data from other dimensions. We can view sales of Product X by month, by region, and so on. The word "by" indicates how we're dicing the data. It's this multidimensional intersection of data, and how it's organized in an OLAP database, that makes multidimensional analysis especially interesting - and powerful. The ability to intuitively and instantly switch views is a basic feature of any respectable OLAP user tool. Another capability inherent in OLAP's multidimensional design is the pivoting and nesting of dimensions, which are point-and-click operations in good OLAP systems. Take a look at this table - it's the same soda data we looked at earlier, but the products have been pivoted from a row orientation to a column orientation, and the markets were pivoted from columns to rows. In addition, the quarters have been flipped on the row axis to an inside nested row position. Now you can see the column totals for products that weren't available in the previous version of the table. Multidimensional data in an OLAP system is typically visualized as a cube storage structure with lots of mini-cubes, or cells, making up the cube as a whole. The data is a cube because it has three dimensions of data: geographies, time periods, and products. This is a data cube that contains the information for our simple soda example. Each cell contains a value, which is the value of sales for the intersection of each unique combination of time period, geography, and product. For example, the value of the cell for cola-flavored soda in Cologne in the 3rd quarter is $2000. The total value of all the cells, as we saw in our data tables, is $32,000. Let's look at the cube again, this time highlighting a slice as it might be depicted on an OLAP-generated report. The cherry soda member of the product dimension crosses over all intersections of the time and geography dimensions. The sum of these highlighted cells is the total sales for cherry soda, which is $8,000. Slicing and dicing, in the OLAP world, is calling up subsets of cells that the user has selected to view from the master OLAP cube stored on a data server. In the real world, an OLAP cube usually has dimensions with hundreds, thousands, or even millions of members. When we call an OLAP database a cube it's a metaphor, not an actual storage structure. The technical definition of an OLAP cube is a multidimensional structure that stores and maintains discrete intersection values. Some OLAP database architectures may also include multiple cubes with intersections between them. Let's assume that you're a business analyst for a typical consumer goods company. Your job is to perform a thorough analysis of units sold, average prices, and dollar sales for 100 specific products that your company sells. Your analysis includes looking at patterns across 24 months for 200 major cities in the US. So how many data points are you going to have to look at? The answer is 1,440,000 - that's the three measures by 100 products by 24 months by 200 geographies. That's a lot of data points. Of course, not all of the products are sold in all 200 cities, and not all of the products were carried for all 24 months, but can you really know what products were carried where and when? This analysis problem is addressed in OLAP systems by letting you organize the data into hierarchies, which aggregate the data to higher and higher levels. For example, the monthly data can be rolled up - summarized - to quarter and year totals, the cities rolled up to regions and states, the products rolled up to product lines and product groups, and so on. Average prices can also be back-calculated at each level in the hierarchy, from the aggregations of the unit sales and dollar sales. The multidimensional analysis process becomes manageable, because a hierarchical organization of each dimension lets you start at the top of the hierarchy, and drill-down through the data as you see interesting patterns or anomalies. This top-down approach to analysis is similar to a game of 20 questions. You start by asking broad questions. Then, depending on the answers, you ask increasingly focused questions. Finally, you figure out the correct answer. Thinking top-down is a natural way for us humans to organize complex information, but it is a learned behavior that takes some practice. Multidimensional data, organized into hierarchies, is an intuitively comfortable way to analyze the mountains of data that come out of transactional systems. The mechanics of OLAP interfaces - especially pointing and clicking to drill-down into interesting layers - is made possible by rapid responses to queries. This functionality gives managers and analysts a completely new process for dealing with large amounts of data, a process called ad-hoc analysis. Paper-based reporting with fixed formats presumes a set list of questions that require answers. Ad hoc analysis doesn't have those constraints. Virtually any question - assuming a well-constructed OLAP system - can be answered in real-time with a few minutes of drill-down. Virtually any report can be formatted or graphed with pivoting and nesting of dimensions. Virtually anyone in the company, even executives who like to ask lots of questions, can be taught how to do it with minimal training. For an example of a hierarchy, let's use a sample time dimension from the year 2013. A hierarchy is an organization of levels within a dimension that reflects two things: how additive data is aggregated level by level, and the top-down drill-down path for users within the dimension. In our example time dimension, the hierarchy is a general calendar organization of the levels: year, quarter, and month. It's possible to have more than one hierarchy for the same dimension, depending on how you want to look at the data. Just remember that every hierarchy in the dimension uses the same bottom-level data. More than likely, you're going to supplement your analysis work with descriptive information that isn't part of the hierarchy but is still important to your analysis. This non-hierarchical information is known as attributes. Examples of attributes are model, size, list price, color, flavor, street address, FIPS time codes, and so on. So now that we've covered how you can structure and view data in an OLAP database using dimensions and hierarchies, it's time to talk about the actual data you're aggregating, storing, calculating, and otherwise analyzing. The data in most BI applications, and all OLAP systems, are called measures. In simplest terms, a measure is any quantitative expression. For example, dollar sales is an important measure for analyzing sales activity, as we saw earlier in the soda company example. In OLAP systems, a measure is what's being analyzed across multiple dimensions. For example, dollar sales of soda by month, by product, and by city. The measure is dollar sales, and each intersection of the three dimensions has a different dollar sales value. The total of dollar sales is the sum of the values of all combinations of members at the lowest level. There are four important parameters for what measures are, and how they work in multidimensional analysis. First, a measure is always a quantity, or a mathematical formula that yields a quantity. Secondly, a measure can take any quantitative format, for example, an absolute value like unit sales, a currency value like dollar sales, a percentage like percent sales, or a ratio like sales per employee. Next, a measure can be derived from any original data source or calculation. For example, a measure can be a direct input like unit sales, an aggregation like total unit sales, an average like average selling price, a formula like dollar sales divided by employees, a count like number of products, and so on. And finally, there has to be at least one measure to do any type of OLAP analysis. A typical OLAP cube will usually have lots of different measures, and in some extreme cases hundreds and even thousands of measures. The measures being analyzed depend on the purpose of OLAP system. For example, unit sales and sales per salesperson are important measures for a sales department. Inventory turns and debt-to-equity ratios are standard calculations in balance sheet analysis. Same store sales are a fundamental metric for large retail operations. Burn rate is an important cash metric for venture-financed startups. Measures in business intelligence are generally known by different names, depending on the application. The terms "metric" and "key performance indicator", or KPI, are used interchangeably to represent important measures you should pay attention to. The term "benchmark" refers to a measure used for making comparisons, for example, the average cost of goods sold. A ratio is a measure where the result is calculated specifically from dividing one measure by another, such as sales per salesperson. We've seen that the gap between raw data and business information can be bridged through a series of processes that start with OLTP systems, where the data is collected, and end with BI systems, where the data is organized into dimensions, hierarchies, and measures for analysis. OLAP systems enable ad hoc analysis, and on-the-fly construction of specialized reports, by allowing users to slice and dice their data by different dimensions, pivoting rows and columns, and drilling down through hierarchies. These standard OLAP system capabilities allow users to quickly ask and answer questions and identify patterns and anomalies otherwise obscured by single-dimensional analysis.
Info
Channel: Michael Lamont
Views: 111,906
Rating: 4.9393454 out of 5
Keywords: Business Intelligence (Industry), Software (Industry), Multidimensional Analysis, OLTP, OLAP, Analytics, Dimensions, Aggregates, Attributes, Measures, KPI, Key Performance Indicator, Data Cubes, Technology
Id: IhFkNmVmwn4
Channel Id: undefined
Length: 24min 14sec (1454 seconds)
Published: Tue Sep 16 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.