Database Lesson #8 of 8 - Big Data, Data Warehouses, and Business Intelligence Systems

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Good day, everyone. This is Dr. Soper here. And today we will be exploring the eighth and final topic in our series of lectures on databases, with today's topic focusing on big data, data warehouses, and business intelligence systems. Although the objectives for today's lecture are varied and manifold, broadly speaking, our goals are to extend our knowledge of database systems beyond relational databases by exploring some of the additional ways in which database technologies can be used to achieve organizational objectives. Along the way, we will explore the basic concepts of data warehouses and data marts. And we'll learn about the basic concepts and architecture associated with dimensional or multi-dimensional databases. We will also learn about business intelligence systems, online analytical processing, and data mining, all of which are very useful for helping managers to make decisions. And then toward the end of today's lecture, we will explore some of the fastest growing topics in the database world, including big data, the NoSQL movement, structured storage, and the MapReduce process. So without further ado, let's begin. I thought a good way of introducing some of the topics in this lecture would be through an exploration of human mathematical intuition, that is, do we as human beings have a natural intuition for finding truth in numbers. To examine this question, consider the following example. Here we have two people, whom I've named Bubba and Daisy. And both Bubba and Daisy are interested in purchasing new vehicles. Currently, Bubba drives a very inefficient vehicle. Let's say that he drives an old truck. And he gets 10 miles per gallon. Whereas Daisy drives a reasonably efficient car. And in her case, her current vehicle gets 30 miles per gallon. Now let's say that both Bubba and Daisy are interested in purchasing new vehicles that will have better fuel economy than their current vehicles with the goal of needing to purchase less gasoline every year, thereby saving money. Bubba, however, does not want to give up all of the benefits of his big truck, so the vehicle he's looking at provides just 12 miles per gallon, as opposed to his current vehicle which provides 10. Daisy, on the other hand, is very interested in fuel economy. And so she is planning to purchase a vehicle which delivers 50 miles per gallon, as opposed to her current vehicle, which has 30. Another way of considering these numbers is by looking at the percentage increase in miles per gallon. In Bubba's case, he is moving from a vehicle which has 10 miles a gallon to one which has 12 miles per gallon, and that is an increase of 20% in the miles per gallon. Daisy, by contrast, is moving from a vehicle with 30 miles per gallon to one which provides 50 miles per gallon. And that is an increase of 66.7%. Next, for the sake of offering a fair comparison between Bubba and Daisy, let's say that both drivers drive 10,000 miles per year. Now at this point, I want to appeal to you or mathematical intuition. Based upon the situation that I've described, I would ask you a simple question, which of these drivers, Bubba or Daisy, is going to save more money on gasoline every year? Well, if you are like most people, you will look at these numbers and you'll say that clearly Daisy's vehicle has a much greater increase in miles per gallon, therefore, she will be saving the most money every year on gasoline. Unfortunately, this is the wrong answer. In fact, Bubba will be saving more money every year on gasoline. And let's see why. Here we have a table which shows us how much gasoline both Bubba and Daisy are currently consuming based upon their driving habits versus how much they will consume if they purchase their new vehicles. In Bubba's case, he drives 10,000 miles per year. And in his current vehicle, he gets 10 miles per gallon. Therefore he will consume 1,000 gallons of gasoline per year. In his new vehicle, Bubba will still drive 10,000 miles per year, but now will get 12 miles per gallon, and therefore will consume 833.33 gallons of gasoline per year. By contrast, in Daisy's current vehicle, she gets 30 miles per gallon. And driving 10,000 miles per year, she consumes 333.33 gallons of gas. Whereas with her new vehicle, which gets 50 miles per gallon, she will consume 200 gallons of gasoline every year. Simple subtraction, then, shows that Bubba will save 166.67 gallons of fuel per year if he purchases his new vehicle. Whereas Daisy will only save 133.33 gallons of fuel per year. That is, Bubba will enjoy it more fuel savings per year by buying a new truck, which has 12 miles per gallon, versus his old truck, which had 10 miles per gallon, than will Daisy by buying a new car which has 50 miles per gallon, versus her old car which provided 30 miles per gallon. So if you're like most people, in this problem you're mathematical intuition has failed. And it is for this reason that I would argue that managers should not rely on their intuition when making important managerial decisions, but rather should rely on data. Many managers have begun to recognize the deficiency is in human intuition, and are hence adopting technologies like business analytics, business intelligence, data warehousing, and data mining, so as to better support their decision-making activities with the data that the organization has at its disposal. By the way, if you're interested, this problem is known as the miles per gallon illusion, and stems from the fact that we here in the United States commonly measure fuel economy in miles per gallon, as opposed to the much more appropriate measure which would be gallons per mile. Now that we've seen for ourselves how human intuition can fail, in even very simple mathematical problems, we can begin discussing some of the systems and technologies that have been developed to help managers make better decisions and avoid the failures that are commonly associated with relying upon intuition alone. First among these technologies are business intelligence, or BI, systems. BI systems are information systems that are intentionally designed to allow managers to capitalize upon organizational data for the purpose of improving their decision making. It's important to note that these business intelligence systems do not support the real-time operational activities of the organization. Those activities, by contrast, are supported by transaction processing systems. Instead, these BI systems are designed to support managerial assessment activities, analysis, planning, controlling, et cetera. Broadly speaking, we can classify these BI systems into two categories. First, are simple reporting systems, the purpose of which is to provide managers with the ability to look at data in a flexible, real-time way, that is, these reporting systems support simple data organization capabilities, such as sorting, filtering, and grouping data. And they often provide the ability to make simple calculations on data in real time. These simple calculations might include operation such as a sum, or an average, or a count. By contrast, second category of BI systems are data mining applications. Data mining applications are typically designed not to rely on that real time data, but rather to rely on archived historical data. And the reason for this is that data mining applications typically allow for sophisticated analyses on an organization's data. Because these analyzes involve complex statistical and mathematical processing, they typically cannot be conducted in real time. The advantage, of course, of such statistical and mathematical techniques, is that they can deliver insights and create predictive models that simply are not possible with the simple types of calculations that are available within Reporting systems. This figure depicts the relationships that exist between operational and BI applications. Whereas our operational applications are used by functional users and rely upon the operational DBMS. Business intelligence applications are used by managerial users, and can rely upon the operational DBMS, or a specialized DBMS that is made just for the business intelligence applications. And by extension, these BI applications can hence rely directly upon the operational database by way of the operational DBMS, for example, in the case of reporting applications. While they also might rely upon archived historical data, or other data sources, which are typically made available in the form of a data warehouse or a data mart. As a quick review, just remember that simple BI reporting applications typically rely on an organization's operational data. And they provide the ability to look at data in a simple way in real time. By contrast, data mining applications typically rely on archived historical data, and as such, do not provide a real time view of the organization. The trade-off or this time lag is that data mining applications can use sophisticated statistical and mathematical techniques to create models which allow managers to perform what if analyzes, do predictions about the future, and generally speaking, improve their decision making. As I noted earlier, business intelligence applications that provide capabilities which go beyond basic reporting, typically rely upon extracts of the organization's operational database, along with data acquired from other sources, all of which can be aggregated and stored in a data warehouse. Thus a data warehouse commonly contains data from many different sources. Not only does it contain data from the organization's operational databases, but it can contain other internal and external data as well. In the case of external data, an organization may be able to obtain such data from publicly available sources. Or they may purchase data. Examples of these external data sets might include information about what competitors are doing, what the market is doing, or expectations about future global trends. Together, data from all of these various sources are run through an ETL system, where ETL stands for extract, transform, and load, so as to clean and prepare the data for inclusion in the data warehouse. After this process, the data can actually be added to the data warehouse itself. And then our more complex business intelligence applications will have a source of data upon which they can rely when performing their tasks. Despite our best efforts at designing relational databases that will ensure the quality and integrity of the data that they contain, it is, unfortunately, still possible for problematic data to appear in the database. Further, because our business intelligence applications may rely upon these data in support of managerial decision making, it is critically important that the data be of the highest quality possible, such that managers will have the greatest chance possible of making good decisions. Here, we're simply referring to the classic and cliched concept of garbage in, garbage out, that is, if we are providing our managers with low quality or problematic data with which to support their decision making, then we must expect the resulting decisions to be similarly problematic. So let's examine some of the problems that can arise in operational databases. A very common problem is what we call dirty data. And dirty data refers to a data value which is obviously incorrect. As an example, we might have the letter v stored as a gender code, instead of the more common m or f, for male and female. Or we might have a value of age stored as 213, which would be a remarkably old human being if that number was correct. Other problems with operational data include missing values and inconsistent data, where inconsistent data refer to data values that have changed and are not the same across all of our data sources. So perhaps a customer's phone number was updated in the operational database. And the previous value in the data warehouse is, therefore, incorrect. Additional problems with operational data include non-integrated data, that is, when we have data from two or more sources that we need to merge together in some way so that they can be added into the data warehouse. We may also have data in an incorrect format, in which case, it will need to be transformed into the format that is required by the data warehouse. And of course, we may simply have too much data. There is a general concept in science known as parsimony. And this concept tells us that simplicity is often preferred. For example, if I construct a predictive model that is 95% correct, and it relies upon three predictor variables in order to achieve that 95% accuracy. But I might improve the accuracy to 96% by adding 20 additional variables, in most cases, the additional complexity involved in order to achieve such a marginal gain in predictive power would not be worth it. So in that case, we would prefer the three predictor model over the model which contains 23 predictors. As I mentioned earlier, because a data warehouse often contains data from multiple data sources, the input data commonly need to be run through an ETL process before they can be stored in the data warehouse. Again, ETL here stands for extract, transform, and load. Where the extract step is simply pulling data from these various data sources. The transform step is cleaning, or modifying, or processing the data in such a way that they are made appropriate for inclusion in the data warehouse. And then the load step refers to the process of taking the transformed, cleaned, processed data and actually storing it in the data warehouse so that they can be used by whichever business intelligence applications might rely upon that data warehouse. Simple examples of such transformations might be transforming a country code into a country name. So we may have the country code US, for example. And in the data warehouse, we need to transform that into the name of the country, which might be United States. Or we may have a customer's email address, such as dan@dan.com. And we actually just want to store the email domain in the data warehouse for purposes of aggregation. In that case, we would want to transform the customer's email address simply into the domain, which in this case would be dan.com, and store the result in the data warehouse. Next, I would like to talk about the concept of a data mart. And I think the best way to understand a data mart is simply that it is a subset of the organization's data warehouse. Data marts are constructed to support a specific need within the organization. So this might be a subset of the data warehouse that is needed to support a particular project, or a particular functional area within the business, like advertising or marketing, for example. Or perhaps, we need to create a data mart to support a specific group of employees within our organization, like a sales team. Regardless of the specific reason why we create a data mart, the general principle underlying their creation is simply that not all personnel, or not all managers, within the organization will need access to all of the organization's archive historical data. Personnel within the organization who perform new product development, for example, probably will not need access to data associated with human resources, such as employee salaries or employee benefits. Instead we might create a data mart just for the new product development team, which contains only those data that directly support their needs. At some point in your adventures in the database world, you may have heard the term dimensional database, or multi-dimensional database. And I'd like to take a few moments to talk about some of the concepts associated with these types of databases. To begin, it's important to note that dimensional databases are designed and implemented using exactly the same sort of database technology that we use to create our operational databases. That is, dimensional databases contain tables. They are related to each other using primary key foreign key links. We have the same concepts cardinalities, such as one to one relationships, one to many relationships, et cetera. So hopefully, operating within the familiar framework of the relational database world, understanding these dimensional databases will be reasonably easy. Broadly speaking, the idea with a multi-dimensional database is that we want to implement a non-normalized database structure for the purpose of vastly improving query speed. That is, in an operational database, we typically implement a database design that is largely normalized, that is, it might be in third normal form, or Boyce-Codd normal form, with perhaps a few tables being denormalized for the purpose of improving efficiency. And what a normalized database allows us to do is to store large amounts of data very quickly, while still preserving the quality and the integrity of the data in the database. The problem with this sort of rigid, normalized design, however, is that if we want to extract data from the database, we commonly need to perform computationally expensive join operations in order to get the information that we want. So a normalized relational database is very good for quickly storing information, but is very bad for quickly extracting information that we want. By contrast, with the sort of design that is implemented in a multi-dimensional database, storing the data in the database can be a very slow and laborious process. However, extracting data from the database is very fast. And the reasons for this are that we implement a non-normalized database structure, while simultaneously storing data in pre-aggregated levels of granularity within the dimensional database. An important point to note here is that these dimensional databases are used to track an organization's historical data, and therefore they almost always contain a date or time dimension. And it is this date or time dimension that provides us with the ability to store the same information aggregated at different levels of granularity within the multi-dimensional database. For example, imagine that we have operational data which represents all of the sales transactions for a store. So every time a sale is made, we may generate a sale record, which records information about that transaction. Now, this information is useful, however, it may not be as useful to a manager as it would be if it were aggregated up to a coarser level of granularity. Consider that if I were to take all of the sales data for one day and add them all together, then I have a daily total. Similarly, if I take the daily totals for seven consecutive days, then I have a weekly total. I can, in such a way, continue to create monthly totals, quarterly totals, yearly totals, et cetera. It is the same information that is available in the individual sales transactions, except it has been pre-aggregated and stored in the database in a pre-aggregated form, such that we can vastly improved query speed. That is, when a query is run where we want to look at the data in the form of weekly totals, or monthly totals, or quarterly totals, the database at that time does not need to aggregate all of the individual sales transactions in order to produce the result. The result already exists in the dimensional database because it has been pre-processed prior to being added into the database. So again, the purpose of these dimensional databases, then, is to intentionally implement redundant data in a non-normalized design, such that we can vastly improved query speed. We want to be able to extract data very quickly from the database. The most common data model for a dimensional database is known as a star schema. And the general characteristics of a star schema are that we have several dimension tables. In this case, we have a time dimension table, a customer dimension table, and a product dimension table. And at the intersection of all of those dimension tables, we have something called a fact table. Philosophically speaking, the reason that the intersection table at the center of the dimension tables is called a fact table, is because a fact in very real terms is the intersection of information. For example, imagine that I'm interested in knowing how many of a particular product a specific customer purchased during the month of December. Well, the answer to that question is the intersection of the customer, the product in which I'm interested, and the specific time frame that I specify, in this case, the month of December. And the answer might be 14 units. So a specific customer purchased 14 units of a particular product during the month of December. The intersection of those three values is a fact. And it is for this reason that we label the table at the center of a star schema, a fact table. To help you better conceptualize this concept, let's consider the intersection of two dimensions. And in this example, we're looking at the customer dimension contrasted with the product dimension in the form of a two dimensional matrix. The value contained in each cell within this matrix, then, is a fact. And it expresses to us the quantity of a particular product that was purchased by a particular customer. And you will notice, of course, if you recall from our last lecture, that this structure is very similar to a bitmap index. Extending this concept out into a third dimension, we can see here that we're representing a fact as the intersection of three different dimensions in a three dimensional matrix. Along the horizontal axis I, again, have customers. Along the vertical axis I, again, have products. But now, along the z-axis, I have a time dimension. Therefore, the value contained in any of the cells in this three dimensional matrix will tell me the quantity of a given product that was purchased by a particular customer during a particular date or time frame. Now unfortunately, human beings have a great deal of difficulty envisioning higher order spaces beyond three dimensions. However, this concept scales very easily up to higher dimensional spaces. So we might consider, for example, a fact to be the intersection of four dimensions, or five dimensions. And although it is not easy to depict such a situation, conceptually, it is just a natural extension of the two-dimensional and three-dimensional examples we saw here. In either case, I hope you can now understand why databases designed in this way are called dimensional or multi-dimensional databases. Next, I'd like to talk briefly about OLAP and data mining technologies. If you recall from earlier in the lecture, we said that, generally, there are two broad categories of business intelligence applications. And they were, reporting applications, and data mining applications. Online analytical processing, or OLAP, then, is a technique that supports these reporting applications. That is, OLAP allows us to dynamically examine database data in real time, and apply simple transformations like sorting, filtering, grouping, et cetera. And it allows us to perform simple arithmetic functions, such as summing values together, finding an average, account, the standard deviation, et cetera. And again, this is intended to be used in real time. By contrast, data mining techniques support data mining category of business intelligence applications. And data mining, broadly, refers to a collection of mathematical and statistical methods that can be used to gain deep insights into an organization's data. Again, remember that the level of sophistication of these techniques generally requires that they not be executed in real time, so as to avoid interfering with the real time operations of the organization. OLAP systems, then, when used in support of simple BI reporting needs, produce something called an OLAP report, which some people will refer to as an OLAP cube. And the general idea here is that our inputs are a set of dimensions, while our outputs are a set of measures. So recalling the two dimensional and three dimensional matrices that we saw just a few moments ago, a manager might select a series of dimensions. And the OLAP system might allow him or her to perform simple transformations or drill down operations on the data which lie at the intersection of those dimensions so as to gain real time insights into the organization. And here we see that these OLAP cubes can be constructed using our standard SQL SELECT queries. In this case, we're selecting a number of different dimensions. We are then performing a join on four separate tables, and are imposing some group by and order by requirements. The result of this query in OLAP terminology, then, would be a result set which represents a collection of measures that a manager could use to gain some insights into his or her organization. And of course, rather than constructing these SQL commands repeatedly, we might take advantage of the capability of relational databases to create views, so as to save the SQL statements, which are used to produce common OLAP cubes in the database itself. Data mining, then, can be viewed as the convergence of many different disciplines. A skilled data miner needs not only working knowledge of databases, but also needs statistical and mathematical knowledge, perhaps knowledge of artificial intelligence, or machine learning algorithms, knowledge of data management technologies, and so forth. In the modern world, many people become highly specialized in one particular area. But the people who are most valuable to an organization often have expertise in two or more areas. And this is certainly the case with people who are experts at data mining. To conclude our overview of data mining, I just wanted to briefly describe some of the most common techniques that are used to perform data mining against an organization's data. Among these techniques are cluster analysis, in which case, the objective is to group sets of entities together according to their level of similarity along one or more dimensions. We also have decision tree analysis, in which we can process a large quantity of historical data and generate a decision tree, which tells us what to do under different circumstances, in order to achieve some kind of the desired result. We also have regression available to us as a very powerful data mining tool. The goal of which is to produce mathematical equations, or mathematical models, that not only describe the relationships between variables, but also provide us a basis for predicting future events based upon past observations. Data mining applications might also rely on sophisticated artificial intelligence, or machine learning algorithms, such as neural networks or support vector machines. And recently, we've seen a rise in a technique known as market basket analysis, or affinity analysis, which allows us to look for patterns of co-occurrence, for example, determining which products are commonly purchased together. And the results of these affinity analyses can then be used as the foundation of a recommendation engine, which can suggests to you movies that you might like, books that you might like, et cetera. Now I'd like to move into the final topic in our course on databases, and that is the rise of the big data paradigm. Scientists and researchers have recently noted an exponential increase the quantity of data being produced by the human species. If my memory serves correctly, I believe the current rate of growth is that the amount of data doubles every 40 months. At present, the world is generating many exabytes of new data every single day. And if you're unfamiliar with the term exabyte, consider that one exabyte is slightly more than one million terabytes. So you may have a computing device at home that saves several terabytes of data. But consider that several million terabytes of new data are being generated by the human species every single day. And this situation creates a vast array of new and interesting problems for organizations. The term big data, then, refers to the rapidly expanding amount of data that is being stored and used by organizations. These data sets can be very large and very complex. And because of their size and complexity, they can be extremely difficult to process using traditional database technologies. And an important point to note is that much of what is considered big data is being generated by web 2.0 applications, and the emerging collection of web 3.0 applications. Traditional examples of web 2.0 applications might include social networking sites, video sharing sites, blogs, discussion forums, et cetera. So this rapidly accumulating quantity of data presents many challenges for organizations. Among these are simply capturing all of the data and storing it, maintaining the data once we have it. This is also commonly referred to as curation, in the same way that the curator of a museum must maintain all of the ancient artifacts, so, too, must the curator of a big data set be able to maintain the quality and the integrity of the data in light of things like failing hardware, and the desire of the data to be used by many people from all over the world simultaneously. Additional challenges include things such as search. How does one search efficiently through such an enormous quantity of data? Data transfer, consider for example, that if you have a 100 megabit network connection you can transfer approximately one terabyte of uncompressed data per day. At this speed, it would take you more than a million days to transfer one exabyte of data. Further challenges include analyzing these massive data sets, visualizing the massive quantities of data, and so forth. In the past few years, a term has arisen in the area of big data that is used to describe the movement toward using non-relational databases in order to support these huge and highly distributed, highly replicated collections of data. And this term is called NoSQL. Although the name, to many people, implies that SQL is not involved in these databases, that is, they do not support SQL-like queries, this assumption is actually wrong. As it is used in contemporary database circles, NoSQL means not only SQL, that is, these very large databases, although they may not be based on relational algebra, in the same way that a relational database is, they nevertheless support querying through a SQL-like query language. Unlike the relational database world where the relational model is fixed and predominates all relational database vendors, in the NoSQL world, there are many different architectures for non-relational databases that are currently being used. These include architectures which rely upon a key value store, a wide columnar store, a documents store. There are databases that rely upon graph theory, and so forth. Collectively, all of these different types of very large data stores are commonly referred to as structured storage. And they have a few attributes in common. First, they arguably employ simpler designs, as opposed to relational databases. And second, they almost always have a looser consistency model than one will find in a relational database. Another way of saying that is these structured storage databases do not provide ACID guarantees. If you remember, ACID is an acronym which stands for Atomicity, Consistency, Isolation, and Durability. And ACID guarantees are the hallmark of a normalized relational database. We cannot expect to have that level of consistency in these massive, highly distributed, highly replicated, structured storage databases. When discussing the sort of data models that are actually used in these structured storage databases, I like to use the data model that is employed by the Apache Cassandra database as an example. And the reason for this is that it is one of the most popular structured storage database management systems. And it is currently the most popular wide columnar store that is available. Broadly speaking, the Apache Cassandra database can be classified as a hybrid, key value slash wide columnar database. So its architecture contains elements of both a key value store and a wide columnar store. The Apache Cassandra database itself was originality created at Facebook by two of their software architects, after which it was transferred to the Apache Foundation, where it now resides as entirely open source and free database. Apache Cassandra has cross-platform support. The reason for this being that it was a written in Java. So it can run on Linux-based machines, Windows-based machines, Unix, et cetera. And Cassandra supports a massively distributed database environment. That is, it allows us to subdivide our database among dozens, or hundreds, or even thousands of separate database servers, potentially spread out all over the world. The database is a highly scalable and decentralized. By scalable here, I mean it's extremely easy to add an extra node, that is, an extra server to the cluster, thereby expanding the size of the database. And by decentralized, what I mean here is that all of the nodes, that is all of the database servers, that are involved in a Apache Cassandra database, have the same role. And this provides the very desirable characteristic of there being no single point of failure. Another very valuable characteristic of the Apache Cassandra model is that it provides for automatic data replication. That is, the database itself can automatically make copies of data and store those copies in different locations throughout the cluster. This makes the database highly fault tolerant, such that if an individual node, that is an individual database server, were to fail, the redundant data stores takeover instantaneously. There's no down time with the database at all. Further, Apache Cassandra supports the MapReduce process, which is a computational model for solving data processing problems in a highly distributed environment. And I'll talk more about the MapReduce process here in a few minutes. And to illustrate the legitimacy of the Apache Cassandra model, consider that it is currently used by CERN, organization such as Constant Contact, Digg, Instagram, Netflix, Reddit, Walmart, Twitter, et cetera. Now let's talk about the Cassandra data model itself. As you know, in a relational database management system, related data for an application are stored together in a container which is referred to as a database, or sometimes as a schema. And within that database or schema, we have one or more tables. The analogous concept in Cassandra is something called a key space. That is, data for an application are stored together in a container known as a key space. And inside of that key space, instead of tables, we have something known as column families. So just as in the relational database world, a single DBMS might contain many databases, each of which contains many tables. In the world of Apache Cassandra, the Cassandra database might contain many key spaces, each of which contains many column families. The column families, then, contain columns. But this is not the same as a column in a relational database. In Cassandra, a column consists of a name, that is the name of the column, a value, that is the data value for that column, and the time stamp, where the time stamp indicates the point in time at which the data value was changed. Related columns, then, are all stored within the same row. And each row is identified using a unique row key. This notion of a unique row key is directly analogous to the idea of a primary key in the relational database world. Rows in the Cassandra model, however, are not required to contain the same set or number of columns. That is, different rows within the same column family might have a different number of columns. And the number of columns in a particular row is allowed to expand or contract on an as-needed basis. A few additional important differences to note between the Cassandra data model and the relational data model are that in the Cassandra data model there are no formal foreign key relationships between column families. That is, we cannot establish formal relationships between column families within the same key space. And what's more, it is not possible to join column families together using a query. So whereas in the relational database model we can write queries that will join tables together, it is not possible in the Cassandra model to join column families together. Now, I know that this sort of a verbal description of the Cassandra data model can be a bit challenging to follow. So let's look at a picture which I hope will help to make some of these concepts clearer. Here we see a graphical representation of the Cassandra data model. The outermost container represents all of the key spaces for the Cassandra database. And in this case, we have just two key spaces, one of which is labeled as the blog key space, and the other which is labeled as the store key space, where the details of the store key space are not elaborated in this diagram. Again, the idea here is that a key space is roughly analogous to a database within the relational database world. This means that a key space, then, is typically oriented toward a particular need or a particular application. Within each key space, we can have one or more column families. In this case, we have a column family for users, and we have a column family for blog entries. Next, let's look at an example of a column family. And to begin, I will refer to the user column family. So here, we can see that we have three rows within the column family. And each row represents a unique user within the blog key space. A user, then, is represented as a collection of one or more columns. And remember that, in the Cassandra data model, the number of columns per row can vary from row to row. So in this first row, that is the Dan 42 row, we see that we have three columns. The first column is the name column. It's value is Dan. And we have a timestamp. The second column is the email column. It's value is dan@dan.com. Again, it has a timestamp. And the third column is the phone columns, which has a value and a timestamp. For the next user, we have only the name column. And for the third user, we have only the name and email columns. So there is no requirement that each row contain the same number, or even the same type of columns. Next, let's look at the blog entry column family within this blog key space. Again, we see that each row within the column family contains multiple columns. In this case, both rows contain the same columns. But again, that is not a requirement. Here, the columns are the text column, the category column, and the user column. Note, particularly, that the values stored in the user column can be used to determine which blog entries were written by which users. However, remember that formal relationships between column families do not exist in Cassandra. That is, we do not formally establish primary key, foreign key relationships. So I hope that looking at this diagram of the Apache Cassandra data model demystifies things a little bit for you. I know that learning about these structured storage data models for the first time can be intimidating, but I hope that through this diagram, you can see that it's really not that complicated. And I hope that is encouraging for you. As I mentioned earlier, these structured storage databases are often highly distributed and highly replicated. That is, they may be spread across many, many different nodes or database servers. Now this structure has substantial advantages. Not only does it provide fault tolerance, but it allows for data requests to be handled by the nearest available node that is able to service the request. So for example, if you are in Tokyo, and it happens that a copy of the data in which you are interested is stored on one of my database nodes, which is located near Tokyo, it's much more efficient for that node to handle your request than it would be to route the request to a distant geographical node, say, one which might be located in Berlin. The problem with this model, however, is that it can cause problems with consistency. Consider what happens when a data item is updated. So if I update a data item on one node, it will take time for that update to cascade to the other nodes within the cluster that contain a copy of the data. So imagine that my distributed structure storage database contains 1,000 nodes, spread all over the world, and the data item I'm interested in updating is replicated across 100 of those nodes. So I may then perform the update on one of the nodes. And until that update cascades throughout all of the other nodes in the cluster, any requests for that same data item that are made of those other nodes will be returned values that are out of date. And again, this is due to the fact that the database is widely distributed and widely replicated. And because we typically do not enforce an ACID level of consistency. Thus, in these replicated data environments, we commonly use a consistency model that is referred to as eventual consistency. And what eventual consistency means is that if no new updates are made to a specific data item for a period of time, eventually all of the requests for that data item will return the most up to date value, regardless of which node is servicing the request. And the time stamps that are recorded during each item update are the key which allows us to reconcile any inconsistencies in replicated data values between nodes. Finally, I would just like to take a few moments to discuss the MapReduce process. Broadly speaking, MapReduce is a programming model that relies on parallelization in order to perform data processing tasks on these huge data sets that may be distributed across many different servers, or many different nodes. So conceptually speaking, then, the MapReduce process involves two different types of nodes. There will be a master node and a worker node. Put simply, the master node is usually the node which receives the data processing request from the user. While the worker nodes are nodes which are assigned to complete part of the processing task by the master node. So this MapReduce process, then, unfolds in two steps. The first step is called the map step. And in the map step, the master node will take the data processing problem and subdivide it into a series of sub problems. And each of these sub problems is then assigned to, and carried out by, a worker node. The second step in the MapReduce process, then, is the reduce step. So after the worker nodes have completed their assigned tasks, they pass the results of their work back to the master node. The master node will then do the final processing, or final combining, of those results in order to produce the overall answer to the problem, which is then returned back to the user. Again, I know that concepts such as this can be difficult to understand in verbal terms, so let's see if we can get a better idea using an image. So toward the top of this figure we have the master node. And toward the bottom, we have various worker nodes, which here are labeled one, two, three and n, up to however many worker nodes we need to solve the problem. So the MapReduce process unfolds as follows. As input, the data processing problem is passed into the master node. The master node will then divide that data processing problem into sub problems, which are then assigned to and carried out by the various worker nodes. After completing their tasks, the worker nodes will return their results back to the master node, which performs the final combining and processing of the worker nodes' results, in order to produce the final answer, which is then the output of the MapReduce process. Let's consider an example, imagine that we are a wireless service provider, and it we use a highly distributed, structured storage database, which has 1,000 different servers all over the world. Let's further assume that our 100 million customers are equally subdivided among our 1,000 servers. So that means we would have data for 100,000 customers per node within our database environment. Now let's imagine that our data processing problem is that we want to figure out the average number of text messages sent during the month of November. And we want those results organized by age. So we would like to know what is the average number of text messages sent by 18-year-olds, and 19-year-olds, and 20-year-olds, and 21-year-olds, and so forth, all the way up until our oldest customers. Now let's see how the MapReduce process can be used to solve this data problem. First, the problem is passed to the master node. And the master node might subdivide the problem such that it instructs each of the 1,000 worker nodes within our database environment to count the total number of text messages sent by each customer during the month of November, and aggregate those results by age. The results of the worker nodes tasks, then, would be a table of data, which might contain three columns. First would be all of the distinct ages of the customers whose data resides on that particular worker node. The second column might be the number of customers who are that age. So we might have 1,000 18-year-olds, 714 19-year-olds, 235 20-year-olds, et cetera. And then, the total number of text messages sent by customers of each particular age. So perhaps, 18-year-old sent 10 million text messages. 19-year-olds sent 9,800,000 text messages, and so forth. So each worker node performs this task for all of the customers whose data are stored on that node. And those results, then, are returned back to the master node. The master node will then combine the results. So it will calculate, for example, the total number of 18-year-olds and the total number of text messages sent by 18-year-olds. After which it can divide those two numbers in order to produce the average number of text messages sent for 18-year-olds. That process is simply repeated or customers of every page. And we then have the results, which we can send back to the user who requested them. So I hope that you can appreciate that this MapReduce process is a very clever way of efficiently handling data processing problems on distributed database environments by taking advantage of parallelization in order to solve the problem. Well, my friends, thus ends our overview of big data, data warehousing, and business intelligence applications. And more broadly speaking, thus ends our series of lectures on database technologies. It is my sincere hope that you have found these lectures useful, and most importantly, that you now have the self-confidence to go out and start creating and working with these databases. It has been a great pleasure to lead you on this journey, and I hope that you have a great day.
Info
Channel: Dr. Daniel Soper
Views: 102,253
Rating: 4.9028902 out of 5
Keywords: Data Warehouse, Database (File Format Genre), Business Intelligence (Industry), Big Data, Data Mining (Technology Class), NoSQL (Software Genre), Structured Storage, MapReduce (Software), Apache Cassandra (Software), databases
Id: 4WX2MZvmzho
Channel Id: undefined
Length: 63min 13sec (3793 seconds)
Published: Wed Aug 07 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.