Good day, everyone. This is Dr. Soper here. And today we will be exploring
the eighth and final topic in our series of
lectures on databases, with today's topic focusing
on big data, data warehouses, and business
intelligence systems. Although the objectives
for today's lecture are varied and manifold,
broadly speaking, our goals are to extend our
knowledge of database systems beyond relational
databases by exploring some of the additional ways in
which database technologies can be used to achieve
organizational objectives. Along the way, we will explore
the basic concepts of data warehouses and data marts. And we'll learn about the
basic concepts and architecture associated with dimensional or
multi-dimensional databases. We will also learn about
business intelligence systems, online analytical processing,
and data mining, all of which are very useful for helping
managers to make decisions. And then toward the
end of today's lecture, we will explore some
of the fastest growing topics in the database
world, including big data, the NoSQL
movement, structured storage, and the MapReduce process. So without further
ado, let's begin. I thought a good
way of introducing some of the topics
in this lecture would be through an exploration
of human mathematical intuition, that is,
do we as human beings have a natural intuition for
finding truth in numbers. To examine this question,
consider the following example. Here we have two people, whom
I've named Bubba and Daisy. And both Bubba and Daisy
are interested in purchasing new vehicles. Currently, Bubba drives a
very inefficient vehicle. Let's say that he
drives an old truck. And he gets 10 miles per gallon. Whereas Daisy drives a
reasonably efficient car. And in her case,
her current vehicle gets 30 miles per gallon. Now let's say that
both Bubba and Daisy are interested in purchasing new
vehicles that will have better fuel economy than their
current vehicles with the goal of needing to purchase
less gasoline every year, thereby saving money. Bubba, however, does
not want to give up all of the benefits
of his big truck, so the vehicle he's looking
at provides just 12 miles per gallon, as opposed
to his current vehicle which provides 10. Daisy, on the
other hand, is very interested in fuel economy. And so she is planning to
purchase a vehicle which delivers 50 miles per
gallon, as opposed to her current
vehicle, which has 30. Another way of
considering these numbers is by looking at the percentage
increase in miles per gallon. In Bubba's case, he is
moving from a vehicle which has 10 miles a gallon to one
which has 12 miles per gallon, and that is an increase of
20% in the miles per gallon. Daisy, by contrast, is
moving from a vehicle with 30 miles per
gallon to one which provides 50 miles per gallon. And that is an
increase of 66.7%. Next, for the sake of offering
a fair comparison between Bubba and Daisy, let's say
that both drivers drive 10,000 miles per year. Now at this point, I
want to appeal to you or mathematical intuition. Based upon the situation
that I've described, I would ask you a
simple question, which of these drivers,
Bubba or Daisy, is going to save more money
on gasoline every year? Well, if you are
like most people, you will look at these
numbers and you'll say that clearly Daisy's vehicle
has a much greater increase in miles per gallon,
therefore, she will be saving the most
money every year on gasoline. Unfortunately, this
is the wrong answer. In fact, Bubba will
be saving more money every year on gasoline. And let's see why. Here we have a
table which shows us how much gasoline
both Bubba and Daisy are currently consuming
based upon their driving habits versus how much they
will consume if they purchase their new vehicles. In Bubba's case, he drives
10,000 miles per year. And in his current vehicle,
he gets 10 miles per gallon. Therefore he will consume 1,000
gallons of gasoline per year. In his new vehicle,
Bubba will still drive 10,000 miles
per year, but now will get 12 miles per
gallon, and therefore will consume 833.33 gallons
of gasoline per year. By contrast, in Daisy's
current vehicle, she gets 30 miles per gallon. And driving 10,000
miles per year, she consumes 333.33
gallons of gas. Whereas with her new vehicle,
which gets 50 miles per gallon, she will consume 200 gallons
of gasoline every year. Simple subtraction,
then, shows that Bubba will save 166.67
gallons of fuel per year if he purchases his new vehicle. Whereas Daisy will only
save 133.33 gallons of fuel per year. That is, Bubba will enjoy it
more fuel savings per year by buying a new truck, which
has 12 miles per gallon, versus his old truck, which
had 10 miles per gallon, than will Daisy by
buying a new car which has 50 miles per gallon,
versus her old car which provided 30 miles per gallon. So if you're like most
people, in this problem you're mathematical
intuition has failed. And it is for this
reason that I would argue that managers should
not rely on their intuition when making important
managerial decisions, but rather should rely on data. Many managers have
begun to recognize the deficiency is
in human intuition, and are hence
adopting technologies like business analytics,
business intelligence, data warehousing, and data mining,
so as to better support their decision-making
activities with the data that the organization
has at its disposal. By the way, if
you're interested, this problem is known as the
miles per gallon illusion, and stems from the fact
that we here in the United States commonly measure fuel
economy in miles per gallon, as opposed to the much more
appropriate measure which would be gallons per mile. Now that we've
seen for ourselves how human intuition
can fail, in even very simple mathematical
problems, we can begin discussing some of the
systems and technologies that have been developed to help
managers make better decisions and avoid the failures
that are commonly associated with relying
upon intuition alone. First among these technologies
are business intelligence, or BI, systems. BI systems are information
systems that are intentionally designed to allow
managers to capitalize upon organizational data
for the purpose of improving their decision making. It's important to note that
these business intelligence systems do not support
the real-time operational activities of the organization. Those activities, by contrast,
are supported by transaction processing systems. Instead, these BI
systems are designed to support managerial assessment
activities, analysis, planning, controlling, et cetera. Broadly speaking, we can
classify these BI systems into two categories. First, are simple reporting
systems, the purpose of which is to provide managers
with the ability to look at data in a
flexible, real-time way, that is, these reporting
systems support simple data organization capabilities,
such as sorting, filtering, and grouping data. And they often
provide the ability to make simple calculations
on data in real time. These simple calculations
might include operation such as a sum,
or an average, or a count. By contrast, second
category of BI systems are data mining applications. Data mining applications
are typically designed not to rely
on that real time data, but rather to rely on
archived historical data. And the reason for this is
that data mining applications typically allow for
sophisticated analyses on an organization's data. Because these analyzes
involve complex statistical and mathematical processing,
they typically cannot be conducted in real time. The advantage, of course,
of such statistical and mathematical
techniques, is that they can deliver insights and create
predictive models that simply are not possible with the
simple types of calculations that are available
within Reporting systems. This figure depicts
the relationships that exist between operational
and BI applications. Whereas our operational
applications are used by functional users and
rely upon the operational DBMS. Business intelligence
applications are used by managerial
users, and can rely upon the operational
DBMS, or a specialized DBMS that is made just for
the business intelligence applications. And by extension,
these BI applications can hence rely directly upon
the operational database by way of the operational
DBMS, for example, in the case of
reporting applications. While they also might rely
upon archived historical data, or other data sources,
which are typically made available in the form of a
data warehouse or a data mart. As a quick review, just remember
that simple BI reporting applications typically
rely on an organization's operational data. And they provide the
ability to look at data in a simple way in real time. By contrast, data mining
applications typically rely on archived historical
data, and as such, do not provide a real time
view of the organization. The trade-off or
this time lag is that data mining
applications can use sophisticated statistical
and mathematical techniques to create models which allow
managers to perform what if analyzes, do predictions
about the future, and generally speaking,
improve their decision making. As I noted earlier, business
intelligence applications that provide capabilities which
go beyond basic reporting, typically rely upon extracts
of the organization's operational database,
along with data acquired from other sources, all of which
can be aggregated and stored in a data warehouse. Thus a data warehouse
commonly contains data from many
different sources. Not only does it contain
data from the organization's operational
databases, but it can contain other internal
and external data as well. In the case of external
data, an organization may be able to obtain
such data from publicly available sources. Or they may purchase data. Examples of these
external data sets might include information about
what competitors are doing, what the market is
doing, or expectations about future global trends. Together, data from all
of these various sources are run through an
ETL system, where ETL stands for extract,
transform, and load, so as to clean and prepare
the data for inclusion in the data warehouse. After this process,
the data can actually be added to the data
warehouse itself. And then our more complex
business intelligence applications will
have a source of data upon which they can rely
when performing their tasks. Despite our best efforts at
designing relational databases that will ensure the quality
and integrity of the data that they contain,
it is, unfortunately, still possible for problematic
data to appear in the database. Further, because our business
intelligence applications may rely upon these data in
support of managerial decision making, it is
critically important that the data be of the
highest quality possible, such that managers will
have the greatest chance possible of
making good decisions. Here, we're simply referring to
the classic and cliched concept of garbage in,
garbage out, that is, if we are providing our managers
with low quality or problematic data with which to support
their decision making, then we must expect
the resulting decisions to be similarly problematic. So let's examine some
of the problems that can arise in operational databases. A very common problem is
what we call dirty data. And dirty data refers
to a data value which is obviously incorrect. As an example, we might have
the letter v stored as a gender code, instead of the more common
m or f, for male and female. Or we might have a value
of age stored as 213, which would be a remarkably old
human being if that number was correct. Other problems with
operational data include missing values
and inconsistent data, where inconsistent data refer
to data values that have changed and are not the same across
all of our data sources. So perhaps a
customer's phone number was updated in the
operational database. And the previous value
in the data warehouse is, therefore, incorrect. Additional problems
with operational data include non-integrated
data, that is, when we have data from
two or more sources that we need to merge
together in some way so that they can be added
into the data warehouse. We may also have data
in an incorrect format, in which case, it will need to
be transformed into the format that is required by
the data warehouse. And of course, we may
simply have too much data. There is a general concept in
science known as parsimony. And this concept tells us that
simplicity is often preferred. For example, if I construct
a predictive model that is 95% correct, and it
relies upon three predictor variables in order to
achieve that 95% accuracy. But I might improve
the accuracy to 96% by adding 20
additional variables, in most cases, the additional
complexity involved in order to achieve such a marginal
gain in predictive power would not be worth it. So in that case, we would
prefer the three predictor model over the model which
contains 23 predictors. As I mentioned earlier,
because a data warehouse often contains data from
multiple data sources, the input data commonly need to
be run through an ETL process before they can be stored
in the data warehouse. Again, ETL here stands for
extract, transform, and load. Where the extract step
is simply pulling data from these various data sources. The transform step is
cleaning, or modifying, or processing the
data in such a way that they are made
appropriate for inclusion in the data warehouse. And then the load step
refers to the process of taking the transformed,
cleaned, processed data and actually storing it
in the data warehouse so that they can be used by
whichever business intelligence applications might rely
upon that data warehouse. Simple examples of
such transformations might be transforming a country
code into a country name. So we may have the country
code US, for example. And in the data
warehouse, we need to transform that into the
name of the country, which might be United States. Or we may have a
customer's email address, such as dan@dan.com. And we actually just want
to store the email domain in the data warehouse for
purposes of aggregation. In that case, we would want to
transform the customer's email address simply into the
domain, which in this case would be dan.com, and store the
result in the data warehouse. Next, I would like to talk about
the concept of a data mart. And I think the best
way to understand a data mart is simply that it is
a subset of the organization's data warehouse. Data marts are constructed
to support a specific need within the organization. So this might be a subset
of the data warehouse that is needed to support
a particular project, or a particular functional
area within the business, like advertising or
marketing, for example. Or perhaps, we need
to create a data mart to support a specific
group of employees within our organization,
like a sales team. Regardless of the
specific reason why we create a data mart,
the general principle underlying their
creation is simply that not all personnel,
or not all managers, within the organization
will need access to all of the organization's
archive historical data. Personnel within
the organization who perform new product
development, for example, probably will not need
access to data associated with human resources, such as
employee salaries or employee benefits. Instead we might
create a data mart just for the new product
development team, which contains only
those data that directly support their needs. At some point in your adventures
in the database world, you may have heard the
term dimensional database, or multi-dimensional database. And I'd like to take a few
moments to talk about some of the concepts associated
with these types of databases. To begin, it's important to note
that dimensional databases are designed and implemented
using exactly the same sort of database technology
that we use to create our operational databases. That is, dimensional
databases contain tables. They are related
to each other using primary key foreign key links. We have the same
concepts cardinalities, such as one to one
relationships, one to many relationships,
et cetera. So hopefully, operating
within the familiar framework of the relational
database world, understanding these
dimensional databases will be reasonably easy. Broadly speaking, the idea with
a multi-dimensional database is that we want to implement
a non-normalized database structure for the purpose of
vastly improving query speed. That is, in an
operational database, we typically implement
a database design that is largely
normalized, that is, it might be in
third normal form, or Boyce-Codd normal form,
with perhaps a few tables being denormalized for the purpose
of improving efficiency. And what a normalized
database allows us to do is to store large
amounts of data very quickly, while
still preserving the quality and the integrity
of the data in the database. The problem with this sort
of rigid, normalized design, however, is that if we want to
extract data from the database, we commonly need to perform
computationally expensive join operations in order to get
the information that we want. So a normalized
relational database is very good for quickly
storing information, but is very bad for quickly
extracting information that we want. By contrast, with
the sort of design that is implemented in a
multi-dimensional database, storing the data
in the database can be a very slow and
laborious process. However, extracting data from
the database is very fast. And the reasons for this
are that we implement a non-normalized
database structure, while simultaneously
storing data in pre-aggregated
levels of granularity within the dimensional database. An important point
to note here is that these dimensional
databases are used to track an
organization's historical data, and therefore they almost
always contain a date or time dimension. And it is this date
or time dimension that provides us
with the ability to store the same
information aggregated at different levels
of granularity within the
multi-dimensional database. For example, imagine that we
have operational data which represents all of the sales
transactions for a store. So every time a
sale is made, we may generate a sale record,
which records information about that transaction. Now, this information
is useful, however, it may not be as
useful to a manager as it would be if it were
aggregated up to a coarser level of granularity. Consider that if I were to
take all of the sales data for one day and add
them all together, then I have a daily total. Similarly, if I take
the daily totals for seven consecutive days,
then I have a weekly total. I can, in such a way,
continue to create monthly totals, quarterly
totals, yearly totals, et cetera. It is the same
information that is available in the individual
sales transactions, except it has been
pre-aggregated and stored in the database in a
pre-aggregated form, such that we can vastly
improved query speed. That is, when a
query is run where we want to look at the data
in the form of weekly totals, or monthly totals, or quarterly
totals, the database at that time does not
need to aggregate all of the individual
sales transactions in order to produce the result. The result already exists
in the dimensional database because it has
been pre-processed prior to being added
into the database. So again, the purpose of
these dimensional databases, then, is to
intentionally implement redundant data in a
non-normalized design, such that we can vastly
improved query speed. We want to be able
to extract data very quickly from the database. The most common data model
for a dimensional database is known as a star schema. And the general characteristics
of a star schema are that we have several
dimension tables. In this case, we have
a time dimension table, a customer dimension table,
and a product dimension table. And at the intersection of
all of those dimension tables, we have something
called a fact table. Philosophically
speaking, the reason that the intersection table
at the center of the dimension tables is called
a fact table, is because a fact in
very real terms is the intersection
of information. For example, imagine that
I'm interested in knowing how many of a particular product
a specific customer purchased during the month of December. Well, the answer
to that question is the intersection
of the customer, the product in which
I'm interested, and the specific time frame
that I specify, in this case, the month of December. And the answer
might be 14 units. So a specific customer
purchased 14 units of a particular product
during the month of December. The intersection of those
three values is a fact. And it is for this
reason that we label the table at the center
of a star schema, a fact table. To help you better
conceptualize this concept, let's consider the
intersection of two dimensions. And in this example,
we're looking at the customer
dimension contrasted with the product dimension
in the form of a two dimensional matrix. The value contained in each
cell within this matrix, then, is a fact. And it expresses
to us the quantity of a particular product
that was purchased by a particular customer. And you will notice,
of course, if you recall from our last lecture,
that this structure is very similar to a bitmap index. Extending this concept out
into a third dimension, we can see here that
we're representing a fact as the intersection of three
different dimensions in a three dimensional matrix. Along the horizontal axis
I, again, have customers. Along the vertical axis
I, again, have products. But now, along the z-axis,
I have a time dimension. Therefore, the value
contained in any of the cells in this
three dimensional matrix will tell me the quantity
of a given product that was purchased by a
particular customer during a particular
date or time frame. Now unfortunately, human
beings have a great deal of difficulty
envisioning higher order spaces beyond three dimensions. However, this concept
scales very easily up to higher dimensional spaces. So we might consider,
for example, a fact to be the intersection of four
dimensions, or five dimensions. And although it is not easy
to depict such a situation, conceptually, it is
just a natural extension of the two-dimensional and
three-dimensional examples we saw here. In either case, I
hope you can now understand why databases
designed in this way are called dimensional or
multi-dimensional databases. Next, I'd like to talk briefly
about OLAP and data mining technologies. If you recall from
earlier in the lecture, we said that,
generally, there are two broad categories of business
intelligence applications. And they were, reporting
applications, and data mining applications. Online analytical
processing, or OLAP, then, is a technique that supports
these reporting applications. That is, OLAP allows us to
dynamically examine database data in real time, and
apply simple transformations like sorting, filtering,
grouping, et cetera. And it allows us to perform
simple arithmetic functions, such as summing values together,
finding an average, account, the standard
deviation, et cetera. And again, this is intended
to be used in real time. By contrast, data
mining techniques support data mining category
of business intelligence applications. And data mining, broadly,
refers to a collection of mathematical and
statistical methods that can be used to gain deep
insights into an organization's data. Again, remember that the
level of sophistication of these techniques generally
requires that they not be executed in real
time, so as to avoid interfering with the real time
operations of the organization. OLAP systems, then, when used in
support of simple BI reporting needs, produce
something called an OLAP report, which some people
will refer to as an OLAP cube. And the general idea
here is that our inputs are a set of dimensions,
while our outputs are a set of measures. So recalling the two dimensional
and three dimensional matrices that we saw just
a few moments ago, a manager might select
a series of dimensions. And the OLAP system
might allow him or her to perform simple
transformations or drill down operations
on the data which lie at the intersection
of those dimensions so as to gain real time
insights into the organization. And here we see that
these OLAP cubes can be constructed using our
standard SQL SELECT queries. In this case, we're selecting a
number of different dimensions. We are then performing a
join on four separate tables, and are imposing some group
by and order by requirements. The result of this query
in OLAP terminology, then, would be a
result set which represents a
collection of measures that a manager could use
to gain some insights into his or her organization. And of course, rather than
constructing these SQL commands repeatedly, we
might take advantage of the capability of relational
databases to create views, so as to save the
SQL statements, which are used to produce common OLAP
cubes in the database itself. Data mining, then, can be
viewed as the convergence of many different disciplines. A skilled data miner needs
not only working knowledge of databases, but also needs
statistical and mathematical knowledge, perhaps knowledge
of artificial intelligence, or machine learning algorithms,
knowledge of data management technologies, and so forth. In the modern world, many
people become highly specialized in one particular area. But the people who are most
valuable to an organization often have expertise
in two or more areas. And this is certainly
the case with people who are experts at data mining. To conclude our
overview of data mining, I just wanted to
briefly describe some of the most
common techniques that are used to perform data mining
against an organization's data. Among these techniques
are cluster analysis, in which case, the objective
is to group sets of entities together according to
their level of similarity along one or more dimensions. We also have decision
tree analysis, in which we can process a large
quantity of historical data and generate a
decision tree, which tells us what to do under
different circumstances, in order to achieve some
kind of the desired result. We also have regression
available to us as a very powerful
data mining tool. The goal of which is to
produce mathematical equations, or mathematical models, that not
only describe the relationships between variables,
but also provide us a basis for predicting
future events based upon past observations. Data mining
applications might also rely on sophisticated artificial
intelligence, or machine learning algorithms, such as
neural networks or support vector machines. And recently, we've seen a
rise in a technique known as market basket analysis,
or affinity analysis, which allows us to look for
patterns of co-occurrence, for example, determining
which products are commonly purchased together. And the results of
these affinity analyses can then be used as the
foundation of a recommendation engine, which can suggests to
you movies that you might like, books that you might
like, et cetera. Now I'd like to move into
the final topic in our course on databases, and that is the
rise of the big data paradigm. Scientists and
researchers have recently noted an exponential increase
the quantity of data being produced by the human species. If my memory serves
correctly, I believe the current rate of growth
is that the amount of data doubles every 40 months. At present, the world is
generating many exabytes of new data every single day. And if you're unfamiliar
with the term exabyte, consider that one exabyte
is slightly more than one million terabytes. So you may have a
computing device at home that saves several
terabytes of data. But consider that several
million terabytes of new data are being generated by the
human species every single day. And this situation
creates a vast array of new and interesting
problems for organizations. The term big data, then, refers
to the rapidly expanding amount of data that is being stored
and used by organizations. These data sets can be very
large and very complex. And because of their
size and complexity, they can be extremely
difficult to process using traditional
database technologies. And an important
point to note is that much of what is
considered big data is being generated by web 2.0
applications, and the emerging collection of web
3.0 applications. Traditional examples
of web 2.0 applications might include social networking
sites, video sharing sites, blogs, discussion
forums, et cetera. So this rapidly accumulating
quantity of data presents many challenges
for organizations. Among these are simply
capturing all of the data and storing it, maintaining
the data once we have it. This is also commonly
referred to as curation, in the same way that
the curator of a museum must maintain all of the
ancient artifacts, so, too, must the curator
of a big data set be able to maintain the quality
and the integrity of the data in light of things
like failing hardware, and the desire of the data to
be used by many people from all over the world simultaneously. Additional challenges include
things such as search. How does one search
efficiently through such an enormous quantity of data? Data transfer,
consider for example, that if you have a 100
megabit network connection you can transfer
approximately one terabyte of uncompressed data per day. At this speed, it would take
you more than a million days to transfer one exabyte of data. Further challenges
include analyzing these massive data
sets, visualizing the massive quantities
of data, and so forth. In the past few
years, a term has arisen in the area
of big data that is used to describe the
movement toward using non-relational
databases in order to support these huge
and highly distributed, highly replicated
collections of data. And this term is called NoSQL. Although the name,
to many people, implies that SQL is not involved
in these databases, that is, they do not support
SQL-like queries, this assumption
is actually wrong. As it is used in contemporary
database circles, NoSQL means not only SQL, that
is, these very large databases, although they may not be
based on relational algebra, in the same way that a
relational database is, they nevertheless
support querying through a SQL-like
query language. Unlike the relational
database world where the relational model
is fixed and predominates all relational database
vendors, in the NoSQL world, there are many
different architectures for non-relational databases
that are currently being used. These include
architectures which rely upon a key value
store, a wide columnar store, a documents store. There are databases that
rely upon graph theory, and so forth. Collectively, all of
these different types of very large data
stores are commonly referred to as
structured storage. And they have a few
attributes in common. First, they arguably employ
simpler designs, as opposed to relational databases. And second, they almost always
have a looser consistency model than one will find in
a relational database. Another way of saying that
is these structured storage databases do not
provide ACID guarantees. If you remember,
ACID is an acronym which stands for Atomicity,
Consistency, Isolation, and Durability. And ACID guarantees
are the hallmark of a normalized
relational database. We cannot expect to have that
level of consistency in these massive, highly distributed,
highly replicated, structured storage databases. When discussing the
sort of data models that are actually used in these
structured storage databases, I like to use the
data model that is employed by the
Apache Cassandra database as an example. And the reason for
this is that it is one of the most
popular structured storage database management systems. And it is currently the most
popular wide columnar store that is available. Broadly speaking, the
Apache Cassandra database can be classified as a hybrid,
key value slash wide columnar database. So its architecture
contains elements of both a key value store
and a wide columnar store. The Apache Cassandra database
itself was originality created at Facebook by two
of their software architects, after which it was transferred
to the Apache Foundation, where it now resides as entirely
open source and free database. Apache Cassandra has
cross-platform support. The reason for this being
that it was a written in Java. So it can run on
Linux-based machines, Windows-based machines,
Unix, et cetera. And Cassandra supports a
massively distributed database environment. That is, it allows us to
subdivide our database among dozens, or
hundreds, or even thousands of separate
database servers, potentially spread out
all over the world. The database is a highly
scalable and decentralized. By scalable here, I
mean it's extremely easy to add an extra node, that is,
an extra server to the cluster, thereby expanding the
size of the database. And by decentralized,
what I mean here is that all of the nodes,
that is all of the database servers, that are involved in
a Apache Cassandra database, have the same role. And this provides the very
desirable characteristic of there being no
single point of failure. Another very valuable
characteristic of the Apache Cassandra
model is that it provides for automatic data replication. That is, the database
itself can automatically make copies of data
and store those copies in different locations
throughout the cluster. This makes the database
highly fault tolerant, such that if an individual node,
that is an individual database server, were to fail, the
redundant data stores takeover instantaneously. There's no down time
with the database at all. Further, Apache Cassandra
supports the MapReduce process, which is a computational model
for solving data processing problems in a highly
distributed environment. And I'll talk more about
the MapReduce process here in a few minutes. And to illustrate the legitimacy
of the Apache Cassandra model, consider that it is currently
used by CERN, organization such as Constant Contact, Digg,
Instagram, Netflix, Reddit, Walmart, Twitter, et cetera. Now let's talk about the
Cassandra data model itself. As you know, in a relational
database management system, related data for an
application are stored together in a container which is
referred to as a database, or sometimes as a schema. And within that
database or schema, we have one or more tables. The analogous
concept in Cassandra is something called a key space. That is, data for an
application are stored together in a container known
as a key space. And inside of that key
space, instead of tables, we have something known
as column families. So just as in the
relational database world, a single DBMS might contain
many databases, each of which contains many tables. In the world of
Apache Cassandra, the Cassandra
database might contain many key spaces, each of which
contains many column families. The column families,
then, contain columns. But this is not the
same as a column in a relational database. In Cassandra, a column
consists of a name, that is the name of the
column, a value, that is the data value
for that column, and the time stamp, where
the time stamp indicates the point in time at which
the data value was changed. Related columns, then, are all
stored within the same row. And each row is identified
using a unique row key. This notion of a
unique row key is directly analogous to
the idea of a primary key in the relational
database world. Rows in the Cassandra
model, however, are not required to contain the
same set or number of columns. That is, different rows
within the same column family might have a different
number of columns. And the number of columns
in a particular row is allowed to expand or
contract on an as-needed basis. A few additional
important differences to note between
the Cassandra data model and the
relational data model are that in the
Cassandra data model there are no formal foreign key
relationships between column families. That is, we cannot establish
formal relationships between column families within
the same key space. And what's more, it is
not possible to join column families
together using a query. So whereas in the
relational database model we can write queries that
will join tables together, it is not possible in
the Cassandra model to join column
families together. Now, I know that this sort
of a verbal description of the Cassandra data model can
be a bit challenging to follow. So let's look at a
picture which I hope will help to make some of
these concepts clearer. Here we see a graphical
representation of the Cassandra data model. The outermost
container represents all of the key spaces for
the Cassandra database. And in this case, we have just
two key spaces, one of which is labeled as the
blog key space, and the other which is labeled
as the store key space, where the details of the
store key space are not elaborated
in this diagram. Again, the idea here
is that a key space is roughly analogous
to a database within the relational
database world. This means that a key
space, then, is typically oriented toward
a particular need or a particular application. Within each key space, we
can have one or more column families. In this case, we have a
column family for users, and we have a column
family for blog entries. Next, let's look at an
example of a column family. And to begin, I will refer
to the user column family. So here, we can see that we have
three rows within the column family. And each row represents a unique
user within the blog key space. A user, then, is
represented as a collection of one or more columns. And remember that, in
the Cassandra data model, the number of columns per
row can vary from row to row. So in this first row,
that is the Dan 42 row, we see that we
have three columns. The first column
is the name column. It's value is Dan. And we have a timestamp. The second column
is the email column. It's value is dan@dan.com. Again, it has a timestamp. And the third column is
the phone columns, which has a value and a timestamp. For the next user, we
have only the name column. And for the third user, we
have only the name and email columns. So there is no requirement
that each row contain the same number, or even
the same type of columns. Next, let's look at the
blog entry column family within this blog key space. Again, we see that each row
within the column family contains multiple columns. In this case, both rows
contain the same columns. But again, that is
not a requirement. Here, the columns are the text
column, the category column, and the user column. Note, particularly, that the
values stored in the user column can be used to determine
which blog entries were written by which users. However, remember that
formal relationships between column families
do not exist in Cassandra. That is, we do not
formally establish primary key, foreign
key relationships. So I hope that looking at this
diagram of the Apache Cassandra data model demystifies
things a little bit for you. I know that learning about
these structured storage data models for the first
time can be intimidating, but I hope that
through this diagram, you can see that it's
really not that complicated. And I hope that is
encouraging for you. As I mentioned earlier, these
structured storage databases are often highly distributed
and highly replicated. That is, they may be spread
across many, many different nodes or database servers. Now this structure has
substantial advantages. Not only does it
provide fault tolerance, but it allows for data
requests to be handled by the nearest
available node that is able to service the request. So for example, if
you are in Tokyo, and it happens that a copy
of the data in which you are interested is stored on
one of my database nodes, which is located
near Tokyo, it's much more efficient for that
node to handle your request than it would be to
route the request to a distant geographical
node, say, one which might be
located in Berlin. The problem with
this model, however, is that it can cause
problems with consistency. Consider what happens when
a data item is updated. So if I update a data
item on one node, it will take time
for that update to cascade to the other
nodes within the cluster that contain a copy of the data. So imagine that my distributed
structure storage database contains 1,000 nodes,
spread all over the world, and the data item I'm
interested in updating is replicated across
100 of those nodes. So I may then perform the
update on one of the nodes. And until that update
cascades throughout all of the other nodes
in the cluster, any requests for
that same data item that are made of
those other nodes will be returned values
that are out of date. And again, this
is due to the fact that the database is
widely distributed and widely replicated. And because we
typically do not enforce an ACID level of consistency. Thus, in these replicated
data environments, we commonly use a
consistency model that is referred to as
eventual consistency. And what eventual
consistency means is that if no new updates
are made to a specific data item for a period
of time, eventually all of the requests
for that data item will return the most
up to date value, regardless of which node
is servicing the request. And the time stamps that are
recorded during each item update are the key
which allows us to reconcile any inconsistencies
in replicated data values between nodes. Finally, I would just like to
take a few moments to discuss the MapReduce process. Broadly speaking, MapReduce
is a programming model that relies on
parallelization in order to perform data processing tasks
on these huge data sets that may be distributed across
many different servers, or many different nodes. So conceptually speaking,
then, the MapReduce process involves two different
types of nodes. There will be a master
node and a worker node. Put simply, the
master node is usually the node which receives
the data processing request from the user. While the worker
nodes are nodes which are assigned to complete
part of the processing task by the master node. So this MapReduce process,
then, unfolds in two steps. The first step is
called the map step. And in the map step,
the master node will take the data
processing problem and subdivide it into a
series of sub problems. And each of these sub
problems is then assigned to, and carried out
by, a worker node. The second step in
the MapReduce process, then, is the reduce step. So after the worker nodes have
completed their assigned tasks, they pass the results of their
work back to the master node. The master node will then
do the final processing, or final combining,
of those results in order to produce the overall
answer to the problem, which is then returned back to the user. Again, I know that
concepts such as this can be difficult to
understand in verbal terms, so let's see if we can get a
better idea using an image. So toward the top of this
figure we have the master node. And toward the bottom, we
have various worker nodes, which here are labeled
one, two, three and n, up to however many worker nodes
we need to solve the problem. So the MapReduce process
unfolds as follows. As input, the data
processing problem is passed into the master node. The master node will then divide
that data processing problem into sub problems, which
are then assigned to and carried out by the
various worker nodes. After completing their
tasks, the worker nodes will return their results
back to the master node, which performs the final combining and
processing of the worker nodes' results, in order to produce
the final answer, which is then the output of the
MapReduce process. Let's consider an
example, imagine that we are a wireless
service provider, and it we use a highly
distributed, structured storage database, which has
1,000 different servers all over the world. Let's further assume
that our 100 million customers are equally subdivided
among our 1,000 servers. So that means we would have data
for 100,000 customers per node within our database environment. Now let's imagine that our
data processing problem is that we want to figure out
the average number of text messages sent during
the month of November. And we want those
results organized by age. So we would like to know what
is the average number of text messages sent by 18-year-olds,
and 19-year-olds, and 20-year-olds, and
21-year-olds, and so forth, all the way up
until our oldest customers. Now let's see how
the MapReduce process can be used to solve
this data problem. First, the problem is
passed to the master node. And the master node might
subdivide the problem such that it instructs each
of the 1,000 worker nodes within our
database environment to count the total number
of text messages sent by each customer during
the month of November, and aggregate those
results by age. The results of the
worker nodes tasks, then, would be a
table of data, which might contain three columns. First would be all of the
distinct ages of the customers whose data resides on that
particular worker node. The second column
might be the number of customers who are that age. So we might have 1,000
18-year-olds, 714 19-year-olds, 235 20-year-olds, et cetera. And then, the total
number of text messages sent by customers of
each particular age. So perhaps, 18-year-old sent
10 million text messages. 19-year-olds sent 9,800,000
text messages, and so forth. So each worker node
performs this task for all of the customers whose
data are stored on that node. And those results,
then, are returned back to the master node. The master node will
then combine the results. So it will calculate, for
example, the total number of 18-year-olds and the total
number of text messages sent by 18-year-olds. After which it can divide
those two numbers in order to produce the
average number of text messages sent for 18-year-olds. That process is simply repeated
or customers of every page. And we then have the results,
which we can send back to the user who requested them. So I hope that you can
appreciate that this MapReduce process is a very clever way
of efficiently handling data processing problems on
distributed database environments by taking
advantage of parallelization in order to solve the problem. Well, my friends,
thus ends our overview of big data, data warehousing,
and business intelligence applications. And more broadly speaking, thus
ends our series of lectures on database technologies. It is my sincere
hope that you have found these lectures useful,
and most importantly, that you now have
the self-confidence to go out and start creating and
working with these databases. It has been a great pleasure
to lead you on this journey, and I hope that you
have a great day.