Good day, everyone. This is Dr. Soper here. And today, we will be
exploring the fifth topic in our series of database
lectures, this one being entitled, Database Design. Although there are many
objectives associated with today's topic,
broadly speaking, our goals are to further
develop our skills such that we are able to transition
away from our conceptual data models to a point where we
feel comfortable implementing complex data structures
in a real-world database. To this end, we will review
the normalization process in this lecture. And we will also explore the
concept of denormalization and learn when denormalization
is an appropriate strategy in database design. We will also learn some
new information that will be useful in helping us
implement 1 to 1, 1 to many, and many to many binary
relationships in a database. And toward the end
of the lecture, we will also gain
insights into how to implement these
1:1, 1 to many, and many to many relationships,
when those relationships are recursive. What's more, we
will further expand upon our knowledge of the
structured query language today by learning the
strategies and approaches that can be used in order to
work with these recursive relationships using SQL. To begin, I'd like to talk about
the process of transitioning from relations, to
entities, and ultimately into designs for cables that
we can actually implement in a real-world database. We begin with simple relations. And we learn the
parenthetical method of describing these
relations, an example of which is shown on your screen. So here, we have
an item relation. And we can see
that this relation contains five attributes. The first attribute,
itemId, is underlined, indicating that it
is a primary key. And then we have four
non-key attributes. We've already
learned how to take this initial, simple
conceptual model of a relation and translate that into an
entity consistent with the set of symbols that are used in
entity relationship modeling. Our item relation, then,
can be easily transformed into an entity. And we see an example
of such an entity here. Although modeling the entity
in this way can be very useful, we still need to make
additional decisions, and specify additional
details, before we can actually implement this entity as a
table in a real-world database. As an example of
these missing details, we might consider that we need
to specify data types for each of the attributes in the table. This information
will ultimately be required before we can
implement our entities as tables in the database. Transitioning from a data
model into a database, however, involves
several more steps than just identifying the
data types for our attributes. So, perhaps it would
be useful if we were to explore this process
in a slightly more formal way. To ensure that we are able to
successfully implement our data model, the first step that
we would like to complete is to create a table for each
of the entities appearing in the data model. As we go through the process
of implementing these tables, there are several
key tasks that we will need to complete
in order to ensure that the resulting
implementation of our database is satisfactory. One of the most
critical of these tasks is to specify a primary key. We must always remember to
specify a primary key for each of the tables in our database. After adding our attributes
or columns to the table, we need to specify several
properties for each column so that the database
engine will be able to treat the
column properly once the database is deployed. Among these properties
are, first, the data type, is the attribute-- an integer,
a float, a varchar, a date-time, or some other data type. The null status of
the attribute-- that is, are null values allowed? Default values, if
any, for the attribute. That is to say, if a specific
value for the attribute is not supplied,
is there a value that we would want the database
to fill in automatically for us? And, of course, we may also
impose certain constraints on the column so as to
preserve the integrity of the data contained therein. Examples of such
constraints might include a range constraint
in which the values appearing in a column must fall between
a minimum and maximum value. You may also specify a set
of acceptable values, such that each value
appearing in the column must exist within the
predefined set, and so forth. Finally, after completing
all of these design tasks, we may want to
re-evaluate the structure of the table in light
of the normalization needs of the database. If, for example, our
performance needs require that we
denormalize the table, that action of
denormalization, which we will discuss and
describe later today, might require us to alter
the structure of the table somewhat. Before we discuss
denormalization, however, a brief
review of normalization and the normalization
process may be useful. If you recall from
an earlier lecture, we said that data tables which
are not properly normalized may be susceptible to problems
which we called modification anomalies. And there are three different
types of these anomalies. First are insertion
problems in which a problem arises when we attempt
to add new data into a table. Next, we have update
anomalies in which a problem arises when we attempt
to modify the data in a table. And finally, we have
deletion anomalies in which a problem arises when
we attempt to delete existing data from within a table. Tables which are
properly normalized will help us to avoid these
three types of problems. As a quick review,
let's see some examples of these three different types
of modification anomalies. Here, we see a list
of data in which we are recording information
about customers and the cars that they own. To begin, let's consider
a deletion anomaly. Let's say, for
example, that Leonard is no longer my customer. So I want to delete
Leonard from the database. By deleting the row of data
in the list for Leonard, however, I have
now lost all record that a type of car, known
as a 2010 Nissan Sentra, even exists. So by deleting Leonard, I
have inadvertently deleted other information from the list
that I may have found useful, or I may have
needed to preserve. This is a deletion anomaly. Next, let's consider
an insertion anomaly. Say that we want to add a
new customer to our table. Perhaps this customer's
name is Sheldon. But Sheldon doesn't drive. So we add information about
Sheldon into the table. But the result is that
we now have many missing values in our list. This is wasted space. And we might consider this
to be an insertion anomaly. Later on, perhaps, our
new customer, Sheldon, decides to learn to drive. And let's say that he purchases
a 2003 Volkswagen Golf. Now, a 2003 Volkswagen Golf
already exists in our list. So as we are adding
this information into the list for our
customer, Sheldon, assume that we make
a spelling mistake. Perhaps we misspell Volkswagen. Again, this is an
insertion anomaly that would not have occurred in
a properly-normalized database design. Finally, we can consider
an update anomaly. Let's say, for
example, that we want to change the car ID for
a 2003 Volkswagen Golf from 102 to 100. If I make that change
for my customer, Penny, I have now introduced
an update anomaly into the list because my customer, Sheldon,
has the same type of vehicle. And therefore, in
order to maintain the accuracy of the
data in the list, I would need to update the car
ID not only for Penny, but also for Sheldon. And this is an example
of an update anomaly because we've introduced a
problem into our table of data as the result of an
update operation. If you recall, the
approach that we take to avoid these
modification anomalies is to break tables apart into
smaller and smaller tables until they are no longer
susceptible to these anomalies. This is a process which
we can call normalization. In our previous example,
we had a large list of data that
contained information about two separate business
concepts, or business themes. That is, our list of data
contained information about customers and
cars in the same list. By breaking that list apart
into two separate tables, say, a customer table
and a car table, we now have a data design
that allows us to add, update, or delete data without
introducing any anomalies into the database. Next, let's briefly review some
of the definitions associated with the normalization process. First is a functional
dependency. Now, a functional
dependency refers to a relationship
among attributes within a relation, such that
the value of one attribute can be used to find or identify
the value of another attribute or set of attributes. That is, if the value
of one attribute can be determined by
the value of another, then that first
attribute is said to be functionally dependent
upon the second attribute. Further, if an
attribute can be used to find the values of other
attributes within a table, then that attribute is
known as a determinant. That is, the value
of the attribute can be used to determine the
values of other attributes within the table. Finally, recall the
concept of a candidate key. And a candidate key is simply
an attribute whose values can be used to
determine, or to find, the value of every other
non-key attribute within a row. And these candidate
keys can come in a few different flavors,
one being a simple candidate key which consists of the value
of just a single attribute. And the other being
a composite candidate key whose uniqueness comes by
combining the values of two or more attributes. Recall, also, that many
different types of normal forms have been defined. We began by exploring the
concept of a relation. And we said that every
relation is, by definition, in first normal form. That is, if a table meets
the definition of a relation, then that table is also
in First Normal Form. If we take our table
in first normal form, and we remove any
partial dependencies, then the table
meets the definition of second normal form. If we further remove any
transitive dependencies, then our table will
meet the definition of Third Normal Form. If we reach a point in
our normalization process where every determinant
within the table is also a candidate
key, then the table will meet the definition of
something called Boyce-Codd Normal Form. And as you can see, beyond
Boyce-Codd Normal Form, there is Fourth Normal Form,
Fifth Normal Form, DK Normal Form, and Sixth Normal Form. So many additional normal
forms have been identified. For our purposes, a relation
is considered normalized when every determinant
is a candidate key. And again, technically,
this is known as Boyce-Codd Normal Form. Boyce-Codd Normal Form is a
slightly more stringent version of Third Normal Form,
and it is sometimes called 3.5 Normal
Form for that reason. A database design which
has been normalized such that it is in Third
Normal Form or above is generally not
susceptible to the types of modification anomalies
that we described earlier. And, at this level
of normalization, a database design is also
almost always sufficient to handle any
business data problem. Next, let's consider a
normalization example. Assume, for this
example, that we want to create a
database table which will allow us to keep track
of information relating to our customers,
and also information relating to each of those
customers' accountants. So we create a customer table
that contains a customer ID. That will be our primary key. And we also track the customer's
name, address, city, state, zip code, and the accountant
ID for the customer, along with that accountant's
name and telephone number. In its current form, this
customer table is not normalized because all of the
attributes in the customer table cannot be determined
by the customer ID. Specifically,
consider that if we know a zip code, or a postal
code, then, by definition, we also know the city and state. That is to say, if
I know a zip code, I can determine
the city and state to which that zip code applies. Furthermore, in
this table, we also see that, if we know
the accountant ID, we can use that
information to determine the name of the accountant and
the accountant's phone number. To normalize this
database design, then, we will need to
break the table apart into a series of
smaller tables until we reach a point where every
non-key attribute is fully functionally determined
by the primary key. Our normalized solution, then,
contains three separate tables, a customer table, which
contains the customer ID as a primary key, the name
and address of the customer, and the zip code and
accountant IDs associated with the customer as foreign
key links back to their parent tables. We then have a zip code table
where zip is the primary key. And we have city and state
as attributes of the zip code table. And then we have an accountant
table where accountant ID serves as the primary key. And we have two non-key
attributes, accountant name and accountant phone. We have thus taken our
original customer table and broken it into three
smaller tables which, when considered
together, provide us with a normalized data solution. Now that we've reviewed
the normalization process, next I would like to
talk about a new topic. And that is the concept
known as denormalization. As the name suggests,
denormalization is the opposite of the
normalization process. That is, in the
denormalization process, we have identified a
business need, or a business justification, for why two or
more tables within our database design should be combined
together into a single table. Common reasons for wanting to
denormalize a set of tables include simplicity of design
and speed of querying. The trade-off, however, is that,
by using a denormalized design, we are introducing the
possibility of modification anomalies into our database. Consider, for example, the
set of tables shown here. In this example, we have
combined the zip code table and the customer table from
our normalized solution into a single table, which here,
we are just calling Customer. Specifically, what
we have done is we have taken the city
and state attributes that used to exist in
the zip code table, and we have added them into the
customer table, along with zip code, such that the entire
customer address can now be read simply by examining
the rows of the customer table. From a performance
perspective, this may be desirable
because it allows us to extract a customer's
address from the table without having to perform a
join operation with the zip code table. In the relational
database world, joint operations are
computationally and temporally expensive processes. It takes the database
engine a substantial amount of computational cycles
and time in order to join two tables together. And that extra time,
that extra effort, slows down the
query performance. By denormalizing the design,
and combining the zip code table and the customer
tables together, we can expect an increase
in query performance if our objective is to
extract customer addresses. What's more, we have
the additional benefit of simplicity. Our database design now contains
two tables instead of three, which makes it conceptually
easier to understand, perhaps easier to
maintain, certainly easier from the perspective
of writing SQL commands. There is, however, a cost to
this denormalization process. And the cost is our
denormalized design is now open to the possibility
of modification anomalies. In this case, because
we are storing the zip code and the city and
the state for each customer in the customer table,
this means, one, we will have a great deal of
repetitive data in the customer table. For example, if we have
100,000 customers who live in Los Angeles, California,
that extra text, Los Angeles, California, will
appear redundantly in many, many, many rows
within our customer table. What's more, we now have
a much greater probability of data entry problems
existing in our customer table. One customer, for example,
might enter their city and state as Los Angeles,
California, where they spell the state California. Another customer might use the
two-character abbreviation, CA, for California. Whereas another customer might
provide the city and state entirely as upper case,
while another customer might provide the same
information entirely as lower case. These types of
anomalies in the data would not be present
if we had retained our original, normalized design. But because we made the decision
to denormalize those tables, we are now exposed
to the possibility that these anomalous
data values might exist. In summary, then, you should
denormalize your data design only when there is a defensible
business reason for doing so. And remember that, by
denormalizing your database design, your database
will now be exposed to the possibility
of having problems that it might not
otherwise have. There are some
preventative measures that can be taken to help
avoid those problems, such as using
database constraints, or implementing validity
checks in software. However, each of
these strategies represents additional
overhead tasks. And in the end,
there may not truly be any gains in simplicity
from using such an approach. A major component of relational
database design, of course, is creating relationships
between tables. And to that end,
I'd like to provide some additional useful
information about how to implement one to
one, one to many, and many to many relationships
in a real-world database. First, let's consider a
one to one relationship. Now, the question
here is, in order to establish a relationship
between two tables, we need to use a link between a
primary key and a foreign key. But in a one to one
relationship, which table should contain the foreign key? Here are a few guidelines
that you can use in order to decide what to do. First, if both sides
of the relationship are optional-- that is,
if the minimum cardinality on both sides of the
relationship is 0, then it doesn't really
matter which table receives the foreign in a
one-to-one binary relationship. You can place the
foreign key in whichever table makes most sense to you,
conceptually, as a human being. If, however, only one
side of the relationship is optional, that is to say,
if one side of the relationship has a minimum
cardinality of zero, while the other side
of the relationship has a minimum
cardinality of one, then the best
strategy is usually to put the foreign
key in the table that resides on the optional
side of the relationship. Let's look at some examples. First, consider the
relationship between an employee and a locker. This is a one-to-one
relationship such that an employee might
be assigned one locker, and each locker might be
assigned to a single employee. In this case, the
minimum cardinality in this one-to-one
relationship is zero on both sides-- that is,
the relationship is optional in both directions--
and therefore, we can place the foreign key in
whichever table we prefer. In the first example, we've
placed the foreign key in the locker table such that
a locker has an employee, whereas in the
second example, we've placed the foreign key
in the employee table such that an employee
has a locker. So whichever of
those two notions conceptually makes most sense to
you as the data designer should guide your decision as to
where the foreign key should be placed. If it makes more sense to
think about a locker having an employee, then
place the foreign key in the locker table. If it makes more sense to you to
think about an employee having a locker, then place the foreign
key in the employee table. If, however, we have a
one-to-one binary relationship, where the minimum cardinality
on one side of the relationship is zero, while the minimum
cardinality on the other side of the relationship is
one, then the foreign key should always go at the optional
side of the relationship. Let's consider this example
where we have a hospital patient and a hospital bed. According to this data
design, a hospital patient is always assigned to exactly
one bed in the hospital. But each bed in the
hospital might not be assigned to a patient. If a bed is assigned to a
patient, then, of course, it is assigned to a
maximum of one patient. But the possibility exists that
we have empty beds that are not assigned to any patients. In this case, we
want the foreign key to go on the optional
side of the relationship. That is we would put the bed ID
as a foreign key in the Patient table. And we would say a
patient is assigned a bed. SQL queries for one-to-one
binary relationships are quite straightforward and
depend only upon our decision as to the placement
of the foreign key. In this first example, we placed
the locker ID as a foreign key in the employee table. And therefore, we
need to do the join on equality between locker IDs. Whereas, in the second
example, we made the decision to place employee
ID as a foreign key in the locker table. And we therefore
need to do the join by looking for matching values
of employee IDs between the two tables. Next, let's consider a
one to many relationship. Just as with a one-to-one
binary relationship, in a one to many
binary relationship, we establish the relationship
between the two tables by creating a link between
a primary and foreign key. In a one to many
relationship, however, it is very easy to
identify the table that should receive the foreign key. All you need to remember
is that the foreign key belongs at the many side
of the relationship. And in these types of one to
many binary relationships, we typically refer
to the one side of the relationship
as the parent table, while we refer to the many
side of the relationship is the child table. Here, for example, we see a
one to many binary relationship between a team table
and a player table such that a team has 0
to many players, while a player belongs
to one and only one team. In this case, the player
table resides at the many end of the relationship. And we therefore know
that the foreign key belongs in that table. Team ID, which is
currently serving as the primary key
of the team table, is thus placed into the
player table as a foreign key. Just as with one-to-one
binary relationships, writing SQL queries to
join two tables, which are involved in a one to
many binary relationship, is quite simple. In this example,
we are performing the join in the WHERE clause. But we could just
as easily perform the join in the FROM clause
by using the JOIN keyword. Finally, we can consider many
to many binary relationships. And although, conceptually,
we as human beings are easily able to understand
the notion of a many to many relationship, in
order to actually implement such relationships in
a real-world database, we need to create
an intersection table that will sit between
our two parent tables. The typical strategy for
creating these intersection tables is to use a composite
primary key in the intersection table, which is comprised
of the primary keys from each of the parent tables. It is, of course, also
possible to use a surrogate key as the primary key in
an intersection table. However, doing so has
some implications. We will take a look at
those implications shortly. First, however, consider
this conceptual example of a relationship between
a student and a class. So, conceptually speaking, a
student can take many classes. And a class can
contain many students. This is a many to
many relationship. We cannot, unfortunately,
implement this type of relationship directly
in a real-world database. Instead, we need to create
an intersection table that will serve as a look-up table
between the student and class tables. In this case, we've created
an intersection table named Student_Class. And we can see that it has a
composite primary key composed of the primary keys of the two
tables to which it is attached. Note here that each component
of the composite primary key is individually serving
as a foreign key link back to its parent table. Thus, the attributes involved
in the composite primary key are simultaneously
serving two purposes. First, they are a part
of the primary key for our intersection table. And second, when
considered individually, they are serving as
foreign key links back to their parent tables. From the perspective of the
structured query language, to write a query, which would
assemble a list of students and all of the classes that
each student has taken, we would thus need to join
three tables together. We would need to join our
student table and class table together. But in order to do that, we need
to use the student-class table as a part of the
joint operation. And in this case, in order
to accomplish this task, we need to match student
IDs in the student table to student IDs in the
student class table, while also matching the class
number in the student class table to the class number
in the class table. In this way, we've linked all
three tables together in order to produce our query results. Next, let's consider a type
of intersection table known as an associative table,
or an associative entity. An associated entity is
simply an intersection table which contains additional
non-key attributes. And we call this an
association relationship. Revisiting our
student class example, we see here that we've added
a new attribute to the student class table, named
Grade, which provides us with additional information
about the relationship between the student
and the class. That is, for each intersection
between a specific student and a specific class,
we want the ability to record a grade which
indicates the student's performance in that class. This sort of a design is called
an association relationship. Next, as promised, I
would like to talk briefly about the implications of using
a surrogate key in conjunction with an associative entity. When an associative entity uses
a composite primary key that is made up of the primary
keys of the parent tables to which it is attached,
then each possible pair of matched values that appear
within the associative entity can appear a
maximum of one time. By contrast, if we decide
to use a surrogate key, then, in that case, the primary
keys of the parent tables are serving just as foreign
keys in the associative entity. And the implication, then,
is that each possible matched pair of values between
the two parent tables can appear many times in
the associative entity because those foreign keys
are no longer considered part of the primary key. In this first example,
we have a situation where we would like to allow
our customers to provide ratings for the products that we sell. This is a many to many
relationship such that, conceptually
speaking, a customer should be able to provide
ratings for many products. And each product should
be able to be rated by many different customers. In our first example,
we see that we are using a
composite primary key as the primary key for our
associative entity rating. Because of this, each
possible combination of customer ID and product
ID can occur only one time in the rating table. What this means from a practical
perspective, in this example, is that each customer
would be allowed to rate each product
only one time, according to this data design. By contrast, if we consider
our second example, we see that we are
using a surrogate key as the primary key in the
rating table, named Rating ID. And our customer ID and
product ID attributes are simply serving
as foreign key links back to their respective
parent tables. That is, they are no longer
part of the primary key in this design. They are just
simple foreign keys. In the second design, because
we are using a surrogate key, the combination of values
between customer and product ID, each unique combination
could potentially occur in the rating
table many times. So customer number 1
might rate product number 7 many different times,
the practical implication being that, in this
sort of a design, each customer would be
allowed to rate each product potentially many times. So there are
important implications to be considered when
making the decision as to whether a surrogate
key should be used in place of a composite
primary key in one of these associative entities. Our final topic
for this lecture is to consider how to
properly implement recursive relationships
in our database designs. If you recall, we said that
a recursive relationship is a relationship in which
a table is connected to or related to itself. To actually implement these
recursive relationships in a real-world
database, however, we need to consider a few points. First, with respect to
one-to-one recursive relationships, and one to
many recursive relationships, the process of implementing
the recursive relationship is quite straightforward. We simply need to use a foreign
key link within the table. And that will allow us to create
the recursive relationship. If, however, we have a many to
many recursive relationship, we will need to create an
intersection table in order to actually implement the
recursive many to many relationship in a
real-world database. So on the coming slides, we
will consider these three conceptual recursive
relationships. In the first example, we have
a sponsorship relationship which is a one to one
recursive relationship. In this case, each
person in the table is sponsored by 0
to 1 other people. That is, they can be sponsored
by, at most, one other person. In our second
example, we have a one to many recursive relationship. And this is a
referral relationship. That is, each customer might
refer many other customers to us, but each
customer is referred to by a maximum of
one other customer. So this is a one to many
recursive relationship. And finally, we have a
many to many recursive relationship in which a doctor
might treat many other doctors, while each doctor might be
treated by many other doctors as well. This is conceptually a many to
many recursive relationship. Let's see how we
actually implement these recursive relationships
in a real-world database. To begin, consider our
one-to-one recursive relationship. To implement this
type of relationship, we just add a foreign
key into the same table, in this case, a
person table, which will track the sponsorship
relationship between people in the table. Just as when we were
discussing where to place the foreign
key in a one to one binary relationship, when we
have a one-to-one recursive relationship, we
also have a choice. In this case, depending
upon the approach which makes the most conceptual
sense to us as a human being, we can implement the
sponsorship relationship, either by creating a foreign key which
records the name of the person that each person in the
database has sponsored. But we can also record
exactly the same information by storing instead
the person who was the sponsor for each person. Regardless of which
strategy we take to implement the one to
one recursive relationship, we will see that
the approach which is used to run SQL
queries related to the recursive relationship
is quite interesting. Broadly speaking,
the strategy involves joining the table
to itself in order to extract the
information that we want. So we see in these
examples here, we are referencing the
same table twice each time, giving the table
a different alias. And this approach allows us
to join the table to itself. Next, let's consider a one to
many recursive relationship. In this case,
again, the strategy is to place a foreign key
inside the same table that will track the one to many
recursive relationship. In our example, we have
a referral relationship. And as an example, we
would like to use this one to many recursive
relationship to generate a list of customers
who have been referred by a specific chosen customer. So we might, for
example, want to know which customers were referred
by customer number 100. And this example, we would
see that customer number 200 and customer number
400 were referred to by customer number 100. So referred by, in this
example, is a foreign key link which enables the
one to many recursive relationship within the table. Notice that the
strategy for writing SQL queries for this type of
recursive relationship is the same as the
strategy that we used for a one-to-one
recursive relationship. That is, we simply query
the same table twice, giving each instance
of the table a different name-- here,
A and B-- and this allows us to join the table to itself. Finally, we have
the problem of how to implement a many to many
recursive relationship. In our example, we
said that we want each doctor to be able
to treat or provide care for many other doctors,
while each doctor can also be treated by, or receive
care from, many other doctors. So conceptually, this is
a many to many recursive relationship on
our doctor table. In order to actually
implement these many to many recursive relationships,
we will need to use an intersection
table which will function as a look-up table
that allows us to determine, in this case, which
doctors are being treated by which other doctors. The structured query
language statements that will allow us to
join these two tables together look a bit odd. But hopefully, they make
logical sense to you, just as with our one-to-one and one to
many recursive relationships, we must first select data
twice from the same table. And we do this by providing
each instance of the table a unique name, in
this case, A and B. And then we also
must select data from our intersection table. With all of our tables
defined, we then simply perform the
joint operation, as we ordinarily would. And we can, using this
approach, generate a list of doctors and
patients who, in this case, conceptually are also doctors. Well, my friends, thus ends
our further exploration of database design. I hope that you learned
something interesting today. And until next time,
have a great day.