Database Lesson #5 of 8 - Database Design

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Good day, everyone. This is Dr. Soper here. And today, we will be exploring the fifth topic in our series of database lectures, this one being entitled, Database Design. Although there are many objectives associated with today's topic, broadly speaking, our goals are to further develop our skills such that we are able to transition away from our conceptual data models to a point where we feel comfortable implementing complex data structures in a real-world database. To this end, we will review the normalization process in this lecture. And we will also explore the concept of denormalization and learn when denormalization is an appropriate strategy in database design. We will also learn some new information that will be useful in helping us implement 1 to 1, 1 to many, and many to many binary relationships in a database. And toward the end of the lecture, we will also gain insights into how to implement these 1:1, 1 to many, and many to many relationships, when those relationships are recursive. What's more, we will further expand upon our knowledge of the structured query language today by learning the strategies and approaches that can be used in order to work with these recursive relationships using SQL. To begin, I'd like to talk about the process of transitioning from relations, to entities, and ultimately into designs for cables that we can actually implement in a real-world database. We begin with simple relations. And we learn the parenthetical method of describing these relations, an example of which is shown on your screen. So here, we have an item relation. And we can see that this relation contains five attributes. The first attribute, itemId, is underlined, indicating that it is a primary key. And then we have four non-key attributes. We've already learned how to take this initial, simple conceptual model of a relation and translate that into an entity consistent with the set of symbols that are used in entity relationship modeling. Our item relation, then, can be easily transformed into an entity. And we see an example of such an entity here. Although modeling the entity in this way can be very useful, we still need to make additional decisions, and specify additional details, before we can actually implement this entity as a table in a real-world database. As an example of these missing details, we might consider that we need to specify data types for each of the attributes in the table. This information will ultimately be required before we can implement our entities as tables in the database. Transitioning from a data model into a database, however, involves several more steps than just identifying the data types for our attributes. So, perhaps it would be useful if we were to explore this process in a slightly more formal way. To ensure that we are able to successfully implement our data model, the first step that we would like to complete is to create a table for each of the entities appearing in the data model. As we go through the process of implementing these tables, there are several key tasks that we will need to complete in order to ensure that the resulting implementation of our database is satisfactory. One of the most critical of these tasks is to specify a primary key. We must always remember to specify a primary key for each of the tables in our database. After adding our attributes or columns to the table, we need to specify several properties for each column so that the database engine will be able to treat the column properly once the database is deployed. Among these properties are, first, the data type, is the attribute-- an integer, a float, a varchar, a date-time, or some other data type. The null status of the attribute-- that is, are null values allowed? Default values, if any, for the attribute. That is to say, if a specific value for the attribute is not supplied, is there a value that we would want the database to fill in automatically for us? And, of course, we may also impose certain constraints on the column so as to preserve the integrity of the data contained therein. Examples of such constraints might include a range constraint in which the values appearing in a column must fall between a minimum and maximum value. You may also specify a set of acceptable values, such that each value appearing in the column must exist within the predefined set, and so forth. Finally, after completing all of these design tasks, we may want to re-evaluate the structure of the table in light of the normalization needs of the database. If, for example, our performance needs require that we denormalize the table, that action of denormalization, which we will discuss and describe later today, might require us to alter the structure of the table somewhat. Before we discuss denormalization, however, a brief review of normalization and the normalization process may be useful. If you recall from an earlier lecture, we said that data tables which are not properly normalized may be susceptible to problems which we called modification anomalies. And there are three different types of these anomalies. First are insertion problems in which a problem arises when we attempt to add new data into a table. Next, we have update anomalies in which a problem arises when we attempt to modify the data in a table. And finally, we have deletion anomalies in which a problem arises when we attempt to delete existing data from within a table. Tables which are properly normalized will help us to avoid these three types of problems. As a quick review, let's see some examples of these three different types of modification anomalies. Here, we see a list of data in which we are recording information about customers and the cars that they own. To begin, let's consider a deletion anomaly. Let's say, for example, that Leonard is no longer my customer. So I want to delete Leonard from the database. By deleting the row of data in the list for Leonard, however, I have now lost all record that a type of car, known as a 2010 Nissan Sentra, even exists. So by deleting Leonard, I have inadvertently deleted other information from the list that I may have found useful, or I may have needed to preserve. This is a deletion anomaly. Next, let's consider an insertion anomaly. Say that we want to add a new customer to our table. Perhaps this customer's name is Sheldon. But Sheldon doesn't drive. So we add information about Sheldon into the table. But the result is that we now have many missing values in our list. This is wasted space. And we might consider this to be an insertion anomaly. Later on, perhaps, our new customer, Sheldon, decides to learn to drive. And let's say that he purchases a 2003 Volkswagen Golf. Now, a 2003 Volkswagen Golf already exists in our list. So as we are adding this information into the list for our customer, Sheldon, assume that we make a spelling mistake. Perhaps we misspell Volkswagen. Again, this is an insertion anomaly that would not have occurred in a properly-normalized database design. Finally, we can consider an update anomaly. Let's say, for example, that we want to change the car ID for a 2003 Volkswagen Golf from 102 to 100. If I make that change for my customer, Penny, I have now introduced an update anomaly into the list because my customer, Sheldon, has the same type of vehicle. And therefore, in order to maintain the accuracy of the data in the list, I would need to update the car ID not only for Penny, but also for Sheldon. And this is an example of an update anomaly because we've introduced a problem into our table of data as the result of an update operation. If you recall, the approach that we take to avoid these modification anomalies is to break tables apart into smaller and smaller tables until they are no longer susceptible to these anomalies. This is a process which we can call normalization. In our previous example, we had a large list of data that contained information about two separate business concepts, or business themes. That is, our list of data contained information about customers and cars in the same list. By breaking that list apart into two separate tables, say, a customer table and a car table, we now have a data design that allows us to add, update, or delete data without introducing any anomalies into the database. Next, let's briefly review some of the definitions associated with the normalization process. First is a functional dependency. Now, a functional dependency refers to a relationship among attributes within a relation, such that the value of one attribute can be used to find or identify the value of another attribute or set of attributes. That is, if the value of one attribute can be determined by the value of another, then that first attribute is said to be functionally dependent upon the second attribute. Further, if an attribute can be used to find the values of other attributes within a table, then that attribute is known as a determinant. That is, the value of the attribute can be used to determine the values of other attributes within the table. Finally, recall the concept of a candidate key. And a candidate key is simply an attribute whose values can be used to determine, or to find, the value of every other non-key attribute within a row. And these candidate keys can come in a few different flavors, one being a simple candidate key which consists of the value of just a single attribute. And the other being a composite candidate key whose uniqueness comes by combining the values of two or more attributes. Recall, also, that many different types of normal forms have been defined. We began by exploring the concept of a relation. And we said that every relation is, by definition, in first normal form. That is, if a table meets the definition of a relation, then that table is also in First Normal Form. If we take our table in first normal form, and we remove any partial dependencies, then the table meets the definition of second normal form. If we further remove any transitive dependencies, then our table will meet the definition of Third Normal Form. If we reach a point in our normalization process where every determinant within the table is also a candidate key, then the table will meet the definition of something called Boyce-Codd Normal Form. And as you can see, beyond Boyce-Codd Normal Form, there is Fourth Normal Form, Fifth Normal Form, DK Normal Form, and Sixth Normal Form. So many additional normal forms have been identified. For our purposes, a relation is considered normalized when every determinant is a candidate key. And again, technically, this is known as Boyce-Codd Normal Form. Boyce-Codd Normal Form is a slightly more stringent version of Third Normal Form, and it is sometimes called 3.5 Normal Form for that reason. A database design which has been normalized such that it is in Third Normal Form or above is generally not susceptible to the types of modification anomalies that we described earlier. And, at this level of normalization, a database design is also almost always sufficient to handle any business data problem. Next, let's consider a normalization example. Assume, for this example, that we want to create a database table which will allow us to keep track of information relating to our customers, and also information relating to each of those customers' accountants. So we create a customer table that contains a customer ID. That will be our primary key. And we also track the customer's name, address, city, state, zip code, and the accountant ID for the customer, along with that accountant's name and telephone number. In its current form, this customer table is not normalized because all of the attributes in the customer table cannot be determined by the customer ID. Specifically, consider that if we know a zip code, or a postal code, then, by definition, we also know the city and state. That is to say, if I know a zip code, I can determine the city and state to which that zip code applies. Furthermore, in this table, we also see that, if we know the accountant ID, we can use that information to determine the name of the accountant and the accountant's phone number. To normalize this database design, then, we will need to break the table apart into a series of smaller tables until we reach a point where every non-key attribute is fully functionally determined by the primary key. Our normalized solution, then, contains three separate tables, a customer table, which contains the customer ID as a primary key, the name and address of the customer, and the zip code and accountant IDs associated with the customer as foreign key links back to their parent tables. We then have a zip code table where zip is the primary key. And we have city and state as attributes of the zip code table. And then we have an accountant table where accountant ID serves as the primary key. And we have two non-key attributes, accountant name and accountant phone. We have thus taken our original customer table and broken it into three smaller tables which, when considered together, provide us with a normalized data solution. Now that we've reviewed the normalization process, next I would like to talk about a new topic. And that is the concept known as denormalization. As the name suggests, denormalization is the opposite of the normalization process. That is, in the denormalization process, we have identified a business need, or a business justification, for why two or more tables within our database design should be combined together into a single table. Common reasons for wanting to denormalize a set of tables include simplicity of design and speed of querying. The trade-off, however, is that, by using a denormalized design, we are introducing the possibility of modification anomalies into our database. Consider, for example, the set of tables shown here. In this example, we have combined the zip code table and the customer table from our normalized solution into a single table, which here, we are just calling Customer. Specifically, what we have done is we have taken the city and state attributes that used to exist in the zip code table, and we have added them into the customer table, along with zip code, such that the entire customer address can now be read simply by examining the rows of the customer table. From a performance perspective, this may be desirable because it allows us to extract a customer's address from the table without having to perform a join operation with the zip code table. In the relational database world, joint operations are computationally and temporally expensive processes. It takes the database engine a substantial amount of computational cycles and time in order to join two tables together. And that extra time, that extra effort, slows down the query performance. By denormalizing the design, and combining the zip code table and the customer tables together, we can expect an increase in query performance if our objective is to extract customer addresses. What's more, we have the additional benefit of simplicity. Our database design now contains two tables instead of three, which makes it conceptually easier to understand, perhaps easier to maintain, certainly easier from the perspective of writing SQL commands. There is, however, a cost to this denormalization process. And the cost is our denormalized design is now open to the possibility of modification anomalies. In this case, because we are storing the zip code and the city and the state for each customer in the customer table, this means, one, we will have a great deal of repetitive data in the customer table. For example, if we have 100,000 customers who live in Los Angeles, California, that extra text, Los Angeles, California, will appear redundantly in many, many, many rows within our customer table. What's more, we now have a much greater probability of data entry problems existing in our customer table. One customer, for example, might enter their city and state as Los Angeles, California, where they spell the state California. Another customer might use the two-character abbreviation, CA, for California. Whereas another customer might provide the city and state entirely as upper case, while another customer might provide the same information entirely as lower case. These types of anomalies in the data would not be present if we had retained our original, normalized design. But because we made the decision to denormalize those tables, we are now exposed to the possibility that these anomalous data values might exist. In summary, then, you should denormalize your data design only when there is a defensible business reason for doing so. And remember that, by denormalizing your database design, your database will now be exposed to the possibility of having problems that it might not otherwise have. There are some preventative measures that can be taken to help avoid those problems, such as using database constraints, or implementing validity checks in software. However, each of these strategies represents additional overhead tasks. And in the end, there may not truly be any gains in simplicity from using such an approach. A major component of relational database design, of course, is creating relationships between tables. And to that end, I'd like to provide some additional useful information about how to implement one to one, one to many, and many to many relationships in a real-world database. First, let's consider a one to one relationship. Now, the question here is, in order to establish a relationship between two tables, we need to use a link between a primary key and a foreign key. But in a one to one relationship, which table should contain the foreign key? Here are a few guidelines that you can use in order to decide what to do. First, if both sides of the relationship are optional-- that is, if the minimum cardinality on both sides of the relationship is 0, then it doesn't really matter which table receives the foreign in a one-to-one binary relationship. You can place the foreign key in whichever table makes most sense to you, conceptually, as a human being. If, however, only one side of the relationship is optional, that is to say, if one side of the relationship has a minimum cardinality of zero, while the other side of the relationship has a minimum cardinality of one, then the best strategy is usually to put the foreign key in the table that resides on the optional side of the relationship. Let's look at some examples. First, consider the relationship between an employee and a locker. This is a one-to-one relationship such that an employee might be assigned one locker, and each locker might be assigned to a single employee. In this case, the minimum cardinality in this one-to-one relationship is zero on both sides-- that is, the relationship is optional in both directions-- and therefore, we can place the foreign key in whichever table we prefer. In the first example, we've placed the foreign key in the locker table such that a locker has an employee, whereas in the second example, we've placed the foreign key in the employee table such that an employee has a locker. So whichever of those two notions conceptually makes most sense to you as the data designer should guide your decision as to where the foreign key should be placed. If it makes more sense to think about a locker having an employee, then place the foreign key in the locker table. If it makes more sense to you to think about an employee having a locker, then place the foreign key in the employee table. If, however, we have a one-to-one binary relationship, where the minimum cardinality on one side of the relationship is zero, while the minimum cardinality on the other side of the relationship is one, then the foreign key should always go at the optional side of the relationship. Let's consider this example where we have a hospital patient and a hospital bed. According to this data design, a hospital patient is always assigned to exactly one bed in the hospital. But each bed in the hospital might not be assigned to a patient. If a bed is assigned to a patient, then, of course, it is assigned to a maximum of one patient. But the possibility exists that we have empty beds that are not assigned to any patients. In this case, we want the foreign key to go on the optional side of the relationship. That is we would put the bed ID as a foreign key in the Patient table. And we would say a patient is assigned a bed. SQL queries for one-to-one binary relationships are quite straightforward and depend only upon our decision as to the placement of the foreign key. In this first example, we placed the locker ID as a foreign key in the employee table. And therefore, we need to do the join on equality between locker IDs. Whereas, in the second example, we made the decision to place employee ID as a foreign key in the locker table. And we therefore need to do the join by looking for matching values of employee IDs between the two tables. Next, let's consider a one to many relationship. Just as with a one-to-one binary relationship, in a one to many binary relationship, we establish the relationship between the two tables by creating a link between a primary and foreign key. In a one to many relationship, however, it is very easy to identify the table that should receive the foreign key. All you need to remember is that the foreign key belongs at the many side of the relationship. And in these types of one to many binary relationships, we typically refer to the one side of the relationship as the parent table, while we refer to the many side of the relationship is the child table. Here, for example, we see a one to many binary relationship between a team table and a player table such that a team has 0 to many players, while a player belongs to one and only one team. In this case, the player table resides at the many end of the relationship. And we therefore know that the foreign key belongs in that table. Team ID, which is currently serving as the primary key of the team table, is thus placed into the player table as a foreign key. Just as with one-to-one binary relationships, writing SQL queries to join two tables, which are involved in a one to many binary relationship, is quite simple. In this example, we are performing the join in the WHERE clause. But we could just as easily perform the join in the FROM clause by using the JOIN keyword. Finally, we can consider many to many binary relationships. And although, conceptually, we as human beings are easily able to understand the notion of a many to many relationship, in order to actually implement such relationships in a real-world database, we need to create an intersection table that will sit between our two parent tables. The typical strategy for creating these intersection tables is to use a composite primary key in the intersection table, which is comprised of the primary keys from each of the parent tables. It is, of course, also possible to use a surrogate key as the primary key in an intersection table. However, doing so has some implications. We will take a look at those implications shortly. First, however, consider this conceptual example of a relationship between a student and a class. So, conceptually speaking, a student can take many classes. And a class can contain many students. This is a many to many relationship. We cannot, unfortunately, implement this type of relationship directly in a real-world database. Instead, we need to create an intersection table that will serve as a look-up table between the student and class tables. In this case, we've created an intersection table named Student_Class. And we can see that it has a composite primary key composed of the primary keys of the two tables to which it is attached. Note here that each component of the composite primary key is individually serving as a foreign key link back to its parent table. Thus, the attributes involved in the composite primary key are simultaneously serving two purposes. First, they are a part of the primary key for our intersection table. And second, when considered individually, they are serving as foreign key links back to their parent tables. From the perspective of the structured query language, to write a query, which would assemble a list of students and all of the classes that each student has taken, we would thus need to join three tables together. We would need to join our student table and class table together. But in order to do that, we need to use the student-class table as a part of the joint operation. And in this case, in order to accomplish this task, we need to match student IDs in the student table to student IDs in the student class table, while also matching the class number in the student class table to the class number in the class table. In this way, we've linked all three tables together in order to produce our query results. Next, let's consider a type of intersection table known as an associative table, or an associative entity. An associated entity is simply an intersection table which contains additional non-key attributes. And we call this an association relationship. Revisiting our student class example, we see here that we've added a new attribute to the student class table, named Grade, which provides us with additional information about the relationship between the student and the class. That is, for each intersection between a specific student and a specific class, we want the ability to record a grade which indicates the student's performance in that class. This sort of a design is called an association relationship. Next, as promised, I would like to talk briefly about the implications of using a surrogate key in conjunction with an associative entity. When an associative entity uses a composite primary key that is made up of the primary keys of the parent tables to which it is attached, then each possible pair of matched values that appear within the associative entity can appear a maximum of one time. By contrast, if we decide to use a surrogate key, then, in that case, the primary keys of the parent tables are serving just as foreign keys in the associative entity. And the implication, then, is that each possible matched pair of values between the two parent tables can appear many times in the associative entity because those foreign keys are no longer considered part of the primary key. In this first example, we have a situation where we would like to allow our customers to provide ratings for the products that we sell. This is a many to many relationship such that, conceptually speaking, a customer should be able to provide ratings for many products. And each product should be able to be rated by many different customers. In our first example, we see that we are using a composite primary key as the primary key for our associative entity rating. Because of this, each possible combination of customer ID and product ID can occur only one time in the rating table. What this means from a practical perspective, in this example, is that each customer would be allowed to rate each product only one time, according to this data design. By contrast, if we consider our second example, we see that we are using a surrogate key as the primary key in the rating table, named Rating ID. And our customer ID and product ID attributes are simply serving as foreign key links back to their respective parent tables. That is, they are no longer part of the primary key in this design. They are just simple foreign keys. In the second design, because we are using a surrogate key, the combination of values between customer and product ID, each unique combination could potentially occur in the rating table many times. So customer number 1 might rate product number 7 many different times, the practical implication being that, in this sort of a design, each customer would be allowed to rate each product potentially many times. So there are important implications to be considered when making the decision as to whether a surrogate key should be used in place of a composite primary key in one of these associative entities. Our final topic for this lecture is to consider how to properly implement recursive relationships in our database designs. If you recall, we said that a recursive relationship is a relationship in which a table is connected to or related to itself. To actually implement these recursive relationships in a real-world database, however, we need to consider a few points. First, with respect to one-to-one recursive relationships, and one to many recursive relationships, the process of implementing the recursive relationship is quite straightforward. We simply need to use a foreign key link within the table. And that will allow us to create the recursive relationship. If, however, we have a many to many recursive relationship, we will need to create an intersection table in order to actually implement the recursive many to many relationship in a real-world database. So on the coming slides, we will consider these three conceptual recursive relationships. In the first example, we have a sponsorship relationship which is a one to one recursive relationship. In this case, each person in the table is sponsored by 0 to 1 other people. That is, they can be sponsored by, at most, one other person. In our second example, we have a one to many recursive relationship. And this is a referral relationship. That is, each customer might refer many other customers to us, but each customer is referred to by a maximum of one other customer. So this is a one to many recursive relationship. And finally, we have a many to many recursive relationship in which a doctor might treat many other doctors, while each doctor might be treated by many other doctors as well. This is conceptually a many to many recursive relationship. Let's see how we actually implement these recursive relationships in a real-world database. To begin, consider our one-to-one recursive relationship. To implement this type of relationship, we just add a foreign key into the same table, in this case, a person table, which will track the sponsorship relationship between people in the table. Just as when we were discussing where to place the foreign key in a one to one binary relationship, when we have a one-to-one recursive relationship, we also have a choice. In this case, depending upon the approach which makes the most conceptual sense to us as a human being, we can implement the sponsorship relationship, either by creating a foreign key which records the name of the person that each person in the database has sponsored. But we can also record exactly the same information by storing instead the person who was the sponsor for each person. Regardless of which strategy we take to implement the one to one recursive relationship, we will see that the approach which is used to run SQL queries related to the recursive relationship is quite interesting. Broadly speaking, the strategy involves joining the table to itself in order to extract the information that we want. So we see in these examples here, we are referencing the same table twice each time, giving the table a different alias. And this approach allows us to join the table to itself. Next, let's consider a one to many recursive relationship. In this case, again, the strategy is to place a foreign key inside the same table that will track the one to many recursive relationship. In our example, we have a referral relationship. And as an example, we would like to use this one to many recursive relationship to generate a list of customers who have been referred by a specific chosen customer. So we might, for example, want to know which customers were referred by customer number 100. And this example, we would see that customer number 200 and customer number 400 were referred to by customer number 100. So referred by, in this example, is a foreign key link which enables the one to many recursive relationship within the table. Notice that the strategy for writing SQL queries for this type of recursive relationship is the same as the strategy that we used for a one-to-one recursive relationship. That is, we simply query the same table twice, giving each instance of the table a different name-- here, A and B-- and this allows us to join the table to itself. Finally, we have the problem of how to implement a many to many recursive relationship. In our example, we said that we want each doctor to be able to treat or provide care for many other doctors, while each doctor can also be treated by, or receive care from, many other doctors. So conceptually, this is a many to many recursive relationship on our doctor table. In order to actually implement these many to many recursive relationships, we will need to use an intersection table which will function as a look-up table that allows us to determine, in this case, which doctors are being treated by which other doctors. The structured query language statements that will allow us to join these two tables together look a bit odd. But hopefully, they make logical sense to you, just as with our one-to-one and one to many recursive relationships, we must first select data twice from the same table. And we do this by providing each instance of the table a unique name, in this case, A and B. And then we also must select data from our intersection table. With all of our tables defined, we then simply perform the joint operation, as we ordinarily would. And we can, using this approach, generate a list of doctors and patients who, in this case, conceptually are also doctors. Well, my friends, thus ends our further exploration of database design. I hope that you learned something interesting today. And until next time, have a great day.
Info
Channel: Dr. Daniel Soper
Views: 185,583
Rating: undefined out of 5
Keywords: Database (File Format Genre), Database Design (Field Of Study), entity relationship models, physical database design, denormalization, binary database relationships, recursive database relationships, SQL (Programming Language), SQL joins, databases
Id: B_QU9OGwtbI
Channel Id: undefined
Length: 50min 11sec (3011 seconds)
Published: Tue Jul 23 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.