What is the first thing that comes to your mind
when you hear the word “database”? For many people, this question is more challenging
than it might seem at first. An answer like “a big file where much information
is stored” is not satisfactory and would not please potential employers. You should remember there are two main types
of databases – relational and non-relational. The former will be the focus of this course,
while the latter regards more complex systems. Although understanding non-relational databases
requires a serious mathematical and programming background, some of the logic applied in its
coding is the same as SQL. Likewise, relational databases have a few
advantages on their own. A small bit of theory will explain why they
are still the preferred choice in many companies and institutions. Databases’ main goal is to organize huge
amounts of data that can be quickly retrieved upon users’ request. Therefore, they must be compact, well-structured,
and efficient in terms of speed of data extraction. Today, people need such extra efficiency because
data occupies memory space and… the bigger its size, the more sluggish the database is
and the slower the retrieval process becomes. If we have a database containing one multi-million-row
table, with many columns, then every time a request has been received, the server must
load all the records, with all fields, and it would take too much time for a task to
be completed. Don’t forget every symbol is a container
of information and requires bytes of storage space. Hence, loading that much data will not be
an easy job for the computer. So, what allows us to contain so much data
on the server, but lets us efficiently use only the portions we need for our analysis? The secret lies behind the use of mathematical
logic originating from relational algebra. Please, don’t worry - we will not bother
you with math. Imagine each table with data is represented
by a transparent circle that contains all the data values of the table, categorized
by columns or, as we will often call them, fields. Now, if our database consisted of only one
table, a giant circle would represent the entire database, something like this huge
table from our fictional example with the “Sales” database. And when we need a piece of information from
the database, for example if we wish to see who has bought something on a certain date,
we will have to lift this whole big circle and then search for what we need. This challenge seems vague and the process
of data extraction will not be efficient. See what can happen if we split the circle
into 3 smaller circles, just as we did with the “Sales” database. One circle will stand for the “Sales”
table, the other for “Customers”, and the last one for “Items”. There are various theoretical combinations
between 3 or more circles, but in our database, we have the following model. “Sales” and “Customers” have the same
“customer ID” column, and “Sales” and “Items” have the same “item code”
column. This way, we can see the circles overlap as
they have common fields. So, if we’d like to extract the same information,
the names of the customers who have purchased something on a given date, we will need only
the “date of purchase” column from the “Sales” table and the “first” and
“last name” from the “Customers” table. So, to satisfy this request, we will not need
to lift the third circle from our database, “Items”. This way, we can save energy or, more technically,
increase efficiency. Less data, represented as only two of the
three circles, will be involved in this operation. The mathematical trick lies in relating the
tables to one another. Relationships were formed namely through these
common fields. Anyway, I am sure that now you understand
why we use the term relational databases. Some professionals may refer to the tables,
or the circles in our plot, as relations because, theoretically, they are the smallest units
in the entire system that can carry integral logical meaning. Likewise, the three circles are all part of
our “Sales” database. When we combine the database and its existing
relations, we obtain the famous term relational database management system, frequently abbreviated
as RDBMS. 😊
We hope this theoretical illustration makes things clearer. SQL is designed for managing relational database
management systems and can do that by creating relations between the different tables in
a database. To complete our theoretical preparation, we
will need to learn more database terminology. This is a necessary step that will help us
when we start coding in SQL later. Let’s go through the entire process of creating
a database. Assume our database containing customer sales
data has not been set up yet, ok? So, imagine you are the shop owner and you
realize you have been selling goods quite well recently, and you have more than a million
rows of data. What do you need, then? A database! But you know nothing about databases. Who do you call, an SQL specialist? No. You need a database designer. She will be in charge of deciding how to organize
the data into tables and how to set up the relations between these tables. This step is crucial. If the database design is not perfect from
the beginning, your system will be difficult to work with and wouldn’t facilitate your
business needs; you will have to start over again. Considering the time and data (…and money!)
involved in the process, you want to avoid going back to point 0. What do database designers actually do? They plot the entire database system on a
canvas using a visualization tool. There are two main ways to do that. One is drawing an Entity-Relationship diagram,
an ER diagram for brief. It looks like this… and, as its name suggests,
the different figures represent different data entities and the specific relationships
we have between entities. The connections between tables are indicated
with lines. This way of representing databases is powerful
and professional, but it is complicated. We will not focus in-depth on ER diagrams
in this course, but you should know they exist and refer to the process of database design. Another form of representation of a database
is the relational schema. This is an existing idea of how the database
must be organized. It is useful when you are certain of the structure
and organization of the database you would like to create. More precisely, a relational schema would
look like this. It represents a table in the shape of a rectangle. The name of the table is at the top of the
rectangle. The column names are listed below. All relational schemas in a database form
the database schema. You can also see lines indicating how tables
are related in the database. To this moment, it has been ideas, planning,
abstract thinking, and design. At this stage, it would be correct to say
SQL can be used to set up the database physically, as opposed to contriving it abstractly. Then, you can enjoy the advantages of data
manipulation. It will allow you to use your dataset to extract
business insights that aim to improve the performance and efficiency of the business
you are working for. This process is interesting, and, practically,
the main part of the course will be related to that. Remember – well thought-out databases that
are carefully designed and created are crucial prerequisites for data manipulation. If we have done good work with these steps,
we could write effective queries and navigate in a database rather quickly. You will often hear the term database management. It comprises all these steps a business undertakes
to design, create, and manipulate its databases successfully. Finally, database administration is the most
frequently encountered job amongst all. A database administrator is the person providing
daily care and maintenance of a database. Her scope of responsibilities is narrower
regarding the ones carried out by a database manager, but she is still indispensable for
the database department of a company. All right! So… In this lesson, we will use the “Sales”
database to illustrate the concept of relational schemas a little better. The data will be stored in 4 tables – “Sales”,
“Customers”, “Items”, and “Companies”. You see how these tables have a tabular shape. Let’s see how a relational schema can be
applied to represent them. Let’s build the landscape piece by piece. We care about the sales per customer of our
company, and that’s why we have a “Sales” table. It tells us the number with which each purchase
was registered, the date of the purchase, the respective customer ID, and the item code. So, in this table, the dates of a few purchases
may coincide, because it is normal to sell some of the same goods in the same day. Analogically, in the third column, the ID
of a customer may appear a few times, given that some customers make more than one purchase. People may have bought many units of the same
product; hence, there is a possibility to see the same item code a few times in the
last column. This reasoning does not apply to the first
field, though. Each purchase is unique! In databases, this means all the numbers in
this column will be different, because each purchase will be assigned a specific number
that is unique. There is a term for such type of field. A column (or a set of columns) whose value
exists and is unique for every record in a table is called a primary key. Each table can have one and only one primary
key. In one table, you cannot have 3 or 4 primary
keys. For instance, in our “Sales” table, “number
of purchase” can act as a single-column primary key, and there will be no other primary
keys. If the database creators decide, the primary
key of a table may be composed of a set of columns, not of just one column. In the “Sales” table, both “number of
purchase” and “date of purchase” can be thought of as a unique-identifier pair
for the data in this table. For instance, purchase number 1 and purchase
date, which is the 3rd of September, will form a unique pair and so will purchase number
2 and the same date, 3rd of September. This means these 2 rows are different. Please, don’t overlook the fact you cannot
have a combination of the same purchase number, 1, and the same date, 3rd of September, more
than once in your table. Either the number or the date must be different. This is because, as we already said, a primary
key must be unique. Good! Both ways of representation make sense. Since having a one-column primary key here
would mean all purchases in our database will be recorded under a different number, this
means they can go from 1 to, hopefully, a million, or two… This logic will preserve the uniqueness of
the purchases. Ok. So, we will proceed with “purchase number”
as a single-column primary key. Awesome! Therefore, you can think of a primary key
as the field or group of fields that identifies the content of a table in a unique way. For this reason, the primary keys are also
called the unique identifiers of a table. Another crucial feature of primary keys is
they cannot contain null values. This means, in the example with a single-column
primary key, there must always be a value inserted in the rows under this column. You cannot leave it blank. Please, be careful and don’t forget this
characteristic of the primary key! Technically, to create a relational schema,
we draw a table in the form of a rectangle, and we place its name on top. Then, we enlist the fields vertically (as
opposed to their horizontal organization in a spreadsheet). The field that is the primary key of the table
is usually quoted on top of the other fields. What is more important - it is always underlined. Keeping these simple rules in mind, when you
spot this table, you will immediately know: It is called “Sales”; its primary key
is “purchase number”; and there are three other fields – “date of purchase”, “customer
ID”, “item code”. This image corresponds to tabular data in
the following form… All right? Good. One last remark about primary keys. Please, remember that not all tables you work
with will have a primary key, although almost all tables in any database will have a single-column
or a multi-column primary key. To conclude, the term for representing a table
with the help of such a rectangle and the names of the fields is called a relational
schema. When you combine the relational schemas of
the tables we have in a database, this gives us a database schema. Informally, it is ok to call the representation
of the entire database in plural - relational schemas. All right! Excellent! The “Customer ID” column does not give
away much information about the purchaser, does it? We will store her first and last names, email
address, and number of complaints submitted in another table, called “Customers”. Here is a table containing all these fields,
as well as a “Customer ID” column, which is unique; there is no need to keep the information
about the same customer more than once, right? Therefore, in this table, the “Customer
ID” field is the primary key. The other columns represent the rest of the
information about a customer – first name, last name, email address, and number of complaints. Interesting! We already have two tables – “Sales”
and “Customers”! We know they are related through the “Customer
ID” field, and as you can see, a “Customer ID” column exists in the “Sales” table. The coincidence of the two column names is
chosen purely for convenience. The same relationship could be based on columns
with different names – “Column ID” and just “ID”, for example. What is important is that the information
in the “Sales” table can be obtained through the “Customer ID” column in the “Customers”
table. Great! Let’s stick to using the same name for both
columns. All right. Here is the big trick. “Customer ID” is a primary key for the
“Customers” table and a foreign key in the “Sales” table. In the relational schemas form of representation,
relations between tables are expressed in the following way – the column name that
designates the logical match is a foreign key in one table, and it is connected with
a corresponding column from another table. Often, the relationship goes from a foreign
key to a primary key, but in more advanced circumstances, this will not be the case. To catch the relations on which a database
is built, always look for the foreign keys, as they show us where the relations are. How is this shown on the canvas? First, you should indicate that “Customer
ID” is a foreign key in the “Sales” table with F.K. in brackets next to the column
name. Second, you must draw an arrow that begins
from the same level as the field name of the foreign key and ends pointing at the column
name from the table it is related to. Don’t worry; the arrow won’t be straight
– what matters are its start and its end. Observe the following example from our “Sales”
database to avoid confusion between primary and foreign keys. In the “Customers” table, “Customer
ID” contains different values; that is, different numbers are stored on every row,
and there are no missing values. This complies with the requirement for primary
keys to contain unique values only. In the “Sales” table, though, we see under
the “Customer ID” column we can have repeating and missing values. This is ok. For instance, it is normal to expect customer
number 2 to have executed three purchases in our shop, correct? Also, it is possible to have not registered
who executed purchase number 9. The values under this field in the “Sales”
table will be mentioned in the “Customers” table, as well. If we have registered information for four
customers with IDs 1,2,3, and 4, we cannot have a value of 5, 10, or 100 here, right? This also means, if you try to insert in the
“Sales” table a record with a “Customer ID” that differs from the ones we have already
seen in the “Customers” table, MySQL will raise an error and the record won’t be stored. Fantastic! By the same token, we can create the “Items”
table. The field “Item code” appears in this
table as a primary key and in “Sales” as a foreign key. Analogically to the previous example with
the “Customer ID” column from the “Customers” table, “Item Code” contains a letter and
a number forming a unique combination in the “Items” table, while in “Sales”, the
values could repeat, as is the case with item B2, for instance. Therefore, we can use “F.K.” to indicate
an “Item Code” is a foreign key in the “Sales” table and by drawing an arrow
starting from the “Item Code” field in the “Sales” table pointing at the column
with the same name in the “Items” table. Great! Remember that primary keys were called unique
identifiers of a table? Well, foreign keys can be called identifiers,
too, but they identify the relationships between tables, not the tables themselves. Amazing! Ok, great. Now we have a good idea of what the database
will look like. However, there is one missing bit – we have
to create a connection between the “Sales”, “Customers”, and “Items” table and
the “Companies” table you created earlier. And why is the “Companies” table so important
to us? Well, I hope you can already see that “Company
ID” is a primary key in it. Spotting this must be a piece of cake for
you already. On the other hand, you can rarely have two
or more companies with the same name, but it’s not impossible, especially if they
are registered in a different state or if they operate in different industries. How about the “Headquarters Phone Number”
field? Can you have two US numbers that are completely
identical? No! When you dial a phone number, there is just
one phone that will ring somewhere in the US, right? Therefore, we can conclude that the values
in this field have to be unique. But wait! This also means that “Headquarters Phone
Number” is a primary key! Would that be possible? We just said that there is one and only one
primary key in a table, and here the “Company ID” column is a primary key! Don’t worry, there is a name for this type
of field, and it is unique keys. They are used whenever you would like to specify
that you don’t want to see duplicate data in a given field, exactly as in our example
with the headquarters phone number. There is one substantial difference between
unique keys and primary keys, which is that they can contain null values. Meaning, if we don’t know the headquarters
phone number of Company D, we can still have this record and our table will be functioning
as part of the relational database. However, it won’t give us information about
the phone number of Company D. If, instead, we remove number two as company
ID and leave the name “Company B” and the relevant phone number, SQL will display
an error message because “Company ID” is a primary key and we are obliged to provide
data in every row of this column. In addition, a table can have 0, 1, 2 or more
unique keys. This means that depending on the requirements
of the database, a table may or may not contain one or more unique key columns. This is different with respect to the characteristics
of the primary key constraint: remember, you can have only one primary key per table. Furthermore, what is similar between the two
types of constraints is that both can be applied to multiple columns, not just to a single
column. In other words, the unique key constraint
can be defined by a single field from the table it is applied to, or it could comprise
a few of the columns in that table. Both scenarios are possible. Wonderful! This completes the creation of the entire
“Sales” database scheme! All right! Excellent! This is a lecture in which we will try to
illustrate that relationships between tables can be categorized. We will not explore all types of relationships
in detail given that this topic is rather theoretical and time consuming in its entirety. Instead, we will study the main types of relationships
you will likely need in your workplace. Relationships tell you how much of the data
from a foreign key field can be seen in the primary key column of the table the data is
related to and vice versa. What do we mean by that? As we already discussed, the “Customer ID”
column is a primary key of the “Customers” table. This means it contains only unique values
– 1, 2, 3, and 4. The same information about customer ID can
be found in the “Sales” table as a foreign key, but you will likely have a lot more than
4 rows there. Hence, the values from 1 to 4 can be repeated
many times because the same customer could execute more than one purchase. This is an example of a ‘one-to-many’
type of relationship: one value from the customer ID column under the “Customers” table
can be found many times in the customer ID column in the “Sales” table. In the relational schema, this is shown by
assigning the correct symbols at the end of the arrow. You should always read the symbols according
to the direction of the relationship you are exploring. For instance, think of it this way: a single
customer could have made one purchase, but she could have made more than one purchase
too! Therefore, the second symbol, which is next
to the rectangle, shows the minimum number of instances of the “Customers” entity
that can be associated with the “Sales” entity. When this symbol is a tiny line, it means
“one”. The symbol located next to the rectangle indicates
the maximum number of instances that can be associated with the “Sales” entity. The angle-like symbol stands for “many”. Let’s check the relationship in the opposite
direction – for a single purchase registered in the “Sales” table a single customer
can be indicated as the buyer. So, the minimum number of customers you will
find in the “Customers” table for whom we have such information as name, email, and
number of complaints, and that corresponds to a single purchase in the “Sales” table,
is one. At the same time, we just said that for a
given purchase we cannot have more than one buyer, meaning that the maximum number of
instances from “Sales” associated with “Customers” is also one. Therefore, for every purchase, we can obtain
the name, email, and number of complaints data for one customer, and we represent this
logic by drawing а line for the minimum, and а line for the maximum. The two symbols in closer proximity to the
rectangles form the relationship between the “Customers” and the “Sales” tables,
according to the specified direction. In our case, it is correct to say that the
“Customers” to “Sales” relationship is one-to-many, while “Sales” to “Customers”
is many-to-one. Great! The symbols showing us relationship limitations
are called cardinality constraints. There are other symbols that can be used too
– M or N for infinite associations, or a circle for optional instances, which would
have been the case if it wasn’t necessary for a registered person to have purchased
an item. There are some other types of relationships
as well – one-to-one, many-to-many, and others. This is information that we share with you
for general knowledge, as this is a specialized topic which is of interest mainly to advanced
users. To sum it up, relational (or database) schemas
do not just represent the concept database administrators must implement. They depict how a database is organized. They can be thought of as blueprints, or a
plan for a database, because they are usually prepared at the stage of database design. Drawing a relational schema isn’t an easy
job, but relational schemas will help you immensely while writing your queries – a
neat and complete visualization of the structure of the entire database will always be useful
when retrieving information.