Good day, everyone. This is Dr. Soper, here. And today we will be
exploring our seventh topic in our series of
database lectures with this lecture focusing
on the database indexes. To begin, I want to emphasize
the importance of indexes to the overall performance
of database queries. Of all of the strategies that
we have for improving database performance, many
people, including myself, would argue that
database indexes are the single most critical tool
that is available to a database administrator for improving
the query performance of the database. An index itself is
simply a data structure of some sort that
contains an organized copy of some of the
data from one or more of the tables in the database. And this data structure can
be used to vastly improved query performance. Conceptually speaking, you
can think of a database index just as you might
think of an index at the back of a textbook. Imagine, for example, that I
gave you a chemistry textbook and I asked you to
find information about beryllium, which
is the fourth element on the periodic table. Without an index
or some other form of organizational
framework for finding the information in which
you are interested, it would take an extremely
long time in order for you to locate information on
beryllium that I had requested. In this case, without
an index, your strategy for locating the
desired information would probably involve
flipping through the pages of the chemistry
textbook, one by one, scanning the content
of each page, and looking for information
about the element beryllium. If, however, you had an
organizational framework for the content of the textbook,
such as an index that you might find at the back
of the book, you could simply look through
the index, which in this case would almost certainly be
organized alphabetically, find beryllium,
and then the index would tell you which
pages within the textbook contained information
about beryllium. You could then
immediately skip directly to the pages in
the textbook which contain the information in
which you are interested. I hope you can
intuitively understand how an index can thus vastly
improve the speed with which SQL queries can be answered. When teaching people
about database indexes, I always like to begin
with an intuitive overview. Consider the table
of names that are shown on the right
side of your screen. You can see that we
have 22 rows of data. In the first column we see
the row position, row 22. And in the second column we
see a person's last name. And again, the last names
here are randomly ordered. Now imagine that we choose a
name at random from this list. If we were to begin
at the top of the list and move toward the bottom,
examining one row at a time, how many rows would
we need to inspect before we would be able to
find our randomly chosen name? Well the answer, of
course, is, it depends. And in this case it
depends upon the name. If, for example, we
had chosen Salehian as our randomly selected name,
and we started at the top, moving down, we would
find our desired name after inspecting
only three rows. If, on the other hand, the
randomly chosen name was Pham, we would need to inspect
20 rows before we would locate our desired name. Thus, the performance
of our search strategy varies widely and depends
upon the name which is chosen and the random order in which
the names happen to appear. A more useful metric for
assessing this search strategy then, would be to ask
the question, what is the average search
time if we were to repeat this process many, many times? So if we were to choose
one of these names randomly and go through
this linear search process of starting
at the top, working our way toward the bottom,
searching for our target name, if we were to repeat that
process over and over and over again, constantly
just choosing random names, what would the average
search time be? And in this case,
by search time, I'm referring to
the number of rows that we would need to inspect in
order to find our desired name. Well this search strategy
is known as a linear search. And the answer to our question
is, that the average search time would be equal to
n plus 1 divided by 2, where n is the number
of rows in the table. In our example here, we
have 22 rows in the table and therefore, the
average search time to find any randomly
chosen name would be equal to 22 plus
1 divided by 2. Which works out to 11.5 rows. So on average then, using
this linear search process, we would need to inspect
11.5 rows in order find the name in which
we are interested. Another good metric for
assessing the efficiency of this search strategy
would be to ask, what is the maximum search time? That is, at most, how
many rows would we need to inspect to find
a randomly chosen name. And in this sort of
linear search strategy, if, by chance, we happen
to randomly select the last name in the
list, in this case, the last name in
the list is Israr, then we will need to inspect
all 22 rows before we locate the name in
which we are interested. Thus, in this sort of
linear search strategy, the maximum search
time will be n, where n is the number of rows. In our example, the maximum
search time would thus be 22. Next, let's explore a different
sort of search strategy. In this case, rather than having
a table in which the names were ordered randomly,
we instead have a table in which the names
are ordered alphabetically. And as we will see,
when we have this sort of organizational structure
imposed on the data, in this case, an
alphabetical sort, that one change alone
and vastly improved the speed with
which we can locate any randomly chosen name. Now that we know that
the names in our table are listed in
alphabetical order, what sort of search
strategy could we use that could improve
our performance. If we were to rely on the
same linear search strategy that we used in the
previous example, we would not
experience any gains in our average search time. And the reason why is,
with a linear search, we are not capitalizing on the
fact that our table now has an organizational structure. That is, the names are
listed in alphabetical order. Because we have an
ordered list of data, we can instead employ something
called a binary search strategy. And a binary search
strategy is much more efficient at allowing us to
locate any randomly chosen name because it capitalizes
on the knowledge that the names in the table
are in alphabetical order. So let's see how this binary
search strategy might work. Imagine that we
randomly choose a name. Again, we will choose
the name Salehian. Strategy involved
in a binary search might be described as
divide and conquer. So knowing that the names
are in alphabetical order and knowing how many rows there
are in the table, the best place for us to
begin our searches in the middle of the table. In this case, because we
have an even number of rows, there is no precise
middle row in the table. So we may select the row
location at n over 2 plus 1 to begin our search. Since we have 22
rows here, that would mean that we begin our
search by examining the value of row 12,
which in this case, is the last name, Mishra. So when we examine this row,
we will ask the question, is this the row that
we are looking for. And since we know we're
looking for Salehian and the row that we are
currently examining is Mishra, the answer to that
question is, no. But because we know
that the names are in alphabetical order,
we also now know that the row which
we are seeking is below row 12 in the table. Thus, after examining
a single row, we were able to
eliminate more than half of the possibilities
of where our randomly selected name might appear. That is, after
inspecting our first row, we now know that the
name we are seeking exists somewhere
between rows 13 and 22. We next repeat
the process again. We look at the row which lies
in the middle of our remaining set of rows. In this case, we again have an
even number of rows remaining. So we might look at the row that
is at location n over 2 plus 1. And since we have 10
rows remaining, that would mean the next place that
we will look will be at row six among our remaining set of rows. In this case, that means
the second location, where we would look for
our target name, would be the row which
contains the last name Spievak. And again, we would ask
our self the same question. Now we asked before
to wit, is this the row that we're seeking. Again, since we are
seeking the name Salehian, the answer to that
question is, no. And because the names are
in alphabetical order, we are now able to eliminate
Spievak and all of the rows that follow as possible
locations for our target name. Thus, after inspecting just
two rows in this table, we've been able to narrow
down the set of possibilities for where our target name
resides to just five rows. We then repeat the
process once again. Selecting the row in the
middle of our remaining rows. And since in this case, we
have five rows remaining, the row in the
middle is Salehian. We would again, ask the
question is this the row that we are seeking. The answer is yes. And therefore, we have
completed the search process. In this case, we only needed to
inspect three rows in the table in order to find the randomly
chosen name in which we were interested. But again, a better metric
of the relative performance of the search strategy is to
ask, what is the average search time if we were to
repeat this process many, many times, choosing a
name randomly each time for which to search. With this sort of
binary search strategy, mathematicians and
computer scientists have been able to prove
that the average search time will be log
base 2 n minus 1, where n is the number of rows. In our example, we have 22
rows so the average search time would be log
base 2 of 22 minus 1, which means on average,
we need to inspect approximately 3.5
rows in order to find any randomly chosen name using
this binary search strategy with an ordered table of data. That's much, much
better than the average from our previous linear search
strategy, which if you recall, required us to inspect
11.5 rows on average. As with our previous
example, we can also consider the maximum
search time required. And what is remarkable about
this binary search strategy is that the maximum
search time is simply equal to log base 2 n. So to whereas the average is
log base 2 n minus 1, here the maximum search
time is log base 2 n. That is just one more
search than the average. Thus, with 22 records,
the maximum number of rows that we would need to
inspect before we would locate any randomly chosen name,
would be log base 2 of 22, or approximately 4.5 rows
that we would need to inspect. So the purpose of comparing
these two search strategies, a linear search versus
a binary search, is to demonstrate the massive
gain in search performance that we can achieve if we impose
an organizational structure on the data through
which we are searching. In this case, that
organizational structure was alphabetizing
the list of names. Next, let's consider a few of
the major concepts associated with database indexes. As we saw in the previous
intuitive examples, without an indexing
strategy in place, the database has no
choice but to perform a table scan when it
is asked to locate one or more rows within a table. That is, in the
absence of an index, the database simply has to
adopt a naive approach, where it starts at the first
row in the table and says, is this the row I'm looking for. No, it moves on to the next row. Is this the row I'm looking for. No. And so forth until
it finally locates the row has been seeking. With an index in place,
the DBMS no longer needs to rely upon
such a nice strategy. And it can locate the desired
information much more rapidly. Indexes then could be created
on one or more columns within a database table. Each table can
contain potentially in many different indexes. And each column within
a table might belong to many different indexes. As an example, imagine
that we have a table and we want to create an index
on the primary key column. So we instruct the DBMS
to create the index. And the contents of the index
will be the primary key value for each row in the table,
along with the row's ordinal position. That is, its row number
within the table. The primary key
values in this case would be stored
in a sorted order. So let's say that
they are stored in an ascending numerical order. Thus, whenever a query is run
against the database which involves the primary
key for this table, the DBMS can search through
the index rather than the table itself to find
the primary key value using the sort of binary
search strategy that we described previously. And then it can quickly
locate the row position that is the ordinal
position of the target row within the table. So again, this is
just like looking in the index in the
back of a textbook. If you know that the
values in the index are in alphabetical
order or numerical order, you can easily locate
the information you want. And there will be a
pointer which tells you on which page of
the textbook you should look for your
desired information. Again, although table can
contain potentially many different indexes and although
certain columns might be involved in many
different indexes, there are certain types of
columns upon which an index cannot be created. And the determination as
to whether an index can be created on a
particular column depends on the
column's data type. Columns which have what we call
a large object data type cannot be indexed directly. There are some strategies, such
as using hashing algorithm, which I will
describe later, that can be used to index these
sorts of large objects. However, generally speaking,
they cannot be indexed directly. In SQL Server, the
large object data types include text, end
text, which as we know is a data type which supports
unicode text, the image data type, varchar(max),
nvarchar(max), which, again, is a variable
length character string that supports unicode characters. And of course,
varbinary(max), which can be used to store binary
encoded data directly in the database. So these six data types are all
considered large object data types and columns which have one
of these data types cannot be directly indexed. Another important concept
to understand about indexes is that we are
making a trade off that is, although when
we properly use an index, we can vastly improve the
performance of our queries, the cost that we incur for
that performance increase are additional storage
space requirements. And of course, all of the
additional maintenance requirements that come
along with the index. Put simply, the reason
that using an index increases the storage space
requirements of the database is that the index
contains a copy of some of the data within a table. And we need to store that copy
of the data on the file system. Let's see an example
using our list of names from earlier in this lecture. Let's imagine that the table
on the left is an index table. While the table on the right
is our actual table of data from within the database. As you can see,
the index table is storing a copy of some of the
data from the actual table. In this case, it's storing a
copy of all of the last names because we had
instructed the database to construct an index on
the last name attribute. Although our database now
requires more storage space, we can reasonably argue that
the performance gains realized through the expense of
this extra storage space are well worth it. As we saw previously,
if we were to use a binary search
against this index, we could locate any
randomly chosen name by inspecting just 3.5
rows within the table. And then the index
would contain a pointer, which tells us precisely where
to look in the actual data table for the row that
we've been seeking. Because indexes consume
extra storage space, it's important for a
database administrator to know how to calculate
how much extra storage space and index it will require. A simple way of estimating
the required storage space for an index
is simply to multiply the number of rows in a table
by the average number of bytes that are required per row
for the indexed columns. As an example, imagine
that we want to create an index on two columns. Named, Last Name,
and Department ID. And we know that
within our table, Last Names require an
average of 16 bytes per row while Department IDs require an
average of two bytes per row. And let's say that our
table contains 98,000 rows. We can then easily estimate
the storage space requirements for our index by multiplying
the number of rows, 98,000, times the average
number of bytes required per row for the indexed
columns, in this case, at 16 bytes for each
last name on average. And two bytes for each
Department ID on average. So we multiply 98,000
times 18 and we determined that our index will
require about 1,764,000 bytes of extra storage space. Next, we'll talk about
a few specific types of index structures. By far, the commonest type of
index uses a B-Tree structure. And the B here
stands for balanced. So this is balanced tree. In a B-Tree structure, we use
pointers and layers of nodes in order to organize our data. And using this
structure, we can quickly locate any data that we desire. Conceptually speaking,
traversing a B-Tree index is exactly the same as
the binary search strategy that we described earlier. So a B-Tree is arranged around
several layers of nodes. We begin at the root node level. And then we might pass
through one or more layers of intermediate nodes
until ultimately, we arrive at the leaf nodes. And it is at the
leaf nodes where we are able to get the
information that we need for answering whatever
query has been asked. Let's see an example
of how this works. So here we have a sample
balanced tree index. And again, we can give
ourselves the task of locating any randomly selected name. So I will, once
again, choose Salehian as my randomly chosen name. Beginning at the
root node, we then can immediately traverse to
the right side of the tree because we know that our
randomly selected name begins with a letter which
falls between the range M to W. We can then
move immediately to the intermediate
node for S. After which we arrive at the leaf nodes and
we can locate our desired name. Recall that a B-Tree
index is a balanced tree. And I hope you can
see in this example where a term balanced comes in. The objective, when
constructing B-Tree index, is to subdivide the data
as evenly as possible in a balanced way. And in following
this strategy, we can maximize the
search performance. Before we move on to our
next index structure, I want to speak
briefly about clustered versus nonclustered indexes. In a clustered index, we can
store the actual data rows of themselves, that is, the data
rows that together, comprise the actual data table
in the database, at the leaf level of the index. And this can improve search time
because the database will not need to follow a pointer
to the actual data row within the table. At the leaf level of
the index, the data rows were already there. Because we are storing
the actual data rows that comprise the
data table at the leaf level of the index however,
this imposes a constraint such that we can have only one
clustered index per table. Primary key columns are,
therefore, excellent candidates for clustered indexes. Returning to our previous
example of a balanced tree index, imagine that at the
level of the leaf nodes, we are not simply
storing pointers here to the rows within the
actual database table. But instead, we are storing the
actual data rows themselves. Another way of
thinking about this is that in a clustered index,
we are breaking the actual table apart and storing its
rows in a specific way such that we can maximize
our search performance. In this case, the rows
become the leaf nodes at the bottom of
the B-Tree index. By contrast, in a non
clustered index leaf nodes contain the values
of the index columns and a pointer, which
tells the database were to look in the actual
database table, for the target row or rows. And of course, the
actual data row itself might be stored as a
leaf node of a clustered index, as we saw previously. Or it might just be stored
in something called a heap. Where a heap is just the
simple ordinary table that does not use
a clustered index. A few important points to note
about non clustered indexes are that nonclustered
indexes are slower than clustered indexes. Because again, in a
nonclustered index, the database must follow a
pointer to the actual data row. An advantage of nonclustered
indexes however, is that each table in a database
can contain more than one nonclustered index. And another advantage
of a nonclustered index is that the leaf nodes
of a nonclustered index are allowed to contain values
from non-indexed columns. And using this strategy
can often substantially improve query performance. Because an artfully
designed nonclustered index, which contains values
from non-indexed columns, can be used by the DBMS
to answer certain queries without actually having to
look at the real table of data itself. For example, imagine
that I have a product table where I have indexed
the product name attribute. Now let's imagine,
as well, that I have included the price
attribute as a part of the nonclustered index. Now if I run a query
where I'm attempting to find the price of
a particular product, the database can locate the
product within the index. And then it will immediately
know the price of that product without actually having to look
in the real data table itself. So it will be able
to answer that SQL query without ever having to
look in the actual data table. I hope you can
understand how this can improve query performance. Next, let's talk about a
different index structure. And in this case, we'll
look at bitmap index. So in a bitmap index,
we create a table where we have the values
of one attribute listed along the horizontal axis. And the values of
another attribute along the vertical axis. Each cell value contains
a bit, that is, simply a value of 1 or
0, which indicates whether the value
of one attribute is associated with
the value of another. Important things to note
about bitmap indexes include that bitmap
indexes work best when one or both of
the attributes involved in the bitmap index has only
a relatively small number of unique values. Also, when they
are used properly, a bitmap index can require
only about one quarter of the storage space
of a tree-based index and can also be about
10 times faster. Let's see an example. So here we have two attributes,
Student ID and Student Grade, where a Student
Grade exists along the horizontal axis
of the bitmap index and Student ID exists along
the vertical axis of the bitmap index. And we can see all
the possible values for Student Grade are listed,
A, B, C, D, and F. And the bits, that is, the 1s
or 0s, which exist at the intersection
of a Student Grade and a Student ID indicate
to us whether or not that student is associated
with that particular grade. So if I write a
query which says, give me the Student IDs
of all of the students who received A's, the database
could use a bitmap index, such as the
one depicted here, and performance bit-wise
comparisons, which are computationally
extremely fast, in order to answer our query. Next. Let's look at a third
type of index, which we can call a hashed index. Conceptually speaking, the
idea with a hashed index is that we use a hashing
algorithm in order to convert some kind of
input value or input data into a location within an index. And that location within
the index, in turn, will contain a pointer
to the actual data row within a data table. These hashing
algorithms typically work by using prime numbers
as input along with something called a modulo operation, which
returns a remainder in order to generate a location
within an index. These types of hash indexes are
useful in several situations. Notably, in very
large databases where parallel processing or a
distributed database model is being used. A hashing algorithm can be
used to quickly determine which database server or
which processor is associated with the desired row or rows. And we can also
use hashed indexes in situations where we need
to index complex objects, such as images or other
types of large objects. Here, we see an example of
how a hashing algorithm can be used to allow us to
index complex objects, such as images. So we begin by running the image
through the hash algorithm. And the result of that operation
is a location within an index. So let's say that we run in
image through hash algorithm and it gives us a location
within the index of five. We can then traverse
our balanced tree until we get down to indexed
location, which then tells us to look in row nine
of the actual data table for the data in which
we are interested. To finalize our
lecture today, I just want to draw your attention
to some index considerations and some guidelines for using
and implementing indexes. First, remember that indexes
require additional storage space. And because of this,
we should generally only create indexes
on columns that are involved in common queries. That is, columns which commonly
appear in the WHERE clause of a SQL query. Or perhaps in an order
by or group by clause. This also means that
the database designer must know which
sort of queries will be run against the database
if he or she is to design an effective indexing strategy. A second point to
consider is that indexes should be used sparingly on
tables that are subjected to frequent updates. To understand why,
consider that whenever an insert operation,
an update operation, or a delete operation occurs
which affects an index column, then the index for that column
has to be rebuilt by the DBMS. Thus, if we have a table
which is updated frequently, then we are also having to
frequently rebuild the index. Because rebuilding
indexes takes time, if we create a lot of indexes
on tables that are frequently updated, the effort
required by the database to constantly
rebuild those indexes may actually slow the overall
performance of the database. Finally, just a few indexing
guidelines that I hope you will find useful when
designing an indexing strategy for your database. Again, if a table
is heavily updated, try to indexes few
columns as possible. If you over-index a
heavily updated table, it may act really slow
the overall performance of the database rather than
improving the performance. If, however, table
is updated rarely, then, assuming that you have
the storage space available, feel free to use as many
indexed columns as necessary to support the common
types of queries that are run against that table. Doing this will substantially
improve the performance of those queries. With respect to clustered
versus nonclustered indexes, remember that clustered indexes
are most effectively used on columns that do
not allow null values and whose values are unique. Because, by definition,
primary key column it's contained unique
values and values that are not allowed to be
null, primary key columns are therefor good targets
for clustered indexes. And finally, the
performance benefits that you can achieve
through the use of an index are related to the
uniqueness of the values in the indexed column. That is, if an indexed
column contains many, many duplicate values,
the performance of the index will be quite poor. On the other hand,
if the indexed column contains many unique
values, that's when you can achieve the maximum
benefit from using an index. Well my friends, thus ends our
overview of database indexing. I hope that you learned
something interesting. And until next time,
have a great day.