Today we are going to run SQL queries against
a table containing ten THOUSAND records. {{ Maniacal laughter }} {{ Phone call }} What is it, I’m in the middle of a video… You don’t say? ALL in RAM?… Well, alrighty then... Today we are going to run SQL queries against
a table containing one .. Hundred .. MILLION records. {{ Maniacal laughter }} But don’t worry. By using indexes, we can rapidly speed up
queries so you do not have to experience the phenomenon known as boredom. We will work with a single table called ‘person’
containing 100 MILLION randomly generated people. The first row is an auto-generated primary
key called “person_id” The other columns are first_name… last_name…
and birthday. To create this table, we randomly generated
names using the 1000 most popular female names, male names, and last names in the United States. We did not weight the names by frequency when
generating our random sample. The datasets and the Python code used to generate
the random names are available for download from Github. Today, we are going to pay close attention
to how long each query takes to execute. So on the right, we will maintain a list of
queries and how many milliseconds each query took to complete. These values will vary depending on the specifications
of your workstation. We will begin by counting the number of rows
in the “person” table… There are indeed 100 MILLION records. Notice how long it took to perform this simple
count. 3 seconds and 562 milliseconds. Or 3,562 milliseconds. Let us put this query and the time required
in our table. As another speed test, count the number of
people with the last name “Smith.” Execute Over 100,000 Smiths... And the query took 4,261 milliseconds. For a second test, count how many people are
named “Emma.” Execute… Nearly 50,000. And note the time: 4,066 milliseconds. What if we count the number of people born
in May, 1980? For this we will use the BETWEEN operator. Execute… 166,000... And the query took 4,480 milliseconds. A pattern is starting to emerge. Let’s quickly gather a few more speed samples. How many Michaels are in our database? 4,127 milliseconds How many people have the last name
‘Hawkins’ or ‘Snow’? 5,315 milliseconds And how many people were born on April 1, 1995? 4,110 milliseconds. We ran 7 queries, and the execution times
were all very similar. This is because in order to answer each query,
the database had to scan all 100 MILLION rows to check each record for a match. Our 7 scans of the entire table took a combined
total of 29,921 milliseconds. This means on average my workstation was able
to scan 23,394,940 rows per second. While this may seem fast, we can do better… MUCH better… And we will do this by building an INDEX. The name ‘index’ was chosen because of
how an index works in a book. For example, if you want to learn about “normal
distributions”, you would not flip through your entire math book page by page, scanning
for the term “normal distribution.” Instead, you would go to the index, find the
term “normal distribution”, and you would see the list of pages. This technique is used by databases. When you create an index, the database will
generate a method to rapidly find data based on one or more columns. As our first example, let us create an index
on the “person” table. To do this, write “CREATE INDEX” and then
give your index a name. There are many different naming conventions
that you can use. The important thing is to be consistent. We will use the table name, followed by the
field name, then add “idx” to indicate this is an index. Next, write “ON” followed by the table
name. Inside parentheses, write the columns to include
in the index. We will start simply and create an index for
the “first_name” column. Execute. That certainly took a while. Remember, the database had to scan 100 MILLION
rows and build a “first name” index from scratch. Let’s test our new index by repeating an
earlier query where we counted the number of people named “Emma.” Execute. This time, it took only 508 milliseconds. If we repeat the count of people with last
name “Smith”, it takes 4,415 milliseconds. What’s going on here? This is about as long as it took last time. It did not speed up because our index was
built for the first name column. The last name was not indexed. And as you might suspect, if we search for
people born on April 1, 1995, the query should not run any faster than last time. Execute. 4,011 milliseconds. No real change. Our index improved the speed of searches by
first name, but had no effect on queries by last name or birthday. However, there are other queries where the
“first name” index will improve performance. For example, count the number of people named
“Julie Andrews.” Execute. This ran in just 514 milliseconds. By contrast, if you count people with last
name ‘Andrews’ and born on June 12, 1992... It takes 4,274 milliseconds. The full name query benefited from the first
name index, because the database was able to use it to quickly find all the “Julies.” From this smaller set of records, it then
completed the job by finding people with last name ‘Andrews’. The second query, however, had no index to
help. It had to perform a full scan of the table
to find all people with last name “Andrews”, and then find those with a matching birthday. Fortunately databases are clever about how
to use indexes. For example, if you count the number of people
born on October 31, 1985 with a first name of “Mia”... It executes quickly: 512 milliseconds... But why? Wouldn’t the database first scan all rows
to find people with a matching birthday, and then filter by first name? No! Just because we wrote it in this order does
not mean the database will do its work in that order. The database will consider all possible ways
to execute your query, then chooses the optimal path. We call each possibility a “query plan”,
and the “query optimizer” picks what it thinks is best. You’ll be happy to know a database can have
more than one index. Better still, each index can be created using
more than one column. Let us create a multi-column index. Write “CREATE INDEX”, then an overly long
but descriptive name. Next, write “ON” then the table name. In parentheses, list the columns you want
to go into this index. Important Note: The order matters. Think of this as sorting the data first by
last_name, then by first_name Execute… It is now time to take this new index for a test run. How many people named “John Williams”
are in the database? Execute… There are 36 people. And it only took 27 milliseconds to run this query. What about Julie Andrews? 26 milliseconds. Well done, index.. Well done... A word of caution. Indexes are not free. Just as indexes in a book require paper, indexes
in a database require storage. Furthermore, when you add data to a database,
it needs to create new records AND update all relevant indexes. So go forth and use indexes, but use them wisely. And before you go, please help us achieve
our goal of being the first channel with
One Hundred BILLION subscribers! {maniacal laughter
hahhahaa please subscribe, really}