There are many different ways to put a shelf
of books in order. Alphabetically by author, or by title. By subject. Or by the colour of the spine. Those are all valid approaches
for different situations: if you’re a bookstore that has a lot of
customers coming in and saying “I don’t remember the title, but I know
it was red…” then sorting by colour is a reasonable approach. In 1876, an American librarian called
Melvil Dewey published a four-page pamphlet describing how he thought a public library
should be ordered. Before Dewey, books in many libraries were
ordered by when they were purchased. As new books arrived, they’d just be allotted
to a shelf or a box that matched their size, and there’d be a central card file somewhere
listing where everything was. You would have to ask a librarian if you wanted
a certain book, and they’d retrieve it for you. Just browsing the shelves wasn’t a thing. Dewey gave each subject an index number, and then said that the shelves should be ordered
by that number. Over the years, the Dewey Decimal system
became a standard. Dewey had solved a serious problem: how to sort and find information,
without having to bother the head librarian. And because books were
clustered by their subject, it encouraged general-interest
browsing of the shelves. Side note: while researching this, I found out that Dewey was a terrible person. Not in a “oh, but he was of his time”
way, like, he resigned from the American Library
Association because of sexual harassment complaints in 1905. How bad do you have to be in 1905 to get kicked- His sexist and racist biases made it into
the early versions of the Dewey Decimal system. Lots of libraries these days use completely
different classifications. But, yeah, we do still remember his name. Anyway, here’s the key point: by creating
a useful index and sorting by it, readers could find books more quickly. That same principle of indexing applies to
computers, and has since the early days. There are two basic ways that a
computer can search: linear and binary. Let’s say you’ve got a stack of cards. They’ve got numbers on them that are somewhere
between 1 and 50. and you want to find lucky number 13. If it’s a shuffled stack, then there’s no shortcut: you just run through, looking at one card after another
after another after another, until eventually you find number 13. That is linear search: going through every card in a stack,
every element in a list, until you find the one
or many that you’re looking for. That’ll work on any list, in any order. But most of the time, it will be slow. But what if the cards are in order by number? Sure, you have to spend some time to sort them. I’ve got a whole other video about that. But once you’ve done that, once they’re in order,
you can use a binary search. Binary not in terms of
zeros and ones on disk, but binary in terms of
splitting things in two. Look at the middle item of the list. Is it higher or lower than the number we’re
looking for, 13? So in this case, it’s lower. So we can forget about the first half of the
list, and we can do the same again. Look in the middle of what's left, forget about the side that we don’t want,
look in the middle, forget about the side we don't want,
and so on, and so on, and so on, until eventually we find the card we’re looking for or we know it’s not in the list at all. Okay, with eleven cards, binary search doesn’t
make much difference. But with a million cards, it is lightning-fast compared to linear search. Provided that the list is in order. But a stack of cards is a simple example. A database is much more complex. Let’s say we don’t just have numbers on
the cards: we also have letters,
and let's add some colours as well. We can’t sort the cards by more than one
of these properties at a time. So right now, I can use binary search for number, but not
for letter or colour. In the same way that librarians can sort their
books by title, or author, or colour, you can only optimise for
one property at a time. For everything else, you’ve got to run through
every book on the shelf, every card in the stack,
every item in the list. Unless you use indexes. Now, physically rearranging data on a hard disk,
moving the cards around, that’s called a “clustered index”. It’s like making sure your dictionary is
in alphabetical order because that's mostly how
you're going to use it. But a “non-clustered index” is like the
index at the back of a book, or a library catalogue. It’s a separate store of just the property
you’re interested in, stored in a simple, easy-to-access format. If I want to find the card with the letter
K on it? Well, I can do a binary search
through the index at speed, identify the number of the card, and then
use another binary search to go straight to it. If I want to pull all the green cards, I can pull just those out of the colour index,
index, note down all those numbers, and then pull just those cards.
There's no linear search happening anywhere. Again: it doesn’t make much difference with
eleven cards. But with a thousand, or a million, or a billion, that’s the difference between waiting hours
for a response to your query, and having it in a fraction of a second. Of course, each index that you add takes
time to create, and space to store. If your computer’s ever been slow because
of the “Indexing Service” on Windows, or “Spotlight Search” on Mac; that’s it
taking the time to run through all your files and index
all the properties that it can, so that when you type something
into your search bar, it’ll be able to reply as quickly
as possible. But every time you make any
change to your database, if you add, remove, or change something, you also have to keep the indices...
indexes... whichever, up to date. So there’s a skill in choosing which properties,
which fields in a database, you want to create indices for. If you’re only going to search by colour
once or twice a year, it might not be worth indexing. As with many things in computer science, there’s a trade-off between space and time. Now, Google say that their full search index
is 100 petabytes in size, and they’re customising every result from
it for every user, usually in under a second. Even after all the changes to Google over
the years, even after all the space that they’ve given
over to knowledge boxes -- and adverts -- even after all the optimisation they’ve added
to give the user exactly what they want… Google still put the time it took to answer every query
on every desktop search result. Because they're still proud of it. Because the logistics and computer science skills
required to do that are the sort of thing that PhDs are written about. Or, because it’s a trade secret, they’re the sort of thing that commands
very, very high salaries. It’s a long, long way between Dewey
and his decimal system and the Google search result. But the basic principle, "create an index",
is the same.
Good video., though I wished he added some visualization for the viewer to comprehend how massive the difference is between a logarithmic running time vs a linear one, rather than just mention it works well for very large datasets.
A linear search is called an order
n
(orO(n)
) function, which means that the time it takes to complete grows linearly with the size of the input. If it takes a second to search through a million entries then it will take about 10 seconds to search through ten million entries.A binary search is an order
log2(n)
function, which means that the time it takes grows logarithmically with respect to the input size. If it takes a second to search through a million entries then it will take about 1.16 seconds to search through ten million entries.To show just what a difference a binary search will make, compare the average time taken for a linear and a binary search in the following table, assuming searching 1 item takes a second:
At a certain point, a linear search will take on average longer than the age of the universe to complete while a binary search is still under a minute.
The key to the binary search is that it's throwing away about half the values at every step, which - even after just a few steps - discards a huge proportion of the values. For example, one thousand -> 500 -> 250 -> 125 -> 62 -> 31 -> 15 -> 7 -> 3 -> 1
Of course, one also has to consider how long it would take to sort a million million million items. Generally speaking, most good sorting algorithms are about
O(n log(n))
on average, which will take even longer than the linear search, but if you can insert the values in sorted order in the first place (by using a binary search to find where they go. for example) then that saves you having to sort them all after the fact.There's an interesting data structure that follows a similar principle called a binary search tree.
A binary tree is a collection of connected nodes in which each node has either no child nodes, one child node, or two child nodes. The start is called the root, and nodes with no children are called leaf nodes. To construct a binary search tree (BST), you insert each value by following a path through the tree starting from the root until you find a blank spot in which the value can be placed. You construct the path using the following rule: each time you reach a node, if the value you want to insert is smaller than the current node, you go left, otherwise you go to the right.
So let's say you want to insert the values 10, 7, 9, 14, 11, 19, 4. The tree would be constructed as follows:
And if you want to find an item, you just walk through the tree following the same rule. Let's say you're looking to see if 11 is in the tree. Well, it's greater than 10 so you go right and reach 14. 11 is less than 14 so you go left, and there it is. If you want to see if 12 is in the tree you start at 10 and go right, because 12 is greater than 10. Now you're at 14, so you go left because 12 is less than 14. Now you're at 11, so you need to go right but there's nowhere else to go, so you know that 12 isn't in the tree.
With each decision you make to go left or right (either when inserting values or when searching for values), you're cutting out about half of the remaining tree (assuming the tree is somewhat balanced), so it works similarly to a binary search. As such, a BST has
O(log n)
insertion andO(log n)
search on average.One of the problems with a BST is if you insert values in already sorted or close to sorted order then the tree ends up heavily weighted down one side, and in the worst case can actually just be more like a list, reducing its effectiveness to
O(n)
(linear). To handle that, there are other more interesting data structures that attempt to self-balance. But that's for another thread.Tom is awesome, and genuinely seems to just want to inform people. I really enjoyed this this video. So often people want to bring up electronic voting, and I just want to send them to this video.
Binary search (specifically trees) r so cool, I really wish he showed more. For reference, if u had a 1,000,000 item array, it would take 1mil checks to find your book if it happened to b last
By reference, w binary search, log2(1,000,000) is TWENTY checks
BOX A113
i see you, Tom
Binary exclusion is also a troubleshooting process you can use.
There’s a “the price is right” game where the contestant has to guess the price and they are given a range with a min and max price.
I get so frustrated because I’ve never seen anybody do a binary search. They always start at a random point and move in small increments.
I always use paper Dictionary as binary search analogy.