How Binary Search Makes Computers Much, Much Faster

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Good video., though I wished he added some visualization for the viewer to comprehend how massive the difference is between a logarithmic running time vs a linear one, rather than just mention it works well for very large datasets.

👍︎︎ 80 👤︎︎ u/shmed 📅︎︎ Sep 28 2020 đź—«︎ replies

A linear search is called an order n (or O(n)) function, which means that the time it takes to complete grows linearly with the size of the input. If it takes a second to search through a million entries then it will take about 10 seconds to search through ten million entries.

A binary search is an order log2(n) function, which means that the time it takes grows logarithmically with respect to the input size. If it takes a second to search through a million entries then it will take about 1.16 seconds to search through ten million entries.

To show just what a difference a binary search will make, compare the average time taken for a linear and a binary search in the following table, assuming searching 1 item takes a second:

n O(n) O(log2(n))
1 1s 1s
10 10s 3.3s
100 1m 40s 6.6s
1000 16m 40s 9.9s
10000 ~2h 46m 13.2s
100000 ~1d 16.6s
1000000 ~11d 13h 19.9s
10000000 ~115d 23.2s
100000000 ~3y 26.5s
1000000000 ~31y 29.8s
10000000000 ~317y 33.2s
100000000000 ~3 thousand y 36.5s
1000000000000 ~31 thousand y 39.8s
10000000000000 ~317 thousand years 43.1s
100000000000000 ~3 million y 46.5s
1000000000000000 ~31 million y 49.8s
10000000000000000 ~317 million y 53.1s
100000000000000000 ~3 billion y 56.4s
1000000000000000000 ~31 billion y 59.7s
10000000000000000000 ~317 billion y 63.1s
100000000000000000000 ~3 trillion y 66.4s

At a certain point, a linear search will take on average longer than the age of the universe to complete while a binary search is still under a minute.

The key to the binary search is that it's throwing away about half the values at every step, which - even after just a few steps - discards a huge proportion of the values. For example, one thousand -> 500 -> 250 -> 125 -> 62 -> 31 -> 15 -> 7 -> 3 -> 1


Of course, one also has to consider how long it would take to sort a million million million items. Generally speaking, most good sorting algorithms are about O(n log(n))on average, which will take even longer than the linear search, but if you can insert the values in sorted order in the first place (by using a binary search to find where they go. for example) then that saves you having to sort them all after the fact.

👍︎︎ 66 👤︎︎ u/IRBMe 📅︎︎ Sep 28 2020 đź—«︎ replies

There's an interesting data structure that follows a similar principle called a binary search tree.

A binary tree is a collection of connected nodes in which each node has either no child nodes, one child node, or two child nodes. The start is called the root, and nodes with no children are called leaf nodes. To construct a binary search tree (BST), you insert each value by following a path through the tree starting from the root until you find a blank spot in which the value can be placed. You construct the path using the following rule: each time you reach a node, if the value you want to insert is smaller than the current node, you go left, otherwise you go to the right.

So let's say you want to insert the values 10, 7, 9, 14, 11, 19, 4. The tree would be constructed as follows:

    10

    10
  /
7

    10
  /
7
 \
  9

    10
  /    \
7       14
 \
  9

    10
  /    \
7       14
 \      /
  9  11

    10
  /    \
7       14
 \      /  \
  9   11    19

        10
      /    \
    7       14
   / \      /  \
 4    9   11    19

And if you want to find an item, you just walk through the tree following the same rule. Let's say you're looking to see if 11 is in the tree. Well, it's greater than 10 so you go right and reach 14. 11 is less than 14 so you go left, and there it is. If you want to see if 12 is in the tree you start at 10 and go right, because 12 is greater than 10. Now you're at 14, so you go left because 12 is less than 14. Now you're at 11, so you need to go right but there's nowhere else to go, so you know that 12 isn't in the tree.

With each decision you make to go left or right (either when inserting values or when searching for values), you're cutting out about half of the remaining tree (assuming the tree is somewhat balanced), so it works similarly to a binary search. As such, a BST has O(log n) insertion and O(log n) search on average.

One of the problems with a BST is if you insert values in already sorted or close to sorted order then the tree ends up heavily weighted down one side, and in the worst case can actually just be more like a list, reducing its effectiveness to O(n) (linear). To handle that, there are other more interesting data structures that attempt to self-balance. But that's for another thread.

👍︎︎ 20 👤︎︎ u/IRBMe 📅︎︎ Sep 28 2020 đź—«︎ replies

Tom is awesome, and genuinely seems to just want to inform people. I really enjoyed this this video. So often people want to bring up electronic voting, and I just want to send them to this video.

👍︎︎ 7 👤︎︎ u/Mansyn 📅︎︎ Sep 28 2020 đź—«︎ replies

Binary search (specifically trees) r so cool, I really wish he showed more. For reference, if u had a 1,000,000 item array, it would take 1mil checks to find your book if it happened to b last

By reference, w binary search, log2(1,000,000) is TWENTY checks

👍︎︎ 2 👤︎︎ u/Sharmat_Dagoth_Ur 📅︎︎ Sep 29 2020 đź—«︎ replies

BOX A113

i see you, Tom

👍︎︎ 1 👤︎︎ u/Aerik 📅︎︎ Sep 29 2020 đź—«︎ replies

Binary exclusion is also a troubleshooting process you can use.

👍︎︎ 1 👤︎︎ u/Son_Of_Borr_ 📅︎︎ Sep 29 2020 đź—«︎ replies

There’s a “the price is right” game where the contestant has to guess the price and they are given a range with a min and max price.

I get so frustrated because I’ve never seen anybody do a binary search. They always start at a random point and move in small increments.

👍︎︎ 1 👤︎︎ u/[deleted] 📅︎︎ Sep 29 2020 đź—«︎ replies

I always use paper Dictionary as binary search analogy.

👍︎︎ 1 👤︎︎ u/__konrad 📅︎︎ Sep 29 2020 đź—«︎ replies
Captions
There are many different ways to put a shelf of books in order. Alphabetically by author, or by title. By subject. Or by the colour of the spine. Those are all valid approaches for different situations: if you’re a bookstore that has a lot of customers coming in and saying “I don’t remember the title, but I know it was red…” then sorting by colour is a reasonable approach. In 1876, an American librarian called Melvil Dewey published a four-page pamphlet describing how he thought a public library should be ordered. Before Dewey, books in many libraries were ordered by when they were purchased. As new books arrived, they’d just be allotted to a shelf or a box that matched their size, and there’d be a central card file somewhere listing where everything was. You would have to ask a librarian if you wanted a certain book, and they’d retrieve it for you. Just browsing the shelves wasn’t a thing. Dewey gave each subject an index number, and then said that the shelves should be ordered by that number. Over the years, the Dewey Decimal system became a standard. Dewey had solved a serious problem: how to sort and find information, without having to bother the head librarian. And because books were clustered by their subject, it encouraged general-interest browsing of the shelves. Side note: while researching this, I found out that Dewey was a terrible person. Not in a “oh, but he was of his time” way, like, he resigned from the American Library Association because of sexual harassment complaints in 1905. How bad do you have to be in 1905 to get kicked- His sexist and racist biases made it into the early versions of the Dewey Decimal system. Lots of libraries these days use completely different classifications. But, yeah, we do still remember his name. Anyway, here’s the key point: by creating a useful index and sorting by it, readers could find books more quickly. That same principle of indexing applies to computers, and has since the early days. There are two basic ways that a computer can search: linear and binary. Let’s say you’ve got a stack of cards. They’ve got numbers on them that are somewhere between 1 and 50. and you want to find lucky number 13. If it’s a shuffled stack, then there’s no shortcut: you just run through, looking at one card after another after another after another, until eventually you find number 13. That is linear search: going through every card in a stack, every element in a list, until you find the one or many that you’re looking for. That’ll work on any list, in any order. But most of the time, it will be slow. But what if the cards are in order by number? Sure, you have to spend some time to sort them. I’ve got a whole other video about that. But once you’ve done that, once they’re in order, you can use a binary search. Binary not in terms of zeros and ones on disk, but binary in terms of splitting things in two. Look at the middle item of the list. Is it higher or lower than the number we’re looking for, 13? So in this case, it’s lower. So we can forget about the first half of the list, and we can do the same again. Look in the middle of what's left, forget about the side that we don’t want, look in the middle, forget about the side we don't want, and so on, and so on, and so on, until eventually we find the card we’re looking for or we know it’s not in the list at all. Okay, with eleven cards, binary search doesn’t make much difference. But with a million cards, it is lightning-fast compared to linear search. Provided that the list is in order. But a stack of cards is a simple example. A database is much more complex. Let’s say we don’t just have numbers on the cards: we also have letters, and let's add some colours as well. We can’t sort the cards by more than one of these properties at a time. So right now, I can use binary search for number, but not for letter or colour. In the same way that librarians can sort their books by title, or author, or colour, you can only optimise for one property at a time. For everything else, you’ve got to run through every book on the shelf, every card in the stack, every item in the list. Unless you use indexes. Now, physically rearranging data on a hard disk, moving the cards around, that’s called a “clustered index”. It’s like making sure your dictionary is in alphabetical order because that's mostly how you're going to use it. But a “non-clustered index” is like the index at the back of a book, or a library catalogue. It’s a separate store of just the property you’re interested in, stored in a simple, easy-to-access format. If I want to find the card with the letter K on it? Well, I can do a binary search through the index at speed, identify the number of the card, and then use another binary search to go straight to it. If I want to pull all the green cards, I can pull just those out of the colour index, index, note down all those numbers, and then pull just those cards. There's no linear search happening anywhere. Again: it doesn’t make much difference with eleven cards. But with a thousand, or a million, or a billion, that’s the difference between waiting hours for a response to your query, and having it in a fraction of a second. Of course, each index that you add takes time to create, and space to store. If your computer’s ever been slow because of the “Indexing Service” on Windows, or “Spotlight Search” on Mac; that’s it taking the time to run through all your files and index all the properties that it can, so that when you type something into your search bar, it’ll be able to reply as quickly as possible. But every time you make any change to your database, if you add, remove, or change something, you also have to keep the indices... indexes... whichever, up to date. So there’s a skill in choosing which properties, which fields in a database, you want to create indices for. If you’re only going to search by colour once or twice a year, it might not be worth indexing. As with many things in computer science, there’s a trade-off between space and time. Now, Google say that their full search index is 100 petabytes in size, and they’re customising every result from it for every user, usually in under a second. Even after all the changes to Google over the years, even after all the space that they’ve given over to knowledge boxes -- and adverts -- even after all the optimisation they’ve added to give the user exactly what they want… Google still put the time it took to answer every query on every desktop search result. Because they're still proud of it. Because the logistics and computer science skills required to do that are the sort of thing that PhDs are written about. Or, because it’s a trade secret, they’re the sort of thing that commands very, very high salaries. It’s a long, long way between Dewey and his decimal system and the Google search result. But the basic principle, "create an index", is the same.
Info
Channel: Tom Scott
Views: 1,194,042
Rating: undefined out of 5
Keywords: tom scott, tomscott, the basics, computer science, binary search, linear search, indexes, indices
Id: KXJSjte_OAI
Channel Id: undefined
Length: 6min 51sec (411 seconds)
Published: Mon Sep 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.