Understanding B-Trees: The Data Structure Behind Modern Databases

Video Statistics and Information

Video

Captions Word Cloud

Captions

When deciding how to store large amounts of data, a key question is how to organize that data so that we can operate on it efficiently. Let's start by looking at a popular data structure, the binary search tree. A binary search tree consists of connected nodes. Each node has a key, in this case a unique number, and can point to a left child node and a right child node. Binary search trees obey the property that any keys to the left of a node are less than that node's key, and any keys to the right of a node are greater than that node's key. This allows for an efficient search process. If we want to search for a key, we start at the root of the tree and compare against the key there, going left or right depending on if the key we want is less than or greater than the key in the node. At each step, we pick only the half of the tree that might contain the key we're looking for, saving time by not bothering to search parts of the tree we know won't have our key. At this point, we might wonder: if binary search trees work so well because at each step we focus in on only the relevant half of the search space, wouldn't it be better to have a tree that split into three instead of two, so that at each step we could reduce the search space to a third the size? But depending on the circumstances, this might actually be less efficient. That's because each node, instead of having one key, might have two. Keys less than the first key would be to the left, keys greater than the second key would be to the right, and keys between the two would be in the middle. The tree is definitely "shallower", meaning we have to look at fewer nodes to get to our answer. But we also might need to double the number of comparisons for each node: rather than just compare against one key, we might need to compare against both keys. In the end, we'll often end up performing more comparisons than in binary search. But does more comparisons mean this approach is less efficient? That depends on what operations take up more time. If you have a lot of data, say in a database or a file system, it's likely that the most expensive part of any algorithm will be the time it takes to go fetch a new node of data. Comparing keys the processor can perform very quickly, but when it's time to search the next node, the processor needs to wait much longer for the data to be ready. It's in this context that the B-tree shines. The idea behind a B-tree is to store a tree with nodes that can have more than one key. Maybe they have up to two keys, maybe ten, maybe hundreds or more. To demonstrate the idea, we'll use a B-tree with nodes that have a maximum of four keys. Once we have a B-tree, search works just like you'd expect. We start with the root node, and check our key against the keys in the node. Maybe one of the keys matches and we're done, but otherwise we need to search another node. However many keys a node has, it has a number of children equal to one more than the number of keys. This lets us know where to search: if our search key is less than all of the keys, we'll look to the leftmost child. If our search key is greater than all of the keys, we'll look to the rightmost child. Otherwise, depending on which two keys our search key falls between, we'll look at the child in between those two keys. This process continues, one node at a time, until we either find or don't find the key we're looking for. Because each node branches in so many different directions, we cut down on the search space quickly, and we can find what we're looking for by checking far fewer nodes than a binary search tree would require. But how exactly do we create a B-tree? And how do we make sure it doesn't get imbalanced, where the tree is deeper on one side compared to another, which might slow down operations? Let's start with some rules of B-trees that will guide how we operate on them. First, the leaves of the B-tree, the nodes that don't point to any other nodes, those all need to be at the same level of the tree. This makes sure everything stays balanced: we're not allowed to have some leaves deeper in the tree than others. Next, every node has a maximum and a minimum number of keys. We get to choose the maximum. And the minimum is always half of the maximum number of keys, rounding down if needed. So in our case, each node can have two, three, or four keys. What happens when we try to add our first number into a B-tree? We'll place the key in the root node. But wait, you might say — we just said that nodes need at least two keys. For B-trees, the root node is the exception to the rule: the root node can have fewer keys, but all other nodes in the tree need to have at least the minimum. So let's keep adding keys to the tree and see what happens. Each time we add a key, that key will get added to the root node, in sorted order. But remember, all nodes can only store up to a maximum number of keys — in this case, four. So what happens when we try to add our fifth key? It's not going to fit. So any time we have too many keys in a node, we need to split that node up. Remember that each node requires a minimum of two keys. So we'll take the smallest two keys and make one node, and take the largest two keys and make a second node. What do we do with this leftover middle key? Well, all the keys in the left node are less than this key. And all the keys in the right node are greater than this key. So we can push this key up into a new root node that points to the left and right. Now, this is starting to look more like a tree. When we add new elements, we're always going to add them at the bottom level of the B-tree. So when a new element comes along, we use the root node to decide which node to visit next, and when we get to the bottom of the tree, we insert the key in the appropriate spot. That process will continue, with new nodes getting added to the bottom level. Notice that most of the time, inserting is very straightforward: we just follow the B-tree nodes down to the bottom and insert the value there. It's only when we completely fill up a node and then try to add a new key there that we need to perform a split. We'll split the node into two nodes, and push the middle key up into the parent. As we add more and more data, you might start to wonder — what happens when the parent becomes full of keys too? It turns out this splitting algorithm is recursive. If we look at this tree, for example, and try to insert a new key that belongs in a node that's already full, we start as usual by splitting the node up into two nodes, and pushing the middle key into the parent. But the parent is already full, so when we push the middle key into the parent, it's overflowing. If that happens, we repeat the process again: we split the parent into two nodes of its own, pushing the middle key into a new root node. This elegant recursive algorithm preserves all the properties of the B-tree. When a node gets full, we split it into new nodes. And the only time the tree gets taller is when we split the root node into two and create a new root node, keeping all the leaves at the same level of the tree. But we don't just care about adding new data. We also want the ability to delete data. So how do you delete data from a B-tree? Well, let's say there's a key we want to delete. We start by searching through the tree as usual, starting from the root, looking for the key. Once we find it, we can delete it from the tree. That worked well in this case, but there are a few trickier cases we need to think about. Take a moment and consider — under what circumstances might this deletion algorithm run into problems with the properties we've set up around B-trees? Here's one case to consider: what happens if we try to delete a key, but the node already has the minimum key count? If we deleted the key, the node would have fewer than the minimum, which isn't allowed. So how do we fix the problem? Well, if a node has too few keys, we'll take a key from a node next to it, one of its "sibling" nodes, so to speak. In this case, let's take a key from the right sibling. We'll take the smallest key, since we're going to be moving it to the left. If we were taking a key from the left sibling, we'd instead take the largest key, since we'd be moving it to the right. But we can't just take the key from the sibling and fill in the missing spot in our node, because that would make the parent key in between these two nodes wrong. Remember that this parent key acts as a separator: all keys to the left of it should be smaller, and all keys to the right of it should be bigger. So we can't just move a key from one sibling to another like this. With a slight modification, though, we can make this work. Instead of taking the key from the sibling directly, we'll let the key from the sibling become the new separator, and we'll use what used to be the separator in our node. This preserves the properties of the B-tree: values less than the separator are to the left of it, values greater than the separator are to the right of it. So this works, as long as we have a sibling we can take from. But what happens if when we remove a key, we fall below the minimum and our sibling nodes are also already at the minimum? Now we can't take from a sibling. Instead, just like when adding keys we could split nodes apart, when deleting keys we might merge nodes back together. We can take our node, our sibling node, and the separator between them and merge them together into a single full node. In this case this works out fine, though note that in some cases that might mean the parent node ends up having too few keys. In that case, we can recursively apply the algorithm we've just described: either taking from a sibling if possible, or merging with a sibling if that's not possible. The other scenario we need to think about is what happens if we delete a key from the middle of the tree rather than from a leaf. In that case, it's not as simple as just deleting the key, since that key was also acting as a separator between two subtrees. In that case, we need a new separator. And that separator needs to still be greater than everything in the left subtree and smaller than everything in the right subtree. That means we have two options for the new separator: we either need to take the largest value in the left subtree, or the smallest value in the right subtree. Either one works, and that becomes the new separator value. Of course, after we do this, it's possible that the node we took the key from will now have fewer than the minimum — but we know how to fix that already, we just take a key from a sibling or merge two nodes together if that's not possible. So by taking care to always preserve these rules of B-trees, we've constructed a data structure that's remarkably efficient for searching lots of data. Even though we have lots of nodes, we only need to access a small number of them any time we're working with data, resulting in substantial savings on time waiting for data to be retrieved. And the property that nodes can be anywhere between half full and completely full turns out to be especially helpful: it means we can split up nodes when they get too full, and merge them together when they get too empty. The fact that many nodes aren't completely full also makes inserting new data easy most of the time, since often we can just fill in an empty space in a node without needing to change anything about the rest of the tree. These properties make B-trees and variants on B-trees popular and elegant choices for handling data, especially in databases and file systems. And I hope this tour of the B-tree, its structure, and how it operates gives you a better understanding and appreciation for what's really going inside your computer as it stores and navigates your data.

Info

Channel: Spanning Tree

Views: 310,109

Rating: undefined out of 5

Keywords: data structures, algorithms, computer science, trees, search, binary search, btree

Id: K1a2Bk8NrYQ

Channel Id: undefined

Length: 12min 39sec (759 seconds)

Published: Mon Apr 29 2024