Understanding B-Trees: The Data Structure Behind Modern Databases

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
When deciding how to store large amounts of data,   a key question is how to organize that data  so that we can operate on it efficiently. Let's start by looking at a popular  data structure, the binary search   tree. A binary search tree consists of  connected nodes. Each node has a key,   in this case a unique number, and can point  to a left child node and a right child node. Binary search trees obey the property that any  keys to the left of a node are less than that   node's key, and any keys to the right of  a node are greater than that node's key. This allows for an efficient search  process. If we want to search for a key,   we start at the root of the tree and compare  against the key there, going left or right   depending on if the key we want is less  than or greater than the key in the node. At each step, we pick only the half of the tree  that might contain the key we're looking for,   saving time by not bothering to search parts  of the tree we know won't have our key. At this point, we might wonder: if binary search  trees work so well because at each step we focus   in on only the relevant half of the search space,  wouldn't it be better to have a tree that split   into three instead of two, so that at each step we  could reduce the search space to a third the size? But depending on the circumstances,  this might actually be less efficient. That's because each node, instead of having  one key, might have two. Keys less than the   first key would be to the left, keys greater  than the second key would be to the right,   and keys between the two would be in the middle. The tree is definitely "shallower", meaning  we have to look at fewer nodes to get to our   answer. But we also might need to double  the number of comparisons for each node:   rather than just compare against one key, we might  need to compare against both keys. In the end,   we'll often end up performing more  comparisons than in binary search. But does more comparisons mean  this approach is less efficient?   That depends on what operations take up more time. If you have a lot of data, say in a database  or a file system, it's likely that the most   expensive part of any algorithm will be the time  it takes to go fetch a new node of data. Comparing   keys the processor can perform very quickly,  but when it's time to search the next node,   the processor needs to wait much  longer for the data to be ready. It's in this context that the B-tree shines. The  idea behind a B-tree is to store a tree with nodes   that can have more than one key. Maybe they have  up to two keys, maybe ten, maybe hundreds or more.   To demonstrate the idea, we'll use a B-tree  with nodes that have a maximum of four keys. Once we have a B-tree, search works just like  you'd expect. We start with the root node,   and check our key against the keys in the node.  Maybe one of the keys matches and we're done,   but otherwise we need to search another node. However many keys a node has, it has a number  of children equal to one more than the number   of keys. This lets us know where to search: if  our search key is less than all of the keys,   we'll look to the leftmost child. If our  search key is greater than all of the keys,   we'll look to the rightmost child.  Otherwise, depending on which two keys   our search key falls between, we'll look  at the child in between those two keys. This process continues, one node at  a time, until we either find or don't   find the key we're looking for. Because each  node branches in so many different directions,   we cut down on the search space quickly,  and we can find what we're looking for   by checking far fewer nodes than a  binary search tree would require. But how exactly do we create a B-tree? And  how do we make sure it doesn't get imbalanced,   where the tree is deeper on one side compared  to another, which might slow down operations? Let's start with some rules of B-trees  that will guide how we operate on them. First, the leaves of the B-tree, the  nodes that don't point to any other nodes,   those all need to be at the same level of the  tree. This makes sure everything stays balanced:   we're not allowed to have some leaves  deeper in the tree than others. Next, every node has a maximum and a minimum  number of keys. We get to choose the maximum.   And the minimum is always half of the maximum  number of keys, rounding down if needed. So in our case, each node can  have two, three, or four keys. What happens when we try to add our first number   into a B-tree? We'll place  the key in the root node. But wait, you might say — we just said that  nodes need at least two keys. For B-trees,   the root node is the exception to the  rule: the root node can have fewer keys,   but all other nodes in the tree  need to have at least the minimum. So let's keep adding keys to the tree and see  what happens. Each time we add a key, that key   will get added to the root node, in sorted order.  But remember, all nodes can only store up to a   maximum number of keys — in this case, four. So  what happens when we try to add our fifth key? It's not going to fit. So any time  we have too many keys in a node,   we need to split that node up. Remember that each  node requires a minimum of two keys. So we'll   take the smallest two keys and make one node, and  take the largest two keys and make a second node. What do we do with this leftover middle  key? Well, all the keys in the left node   are less than this key. And all the keys in  the right node are greater than this key. So   we can push this key up into a new root  node that points to the left and right. Now, this is starting to look more  like a tree. When we add new elements,   we're always going to add them at the  bottom level of the B-tree. So when a   new element comes along, we use the root  node to decide which node to visit next,   and when we get to the bottom of the tree,  we insert the key in the appropriate spot. That process will continue, with new  nodes getting added to the bottom   level. Notice that most of the time,  inserting is very straightforward:   we just follow the B-tree nodes down to  the bottom and insert the value there. It's only when we completely fill up a node and  then try to add a new key there that we need to   perform a split. We'll split the node into two  nodes, and push the middle key up into the parent. As we add more and more data, you might start  to wonder — what happens when the parent becomes   full of keys too? It turns out this splitting  algorithm is recursive. If we look at this tree,   for example, and try to insert a new key that  belongs in a node that's already full, we   start as usual by splitting the node up into two  nodes, and pushing the middle key into the parent. But the parent is already full, so when  we push the middle key into the parent,   it's overflowing. If that happens,  we repeat the process again:   we split the parent into two nodes of its own,  pushing the middle key into a new root node. This elegant recursive algorithm preserves all the  properties of the B-tree. When a node gets full,   we split it into new nodes. And the only time the  tree gets taller is when we split the root node   into two and create a new root node, keeping  all the leaves at the same level of the tree. But we don't just care about  adding new data. We also want   the ability to delete data. So how  do you delete data from a B-tree? Well, let's say there's a key we want to delete.  We start by searching through the tree as usual,   starting from the root, looking for the key.  Once we find it, we can delete it from the tree. That worked well in this case, but there  are a few trickier cases we need to think   about. Take a moment and consider —  under what circumstances might this   deletion algorithm run into problems with  the properties we've set up around B-trees? Here's one case to consider: what  happens if we try to delete a key,   but the node already has the minimum  key count? If we deleted the key,   the node would have fewer than  the minimum, which isn't allowed. So how do we fix the problem?  Well, if a node has too few keys,   we'll take a key from a node next to it,  one of its "sibling" nodes, so to speak. In this case, let's take a key from the  right sibling. We'll take the smallest key,   since we're going to be moving it to the left.  If we were taking a key from the left sibling,   we'd instead take the largest key,  since we'd be moving it to the right. But we can't just take the key from the sibling  and fill in the missing spot in our node,   because that would make the parent key in between  these two nodes wrong. Remember that this parent   key acts as a separator: all keys to the left of  it should be smaller, and all keys to the right   of it should be bigger. So we can't just move  a key from one sibling to another like this. With a slight modification, though, we can  make this work. Instead of taking the key   from the sibling directly, we'll let the key  from the sibling become the new separator,   and we'll use what used to  be the separator in our node.   This preserves the properties of the  B-tree: values less than the separator   are to the left of it, values greater  than the separator are to the right of it. So this works, as long as we  have a sibling we can take from. But what happens if when we remove  a key, we fall below the minimum and   our sibling nodes are also already at the  minimum? Now we can't take from a sibling. Instead, just like when adding  keys we could split nodes apart,   when deleting keys we might merge nodes  back together. We can take our node,   our sibling node, and the separator between them  and merge them together into a single full node. In this case this works out fine,  though note that in some cases that   might mean the parent node ends up  having too few keys. In that case,   we can recursively apply the  algorithm we've just described:   either taking from a sibling if possible, or  merging with a sibling if that's not possible. The other scenario we need to think about is what  happens if we delete a key from the middle of the   tree rather than from a leaf. In that case,  it's not as simple as just deleting the key,   since that key was also acting as  a separator between two subtrees. In that case, we need a new separator. And  that separator needs to still be greater   than everything in the left subtree  and smaller than everything in the   right subtree. That means we have  two options for the new separator:   we either need to take the largest value  in the left subtree, or the smallest value   in the right subtree. Either one works,  and that becomes the new separator value. Of course, after we do this, it's possible  that the node we took the key from will now   have fewer than the minimum — but  we know how to fix that already,   we just take a key from a sibling or merge  two nodes together if that's not possible. So by taking care to always  preserve these rules of B-trees,   we've constructed a data structure that's  remarkably efficient for searching lots of data. Even though we have lots of nodes, we  only need to access a small number of   them any time we're working with data,  resulting in substantial savings on time   waiting for data to be retrieved. And the  property that nodes can be anywhere between   half full and completely full turns out  to be especially helpful: it means we can   split up nodes when they get too full, and  merge them together when they get too empty. The fact that many nodes aren't completely  full also makes inserting new data easy   most of the time, since often we  can just fill in an empty space   in a node without needing to change  anything about the rest of the tree. These properties make B-trees and variants  on B-trees popular and elegant choices for   handling data, especially in databases and file  systems. And I hope this tour of the B-tree,   its structure, and how it operates gives you  a better understanding and appreciation for   what's really going inside your computer  as it stores and navigates your data.
Info
Channel: Spanning Tree
Views: 310,109
Rating: undefined out of 5
Keywords: data structures, algorithms, computer science, trees, search, binary search, btree
Id: K1a2Bk8NrYQ
Channel Id: undefined
Length: 12min 39sec (759 seconds)
Published: Mon Apr 29 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.