The following content is
provided under a Creative Commons license. Your support will help
MIT OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: All right,
let's get started. Today we have another
data structures topic which is, Data
Structure Augmentation. The idea here is we're going
to take some existing data structure and augment it
to do extra cool things. Take some other data
structure there we've covered. Typically, that'll be
a balanced search tree, like an AVL tree or a 2-3 tree. And then we'll modify it to
store extra information which will enable additional kinds
of searches, typically, and sometimes to
do updates better. And in 006, you've seen an
example of this where you took AVL trees and augmented AVL
trees so that every node knew the number of nodes in
that rooted subtree. Today we're going to see
that example but also a bunch of other examples,
different types of augmentation you could do. And we'll start out with
a very simple one, which I call easy tree augmentation,
which will include subtree size as a special case. So with easy tree
augmentation, the idea is you have a tree, like
an AVL tree, or 2-3 tree, or something like that. And you'd like to
store, for every node x, some function of the
subtree, rooted at x. Such as the number
of nodes in there, or the sum of the
weights of the nodes, or the sum of the
squares of the weights, or the min, or the max, or the
median maybe, I'm not sure. Some function f of x which
is a function of that. Maybe not f of x,
but we want to store some function of that subtree. Say the goal is to store
f of the subtree rooted at x at each node x in a
field which I'll call x.f. So, normally nodes have a left
child, right child, parent. But we're going to
store an extra field x.f for some function
that you define. This is not always
possible, but here's a case where it is possible. That's going to
be the easy case. Suppose x.f can be
computed locally using lower information,
lower nodes. And we'll say,
let's suppose it can be computed in constant time
from information in the node x from x's children
and from the f value that's stored in the children. I'll call that children.f. But really, I mean left
child.f, right child.f, or if you have a
2-3 tree you have three children, potentially. And the .f of each of them. OK. So suppose you can
compute x.f locally just using one level down
in constant time. Then, as you might expect,
you can update whenever a node ends up changing. So more formally. If some set of nodes
change-- call this at s. So I'm stating a very
general theorem here. If there is some set of nodes,
which we changed something about them. We change either
their f field, we change some of the data
that's in the node, or we do a rotation,
loosen those around. Then we count the total number
of ancestors of these nodes. So this subtree. Those are the nodes
that need to be updated because we're assuming we
can compute x.f just given the children data. So if this data is
changing, we have to update it's
parents value of f because it depends
on this child value. We have to update
all those parents, all the way up to the root. So however many nodes there are
there, that's the total cost. Now, luckily, in an AVL tree, or
2-3 tree, most balanced search structures, the updates
you do are very localized. When we do splits in
a 2-3 tree we only do it up a single
path to the root. So the number of ancestors
here is just going to be log n. Same thing with an AVL tree. If you look at the
rotations you do, they are up a single
leaf to root path. And so the number
of ancestors that need to be updated is
always order log n. Things change, and there's an
order log n ancestors of them. So this is a little more
general than we need, but it's just to point out if
we did log n rotation spread out somewhere in the tree,
that would actually be bad because the total number of
ancestors could be log squared. But because in the
structures we've seen, we just work on a single path
to the root, we get log n. So in a little more
detail here, whenever we do a rotation in an AVL tree. Let's say A, B, C, x, y. Remember rotations? Been a while since
we've done rotations. So we haven't changed any
of the nodes in A, B, C, but we have changed
the nodes x and y. So we're going to have to
trigger an update of y. First, we'd want to
update y.f and then we're going to trigger
the update to x.f. And as long as this one can
be computed from its children, then we compute y.f, then we
can compute x from its children. All right. So a constant number
of extra things we need to do whenever
we do rotation. And because the rotations lie
on a single path, total cost that-- once we stop doing the
rotations, in AVL insert say, then we still have to keep
updating up to the root. But there's only log n at
most log n nodes to do that. OK. Same thing with 2-3 trees. We have a node split. So we have, I guess,
three keys, four children. That's too many. So we split to two nodes
and an extra node up here. Then we just trigger an
update of this f value, an update of this f value,
and an update of that f value. And because that just follows a
single path everything's log n. So this is a general
theorem about augmentation. Any function that's well
behaved in this sense, we can maintain in AVL
trees and 2-3 trees. And I'll remind you and state,
a little more generally, what you did in 006, which are
called order statistic trees in the textbook. So here we're going to--
let me first tell you what we're trying to achieve. This is the abstract data
type, or the interface of the data structure. We want to do insert, delete,
and say, successor searches. It's the usual thing we want
out of a binary search tree. Predecessor too, sure. We want to do rank of a
given key which is, tell me what is the index of that key
in the overall sorted order of the items, of the keys? We've talked about rank a few
times already in this class. Depends whether you
start at 0 or 1, but let's say we start at one. So if you say rank of the key
that happens to be the minimum, you want to get one. If you say rank of the key
that happens to be the median, you want to get n over
2 plus 1, and so on. So it's a natural thing
you might want to find out. And the converse
operation is select, let's say of i, which is,
give me the key of rank i. We've talked about select
as an offline operation. Given an array,
find me the median. Or find me the n over
seventh rank item. And we can do that in linear
time given no data structure. Here, we want a
data structure so that we can find the
median, or the seventh item, or the n over seventh key,
whatever in log n time. We want to do all of these
in log n per operation. OK. So in particular, rank of
selective i should equal i. We're trying to find
the item of that rank. So far, so good. And just to plug these
two parts together. We have this data structure
augmentation tool, we have this goal
we want to achieve, we're going to achieve
this goal by applying this technique where f
is just the subtree size. It's the number of
nodes in that subtree because that will
let us compute rank. So we're going to use easy tree
augmentation with f of subtree equal to the number of
nodes in the subtree. So in order for
this to apply, we need to check that given a node
x we can compute x.f just using its children. This is easy. We just add everything up. So x.f would be equal to 1. That's for x. Plus the sum of c.f
for every child c. I'll write this as a
python interpolation so it looks a little
more like an algorithm. I'm trying to be generic here. If it's a binary
search tree you just do x.left.f, plus x.right.f. But this will work
also for 2-3 trees. Pick your favorite
data structure. As long as there's a constant
number of children then this will take constant time. So we satisfied this condition. So we can do easy
tree augmentation. And now we know we
have subtree sizes. So given any node. We know the number of
descendants below that node. So that's cool. It lets us compute
rank in select. I'll just give you those
algorithms, quickly. We can check that
they're log n time. Yeah. So the idea is pretty simple. You have some key-- let's
think about binary trees now, because it's a
little bit easier. We have some item x. It has a left subtree,
right subtree. And now let's look up from x. Just keep calling x.parent. So sometimes the parent
is to the right of us and sometimes the parent
is to the left of us. I'm going to draw this
in a, kind of, funny way. But this funny way has
a very special property, which is that the
x-coordinate in this diagram is the key value. Or is the sorted order
of the keys, right? Everything in the left subtree
of x has a value less than x. If we say all the
keys are different. Everything to the right of x
has a value greater than x. If x was the left
child of its parent, that means this thing
is also greater than x. And if we follow a
parent and this was the right child of
that parent, that means this thing is less than x. So that's why I drew it all
the way over to the left. This thing is also less
than x because it was a, I'll call it a left parent. Here we have a right
parent, so that means this is something
greater than x. And over here we have
a left parent, so this is something less than x. Let's say that's the root. In general, there's
going to be some left edges and some right
edges as we go up. These arrows will go either
left or right in a binary tree. So the rank of x is just
1 plus the number of nodes that are less than x. Number of keys that
are less than x. So there's these guys,
there's these guys, and there's whatever's
hanging off-- OK. Here I've almost violated
my x-coordinate rule. If I make these really
narrow, that's right. All of these things,
all of these nodes in the left subtrees of
these less than x nodes will also be less than x. If you think about
these other subtrees, they're going to
be bigger than x. So we don't really
care about them. So we just want to count
up all these nodes and all of these nodes. So the algorithm to do
that is pretty simple. We're just going
to start out with-- I'm going to switch from
this f notation to size. That's a little more natural. In general, you might
have many functions. Size is the usual
notation for subtree size. So we start out by counting
up how many items are here. And if we want to
start at a rank of 1, if the min has rank
1, then I should also do plus 1 for x itself. If you wanted to start at zero
you just omit that plus 1. And then, all I do is walk up
from x to the root of the tree. And whenever we go left
from, say x to x prime. So that means we
have an x prime. It's right child is x. And so when we went from x to
its parent we went to the left. Then we say rank plus
equals x prime.left.size plus 1 for x prime itself. And maybe x
prime.left.size is zero. Maybe there's no
nodes over there. But at the very least we have
to count those nodes that are to the left of us. And if there's
anything down here we add up all those things. So that lets us compute rank. How long does it take? Well, we're just walking
up one path from a leaf to a root-- or not necessarily
a leaf, but from some node x to the root. And as long we're using
a balance structure like AVL trees. I guess I want binary here,
so let's say AVL trees. Then this will take log n time. So I'm spending
constant work per step, and there's log n steps. Clear? So that's good old rank. Easy to do once you
have subtree size. Let's do select for fun. This may seem like review,
but I drew out this picture explicitly because we're
going to do it a lot today. We'll have pictures like
this a bunch of times. Really helps to
think about where the nodes are, which
ones are less than x, which ones are greater than x. Let's do select first. This you may not
have seen in 006. So we're going to
do the reverse. We're going to start at the root
and we're going to walk down. Sounds easy enough. But now walking down is
kind of like doing a search but we don't have a key
we're searching for, we have a rank we're searching for. So what is that rank? Rank is i. OK. So on the other hand,
we have the node x. We'd like to know the rank
of x and compare that to i. That will tell us whether we
should go left, or go right, or whether we happen
to find the item. Now one possibility
is we call rank of x to find the rank of x. But that's dangerous because I'm
going to have a four loop here and it's going to
take log n iterations. If at every iteration
of computing rank of x, and rank costs
log n, then overall cost might be log squared n. So I can't afford to-- I want
to know what the rank of x is but I can't afford to
say rank, open paren, x. Because that recursive
call will be too expensive. So what is the rank
of x in this case? This is a little special. What's that? AUDIENCE: Number of
left children plus 1. PROFESSOR: Number
of left, or the size of the left subtree plus 1. Yep. Plus 1 if we're counting,
starting at one. Very good. I'm slowly getting better. Didn't hit anyone this time. OK. So at least for the
root, this is the rank, and that only takes us constant
time in the special case. So we'll have to
check that it's still holds after I do the loop. But it will. So, cool. Now there are three cases. If i equals rank. If the rank we're
searching for is the rank that we happen to have,
then we're done, right? We just return x. That's the easy case. More likely is that I will be
either less than or greater than the rank of x. OK. So if i is less than the
rank, this is fairly easy. We just say x equals x.left. Did I get that right? Yep. In this case, the rank. So here we have x. It's at rank, rank. And then we have the left
subtree and the right subtree. And so if the rank
were searching for is less than rank, that
means we know it's in here. So we should go left. And if we just said x
equals x.left you might ask, well what rank are we
searching for in here? Well, exactly the same rank. Fine. That's easy case. In the other situation, if
we're searching in here, we're searching for
rank greater than rank. Then I want to go
right but the new rank that I'm searching for
is local to this subtree. I'm searching for
i minus this stuff. This stuff is rank. So I'm going to let
i be i minus rank. Make sure I don't have
any off by 1 errors. That seems to be right. OK. And then I do a loop. So I'll write repeat. So then I'm going to
go up here and say, OK. Now relative to this thing. What is the rank of the
root of this subtree? Well, it's again going to be
that node .left.size plus 1. And now I have the new
rank I'm searching for, i. And I just keep going. You could write this recursively
if you like, but here's an iterative version. So it's actually very familiar
to the select algorithm that we had, like when we
did deterministic linear time median finding or
randomized median finding. They had a very similar
kind of recursion. But in that case, they
were spending linear time to do the partition
and that was expensive. Here, we're just spending
constant time at each node and so the overall
cost is log n. So that's nice. Any questions about that? OK. I have a note here. Subtree size is obvious once you
know that's what you should do. Another natural
thing to try to do would be to augment,
for each node, what is the rank of that node? Because then rank is
really easy to find. And then select would
basically be a regular search. I just look at the
rank of the root, I see whether the
rank I'm looking for is too big, or too small, and I
go left or right, accordingly. What would be bad about
augmenting with rank of a node? Updates. Why? What's a bad example
for an update? AUDIENCE: If you add
new in home element. PROFESSOR: Right. Say we insert a new
minimum element. Good catch, cameraman. That was for the
camera, obviously. So, right. If we insert, this is off to
the side, but say we insert, I'll call it minus infinity. A new key that is smaller
than all other keys, then the rank of
every node changes. So that's bad. It means that easy tree
augmentation, in particular, isn't going to apply. And furthermore, it would
take linear time to do this. And you could keep inserting,
if you insert keys in decreasing order from there, every
time you do an insert, all the ranks increase by one. Maintaining that's going to
cost linear time per update. So you have to be really
careful that the function you want to store actually
can be maintained. Be very careful about that,
say, on the quiz coming up, that when you're
augmenting something you can actually maintain it. For example, it's very hard to
maintain the depths of nodes because when you do a rotation
a whole lot of depths change. Depth is counting from the root. How deep am I? When I do a rotation
then this entire subtree went down by one. This entire subtree
went up by one. In this picture. But it's very easy to
maintain heights, for example. Height counting from
the bottom is OK, because I don't affect
the height of a, b, and c. I affect it for x and y
but that's just two nodes. That I can afford. So that's what you want to be
careful of in the easy tree augmentation. So most the time easy tree
augmentation does the job. But in the remaining
two examples, I want to show you cooler
examples of augmentation. These are things you
probably wouldn't be expected to come up with
on your own, but they're cool. And they let us do more
sophisticated operations. So the first one is
called level linking. And here we're going to do it
in the context of 2-3 trees, partly for variety. So the idea of level
linking is very simple. Let me draw a 2-3 tree. Not a very impressive 2-3 tree. I guess I don't feel
like drawing too much. Level linking is the
idea of, in addition to these child and
parent pointers, we're going to add links
on all the levels. Horizontal links,
you might call them. OK. So that's nice. Two questions-- can we do this? And what's it good for? So let's start with
can we do this. Remember in 2-3 trees all
we have to think about are splits and merges. So in a split, we have,
for a brief period, let's say three
keys, four children. That's too many. So we change that to-- I'm going to change
this in a moment. For now, this is the split
you know and love, maybe. At least know. And if we think about where
the leveling pointers are, we have one before. And then we just need to
distribute those pointers to the two resulting nodes. And then we have to create a
new pointer between the nodes that we just created. This is, of course, easy to do. We're here. We're taking this node. We're splitting it in half. So we have the nodes
right in our hands so just add pointers between them. And key thing is, there's some
node over here on the left. It used to point
to this node, now we have to change it to
point to the left version. The left half of the node. And there's some node
over on the right. We have to change it's
left pointer to point to this right half of the node. But that's it. Constant time. So this doesn't fall under
the category of easy tree augmentation because this is
not isolated to the subtree. We're also dealing with it's
left and right subtrees. But still easy to
do in constant time. Merging nodes is
going to be similar. If we steal a node from our
parents or former sibling, nothing happens in
terms of level links. But if we have, say, an empty
node and a node that cannot afford any stealing. So we have single child
here, two children, and we merge it into-- We're taking something
from our parent. Bringing it down. Then we have three
children afterwards. Again, we used to have
these level pointers. Now we just have
these level pointers. It's easy to maintain. It's just a constant
size neighborhood. Because we have
the level links, we can get to our left
and right neighbors and change where
the links point to. So easy to maintain
in constant time. I'll call it constant overhead. Every time we do
a split or merge we spend additional
constant time to do it. We're already spending
constant time. So just changes everything
by constant factor. So far, so good. Now, I'm going to have
to tweak this data structure a little bit. But let me first tell you why. What am I trying to achieve
with this data structure? What I'm trying to achieve is
something called the finger search property. So let's just think
about the case where I'm doing a
successful search. I'm searching for key x and I
find it in the data structure. I find it in the tree. Suppose I found one-- I
search for x, I found it. And then I search
for another key y. Actually I think
I'll do the reverse. First I found y, now
I'm searching for x. If x and y are
nearby in the tree, I want this to run
especially fast. For example, if x is
the successor of y I want this to
take constant time. That would be nice. In the worst case x and y
are very far away from me in the tree then I want
it to take log n time. So how could I interpolate
between constant time for finding the
successor and log n time for finding the
worst case search. So I'm going to call
this search of x from y. Meaning, this is a little
imprecise, but what I mean is when I call
search, I tell it where I've already found y. And here it is. Here's the node storing y. And now I'm given a key x. And I want to find
that key x given the node that stores key y. So how long should this take? Will be a good
way to interpolate between constant
time at one extreme. The good case, when
x and y are basically neighbors in sorted
order, versus log n time, in the worst case. AUDIENCE: Distance
along the graph. PROFESSOR: Distance
along the graph. That would be one
reasonable definition. So I have a tree which you
could think of as a graph. Measure the shortest
path length from x to y. Or we have a more
sophisticated graph over here. Maybe that length. The trouble with the
distance in the graph, that's a reasonable suggestion,
but it's very data structure specific. If I use an AVL tree
without level links, then the distance
could be one thing, whereas if I use a 2-3 tree,
even without level lengths, it's going to be a
different distance. If I use a 2-3 tree
with level lengths it's going to be yet
another distance. So that's a little unsatisfying. I want this to be an
answer to a question. I don't want to phrase the
question in terms of that data structure. AUDIENCE: Difference
between ranks of x and y? PROFESSOR: Difference between
ranks between x and y. That's close. So I'm going to look at the
rank of x and rank of y. Let's say, take the
absolute difference. That's kind of how far away
they are in sorted order. Do you want to add anything? AUDIENCE: Log? PROFESSOR: Log. Yeah. Because in the worst case
the difference in ranks could be linear. So I want to add a log out
here to get log n in that worst case. Add a big o for safety. That's how much time
we want to achieve. So this would be the
finger search property that you can solve this
problem in this much time. Again, difference in
ranks is at most n. So this is at most log n. But if y is the successor of
x this will only be constant and this will be constant. So this is great if you're
doing lots of searches and you tend to search for
things that are nearby, but sometimes you search
for things are far away. This gives you a nice bound. On the one hand, we
have, this is our goal. Log difference of ranks. On the other hand, we
have the suggestion that what we can
achieve is something like the distance in the graph. But we have a problem with this. I used to think that data
structure solved this problem, but it doesn't. Let me just draw-- actually
I have a tree right there. I'm going to use that one. Suppose x is here and y is here. OK. This is a bit of a small tree
but if you think about it long enough, this node is
the predecessor of this node. So their difference
in ranks should be 1. But the distance in
the graph here is two. Not very impressive. But in general, you have
a tree of height log n. If you look at the root, and
the predecessor of the root, they will have a rank
difference of one by definition of predecessor. But the graph distance
will be log n. So that's bad news, because if
we're only following pointers there's no way to get from
here to there in constant time. So we're not quite there. We're going to use another tweak
that data structure, which is store the data in the leaves. Tried to find a data structure
that didn't require this and still got finger search. But as far as I
know, there is none. No such data structure. If you look at, say,
Wikipedia about B-trees, you'll see there's a ton
of variations of B-trees. B+-trees, B*-trees. This is one of those. I think B+-trees. As you saw, B-trees or
2-3 trees, every node stored one or two keys. And each key only
existed in one spot. We're still only going to put
each key in one spot, kind of. But it's only going
to be the leaf spots. OK. Good news is most nodes
are leaves, right? Constant fraction of the
nodes are going to be leaves. So it doesn't change too
much from a space efficiency standpoint. If we just put data down
here and don't put-- I'm not going to put any
keys up here for now. So this a little weird. Let me draw an example
of such a tree. So maybe we have 2, and 5,
and 7, and 8, 9, let's say. Let's put 1 here. So I'm going to have a node
here with three children, a node here with two
children, and here's a node with two children. So I think this mimics
this tree, roughly. I got it exactly right. So here I've taken
this tree structure. I've redrawn it. There's now no keys
in these nodes. But everything else is
going to be the same. Every node is going
to have 0 children if it's a leaf, or two, or
three children otherwise. Never have one child
because then you wouldn't get logarithmic depth. All the leaves are going
to be at the same depth. And that's it. OK. That is a 2-3 tree with the
data stored in the leaves. It's a useful trick to know. Now we're going to do a
level linked 2-3 tree. So in addition to
that picture, we're going to have links like this. OK. And I should check that I can
still do insert and delete into these structures. It's actually not too hard. But let's think about it. I think, actually,
it might be easier. Let's see. So if I want to
do an insert-- OK. I have to first search
for where I'm inserting. I haven't told you
how to do search yet. OK. So let's first
think about search. What we're going to do is
data structure augmentation. We have simple
tree augmentation. So I'm going to do
it and each node, what the functions
I'm going to store are the minimum
key in the subtree, and the maximum
key in the subtree. There are many ways to
do this, but I think this is kind of the simplest. So what that means
is at this node, I'm going to store 1 as
the min and 7 as the max. And at this node it's
going to be 1 at the min and 9 at the max. And here we have 8 as
the min and 9 as the max. Again min and max of
subtrees are easy to store. If I ever change a
node I can update it based on its children,
just by looking at the min of the
leftmost child and the max of the rightmost child. If I didn't know 1
and 9, I could just look at this min and
that max and that's going to be the min and the
max of the overall tree. So in constant time
I can update the min and the max of a
node given the min and the max of its children. Special case is at the leaves. Then you have to actually
look at keys and compare them. But leaves only have,
at most, two keys. So pretty easy to compare
them in constant time. OK. So that's how I do
the augmentation. Now how do I do a search? Well, if I'm at a node and
I'm searching for a key. Well, let's say
I'm at this node. I'm searching for a key like 8. What I'm going to do is
look at all of the children. In this case, there's two. In the worst case there's three. I look at the min and max
and I see where does 8 fall? Well it falls in this interval. If I was searching for 7
1/2 I know it's not there. It's going to be
in between here. If I'm doing a successor
then I'll go to the right. If I'm doing predecessor
I'll go to the left. And then take either the maximum
item or the minimum item. If I'm searching
for 8 I see, oh. 8 falls in the interval
between 8 and 9, so I should clearly
take the right child among those two children. In general, there's
three children. Three intervals. Constant time. I can find where my key
falls in the interval. OK. So search is going to take
log n time again, provided I have these mins and maxs. If you stare at it
long enough, this is pretty much the same thing
as regular search in a 2-3 tree. But I've put the data
just one level down. OK. Good. That was regular search. I still need to do finger
search, but we'll get there. And now, if I want to do
an insert into this data structure, what happens. Well I search for the key
let's say I'm inserting 6. So maybe I go here. I say because 6. Is in this interval. 6 is in neither of
these intervals. But it's closest to the interval
2, 5, or the interval 7. Let's say I go down to 2, 5. And well, to insert 6 I'll
just add a 6 on there. Of course, now that
node is too big. So there's still going to be a
split case at the leaves where I have let's say,
a,b,c, too many keys. I'm going to split
that into a,b and c. This is different from before. It used to be I would
promote b to the parent because the parent
needed the key there. Now parents don't have keys. So I'm just going to split
this thing, roughly, in half. It works. It's still the case that
whoever was the parent up here now has an
additional child. One more child. So maybe that node
now has four children but it's supposed
to be two or three. So if I have a node with four
children, what I'm going to do, I'm suppose to use
these fancy arrows. What do I do in this case? It's just going to split
that into two nodes with two children. And again this used
to have a parent. Now that parent has
an additional child, and that may cause
another split. It's just like before. Was just potentially split
all the way up to the root. If we split the root then
we get an additional level. But we could do all this and
we can still maintain our level links, if we want. But everything will take log n. I won't draw the
delete case, as delete is slightly more annoying. But I think, in
this case, you never have to worry about where
is the key coming from, your child or your parent? You're just merging nodes so
it's a little bit simpler. But you have to deal
with the leaf case separately from
the nonleaf case. OK. So all this was to
convince you that we can store data in the leaves. 2-3 trees still work fine. Now I claim that the graph
distance in level link trees is within a constant factor
of the finger search bound. So I claim I can get the finger
search property in 2-3 trees, with data in the leaves,
with level links. So lots of changes here. But in the end, we're going
to get a finger search bound. Let's go over here. So here's a finger
search operation. First thing I want to
do is identify a node that I'm working with. I want to start from y's node. So we're supposing that we're
told the node, a leaf, that contains y. So I'm going to
let v be that leaf. OK. Because we're supposing
we've already found y, and now all the data
is in the leaves. So give me the leaf
that contains y. So that should
take constant time. That's just part of the input. Now I'm going to do a
combination of going up and horizontal. So starting at a leaf. And the first thing I'm
going to do is check, does this leaf
contain what I want? Does it contain the key I'm
searching for, which is x? So that's going to be the case. At every node I store
the min and the max. So if x happens to fall
between the min and the max, then I'm happy. Then I'm going to do a
regular search in v's subtree. This seems weird in
the case of a leaf. In the case of a
leaf, this is just to check the two
keys that are there. Which one is x. OK. But in general I gave
you this search algorithm which was, if I decide
which child to take, according to the ranges,
that's a downward search. So that's what I'm calling
regular search here. Maybe downward would
be a little better. This is the usual
log n time thing. But we're going to claim
a bound better than log n. If this is not the
case, then I know x either falls before
v.min or after v.max. So if x is less than v.min
then I'm going to go left. v equals v. I'll call it
level left to be clear. You might say left
is the left child. There's no left child
here, of course. But level left is clear. We take the horizontal
left pointer. And otherwise x is
greater than v.max. And in that case
I will go right. That seems logical. And in both cases
we're going to go up. x equals x.parent Whoops. v equals v.parent. X is not changing here. X is a key we're searching
for. v is the node. V for vertex. So we're always going
to go up, and then we're going to go
either left or right, and we're going to keep
doing that until we find a subtree that contains
x in terms of key range. Then we're going
to stop this part and we're just going
to do downward search. I should say return
here or something. I'm going to do a
downward search, which was this regular algorithm. And then whatever it finds,
that's what I return. I claim the algorithm
should be clear. What's less clear is that
it achieves the bound that we want. But I claim that this will
achieve the finger search property. Let me draw a picture
of what this thing looks like kind of generically. On small examples it's hard
to see what's going on. So I'm going to draw a
piece of a large example. Let's say we start here. This is where y was. I'm searching for x. Let's suppose x is to the right. 'Cause otherwise I go
to the other board. So x is to the right. I'll discover that the range
with just this node, this node maybe contains one other key. I'll find that
range is too small. So I'm going to go follow
the level right pointer, and I get to some other node. Then I'm going to
go to the parent. Maybe the parent was the
parent of those two children so I'm going to
draw it like that. Maybe I find this
range is still too low. I need to go right to get to x,
so I'm going to follow a level pointer to the right. I find a new subtree. I'll go to its parent. Maybe I find that this subtree,
still the max is too small. So I have to go to
the right again. And then I take the parent. So this was an example
of a rightward parent. Here's an example of
a leftward parent. This is maybe the parent of
both of these two children. Then maybe this subtree
is still too small, the max is still smaller than x. So then I go right
one more time. Then I follow the parent. Always alternating
between right and parent until I find a node
whose subtree contains x. It might have actually, x
may be down here, because I immediately went to the
parent without checking whether I found where x is. But if I know that x is
somewhere in here then I will do a downward search. It might go left and then down
here, or it might go right, or there's actually
potentially three children. One of these searches
will find the key x that I'm looking
for because I'm in the case where x is
between v.min and v.max, so I know it's in
there, somewhere. It could be x doesn't exist, but
it's predecessor or successor is in there somewhere. And so one of these
three subtrees will contain the x range. And then I go follow that path. And keep going down until I
find x or it's predecessor or successor. Once I find it's predecessor I
can use a level right pointer to find its
successor, and so on. So that's kind of the general
picture what's going on. We keep going rightward
and we keep going up. Suppose we do k up steps. Let's look at this
last step here. Step k. How high am I in the tree? I started at the leaf level. Remember in a 2-3 tree all the
leaves have the same level. And I went up every step. Sorry. I don't know what this
is, like the 2-step dance where, let's say every
iteration of this loop I do one left or right step,
and then a parent step. So I should call
this iteration k. I guess there's
two k steps, then. Just to be clear. So in iteration k, that
means I've gone up k times and I've gone either
right or left k times. You can show if you start going
right you keep going right. If you initially go left
you'll keep going left. Doesn't matter too much. At iteration k I am at
height k, or k minus 1, or however you want to count. But let's call it k. So when I do this
right pointer here I know that, for
example, I am skipping over all of these keys. All the keys down-- the
keys are in the leaves, so all these things down
here, I'm jumping over them. How many keys are down there? Can you tell me,
roughly, how many keys I'm skipping over when I'm
moving right at height k? It's not a unique answer. But you can give me some bounds. Say again. Number of children
to the k power. Yeah. Except we don't know
the number of children. But it's between 2 and 3 Closer
one should be easy but I fail. So it's between two
and three children. So there's the number-- if
you look at a height k tree, how many leaves does it have? It's going to be between
2 to the k and 3 to the k. Because I have between 2 and
3 children at every node. And so it's exponential in k. That's all I'll need. OK. When I'm at height k here,
I'm skipping over a height k minus 1 tree or something. But it's going to be-- So in iteration k I'm skipping,
at least, some constant times 2 to the k. Maybe to the k minus
1, or to the k minus 2. I'm being very sloppy. Doesn't matter. As long as it's exponential
in k, I'm happy. Because I'm supposing that
x and y are somewhat close. Let's call this
rank difference d. Then I claim the
number of iterations I'll need to do in this loop
is, at most, order log d. Because if, when I get
to the k-th iteration, I'm jumping over 2
to the k elements. How large does k have
to be before 2 to the k is larger than d? Well, log d. Log base 2 The number of
iterations is order log d, where d is
the rank difference. d is the absolute value between
rank of x and rank of y. And I'm being a
little sloppy here. You probably want
to use an induction. You need to show that
they're really, these items here that you're skipping
over that are strictly between x and y. But we know that there's
only d items between x or y. Actually d minus 1, I guess. So as soon as we've skipped over
all the items between x and y, then we'll find a
range that contains x, and then we'll go do
the downward search. Now how long does the
downward search cost? Whatever the height
of the tree is. What's the height of the tree? That's the number of iterations. So the total cost. The downward search
will cost the same as the rest of the search. And so the total cost is
going to be order log d. Clear? Any questions about finger
searching with level linked data at the
leaves, 2-3 trees? AUDIENCE: Sir, I'm not sure
why [INAUDIBLE] d, why is that? PROFESSOR: I'm defining d to be
the rank of x minus rank of y. My goal is to achieve
a log d bound. And I'm claiming that because
once I've skipped over d items, then I'm done. Then I've found x. And at step k I'm skipping
over 2 to the k items. So how big is k going to be? Log d. That's all. I used d for a notation here. Cool. Finger searching. It's nice. Especially if you're doing
many consecutive searches that are all relatively
close to each other. But that was easy. Let's do a more
difficult augmentation. So the last topic for
today is range trees. This is probably the coolest
example of augmentation, at least, that you'll
see in this class. If you want to see
more you should take advanced data structure 6851. And range trees solve
a problem called orthogonal range searching. Not orthogonal search ranging. Orthogonal range search. So what's the problem? I'm going to give you
a bunch of points. Draw them as fat dots so
you can actually see them. In some dimension. So this is, for
example, a 2D point set. OK. Over here I will
draw a 3D point set. You can tell the
difference, I'm sure. There. Now it's a 3D point set. And this is a static point set. You could make this
dynamic but let's just think about the static case. Don't want the 2D points
and the 3D points to mix. Now, you get to preprocess
this into a data structure. So this is a static
data structure problem. And now I'm going to come along
with a whole bunch of queries. A query will be a box. OK. In two dimensions, a
box is a rectangle. Something like this. Axis aligned. So I give you an x min, x
max, a y min, and a y max. I want to know what
are the points inside. Maybe I want you to list them. If there's a lot
of them it's going to take a long
time to list them. Maybe I just want to know
10 of them as examples. Maybe this is a Google
search or something. I just get the first 10
results in the first page, I hit next then want the
next 10, that kind of thing. Or maybe I want to know how
many search results there are. Number of points
in the rectangle. Bunch of different problems. In 3D, it's a 3D box. Which is a little
harder to draw. You can't really tell which
points are inside the box. Let's say these three points
are all inside the box. I give you an interval
in x, an interval in y, and an interval in
z, and I want to know what are the points inside. How many are there? List them all. List 10 of them, whatever. OK. I want to do this in
poly log time, let's say. I'm going to achieve today
log squared for the 2D problem and log cubed for
the 3D problem, plus whatever the
size output is. So let me just write that down. So the goal is to preprocess
n points in d dimensions. So you get to spend a
bunch of time preprocessing to support a query which is,
given a box, axis aligned box, find let's say the number
of points in the box. Find k points in the box. I think that's good. That includes a special case of
find all the points in the box. So this, of course, we have
to pay a penalty of order k for the output. No getting around that. But I want the rest of the
time to be log to the d. So we're going to
achieve log to the d n plus size of the output. And you get to control how
big you want the output to be. So it's a pretty
reasonable data structure. In a certain sense we will
understand what the output is in log to the d time. If you actually
want to list points, well, then you have to
spend the time to do it. All right. So 2D and 3D are great,
but let's start with 1D. First we should
understand 1D completely, then we can generalize. 1D we already know how to do. 1D I have a line. I have some points on the line. And I'm given, as a
query, some interval. And I want to know how many
points are in the interval, give me the points in
the interval, and so on. So how do I do this? Any ways? If d is 1. So I want to achieve
log d, sorry, log n, plus size of output. I hear whispers. Yeah? AUDIENCE: Segment trees? PROFESSOR: Segment tree? That's fancy. We won't cover segment trees. Probably segment trees do it. Yeah. We know lots of ways to do this. Yeah? AUDIENCE: Sorted array? PROFESSOR: Sorted array
is probably the simplest. If I store the items in a sorted
array and I have two values, I'll call them x1
and x2, because it's the x min and x max. Binary search for x1. Binary search for x2. Find the successor of x1
and the predecessor of x2. I'll find these two guys. And then I know all
the ones in between. That's the match. So that'll take log n
time to find those points and then we're good. So we could do a sorted array. Of course, sorted array is
a little hard to generalize. I don't want to do a 2D
array, that sounds bad. You could, of course,
do a binary search tree. Like an AVL tree. Same thing. Because we have log n
search, find successor, and predecessor, I guess
you could use Van Emde Boas, but that's hard to
generalize to 2D. You could use level links. Here's a fancy version. We could use level linked 2-3
trees with data in the leaves. Then once I find x
min, I find this point, I can go to the successor
in constant time because that's a finger search
with a rank difference of 1. And I could just keep
calling successor and in constant time per item
I will find the next item. So we could do that easily
with the sorted array. BST is not so great
because successor might cost log n each time. But if I have the
level links then basically I'm just
walking down the link list at the bottom of the tree. OK. So actually level
linked is even better. BST would achieve something
like log n plus k log n, where k is the size of the output. If I want k points in the
box I have to pay log n. For each level linked I'll
only pay log n plus k. Here I actually only need
the levels at the leaves. Level links. OK. All good. But I actually want to
tell you a different way to do it that will
generalize better. The pictures are
going to look just like the pictures
we've talked about. So these would actually
work dynamically. My goal here is just to achieve
a static data structure. I'm going to idealize this
solution a little bit. And just say, suppose I
have a perfectly balanced binary search tree. That's going to be
my data structure. OK. So the data structure is not
hard, but what's interesting is how I do a range search. So if I do range query of the
interval, I'll call it ab. Then what I'm going to do
is do a binary search for a, do a binary search for
b, trim the common prefix of those search paths. That's basically finding
the lowest common ancestor of a and b. And then I'm going
to do some stuff. Let me draw the picture. So here is, suppose here's
the node that contains a. Here's the node that contains b. They may not be at the
same depth, who knows. Then I'm going to look
at the parents of a. I just came down from some path
here, and some path down to b. I want to find this
branching point where the paths to a and
the paths to b diverge. So let's just look
at the parent of a. It could be a right
parent, in which case there's a subtree here. Could be a left parent in
which case, subtree here. I'm going to follow
my convention again. That x-coordinate
corresponds roughly to key. Left parent here. Maybe right parent here. Something like that. OK. Remember it's a perfect tree. So, actually, all the leaves
will be at the same level. And, roughly here, x-coordinate
corresponds to key. So here is a. And I want to return all the
keys that are between a and b. So that's everything
in this sweep line. The parents of the LCA don't
matter, because this parents either going to be way over
to the right or way over to the left. In both cases, it's outside
the interval a to b. So what I've tried
to highlight here, and I will color it in
blue, is the relevant nodes for the search between a and b. So a is between a and b. This subtree is greater
than a and less than b. This node, and these nodes. This node, and these nodes. This node and these nodes. The common ancestor. And then the corresponding
thing over here. All the nodes in all
these blue subtrees, plus these individual
nodes, fall in the interval between a and b, and that's it. OK. This should look super familiar. It's just like when
we're computing rank. We're trying to figure out
how many guys are to our left or to our right. We're basically doing
a rightward rank from a and a
leftward rank from b. And that finds all the nodes. And stopping when those
two searches converge. And then we're finding all
the nodes between a and b. I'm not going to write down
the pseudocode because it's the same kind of thing. You look at right
parents and left parents. You just walk up from a. Whenever you get a
right parent then you want that node, and
the subtree to its right. And so that will
highlight these nodes. Same thing for b, but
you look at left parents. And then you stop when
those two searches converge. So you're going to
do them in lock step. You do one step for a and b. One step for a and b. And when they happen to hit the
same node, then you're done. You add that node to your list. And what you end
up with is a bunch of nodes and rooted subtrees. The things I circled in blue
is going to be my return value. So I'm going to return all
of these nodes, explicitly. And I'm also going to
return these subtrees. I'm not going to have
to write them down. I'm just going to return
the root of the subtree, and say, hey look. Here's an entire
subtree that contains points that are in the answer. Don't have to list
them explicitly, I can just give you the tree. Then if I want to know how
many results are in the answer, well, just augment to store
subtree size at the beginning. And then I can
count how many nodes are down here, how many
nodes are down here, add that up for
all the triangles, and then also add one for
each of the blue nodes, and then I've counted the size
of the answer in how much time? How many subtrees and how many
nodes am I returning here? Log. Log n nodes and log n rooted
subtrees because at each step, I'm going up by one for
a, and up by one for b. So it's like 2 log n. Log n. So I would call this an implicit
representation of the answer. From that implicit
representation you can do subtree size. Augmentation to count
the size the answer. You can just start walking
through one by one, do an inter traversal of the trees, and
you'll get the first k points in the answer in order k time. Question? AUDIENCE: Just a clarification. You said when we
were walking up, you want to get
all the ancestors in their right subtrees. But you don't do that for
the left parent, right? PROFESSOR: That's right. As I'm walking up the tree,
if it's a right parent then I take the right subtree
and include that in the answer. If it's a left parent
just forget about it. Don't do anything. Just keep following parents. Whenever I do right
parent then I also add that node and
the right subtree. If it's a left parent I
don't include the node, I don't include
the left subtree. I also don't include
the right subtree. That would have too much stuff. It's easy when you
see the picture, you would write
down the algorithm. It's clear. It's left versus right parents. AUDIENCE: Would you include
the left subtree of b? PROFESSOR: I would
also-- thank you. I should color the
left subtree of b. I didn't apply
symmetry perfectly. So we have the right subtree
of a and the left subtree of b. Thanks. I would also include b if
it's a closed interval. Slightly more general. If a and b are not
in the tree then this is really the successor of a and
this is the predecessor of b. So then a and b don't
have to be in there. This is still a well
defined range search. OK. Now we really understand 1D. I claim we've almost
solved all dimensions. All we need is a little
bit of augmentation. So let's do it. Let's start with 2D. But then 3D, and 4D,
and so on will be easy. Why do I care about
4D range trees? Because maybe I have a database. Each of these points
is actually just a row in the database which has
four columns, four values. And what I'm trying to do
here is find all the people in my database that have a
salary between this and this, and have an age
between this and that, and have a profession
between this and this. I don't know what that means. Number of degrees between
this and this, whatever. You have some numerical data
representing a person or thing in your database, then this
is a typical kind of search you want to do. And you want to know how
many answers you've got and then list the first
hundreds of them, or whatever. So this is a practical
thing in databases. This is what you might call
an index in the database. So let's start. Suppose your data is
just two dimensional. You have two fields
for every item. What I'm going to do is store
a 1D range tree on all points by x. So this data structure makes
sense if you fix a dimension. Say x is all I care about. Forget about y. So my point set. Yeah. So what that corresponds to is
projecting each of these points onto the x-axis. And now also
projecting my query. So my new query is
from here to here in x. And so this data
structure will let me find all these
points that match in x. That's not good because
there's actually only two points that
I want, but I find four points in this picture. But it's half of the answer. It's all the x matches
forgetting about y. Now here's the fun part. So when I do a search
here I get log n nodes. Nodes are good because they
have a single key in them. So I'll just check for
each of those log n nodes. Do they also match in y? If they do, add
it to the answer. If they don't forget about it. OK. But the tricky part is I also
get log n subtrees representing parts of the answer. So potentially it could be that
your search, this rectangle, only has like five points. But if you look at this
whole vertical slab there's a billion points. Now, luckily, those
billion points are represented succinctly. There's just log
n subtrees saying, well there's half
a billion here, and a quarter billion here, and
an eighth of a billion here. Now for each of that
big chunk of output, I want to very quickly find
the ones that match in y. How would I find the
ones matching in y? A range tree. Yeah. OK. So here's what
we're going to do. For each node, call it x. x is overloaded. It's a coordinate. So many things. Let's call it v.
In the, this thing I'm going to call the x-tree. So for every node
in the x-tree I'm going to store
another 1D range tree. But this time using
the y-coordinate on all points in these rooted subtree. At this point I really
want to draw a diagram. So, rough picture. Forgive me for not
drawing this perfectly. This is roughly what
the answer looks like for the 1D range search. This is the x-tree. And here I've searched between
this value and this value in the x-coordinate. Basically I have log n nodes. I'm going to check
those separately. Then I also have
these log n subtrees. For each of those
log n sub trees I'm going to have
a pointer-- this is the augmentation-- to another
tree of exactly the same size. On exactly the same
data that's in here. It's also over here. But it's going to
be sorted by y. And it's a 1D range tree by y. Tons of data duplication here. I took all these points and I
copied them over here, but then built a 1D range tree in y. This is all preprocessing. So I don't have to pay for this. It's polynomial time. Don't worry too much. And then I'm going
to search in here. What does the search
in there look? I'm going to get, you know,
some more trees and a couple more nodes. OK. But now those items, those
points, match in x and y because this whole
subtree matched in x and I just did a y search, so I
found things that matched in y. So I get here
another log n trees that are actually in my answer. And for each of these nodes I
have a corresponding other data structure where I
do a little search and I get part of the answer. Every one. Sounds huge. This data structure sounds
huge, but it's actually small. But one thing that's clear is
it takes log squared n time, because I have log n
triangles over here. For each of them I spend log
n to find triangles over here. The total output is log squared
n nodes, for each of them I have to check manually. Plus, so over here,
there's log n, different searches I'm doing. Each one has size log n. So I get log squared
little triangles that contain the results
that match in x and y. How much space in
this data structure? That's the remaining challenge. Actually, it's not that hard,
because if you look at a key. So look at some
key in this x-tree. Let's look at a leaf
because that's maybe the most interesting. Here's the x-tree. x-tree has linear size. Just one tree. If I look at some key value,
well, it lives in this subtree. And so there's going to be a
corresponding blue structure of that size that
contains that key. And then there's the parent. So there's a structure here. That has a corresponding
blue triangle. And then its parent,
that's another triangle. That contains-- I'm
looking at a key k here. All of these triangles
contain the key k. And so key k will be
duplicated all this many times, but how many sub trees is k in? Log n. Each key, fundamental fact
about balanced binary search trees, each key lives
in log n subtrees. Namely all of its ancestors. Awesome. Because that means the
total space is n log n. There's n keys. Each of them is duplicated
at most log n times. In general, log
to the d minus 1. So If you do it in 3D,
each of the blue trees, every node in it has a
corresponding pointer to a red tree
that's sorted by z. And you just keep doing this,
sort of, nested searching, like super augmentation. But you're only losing a log
factor each dimension you add.