1. Persistent Data Structures

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. ERIK DEMAINE: Welcome to 6.851 Advanced Data Structures. I am Erik Demaine. You can call me Erik. We have two TAs, Tom Morgan and Justin Zhang. Tom's back there. Justin is late. And this class is about all kinds of very cool data structures. You should have already seen basic data structures like balance binary search trees and things like that, log n time to do wherever you want in one dimension. And here we're going to turn all those data structures on their head and consider them in all sorts of different models and additional cool problems. Today we're going to talk about time travel or temporal data structures, where you're manipulating time as any good time traveler should. Then we'll do geometry where we have higher dimensional data, more than one dimension. Then we'll look at a problem called dynamic optimality, which is, is there one best binary search tree that rules them all? Then we'll look at something called memory hierarchy, which is a way to model more realistic computers which have cache and then more cache and then main memory and then dish and all these different levels. How do you optimize for that? Hashing is probably the most famous, and most popular, most used data structure in computer science. We'll do a little bit on that. Integers, when you know that your data is integers and not just arbitrary black boxes that you can compare or do whatever, you can do a lot better with integers. You usually beat log n time. Often you can get constant time. For example, if you want to do priority queues, you can do square root log log n time. That's the best known randomized. Dynamic graphs, you have a graph you want to store, and the edges are being added and maybe deleted, like you're representing a social network. And people are friending and de-friending. You want to maintain some interesting information about that graph. Strings, you have a piece of text, such as the entire worldwide web. And you want to search for a substring. How do you do that efficiently? It's sort of the Google problem. Or you searching through DNA for patterns, whenever. And finally succinct data structures, which is all about taking what we normally consider optimal space or n space and reducing it down to the very bare minimum of bits of space. Usually if you want to store something where there's 2 to the n possibilities, you want to get away with n bits of space, maybe plus square root of n or something very tiny. So that's the sync data structures. So that's an overview of the entire class. And these are sort of the sections we'll be following. Let me give you a quick administrative overview of what we're doing. Requirements for the class-- I guess, first, attending lecture. Obviously if you don't attend lecture, there'll be videos online. So that's resolvable. But let me know if you're not going to make it. We're going to have problems sets roughly every week. If you're taking the class for credit, they have a very simple rule of one page in, one page out. This is more constraint on us to write problems that have easy or short answers. You probably need to think about them a little bit before they're transparent, but then easy to write up. And then scribing lectures-- so we have a scribe for today, I hope. Here? Yes, good. So most of the lectures have already been scribed in some version, and your goal is to revise that scribe notes that if you don't like handwritten notes, which are also online, then easier for people to read. Let's see. Listeners welcome. We're going to have an open problem session. I really like open problems. I really like solving open problems. So we've done this every time this class has been offered. So if you're interested in also solving open problems, it's optional. I will organize-- in a couple of weeks, we'll have a weekly open problem session and try to solve all the things that push the frontier of advanced data structures. So in classes, we'll see the state of the art. And then we'll change the state of the art in those sessions. I think that's it. Any questions about the class before we get into the fun stuff? All right. Let's do some time traveling. Before I get to time traveling, though, I need to define our model of computation. A theme in this class is that the model of computation you're working with matters. Models matter. And there's lots of different models of computation. We'll see a few of the main ones in this class. And the starting point, and the one we'll be using throughout today, is called a pointer machine. It's an old one from the '80s. And it corresponds to what you might think about if you've done a lot of object-oriented programming, and before that, structure-oriented programming, I guess. So you have a bunch of nodes. They have some fields in them, a constant number of fields. You can think of these as objects or strucs in c It used to be records back in Pascal days, so a lot of the papers call them records. You could just have a constant number of fields. You could think of those numbered, labeled. It doesn't really matter because there's only a constant number of them. Each of the fields could be a pointer to another node, could be a null pointer, or could have some data in it. So I'll just assume that all my data is integers. You can have a pointer to yourself. You can have a pointer over here, whatever you want. A pointer machine would look something like this. In any moment, this is the state of the pointer machine. So you think this as the memory of your computer storing. And then you have some operations that you're allowed to do. That's the computation part of the model. You can think of this as the memory model. What you're allowed to do are create nodes. You can say something like, x equals new node. You can, I don't know, look at fields. You can do x equals y.field. You can set fields, x.field equals y. You can compute on these data, so you can add 5 and 7, do things like that. I'm not going to worry about-- I'll just write et cetera. This is more a model about how everything's organized in memory, not so much about what you're allowed to do to the data items. In this lecture, it won't matter what you're doing to the data items. We never touch them. We just copy them around. So am I missing anything? Probably. I guess you could destroy nodes if you felt like it. But we won't have to today, because we don't want to throw anything away when you're time traveling. It's too dangerous. And then the one catch here is, what are x and y? There's going to be one node in this data structure or in your memory called the root node. And you could think of that as that's the thing you always have in your head. This is like your cache, if you will. It's just got a constant number of things, just like any other node. And x and y are fields of the root. So that sort of ties things down. You're always working relative to the root. But you can look at the data, basically follow this pointer, by looking at the field. You could set one of these pointers-- I think I probably need another operation here, like x equals y.field1, field2, that sort of thing, and maybe the reverse. But you can manipulate all nodes sort of via the root is the idea. You follow pointers, do whatever. So pretty obvious, slightly annoying to write down formally. But that is pointer machine. And what we're going to be talking about today in time travel is suppose someone gives me a pointer machine data structure, for example, balanced binary search tree, linked list. A lot of data structures, especially classic data structures, follow pointer machine model. What we'd like to do is transform that data structure or make a new pointer machine data structure that does extra cool things, namely travel through time. So that's what we're going to do. There's two senses of time travel or temporal data structures that we're going to cover in this class. The one for today is called persistence, where you don't forget anything, like an elephant. And the other one is retroactivity. Persistence will be today. Retroactivity is next class. Basically, these correspond to two models of time travel. Persistence is the branching universe time travel model, where if you make a change in the past, you get a new universe. You never destroy old universes. Retroactivity is more like Back to the Future, when you go back, make a change, and then you can return to the present and see what happened. This is a lot harder to do. And we'll work on that next class. Persistence is what we will do today. So persistence. The general idea of persistence is to remember everything-- the general goal, I would say. And by everything, I mean different versions of the data structure. So you're doing data structures in general. We have update operations and query operations. We're mainly concerned about updates here. Every time you do an update, you think of it as taking a version of the data structure and making a new one. And you never want to destroy old versions. So even though an update like an insert or something changes the data structure, we want to remember that past data as well. And then let's make this reasonable. All data structure operations are relative to a specified version. So an update makes and returns a new version. So when you do an insert, you specify a version of your data structure and the thing you want to insert. And the output is a new version. So then you could insert into that new version, keep going, or maybe go back to the old version, modify that. I haven't said exactly what's allowed here, but this is sort of the general goal. And then there are four levels of persistence that you might want to get. First level is called partial persistence. This is the easiest to obtain. And in partial persistence, you're only allowed to update the latest version, which means the versions are linearly ordered. This is the easiest to think about. And time travel can easily get confusing, so start simple. We have a timeline of various versions on it. This is the latest. And what we can do is update that version. We'll get a new version, and then our latest is this one. What this allows is looking back at the past to an old version and querying that version. So you can still ask questions about the old version, if you want to be able to do a search on any of these data structures. But you can't change them. You can only change the most recent version. So that's nice. It's kind of like time machine on Mac, I guess. If you've ever seen the movie Deja Vu, which is not very common, but it's a good time travel movie, in the first half of the movie, all they can do is look back at the past. Later they discover that actually they have a full persistence model. It takes a while for dramatic effect. In full persistence, you can update anything you want-- so update any version. and so then the versions form a tree. OK. So in this model, maybe you initially have a nice line of versions. But now if I go back to this version and update it, I branch, get a new version here. And then I might keep modifying that version sometimes. Any of these guys can branch. So this is why I call it the branching universe model, when you update your branch. So no version ever gets destroyed here. Again, you can query all versions. But now you can also update any version. But you just make a new version. It's a totally new world. When I update this version, this version knows nothing about all the-- this doesn't know about this future. It's created its own future. There's no way to sort of merge those universes together. It's kind of sad. That's why we have the third level of persistence, which lets us merge timelines. It's great for lots of fiction out there. If you've seen the old TV show Sliders, that would be confluent persistence. So confluent persistence, you can combine two versions to create a new version. And in this case, again, you can't destroy old versions. In persistence, you never destroy versions. So now the versions form a DAG, directed acyclic graph. So now we're allowing-- OK, you make some changes, whatever. You branch your universe, make some changes. And now I can say, OK, take this version of the data structure and this version and recombine them. Get a new version, and then maybe make some more changes. OK, what does combine mean? Well, it depends on your data structure. A lot of data structures have combine operations like if you have linked lists, you have two linked lists, you can concatenate them. That's an easy operation. Even if you have binary search trees, you can concatenate them reasonably easy and combine it into one big binary search tree. So if your data structure has an operation that takes as input two data structures, then what we're saying is now it can take two versions, which is more general. So I could take the same data structure, make some changes in one way, separately make some changes in a different way, and then try to concatenate them or do something crazy. This is hard to do, and most of it is an open problem whether it can be done. But I'll tell you about it. Then there's another level even more than confluent persistence. This is hard to interpret in the time travel world, but it would be functional data structures. If you've ever programmed in a functional programming language, it's a little bit annoying from an algorithm's perspective, because it constrains you to work in a purely functional world. You can never modify anything. OK. Now, we don't want to modify versions. That's fine. But in a functional data structure, you're not allowed to modify any nodes ever. All you can do is make new notes. This is constraining, and you can't always get optimal running times in the functional world. But if you can get a functional data structure, you have all these things, because you can't destroy anything. If you can't destroy nodes, then in particular, you can't destroy versions. And all of these things just work for free. And so a bunch of special cases are known, interesting special cases, like search trees you can do in the functional world. And that makes all of these things easy. So the rest of this lecture is going to be general techniques for doing partial full persistence, what we know about confluent, and what we know about functional, brief overview. Any questions about those goals, problem definitions? Yeah. AUDIENCE: I'm still confused about functional, because-- ERIK DEMAINE: What does functional mean? AUDIENCE: [INAUDIBLE] ERIK DEMAINE: Yeah, I guess you'll see what-- functional looks like all the other things, I agree. You'll see in a moment how we actually implement partial and persistence. We're going to be changing nodes a lot. As long as we still represent the same data in the old versions, we don't have to represent it in the same way. That lets us do things more efficiently. Whereas in functional, you have to represent all the old versions in exactly the way you used to represent them. Here we can kind of mangle things around and it makes things faster. Yeah, good question. So it seems almost the same, but it's nodes versus versions. I haven't really defined a version. But it's just that all the queries answer the same way. That's what you need for persistence. Other questions? All right. Well, let's do some real data structures. We start with partial persistence. This is the easiest. For both partial and full persistence, there is the following result. Any pointer machine data structure, one catch with a constant number of pointers to any node-- so this is constant n degree. In a pointer machine, you always have a constant number of pointers out of a node at most. But for this result to hold, we also need a constant number of pointers into any node. So this is an extra constraint. Can be transformed into another data structure that is partially persistent and does all the things it used to do-- so I'll just say, can be made partially persistent. You have to pay something, but you have to pay very little-- constant amortized factor overhead, multiplicative overhead and constant amount of additive space per change in the data structure. So every time you do a modification in your pointer machine-- you set one of the fields to something-- you have to store that forever. So, I mean, this is the best you could hope to do. You've got to store everything that happened. You pay a constant factor overhead, eh. We're theoreticians. That doesn't matter. Then you get any data structure in this world can be made partially persistent. That's nice. Let's prove it. OK, the idea is pretty simple. Pointer machines are all about nodes and fields. So we just need to simulate whatever the data structure is doing to those nodes and fields in a way that we don't lose all the information and we can still search it very quickly. First idea is to store back pointers. And this is why we need the constant n degree constraint. So if we have a node-- how do I want to draw a node here? So maybe these are the three fields of the node. I want to also store some back pointers. Whenever there is a node that points to this node, I want to have a back pointer that points back so I know where all the pointers came from. If there's only p pointers, then this is fine. There'll be p fields here. So still constant, still in the pointier machine model. OK, I'm going to need some other stuff too. So this is a simple thing, definitely want this. Because if my nodes ever move around, I've got to update the pointers to them. And where are those pointers? Well, the back pointers tell you where they are. Nodes will still be constant size, remain in pointer machine data structure. OK. That's idea one. Idea two is this part. This is going to store something called mods. It could stand for something, but I'll leave it as mods. So these are two of the fields of the data structure. Ah, one convenience here is for back pointers, I'm only going to store it for the latest version of the data structure. Sorry. I forgot about that. We'll come back to that later. And then the idea is to store these modifications. How many modifications? Let's say up to p, twice p. p was the bound on the n degree of a node. So I'm going to allow 2p modifications over here. And what's a modification look like? It's going to consist of three things-- get them in the right order-- the version in which something was changed, the field that got changed, and the value it go changed to. So the idea is that these are the fields here. We're not going to touch those. Once they're set to something-- or, I mean, whatever they are initially, they will stay that way. And so instead of actually changing things like the data structure normally would, we're just going to add modifications here to say, oh, well at this time, this field changed to the value of 5. And then later on, it changed to the value 7. And then later on, this one changed to the value 23, whatever. So that's what they'll look like. There's a limit to how many-- we can only store a constant number of mods to each node. And our constant will be 2p. OK. Those are the ideas, and now it's just a matter of making this all work and analyzing that it's constant amortized overhead. So first thing is if you want to read a field, how would I read a field? This is really easy. First you look at what the field is in the node itself. But then it might have been changed. And so remember when I say read the field, I actually mean while I'm given some version, v, I want to know what is the value of this field at version v, because I want to be able to look at any of the old data structures too. So this would be at version v. I just look through all the modifications. There's constantly many, so it takes constant time to just flip through them and say, well, what changes have happened up to version v? So I look at mods with version less than or equal to v. That will be all the changes that happened up to this point. I see, did this field change? I look at the latest one. That will be how I read the field of the node, so constant time. There's lots of ways to make this efficient in practice. But for our purposes, it doesn't matter. It's constant. The hard part is how do you change a field? Because there might not be any room in the mod structure. So to modify, say we want to set node.field equal to x. What we do is first we check, is there any space in the mod structure? If there's any blank mods, so if the node is not full, we just add a mod. So a mod will look like now field x. Just throw that in there. Because right at this moment-- so we maintain a time counter, just increment it ever time we do a change. This field changed that value. So that's the easy case. The trouble, of course, is if the node is full-- the moment you've all been waiting for. So what we're going to do here is make a new node. We've ran out of space. So we need to make a new node. We're not going to touch the old node, just going to let it sit. It still maintains all those old versions. Now we want a new node that represents the latest and greatest of this node. OK. So make a new node. I'll call it node prime to distinguish from node, where with all the mods, and this modification in particular, applied. OK, so we make a new version of this node. It's going to have some different fields, whatever was the latest version represented by those mods. It's still going to have back pointers, so we have to maintain all those back pointers. And now the mod, initially, is going to be empty, because we just applied them all. So this new node doesn't have any recent mods. Old node represents the old versions. This node is going to represent the new versions. What's wrong with this picture? AUDIENCE: Update pointers. ERIK DEMAINE: Update pointers. Yeah, there's pointers to the old version of the node, which are fine for the old versions of the data structure. But for the latest version of the data structure, this node has moved to this new location. So if there are any old pointers to that node, we've got to update them in the current version. We have to update them to point to this node instead. The old versions are fine, but the new version is in trouble. Other questions or all the same answer? Yeah. AUDIENCE: So if you wanted to read an old version but you just have the new version, [INAUDIBLE]? ERIK DEMAINE: OK-- AUDIENCE: [INAUDIBLE] ERIK DEMAINE: The question is essentially, how do we hold on to versions? Essentially, you can think of a version of the data structure as where the root node is. That's probably the easiest. I mean, in general, we're representing versions by a number, v. But we always start at the root. And so you've given the data structure, which is represented by the root node. And you say, search for the value 5. Is it in this binary search tree or whatever? And then you just start navigating from the root, but you know I'm inversion a million or whatever. I know what version I'm looking for. So you start with the root, which never changes, let's say. And then you follow pointers that essentially tell you for that version where you should be going. I guess at the root version, it's a little trickier. You probably want a little array that says for this version, here's the root node. But that's a special case. Yeah. Another question? AUDIENCE: So on the new node that you created, the fields that you copied, you also have to have a version for them, right? Because [INAUDIBLE]? ERIK DEMAINE: These-- AUDIENCE: Or do you version the whole node? ERIK DEMAINE: Here we're versioning the whole node. The original field values represent what was originally there, whenever this node was created. Then the mods specify what time the fields change. So I don't think we need times here. All right. So we've got to update two kinds of pointers. There's regular pointers, which live in the fields, which are things pointing to the node. But then there's also back pointers. Because if this is a pointer to a node, then there'll be a back pointer back to the node. And all of those have to change. Conveniently, the back pointers are easy. So if they're back pointers to the node, we change them to the node prime. How do we find the back pointers? Well, we just follow all the pointers and then there will be back pointers there. Because I said we're only maintaining backed pointers for the latest version, I don't need to preserve the old versions of those backed pointers. So I just go in and I change them. It takes constant time, because the constant number of things I point to, each one as a back pointer. So this is cheap. There's no persistence here. That's an advantage of partial persistence. The hard part is updating the pointers because those live in fields. I need to remember the old versions of those fields. And that we do recursively. Because to change those pointers, that's a field update. That's something exactly of this form. So that's the same operation but on a different node. So I just do that. I claim this is good. That's the end of the algorithm. Now we need to analyze it. How do we analyze it? Any guesses? AUDIENCE: Amortize it. ERIK DEMAINE: Amortized analysis, exactly the answer I was looking for. OK. [INAUDIBLE] amortization. The most powerful technique in amortization is probably the potential method. So we're going to use that. There's a sort of more-- you'll see a charging argument in a moment. We want the potential function to represent when this data structure is in a bad state. Intuitively, it's in a bad state when a lot of nodes are full. Because then as soon as you make a change in them, they will burst, and you have to do all this crazy recursion and stuff. This case is nice and cheap. We just add a modification, constant time. This case, not so nice because we recurse. And then that's going to cause more recursions and all sorts of chaos could happen. So there's probably a few different potential functions that would work here. And an old version of these nodes I said should be the number of full nodes. But I think we can make life a little bit easier by the following. Basically, the total number of modifications-- not quite the total, almost the total. So I'm going to do c times the sum of the number of mods in latest version nodes. OK. So because we sort of really only care about-- we're only changing the latest version, so I really only care about nodes that live in the latest version. What do I mean by this? Well, when I made this new node prime, this becomes the new representation of that node. The old version is dead. We will never change it again. If we're modifying, we will never even look at it again. Because now everything points to here. So I don't really care about that node. It's got a ton of mods. But what's nice is that when I create this new node, now the mod list is empty. So I start from scratch, just like reinstalling your operating system. It's a good feeling. And so the potential goes down by, I guess, c times 2 times p. When I do this change, potential goes down by basically p. AUDIENCE: Is c any constant or-- ERIK DEMAINE: c will be a constant to be determined. I mean, it could be 1. It depends how you want to define it. I'm going to use the CLRS notion of amortized cost, which is actual cost plus change in potential. And then I need a constant here, because I'm measuring a running time versus some combinatorial quantity. So this will be to match the running time that we'll get to. OK. So what is amortized cost? There's sort of two cases modification. There's the cheap case and the not so cheap case. In general, amortized cost-- in both cases, it's going to be at most-- well, first of all, we do some constant work just to figure out all this stuff, make copies, whatever. So that's some constant time. That's the part that I don't want to try to measure. Then potentially, we add a new mod. If we add a mod, that increases the potential by c. Because we're just counting mods, multiplying by c. So we might get plus 1 mod. This is going to be an upper bound. We don't always add 1, but worst case, we always had 1, let's say. And then there's this annoying part. And this might happen, might not happen. So then there's a plus maybe. If this happens, we decrease the potential because we empty out the mods for that node in terms of the latest version. So then we get a negative 2cp, change in potential. And then we'd have to pay I guess up to p recursions. Because we have to-- how many pointers are there to me? Well, at most p of them, because there are at most p pointers to any node. OK. This is kind of a weird-- it's not exactly algebra here. I have this thing, recursions. But if you think about how this would expand, all right, this is constant time. That's good. And then if we do this-- I'll put a question mark here. It might be here. It might not. If it's not here, find constant. If it is here, then this gets expanded into this thing. It's a weird way to write a recurrence. But we get p times whatever is in this right hand side. OK. But then there's this minus 2cp. So we're going to get p times 2c here. That's the initial cost. So that will cancel with this. And then we might get another recursion. But every time we get a recursion, all the terms cancel. So it doesn't matter whether this is here or not. You get 0, which is great. And you're left with the original 2c. Constant. OK. [INAUDIBLE] potential functions are always a little crazy. What's happening here is that, OK, maybe you add a mod. That's cheap. But when we have to do this work and we have to do this recursion-- this is up to 2p updates or recursions-- we are charging it to the emptying of this node. The number of mods went from 2p down to 0. And so we're just charging this update cost to that modification. So if you like charging schemes, this is much more intuitive. But with charging schemes, it's always a little careful. You have to make sure you're not double charging. Here it's obvious that you're not double charging. Kind of a cool and magical. This is a paper by Driscoll, Sarnak, Sleator, Tarjan from 1989. So it's very early days of amortization. But they knew how to do it. Question? AUDIENCE: [INAUDIBLE] ERIK DEMAINE: What happens if you overflow the root? Yeah, I never thought about the root before today. But I think the way to fix the root is just you have one big table that says, for a given version-- I guess a simple way would be to say, not only is a version a number, but it's also a pointer to the root. There we go. Pointer machine. So that way you're just always explicitly maintaining the root copy or the pointer. Because otherwise, you're in trouble. AUDIENCE: Then can you go back to [INAUDIBLE]. ERIK DEMAINE: So in order to refer to an old version, you have to have the pointer to that root node. If you want to do it just from a version number, look at the data structure. Just from a version number, you would need some kind of lookup table, which is outside the pointer machine. So you could do it in a real computer, but a pointer machine is not technically allowed. So it's slightly awkward. No arrays are allowed in pointer machines, in case that wasn't clear. Another question? AUDIENCE: [INAUDIBLE] constant space to store for [INAUDIBLE]. And also, what if we have really big numbers [INAUDIBLE]? ERIK DEMAINE: In this model, in the pointer machine model, we're assuming that whatever the data is in the items take constant space each. If you want to know about bigger things in here, then refer to future lectures. This is time travel, after all. Just go to a future class and then come back. [LAUGHS] So we'll get there, but right now, we're not thinking about what's in here. Whatever big thing you're trying to store, you reduce it down to constant size things. And then you spread them around nodes of a pointer machine. How you do that, that's up to the data structure. We're just transforming the data structure to be persistent. OK, you could ask about other models than pointer machines, but we're going to stick to pointer machines here. All right. That was partial persistence. Let's do full persistence. That was too easy. Same paper does full persistence. Systems That was just a warm up. Full persistence is actually not that much harder. So let me tell you basically what changes. There are two issues. One is that everything here has to change and not by much. We're still going to use back pointers. We're still going to have my mods. The number of mods is going to be slightly different but basically the same. Back pointers no longer just refer to the latest version. We have to maintain back pointers in all versions. So that's annoying. But hey, that's life. The amortization, the potential function will change slightly but basically not much. Sort of the bigger issue you might first wonder about, and it's actually the most challenging technically, is versions are no longer numbers. Because it's not a line. Versions are nodes in a tree. You should probably call them vertices in a tree to distinguish them from nodes in the pointer machine. OK, so you've got this tree of versions. And then versions are just some point on that tree. This is annoying because we like lines. We don't like trees as much. So what we're going to do is linearize the tree. Like, when in doubt, cheat. How do we do this? With tree traversal. Imagine I'm going to draw a super complicated tree of versions. Say there are three versions. OK. I don't want to number them, because that would be kind of begging the question. So let's just call them x, y, and z. All right. I mean, it's a directed tree, because we have the older versions. This is like the original version. And we made a change. We made a different change on the same version. What I'd like to do is a traversal of that tree, like a regular, as if you're going to sort those nodes. Actually, let me use color, high def here. So here's our traversal of the tree. And I want to look at the first and the last time I visit each node. So here's the first time I visit x. So I'll write this is the beginning of x. Capital X. Then this is the first time I visit y, so it's beginning of y. And then this is the last time I visit y, so it's the end of y. And then, don't care. Then this is the beginning of z. And this is the end of z. And then this is the end x. If I write those sequentially, I get bxbyeybzez, because this is so easy, ex. OK, you can think of these as parentheses, right? For whatever reason I chose b and e for beginning and ending, but this is like open parens, close parens. This is easy to do in linear time. I think you all know how. Except it's not a static problem. Versions are changing all the time. We're adding versions. We're never deleting versions, but we're always adding stuff to here. It's a little awkward, but the idea is I want to maintain this order, maintain the begin and the end of each you might say subtree of versions. This string, from bx to ex, represents all of the stuff in x's subtree, in the rooted tree starting at x. How do I maintain that? Using a data structure. So we're going to use something, a data structure we haven't yet seen. It will be in lecture 8. This is a time travel data structure, so I'm allowed to do that. So order maintenance data structure. You can think of this as a magical linked list. Let me tell you what the magical linked list can do. You can insert-- I'm going to call it an item, because node would be kind of confusing given where we are right now. You can insert a new item in the list immediately before or after a given item. OK. This is like a regular linked list. Here's a regular linked list. And if I'm given a particular item like this one, I can say, well, insert a new item right here. You say, OK. Fine. I'll just make a new node and relink here, relink there. Constant time, right? So in an order maintenance data structure, you can do this in constant time. Wow! So amazing. OK, catch is the second operation you can do. Maybe I'll number these. This is the update. Then there's the query. The query is, what is the relative order of two notes, of two items? x and y. So now I give you this node and this node. And I say, which is to the left? Which is earlier in the order? I want to know, is x basically less than y in terms of the order in the list? Or is y less than x? And an order maintenance data structure can do this in constant time. Now it doesn't look like your mother's linked list, I guess. It's not the link list you learned in school. It's a magical linked list that can somehow answer these queries. How? Go to lecture 7. OK. Forward reference, lecture 8, sorry. For now, we're just going to assume that this magical data structure exists. So in constant time, this is great. Because if we're maintaining these b's and e's, we want to maintain the order that these things appear in. If we want to create a new version, like suppose we were just creating version z, well, it used to be everything without this bz, ez. And we'd just insert two items in here, bz and ez. They're right next to each other. And if we were given version x, we could just say, oh, we'll look at ex and insert two items right before it. Or you can put them right after bx. I mean, there's no actual order here. So it could have been y first and then z or z first and then y. So it's really easy to add a new version in constant time. You just do two of these insert operations. And now you have this magical order operation, which if I'm given two versions-- I don't know, v and w-- and I want to know is v an ancestor of w, now I can do it in constant time. So this lets me do a third operation, which is, is version v an ancestor of version w? Because that's going to be true if and only if bv is an ev nest around bw and ew. OK. So that's just three tests. They're probably not all even necessary. This one always holds. But if these guys fit in between these guys, then you know-- now, what this tells us, what we care about here, is reading fields. When we read a field, we said, oh, we'll apply all the modifications that apply to version v. Before that, that was a linear order. So it's just all versions less than or equal to v. Now it's all versions that are ancestors of v. Given a mod, we need to know, does this mod apply to my version? And now I tell you, I can do that in constant time through magic. I just test these order relations. If they hold, then that mod applies to my version. So w's the version we're testing. v is some version in the mod. And I want to know, am descendant of that version? If so, the mod applies. And I update what the field is. I can do all pairwise ancestor checks and figure out, what is the most recent version in my ancestor history that modified a given field? That lets me read a field in constant time. Constants are getting kind of big at this point, but it can be done. Clear? A little bit of a black box here. But now we've gotten as far as reading. And we don't need to change much else. So this is good news Maybe I'll give you a bit of a diff. So full persistence, fully persistent theorem-- done. OK. Same theorem just with full persistence. How do we do it? We store back pointers now for all versions. It's a little bit annoying. But how many mods do we use? There's lots of ways to get this to work, but I'm going to change this number to 2 times d plus p plus 1. Wait, what's d? d is the number of fields here. OK. We said it was constant number fields. I never said what that constant is. d for out degree, I guess. So p is in degree, max in degree. d is max out degree. So just slightly more-- that main reason for this is because back pointers now are treated like everyone else. We have to treat both the out pointers and the in pointers as basically the same. So instead of p, we have d plus p. And there's a plus 1 just for safety. It gets my amortization to work, hopefully. OK. Not much else-- this page is all the same. Mods are still, you give versions, fields, values, reading. OK, well, this is no longer less than or equal to v. But this is now with a version, sort of the nearest version, that's an ancestor of v. That's what we were just talking about. So that can be done in constant time. Check it for all of them, constant work. OK. That was the first part. Now we get to the hard part, which is modification. This is going to be different. Maybe you I should just erase-- yeah, I think I'll erase everything, except the first clause. OK. If a node is not full, we'll just add a mod, just like before. What changes is when a node is full. Here we have to do something completely different. Why? Because if we just make a new version of this node that has empty mods, this one's still full. And I can keep modifying the same version. This new node that I just erased represents some new version. But if I keep modifying an old version, which I can do in full persistence, this node keeps being full. And I keep paying potentially huge cost. If all the nodes were full, and when I make this change every node gets copied, and then I make a change to the same version, every node gets copied again. This is going to take linear time per operation. So I can't do the old strategy. I need to somehow make this node less full. This is where we're definitely not functional. None of this was functional, but now I'm going to change an old node, not just make a new one in a more drastic way. Before I was adding a mod. That's not a functional operation. Now I'm actually going to remove mods from a node to rebalance. So what I'd like to do is split the node into two halves. OK. So I had some big node that was-- I'll draw it-- completely full. Now I'm going to make two nodes. Here we go. This one is going to be half full. This one's going to be half full of mods. OK. The only question left is, what do I do with all these things? Basically what I'd like to do is have the-- on the one hand, I want to have the old node. It's just where it used to be. I've just removed half of the mods, the second half, the later half. What does that mean? I don't know. Figure it out. It's linearized. I haven't thought deeply about that. Now we're going to make a new node with the second half of the mods. It's more painful than I thought. In reality, these mods represent a tree of modifications. And what you need to do is find a partition of that tree into two roughly equal halves. You can actually do a one third, 2/3 split. That's also in a future lecture, which whose number I forget. So really, you're splitting this tree into two roughly balanced halves. And so this 2 might actually need to change to a 3, but it's a constant. OK. What I want is for this to represent a subtree of versions. Let me draw the picture. So here's a tree of versions represented by the old mods. I'd like to cut out a subtree rooted at some node. So let's just assume for now this has exactly half the nodes. And this has half the nodes. In reality, I think it can be one third, 2/3. OK. But let's keep it convenient. So I want the new node to represent this subtree and this node to represent everything else. This node is as if this stuff hasn't happened yet. I mean, so it represents all these old versions that do not, that are not in the subtree. This represents all the latest stuff. So what I'm going to do is like before, I want to apply some mods to these fields. And whatever minds were relevant at this point, whatever had been applied, I apply those to the fields here. And so that means I can remove all of these mods. I only cared about these ones. Update these fields accordingly. I still have the other mods to represent all the other changes that could be in that subtree. OK. So we actually split the tree, and we apply mods to new nodes. Anything else I need to say? Oh, now we need to update pointers. That's always the fun part. Let's go over here. So old node hasn't moved. But this new node has moved. So for all of these versions, I want to change the pointer that used to point to old node should now point to new node. In this version, it's fine. It should still point to old node, because this represents all those old versions. But for the new version, that version in the subtree, I've got to point here instead. OK. So how many pointers could there be to this node that need to change. That's a tricky part in this analysis. Think about it for a while. I mean, in this new node, whatever is pointed to by either here or here in the new node also has a return pointer. All pointers are bidirectional. So we don't really care about whether they're forward or backward. How many pointers are there here? Well, there's d here and there's p here. But then there's also some additional pointers represented over here. How many? Well, if we assume this magical 50/50 split, there's right now d plus p plus 1 mods over here, half of them. Each of them might be a pointer to some other place, which has a return pointer in that version. So number of back pointers that we need to update is going to be this, 2 times d 2 times p plus 1. So recursively update at most 2 times d plus 2 times p plus 1 pointers to the node. The good news is this is really only half of them or some fraction of them. It used to be-- well, there were more pointers before. We don't have to deal with these ones. That's where we're saving, and that's why this amortization works. Let me give you a potential function that makes this work-- is minus c times sum of the number of empty mod slots. It's kind of the same potential but before we had this notion of dead and alive nodes. Now everything's alive because everything could change at any moment. So instead, I'm going to measure how much room I have in each node. Before I had no room in this node. Now I have half the space in both nodes. So that's good news. Whenever we have this recursion, we can charge it to a potential decrease. Fee goes down by-- because I have a negative sign here-- c times, oh man, 2 times d plus p plus 1, I think. Because there's d plus p plus 1 space here, d plus p plus 1 space here. I mean, we added one whole new node. And total capacity of a node in mods is 2 times d plus p plus 1. So we get that times c. And this is basically just enough, because this is 2 times d plus 2 times p plus 2. And here we have a plus 1. And so the recursion gets annihilated by 2 times d plus 2 times p plus 1. And then there's one c left over to absorb whatever constant cost there was to do all this other work. So I got the constants just to work, except that I cheated and it's really a one third, 2/3 split. So probably all of these constants have to change, such is life. But I think you get the idea. Any questions about full persistence? This is fun stuff, time travel. Yeah? AUDIENCE: So in the first half of the thing where the if, there's room you can put it in. ERIK DEMAINE: Right. AUDIENCE: I have a question about how we represent the version. Because before when we said restore now [INAUDIBLE]. It made more sense if now was like a timestamp or something. ERIK DEMAINE: OK. Right, so how do we represent a version even here or anywhere? When we do a modification, an update, in the data structure, we want to return the new version. Basically, we're going to actually store the DAG of versions. And a version is going to be represented by a pointer into this DAG. One of the nodes in this DAG becomes a version. Every node in this DAG is going to store a pointer to the corresponding b character and a corresponding e character in this data structure, which then lets you do anything you want. Then you can query against that version, whether it's an ancestor of another version. So yeah, I didn't mention that. Versions are nodes in here. Nodes in here have pointers to the b's and e's. And vice versa, the b's and e's have pointers back to the corresponding version node. And then you can keep track of everything. Good question. Yeah? AUDIENCE: [INAUDIBLE] question. Remind me what d is in this. ERIK DEMAINE: Oh, d was the maximum out degree. It's the number of fields in a node, as defined right here. Other questions? Whew. OK, a little breather. That was partial persistence, full persistence. This is, unfortunately, the end of the really good results. As long as we have constant degree nodes, in and out degree, we can do all. We can do for persistence for free. Obviously there are practical constants involved here. But in theory, you can do this perfectly. Before we go on to confluence, there is one positive result, which is what if you don't like amortize bounds. There are various reasons amortize bounds might not be good. Maybe you really care about every operation being no slower than it was except by a constant factor. We're amortizing here, so some operations get really slow. But the others are all fast to compensate. You can deamortize, it's called. You can get constant worst case slowdown for partial persistence. This is a result of Garret Brodle from the late '90s, '97. For full persistence-- so it's an open problem. I don't know if people have worked on that. All right. So some, mostly good results. Let's move on to confluent persistence where things get a lot more challenging. Lots of things go out the window with confluent persistence. In particular, your versions are now a DAG. It's a lot harder to linearize a DAG. Trees are not that far from pads. But DAGs are quite far from pads, unfortunately. But that's not all that goes wrong. Let me first tell you the kind of end effect as a user. Imagine you have a data structure. Think of it as a list, I guess, which is a list of characters in your document. You're using vi or Word, your favorite, whatever. It's a text editor. You've got a string of words. And now you like to do things like copy and paste. It's a nice operation. So you select an interval of the string and you copy it. And then you paste it somewhere else. So now you've got two copies of that string. This is, in some sense, what you might call a confluent operation, because-- yeah, maybe a cleaner way to think of it is the following. You have your string. Now I have an operation, which is split it. So now I have two strings. OK. And now I have an operation, which is split it. Now I have three strings. OK. Now I have an operation which is concatenate. So I can, for example, reconstruct the original string-- actually, I have the original string. No biggie. Let's say-- because I have all versions. I never lose them. So now instead, I'm going to cut the string here, let's say. So now I have this and this. And now I can do things like concatenate from here to here to here. And I will get this plus this plus this. OK. This guy moved here. So that's a copy/paste operation with a constant number of splits and concatenates. I could also do cut and paste. With confluence, I can do crazy cuts and pastes in all sorts of ways. So what? Well, the so what is I can actually double the size of my data structure in a constant number of operations. I can take, for example, the entire string and concatenate it to itself. That will double the number of characters, number of elements in there. I can do that again and again and again. So in u updates, I can potentially get a data structure size 2 to the u. Kind of nifty. I think this is why confluence is cool. It's also why it's hard. So not a big surprise. But, here we go. In that case, the version DAG, for reference, looks like this. You're taking the same version, combining it. So here I'm assuming I have a concatenate operation. And so the effect here, every time I do this, I double the size. All right. What do I want to say about confluent persistence? All right. Let me start with the most general result, which is by Fiat and Kaplan in 2003. They define a notion called effective depth of a version. Let me just write it down. It's kind of like if you took this DAG and expanded it out to be a tree of all possible paths. Instead of point to the same node, you could just duplicate that node and then have pointers left and right. OK. So if I did that, of course, this size grows exponentially. It explicitly represents the size of my data structure. At the bottom, if I have u things, I'm going to have 2 to the u leaves at the bottom. But then I can easily measure the number of paths from the root to the same version. At the bottom, I still label it, oh, those are all v. They're all the same version down there. So exponential number of paths, if I take log, I get what I call effective depth. It's like if you somehow could rebalance that tree, this is the best you could hope to do. It's not really a lower bound. But it's a number. It's a thing. OK. Then the result they achieve is that the overhead is log the number of updates plus-- this is a multiplicative overhead, so you take your running time. You multiply it by this. And this is a time and a space overhead. So maximum effective depth of all versions, maybe even sum of effective depths, but we'll just say max to be safe. Sorry-- sum over all the operations. This is per operation. You pay basically the effective depth of that operation as a factor. Now, the annoying thing is if you have this kind of set up where the size grew exponentially, then number of paths is exponential. Log of the number of paths is linear in u. And so this factor could be as much as u, linear slowdown. Now, Fiat and Kaplan argue linear slowdown is not that bad, because if you weren't even persistent, if you did this in the naive way of just recopying the data, you were actually spending exponential time to build the final data structure. It has exponential size. Just to represent it explicitly requires exponential time, so losing a linear factor to do u operations and now u squared time instead of 2 to the u. So it's a big improvement to do this. The downside of this approach is that even if you have a version DAG that looks like this, even if the size of the data structure is staying normal, staying linear, so this potential, you could be doubling the size. But we don't know what this merge operation is. Maybe it just throws away one of the versions or does something-- somehow takes half the nodes from one side, half the nodes from the other side maybe. These operations do preserve size. Then there's no great reason why it should be a linear slowdown, but it is. OK? So it's all right but not great. And it's the best general result we know. They also prove a lower bound. So lower bound is some effect of depth, total bits of space. OK. What does this mean? So even if this is not happening, the number of bits of space you need in the worst case-- this does not apply to every data structure. That's one catch. They give a specific data structure where you need this much space. So it's similar to this kind of picture. We'll go into the details. And you need this much space. Now, this is kind of bad, because if there's u operations, and each of these is u, that's u squared space. So we actually need a factor u blow up in space. It looks like. But to be more precise, what this means is that you need omega e of v space, and therefore time overhead per update, if-- this is not written in the paper-- queries are free. Implicit here, they just want to slow down and increase space for the updates you do, which is pretty natural. Normally you think of queries as not increasing space. But in order to construct this lower bound, they actually do this many queries. So they do e of v queries and then one update. And they say, oh well, space had to go up by an extra e of v. So if you only charge updates for the space, then yes, you have to lose potentially a linear factor, this effect of death, potentially u. But if you also charge the queries, it's still constant in their example. So open question, for confluent persistence, can you achieve constant everything? Constant time and space overheads, multiplicative factor per operation, both updates and queries. So if you charge the queries, potentially you could get constant everything. This is a relatively new realization. And no one knows how to do this yet. Nice challenge. I think maybe we'll work on that in our first problem session. I would like to. Questions about that result? I'm not going to prove the result. But it is a fancy rebalancing of those kinds of pictures to get this log. There are other results I'd like to tell you about. So brand new result-- that was from 2003. This is from 2012-- no, '11, '11, sorry. It's SOTO, which is in January, so it's a little confusing. Is it '11? Maybe '12. Actually now I'm not sure. It's February already, right? A January, either this year or last year. It's not as general a transformation. It's only going to hold in what's called a disjoint case. But it gets a very good bound-- not quite constant, but logarithmic. OK, logarithmic would also be nice. Or log, log n, whatever n is. Pick your favorite n, number of operations, say. OK. If you assume that confluent operations are performed only on two versions with no shared nodes-- OK, this would be a way to forbid this kind of behavior where I concatenate the data structure with itself. All the nodes are common. If I guarantee that maybe I, you know, slice this up, slice it, dice it, wherever, and then re-emerge them in some other order, but I never use two copies of the same piece, that would be a valid confluent operation over here. This is quite a strong restriction that you're not allowed. If you try to, who knows what happens. Behavior's undefined. So won't tell you, oh, those two versions have this node in common. You've got to make a second copy of it. So somehow you have to guarantee that control and operations never overlap. But they can be reordered. Then you can get order log n overhead. n is the number of operations. I have a sketch of a proof of this but not very much time to talk about it. All right. Let me give you a quick picture. In general, the versions form a DAG. But if you make this assumption, and you look at a single node, and look at all the versions where that node appears, that is a tree. Because you're not allowed to remerge versions that have the same node. So while the big picture is a DAG, the small picture of a single guy is some tree. I'm drawing all these wiggly lines because there are all these versions where the node isn't changing. This is the entire version DAG. And then some of these nodes-- some of these versions, I should say-- that node that we're thinking about changes. OK, whenever it branches, it's probably because the actual node changed, maybe. I don't know. Anyway there are some dots here where the version changed, some of the leaves, maybe, that changed. Maybe some of them haven't yet. In fact, let's see. Here where it's change, it could be that we destroyed the node. Maybe it's gone from the actual data structure. But there still may be versions down here. It's not really a tree. It's a whole DAG of stuff down there. So that's kind of ugly. Where never the node still exists, I guess that is an actual leaf of the DAG. So those are OK. But as soon as I maybe delete that node, then there can be a whole subtree down there. OK. So now if you look at an arbitrary version, so what we're thinking about is how to implement reading, let's say. Reading and writing are more or less the same. I give you a version. I give you a node, and I give you a field. I want to know, what is the value of that field, that node, that version? So now where could a version fall? Well it has to be in this subtree. Because the node has to exist. And then it's maybe a pointer. A pointer could be to another node, which also has this kind of picture. They could be overlapping trees. In general, there are three cases. Either you're lucky, and the version you're talking about is a version where the node was changed. In that case, the data is just stored right there. That's easy. So you could just say, oh, how did the node change? Oh, that's what the field is. OK, follow the pointer. A slightly harder case it's a version in between two such changes. And maybe these are not updates. So I sort of want to know, what was the previous version where this node changed in constant time? It can be done. Not constant time, actually, logarithmic time, using a data structure called link-cut trees, another fun black box for now, which we will cover in lecture 19, far in the future. OK. Well, that's one case. There's also the version where maybe a version is down here in a subtree. I guess then the node didn't exist. Well, all these things can happen. And that's even harder. It's messy. They use another trick, which is called fractional cascading, which I'm not even going to try to describe what it means. But it's got a very cool name. Because we'll be covering it in lecture 3. So stay tuned for that. I'm not going to say how it applies to this setting, but it's a necessary step in here. In the remaining zero minutes, let me tell you a little bit about functional data structures. [LAUGHTER] Beauty of time travel. Functional-- I just want to give you some examples of things that can be done functionally. There's a whole book about functional data structures by Okasaki. It's pretty cool. A simple example is balanced BSTs. So if you just want to get log n time for everything, you can do that functionally. It's actually really easy. You pick your favorite balance BST, like red black trees. You implement it top down so you never follow parent pointers. So you don't need parent pointers. So then as you make changes down the tree, you just copy. It's called path copying. Whenever you're about to make a change, make a copy of that node. So you end up copying all the change nodes and all their ancestors. There's only log n of them, so it takes log n time. Clear? Easy. It's a nice technique. Sometimes path copying is very useful. Like link-cut trees, for example, can be made functional. We don't know what they are, but they're basically a BST. And you can make them functional. We use that in a paper. All right. Deques. These are doubly ended queues. So it's like a stack and a queue and everything. You can insert and delete from the beginning and the end. People start to know what these are now, because Python calls him that. But you can also do concatenation with deques in constant time per operation. This is cool. Deques are not very hard to make functional. But you can do deques and you can concatenate them like we were doing in the figure that's right behind this board. Constant time split is a little harder. That's actually one of my open problems. Can you do lists with split and concatenate in constant time-- functionally or confluently, persistently, or whatever? Another example-- oh, you can do a mix of the two. You can get log n search in constant time deque operations, is you can do tries. So a try is a tree with a fixed topology. Think of it as a directory tree. So maybe you're using Subversion. Subversion has time travel operations. You can copy an entire subtree from one version and stick it into a new version, another version. So you get a version DAG. It's a confluently persistent data structure-- not implemented optimally, because we don't necessarily know how. But there is one paper. This actually came from the open problem section of this class four years ago, I think. It's with Eric Price and Stefan Langerman. You can get very good results. I won't write them down because it takes a while. Basically log the degree of the nodes factor and get functional, and you can be even fancier and get slightly better bounds like log log the degree and get confluently persistent with various tricks, including using all of these data structures. So if you want to implement subversion optimally, that is known how to be done but hasn't actually been done yet. Because there are those pesky constant factors. I think that's all. What is known about functional is there's a log n separation. You can be log n away from the best. That's the worst separation known, between functional and just a regular old data structure. It'd be nice to improve that. Lots of open problems here. Maybe we'll work on them next time.
Info
Channel: MIT OpenCourseWare
Views: 271,855
Rating: 4.9629631 out of 5
Keywords: pointer machine, persistence, partial persistence, full persistence, confluent persistence, bounded-degree pointer machine
Id: T0yzrZL1py0
Channel Id: undefined
Length: 83min 43sec (5023 seconds)
Published: Wed Dec 18 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.