Lecture 5 LQR -- CS287-FA19 Advanced Robotics at UC Berkeley

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Hi everyone. Welcome to lecture five of CS287 Advanced Robotics. Today, we'll cover optimal control for linear dynamical systems and quadratic cost, which at first might sound a little narrow, but you'll see it's actually quite general. Before I do that, a quick reminder slash announcement. Your homework one is due on Wednesday, okay? So you have a little less than a week left to complete your homework one. Any logistical questions? No? Okay, let's move ahead. So there's this thing called Bellman's curse of dimensionality which you might have encountered in your homework one if you're already working on it. You have n-dimensional state space. The number of states in your state space will grow exponentially in the dimension n. Let's assume a fixed number of discretization levels per dimension. Let's say 10 discretization levels, then you have 10 to the n total states. And so as n gets even medium large, not even medium large, just like medium, 10 to the n can already be pretty big and hard to cycle over all states. So in practice, discretization is considered only computationally feasible up to five or six dimensional state spaces. Even when using variable resolution discretization and highly optimized implementations. Maybe you can push it a little beyond that if you have a little more patience to get your results out, but that's kinda roughly the ballpark you can get to. As long as your system is six dimensional or less, you do a really good implementation, you can probably get this to work and optimal solutions out no problem. But beyond that, it gets harder and harder. Function approximation might or might not work. You don't have to enumerate all states there, you can sample them. But because you're sampling them, you're actually not populating the entire state space which makes it more efficient computationally, but you might actually not anymore find the optimal solution because you're not populating the space to properly run the dynamic program that value iteration is to find in a guaranteed way the optimal solution. So with function approximation, often solutions end up being a little more local in practice. In this lecture, we're gonna look at optimal control for linear dynamical systems and quadratic cost, also known as the linear quadratic setting, or LQ setting, or LQR, linear quadratic regulator setting. It's a very special case and we can solve continuous state space optimal control exactly. And it will only require some linear algebra operations which are very attractable to do with matrices dimensions of the state space. So that's very feasible. So run time will be order H, the horizon, and then n cubed where n is the dimensionality of the state space. Still, a million dimensional spaces might be hard, but most practical state spaces are not super large and n cubed will work out fine. Quick note. There's a great reference on this. It's totally optional, but if you want to learn more, there's a book by Anderson and Moore, "Linear Quadratic Methods," and that's probably the most kind of tutorial style introduction, 100, 200 pages on this topic that we'll do in one lecture today. So there's more information in there if you want to learn more. It's also a very strong similarity with what we'll do today and what we'll do when we cover Kalman filtering. When we do estimation, we don't have perfect access to the state directly. We just get sensory observations. We'll have a similar thing going on. To do it exactly in full generality will be very hard, but there will be a special case where we can do it exactly and that will be called Kalman filtering and it'll be very similar to what we do today for optimal control. Okay, let's write out the assumptions under which we'll work today. So, the LQR setting assumes a linear dynamical system. What does that mean? It assumes xt plus one equals a matrix A times xt plus matrix B times ut. State here is denoted by x and then controls or actions by u. So, xt is a state at time t, ut is the input at time t, and I'll assume a quadratic cost function. We've been working with rewards so far and the optimal control literature is more customary to work with cost. Ultimately, it's all the same. If you want to work with rewards, you can just say oh, the negative cost is my reward, and you'll have the same problem you're solving. But we'll follow the notation that's common in this space and we'll also use cost. So we'll be minimizing now rather than maximizing expected values. And so cost, let's call it g, g for being in state xt and taking action ut is equal to quadratic. That's the assumption. So it's gonna be xt transpose Q xt plus ut transpose R ut. And there'll be some assumptions. The assumption will be Q and R symmetric and Q and R both positive definite. What does it mean to be positive definite? It means this thing here, Q positive definite if and only if for all z's that are the right form factor, z transpose Q z bigger than zero. Same for R. Okay, so those are our assumptions. Yes, question? - [Student] That's if z is not equal to zero. - Correct. That's a good point. For all z not equal to zero, it has to be strictly bigger than zero. Thanks. Now, what that means is that the cost is nonzero everywhere except when you're at the zero point. If you have zero state, zero controls, cost is zero. That's the best place to be. Everything else is worse. And we can also see that the system is set up that this is actually possible to achieve. If the state is zero and control is zero, you'll be at zero again. So this is the kind of problem where the optimal solution will drive you to zero and stay there for all times. Now, one thing I want to mention here, a bit of an aside, but if you think about let's say this form here, x transpose Q xt. Matrices are a little harder to think of than scalar entities, so it can be good to first think of okay, what does this mean when it's scalar? When it's scalar, it just means like it's a parabola. If x is a scalar, it's just x squared times some scale factor. And for it to be positive definite, that scale factor has to be positive. So it's a parabola that's shaped with a minimum. That's what it looks like in the scalar case. Now, in higher dimensional cases, well, one thing you know, if a matrix is symmetric, or hopefully, you know, and otherwise, you'll find out now, if a matrix is symmetric, for every symmetric matrix Q, Q is equal to some orthogonal transformation which are the eigenvectors. Let's call them V, lambda which is diagonal, V transpose. So this one is diagonal. And V transpose V is the identity matrix. So V is an orthogonal matrix meaning that all it's doing is rotating things. So what happens is when you multiply something with a symmetric matrix, effectively, you just rotate it, then rescale it, rotate it back. If you multiply it from both sides, from the front with x transpose and the back with x, you are on both sides just rotating the x, and then in the rotated space, rescaling the coordinates. And so if that's all you're doing is rotating things and then rescaling, all that is doing is a change of coordinate system, adjust the rotation. So when you try to build some intuition about this, you can say well, if I look at this cost function. If I pick the simplest possible coordinate system, I pick the one where everything's already rotated ahead of time, and then this V matrix will not have to do any rotation anymore. It will just kinda disappear. It'll be an identity. And then really, Q is just a diagonal matrix doing rescaling. And so in the correct or the easiest coordinate system, really what this cost is just squaring each of the coordinates of x and rescaling those scores with a positive number. That's all it's doing. It's just that sometimes, your coordinate system is off and you need to first rotate and then you do the squaring and scaling. So that can give us some intuition about what this is. It's really just penalizing every coordinate from being away from zero, but it's not doing it in an axis aligned way necessarily and hence, it can be a general symmetric matrix, doesn't have to be a diagonal matrix. In practice, almost everybody picks a diagonal matrix. It's much easier to pick a diagonal matrix as your cost function than to fill in a whole non-diagonal cost matrix. Then the other thing that you might be curious about. Q and R have to be symmetric, I'm saying. What happens if you pick a non-symmetric Q or R? Okay, let's see what happens. We'll do it as a little aside. We'll pick a non-symmetric Q, let's say. So what do we have? Only way this Q appears. Actually, I should. Usually, there's a 1/2 put in front which is just some kind of scaling. Doesn't really matter too much, but when you take derivatives, you get a two and then a 1/2 and the two cancel which makes it nice. So imagine we're looking at x transpose Q x. But now, let's actually not call it Q because we don't want to make the same assumptions we made before. Maybe it's not a symmetric matrix. Imagine it's just some matrix, A. Well, A we already have there. Let's pick some letter we don't have yet. Let's say, we don't have and will not have any time soon matrix F, I think. X transpose F x. F, not symmetric. What happens? Okay, let's take a look at what happens. X transpose F x is equal to, not symmetric. So for a non-symmetric matrix, we can actually write it as F plus F transpose over two plus F minus F transpose over two. We're just adding and subtracting F transpose. Okay. So, let's rewrite it that way. So what we really have is x transpose F plus F over, F transpose over two times x, plus x transpose F minus F transpose over two x. Now, look at this one here. This one is x transpose F x over two minus x transpose F transpose x over two. Okay, now, when you have matrices and you have let's say A, B, C transpose, that's the same as C transpose B transpose A transpose. That's how it works. If you have a scalar value, these are scalar values. You transpose a scalar value, it stays the same. So we might as well replace this one, it's just a scalar, with its transpose. Then we apply this thing over here and we see this one is equal to x transpose. Can people see that from the back, the bottom row? Okay, x transpose F x over two which cancels with this one. So this form here is equal to zero. So all we're left with from x transpose F x is x transpose, x transpose the symmetric part of F times x. So by using non-symmetric matrices, you actually don't gain anything. You're just in some sense obscuring the fact that in reality, all you're using is the symmetric part of that non-symmetric matrix. And so we might as well not obscure those things. We might as well say well, it's clear. We only ever use the symmetric part of the matrix, so let's just from the beginning say we only use symmetric matrices because what's the point in introducing an F and then really be working with F plus F transpose over two. Let's just directly work with the symmetric matrix which we know is the essence of what's happening anyway. Not just the essence, everything that's happening is the symmetric part. So the fact that I said Q and R have to be symmetric is not really a restriction on the formalism. It's just a way to be more transparent about what's really happening. Only the symmetric part is used anyway. Might as well do it upfront. The fact that they have to be positive definite. It should be clear why you want that. We're gonna be minimizing cost. And if there's any way to achieve a negative value, if there's something that achieves a negative value here, whatever that is, then you're gonna want to go to infinity in that direction z to minimize cost so (mumbles) very negative value and that's not really an interesting problem. Well, I tried to send off your system to infinity. It's typically not an interesting problem to solve. And so to make the problem interesting, it has to be positive definite so there's a clear goal to be achieved. And then the symmetric thing is just as you saw there. It's just a way of being transparent about what's gonna happen anyway. Any questions about this? Now, one thing you might say as well, hm, linear quadratic systems. That's maybe a lot of matrix algebra and you're like that sounds fun, but maybe it sounds very limiting to you and you're saying am I really gonna stick around in this lecture today 'cause all we're gonna see maybe is just these linear systems? So, I'm gonna give you a quick heads up that we can do a lot with these systems. We have a slide with the description here. What we're gonna cover in the first part of lecture is to solve this specific problem that I wrote on the board. What we'll do in the remainder of lecture is see how we can use it as the core to solve more difficult problems. How difficult? Essentially, arbitrary non-linear systems. So one good example of a very non-linear system is a helicopter. Helicopter flight is, the helicopter itself is non-linear dynamics due to just how rotations interact, how rotations are represented. They're kind of a complex thing to deal with. But then also the airflow around the helicopter makes it highly non-linear and so there's a lot of complicated dynamics happening. It's definitely not xt plus one equal A xt plus B ut. Not even close. But what we'll see first is the core of how to solve this, then how to extend this so we can start doing things like this. So what we're gonna watch here is an autonomous helicopter that will go through essentially the entire flight envelope of the helicopter, do every possible maneuver expert pilots could demonstrate, and at actually higher precision and similar speed as the expert pilots do. So this was some of my PhD work several years ago. This is all powered by LQR under the hood, inverted take off. Then it hovers. That's actually a hard problem in itself to stay in place with a helicopter. Very difficult problem. Check box, mark then. A split-S, it's a way to change direction. Very non-linear. The dynamic is not even close to linear anywhere here. Stall turn. Nothing close to linear. Dynamics is continually changing also due to the airflow around the helicopter. It can do spins at the top. Come out tail first, go up to 55 miles an hour at maximum speed. It can go inverted. And so, this is a highly unstable regime to fly in because when you're inverted just like holding up a broom, when you're holding up a broom on the palm of your hand, it's easy to tip over. When you're holding up the body of the helicopter here, then flips, rolls. None of it anywhere close to linear, but we'll actually be able to solve this problem, at least the control part of this problem with the machinery we're gonna build up today. So, let's work through that machinery. So, remember value iteration. In control, people often use J rather than V and J for cost, so if we had value iteration, it would be J for i plus one steps to go for being in state s equals min, 'cause we're minimizing now, over controls u. The cost encountered in that first step plus then the sum over future states probability of being in that future state given the state in action times the optimal cost to go. Sometimes, we'll call it value, sometimes, we'll call it cost to go. It's very commonly called cost to go which is all costs you expect to encounter when optimally then onwards for i time steps starting in state s prime. Okay, so now we have those assumption over there. LQR. What does it say? Well, let's fill it in. J i plus one and we're gonna have x now for state. J i plus one for x is the min over u. What are we doing? Well, the cost g is written over there. It's gonna be Q transpose x Q plus u transpose R u maybe with a 1/2. You can kinda choose. The 1/2 is good to let things cancel out later. Then plus sum over all next states x prime. But actually, with sum over all next states x prime, we have a deterministic system here. X t plus one equals xt plus B ut. So we don't really need to sum over it. It's just gonna be one next state x prime. We even know what it is. That's state x prime. So J i steps to go for x prime which is equal to Ax plus Bu. So try to minimize this. Remember how we did value iteration. We said with zero steps to go, it's easy. So okay, let's do that. So we'll start with zero steps to go. When there are zero steps left, in this case, we're not gonna set it to zero. We could set it zero. We're actually gonna set it to J zero x equals x transpose P zero x. All right. Looks like in the slides, I did it without the 1/2, so let's not do the 1/2 on the board either. So, if we set P zero equal to all zeroes, this means there is no end cost, but if P zero is different from all zeroes, then essentially we're saying if you don't end up at the all zero state, we have a preference between different states you might end up in. Along some axes, maybe there's a higher cost to end up somewhere far out there compared to other axes. Now, J one of x equals well, we wrote it over there, so let's rewrite that. Min over u of x transpose Q x plus u transpose R u plus J i of this thing. Well, that's J zero here. J zero is x transpose P zero x. So, we fill that in. We have then Ax plus Bu transpose P zero Ax plus Bu. Okay, and now we want to find the u that minimizes this. To find the u that minimizes this, what do we do? We compute the gradient with respect to u of that quantity. Well, let's see. What do we end up with? (mumbles) to u of this quantity over here. This doesn't have u in it, so no nothing. U transpose R u, that becomes two R u. Then this thing over here becomes plus two times B transpose P zero Ax plus Bu and I want this to be equal to zero. If we can find a solution for that equal to zero, we've found the optimal control u. Well, if we do this, we find u is equal to negative. So the u appears here and appears here. We just need to do some grouping of terms and move to the other side. End up with negative R plus B transpose P zero B inverse B transpose P zero Ax. So we have a nice closed form for u, which is nice. That means we don't have to discretize you. we don't have to search over the best u. We actually have a closed form u as a function of x. And this we'll also call K times x. So we'll call this, what's sitting in front here, K. Then what we can do, we can fill u equals K times x into this thing over here to see what actually our value is at x. Once we fill in the optimal u, we can find out the value for that state x. Once we do that, what we end up with is let me write it over here, top of the board. Filling in, that's just a bit of simplification. You don't have to do anything special. We find J one of x equals x transpose P one x. For P one, equal Q plus, Q plus this. Actually, let's call this thing K one 'cause it's specific to the first iteration. It'll be Q plus K one transpose R K one plus A plus B K one transpose P zero, A plus B K one. And K one equals minus R plus B transpose P zero B inverse B transpose P zero A. So what this shows us, if we want to find the value function for one time step to go, we just need to compute K one. This will give us P one and then we have the value function for one time step to go. Now, the amazing thing that's happening here and this is very unique to the linear quadratic setting and that's why it's so powerful and a building block for other things, the amazing thing that's happening here is that this thing here is of the same form factor as we started with. We again have a quadratic. J zero was a quadratic and then J one is a quadratic in x which means that we can repeat this process again. If we want to find J two, we just need to up all the indices here by one and we'll have the same thing we can do. And repeat, repeat, repeat. For most systems, let's say you had a polynomial system. For most systems, what will happen is the degree of the polynomial will keep increasing as you do this and things will become more and more complex. We kind of get lucky here because when we take the gradient, the degree drops by one, and so the solution is just linear in x. We fill it in, we stay quadratic, and this never changes. Computationally, this is very feasible. I mean, this is just some matrix vector multiply, some matrix inverse, some matrix matrix multiplies. Same thing here. So computationally, this is very practical. Order n cubed where n is the dimensionality of your space. Any questions about this? This is the core idea for today. Yes? Actually, we still have the thingy today. Let's see. Oh. (students laughing) Obstacle, sorry. (students laughing) - [Student] So can P zero be any arbitrary matrix? - So can P zero be any arbitrary matrix? P zero has to be positive definite for this to be meaningful because if P zero is not positive definite, essentially what would happen is setting the gradient equal to zero, for that to be the right point, your function needs to be shaped to have a unique minimum. Otherwise, you have to go inspect whether the gradient is zero. Is it a minimum, or a maximum, or a saddle point? And so because P zero is positive definite, unique minimum for the quadratic form. When we set the gradient equal to zero, we know we have the unique minimum and that's the right thing. If P zero were the other way, shaped like that, and you went through the same math, well, then what you would find here is actually the way to achieve the maximum. And what you'd be doing is maximizing that objective, so something kinda weird going on that's not very natural. - [Student] So what does P zero mean like intuitively? - Intuitively, P zero means how much you care about being close to zero along different axes. So, quadratic forms, you can draw them from above as let's say maybe you draw quadratic form like this. What this means is that maybe the quadratic form has a value of one here, two here, four over here. And so what that means is that if you deviate in this direction from zero, things increase very rapidly and incur very high cost very quickly. So that's a bad place to end up if you can avoid it. Whereas in this direction, it increases much more slowly. So let's say you don't have enough time steps left to drive yourself all the way to the zero point, obviously, you want to be at zero. Let's say you don't have enough time steps left, you'd rather end up somewhere here away from zero than here. And in fact, this is just P zero, but if Q let's say were shaped the same way, if Q and P zero were the same, right, then at all times, you would prefer to be along this axis if you're not there yet. So if you start out let's say, I don't know, over here, you'd prefer to as quickly as possible get this way and then come in this way. Now, if it's possible of course, you might even prefer to go even faster directly to the goal, but the dynamics will limit you in what you can do. And so within what the dynamics allows you to do, you'll prefer paths along axes where you're not penalized as much. You're willing to have more deviation still if you have to make trade offs. So, P zero and Q are really trade offs is what they define. So one example would be for the helicopter, maybe you say you know what? I care a lot about position and orientation. And some entries in Q and P zero will be related to position. Some will be related to orientation. I think you might give it some thought. You might say well, actually, I probably care more about orientation. Why? Because if orientation is off, things get very complicated. And if I were to let's say approximate my helicopter with a linear system, well, the non-linears will start kicking in much more quickly if my orientation is off. But if I'm off by a position, the non-linears will not kick in actually at all 'cause where you are as a helicopter, the dynamics is not affected by your position. Similar thing would be let's say you were controlling a cart-pole. You were doing cart-pole balancing. Well, cart-pole balancing, if the angle here changes, the dynamics will be very different. The linearization of the dynamics will be very different. But if your location has changed, the linearization of the dynamics will be the same. And so what happens is if you are thinking about how should I design my cost function, if I know I'm gonna use a linear approximation to the system, then I should design my cost function in the way that the linear approximation is as good as possible. I'm gonna think about dynamics and say oh, position is less critical than orientation. Let me put a lot of penalty on orientation so it always keeps it nicely up and then it can worry about position later not because I care about it later, but because I think it's gonna fail otherwise. 'Cause my linear system that approximates the real system will become imprecise and my whole calculation based on linear systems will not be sufficiently precise to give me the correct solution. And so in practice, there's a lot of that going on as you design your P zero, and your Q, your R. And if you approximate your system with a linear system which is what we're gonna do because most systems are not linear, we're gonna approximate them that way, you're gonna have to think about how can I assure my linearization is good? And your cost matrices can help you ensure that you're more likely to be in the regime where your linearization is good. So there's a lot of intricacies going on there. It's probably a longer answer than you were looking for, but that's kind of the design aspects of it. Want to throw the box back? Nice. Were there any other questions? Okay. So, we now understand how to solve a linear quadratic system. Let's show the slides that have the (mumbles) math on this. So this is what we did on the board. We derived that only, you get back to quadratic form and this is the only calculus you have to do, just linear algebra. So since we have that, we end up where we started. J one is just like J zero. So we can go from J one to J two just like we went from J zero to J one. And for all times, we have a closed form update. This is pretty crazy by the way. You could try to give it a lot of thought and say can I come with any other dynamical systems and cost functions where if I do a value iteration back, if I can get an analytical solution. And then after an analytical solution, I get the same form factor back so I can just do this repeatedly and nothing blows up in terms of what I need to represent, I don't know of any others. I mean, there's a discrete case where we just enumerate, but in terms of continuous spaces, this is the only one I'm aware of where it just stays nicely in closed form. So that's how we get J two and then the full solution will be just iterating this. We just compute K, K i, P i, and keep repeating. The optimal policy for i-step horizon will be just pi of x is Ki times x. So you just store your Ki's and you just then use them at the appropriate time step. And if you want to know how much cost do I expect to get from this particular state x, it's x transpose Pix. It's a little fact here that it's guaranteed to converge the infinite horizon optimal policy if and only if the dynamics A, B is such that there exists a policy that can drive the state to zero. In some sense, this is intuitive. If there exists a way to drive the state to zero, then definitely in the infinite horizon case, you should drive it to zero so at some point, you stop encountering cost. So these Pi's will start converging because you're bounded in how much cost you're ever gonna encounter. If it's a path that drives you to zero, whatever the cost of that path, that's an upper bound on what the optimal control will do in the infinite horizon case. And if you're upper bounded and your series is always going up, it's guaranteed to converge and so you have convergence. Often, people do that. They just keep writing until convergence and just use a steady state feedback matrix for all times even for finite horizon problems because well it's convenient to store only one matrix rather than having to keep track of time and many, many matrices. Let's revisit our assumptions now. So, we made a bunch of assumptions here. What this really is about right now is about keeping a linear system at the all zero state while preferring to keep the control input small. How can we extend this? We can actually extend this to affine systems, systems with stochasticity, regulation around non-zero fixed point for non-linear system, penalization for change in control inputs, linear time varying systems, trajectory following for non-linear systems. So, we have a lot of extensions to cover. Let's take a small break here, start again in two, three minutes. And in the meantime, I'll just project what we have as our main result so far. (students chattering) If we don't have anything, then we can set it to zero. Or if we want to find the infinite horizon version. - If we set it to zero, wouldn't u star be like zero? - At the very end, it would be. For the last time step, it would be, but from then onwards, it would not be anymore. - So P zero is zero. That means K one is also zero. - K one will be zero, yeah. K one for one time step to go. (students chattering) K one will be non-zero and essentially will start accumulating from there. - So P one will be Q. - P one will be-- - Q, yeah. - Correct, mm-hmm. So if you start with (faint speaking) zero equals zero, you're effectively starting with P zero equal Q and waiting for one iteration later. Mm-hmm. - So, P zero can be positive (faint speaking)? - Yes. As long as Q is positive definite. Q and R are really the ones that matter here. - Thank you. - So, we want to show that this successive multiplication is a one step optimal strategy is optimal globally, (faint speaking). Like let's say we have a finite horizon and then we work like backwards and show that this is optimal-- - The reason it's optimal globally is just 'cause it's value iteration. We're doing exact value iteration. We know value iteration is optimal, so it's a direct consequence of that. - Thanks. - All right, let's restart. So one question that came up during break is how do we know that this is optimal? And the reason we know this is optimal is that this is value iteration and we know value iteration is optimal. It's just that we just found out that we can actually do value iteration exactly in a continuous state space under this set of specific assumptions on dynamics and cost. But now, let's start revisiting those assumptions. How about in affine systems? What does that mean? That's an offset there. This means you can actually not keep the system at zero. Think about it. Because if xt is zero, ut is zero, then xt plus one will be c. So you will continually keep accumulating cost. There's no way around it here because whenever you're not at zero, if you get a cost, you cannot stay at zero. You can still go through the same math. One way you can do it is you can say well, I'm gonna go through the same math I did before, re-derive the update. And that could be a good exercise. You just do all the math we did in previous slides. There will just be some extra offset terms coming around. So then the control, u optimal control will not just be matrix K times x. It'll be matrix K times x plus some offset because you need to apply offset control. Another thing you can do and what you're gonna do a lot in what we're gonna see in these extensions, you can re-define your state space. So we could say what if our state space is actually we'll call it z now and there is xt plus one which I had before with an extra entry of one. If we set it up that way, a slightly bigger state, then we can actually write it again as a linear system in z space. And we can then say okay, now we can re-apply all the equations we've seen before. We're just solving it in z space and then we need to apply a control. We just need to make sure that okay, we turn our x into a z which is an x and a one at the bottom. And then we multiply with the K matrix that we found to get the controls. So it's a little trick to not have to re-implement anything when you go from linear to affine. How about stochastic systems? Imagine you have a wt here. That's not an offset term. Its mean is zero, but it has variance, non-zero variance. So there's some noise in the system. I encourage you to do this as an exercise to see if you can fully understand everything we derive. You can work through the same derivation that I did on the board. It's now stochastic, so it will have an expectation, expected value. And (mumbles) the expected value, what you'll see up here. Well, the expected value of w multiplied with anything that's not w will be zero. That will simplify out. But the expected value of w multiplied with w is a covariance matrix for w and that will be carried throughout and it will show that actually, you have a higher expected cost. Why? Well, it's natural. You have noise in the system. You can't control it as precisely when you have this noise. But the funny thing is that the optimal control policy, the K matrices you'll find are exactly the same. They don't change. You'll just see that you'll have an expected higher cost but use the exact same control strategy. Kind of an interesting result. Some people call it certainty equivalent control. Like in this particular setting, even though there is uncertainty in the system, you can design your optimal controller ignoring the uncertainty and it will give the same result in terms of your strategy. How about non-linear systems? Xt plus one is f of xt comma ut. We can keep the state, we can keep the system at state x star if and only if there exists some u star such that x star equals f of x star u star. That's what it means to be a fixed point. Let's assume there is a fixed point in the system. And we can see can we stabilize it around that fixed point? That's like keeping a helicopter in hover, keeping a cart-pole balanced at the top. It's non-linear, but there's a stable point, and the question is how can you stabilize around it? 'Cause usually there is a stable point. Sure, you might say I just stay there and (mumbles) perturbations and you'll be able to steer back onto it. So it's not enough to know u star to stay there. You need to know a feedback strategy to stay there. Linearizing the dynamics around x star, what would it give you? Well, xt plus one equals this is a first order Taylor expansion of the dynamics at x star u star. Well, that's really A and B, or equivalently, we can write it as st plus one minus x star equals A xt minus x star plus B ut minus u star. If we now redefine our coordinate system, centered around x star and u star, we'll call that z, and for u, we call it v, then we have exactly what we had before. Zt plus one equals Azt plus Bvt and then our cost will be well, some cost for staying close to that. So we don't have to write any new code to find the linear quadratic optimal control around the stable point, not a stable point, a stationary point of a non-linear system. Okay. So, just run a standard LQR. Once you've done that, you need to find your controls. First, you'll turn your x into z. You'll find v from that. Then you can turn your v into your u and here it is, the equation to find u. All right. Here's another one. What if instead of penalizing for the controls, you want to penalize for the change in controls? Why might you care about this? Well, often what happens when you have a system, let's say this is your system and you run it. In the real world, your optimal controller, what you'll find is that there will be a lot of high frequencies for your high frequency control. Even if in the simulator when you run this, it might not be the case. But in reality, there's always a little bit of noise. There's a bit of mismatch between reality and your simulated system. So your optimal controller (mumbles) everything is deterministic is gonna actually use pretty large controls typically and constantly adjust the controls to whatever at that moment looks best. And you typically don't want that because typically what happens is when you have high frequency controls for a physical system, those high frequencies are hard to model. And actually the model that you're using, your linear system model that you have is often not very precise for the high frequencies. For example, for a helicopter, at high frequency, you just shake it apart. It rips itself apart at high enough frequency. And so that's not great 'cause you have all this kind of modes really of the physical system that are not modeled in your control model that you have for your system. So you want to avoid those high frequencies. Okay. So, what you can do, one thing you can do, Anderson and Moore looks at this, is you can frequency shape the cost function. What does that mean? You would say well, I'm gonna add extra variables to my state so I don't just have xt, but I also have xt minus one, xt minus two, xt minus three, xt minus four. I keep them all around in the state. That's easily done. You can expand your A matrix to have some effectively identities to memorize things from the past for a while. And then you can essentially set up something that penalizes for rapid changes in those. Or you can even set up a very specific filter on your past states if you have a certain frequency that you specifically want to suppress. Like maybe there's some kind of, you've done an analysis of your physical system and there's a certain frequency at which the system has a resonancy and the physical system will start accumulating energy and shaking itself apart that you might want to minimize anything that happens at that frequency. And you put something in your cost function to just avoid that frequency. Simple special case which works very well in practice, just penalize for changing control inputs. How do we do that? This is our original problem. But now, we want to penalize for change in controls. Well, solution A is to augment the state with the past control input vector. So we have xt, but also ut is stored in the state. And then we can penalize for how the new input is different from the one we've already stored in the state. Solution method B is the one we'll cover is you actually change the dynamical system to be expressed in terms of the change in control input rather than the actual control input. So for some reason, maybe originally you were controlling, I don't know, velocity, and now you're effectively controlling acceleration and you want to keep your acceleration small rather than keeping your velocity small. Okay, or maybe you want to keep both small. That's fine too. So what does it look like? What we have is, whoa, really animated, but okay. Xt plus one ut is our new state variable. We can get that from xt and ut minus one which at the previous time are in our state. And then a delta in controls comes into B. This ut minus one also gets multiplied with B. So actually in the top row, we get the original dynamics. Xt plus one equals A xt plus B ut minus one plus B delta ut. So that's the same as plus B ut. And the bottom row is just keeping track of the controls from the previous time. So this is what we have, just a redefinition of our state and input space. Our cost can be the same we had before where now there is an R prime introduced. Our original Q is here. Our original R is here and there's an additional R prime to penalize for the change in controls. And this then matches the standard LQR format shown above. I'd say for pretty much any physical system, you're gonna want to do this. It pretty much never works without doing something like this. What else might we want to generalize to? Linear time varying systems. So we had stationary system before and now, there's an index there. The dynamics depends on time, At, Bt. Well, in all the math we did, there was nothing that assumed A and B would be staying the same over time. We can work through the exact same math, exact same update equation, we just need to keep track of the time indices. It will not converge to anything and be varying over time, you're not gonna get the same convergence properties, but you can do the exact same math and find a time varying linear controller that's appropriate for your time varying linear system. So update equations will look exactly the same, just need to keep track of the time indices. You might wonder how often do you really run into a time varying linear system? Like, that would be quite a coincidence. It's not linear, but it's time varying linear. That would be kind of really special. In practice, it's rare to have a time varying linear system that's actually what you run into, but it's very common to run into a non-linear system. And if you know the path you're going to follow in state space that you're trying to follow, then you can approximate in a time varying way with a sequence of matrices your dynamical system as a linear time varying system. So different linear approximation at each time step to what is actually a non-linear system. And so you'll get linear time varying control and you'll get again, quadratic costs. So let's look at what I just described which is the most direct application of this kinda linear time varying setting. We want to do a trajectory following for non-linear systems. For example, we want our helicopter to fall in a specific path. Helicopter's now linear, so it can't do the linear thing, but what can we do? Well, let's assume there's a feasible target trajectory. It's something that the helicopter can actually do or our system can actually do. How would we know what that is? It's not necessarily easy to know, but let's assume we know it maybe from somebody already executing it, a human pilot maybe, or maybe from having a precise analytical model that you can somehow derive what is a feasible sequence of states. Well, feasible means that there exists control such that this trajectory gets followed. You might say well, wouldn't it be enough to just apply the sequence of controls and we're done? Why do we still need to do any work? The reason we still need to do some work is typically, even though you know your sequence that would follow the trajectory in principle, there will be perturbations. They will be thrown off your trajectory. Why? Well, there could be explicit perturbations like noise, wind pushing your helicopter around. Or there could be things that your f that you have there is not perfect. It's imprecise. And so even though you think it's feasible, it's actually not feasible. And you can think of the mismatch between the real f and the f you work with also as noise. It's not necessarily a noise, per se. It's not a noisy process. It's just that there's a mismatch between your f that you use and the real one, and so real update plus some noise is really what you're working with. So our problem statement then would come. So let's minimize the quadratic deviation from our target states at all times and quadratic deviation from our target inputs. And then well, we might say why do we penalize (mumbles)? Why do we need to stay close to our target inputs? Well, we can think of it in some senses as zero centering. We know the controls that we need if there's no perturbations and so that's a strong prior and we're going to penalize for deviating from that 'cause we know that's the right place to be. And then if we need to deviate, sure, we do it to stay on track, but by default, we're gonna try to stay close to u star. Then we transform this into a linear time varying case. Again, Taylor expansion. Xt plus one is some function of x star t u star t plus then what will be our A and B for those respective times, times xt minus xt star, ut minus ut star. And so here's our linear time varying system in a new coordinate system, the x minus x star and u minus u star coordinate system. And at this point, we're actually good to go. We can do the standard thing where we transformed it. We've got a standard LQR backup operations and the resulting policy at i time steps from the end will be thing thing over here which we knew is gonna be that format 'cause it's just a linear time varying system at this point and the target trajectory need actually not be feasible to apply this technique, so if you actually don't have a feasible trajectory, you can still go through the same math, but you'll have to keep track of an extra term here in what you do. So I should have an affine system rather than a linear system and so you have to keep track of that. If your trajectory is feasible, it'll work out fine. Yes? - [Student] What if you have only access to the states from the target trajectory and not the actions? - Oh, good question. What if you only have access to the states, not the actions. Then, we wouldn't know the u stars. We couldn't center it around them. And so we would end up with not necessarily knowing where we want u to be centered. We might essentially have no u star here, just set it to zero. That's maybe the best we know. And you can actually work through the same math, but again, you'll end up with an offset here. So whenever the trajectory's not feasible, 'cause effectively what we're doing is we're replacing the u stars with zeroes. It's an infeasible trajectory. We'll end up with offset term over here. And so we'll have an affine system to deal with rather than a linear system. But other than that, we can still run through the same math. Question here. Do you want to take the mic for a moment? - [Student] This is fun. Does error tend to like accumulate in the system 'cause you have your defined trajectory and then you're off by a little bit and then you keep going off, and off, and off. - So, absolutely. When you're trying to follow a trajectory, typically an error will happen. An error meaning you thought you were steering right back on to (mumbles) but you thought optimally control and back onto it. But then there's a perturbation, either a mismatch in dynamics so you're actually not steering onto it 'cause you had a wrong model you're using or there is maybe a actual perturbation, a force applied to your system that is an external perturbation, and you'll still be off. And so then really what matters is whether you're good at steering on relatively quickly compared to the perturbation force. If the perturbation force is very, very strong, pushes you very hard and the level at which you're able to steer back on is smaller than that, you'll end up not getting back on and it might go unstable. If your ability to control back on is very good, then yes, you'll never actually be on the target trajectory, you'll always be around it, but you'll stay around it. And so it really depends on your control bandwidth there. Do you have the ability to keep it on or do you have not enough control to keep it on? Like for example, a helicopter. If you're flying it in wind gusts up to like 10, 20 miles an hour can keep it on, but wind gusts like I don't know, 100 miles an hour, there's no way we're keeping it on the trajectory. It just doesn't have the ability to counter that. Over there. Behind you first. Oh. (audience laughing) Try way too much arch. - [Student] So what is the component here that keeps it on the trajectory (faint speaking)? - What's gonna keep us on track is the feedback matrix here. So, the controls we choose are gonna be such that if we deviate from where we were supposed to be. Let's say if we already were exactly where we're supposed to be at that time, then we'll just apply u star for that time. This is saying how much we're gonna deviate from u star and so if we're off from the trajectory, this K matrix will describe what controls we need to use to steer back on. Why does this happen? Well, the original optimization problem says that we need to stay close to the x stars. That's in the objective. So the optimum is be on and then that feedback matrix is just a consequence of our objective. (laughing) Try it. (faint speaking) You also had a question, right? - [Student] So it seems like the problem at the top of the page is equivalent to a convex optimization problem? I don't know if you know whether or not that's true or not, but if it is true, why not just use a solver rather than LQR backup iterations? - Yeah. So it's a good question. Next week, we'll look at ways to use solvers to solve for this. If you just solve it here. Thanks, Eric. If you solve this as a convex problem just finding the u's, you'll just find the u's. You will not find a feedback controller. But we'll see extra things we can put around it to get there, but in practice, the nice thing about solving it this way is in some sense that you're exploiting all the structure in the problem 'cause it's actually a dynamic programming problem. So we're actually using all the structure available to us to find a solution including a feedback matrix, including a cost to go matrix. And that will be really helpful as we'll see as even we see other methods to solve this, we'll still be very interested in a feedback matrix and a cost to go matrix. - Here you go. - Over there. Thank you, Eric. - [Student] So is it possible to like interlink two different LQR problems? For example, like use an easier model that kinda captures what's going on but deviates from reality to solve for the u star, and then have another model on the top that includes the non-linear part and put in the non-linear (mumbles). - Yeah. I mean, I think that's a very good idea and that often solving for the x stars, u stars is difficult, and if you can have ideas of how to simplify solving for it, then building up to it. I don't know of any kind of extremely principle way to get this done, but definitely people will try all kinds of things to find x star, u star, and we'll see more of that next week. So let's defer questions about how to find x star and u star to next week 'cause we'll see ways to do it then. All right. You got to far throw it here yourself. Careful, everyone. (students laughing) Nice. How about the most general case? We just want to solve this thing. How could we do this? Well, one thing you could say well, why not use a black box optimization solver? Maybe we can find the u's. But again, we'll just find the u's. We'll not find the feedback matrix. We'll not find a cost to go function. And in fact, we'll not necessarily be exploiting the structure of the problem. We can actually iteratively apply LQR and this is a very powerful way of doing this. So let's step through this step by step. We initialize the algorithm by picking either a control policy or a sequence of states and control inputs. So we'll do the control policy thing. We just assume we have some control policy, whatever it is. We have something. Maybe it's just all zeroes always. Then, start on step one. We execute the current policy and record the resulting state input trajectory shown here. Now, we have a feasible trajectory. We have something that actually just happened that we can do. If this was our target trajectory, we kinda could do just like LQR to stay on that trajectory. But actually, this is kind of an arbitrary trajectory. We picked some arbitrary sequence of controls, we found the trajectory. But it is feasible, which is nice. Then we compute the linear quadratic approximation of the optimal control problem around the obtained state-input trajectory by doing first-order Taylor expansion of dynamics and second-order Taylor expansion of the cost function. If I had a very general cost function, right, could be anything, but we can do first and second order approximations for dynamics and cost function respectively. Once we have that, we have a linear time varying quadratic control problem. So we could use LQR backups to solve the optimal control policy for each time which I'll call pi i plus one where i plus one here is the iteration in the outer loop of the algorithm. Then, we can go back here, execute that policy, and repeat. So this is kind of an iterative process where you have a policy, find a better policy, find a yet better policy over, and over, and over. Now, I haven't yet proven that this is always gonna improve the policy, but that's the intuition. That's what we're hoping for that. Now, what would it actually look like in equations? This is what it looks like. We have a linear approximation to dynamics, and we'll have a quadratic approximation to the cost function, and here's our new state zt vt for new controls centered around the previous trajectory we found. Then our A matrix, our B matrix, our Q and R. Here, we're assuming that our cost function depends on the x and u separately. Otherwise, you have some quadratic terms that cross between x and u. Usually, people don't have much crossing between x and u. They penalize for state, what they would like about state penalize for controls, what I care about for controls. And so most of the time, it'll look like this. But in principle, you could have a cross term between x and u. So this is all we need to compute, the A, B, Q, R, and then we can do our linear time varying system back ups and we can find our new sequence of feedback matrices. We're allowed based on that and go again. Okay, so as we look at it here, it's actually pretty simple. Yeah, you might have to get some derivatives, but actually, you can just do finite difference if you don't have an automatic differentiation through your system. Just do a finite difference. If you do have automatic differentiation available, you can just automatically get the derivatives out. Either way is fine. So, does this converge? It need not converge as I formulated it actually. The reason is the optimal policy, and this is real important, the optimal policy for the approximation we're making might end up not staying close to the sequence of points around which the LQR approximation was computed by Taylor expansion. So if there's a linear time varying system with quadratic cost, we find the optimal solution to it. But if that optimal solution is far away from the previous trajectory, then yeah, it's optimal for the linear time varying system, but it's in such a different part of the space where that linear approximation is just not precise. And so it's actually not optimal for the original system at all, and it might not even be an improvement. It might be worse. So you have to be careful about using your linear approximation and your quadratic approximation in the region where they're valid. How can you do this? You might say well let's just solve and then do some step sizing. You could try to do that, but actually, that might not be the best thing to do. You can do something much better here. We want to stay close to where our approximation is valid. Where is it valid? It's valid close to the xti's and uti's that we had from our previous trajectory 'cause that's where we linearized and quadraticized. So we're gonna add a cost term to stay close to that. And we know that if this cost term is big enough, if this dominates, we're gonna try to stay very close to where we were before and then the little bit of the actual cost we care about which will pull us off a little bit of that trajectory we had before to do a little better on the original cost function. And so if we set it up this way with a large enough alpha, but still a little bit of weight on the original cost, we'll gradually improve. We're guaranteed to improve. Why is this so much better than just a regular line search? Imagine you just did the regular thing and it was worse. You did it (mumbles) with a policy, but it was like going elsewhere and really bad? Well, it's not clear what that line search would do. You're kinda trying to decrease your controls. What are you trying to do in that line search? It's not clear, right? But what we have here is a way to ensure in a dynamic programming way everywhere along the trajectory that we ensure we stay close. It's part of our objective. When we're early on executing, we're already thinking ahead and saying don't just care about the cost g, make sure we also stay close to where we were before because otherwise I cannot trust my linear approximation, and hence, not the result of this calculation that I do to find controls. And so you'll have to do a bit of a line search on this alpha. You'll have to play around with alpha. If you make alpha close to zero, then you can make big updates and you can see maybe it's a lot of improvement. You're good to go. So what you would do is effectively, you would do this with some setting of alpha and if things improved, you would say okay, good to go. If things got worse, you might say okay, I need to make alpha larger 'cause it got worse which means the linear approximation was not good enough where I ended up, so I need to put more emphasis on that being valid. So, often is described as a trust region approach where you have a notion of your trust, the function you're optimize. You have an approximation to the thing you're optimizing and you have a region in which you trust that approximation and you're only willing to optimize within that region. It's not exactly a constrained region here. It's more penalizing for going too far away than having a hard constraint, but it's the same idea because we know when you have a hard constraint, effectively, you just put a Lagrange multiplier in front of it and it becomes a penalty again. It's equivalent. It's just down here, we might not know the exact setting we need to use for that alpha. But if we knew magically the right setting, it'd be good and we don't have to do anything more complicated than just checking. Is the solution making progress on the previous one we had or not? And if that's the case, we're good. If it's not the case, make sure we penalize more for deviation. Some practicalities. F is non-linear, hence this is a non-convex optimization problem, so you can get stuck in local optima. So good initialization matters for now. In the original linear system, linear time varying system, initialization didn't matter. We're gonna find the global optimum. But now, we have a non-linear problem. We have an initial policy we roll out and it's along that trajectory we're gonna start building improvements. And so wherever we start it will affect what kind of trajectory we find in the end of this optimization. The cost function g could be non-convex. And actually, then we can have issues where the Q that we've been working with, the second order approximation, is not positive definite. That's a problem 'cause as we talked about, you can get very weird behaviors. It's not positive definite. It's not even clear what you're doing. Without setting gradient equal to zero for a non-positive definite quadratic function, you're kind of going to that fixed point, but it could be a maximum or a saddle point. It's not clear that's where you should be going. So we should avoid that. So you should check. You should check that the Q and R that you find from your second order approximation are positive definite. And if they're not, should add terms to it. And in fact, if you make your alpha big enough, that alpha is essentially adding an identity scaled by alpha to your Q and R. And so if you make your alpha big enough, at some point, they will become positive definite again and that would be the way to get there. And that's a trick you might want to do ahead of time. Not just see does it work, does it not work. You just say okay, is everything positive definite or not? If not, keep adding 'til you're finally positive definite. Then there's something else called differential dynamic programming. What I described so far, a lot of people would call iterative LQR or iLQR. Then a lot of people talk about this thing, differential dynamic programming. And people don't always really distinguish between the two. They're two different things, but often people use the names interchangeably. They'll say I'm running DDP, I'm running iLQR, whatever, and it's actually they don't even know which one they're running of the two. But that's kinda fine. They're very similar. I mean, ultimately, that's just a vocabulary thing. The difference is in what we saw so far, we do a linear approximation of dynamics, quadratic approximation of cost. In differential dynamic programming, the differential thing, the approximations are happening in the Bellman equation itself. So you will actually let's do a comparison, DDP, and this is when we just look at u. You actually look at the Bellman equation shown on the left there. You'll do a second order approximation of the Bellman equation itself. There will be terms with x also if we care about that, but in this case, we're just doing u. So in an iterative LQR, we have a Bellman equation where we ahead of time make the Bellman equation have quadratic cost and linear dynamics. And what we see actually on the left when we do it on the Bellman equation itself, there will be an extra term here that appears. The details don't matter too much, but an extra term will appear. And actually, that extra term could make it again, non positive definite, and you might have to deal with that. And so you might argue well, the extra term is great to have. Maybe it makes it more precise and so forth, so we can do that. I don't think people have particularly strong opinions which one of the two might be working better. In practice, it's a lot easier often to just to iterative LQR than to also worry about this extra term that pops up in the Bellman equation which might make things non-convex again and harder to deal with. Okay, so we've covered a lot from this one core idea. Can we do even better? Yes. At convergence of iLQR and DDP, we end up with linearizations around the state input trajectory the algorithm converged to. In practice, as we talked about, the system could not be on the trajectory or not even that close due to perturbations, or the initial state being off, the dynamics model being off, and so forth. What then might happen is that we're kind of out of the regime where the linear approximation is as good as we would want it to be. So what now? Well, if you're already working on your homework one, you have actually done something very similar already. We can do a look ahead. In the homework one, you have a grid based approximation to your state space and so you get a crude value function with a crude discretization. But then you can do a couple steps look ahead to optimize for the actions you take in the first few steps and then cap it off with the value function after those steps. So if we do that here, effectively, the result of doing the iLQR is a value function for all times, a quadratic value function that shows how you want to approach the trajectory, what's a bad deviation versus a good deviation from the trajectory. And then you run an optimization to find the optimal controls for let's say five step look ahead, 10 step look ahead in the moment. How are we gonna run that optimization? In your homework one, you do it with enumeration or cross entropy method and so forth. Here, we actually have a method. We can run iterative LQR inside this process. So we can say okay, at time t, asked to generate a control input, we could re-solve the control problem using iLQR DDP over time steps t through H all the way 'til the end. That would be a lot of work. Or over a shorter horizon and cap it off with a value function there. And so this gives you a much better performance 'cause now in the moment, you're looking at the non-linear model that you have, re-optimizing against it. But in practice, you can't look as far ahead as you should to make the best decisions. But you've done that ahead of time and from that, you have a value function, a cost to go function that you can use so you only have to look ahead five to 10 steps. How far should you look ahead? We can actually look at this, right? For example, for the helicopter problem, what did we do? We would look at okay, if we run it with five steps look ahead, 10 steps look ahead, 20 steps look ahead, what happens? We would notice that at the time with I believe it was 40 steps look ahead maybe, I should look it up but it was some number, some step look ahead, 40 steps look ahead, I think, we saw it was essentially always back on the trajectory. If after 40 steps of look ahead, your optimization predicts you to be back on the trajectory, looking further ahead is not gonna help you 'cause looking further ahead is just gonna get you in the trajectory after 40 steps and then after that, keep you on a trajectory. But the (mumbles) function is already really good around the trajectory, so once you're pretty close to your trajectory, the notion that you can keep it on, you already know that. There's nothing interesting happening there. So you would actually look at okay, how long does it typically take to get pretty close again, to get back into the regime where my value function is precise, where my linear and quadratic approximation are good, and that's the amount of look ahead that you need. And you do that amount of look ahead, you're essentially doing the optimal thing. Everything after that was a waste of cycles. The value function already tells you what you would have gotten from that. So, now we need a pretty optimized implementation typically 'cause if you're running this kind of look ahead inside your control loop, and let's say you do 20 hertz control, then you have 1/20 of a second to run this iterative LQR which might require multiple outer loops to finally get your controls, execute the first one, and repeat. Now, one of the nice things that you have a good initialization 'cause you already have your K matrices which tells you your attempted feedback control. You can use that as your initialization trajectory. And then from there, run iterative LQR to re-solve. Okay. Now, here's another thing to think about. Multiplicative noise is a very interesting kind of noise model. Here, the noise is not added in, but there's this matrix Bw and the wt lives here. And if you apply zero control, there will be zero noise. But the more control you apply, the more noise you introduce into the system. That's actually very common in reality and there are some models of human control that kinda seem if you model human control this way, it matches it pretty well. If we do very high force output, we'll have more noise in our execution than if we do something that's low force. It's natural. You can actually re-derive. It turns out that with this setting, you can end up with a similar set of updates, find K matrices and P matrices. And what you'll see happen in interesting ways, effectively, in some sense, it becomes equivalent if you work through it. It's as if you're having a new Q and a new R, but it might make it easier to design your Q and R because if this is really how your system works, there might be easy thing about the right Q and R with this in mind. And then as you go through the (mumbles), you'll say well, it's the same as before, but it's as if I has used different Q and R, but it essentially gave you a transformation to tell you what the right additive noise situation Q and R would be if you care really about this. All right, let's look at a couple of examples here. Cart-pole. Here is the non-linear system. Definitely non-linear. A lot of non-linearity. Let's balance this cart-pole. We design an LQR controller for balancing. Then what we can do is we can say okay, for all starting points or a range of starting points, does this linear feedback controller bring us back? Let's take a look. Our horizontal axis here is z, vertical axis is theta. We're not plotting x dot and theta dot, but we start with I believe zero for both. These are initial conditions. Green means that the linear feedback controller succeeded. So, a pretty wide range of starting points from where the linear feedback controller succeeded. Definitely a non-linear system, but it was able to pull it in. This for the diagonal, cost matrix, and state, and no penalty on controls. Now, what will happen in practice is when you do this, you'll look at this and say what if I change the cost matrices? How well will it do? We talked about this earlier. Your cost matrix will affect your linear controller and your linear controller will affect whether your system is precisely modeled by a linear system or not where you are. Okay, let's change it. We'll penalize for controls. You might say in small controls, linear model will be better. Turns out it actually does worse. What might be going on here? Well, in the linear model, even with small controls, you bring yourself back to the equilibrium state. And so it get penalized for controls and you say well, I'm gonna be patient. I'm not gonna apply too much. I can gradually bring it back. The linear system allows me to bring it gradually back. But the non-linear system will be like I'm already falling. And that cart-pole is down and you're never bringing it back. And so it's an interesting effect here where it's not always super intuitive ahead of time. I would have thought if somebody told me if I'm gonna penalize more for controls, it's gonna do even better because well, the smaller the inputs are, the better the linear approximation is, but actually, something else happens. The look ahead tends to be that it doesn't care enough about control anymore 'cause it doesn't need to for a linear system, but actually it needs higher control for the non-linear system. All right, it's 12:30, so let's stop here and we'll continue this next week.
Info
Channel: Pieter Abbeel
Views: 4,722
Rating: undefined out of 5
Keywords:
Id: S5LavPCJ5vw
Channel Id: undefined
Length: 81min 49sec (4909 seconds)
Published: Sun Feb 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.