- Hi everyone. Welcome to lecture five of
CS287 Advanced Robotics. Today, we'll cover optimal control for linear dynamical
systems and quadratic cost, which at first might
sound a little narrow, but you'll see it's
actually quite general. Before I do that, a quick
reminder slash announcement. Your homework one is
due on Wednesday, okay? So you have a little less than a week left to complete your homework one. Any logistical questions? No? Okay, let's move ahead. So there's this thing called Bellman's curse of dimensionality which you might have
encountered in your homework one if you're already working on it. You have n-dimensional state space. The number of states in your state space will grow exponentially
in the dimension n. Let's assume a fixed number of discretization levels per dimension. Let's say 10 discretization levels, then you have 10 to the n total states. And so as n gets even medium large, not even medium large, just like medium, 10 to the n can already be pretty big and hard to cycle over all states. So in practice, discretization is considered
only computationally feasible up to five or six
dimensional state spaces. Even when using variable
resolution discretization and highly optimized implementations. Maybe you can push it a little beyond that if you have a little more
patience to get your results out, but that's kinda roughly
the ballpark you can get to. As long as your system is
six dimensional or less, you do a really good implementation, you can probably get this to work and optimal solutions out no problem. But beyond that, it
gets harder and harder. Function approximation
might or might not work. You don't have to
enumerate all states there, you can sample them. But because you're sampling them, you're actually not populating
the entire state space which makes it more
efficient computationally, but you might actually not
anymore find the optimal solution because you're not populating the space to properly run the dynamic program that value iteration is to find in a guaranteed
way the optimal solution. So with function approximation, often solutions end up being a little more local in practice. In this lecture, we're gonna look at optimal control for linear dynamical
systems and quadratic cost, also known as the linear
quadratic setting, or LQ setting, or LQR, linear quadratic
regulator setting. It's a very special case and we can solve continuous state space optimal control exactly. And it will only require some
linear algebra operations which are very attractable
to do with matrices dimensions of the state space. So that's very feasible. So run time will be order H, the horizon, and then n cubed where n is the dimensionality
of the state space. Still, a million dimensional
spaces might be hard, but most practical state
spaces are not super large and n cubed will work out fine. Quick note. There's a great reference on this. It's totally optional, but if you want to learn more, there's a book by Anderson and Moore, "Linear Quadratic Methods," and that's probably the most kind of tutorial
style introduction, 100, 200 pages on this topic that we'll do in one lecture today. So there's more information in there if you want to learn more. It's also a very strong similarity
with what we'll do today and what we'll do when we
cover Kalman filtering. When we do estimation, we don't have perfect access
to the state directly. We just get sensory observations. We'll have a similar thing going on. To do it exactly in full
generality will be very hard, but there will be a special
case where we can do it exactly and that will be called Kalman filtering and it'll be very similar to what we do today for optimal control. Okay, let's write out the assumptions under which we'll work today. So, the LQR setting assumes
a linear dynamical system. What does that mean? It assumes xt plus one equals a matrix A times xt plus matrix B times ut. State here is denoted by x and then controls or actions by u. So, xt is a state at time t, ut is the input at time t, and I'll assume a quadratic cost function. We've been working with rewards so far and the optimal control literature is more customary to work with cost. Ultimately, it's all the same. If you want to work with rewards, you can just say oh, the
negative cost is my reward, and you'll have the same
problem you're solving. But we'll follow the notation
that's common in this space and we'll also use cost. So we'll be minimizing now rather than maximizing expected values. And so cost, let's call it g, g for being in state
xt and taking action ut is equal to quadratic. That's the assumption. So it's gonna be xt transpose
Q xt plus ut transpose R ut. And there'll be some assumptions. The assumption will be Q and R symmetric and Q and R both positive definite. What does it mean to be positive definite? It means this thing here, Q positive definite if
and only if for all z's that are the right form factor, z transpose Q z bigger than zero. Same for R. Okay, so those are our assumptions. Yes, question? - [Student] That's if
z is not equal to zero. - Correct. That's a good point. For all z not equal to zero, it has to be strictly bigger than zero. Thanks. Now, what that means is that
the cost is nonzero everywhere except when you're at the zero point. If you have zero state,
zero controls, cost is zero. That's the best place to be. Everything else is worse. And we can also see that
the system is set up that this is actually possible to achieve. If the state is zero and control is zero, you'll be at zero again. So this is the kind of problem where the optimal solution
will drive you to zero and stay there for all times. Now, one thing I want to mention here, a bit of an aside, but if you think about
let's say this form here, x transpose Q xt. Matrices are a little harder to think of than scalar entities, so it can be good to first think of okay, what does this mean when it's scalar? When it's scalar, it just
means like it's a parabola. If x is a scalar, it's just x squared
times some scale factor. And for it to be positive definite, that scale factor has to be positive. So it's a parabola that's
shaped with a minimum. That's what it looks
like in the scalar case. Now, in higher dimensional cases, well, one thing you know, if a matrix is symmetric, or hopefully, you know, and otherwise, you'll find out now, if a matrix is symmetric, for every symmetric matrix Q, Q is equal to some
orthogonal transformation which are the eigenvectors. Let's call them V,
lambda which is diagonal, V transpose. So this one is diagonal. And V transpose V is the identity matrix. So V is an orthogonal matrix meaning that all it's
doing is rotating things. So what happens is when
you multiply something with a symmetric matrix, effectively, you just rotate it, then rescale it, rotate it back. If you multiply it from both sides, from the front with x
transpose and the back with x, you are on both sides just rotating the x, and then in the rotated space,
rescaling the coordinates. And so if that's all you're doing is rotating things and then rescaling, all that is doing is a
change of coordinate system, adjust the rotation. So when you try to build
some intuition about this, you can say well, if I
look at this cost function. If I pick the simplest
possible coordinate system, I pick the one where everything's already
rotated ahead of time, and then this V matrix will not have to do any rotation anymore. It will just kinda disappear. It'll be an identity. And then really, Q is just a diagonal
matrix doing rescaling. And so in the correct or the
easiest coordinate system, really what this cost is just squaring each of the coordinates of x and rescaling those scores
with a positive number. That's all it's doing. It's just that sometimes, your coordinate system is off and you need to first rotate and then you do the squaring and scaling. So that can give us some
intuition about what this is. It's really just
penalizing every coordinate from being away from zero, but it's not doing it in an
axis aligned way necessarily and hence, it can be a
general symmetric matrix, doesn't have to be a diagonal matrix. In practice, almost everybody
picks a diagonal matrix. It's much easier to pick a diagonal matrix
as your cost function than to fill in a whole
non-diagonal cost matrix. Then the other thing that
you might be curious about. Q and R have to be symmetric, I'm saying. What happens if you pick
a non-symmetric Q or R? Okay, let's see what happens. We'll do it as a little aside. We'll pick a non-symmetric Q, let's say. So what do we have? Only way this Q appears. Actually, I should. Usually, there's a 1/2 put in front which is just some kind of scaling. Doesn't really matter too much, but when you take derivatives, you get a two and then a 1/2 and the two cancel which makes it nice. So imagine we're looking
at x transpose Q x. But now, let's actually not call it Q because we don't want to
make the same assumptions we made before. Maybe it's not a symmetric matrix. Imagine it's just some matrix, A. Well, A we already have there. Let's pick some letter we don't have yet. Let's say, we don't have and will not have any time
soon matrix F, I think. X transpose F x. F, not symmetric. What happens? Okay, let's take a look at what happens. X transpose F x is equal to, not symmetric. So for a non-symmetric matrix, we can actually write it as
F plus F transpose over two plus F minus F transpose over two. We're just adding and
subtracting F transpose. Okay. So, let's rewrite it that way. So what we really have is
x transpose F plus F over, F transpose over two times x, plus x transpose F minus
F transpose over two x. Now, look at this one here. This one is x transpose F x over two minus x transpose F transpose x over two. Okay, now, when you have matrices and you have let's say A, B, C transpose, that's the same as C transpose
B transpose A transpose. That's how it works. If you have a scalar value,
these are scalar values. You transpose a scalar
value, it stays the same. So we might as well replace this one, it's just a scalar, with its transpose. Then we apply this thing over here and we see this one is
equal to x transpose. Can people see that from
the back, the bottom row? Okay, x transpose F x over two which cancels with this one. So this form here is equal to zero. So all we're left with from x
transpose F x is x transpose, x transpose the symmetric
part of F times x. So by using non-symmetric matrices, you actually don't gain anything. You're just in some
sense obscuring the fact that in reality, all you're using is the symmetric part of that non-symmetric matrix. And so we might as well
not obscure those things. We might as well say well, it's clear. We only ever use the
symmetric part of the matrix, so let's just from the beginning say we only use symmetric matrices because what's the point
in introducing an F and then really be working with
F plus F transpose over two. Let's just directly work
with the symmetric matrix which we know is the essence
of what's happening anyway. Not just the essence, everything that's happening
is the symmetric part. So the fact that I said Q
and R have to be symmetric is not really a restriction
on the formalism. It's just a way to be more transparent about what's really happening. Only the symmetric part is used anyway. Might as well do it upfront. The fact that they have
to be positive definite. It should be clear why you want that. We're gonna be minimizing cost. And if there's any way to
achieve a negative value, if there's something that
achieves a negative value here, whatever that is, then you're gonna want to go
to infinity in that direction z to minimize cost so
(mumbles) very negative value and that's not really
an interesting problem. Well, I tried to send off
your system to infinity. It's typically not an
interesting problem to solve. And so to make the problem interesting, it has to be positive definite so there's a clear goal to be achieved. And then the symmetric thing
is just as you saw there. It's just a way of being transparent about what's gonna happen anyway. Any questions about this? Now, one thing you might say as well, hm, linear quadratic systems. That's maybe a lot of matrix algebra and you're like that sounds fun, but maybe it sounds very limiting to you and you're saying am I really gonna stick
around in this lecture today 'cause all we're gonna see maybe is just these linear systems? So, I'm gonna give you a quick heads up that we can do a lot with these systems. We have a slide with the description here. What we're gonna cover in
the first part of lecture is to solve this specific problem
that I wrote on the board. What we'll do in the remainder of lecture is see how we can use it as the core to solve more difficult problems. How difficult? Essentially, arbitrary non-linear systems. So one good example of a very non-linear
system is a helicopter. Helicopter flight is, the helicopter itself
is non-linear dynamics due to just how rotations interact, how rotations are represented. They're kind of a complex
thing to deal with. But then also the airflow
around the helicopter makes it highly non-linear and so there's a lot of
complicated dynamics happening. It's definitely not xt plus
one equal A xt plus B ut. Not even close. But what we'll see first is
the core of how to solve this, then how to extend this so we can start doing things like this. So what we're gonna watch here
is an autonomous helicopter that will go through essentially the entire flight
envelope of the helicopter, do every possible maneuver
expert pilots could demonstrate, and at actually higher precision and similar speed as the expert pilots do. So this was some of my PhD
work several years ago. This is all powered by LQR under the hood, inverted take off. Then it hovers. That's actually a hard problem in itself to stay in place with a helicopter. Very difficult problem. Check box, mark then. A split-S, it's a way to change direction. Very non-linear. The dynamic is not even close
to linear anywhere here. Stall turn. Nothing close to linear. Dynamics is continually changing also due to the airflow
around the helicopter. It can do spins at the top. Come out tail first, go up to 55 miles an
hour at maximum speed. It can go inverted. And so, this is a highly unstable regime to fly in because when you're inverted
just like holding up a broom, when you're holding up a broom
on the palm of your hand, it's easy to tip over. When you're holding up the
body of the helicopter here, then flips, rolls. None of it anywhere close to linear, but we'll actually be able
to solve this problem, at least the control part of this problem with the machinery we're
gonna build up today. So, let's work through that machinery. So, remember value iteration. In control, people often
use J rather than V and J for cost, so if we had value iteration, it would be J for i plus one steps to go for being in state s equals min, 'cause we're minimizing
now, over controls u. The cost encountered in that first step plus then the sum over future states probability of being in that future state given the state in action times the optimal cost to go. Sometimes, we'll call it value, sometimes, we'll call it cost to go. It's very commonly called cost to go which is all costs you expect to encounter when optimally then
onwards for i time steps starting in state s prime. Okay, so now we have those
assumption over there. LQR. What does it say? Well, let's fill it in. J i plus one and we're
gonna have x now for state. J i plus one for x is the min over u. What are we doing? Well, the cost g is written over there. It's gonna be Q transpose
x Q plus u transpose R u maybe with a 1/2. You can kinda choose. The 1/2 is good to let
things cancel out later. Then plus sum over all
next states x prime. But actually, with sum over
all next states x prime, we have a deterministic system here. X t plus one equals xt plus B ut. So we don't really need to sum over it. It's just gonna be one next state x prime. We even know what it is. That's state x prime. So J i steps to go for x prime which is equal to Ax plus Bu. So try to minimize this. Remember how we did value iteration. We said with zero steps to go, it's easy. So okay, let's do that. So we'll start with zero steps to go. When there are zero steps left, in this case, we're not
gonna set it to zero. We could set it zero. We're actually gonna set
it to J zero x equals x transpose P zero x. All right. Looks like in the slides,
I did it without the 1/2, so let's not do the 1/2
on the board either. So, if we set P zero equal to all zeroes, this means there is no end cost, but if P zero is
different from all zeroes, then essentially we're saying if you don't end up at the all zero state, we have a preference between different states
you might end up in. Along some axes, maybe
there's a higher cost to end up somewhere far out
there compared to other axes. Now, J one of x equals well, we wrote it over there,
so let's rewrite that. Min over u of x transpose Q x plus u transpose R u plus J i of this thing. Well, that's J zero here. J zero is x transpose P zero x. So, we fill that in. We have then Ax plus Bu transpose P zero Ax plus Bu. Okay, and now we want to find
the u that minimizes this. To find the u that minimizes
this, what do we do? We compute the gradient with
respect to u of that quantity. Well, let's see. What do we end up with? (mumbles) to u of this quantity over here. This doesn't have u in it, so no nothing. U transpose R u, that becomes two R u. Then this thing over here becomes plus two times B transpose
P zero Ax plus Bu and I want this to be equal to zero. If we can find a solution
for that equal to zero, we've found the optimal control u. Well, if we do this, we find u is equal to negative. So the u appears here and appears here. We just need to do some grouping of terms and move to the other side. End up with negative R
plus B transpose P zero B inverse B transpose P zero Ax. So we have a nice closed
form for u, which is nice. That means we don't
have to discretize you. we don't have to search over the best u. We actually have a closed
form u as a function of x. And this we'll also call K times x. So we'll call this, what's
sitting in front here, K. Then what we can do, we can fill u equals K times x into this thing over here to see what actually our value is at x. Once we fill in the optimal u, we can find out the
value for that state x. Once we do that, what we end up with is let me write it over
here, top of the board. Filling in, that's just
a bit of simplification. You don't have to do anything special. We find J one of x equals
x transpose P one x. For P one, equal Q plus, Q plus this. Actually, let's call this thing K one 'cause it's specific
to the first iteration. It'll be Q plus K one transpose R K one plus A plus B K one transpose P zero, A plus B K one. And K one equals minus R
plus B transpose P zero B inverse B transpose P zero A. So what this shows us, if we want to find the value function for one time step to go, we just need to compute K one. This will give us P one and then we have the value
function for one time step to go. Now, the amazing thing
that's happening here and this is very unique to
the linear quadratic setting and that's why it's so powerful and a building block for other things, the amazing thing that's happening here is that this thing here is of the same form
factor as we started with. We again have a quadratic. J zero was a quadratic and
then J one is a quadratic in x which means that we can
repeat this process again. If we want to find J two, we just need to up all
the indices here by one and we'll have the same thing we can do. And repeat, repeat, repeat. For most systems, let's say you had a polynomial system. For most systems, what will happen is the
degree of the polynomial will keep increasing as you do this and things will become
more and more complex. We kind of get lucky here because
when we take the gradient, the degree drops by one, and so the solution is just linear in x. We fill it in, we stay quadratic,
and this never changes. Computationally, this is very feasible. I mean, this is just some
matrix vector multiply, some matrix inverse, some
matrix matrix multiplies. Same thing here. So computationally,
this is very practical. Order n cubed where n is the
dimensionality of your space. Any questions about this? This is the core idea for today. Yes? Actually, we still have the thingy today. Let's see. Oh. (students laughing) Obstacle, sorry. (students laughing) - [Student] So can P zero
be any arbitrary matrix? - So can P zero be any arbitrary matrix? P zero has to be positive definite for this to be meaningful because if P zero is
not positive definite, essentially what would happen is setting the gradient equal to zero, for that to be the right point, your function needs to be
shaped to have a unique minimum. Otherwise, you have to go inspect whether the gradient is zero. Is it a minimum, or a
maximum, or a saddle point? And so because P zero
is positive definite, unique minimum for the quadratic form. When we set the gradient equal to zero, we know we have the unique minimum and that's the right thing. If P zero were the other
way, shaped like that, and you went through the same math, well, then what you would find here is actually the way to
achieve the maximum. And what you'd be doing is
maximizing that objective, so something kinda weird going
on that's not very natural. - [Student] So what does P
zero mean like intuitively? - Intuitively, P zero
means how much you care about being close to zero
along different axes. So, quadratic forms, you
can draw them from above as let's say maybe you draw
quadratic form like this. What this means is that
maybe the quadratic form has a value of one here,
two here, four over here. And so what that means is that if you deviate in this
direction from zero, things increase very rapidly and incur very high cost very quickly. So that's a bad place to
end up if you can avoid it. Whereas in this direction, it increases much more slowly. So let's say you don't
have enough time steps left to drive yourself all the
way to the zero point, obviously, you want to be at zero. Let's say you don't have
enough time steps left, you'd rather end up somewhere
here away from zero than here. And in fact, this is just P zero, but if Q let's say were
shaped the same way, if Q and P zero were the same, right, then at all times, you would prefer to be along this axis if you're not there yet. So if you start out let's say, I don't know, over here, you'd prefer to as quickly
as possible get this way and then come in this way. Now, if it's possible of course, you might even prefer to go even faster directly to the goal, but the dynamics will limit
you in what you can do. And so within what the
dynamics allows you to do, you'll prefer paths along axes where you're not penalized as much. You're willing to have
more deviation still if you have to make trade offs. So, P zero and Q are really
trade offs is what they define. So one example would
be for the helicopter, maybe you say you know what? I care a lot about
position and orientation. And some entries in Q and P zero will be related to position. Some will be related to orientation. I think you might give it some thought. You might say well, actually, I probably care
more about orientation. Why? Because if orientation is off, things get very complicated. And if I were to let's say approximate my helicopter
with a linear system, well, the non-linears
will start kicking in much more quickly if
my orientation is off. But if I'm off by a position, the non-linears will not
kick in actually at all 'cause where you are as a helicopter, the dynamics is not
affected by your position. Similar thing would be let's say you were
controlling a cart-pole. You were doing cart-pole balancing. Well, cart-pole balancing, if the angle here changes, the dynamics will be very different. The linearization of the
dynamics will be very different. But if your location has changed, the linearization of the
dynamics will be the same. And so what happens is
if you are thinking about how should I design my cost function, if I know I'm gonna use a linear approximation to the system, then I should design my
cost function in the way that the linear approximation
is as good as possible. I'm gonna think about dynamics and say oh, position is less critical
than orientation. Let me put a lot of penalty on orientation so it always keeps it nicely up and then it can worry about position later not because I care about it later, but because I think it's
gonna fail otherwise. 'Cause my linear system that
approximates the real system will become imprecise and my whole calculation
based on linear systems will not be sufficiently precise to give me the correct solution. And so in practice, there's
a lot of that going on as you design your P
zero, and your Q, your R. And if you approximate your
system with a linear system which is what we're gonna do because most systems are not linear, we're gonna approximate them that way, you're gonna have to think about how can I assure my linearization is good? And your cost matrices can help you ensure that you're more likely
to be in the regime where your linearization is good. So there's a lot of
intricacies going on there. It's probably a longer answer
than you were looking for, but that's kind of the
design aspects of it. Want to throw the box back? Nice. Were there any other questions? Okay. So, we now understand how to solve a linear quadratic system. Let's show the slides that have
the (mumbles) math on this. So this is what we did on the board. We derived that only, you get back to quadratic form and this is the only
calculus you have to do, just linear algebra. So since we have that, we
end up where we started. J one is just like J zero. So we can go from J one to J two just like we went from J zero to J one. And for all times, we
have a closed form update. This is pretty crazy by the way. You could try to give it a lot of thought and say can I come with
any other dynamical systems and cost functions where if I do a value iteration back, if I can get an analytical solution. And then after an analytical solution, I get the same form factor back so I can just do this repeatedly and nothing blows up in terms
of what I need to represent, I don't know of any others. I mean, there's a discrete
case where we just enumerate, but in terms of continuous spaces, this is the only one I'm aware of where it just stays nicely in closed form. So that's how we get J two and then the full solution
will be just iterating this. We just compute K, K i,
P i, and keep repeating. The optimal policy for i-step horizon will be just pi of x is Ki times x. So you just store your Ki's and you just then use them
at the appropriate time step. And if you want to know how
much cost do I expect to get from this particular state x, it's x transpose Pix. It's a little fact here that it's guaranteed to converge the infinite horizon optimal policy if and only if the dynamics A, B is such that there exists a policy that can drive the state to zero. In some sense, this is intuitive. If there exists a way to
drive the state to zero, then definitely in the
infinite horizon case, you should drive it to
zero so at some point, you stop encountering cost. So these Pi's will start converging because you're bounded in how much cost you're
ever gonna encounter. If it's a path that drives you to zero, whatever the cost of that path, that's an upper bound on what
the optimal control will do in the infinite horizon case. And if you're upper bounded and your series is always going up, it's guaranteed to converge
and so you have convergence. Often, people do that. They just keep writing until convergence and just use a steady state
feedback matrix for all times even for finite horizon problems because well it's convenient
to store only one matrix rather than having to keep track of time and many, many matrices. Let's revisit our assumptions now. So, we made a bunch of assumptions here. What this really is about right now is about keeping a linear
system at the all zero state while preferring to keep
the control input small. How can we extend this? We can actually extend
this to affine systems, systems with stochasticity, regulation around non-zero fixed point for non-linear system, penalization for change in control inputs, linear time varying systems, trajectory following
for non-linear systems. So, we have a lot of extensions to cover. Let's take a small break here, start again in two, three minutes. And in the meantime, I'll just project what we have
as our main result so far. (students chattering) If we don't have anything,
then we can set it to zero. Or if we want to find the
infinite horizon version. - If we set it to zero, wouldn't u star be like zero? - At the very end, it would be. For the last time step, it would be, but from then onwards,
it would not be anymore. - So P zero is zero. That means K one is also zero. - K one will be zero, yeah. K one for one time step to go. (students chattering) K one will be non-zero and essentially will start
accumulating from there. - So P one will be Q. - P one will be--
- Q, yeah. - Correct, mm-hmm. So if you start with (faint speaking) zero equals zero, you're effectively starting
with P zero equal Q and waiting for one iteration later. Mm-hmm. - So, P zero can be
positive (faint speaking)? - Yes. As long as Q is positive definite. Q and R are really the
ones that matter here. - Thank you. - So, we want to show that
this successive multiplication is a one step optimal
strategy is optimal globally, (faint speaking). Like let's say we have a finite horizon and then we work like backwards and show that this is optimal-- - The reason it's optimal globally is just 'cause it's value iteration. We're doing exact value iteration. We know value iteration is optimal, so it's a direct consequence of that. - Thanks. - All right, let's restart. So one question that came up during break is how do we know that this is optimal? And the reason we know this is optimal is that this is value iteration and we know value iteration is optimal. It's just that we just found out that we can actually do
value iteration exactly in a continuous state space under this set of specific
assumptions on dynamics and cost. But now, let's start
revisiting those assumptions. How about in affine systems? What does that mean? That's an offset there. This means you can actually
not keep the system at zero. Think about it. Because if xt is zero, ut is zero, then xt plus one will be c. So you will continually
keep accumulating cost. There's no way around it here because whenever you're not at zero, if you get a cost, you
cannot stay at zero. You can still go through the same math. One way you can do it is you can say well, I'm gonna go through the
same math I did before, re-derive the update. And that could be a good exercise. You just do all the math
we did in previous slides. There will just be some extra
offset terms coming around. So then the control, u optimal control will not
just be matrix K times x. It'll be matrix K times x plus some offset because you need to apply offset control. Another thing you can do and what you're gonna do a lot in what we're gonna see
in these extensions, you can re-define your state space. So we could say what if our state space is actually we'll call it z now and there is xt plus
one which I had before with an extra entry of one. If we set it up that way,
a slightly bigger state, then we can actually write it again as a linear system in z space. And we can then say okay, now we can re-apply all the
equations we've seen before. We're just solving it in z space and then we need to apply a control. We just need to make sure that okay, we turn our x into a z which is an x and a one at the bottom. And then we multiply with the K matrix that we found to get the controls. So it's a little trick to not
have to re-implement anything when you go from linear to affine. How about stochastic systems? Imagine you have a wt here. That's not an offset term. Its mean is zero, but it has
variance, non-zero variance. So there's some noise in the system. I encourage you to do this as an exercise to see if you can fully
understand everything we derive. You can work through the same derivation that I did on the board. It's now stochastic, so it
will have an expectation, expected value. And (mumbles) the expected value, what you'll see up here. Well, the expected value of w multiplied with anything
that's not w will be zero. That will simplify out. But the expected value of w multiplied with w is a
covariance matrix for w and that will be carried throughout and it will show that actually, you have a higher expected cost. Why? Well, it's natural. You have noise in the system. You can't control it as precisely
when you have this noise. But the funny thing is that
the optimal control policy, the K matrices you'll
find are exactly the same. They don't change. You'll just see that you'll
have an expected higher cost but use the exact same control strategy. Kind of an interesting result. Some people call it
certainty equivalent control. Like in this particular setting, even though there is
uncertainty in the system, you can design your optimal controller ignoring the uncertainty and it will give the same result
in terms of your strategy. How about non-linear systems? Xt plus one is f of xt comma ut. We can keep the state, we can keep the system at state x star if and only if there exists some u star such that x star equals
f of x star u star. That's what it means to be a fixed point. Let's assume there is a
fixed point in the system. And we can see can we stabilize
it around that fixed point? That's like keeping a helicopter in hover, keeping a cart-pole balanced at the top. It's non-linear, but
there's a stable point, and the question is how can
you stabilize around it? 'Cause usually there is a stable point. Sure, you might say I just stay there and (mumbles) perturbations and you'll be able to steer back onto it. So it's not enough to
know u star to stay there. You need to know a feedback
strategy to stay there. Linearizing the dynamics around x star, what would it give you? Well, xt plus one equals this is a first order Taylor
expansion of the dynamics at x star u star. Well, that's really A and B, or equivalently, we can write
it as st plus one minus x star equals A xt minus x star
plus B ut minus u star. If we now redefine our coordinate system, centered around x star and u star, we'll call that z, and
for u, we call it v, then we have exactly what we had before. Zt plus one equals Azt plus Bvt and then our cost will be well, some cost for staying close to that. So we don't have to write any new code to find the linear
quadratic optimal control around the stable point,
not a stable point, a stationary point of a non-linear system. Okay. So, just run a standard LQR. Once you've done that, you need to find your controls. First, you'll turn your x into z. You'll find v from that. Then you can turn your v into your u and here it is, the equation to find u. All right. Here's another one. What if instead of
penalizing for the controls, you want to penalize for
the change in controls? Why might you care about this? Well, often what happens
when you have a system, let's say this is your
system and you run it. In the real world, your
optimal controller, what you'll find is that there will be a
lot of high frequencies for your high frequency control. Even if in the simulator
when you run this, it might not be the case. But in reality, there's
always a little bit of noise. There's a bit of mismatch between reality and your simulated system. So your optimal controller (mumbles) everything is deterministic is gonna actually use pretty
large controls typically and constantly adjust the controls to whatever at that moment looks best. And you typically don't want that because typically what happens is when you have high frequency controls for a physical system, those high frequencies are hard to model. And actually the model that you're using, your linear system model that you have is often not very precise
for the high frequencies. For example, for a helicopter, at high frequency, you
just shake it apart. It rips itself apart at
high enough frequency. And so that's not great 'cause you have all this
kind of modes really of the physical system that are not modeled in your control model that you have for your system. So you want to avoid
those high frequencies. Okay. So, what you can do, one thing you can do, Anderson and Moore looks at this, is you can frequency
shape the cost function. What does that mean? You would say well, I'm gonna add extra variables to my state so I don't just have xt, but I also have xt
minus one, xt minus two, xt minus three, xt minus four. I keep them all around in the state. That's easily done. You can expand your A matrix to have some effectively identities to memorize things from
the past for a while. And then you can
essentially set up something that penalizes for rapid changes in those. Or you can even set up
a very specific filter on your past states if you
have a certain frequency that you specifically want to suppress. Like maybe there's some kind of, you've done an analysis
of your physical system and there's a certain frequency at which the system has a resonancy and the physical system will
start accumulating energy and shaking itself apart that you might want to minimize anything that happens at that frequency. And you put something
in your cost function to just avoid that frequency. Simple special case which
works very well in practice, just penalize for changing control inputs. How do we do that? This is our original problem. But now, we want to penalize
for change in controls. Well, solution A is to augment the state with the past control input vector. So we have xt, but also
ut is stored in the state. And then we can penalize for
how the new input is different from the one we've already
stored in the state. Solution method B is the one we'll cover is you actually change
the dynamical system to be expressed in terms of
the change in control input rather than the actual control input. So for some reason, maybe originally you were controlling, I don't know, velocity, and now you're effectively
controlling acceleration and you want to keep
your acceleration small rather than keeping your velocity small. Okay, or maybe you want
to keep both small. That's fine too. So what does it look like? What we have is, whoa, really animated, but okay. Xt plus one ut is our new state variable. We can get that from xt and ut minus one which at the previous
time are in our state. And then a delta in controls comes into B. This ut minus one also
gets multiplied with B. So actually in the top row, we get the original dynamics. Xt plus one equals A
xt plus B ut minus one plus B delta ut. So that's the same as plus B ut. And the bottom row is just
keeping track of the controls from the previous time. So this is what we have, just a redefinition of
our state and input space. Our cost can be the same we had before where now there is an R prime introduced. Our original Q is here. Our original R is here and
there's an additional R prime to penalize for the change in controls. And this then matches the
standard LQR format shown above. I'd say for pretty much
any physical system, you're gonna want to do this. It pretty much never works without doing something like this. What else might we want to generalize to? Linear time varying systems. So we had stationary system before and now, there's an index there. The dynamics depends on time, At, Bt. Well, in all the math we did, there was nothing that assumed A and B would be staying the same over time. We can work through the exact same math, exact same update equation, we just need to keep
track of the time indices. It will not converge to anything
and be varying over time, you're not gonna get the
same convergence properties, but you can do the exact same math and find a time varying linear controller that's appropriate for your
time varying linear system. So update equations will
look exactly the same, just need to keep track
of the time indices. You might wonder how often
do you really run into a time varying linear system? Like, that would be quite a coincidence. It's not linear, but
it's time varying linear. That would be kind of really special. In practice, it's rare to have
a time varying linear system that's actually what you run into, but it's very common to run
into a non-linear system. And if you know the path
you're going to follow in state space that
you're trying to follow, then you can approximate
in a time varying way with a sequence of matrices your dynamical system as a
linear time varying system. So different linear
approximation at each time step to what is actually a non-linear system. And so you'll get linear
time varying control and you'll get again, quadratic costs. So let's look at what I just described which is the most direct application of this kinda linear time varying setting. We want to do a trajectory
following for non-linear systems. For example, we want our helicopter to
fall in a specific path. Helicopter's now linear, so
it can't do the linear thing, but what can we do? Well, let's assume there's a
feasible target trajectory. It's something that the
helicopter can actually do or our system can actually do. How would we know what that is? It's not necessarily easy to know, but let's assume we know it maybe from somebody already executing it, a human pilot maybe, or maybe from having a
precise analytical model that you can somehow derive what is a feasible sequence of states. Well, feasible means
that there exists control such that this trajectory gets followed. You might say well, wouldn't it be enough to just apply the sequence
of controls and we're done? Why do we still need to do any work? The reason we still need to
do some work is typically, even though you know your sequence that would follow the
trajectory in principle, there will be perturbations. They will be thrown off your trajectory. Why? Well, there could be explicit
perturbations like noise, wind pushing your helicopter around. Or there could be things that your f that you have there is not perfect. It's imprecise. And so even though you
think it's feasible, it's actually not feasible. And you can think of the
mismatch between the real f and the f you work with also as noise. It's not necessarily a noise, per se. It's not a noisy process. It's just that there's a mismatch between your f that you
use and the real one, and so real update plus some noise is really what you're working with. So our problem statement then would come. So let's minimize the quadratic deviation from our target states at all times and quadratic deviation
from our target inputs. And then well, we might say
why do we penalize (mumbles)? Why do we need to stay
close to our target inputs? Well, we can think of it in
some senses as zero centering. We know the controls that we need if there's no perturbations and so that's a strong prior and we're going to penalize
for deviating from that 'cause we know that's
the right place to be. And then if we need to deviate, sure, we do it to stay on track, but by default, we're gonna
try to stay close to u star. Then we transform this into
a linear time varying case. Again, Taylor expansion. Xt plus one is some function
of x star t u star t plus then what will be our A and B for those respective times, times xt minus xt star, ut minus ut star. And so here's our linear
time varying system in a new coordinate system, the x minus x star and u minus
u star coordinate system. And at this point, we're
actually good to go. We can do the standard thing
where we transformed it. We've got a standard LQR backup operations and the resulting policy at
i time steps from the end will be thing thing over here which we knew is gonna be that format 'cause it's just a linear time
varying system at this point and the target trajectory need actually not be feasible
to apply this technique, so if you actually don't
have a feasible trajectory, you can still go through the same math, but you'll have to keep
track of an extra term here in what you do. So I should have an affine system rather than a linear system and so you have to keep track of that. If your trajectory is
feasible, it'll work out fine. Yes? - [Student] What if you have only access to the states from the target trajectory and not the actions? - Oh, good question. What if you only have
access to the states, not the actions. Then, we wouldn't know the u stars. We couldn't center it around them. And so we would end up with
not necessarily knowing where we want u to be centered. We might essentially have no u star here, just set it to zero. That's maybe the best we know. And you can actually work
through the same math, but again, you'll end
up with an offset here. So whenever the trajectory's not feasible, 'cause effectively what we're doing is we're replacing the
u stars with zeroes. It's an infeasible trajectory. We'll end up with offset term over here. And so we'll have an
affine system to deal with rather than a linear system. But other than that, we can still run through the same math. Question here. Do you want to take the mic for a moment? - [Student] This is fun. Does error tend to like
accumulate in the system 'cause you have your defined trajectory and then you're off by a little bit and then you keep going
off, and off, and off. - So, absolutely. When you're trying to follow a trajectory, typically an error will happen. An error meaning you thought you were steering right
back on to (mumbles) but you thought optimally
control and back onto it. But then there's a perturbation, either a mismatch in dynamics so you're actually not steering onto it 'cause you had a wrong model you're using or there is maybe a actual perturbation, a force applied to your system that is an external perturbation, and you'll still be off. And so then really what matters is whether you're good at
steering on relatively quickly compared to the perturbation force. If the perturbation force
is very, very strong, pushes you very hard and the level at which you're able to steer back on is smaller than that, you'll end up not getting back
on and it might go unstable. If your ability to control
back on is very good, then yes, you'll never actually
be on the target trajectory, you'll always be around it,
but you'll stay around it. And so it really depends on
your control bandwidth there. Do you have the ability to keep it on or do you have not enough
control to keep it on? Like for example, a helicopter. If you're flying it in wind gusts up to like 10, 20 miles
an hour can keep it on, but wind gusts like I don't
know, 100 miles an hour, there's no way we're keeping
it on the trajectory. It just doesn't have the
ability to counter that. Over there. Behind you first. Oh. (audience laughing) Try way too much arch. - [Student] So what is the component here that keeps it on the
trajectory (faint speaking)? - What's gonna keep us on track
is the feedback matrix here. So, the controls we
choose are gonna be such that if we deviate from
where we were supposed to be. Let's say if we already were exactly where we're
supposed to be at that time, then we'll just apply
u star for that time. This is saying how much we're
gonna deviate from u star and so if we're off from the trajectory, this K matrix will describe
what controls we need to use to steer back on. Why does this happen? Well, the original optimization problem says that we need to stay
close to the x stars. That's in the objective. So the optimum is be on and then that feedback
matrix is just a consequence of our objective. (laughing) Try it. (faint speaking) You also had a question, right? - [Student] So it seems like the problem at the top of the page is equivalent to a convex
optimization problem? I don't know if you know whether or not that's true or not, but if it is true, why not just use a solver rather
than LQR backup iterations? - Yeah. So it's a good question. Next week, we'll look
at ways to use solvers to solve for this. If you just solve it here. Thanks, Eric. If you solve this as a convex
problem just finding the u's, you'll just find the u's. You will not find a feedback controller. But we'll see extra things
we can put around it to get there, but in practice, the nice
thing about solving it this way is in some sense that you're exploiting all
the structure in the problem 'cause it's actually a
dynamic programming problem. So we're actually using all
the structure available to us to find a solution
including a feedback matrix, including a cost to go matrix. And that will be really helpful as we'll see as even we see
other methods to solve this, we'll still be very interested
in a feedback matrix and a cost to go matrix. - Here you go.
- Over there. Thank you, Eric. - [Student] So is it possible to like interlink two
different LQR problems? For example, like use an easier model that kinda captures what's going on but deviates from reality
to solve for the u star, and then have another model on the top that includes the non-linear part and put in the non-linear (mumbles). - Yeah. I mean, I think that's a very good idea and that often solving for the x stars, u stars is difficult, and if you can have ideas of
how to simplify solving for it, then building up to it. I don't know of any kind
of extremely principle way to get this done, but definitely people will
try all kinds of things to find x star, u star, and we'll see more of that next week. So let's defer questions
about how to find x star and u star to next week 'cause
we'll see ways to do it then. All right. You got to far throw it here yourself. Careful, everyone. (students laughing) Nice. How about the most general case? We just want to solve this thing. How could we do this? Well, one thing you could say well, why not use a black
box optimization solver? Maybe we can find the u's. But again, we'll just find the u's. We'll not find the feedback matrix. We'll not find a cost to go function. And in fact, we'll not necessarily be exploiting the
structure of the problem. We can actually iteratively apply LQR and this is a very
powerful way of doing this. So let's step through this step by step. We initialize the algorithm by picking either a control policy or a sequence of states
and control inputs. So we'll do the control policy thing. We just assume we have
some control policy, whatever it is. We have something. Maybe it's just all zeroes always. Then, start on step one. We execute the current policy and record the resulting state
input trajectory shown here. Now, we have a feasible trajectory. We have something that
actually just happened that we can do. If this was our target trajectory, we kinda could do just like
LQR to stay on that trajectory. But actually, this is kind
of an arbitrary trajectory. We picked some arbitrary
sequence of controls, we found the trajectory. But it is feasible, which is nice. Then we compute the linear
quadratic approximation of the optimal control problem around the obtained state-input trajectory by doing first-order Taylor
expansion of dynamics and second-order Taylor
expansion of the cost function. If I had a very general
cost function, right, could be anything, but we can do first and
second order approximations for dynamics and cost
function respectively. Once we have that, we have a linear time varying
quadratic control problem. So we could use LQR backups to solve the optimal
control policy for each time which I'll call pi i plus one where i plus one here is the iteration in the outer loop of the algorithm. Then, we can go back here,
execute that policy, and repeat. So this is kind of an iterative process where you have a policy,
find a better policy, find a yet better policy
over, and over, and over. Now, I haven't yet proven that this is always
gonna improve the policy, but that's the intuition. That's what we're hoping for that. Now, what would it actually
look like in equations? This is what it looks like. We have a linear
approximation to dynamics, and we'll have a quadratic approximation to the cost function, and here's our new state
zt vt for new controls centered around the previous
trajectory we found. Then our A matrix, our
B matrix, our Q and R. Here, we're assuming
that our cost function depends on the x and u separately. Otherwise, you have some quadratic terms that cross between x and u. Usually, people don't have
much crossing between x and u. They penalize for state, what they would like about
state penalize for controls, what I care about for controls. And so most of the time,
it'll look like this. But in principle, you could have a cross
term between x and u. So this is all we need to compute, the A, B, Q, R, and then we can do our linear
time varying system back ups and we can find our new
sequence of feedback matrices. We're allowed based on that and go again. Okay, so as we look at it here, it's actually pretty simple. Yeah, you might have to
get some derivatives, but actually, you can
just do finite difference if you don't have an
automatic differentiation through your system. Just do a finite difference. If you do have automatic
differentiation available, you can just automatically
get the derivatives out. Either way is fine. So, does this converge? It need not converge as
I formulated it actually. The reason is the optimal policy, and this is real important, the optimal policy for the
approximation we're making might end up not staying close
to the sequence of points around which the LQR approximation was computed by Taylor expansion. So if there's a linear time varying system with quadratic cost, we find the optimal solution to it. But if that optimal solution is far away from the previous trajectory, then yeah, it's optimal for
the linear time varying system, but it's in such a
different part of the space where that linear approximation
is just not precise. And so it's actually not optimal for the original system at all, and it might not even be an improvement. It might be worse. So you have to be careful about using your linear approximation and your quadratic approximation in the region where they're valid. How can you do this? You might say well let's just solve and then do some step sizing. You could try to do that, but actually, that might
not be the best thing to do. You can do something much better here. We want to stay close to where
our approximation is valid. Where is it valid? It's valid close to the xti's and uti's that we had from our previous trajectory 'cause that's where we
linearized and quadraticized. So we're gonna add a cost
term to stay close to that. And we know that if this
cost term is big enough, if this dominates, we're gonna try to stay very
close to where we were before and then the little bit of
the actual cost we care about which will pull us off a little bit of that trajectory we had before to do a little better on
the original cost function. And so if we set it up this
way with a large enough alpha, but still a little bit of
weight on the original cost, we'll gradually improve. We're guaranteed to improve. Why is this so much better than just a regular line search? Imagine you just did the
regular thing and it was worse. You did it (mumbles) with a policy, but it was like going
elsewhere and really bad? Well, it's not clear what
that line search would do. You're kinda trying to
decrease your controls. What are you trying to
do in that line search? It's not clear, right? But what we have here is a way to ensure in a dynamic programming way everywhere along the trajectory that we ensure we stay close. It's part of our objective. When we're early on executing, we're already thinking ahead and saying don't just
care about the cost g, make sure we also stay close
to where we were before because otherwise I cannot
trust my linear approximation, and hence, not the result
of this calculation that I do to find controls. And so you'll have to do a bit of a line search on this alpha. You'll have to play around with alpha. If you make alpha close to zero, then you can make big updates and you can see maybe
it's a lot of improvement. You're good to go. So what you would do is effectively, you would do this with
some setting of alpha and if things improved, you would say okay, good to go. If things got worse, you might say okay, I
need to make alpha larger 'cause it got worse which means the linear
approximation was not good enough where I ended up, so I need to put more
emphasis on that being valid. So, often is described as
a trust region approach where you have a notion of your trust, the function you're optimize. You have an approximation to
the thing you're optimizing and you have a region in which
you trust that approximation and you're only willing to
optimize within that region. It's not exactly a
constrained region here. It's more penalizing
for going too far away than having a hard constraint, but it's the same idea because we know when you
have a hard constraint, effectively, you just put a Lagrange
multiplier in front of it and it becomes a penalty again. It's equivalent. It's just down here, we might not know the exact setting we need to use for that alpha. But if we knew magically
the right setting, it'd be good and we don't have to do
anything more complicated than just checking. Is the solution making progress on the previous one we had or not? And if that's the case, we're good. If it's not the case, make sure we penalize more for deviation. Some practicalities. F is non-linear, hence this is a non-convex
optimization problem, so you can get stuck in local optima. So good initialization matters for now. In the original linear system,
linear time varying system, initialization didn't matter. We're gonna find the global optimum. But now, we have a non-linear problem. We have an initial policy we roll out and it's along that trajectory we're gonna start building improvements. And so wherever we start it will affect what kind of trajectory we find in the end of this optimization. The cost function g could be non-convex. And actually, then we can have issues where the Q that we've been working with, the second order approximation,
is not positive definite. That's a problem 'cause
as we talked about, you can get very weird behaviors. It's not positive definite. It's not even clear what you're doing. Without setting gradient equal to zero for a non-positive definite
quadratic function, you're kind of going to that fixed point, but it could be a maximum
or a saddle point. It's not clear that's
where you should be going. So we should avoid that. So you should check. You should check that
the Q and R that you find from your second order
approximation are positive definite. And if they're not,
should add terms to it. And in fact, if you make
your alpha big enough, that alpha is essentially
adding an identity scaled by alpha to your Q and R. And so if you make your alpha big enough, at some point, they will
become positive definite again and that would be the way to get there. And that's a trick you might
want to do ahead of time. Not just see does it
work, does it not work. You just say okay, is everything
positive definite or not? If not, keep adding 'til you're
finally positive definite. Then there's something else called differential dynamic programming. What I described so far, a lot of people would call
iterative LQR or iLQR. Then a lot of people
talk about this thing, differential dynamic programming. And people don't always really
distinguish between the two. They're two different things, but often people use the
names interchangeably. They'll say I'm running DDP, I'm running iLQR, whatever, and it's actually they don't even know which one they're running of the two. But that's kinda fine. They're very similar. I mean, ultimately, that's
just a vocabulary thing. The difference is in what we saw so far, we do a linear approximation of dynamics, quadratic approximation of cost. In differential dynamic programming, the differential thing, the approximations are happening in the Bellman equation itself. So you will actually let's
do a comparison, DDP, and this is when we just look at u. You actually look at the Bellman equation shown on the left there. You'll do a second order approximation of the Bellman equation itself. There will be terms with x
also if we care about that, but in this case, we're just doing u. So in an iterative LQR, we have a Bellman equation where we ahead of time
make the Bellman equation have quadratic cost and linear dynamics. And what we see actually on the left when we do it on the
Bellman equation itself, there will be an extra
term here that appears. The details don't matter too much, but an extra term will appear. And actually, that extra
term could make it again, non positive definite, and you might have to deal with that. And so you might argue well, the extra term is great to have. Maybe it makes it more
precise and so forth, so we can do that. I don't think people have
particularly strong opinions which one of the two
might be working better. In practice, it's a lot easier often
to just to iterative LQR than to also worry about this extra term that pops up in the Bellman equation which might make things non-convex again and harder to deal with. Okay, so we've covered a
lot from this one core idea. Can we do even better? Yes. At convergence of iLQR and DDP, we end up with linearizations around the state input trajectory the algorithm converged to. In practice, as we talked about, the system could not be on the trajectory or not even that close
due to perturbations, or the initial state being off, the dynamics model
being off, and so forth. What then might happen is that we're kind of out of the regime where the linear approximation is as good as we would want it to be. So what now? Well, if you're already
working on your homework one, you have actually done
something very similar already. We can do a look ahead. In the homework one, you have a grid based
approximation to your state space and so you get a crude value function with a crude discretization. But then you can do a
couple steps look ahead to optimize for the actions
you take in the first few steps and then cap it off with the value function after those steps. So if we do that here, effectively, the result of doing the iLQR is a value function for all times, a quadratic value function that shows how you want to
approach the trajectory, what's a bad deviation versus a good deviation
from the trajectory. And then you run an optimization to find the optimal controls for let's say five step look ahead, 10 step look ahead in the moment. How are we gonna run that optimization? In your homework one, you do it with enumeration or cross entropy method and so forth. Here, we actually have a method. We can run iterative
LQR inside this process. So we can say okay, at time t, asked to generate a control input, we could re-solve the control
problem using iLQR DDP over time steps t through
H all the way 'til the end. That would be a lot of work. Or over a shorter horizon and cap it off with a
value function there. And so this gives you a
much better performance 'cause now in the moment, you're looking at the
non-linear model that you have, re-optimizing against it. But in practice, you can't look as far ahead as you should to make the best decisions. But you've done that ahead of time and from that, you have a value function, a cost to go function that you can use so you only have to look
ahead five to 10 steps. How far should you look ahead? We can actually look at this, right? For example, for the helicopter problem, what did we do? We would look at okay, if we run it with five steps look ahead, 10 steps look ahead, 20 steps look ahead, what happens? We would notice that at the time with I believe it was 40
steps look ahead maybe, I should look it up
but it was some number, some step look ahead, 40 steps look ahead, I think, we saw it was essentially
always back on the trajectory. If after 40 steps of look ahead, your optimization predicts you
to be back on the trajectory, looking further ahead
is not gonna help you 'cause looking further ahead is just gonna get you in the
trajectory after 40 steps and then after that,
keep you on a trajectory. But the (mumbles) function is already really good
around the trajectory, so once you're pretty
close to your trajectory, the notion that you can keep it on, you already know that. There's nothing interesting
happening there. So you would actually look at okay, how long does it typically
take to get pretty close again, to get back into the regime where my value function is precise, where my linear and quadratic
approximation are good, and that's the amount of
look ahead that you need. And you do that amount of look ahead, you're essentially
doing the optimal thing. Everything after that
was a waste of cycles. The value function already tells you what you would have gotten from that. So, now we need a pretty
optimized implementation typically 'cause if you're running
this kind of look ahead inside your control loop, and let's say you do 20 hertz control, then you have 1/20 of a second
to run this iterative LQR which might require multiple outer loops to finally get your controls, execute the first one, and repeat. Now, one of the nice things that you have a good initialization 'cause you already have your K matrices which tells you your
attempted feedback control. You can use that as your
initialization trajectory. And then from there, run
iterative LQR to re-solve. Okay. Now, here's another thing to think about. Multiplicative noise is a very interesting kind of noise model. Here, the noise is not added in, but there's this matrix
Bw and the wt lives here. And if you apply zero control,
there will be zero noise. But the more control you apply, the more noise you
introduce into the system. That's actually very common in reality and there are some models of human control that kinda seem if you model
human control this way, it matches it pretty well. If we do very high force output, we'll have more noise in our execution than if we do something that's low force. It's natural. You can actually re-derive. It turns out that with this setting, you can end up with a
similar set of updates, find K matrices and P matrices. And what you'll see happen
in interesting ways, effectively, in some sense, it becomes
equivalent if you work through it. It's as if you're having
a new Q and a new R, but it might make it easier
to design your Q and R because if this is really
how your system works, there might be easy thing
about the right Q and R with this in mind. And then as you go through the (mumbles), you'll say well, it's the same as before, but it's as if I has
used different Q and R, but it essentially gave you
a transformation to tell you what the right additive noise
situation Q and R would be if you care really about this. All right, let's look at
a couple of examples here. Cart-pole. Here is the non-linear system. Definitely non-linear. A lot of non-linearity. Let's balance this cart-pole. We design an LQR controller for balancing. Then what we can do is we can say okay, for all starting points or
a range of starting points, does this linear feedback
controller bring us back? Let's take a look. Our horizontal axis here is
z, vertical axis is theta. We're not plotting x dot and theta dot, but we start with I believe zero for both. These are initial conditions. Green means that the linear
feedback controller succeeded. So, a pretty wide range of starting points from where the linear
feedback controller succeeded. Definitely a non-linear system, but it was able to pull it in. This for the diagonal,
cost matrix, and state, and no penalty on controls. Now, what will happen in
practice is when you do this, you'll look at this and say what if I change
the cost matrices? How well will it do? We talked about this earlier. Your cost matrix will affect
your linear controller and your linear controller will affect whether your system is precisely modeled by a linear system or not where you are. Okay, let's change it. We'll penalize for controls. You might say in small controls, linear model will be better. Turns out it actually does worse. What might be going on here? Well, in the linear model, even with small controls, you bring yourself back
to the equilibrium state. And so it get penalized for controls and you say well, I'm gonna be patient. I'm not gonna apply too much. I can gradually bring it back. The linear system allows me
to bring it gradually back. But the non-linear system will
be like I'm already falling. And that cart-pole is down and you're never bringing it back. And so it's an interesting effect here where it's not always super
intuitive ahead of time. I would have thought if somebody told me if I'm gonna penalize more for controls, it's gonna do even better because well, the smaller the inputs are, the better the linear approximation is, but actually, something else happens. The look ahead tends to be that it doesn't care enough
about control anymore 'cause it doesn't need
to for a linear system, but actually it needs higher control for the non-linear system. All right, it's 12:30, so let's stop here and we'll
continue this next week.