Least Squares - An Informal Introduction (Cyrill Stachniss)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
oh I would like to talk today about least squares in and rather informal way and introduce you to a tool that is very handy for solving a large set of problems especially state estimation problems and that's what we will be using it for here so just as a reminder we start out with that because we want to use it as a tool to address the smooth Haney's localization problem and be especially looking here into one variant of the simultaneous localization mapping problem which is a graph based approach so where we use a graph consisting of observations and controls and try to treat this as an optimization problem and for that we will use it the tool of these squares so graph based slam is nothing else than a least squares approach to slam and in order to understand how that works I would like to introduce you today into the least squares problem so what is least squares for it is an in general approach for computing a solution of an over determined system and it does that in a way that the small mistakes that we do everywhere is minimized and we are minimizing the squared errors of the sum of the squared errors so let's take a very concrete example first what can be an auditory system so for example we have a set of parameters that we want to estimate and we have a set of observations that we get which tell us something about those parameters and then it is often the case that we have actually a large number of those observations because we want to estimate those parameters as good as possible so we have more observations than parameters or more information in those observations they know the parameters and then given that we have always noise no observations so we will not find a perfect solution we will not find the set of parameters that brings all the the whole system to zero mistakes so there will always be small noise values remaining and there trying to find parameters that minimize the sum of those squared errors so overall we say we have more equations than I know unknowns are parameters equations are the things that at least in the way we usually squares stem from our observations and we won't find the unknowns or parameters so that the sum of the squared errors is minimized and that's the standard approach it is out there since more than 200 years it's very frequently used in a lot of engineering problems and we often use it here to in order to estimate model parameters given a set of observations so in the context of a slam problem our observations can be sensor observations like camera data or there's a rangefinder data but also observations from our Dom tree encoders so how the wheels move and then the parameters are gonna estimate are often the position of the platform in the environment at different points in time as well as the location of for example of 3d landmarks in within the environment so we are trying to estimate those parameters where are we what are the world look like given observations and something the way you see how I formulate the problem that's something that we have been doing since quite a while taking those observations to estimate certain parameters of the state so the least squares approach as I said is around since quite a while calculate gals at the young age of eighteen years developed this approach and it has in its first showcase a few years later in order to estimate the future location of the asteroid Ceres in 1801 where first showcased that is a good tool for solving relevant problems again we will be using it here in the context of the simultaneous localization and mapping problem and using it in order to build Maps is something that has been there from the very early days on even gulls himself used it in order to build mats or build measure distinct points in the space so it's not a very new problem the only way which is new are in robotics is that the way what kind of constraints what kind of observations we have that we are interested in certain things like doing this very fast treating this as an online problem an offline problem having other data cessation challenges in there so there are a lot of things which are kind of special in the way we address this land problem but the overall idea of using the least squares approach for solving this mapping problem is something which is around there so it's quite a while okay so let's get slightly more concrete how does it look like so I've formulated in a way that we say we have a given set of observation functions which are here called f so these are those s and what this observation function does it basically computes an expected observation or predicted observation and assuming that it knows the state so the state or our unknowns are this this is variable X here and we say given that we know what the world looks like and where we are for example we can actually compute what we are expecting to observe and this is what this function f does my observation function so f tells me what should I observe and then we have our Z I which is the actual measurements or actual observation it doesn't have to be necessarily in direct observation of the state X but the measurement is strongly related to the state X so that that we can relate the state X through this observation function with our measurement and this exactly happens of we're not without that head so that head is our predicted observation which uses this observation function as I described it to predict what we should see given we are in state X ok so the key true key here is then to take what we are expecting to observe assuming X is correct and what we are actually observing so the observation that we're actually obtaining and then relate them with each other to see how should X look like so that the observations are as similar as possible so the goal of the overall approach is to estimate the state X which best explains my set of observations to believe we have more observation let's say we have n observations over here from 1 to N and then we want to find the end which best explains observation want to N and we do that in a way that we are looking to the mismatch between what we actually observe and what we are expected to observe and then look into the squared error of that so we can also explain in their graphical way in a quite easy form so here X is our unknowns or parameters that we want to estimate then we have our observation function which tells us what should we see given the state of the world is X and this gives us our predicted observation these are those that's with a small hat on top of it of them and this is basically what we as user need to provide so we need to provide this function to compute the predicted measurements and it depends on the sensor for example that we are using and then we obtain our real measurement which of course depends also on the sensor for now on the actual physical device and not on our model of the sensor and then what we're doing we're basically comparing those pairs so that had one with that one that had two was that two and we typically compute differences between them in the easiest form and then square those differences what we then try to do if we would sum up all those differences where difference is here into a sum that we want to basically change X so that this sum gets minimized so it's a problem we're going to find X that discrepancies the differences between the predicted measurements and the real measurements is actually as small as possible again a small concrete example assume X are the location of 3d points or 3d features in the environment so these X is basically says whatever at that certain location there is a point and it has a coordinate whatever x equals 10 y equals 2 and that equals 3 just as an example and then the I have observations in this case let's say a camera which was positioned somewhere in the environment and then the observations are basically those points projected onto my image points so I get a pixel coordinate for each of those 3d features and what I as a user of the diagram or designer that algorithm need to do I need to specify this observation function f and this observation function in this case describes the projection from the 3d world onto the 2d image plane so I actually need to describe how point is mapped into a pixel coordinate so given that I put in a random coordinate let's say 1 1 1 then I need to this function tells me to which pixel coordinate this point 1 1 1 will actually be mapped and then I can compare my actual observations so what the camera the pixel location that I actually got in my real image with the pixel location that I computed using my function f and then I want to try to change the or I'm basically shifting around the location of those points in the 3d world so that this difference get minimized and the question how do I actually move those X around how do I vary this X this is what the approach basically tells us okay so now let's get slightly more formal and see how that is actually set up so everything boils down to computing this arrow so the discrepancy between the actual measurement and my predicted measurement and in my notation here this is typically always called E I so because we have multiple of those observations over xenyx I and this error function depends on the state so depends on my current configuration of the parameter so my crook my current vector X and it's again the observation are actually obtained so there's nothing I can change in here because this is what I actually observed minus this function which takes in the parameter X so if I change the X in here this will lead to a change in this f of X here so it will lead to a different difference between those two so this is just kind of the vector difference so this is a bolt II so this is a vector estimating or referring to the difference between the actual measurement in the particular measurement and then under the assumption that we have normally distributed variables with zero mean the roaming so there's no bias in there so it's basically just let's say a Gaussian noise that my sensor provides the pixel coordinate and came in the example of a camera or could be a difference in range for laser rangefinder for example and we also assume here that we know how good our sensor is so we have and we have notation of uncertainty of this sensor which is expressed in an information matrix here our Omega I so every observation can for example have a different information matrix and the information matrix is the inverse of the covariance matrix so it basically tells me how much information does this measurement actually convey and then we can turn the error vector here into a squared error by using the Arabic Tirso this is the same I transpose it and multiply it with the ISO we square it and what we also do we put the information matrix in here as a weight okay so this is kind of you already have a weighted least squares here if we say all observations have the same uncertainty then we can actually also leave that out and then we just the error function itself okay but here we typically have or assume to have some notation of uncertainty so we will keep this information matrix with us all the time and this is the unseen codes the uncertainty of the individual measurements observation that we have okay and then these AI non boldface is a scalar value and this describes us the error that one single measurement actually generates in its the squared error term here for that single measurement I okay given that we have more than one observations so we have multiple observations we are actually interested in minimizing the sum of all those squared errors so our desired state X star so the exit we're going to compute our optimal state is computed by minimizing a function f of X or capital f of X which is here the global error function which is a scalar and it can be expanded by just being the sum of the individual error terms that we just computed in the previous slide so these are the squared error terms which are also scalar of the individual measurements and we are simply summing over all measurements we can again expand it further and say this is the sum over the error vectors times the information matrix times the error vector and we are summing over all of them and this is the function that I want to minimize so it's the sum of all the individual observations the error that results from the individual observations obtained through comparing the actual observations with our predicted observations okay so the goal is then to find the minimum of that function so the parameter X so we're varying this X and we're interested in the state X that is the minimum of this function a global minimum and in general this is a complex problem especially depending how this function f actually looks like so in practice we need numerical approaches in order to solve that problem and typically this function f is also a nonlinear function so and this can lead to situations that we may not even be able to find a good solution for our problem and so we need iterative approaches but still it's not guaranteed that we actually find the minimum all the time what we typically need is a good initial guess so we need to have a good initial value for our X if we are very far away from our initial solution then we most of the cases will not find a reasonable solution at least for most realistic real-world problems where these functions nonlinear and we also assume that our error function is let's say reasonably smooth in the it's a local neighborhood of the actual global hopefully global minima where our initial gas is close so we assumed to be kind of close enough so we can actually through an iterative error minimization actually end up in the minimum error configuration again in general we are there's no guarantee that we actually find the global optimum in the case there are several variants of this problem where I can say something about convergence guarantees or guarantees but in the general form it's not possible and what we are doing here we are solving this problem through an iterative least squares approach so iteratively computing local linearization of this function and then treating it as if the this function f or the error function e would be a linear function and then kind of going step by step iteratively towards the minimum again no guarantee that this is actually works out ok so what do we need to do in order to solve this with iterative local linearization so the first thing we need to do is to compute a linearized variant of our vector error function e I and so we assume to have an initial guess and what we need to do we need to locally linearize around the solution that we have this error function and then we take this this local linearization and through the equation that we have we square it and then get our squared error function ok and then what we need to come we have a quadratic function in the end and what do we need to do in order to compute the minimum of quadratic function that's something that you may know from your now as class at school so we typically compute the first derivative we set the first derivative to 0 and then solve the remaining system and then get a solution for the minimal minimum of that function this quadratic form however they we have the set up here that this quadratic function was an approximation in the first place so we're basically going back to the start and saying ok assuming that my solution got a bit better than before let's simply do the same procedure again so we are again computing the local linearizations approximating the general error function by a quadratic function computing the minimum and then we actually iterate this process ok so let's dive into the steps how that actually works the first thing we need to do is to linearize our original error so - because this function f is in general nonlinear and we need to compute an approximation for it so that we have a linear function so that in the end we have to obtain a actually a quadratic error term and this is for for doing this we do something which is called a Taylor expansion and for computing the Taylor expansion we need to compute the first derivative of our function e so we are saying our function E is not slightly written as X plus Delta X and this X is now our initial guess so it's a kind of constant in this current iteration and this Delta X is actually our variation around this initial guess so this is zero if we add the initial guess and tivity takes values slightly larger or smaller than zero and expresses the variations around this initial guess so how does this actually looks like so this is the error term at the location of the initial guess so II I of X but this X here is now a constant because or a constant initial guess so this term also is a consonant typically abbreviated with a I in the remaining so just drop the of X of the initial guess just to have a more compact notation and then we have a sum of this absolute value constant offset so to say and this here this J is Jacobian so the first derivative if we leave the one-dimensional world so to say and multiply this with this Delta X which is my back door which is the variation from from the from the initial guess so this is a kind of a small error vector and this is basically a linear function so we are having our linearization point and have a linear function in the neighborhood of this linearization point and how is this Jacobian computed just as a reminder that something that I generally assume you to know how to compute jacobians and deal with was basically near algebra but just as a short reminder so this is a matrix and we have in the dimensions the number of dimension that my parameter X has here and down here we have the number of dimensions of my function f that I'm actually for which I wanna compute Jacobian so this is a partial derivative of the first component of the function derived with respect to the first dimension of X this is again the first dimension of this function f derived with respect to the second dimension of my vector X and so on and down here I'm basically keeping the dimension to which I copy the derivation so the X 1 constant and changing the first dimension of the function second dimension of the function third dimension of the function and so on so it's a matrix of my partial derivatives okay so this is now an approximation of my error function of the individual error function for every single observation and it is now from the originally non linear error it turns into a linear function so this is a linearization so this is an approximation here that's important to note this is an approximation okay so what we now can do is given that we have we have the individual vector error function linearized we now need to compute the squared error term so using this linearized function and compute the squared error of this linearized function so this is now I I of X plus Delta X as before but again note that this e I is now the squared error term it's not a bald-faced character and now we are putting in first the definition how was this term defined so just kind of writing down the definition this was the error in vector form transposed of X plus Delta x times my information matrix times the second error term again so this was just the definition of the squared error multiplied with the information matrix so our information about how much we can trust our observation or how uncertain this observation actually is so and what I'm now doing now I'm using this Taylor expansion that I just computed and I'm replacing this term and this term over here with my approximation from the the tailor step that I just did on the previous slide so what I can do now I can say okay this is now not exactly the same anymore trust approximately the same and this is here now my tailor expansion so this I which was just the Arabic air over you in my linearization point and then plus Jacobian times the deviation in the Jacobian and then I can just expand this expression over here and then I will get a number of different terms this is just obtained by yeah expanding those brackets over here and then we see we have some terms where we have edger independent of X as your system over here we have some terms which are those terms over here where we have a linear dependency in in my unknown Delta X here here and we have a fourth term over here and this fourth term has quadratic elements of Delta X and Delta X appear twice in form of a multiplication in this equation so what I now can do is I can take this term and rearrange it a little bit so this is just kind of copy paste what we had on the previous slide and then I can just do a small rearrangement of these terms over here in order to group them and basically introduce a new variable to make meditation easier so I have a kind of a constant term which takes this see there's a CI this is basically constant elements of this of this expression then we have a second term which we regroup which is contains the these the elements here of my error vector and these are basically all the linear elements so this is two times and then this expression e I transposed Omega I J I and these elements here I replaced by be AI transpose to by a vector behind so this is a vector times Delta X and here is a quadratic term and again all the quantities which do not depend on my Delta X I summarized in one variable here H I and then I have a quadratic element here so what I can do is I can just rewrite it using my C D and H and then the final equation looks like CI plus 2 times B I transposed times Delta X plus Delta X transposed H I Delta X so my quadratic term over here my linear term over here and my constant term over here and again I mean I was introducing these variables and was packing basically everything into those variables where there was no dependencies on this Delta X so far this was just a rearrangement of the individual error terms there was no magic ongoing the only approximation I did was putting in my Taylor approximation and then I was just rearranging those terms and this was done now for one squared error term for one single observation and again overall I want to do this for all the observations looking to sum up all observations so I can use the same thing on a for the global error term so the global error term f of X initial guess plus Delta X my variable is just the sum over all the individual error terms so again this was the result of the previous slide this is the the term for one single constraint or one single observation and what I then can do is I can just move the sum into the individual the some element here into the individual sums here in the bracket so I the sum over all CIS I have the sum over all be eyes and I have the sum of all H eyes because my Delta X of course is the same in all the sums and so then I have some element which takes all the constant elements into account I basically here have a part in the sum which takes all the coefficients of the linear term linear term into account and I have basically all the coefficients for the quadratic terms so that I can again do the same rewriting defining this as a see this as B and this as H without the index I beakers know the sum of them and terms and just by a simple rewrite I can then Express the global arrow as the constancy two times be transposed times Delta X so this B is a vector plus Delta X transpose times H times Delta X and this H is a matrix okay and the bead and the H which are the kind of important element that we will need later on are actually defined here the one is just the sum over all the coefficients of the linear term and the other is the sum of the coefficients of the quadratic term and these B and H are elements that we will actually need so as a result what we have so far we linearized our individual error terms the vectors and then computed the quadratic error terms and it turned out everything turns into a logic quadratic form so our overall function that we want to minimize is expressed over here and our goal is to minimize this overall function and again how do we compute the minimum of a quadratic form so you may again think about how do we compute the minimum of a parabola when you think about and all this classes you had at school what you typically did is you were computing the first derivative of this function and then you're setting this first derivative to zero and then you have an equation that you solve and you solve it right back to X and then this tells you where your minimum is and we're basically doing the same thing here in this high dimensional space what we need to do is we need to compute the first derivative of this quadratic form then we will set this first derivative to zero and this will lead us to a system of linear equations and then we have to solve the system of linear equations to come towards our solution so it's basically the same thing what you still probably know from your classes in school except that we don't do this for the 1d case we do this for and in an N dimensional case but the overall view actually stays the same okay so assuming we have our quadratic form here X transposed or the general quadratic form of this base form because also have a constant here but the constants not really relevant for us because the constant will I live and anyway be gone in when we're computing the first derivative and then the first derivative of this quadratic form is given by h plus h transposed times X plus B and so this is kind of for because we are in the in the form of where H is a matrix are not a scalar as you may know or remember it from school so where you would have this core Derrick term what gives two times x times the coefficient and this will give simply the coefficients or the coefficient stays the same except of the transposition and here this two times h is h plus h transposed because we are in the matrix form but it can directly the liang ology to the one dimensional case so this is now we know how to compute the first derivative of this quadratic form will give us this derivative so we can just take it and apply it to our problem so again this was our quadratic form and then we compute the first derivative of this quadratic form again the seed vanishes because it doesn't depend on Delta X we have here two times B times that transpose times Delta X and if you go back this was this constant term over here this turns just into B so in this case this is to B of course we have the the factor two is still in here so the first this part of the first derivative will give to B and then we have the part and Delta X transposed H Delta X this wave will give us two times H times Delta X okay and so this is now our first derivative of our error function okay so the next thing we need to do we need to set this element to zero solve it and then we are done so overall this means we're computing the first derivative that's what we just did in the previous slide we are setting it to zero so we're 0 times 2 B plus 2 H Delta X of course you can see immediately the two vanishes we can move B to the other side so that we will have a system in the end that we need to solve which is H times Delta X equals minus B and here you see this is a system of linear equations and now what we need to do we need to solve this system of linear equations to get the solution for our increments and I'm just writing this here now by inverting H so we're basically multiplying H inverse from the left hand side if we do this then this turns the identity matrix Delta X remains and here we have minus H inverse times B and this gives you the solution for your increments in reality you will not compute this inverse explicitly there will be other ways to do this but this is kind of this one step to move these Delta X more towards the right solution okay so this was kind of one step of our approach and in general we need to iterate this and this gives us the gauss newton solution to this so what we're doing in every step we are linearizing around the current initial guess that we have X for each method measurement so this was this first term over here we need to linearize all our error functions and we need to do that over and over again because we are changing the initial guess the initial guess is always the solution of the previous iteration so we need to do this real innervation for every iteration then we compute our terms B and our terms H by basically summing over all error functions multiplying the value of the error function the information matrix and the Jacobian over here Jacobian transposed information matrix Jacobian so we use this to compute our B and Rh this gives us the elements for our linear system and we solve this linear system and then say our new X our new initial guess is the old initial guess plus my Delta X star that I computed and this is iterative approach I'm iterating this until all those Delta X are basically zero or very close and then I'm basically hopefully converged to my global optimum again no guarantee that is the global optimum but the is the hope if our initial guess is good that we will actually end up the global solution so this was a very brief and informal introduction to these squares and the least squares approach that we will be using in this course next I control a very short example a very small example very simple example how we can actually use this in order to perform state estimation problem and I want to use the example now not using slam but let's say odometry calibration so again it's a short reminder for last year odometry is basically the position of our platform computed by counting the revolution of the wheels so when we have typically odometry measurements which basically tell us how much did we actually move forward and typically you have systematic errors in this process let's say one wheel is have different air pressure for example slightly larger or smaller so you will experience a drift to one of navigating one or to one of the sites or not everything is geometrically in a perfect shape and what we want to do is we want to kind of find a way using the least squares approach for minimizing this or eliminating this systematic errors through a calibration procedure and that's what we want to do so what we assume here we have let's say sequence of a mobile platform driving around we have those odometry commands and we additionally have ground truth information so we know where the platform actually moved to and this ground truth information can be made available by either using certain sensors which are onboard by running a slant system by using an external motion capture system by using a lidar scanner we have installed on top of it or by externally tracking that system that's something we don't care we just assume that we have ground truth odometry available so we assume we have the odometry that we measured and we have the ground truth odometry away of those so this U star I expressed here so just we have an external tracking system for that and then we take a very very simple our assumption here is that we can compute a corrected odometry command UI prime which is just a three by three matrix multiplied with our obtained odometry command so there's basically our correction function which the form the correction and turns the uncalibrated or madama tree into the calibrated dom odometry so there's now a very simple function but it's pretty good for us for using this as an example on how to use our least squares problem so we obtain our correction function f and now to do that we need to find our parameters parameter vector X which is kind of how does this function look like so these are the parameters of this function f and that's my unknown that I actually want to estimate so I say first what is the state vector here okay the state vector is this calibration function that I want to compute and so this state vector is then these nine elements which are used to design this function right so these are exactly those elements here so my state vector is simply nine dimensional vector containing these nine elements of this function f ok so what is my error function again the error function is discrepancy between what actual status and what I think it is with my function so between the the truth or the what I actually obtained and between the my model that I'm using for predicting that so in this case the error is simply given by what the actual position of the platform was minus this function this calibration function times U my my information that actually get from my system here okay so what are they need to do is I need to the next step to do it did you take my error function and could be the first derivative of my error function so how does this actually look like so what I need to do is I need to take this and derive it with respect to my with respect to my parameter so I need to take this function and derive it with respect to the individual dimensions of my state back door so it's basically de by DX and this is DX so I need to derive this expression over here with respect to all my notes so my Jacobian will be a matrix which has nine columns of course we have these nine dimensions over here and there's three rows because this is a three dimensional function okay so it is will be a three rows and nine columns and it actually turns out that this is a very very easy form so everything where nothing is written here it means that there are zero elements because what happens I have my my this term is my vector my UI star which is constant and multiplying this UI I say with respect to X Y and theta in the 2d case and I need to multiply this term with this matrix over here so if I then compute the first derivative for example for this element computing the first derivative with respect to X 1 1 so this term is constant this term over here will be multiplied with the first element of this vector so UI X the first component and this will be linear with linear multiplication with this X 1 1 and then the second dimension is with this one and the third dimension with this one but they are all independent of X 1 1 so if I copy the first derivative of this term the only term that will remain is UI X so the first component of this vector and this is the same for Y theta and so on so I will have here nonzero elements and which are basically the elements of this vector over here and all the other elements here will be 0 because there's no dependencies of this variables so all those elements are zero over here and what i additionally see in this Jacobian is that there is no dependency on X so then X is never showing up in this Jacobian and this is something which is not normally not the case but here it is the case what does it mean if I compute the first derivative and the first derivative does not depend on the variable anymore what actually means that my function was already a linear function beforehand because I basically have a constant first derivative so that means that I already have the linear function and it means that V I don't need to reel in your eyes so if I run my least squares approach and compute a solution and it would repeat the iteration again I will get exactly the same quantities in there okay so the initial guess doesn't matter so I just need it means I just need one single iteration and then I'm actually done and can compute my solution so now I have my Jacobian and then it's a standard procedure addresses compute my age and my B I get a system of linear equations I solve it and this gives me my state vector this stay picked over here and then I put the state vector reporting into my calibration function and I obtain my solution okay rather straightforward and the fact that the Jacobian here is independent of X means that I don't need to iterate over this process so a few questions you may ask yourself how would the parameters of this function actually looked like it odometry would be perfect so how should this calibration function look like if muhammad reward already be perfect okay let's go back and look to this function over here if my odometry is already perfect that means this UI is already this UI star and then you know to bring this expression to zero this matrix here should actually be the identity matrix right so if I have the identity matrix over here your one elements and everything else is zero this term would always be zero and I have no air no error remaining and I found the minimum of my function so that would be the case if odometry would be perfect the next question is how many measurements do I actually need to find a solution to my calibration problem in here we need to look a little bit how many unknowns do we have and how much information does every single observation actually provides us so how many unknowns do we have yet nine unknowns if it's nine elements of this three by three matrix and with every observation we basically have a three dimensional observation so at least at minimum we need to have three observations because we have nine unknowns we need to fill every observation gives us basically three equations so we need to have at least three if not typically more equations in order to come up with a good solution the next thing is this matrix H is a symmetric matrix why is this actually the case any ideas why H is a symmetric matrix first I answer this question all for you so it's because the way H is constructed that was it consists of the jacobians let's say let's go back so you can see it better how was H set up the matrix H is set up by Jacobian transposed time for Jacobian with weight matrix in between and this weight matrix is the inverse of the covariance matrix which is a positive semi definite or matrix and so a multiplied here J transposed and J from the left and the right hand side and just kind of a sum of all those elements so all the individual elements will be symmetric so that also my function f and my function H is symmetric in the end and then the last question is how does the structure of the measurements actually affect the structure of age that's something which may not be very clearly visible but what you can maybe guess if you compute those jacobians and we have seen here a lot of elements in G Co means it's Jacobian is zero and if the elements in Jacobian is zero that means there is no impact of this observation with a certain parameter M in my state space and so as a result of this the more the more and if the more and measurement will tell me about all my state vectors the more dense are populated these jacobians will be and this is a direct impact on my H matrix so thus part of the matrix is the lesson in one observation tells me about variables and what we have in reality especially in the context of a slam problem especially is that an observation only relates a very small number of variables with each other and so it's very these these H matrices for the individual observations are very sparse consist of a lot of 0 elements and this is something that will exploit later on in order to compute and efficient solutions I want to spend a few words now on how to efficiently solve a system of linear equations so what we want to do is we want to solve HX equals minus B and in theory we can solve this as I said before by computing the inverse of H multiplying it from the left-hand side so then this will go away and we have h inverse multiplied here from the left-hand side to this equation but this requires a full matrix inversion which is something you don't do in practice so what you typically do is you would use matrix decomposition techniques or other techniques for large system like conjugate gradients and this couldn't be genetically factorization you may exploit or QR decomposition it depends a little bit on the properties of your system and I just want to go through a very brief example and tell you how you can use cholesky factorization or to do this so if a is a symmetric and positive-definite matrix and the system we want to solve is ax equals to b what we can do we can use the trulaske decompose rescue factorization and what the Tresca factorization actually does it take this matrix a it splits it up into two matrix this is L L transposed with L being a lower triangular matrix so where the upper triangle everything above the the main diagonal is contains n 0 elements so or 0 and so what Tedeschi does it provides us this matrix L and so that we can express a as L L transposed ok so what can we do now we can actually replace this a with L L transposed so then our linear system is ll transpose x equals B and what we then can do is we can say ok let's first or let's multiply this L transposed with X so L transposed X and X because this is a matrix times the vector and this will basically give me a new back door so this means I can then solve the system all times let's say the new vector y equals to B now first solve this system so first solve this system over here ly equals P where Y is L transpose times X and this is easy to solve because L remember is a lower triangular matrix so a matrix which has triangular shape so you can very easily solve it your linear system and then in a second step you solve L transposed x equals y because this is exactly the missing part that you want to solve though the replacement before and again this linear system is very easy to solve because this is now an upper triangular matrix because it's a the transposed of the lower triangular matrix so by computing the cholesky decomposition we basically replace solving the linear at one complex linear system by solving two simple linear systems and so of course the main work is done within this Tresca decomposition but once I have it I can very very easily solve my system of linear equations so to sum up how we use this Gauss Newton method in order to compute a solution to a least squares problem it's an iterative approach and we typically start with an initial guess so we have an initial idea on what the parameters of our system are and depending on the on our actual problem our nonlinear it is this initial guess my must must be very good or if you're basically heavy linear case then this initial guess doesn't matter because it doesn't play any role in the Commission computation of the Jacobian so what we then need to do so whether the actual work that the design of the algorithm needs to do we need to compute the linearizations of our error functions so we need to sit down and compute the first derivatives of these pair of functions when we had this first derivatives we can compute this quadratic term and the key elements we're interested in this quadratic term is this H and this B so the matrix H and the vector B why is it the case because these are the only two elements which remain when I compute the first derivative of this quadratic form and set this derivative to zero then I get a system of linear equations which just contains this H and this B and I then need to solve the system of linear equations and for that I can use for example Tedeschi composition in order to solve that system and this is actually a step which takes may take a substantial amount of computational resources and then I finished one single iteration and then I iterate or I go back to this point over here we are my the solution that I computed here is my new initial guess and I repeat this process until convergence and this leads me to a new configuration of my state of my parameters that I wanted to estimate which if converged minimizes the squared error troops so this was kind of a short summary on how to solve the least squares problem in very practical setup and the way we will be using it here in the course before ending however I want to go a step further and tell you a little bit about the relation of this least squares problem to progress the state estimation we started out if you think about that saying in this approach or recall probablistic robotics we want to represent and the state of the of our robotics system for example as a probability distribution and we argued that this is valuable because we can't compute the exact values and probabilistic approaches have the advantage that we explicitly have a notation of uncertainty that we then can exploit for state estimation but also for action selection processes but what I explained so far was actually what not directly obvious how this relates to probabilities right so what I actually want to do now is answer the question how is this the spheres approach relates to state estimation probabilistic state estimation so what's the connection between them and for that I want to start with probe listing state estimation and see that through some assumption that we do we actually end up if you want to maximize this probability distribution at a solution that our least squares problem actually generates for us so that there's a direct link under some assumptions so we start out with saying okay we have a probability distribution about my vector X and to believe you have a time series here because that's how we use it in the past and we'll be using it before we have a set of observations and controls control commands of all those control commands can be very easily interpreted also as an as an observation how I can do it and so by just applying Bayes rule and independence assumption Markov assumption we can actually rewrite that in the form that we have some normalization constant as a result of Bayes rule some prior node so this was this X 0 at time 0 where we don't have any observations or control commands available and then it's a product where we have our motion model so everything ready to you and our observation model everything related to that as a product written in there ok so this is kind of just expanding this term basically with product rule and applying Bayes rule a few times and exploiting the Markov assumption actually similar as in the derivation for the base filter the only references in the derivation of the base filter I was only interested in XT over here and at 0 to T so this is actually even easier okay so we have that now what we can do is okay let's have a look into the Mach likelihood so if you compute the luck likelihood of this term so it means we put a lock in here and a lock over there so if I compute the logarithm of that probability then the different element the product basically turns into a sum so I have this log of ETA log of p of x 0 of my prior and LOC of this expression and all has a sum so this is my my constant the normalization constant from Bayes rule then have a lock of my prior and then the product turns into a sum of the lock from the motion model and the lock of my observation okay and now I'm making an assumption now I say okay let's assume all these probability distributions in here are actually gaussians so the prior here is the Gaussian my motion model is the Gaussian and my observation model is a Gaussian okay so there's an explicit assumption we are leaving kind of the general state estimation and making it more specific by saying okay let's restrict our distributions to Gaussian distributions so what how does the Gaussian distribution look like or especially a lock of a Gaussian distribution so it is some constant minus the elements we have in the exponent right so this is X minus my mean transposed my inverse of the inverse of my covariance matrix and then again X minus my mean and now I can actually link that and define my error function looks like or I can for now just say just replace it so let's replace this term here with e transposed let's replace this term by my Omega and replace this term again by E I so I'm just basically defining those variables but of course I'm defining in the way that it can make the linked to the least squares problem so I have my error term over here my information matrix which is the inverse of the covariance matrix and my error term over here and so this again a square term is exactly the same shape as the squared error that I actually had before so if I take this was look likelihood of a single Gaussian distribution and plug it into my my expression that I had over here so putting it in here in here and in here and if I do this then I obtain this term over here where say I have my error term for the prior and I have an error term ready to the odometry command and an error term with respect to the observation and they are all just added up times the constant vector so if I'm now maximizing the log likelihood of my probability distribution from maximizing this term this is actually equivalent to minimizing the error right so maximizing the probability distribution is equivalent to minimizing I can ignore the constant if I think about a minimization or maximization my prior plus the errors resulting from motion commands and the errors from my observations so so far we didn't look into a prior before and we didn't look into the become command so far we basically had only the sum of the errors of the individual observations so this means that minimizing the squared error is equivalent to maximizing the the LOC likelihood of independent Gaussian distributions so if we assume everything is Gaussian and we have independent observations independent distributions then computing the solution to the least squares problem is actually equivalent for computing the mean the most likely configuration of that probability distribution and so that's actually the link between those of them so if I'm computing the minimum error configuration I get the mean of this Gaussian distribution so by performing this error minimization I actually have a direct link to probabilistic state estimation the different ways that we can compute or approximate also the the covariance matrix by exploiting this matrix H that we have so there are more links but I wanted to show here the quite obvious and easy to derive link on how those two elements actually come together so to sum up what we've introduced here is in let's say not very formal more an informal introduction to the least squares problem but at a level of math or formality that I think it's good to understand what's going on and which allows you to use this as a tool for solving a standard least squares problem if you have very special setups you may need to dive deeper about the majority of problem a decent spacing form you can actually solve with the techniques that I have explained over here we introduced the Gauss Newton approach as an iterative approach for solving this nonlinear problems and the key things is we interview be used in a linearization in order to turn the nonlinear error function into a linear error function approximated by a quadratic form and they could be the minimum and we need to actually iterate this process due to this linearization so we need to be clear that this linearization is an approximation of the original problem and what we also have shown now in the last minutes is that the relationship between performing this error minimization and probablistic state estimation so that by assumption that we make over the probability distributions involved we can actually show that minimizing the error is equivalent to maximizing the probability and the least squares approach here is a very popular method in a lot of disciplines there's not only robotics this is basically all fields of engineering it's opponent I'm use these squares not to solve problems in certain fields it's more popular in geodesy basically if I don't say all problems boil down to these squares problems but that would also be wrong but least squares plays a very very important role in here and also in a lot of other disciplines being able to solve these squares problems will dramatically simplify your life in terms of general least squares and gauss newton so basically every textbook on numeric calculus or optimization will cover these squares to some degree even we get an overview on open resources like wikipedia where you can have a look to if you want to dive a bit more into details and look up resources there and depending on the community you're coming from there are variants or a large you of literature on these squares using different differently being differently formal using different ations so even the notation that i used here may be different from the notation you're used to and so there's a whole variety on how that may look like and I want to point out the problems about X book on actually this relation of the least squares approach to politics data estimation where you can actually show what the similarity and differences are so with this would say thank you very much for your attention and I hope with this lecture you've got an idea how these squares works what you can do with it and in the upcoming lecture we will actually use this tool in order to do useful computations with it such as looking into the simultaneous localization and mapping problem from a graph perspective and you use least squares in order to find the most likely map configuration for example so with this thank you very much for attention and to see you soon thank you
Info
Channel: Cyrill Stachniss
Views: 9,005
Rating: undefined out of 5
Keywords: robotics, photogrammetry
Id: r2cyMQ5NB1o
Channel Id: undefined
Length: 61min 8sec (3668 seconds)
Published: Fri Apr 03 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.