Camera Parameters - Extrinsics and Intrinsics (Cyrill Stachniss)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello I want to talk today about cameras and especially the parameters that we use to describe cameras now models these are the so-called intrinsic parameters and extrinsic parameters and you want to dive a bit into the details on what those parameters are which physical properties they actually describe so that we get an idea how the mapping from the 3d world into the camera actually works so it's all about describing this following mapping so we have a point in the world so a 3d point with an XYZ coordinate and we have a pixel coordinate with an XY coordinate in the image and the question is how do we actually get from this 3d World coordinate onto this 2d pixel coordinate and this is encoded in this transformation matrix P and what we want to do today is try to understand what's inside P and which parameters we need to know in order to actually write that down in mathematical form which allows us then to do this mapping from the 3d world to the 2d image so this is one of the very very basic equations whenever you work with cameras and try to perform geometric estimates so either estimating where a camera is or where our 3d points in the environment so whenever GM geometry is involved this equation will pop up at some point in time so it's very important to understand how those different parameters affect the image generation process and that you understand how the 3d points are mapped into the 2d world okay we are doing things in homogeneous coordinates so we have our 3d point here with an X Y Z coordinate and the additional dimension coming from the homogeneous coordinates we have our transformation matrix and then we have here three dimensional vector with the XY image coordinate and the pixel the additional coordinate from the homogeneous coordinate system and again just as a reminder these are capital characters this refers to a point in 3d and these are lowercase characters which represents to a point in the image so just to make your life easier whenever some smaller case a lower case it's in the image or in 2d and whenever it's upper case and so the mapping from the 3d world to the 2d pixel coordinate is typically broken up into several transformations because there's several coordinate systems involved so we typically have a world or object coordinate frame this is kind of our external given coordinate frame we typically have a camera coordinate frame so everything expressed with respect to the camera itself and then we have two BD coordinate system related to the image itself and we also have a coordinate system related to the sensor itself it actually generates my image and we may have even more coordinate systems involved but in the basic form these are four coordinate systems and so this transformation here actually consists of a sequence of transformation performing this mapping through this chain of coordinate systems so that we end up from the world coordinate system finally and on the pixel so they are several coordinate systems involved as I said world or object the camera coordinate system everything expressed with respect to the camera everything with respect to the image plane the image coordinate system and everything with respect to the sensor itself which Jen generates actually our pixel information so the world coordinate system is called so4 or object and SK stands for the camera coordinate system as see for the image plane coordinate system and SS here for the sensor coordinate system and whenever you see those small capital indices here on the top left side expresses in which coordinate system this point is actually expressed and if there's no upper case no no sorry no index see on the left upper side that means we're in the world or object coordinate system so no index is given over here and what we then can do is we can describe the mapping how to map this point into this coordinate so the transformation looking into can be expressed here it's our point in the 3d world in the object coordinate system and then there's a mapping from the object coordinate system into the camera coordinate system so if you see these lowercase index on the right and the uppercase upper scale index on the left hand side this means from here to there so it's a mapping from o to K from the object coordinate system to the camera coordinate system and then the next thing is actually a projection from the camera coordinate system to the image plane and then we have a transformation from the image plane onto the sensor chip and so you can see if you change those transformations you always see that those characters must be the same then we haven't done a mistake there for this form of writing this down this form often tation is quite attractive so what we want to do in years we want to estimate those transformations here later on we can actually combine them into one single transformation so let's illustrate numbers a small figure to get an idea how that actually works so on the one inside we have our world or object coordinate frame which is just somewhere an externally defined coordinate system and then we have our camera and the camera sits somewhere in the world and this is described through this link over here and but the camera itself says okay if I'm working in the camera coordinate frame that means the camera itself is in zero zero zero and then it looks into the world so this is kind of the camera coordinate frame described in here or the origin of the camera and the origin is to begin to be projection center of my camera and then I have an image plane which sits somewhere over here and this image planes to be shifted from the camera coordinate system by a constant and this is a so-called camera constant so how far is the image plane away from the prediction center of my camera and then I have the image plane and I have a second coordinate system here which is a sensor coordinate system so where the pixel 0 0 is on the sensor on the chip for example which will be different from the one on the image plane because typically the zero zero pixel lies on your optical axis in your image coordinate system but at the end we need to go to the sensor coordinate system because these are the pixels that we actually get returned in our image and so there are a few things to note in here in the so the X and y axis in the coordinate system K of the camera and and in the image plane are identical so it's basically just a shift in the Z direction this is exactly this shift by the camera constant so it's the offset of the image plane from the from the production center of my camera so that we can express the origin of the image plane in the camera coordinate system as 0 0 C so no shift and X nor shift and by only a shift along the z axis sensitivity and negative shot ok and what we then can do is once we have that kind of Illustrated we can now break down these individual mappings the whole point from here is mapped into the image plane and we want to break this up into two parts from the world to the sensor so again object coordinate system camera coordinate system image plane coordinate system sends a coordinate system and then we actually do a second transformation inside the sensor coordinate system for taking to counter certain non-linearity appear so the first thing the first transformation here is from the object coordinate system to the camera and this happens in 3d that's everything 3d and then we have the mapping from the camera coordinate system onto the image plane and this is the ideal projection and here we are going from a 3d coordinate to a 2d coordinate that means here we are losing some information this is kind of the mething depth information from a single image that happens through this projection from 3d in 2d and this happens from the camera coordinate system onto the image plane coordinate system and then we have a mapping from the camera coordinate system to the sensor coordinate system and this now happens in 2d so we are inside our camera in the image plane and we just in this case see how the image coordinate system is shifted and what are not necessary only shifted but also shifted with respect to the sensor coordinate system and then this is kind of to be our linear model that we have over here and then we have a second transformation that I put in here and this describes basically the deviation from the spin your model because typically we have some nonlinear deviations nominee errors involved for example through length distortions or something like this which reality of course doesn't necessarily happen here it happens actually already here when we met from the 3d world through our lens into our camera or onto the image plane but we kind of put this in here to add an additional transformation on the sensor frame in order to compensate for those non-linearity so up to here everything is linear and this is the nonlinear step which is headed to this and we distinguish that we group the parameters into two different parameters is so-called extrinsic s-- and the so called intrinsics so you can see X forensics is everything which happens outside the camera so for example where is the camera where do I move my camera through the world how I'm rotating my camera these are the extrinsic things so stuff that happens outside the camera and the second set of parameters are the so called intrinsic so what happens inside the camera and this is kind of how for example is the sensor mounted behind the prediction center these are all kind of camera internal parameters and so the first ones here are these so-called extrinsic and all the rest is something that we group together with intrinsics so you can see it also intuitively you can if you calibrate your camera so you estimate all those parameters the intrinsics will stay constant when you move your camera if you walk with your camera through the world the intrinsic stay the same but the extrinsic changes because you're moving your camera through the world and therefore it makes sense to actually break up those parameters into the extrinsic and intrinsic because the extrinsic will obviously change very often and your state estimation algorithms may be focused mainly on the extrinsic s-- once you assume to know the intrinsic because you have so-called calibrated your camera so again extra everything I have which happens in the outside world so it's basically the pose of the camera in the world and the in statistics everything which happens inside the camera so how do we get from the camera coordinate system actually to the pixel value itself and we start from here in the in this video and we'll actually move forward until we reach the end of that chain so we start with the extrinsic parameters so that means we are here and we want to go here this is the first part of this mapping process to go from the object coordinate system into the camera coordinate system and again this was describing the pose of the camera in the 3d world so the main question is how many parameters actually needed for that transformation and you know that the camera is basically can move your camera through the world you're not kind of changing your camera or distorting in the world or scaling the world making it larger or smaller it's something that we can't do so this is the only thing we can do we can actually translate things the camera and we can rotate the camera so this is so-called rigid body transformation which consists of three translations in XY and net and of three rotational component the yaw pitch and rolling so this means this comes sums up to six parameter that we need to estimate so three for the position and three for the heading information so the extrinsic parameters are six extrinsic which describe the pose that means position and heading of my camera in the world reference frame and so this is an invertible transformation because of course we can shift the camera around its transformation from the 3d world in the 3d world or rigid body transformation so something that is always invertible okay so I can express my point P which is some point in the world coordinate system so as XP is XP YP ZP so it has three coordinates and what I also have a is my projection center of my camera which is typically refer to as x0 depending on the literature it's also sometimes called boldface that or that for the origin of the camera I will be using here excel and so this is the origin of my camera coordinate system it's the the projection center of my camera so the point in the pinhole model were all beans or all rays passed through a single point this defines the camera coordinate system and this we will refer to X oh all the time and you can see though the characters here of this variable here is slightly slanted that means we are depleting coordinates so it's just X Y is that coordinate not this additional coordinate here in the homogeneous coordinates so these the extrinsic can be described by a translation between the origin of the world coordinate system and the camera coordinate system and so this is then the this X will be the location of the projection Center and the rotation to be expressed as the rotation matrix R from the object coordinate system into the camera coordinate system so if you want to express this in Euclidian coordinates that means the point P expressed in my camera coordinate system is the XY as that location of the point expressed in its original object or world coordinate system we subtract the location of the camera and then perform the rotation that's what's actually written in here and written down in Euclidian coordinates however we like to express things in homogeneous coordinates because it's easier to express most of the quantities so again as a short reminder homogeneous coordinates were those coordinates where you add an additional dimension and this additional dimension and make certain things easier for example that you can express translations also through a matrix vector multiplication which you can't do in the accreting world and this allows us to change transformations in a very elegant way therefore we will be using homogeneous coordinates in here so nucleating coordinates again as was expressed in this form and we can write down now the same thing in homogeneous coordinates so we take the ridge Euclidean vector add an additional dimension to it and then we have our rotation matrix or our Euclidian transformation matrix which executes a rotation that's the correct expression so this our matrix R sits over here this is zero vector this is the zero vector one over here then we have our translational component we have the identity matrix over here and the negative x also the position of the protection center of my camera so this is exactly the ship that we had over here and then putting the original point in homogeneous in Euclidian coordinates over here and then we can multiply those matrices together and then get one transformation matrix which actually executes this transformation for us or we can just write it down in Virginia's coordinates where this is our matrix HK because it maps from the world coordinate system through the camera coordinate system so that a point in the 3d world can be expressed as another point in the 3d world but now expressed in the camera coordinate system okay so just kind of as a short reminder for the notation everything all the forms will just landed like this variable over here is in Euclidian coordinates everything which is standing up sensor is not slanted is in homogeneous coordinates so even if I just look into the font you can distinguish which variables are expressed in Euclidian coordinates or which are expressed in homogeneous coordinates okay so with this we are done for the first part we expressed the extrinsic s-- where is the camera in the world and we can express this through this transformation matrix just multiplying a three point from the right-hand side of this matrix will turn this point into a point in the camera coordinate system and it's basically executing a rigid body transformation so the next thing we want to look into all the intrinsic parameters so again the intrinsic parameters are the parameters which happen inside the camera and which are not affected by shifting and rotating the camera around in our world so in our list of transformations we are here at that case so this part was done and now we are looking into this chain of transformations and do not move from K to C and then from C to s and then inside the sense of frame taking account than nonlinear RS so what happens inside the camera is actually a projection from the 3d world into the 2d in the 2d world into our onto our image plane and in that point in time there's a projection involved and the central projection is not invertible because we basically do one degree of freedom we lose the information how far a point is a way in the world so this transformation here the idea of projection is not invertible the other transformations from the image plane onto the center coordinate system or also inside the sensor coordinate system these are invertible transformations but this one is not invertible therefore the error just goes into one direction so this whole mapping behold intrinsics can be split up into a three step process so we split up the map into three steps the first at the ideal perspective projection onto the image plane Oh point in the camera expressed in the camera coordinate system is mapped onto the image plane and the second thing is a mapping onto the sensor coordinate system so basically your sensor coordinate your sensor is a chip which has pixels involved and the pixels are read out they have a certain index what does pixel 0 what is pixel 1 and so depending on how this chip is actually mounted into your camera the for example 0 0 Cornette will not be in the middle of your chip because typically in one of those corners and so this is what happens here is into from C 2 s is the mapping onto the sensor coordinate system and last step this is a transformation inside the sends a coordinate system this is basic compensation for the fact that the previous transformations actually idealized by saying they are the ideal perspective protection and here for the mapping we also only allow a certain type of transformation and by keeping this as linear transformations we actually don't take it to account certain nonlinear effects that join there and we were basically summarizing nonlinear effects here in this transformation over here so again in our overall setup extrinsic SAR done this was a rigidbody transformation and we now move towards C through the central projection and we assuming here in IDEO projection so we assumed to have distortion free lines that's something which is not the case in reality but we assume it to be perfect here and these nonlinear errors that we will see later on that we have to deal with in the sense of frame is something that for example results from errors in your lens that your lens is simply not perfect so but the assumption that we're doing here the lens is perfect and also all rays are straight lines and they pass through the projection center of the camera so all rays will pass for one single point so it's kind of the perfect pinhole again something which is not the case because the pinhole is not infinitely small so it has a certain size and therefore this is not accept what we are doing here and this point this pinhole Point defines the origin of our camera coordinate system so the focal point in the principal point lie on the optical axis this is an assumption that we are doing so the optical axis passes perpendicular through the principal point and the distance from the camera to the image plane is constant so it's the same distance everywhere and it's the constant see our camera constant so these are the assumptions that we are doing because we are assuming ideal perspective protection over here there's also one thing in terms of notation that we should actually mention so if we have a point in the 3d world then we have that ray which passes through the projection Center and then onto the image plane and so we will generate an image which sends side down this is kind of the physically motivated models so how the coordinate system looks like if you look at the physics that means the z-axis points backwards and so we have a positive camera constant what is however more frequently used especially in computer vision or most applications is the coordinate system where we basically rotate the camp the image plane around by a 180 degree rotation so that the image plane sits in front of the camera so if this is my point in 3d world and this is my camera projection center then in reality the image plane actually lies over here behind the projection Center but we are basically rotating it and putting in it here in front and then the image stands absolute ends correctly upside and then the projection is from this point through the image plane to the point in 3d again this is kind of equivalent although it's not it's not in line with physics but it makes things for visualizing things much easier so we need to really use this coordinate frame here and they are equivalent with each other just by a rotation of the image plane by 180 degrees so in this case the the Z coordinate the camera constant is negative so that it sits on the correct side so we will be using here this this kind of non-physically model physically motivated model where the image plane sits in between the coordinate system of the camera and the point in the 3d world just to be clear and so what we now can say through this video perspective projection we are projecting a point from the 3d world onto the image plane so this projection over here and we can in order to express where will this point actually projected to we can use the intercept theory so again we have two parallel lines so we have one pair of two lines one over here one over here so straight lines and two parallel lines so this line over here and this line which would be sitting over here and so this is a coordinate of the point in the world and this would be the same coordinate of the point in our image plane and so we can just express this by saying this distance divided by this distance which is a Zed coordinate is a it's the same as this distance divided by that and this way we can express how a point from the 3d world is actually mapped onto our image plane you can this expresses in very straightforward way so the point in our image plane so this for it's an x coordinate over here this is the x coordinate can be expressed by taking the point P over here expressed in the camera coordinate system Thoreau's respect to this point so this is here the x coordinate of this point so this distance over here divided by Z coordinate of this point which is distance over here and we know that this must be equal to this coordinate divided by this coordinate but this is our camera constant so if we multiply this point by the camera constant we actually get this distance out we can do this for the x coordinate as well as for the y coordinate so by simply taking the X location of that point dividing it through the Z coordinate and multiplying it with the camera constant we are actually getting the protection of that x coordinate of that point in the 3d world expressed in the camera coordinate system into our image plane and we can do it couldn't be the same for X as well as for why there's no difference and of course the the Z coordinate of this point will always be the camera constant because you're projecting the point onto the image plane in the image plane the Zed coordinate of the image plane is always my see my camera constant ok so now let's see how we can express this mapping in homogeneous coordinates so if we would express this as a point in from 3d to 3d this would be the X by Z coordinate of the point plus my additional dimension and then my transformation matrix is just my diagonal matrix in the first three elements over here which is my camera constant and I have my one sitting over here and what we then have and have this projection is in the end we want to we lose one one variable so the Z coordinate is gone so we can just kind of eliminate this coordinate over here so by dropping that third coordinate this turns into a two dimensional system over here so these are just XY coordinate plus the additional dimension in the homogeneous coordinates this is my point in homogeneous coordinates as expressed in my image plane and then we have this 3 by 4 matrix where we have here sitting the camera cord constant here our 1 and here are zeros and they are the XY that coordinate of the point so we are mapping from the 3d world for T vector through this matrix onto the 2d world using a 3d homogeneous vector so let's verify that what I actually have written here is actually the correct transformation because you may ask yourself why does it actually look like this I don't believe what this guy's saying in the video so um let's see on if if we perform the multiplication of the vector with this matrix we actually get the result that we originally wanted from the interception theory ok so that was the result that we wanted to have because this is something that we derived from basic geometry and this is what we said the result would be right so let's take this vector here and label it the right back door this is the black vector and this is a blue vector and this is a green vector so if you perform the matrix vector multiplying from the right-hand side with this matrix then what we need to do is for the first coordinate we need to multiply the red vector with the green vector right so red x green gives me C times XP 0 0 0 okay so it's C times X P over here and for the black vector it's exactly the same except that the y-coordinates of ice and all other coordinates always would go to zero so this leads black times green gives me C times the y-coordinate and for the blue vector it's again the same except that we don't have the camera constant here but a1 sitting over here so multiplying the blue one with the green one over here we only get the Z coordinate which survives so this gets our exact coordinate all here and now consider that we are nucleating coordinates so in order to turn the euclidean coordinates into sorry the VN homogeneous coordinates in order to turn the moe genius coordinates into the euclidean coordinates we need to divide by the last component so if you divide by the last component by Z P over here this turns into 1 and here we get C times the x coordinate divided by the z coordinate and C times the y coordinate divided by the Z coordinate these are exactly those parameters that I have here which I derived from the pure geometry so this transformation matrix that we have written over here performs actually this ideal perspective projection from the 3d world onto the 2d image plane so as a result of this we can describe the mapping from the camera coordinate system to the image plane by this matrix P to which a multiplied from the right-hand side my homogeneous vector expressing the 3d point and this projection matrix is this 3 by 4 matrix which has exactly this form C c1 all the other elements are 0 over here ok so now we can actually combine what we have done before the extrinsic s-- plus the middle projection over here by just multiplying those transformation matrices together so we had this transformation matrix which maps from the world coordinate system to the camera coordinate system and then this transformation which maps from the camera coordinate system onto the image plane and just as these are Balsamo genius matrices we just multiply those matrices with each other which gives us the projection matrix which maps from the world coordinate system on to my image plane and which is expressed in this form so what we typically do if we now can go in further and add the next mapping here onto the sensor chip there's again another matrix which can be multiplied over here and what we're trying to do what we are going to do is we put all the elements that are the intrinsic parameters into one matrix which was the so called calibration matrix so whatever will be added to this matrix see your goals into one matrix which we call the calibration matrix which contains all the intrinsic parameters and this is the matrix containing all the extrinsic parameters and so we typically expressed the camera calibration matrix as a three by three matrix and for this so-called ideal camera that we have here this is C c1 is the calibration matrix that we have over here and we can actually use a special notation to actually write that down which is kind of this bracket notation over here so this is a notation over here where this is a three by three matrix and this is a one by three vector and the whole thing years at three by four matrix and then we can express the overall transformation that we looked into in this form either writing it like this or writing it like this so this is the calibration matrix the rotation matrix of the camera and this is a 4 a 3 by 4 matrix which has the 3 by 3 block here's an identity matrix and the negative should we have in there so let's just maybe to make this notation clearer just have a look how that looks like so written with these brackets and this vertical line over here that means this is a matrix where this identity matrix or the what's written in here the 3 by 3 matrix which gives this form over here and the second part which is this here is a 3 dimensional vector which basically sits over here and if this is quite compact way for writing things down and then we can use this form of notation everything fits into one line and we have the rotation matrix in there the calibration matrix in there so that everything which is related the intrinsics will be stored in this k and this is the expression for the extrinsic as well as the production itself focused so going from 3d to 2d but the parameters for this projection actually sits here in K so having this calibration matrix of that form for the calibration matrix for the IDEO camera then we can express our projection from the world into this camera exactly in this form and if you want to kind of expand all those matrices describing the transformation from the world onto the image plane we could also write it like this so again this is the shift that is expressed over here then we have our rotation matrix and then we have our calibration matrix over here and we can see now as if we kind of perform all those operations and then normalize to 1/2 just have an idea how about what looked like in Euclidian coordinates then we get an expression which looks like this which was so called collinearity equation for the image coordinates and this is the operation that I need to execute in order to transform the 3d point onto the x coordinate or 3d point onto the y coordinate in the image plane you can see already here from those equations that things get a bit more ugly if I would do everything in Euclidian coordinates and if I stick with the homogeneous coordinates it's easier to write down and you typically making a smaller number of mistakes the curse you don't want to mess around with all those expressions over here so writing down something like this is typically much easier and B this number of mistakes that you typically make it's smaller okay so what we've done so far is going from the world coordinate system into our image plane and now we need to do the next step moving on to the sensor so the mapping from the image plane onto the sensor assuming only linear parameters in here so ignoring the non linear errors so if you know kind of overall overview we now move from the sea to this s over here and this is typically done through an affine transformation that we have over here so the map the next step is a mapping from the image plane onto our sensor and important things here are the location of the of the principle point so the principal point is the point where the optical axis passes through the M image plane and so it's basically where the zero zero coordinate next why are in the image plane is which pixel coordinate on the chip that this actually refer to the second thing is that we may have a scale difference in the x and y dimension this is depends on the chip design how the chip is actually set up and we traditionally have a compensation for the sheer of this transformation although I have to say this shear is typically zero for digital cameras today okay so let's have a look to the location of the principal point so in our censorship this is our coordinate system so this is the pixel 0 0 sitting over here in our camera coordinate system the optical axis passes hopefully somewhere in the middle of our image and so this is the optical axis so this is 0 0 in the coordinate frame of the image plane and this is just a shift that it's kind of shift in X and shifted by and this can be expressed by this XH + YH so the principal point which sits over here which is basically realizes this shift between the image plane in the the camera image plane and the Sens or coordinate system so it's just a shift inside a plane and on what we also need to do to take into account is a scale difference in the X Y and different so that X may be differently scaled and y this based on how the chip is set up and how the wires are set up inside an hour chip so they are not necessarily equally spaced in the X and the y direction this is something that we need to take into account and we can add this by saying the XCore we defined the scaling X's one and have a scale offset M which is 1 plus M for the y coordinate over here so depending on the book or literature using you may also find having here different camera constants for x and y or focal lengths for x and y and this is basically equivalent to this scale difference M so M is too big a value which is in the perfect case would be 0 but typically is larger or smaller than 0 depending on how the set up how the chip has actually been produced and then we have a shear compensation that we can take into account a shear component which holds for analog film but not typically for digital camera so for digital cameras this s is a value which is typically around zero so then with this matrix over here we multiply it here from the left hand side we basically perform at these four additional parameters to our system for because it's the scale difference the shear the principle point in X and the principle point in Y which are the four additional parameters and they're multiplied here as an additional transformation or to go from the image plane onto the chip itself so if you look how does it start let's change our calibration matrix which should include all the intrinsic parameters so the new calibration matrix I define as my calibration matrix from the mapping describing the mapping to the image plane and then from the image plane onto the sensor which is the multiplication of those two matrices so that the we will see the camera constant actually populates these three entries over here again this is to be still zero because s is zero but we had the camera constant then sitting over here this brings us in the end to having the what is called the calibration matrix for an affine camera or the unit and it's an affine transformation and consists of five parameters the camera constant the shear the scale difference the x and y coordinate of the principle point and this are typically the five traditional parameters that are used not taking into account nonlinear errors and this is kind of the standard calibration matrix for realizing this affine transformation again depending on the literature you're looking into maybe you find two different camera constants here so whatever C and XC Y or also expresses focal length and but you can see here this C would be C X what we see and see Y would be simply C plus C times M so it's a direct relationship between those quantities and this transformation that is expressed in this form by this calibration matrix and by the extrinsic parameter something is called the DLT or direct linear transform and here we have to be slightly careful because of all DLT or direct linear transform has two meanings in photogrammetry the first meaning is actually describing this transformation over here so it's transformation that we just had having six extrinsic parameters in here plus the five intrinsic parameters sitting in our calibration matrix but it's also used for an approach that estimates those five plus six in some 11 parameters so you have to look a little bit how the world TLT is used it can be once being used for describing this expression and it can also be used as an approach for estimating the parameters of this transformation so and this TLT is a what we have here is maybe a model for something is called an affine camera so a camera that uses an affine transformation in order to realize the mapping from the world coordinate system on to this sensor so the overall mapping from the world on to the sensor can be expressed through an affine transformation plus the projection that we have realized in here and this takes account only the as I said the nonlinear part so not taking into account that we for example if nonlinear lens errors that's something that we will additionally do so the homogeneous projection matrix expressing the DLT is something referred to as speed or if P has no for the parameters exactly this one has five parameter sitting here in K three parameter sitting in R and three parameter sitting in the shift so that means some have eleven parameters and if it would be in the occluding world this would be my expression over here and the P 1 1 2 P 3 4 are the 12 entries of my of my matrix P so this is kind of the resulting equation for the Euclidian parameters so you may wonder at that point in time this matrix P is a 3 by 4 matrix so it means 12 entries but I said here these are l 11 parameters so 11 and 12 how does this come together it comes together because this matrix P is a homogeneous matrix and the homogeneous entities are defined only upto a scaling factor this means this the last degree of freedom is taken through the homogeneous property that I can basically scale this matrix up and down and still is the same transformation matrix in the homogeneous world and therefore a kind of one parameter is lost because we're living in this homogeneous world and so with these eleven parameters all the elements of this matrix are defined because we know this matrix anyway only up to a scaling parameter and the matrix in homogeneous coordinates is the same if I multiply this matrix with any scalar so it's the same entity that we have over here so this brings me to the end of the part looking into linear errors and this is the key part of the model that we have in here in reality however we also have to look into nonlinear errors that the overall process is affected by so the second part looks into the nonlinear errors so how do we take into account that certain aspects of the transformations that we have discussed here are actually affected by nonlinear errors and so the nonlinear errors of course happen the whole pipeline from k to s but we say okay let's keep that part here linear was the central projection and the affine transformation and then having only the nonlinear aspects encoded as a mapping from s into S into the sensor frame take into account that some non-linear stuff has happened and trying to compensate for those nonlinearities so there are several reasons for those nonlinear errors very popular ones are your lens distortions for example so you can see it if you go to a very wide angle lens you can see a barrel distortion so that even straight lines look like a barrel and these are things that for example you want to get rid of if you want to estimate the geometric properties of your environment based on a camera other things are like say the planarity of your sensor or certain other aspects of kind of an imperfect production process of your camera and Lance's will always be imperfect even you can argue that that quality of lenses has degree a decrease over the last 50 years curse might say 40 years or 50 years ago we had more higher quality lenses than what we actually have today in our camera so it's even more important to look into the those nonlinear Earth's and compensate for nonlinear errors okay so the thing is that how do we do this we do this with a general mapping where we update the x coordinate and the y coordinate on a point with the location dependent shift so this location dependent shift means that every pixel will get shifted a little bit in an x and y direction or to compensate for the nonlinear so if you can have this idea of a barrel distortion so that straight lines are actually curved to the outside that would mean that all those points will be moved a little bit more to the inside so that this barrel goes in two straight lines again and this is something which is location dependent so for every location in the world in the in the world in the image sorry for every location being an image we get a different shift so an individual shift per pixel this I can express by saying the point in my kind of most general or arbitrary frame is my original point in the sense of frame and then plus a small shift over here and the important thing is that this shift here actually pens on X so on the location in the image and based on some nonlinear parameters q so it depends where this point is mapped to in my image plane in order to estimate how the non-linear location dependent disturbances actually look like and so this is not a easy transformation that it can very easily Express because it can be different for every location in the image of course I typically assume that there's a very smooth transition or I can express this for example by the parametric form but in theory the transformation is different for every location in the image and just as a small example over here so this is an image of a keyboard taken from a regular webcam and what you can see in here is this line is actually a straight line in reality but it can see this line being curved so there's one of the effect of a lens distortion that we have and that means a straight line in the world is not met - a straight line in the image so it's not straight line preserving and so this is an error that we typically want to compensate and what camera calibration can do for us if we estimate all the intrinsic parameters including those nonlinear errors and then I can actually change the image so that the straight line again turns into a straight line something that I can see over here you can also see for example in this part of the image the image content here has been shifted a bit to the inside in order to make all the straight lines being straight lines and that's something that I can do with this location dependent shift so that I kind of shift pixels depending on their location where they are so if I express this with an additional mapping matrix which maps from my sensor frame to my kind of arbitrary sensor frame I can express this with a three by three matrix of this form where I have the ones on the main diagonal so this is conceived this as a routine identity rotation matrix in 2d this is my homogeneous element so the normalization and this is my shift so this is exactly the shift parameter they had before that is location dependent and I can multiply this matrix to the left hand side of my overall mapping that I designed so far and then this mapping over here takes two count that there are non linear in the other process and they are kind of compensated through this pixel depend shift so this basically means I'm executing my transformation with my linear model and then doing some fixes in the end for those non linear errors that I actually have and so if I kind of multiply this matrix H with my calibration matrix K in order to get a calibration matrix which also takes into account the nonlinear effects then this calibration matrix would look like this so it is in this part here the calibration matrix states exactly the same as it was before only here in the shift component we have the principal point Plus this additional shift at the principal point and Y Plus this additional shift in Y that needs to be taken into account and so the main question is now how does this Delta X Delta Y function actually looks like so how do I actually do that and there are different approaches how this can be estimated and we want to kind of group them into two different groups of approaches the first one are the so-called physics or physically motivated so the produce job variable motivated from physics I'm trying to actually describe the arrows with a more compact math and then trying to take the different effect that could occur into account and then use the parameters that I get from those physical models and add them to my process over here the problem that we have or ever is that there's a large number of reasons why there may be nonlinear errors it may be very hard to describe them and maybe also hard to get all the parameters or get kind of ideas of how those parameters could actually look like and therefore those approaches have become less and less popular in the past but we can however do is just describe the phenomena that actually happens and just assume a model and then try to fit parameters to this model so that we can best we get the best images out under this model assumption that's much easier to do but doesn't really help us to identify the problem so with the camera calibration just describing the phenomena or compensating for the phenomena if she don't get him it's the understanding of what we are the problems actually lies this is what the physically motivated model would do better nevertheless the second part of models are today the quite popular ones so the one for example is a barrel distortion so again we have this barrel distortion that as I said before straight lines are turned into it into a barrel form like this and especially if you work with wide-angle lens you can see this effect and so we can say okay a model for the barrel distortion could be that I have my X&Y location and then I have a number of parameters so first one this is kind of the a constant so this is kind of I start basically with my origin X&Y coordinate and then I add two terms two x and y and they depend on what's called R the radius over here and the radius is the distance from the center of my image plane so a principal point to the pixel coordinate so the further away the larvas R gets and then I have M critic and our ^ for with different coefficients and a basic need to estimate those coefficients q1 and q2 over here in order to estimate the the strength of this barrel distortion and this is a simple model which just takes into account two parameters q1 and q2 to compensate for the effect that the lens distortion in form of a barrel distortion can actually generate so basically shifting my X&Y pixel depending how far this point is away from the principal points of them inside the principal point the radius is zero and there is no shift and the further and I move away from where this shift actually gets and here's kind of one example how this radio distortion field could look like so there's no distortion in here and the further you go to the outside world the stronger your distortion actually is and then I can actually turn that around and shift the pixels in the right direction so that my image gets undistorted using this information so in the end I can use this type of distortion but I can also use other types of distortion models that try to describe the phenomenon that I see in my distorted images and then try to estimate the parameters and compensate for them this can be tricky if the models get more complicated because there are also certain parameters which influence each other because parts of them compensate for the same effects or enhance certain effects so it's kind of something which is typically not trivial to do so but in practice what happens is that we do everything as a two-step process first the linear part second one the nonlinear part so in reality we first assume our linear model we take our 3d point multiply it with the projection matrix P that we get our point in the center frame ignoring the nonlinearities and then we do a second transformation that kind of correct the pics location and performs this kind of pixel dependent correction in order to compensate for the nonlinear errors so but the overall result is that the first step is the same for all pixels in my image and the second step is different for every pixel in my image so that means we get a pixel dependent or point dependent transformation over all that we need to take into account the first one is the same for all and the second one is individual for every point so you get individual Corrections for every image that we get the best possible Corrections out for our camera so this was actually the mapping trying to describe how a point from the 3d world is mapped onto my 2d image and how I can compensate for certain errors so that in the end I can compute a corrected image where I can complete for every pixel from the 3d world where this point ends up in the 2d world and have an understanding how that mapping from the 3d world to the 2d image plane looks like it's very important aspects or knowledge that you need to have and if you want to work with cameras and perform geometric estimation if you want to perform geometric estimation however we typically want to have the process being the other way around right so we want to say we have a pixel coordinate and want to estimate we haven't spawned is in the 3d world for example at least if you want to estimate the 3d geometry of the world given our camera image maybe slightly different if we know the world geometry I want to localize our camera then the mapping in one direction may be sufficient but what we often want to do is what are we wanting if we want to get information about the scene from our camera image so what happens if you want to know something about the scene because then we would need to invert this process but as we have ever seen the central projection is an issue over here because we lose information when we go from the 3d world to the 2d image plane and also this nonlinear part over here can be a bit tricky for us but let's see how I can invert the mapping and what can we say if we invert the mapping or which information we can actually get in pitch information I mean we maybe not be able to retrieve of course this gives us a pretty good idea what happens actually so the goal is we want to map back from my pixel coordinate in my corrected image - and infer the coordinate of the point in the 3d world and again I can tell you already right now that this is not doable so we can't do this perfectly that based on XA we know the X Y that coded at the point that's impossible but we can constrain where that point can be in the 3d world and this is also very valuable information so in the end we can combine multiple images to actually get information about X so that again as we did before to step procedure we also now do a two-step procedure the first step is to map back to the sensor frame and then from the sensor frame back to the 3d world okay so that's what we're going to do so let's up with the first step how do I get back from my arbitrary coordinate system after the Corrections are applied to my original sends our coordinate system so this is a first step in the problem is that in general the nature of this transformation cannot be determined because it requires me to know X so where I mean where am I in my image plane so where is the pixel in my image plane but the only thing I have is actually the corrected image right so I do not know where this what this X looks like for this X a but I only have the next day so as a result of this I need to invert this transformation I need to actively perform an iterative approach trying to iteratively invert this transformation so that I can obtain the sensor coordinate based on the corrected coordinate for that so the problem again is this point depends on the coordinate of the point to transform in the image plane but this is an information I don't have if I have my final pixel location so what I do I start with an iterative procedure we are basically start with an unknown X or with an initial guess for my X and invert this transformation and then kind of iterative inverted so that my point my location then actually fully converges so what I can say it's assuming my image doesn't has two weird distortions I can start as an initial guess at least with the pixel coordinate of the corrected image points of this X a so say okay I start my transformation from s to a putting in here not the X in the image plane but basically I'm using my original image coordinate in order to do that process and then I invert this matrix I multiply X a from that side so it will provide me a transformation into my sensor coordinate system but of course saying that this is just initial guess because this transformation is not correct but I'm doing this and the hope is that I'm then closer to my location in the image than what in the sensor then I was before if I did this one Senate work should also work next time because never better estimate so take the result that I have and feed it back in here and basically iterate this process and that's what I'm doing here so I'm getting my pixel coordinate in the in the sense of frame that I have from the first estimate do the process again but here I keep my X a so it gives me a second estimate for the second estimate and they're giving me a certain test and so on and so forth so typically there is an approach that actually converged quite quickly that I can through this iterative execution of this nonlinear function and inverting it actually turn the or inverting the matrix I can turn the arbitrary coordinate system point so the corrected one into the sensor coordinate system and then I have that step done so again this point here is too big is always expressed with respect to the principal point because the distortions are too big be there happen in the image plane therefore always refer to the image plane so you may need to convert this with respect to your to your principal point just to having said that and then a second step is once I am in the sensor plane I can try to invert this mapping which is the same for the overall image so what I need to do is I need to invert my mapping from the 3d world on to the 2d sensor chip now assuming they're not they are no nonlinear errors so I have my projection matrix they say x equals px and since I know we are living in homogeneous coordinates so this is only defined up to a scalar so I'm multiplying this scalar over here okay so it's lambda times X equals px and then I can just expand this px into K times R and then this four by three by four matrix with the identity matrix over here the shift over here multiplied with the point in left hand side so this point can be expressed as a Euclidean point plus this additional dimension and then I can also split this up into two parts K times R times the XY that location in your fleeting coordinates minus K times R times x0 so the location of the projection Center in Euclidean coordinates okay so nothing just multiply this vector with this matrix over here and so this again was affected resulting from the homogeneous coordinates and this is something that I can exploit in a second that everything is only known up to a constant so what I now want to do is actually from this equation I actually want to get X on one side and all the other quantities on the other side because right I want to know something about capital X where the points in the 3d world so what I can do is I can take this equation and I can simply add this term over here so this term goes away and moves to the other side and then I can multiply from the left hand side kr inverse so that this becomes the identity matrix so not only the X survives so I can express X as kr inverse x KR x 0 plus lambda times K R inverse times small X just by multiplying K R and brackets inverse from the left hand side of this equation so this part and this part they can slough to the identity matrix so what survives is just the X zero and here I can do nothing about now let's have a look to this X to this equation this equation says I have the point in Euclidian coordinates right so it's XY that coordinate equals to the XY that coordinate of my camera coordinate system and then plus a scale factor so a number any arbitrary one-dimensional number and then I have here a product of K R inverse so a matrix which is contained which is the 3x3 matrix is just a multiplication of my calibration matrix and my rotation matrix inverted times a vector so this actually gives me a direction and so now we can actually interpret what that actually means and this is a super important thing in here the term K R inverse times X describes the direction of the Ray from the camera origin x0 to the point X in the 3d world okay so where it says the point must lie on a race starting from the projection center into the world and so this must be the direction vector to that point and this lambda base it tells me how far that point is away so this is actually the direction vector already and this is the length of the direction vector so that means K R inverse times the pixel coordinate tell me the direction of array so I can estimate on which Ray that point X must lie in the 3d world based on my pixel information my knowledge where the camera is in the world and my intrinsic prayer but I do not know how far it is away because I can't resolve the scale parameter and this is accepted in these one degree of freedom of information which is lost when I'm projecting from the 3d world onto the 2d image plane losing one degree of freedom and this degree of freedom sits in here by saying I do not know how far I am on that Ray where that point lies on that Ray I only know he lies on that Ray so it's kind of one degree of freedom which is which can't be resolved and this is the reason why I cannot fully invert the mapping going from a pixel coordinate to a 3d World coordinate because this one degree of freedom is lost and it sits in here but the really important thing to note is that from the calibration matrix and the rotation of the camera if I multiply those two matrices invert them and multiply a pixel coordinate to this I get the direction of the ray from the origin of the camera to the point in the 3d world and this is a very important information that we well we have to exploit or we will exploit if we think about three reconstruction tasks or estimating where point can be in the 3d world so it's important to understand that what this inversion of that mapping actually means so what we have discussed so far was the mapping how to go from the 3d world to the 2d image location with all the different steps of the coordinate system the different coordinate systems and we discussed in which way I can go back which way I can go back from the pixel coordinate to the 3d world and I can do this only up to one degree of freedom which I cannot resolve before coming to the end I wanna briefly talk about different classifications of cameras because different cameras have different names and there's kind of an and a different number of parameters and so if people talk for example about a normalized camera or Euclidian camera at least you should get an idea what that actually means in terms of the parameters so we can here sort our calibration parameters saying where is the camera in the world XYZ location where the camera looking to you're patrolling what the camera constant what is the shift of the chip with respect to my image plane coordinate system is there a difference in the XY coordinate and the scale and what are the nonlinear parameters that are involved so assume we only have a camera we camera where we have those parameters so that means the nonlinear primaries all 0 there is no scale the fan or shear there's no scale difference there is no principle point the camera constant is one the rotation matrix and identity that would be something we call a normalized camera so normalized cameras just a camera which is shifted with respect to the X Y Z location with respect to the world coordinate system that means there's no rotation involved the camera constant is one there is no shear there's no difference in the scale in x and y and they are known nonlinearities involved so very very simple camera just basically just allowed to ship it again if you also want to be able to rotate the camera so look in two different directions but still don't change the intrinsics of the camera it stays exactly the same this is something that we call unit camera it's just the camera which just points into a different direction if we have a camera which has a different C so that means where the image plane has a different distance there's not a distance of one from the projection Center and it's something that we have an EDL camera because it performs the name comes from the fact that performs an EDL perspective projection this camera it we then have a shift of the image plane but we don't scale the pixels and we'd have no shear there's something called the Euclidean camera because it's basically the transformation which happens in the image plane is just shift in x and y and if we have a camera that also have the parameters and an S is something which we called an affine camera because here we have this affine transform e and this is a straight line preserving camera and this is a camera assuming no nonlinear errors so this is kind of the DL DL T transformation is exactly this transformation and the important thing to notice that straight lines in the world state straight lines if straight lines don't stay straight lines there's an indicator that we have nonlinear errors which happens if we add those additional parameters q1 q2 and even more parameters to that and so these different parameters have cameras have different calibration matrices so again the unit camera just become the calibration matrix is an identity matrix that means the overall number of parameters is 6 because 6 for the extrinsic zero for the intrinsic there no periphery parameters in here the IDEO camera just has a camera constant which is the same in for the X and y direction so this is 6 plus 1 7 degrees of freedom Euclidean camera adds a shift in the XY plane over here sorry with x + y so we get three additional parameters here in some 9 or we have the affine camera which is the DLT which gives us 11 degrees of freedom or the general camera which has the 11 degrees of freedom plus the number of degrees of freedom that I need in order to describe my nonlinearities so if you have a camera where we know this calibration matrix we talk about a calibrated camera if you do not know those parameters we talk about an uncalibrated camera so if you kind of take your camera you take it out of the box you bought it and you start making pictures with it you don't know anything about the calibration parameters so this is an uncalibrated camera now you may sit down and implement an algorithm for calibrating the camera so for estimating those intrinsic parameters for estimating this K and then you can add that K you take the images out of the camera and multiply kind of the image or the point that you extracted from the image with this calibration matrix K so you then have calibrated camera because the locations of the points have been changed in your images new images sometimes people use it to a metric camera this is basically a camera where the intrinsics don't change so camera that you want to use for measurement typically they are certain properties of that camera so they are explicitly designed so that the intrinsic change as little as possible by the term that it's not that frequently used today anymore and the process of estimating those intrinsic parameters is what we call camera calibration and also something that we will actually look into so in some what we have done in this lecture today we looked into the description how we can describe the mapping from the world coordinate system onto the chip so not even the image plane but actually the whole chip coordinate system so that we know which pixel belongs the corresponding between world coordinates and pixels with this aspect the extra in six other the camera through aspect the parameters of the camera with describe where the cameras in the world and where is looking too so everything which is outside the camera so to say and the intrinsic SAR things which happen inside the camera so the mapping from the camera coordinate system to the sensor coordinate system taking potentially non-linearity to count all those five intrinsic parameters that we have been using here are those which are which don't take into account any nonlinear errors this is described with the DLT this 11 degree of freedom transformation x equals px that's what we refer to as the d-o-t we then also looked a little bit to the nonlinear arrows at least this barrel distortion and how the inversion of the mapping process looks like so if I'm in the pixel core I have picks the coordinates how can I actually invert that mapping back to the original world coordinate system and it turns out I can't recover the position of the point of 3d Road fully I can only say on which straight line this point may live I have no idea how hard is actually away from my projection Center so also visually we can summarize this this was the overall mapping from the world to the sensor these are kind of our extrinsic s-- these our intrinsic parameters this was the rigidbody transformation here we have the camera in the world then we have a central projection going from the camera coordinate system to the image plane then with an affine transformation going from the image plane onto the sensor and then we have potentially some nonlinear horrors that we actually need to compensate for so with this I'm coming to the end of the spectra I hope you have a good understanding now how to map point from the 3d world on to an image where are those points in your image which pixel at the point in 3d will be mapped to and have 90-year and if we can go back how we can go back in which information we can actually get out of it thank you very much for your attention
Info
Channel: Cyrill Stachniss
Views: 16,296
Rating: undefined out of 5
Keywords: robotics, photogrammetry
Id: uHApDqH-8UE
Channel Id: undefined
Length: 75min 25sec (4525 seconds)
Published: Wed Apr 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.