CVFX Lecture 17: Image formation and single-camera calibration

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so now we're moving into chapter 6 chapter 6 is about this problem of match moving ok so back moving kind of broadly means you know how do I take a computer-generated element for example and insert it into real footage so that it appears to live in the scene right and there are a bunch of you know complicated things that that really sell the illusion that the object is in the scene you know one of them for example is lighting right you have to make sure that the lighting of the insert object matches the lighting in a physical scene we're not going to really talk about anything like that for our concern the most important thing is geometry right so for example you know when you put that creature in to like it's lots of lots of things like g-force and the Smurfs and stuff like that little little animated characters that are playing around on tables and chairs and stuff right you need to make sure that those creatures look like they are firmly planted on the surface that they're running around on right if their feet pantry the table or if they appear to be floating above it or not perpendicular to the ground plane yeah then things look really weird right and also there has to be the right scale the right rotation and translation to make sure that you know you can place this new object into the 3d scene so what do we need to make that work well we need a couple things one is that we need to kind of know three dimensionally where are all the points in the scene so that we can put the object back in the right place and if the camera is moving then we need to know basically how to reproject that 3d character from any particular camera perspective and so this problem of match moving you know fellow study concisely would be estimating a camera trajectory in 3d plus 3d positions of feature points based on apparent motion of 2d correspondences all right this is this is one reason that we cared so much about the features in chapter 4 is that you know what is our basis for figuring out how the camera moves around well the only thing we can do is observe the images that we have obtained right and what can we pull out of those images well we can pull out feature points right and using the detection and matching algorithms we talked about in chapter 4 we can not only match feature points imaged image but we can also hopefully build long tracks of feature points as they move as the camera moves around right and so typically what you have in a matching scenario is you know a bunch of images you have the position of the feature point in each of these images and if I were to trace out this trajectory I would kind of have a little path of that feature point that moves around I see the paths of a whole bunch of those feature points and I use those to estimate the camera trajectory right now this a problem you know it's called batch moving in the visual effects industry okay or camera tracking so there are bunch of names you know that you may see that are related to this idea camera tracking is one in computer vision the key problem is called structure for motion sometimes abbreviated SFM so there was a lot of you know this is actually one place where professors and grad students made a visible impact on the visual effects industry some of the stuff we talked about in the early parts of the class are really still more on the academic side but this is a place where basically you know I think a lot of work came out at the University of Oxford computer vision group where they really worked hard to solve the structure from motion problem and then they commercialize it into a package called Buju that was then picked up by the visual effects community and used to do all these effects right so for example I remember seeing a talk by one of the Oxford professors saying that you know he had just seen a preview for the new Harry Potter movie and he knew that the Harry Potter matchmove data was still sitting on his hard drive at the office waiting to be fully processed right so there was definitely this connection between film industry and the academic industry or the academic world purely in the context of this problem whose you be oh you Jo you right so what I just said was this Buju has a connection with Oxford and the company that produced Buju was called to d3 and I think to d3 got bought out by somebody I need to check on that Hey so as we'll talk about a little bit later and as you'll find out in the homework there are a bunch of both commercial and kind of freeware structured promotion packages that you can use to solve this match movie problem that's why I want you to investigate a little bit on homework another closely related field is called photogrammetry which is a weird sounding word but basically you know the idea of looking at images and understanding what the three-dimensional objects that correspond to the images are that came up very early in like almost the 1940s 1950s when they were flying in planes over terrain taking pictures and wanting to know how they could build terrain maps from these aerial images right and so lots of the underlying techniques that we're going to talk about here were discovered by these kind of US Army and Air Force you know researchers back in the day who really understand the mechanics of how images are projected from 3d world onto a 2d image plane and then some of the stuff that was kind of talked about in 1980s and 90s from computer vision was actually rediscovering some stuff that people have known in a different field entirely for many many years right so it's a question of you know just that data idea not percolating directly to the academic community where it started in the military world another one you may have heard of that is a little bit related although we're not gonna talk about really much here is called slam that stands for simultaneous location and mapping and that's something that is really more of a robotics idea right so you may have you know you've probably seen these DARPA Grand Challenge kinds of things where you got a robot could be like an autonomous car it could be just a little buggy driving around in a hallway somewhere that robot is sensing the world often robots are not necessarily using images they may be using other sensors like ultrasound or lidar but many of them do have cameras and from these observations the robot is trying to self localize itself with respect to its 3d environment which is really the same kind of problem that we're trying to solve here right so all of these terms and all the algorithms that you see in these different academic fields are really kind of solving different aspects of the same problem okay so it's a very rich area and one thing that I'm going to mention is that this area is quite mathematical okay so you know prepare yourself this is going to be a little bit of a mathematical you know derivation type of chapter not so much today but definitely next week because all the ideas that we're going to talk about are related to how do I take the 3d world and project it on to a 2d image plane and then how do 2d correspondences imply things about the 3d world again so we have to understand how to go back and work between 2d and 3d and that branch of mathematics is called projective geometry and unfortunately there's no real way to water it down too much so I'm going to give you the high-level view with the understanding that there are whole other books written all about you know specifically problems in projective geometry so there's a great book by Harley and zisserman called that was called multiple view geometry right second addition is out and that book is like the Bible in my opinion for really understanding all the concepts that I'm only going to talk about at a somewhat high level of this chapter so if you wanna learn more of a projective geometry that's a great book I don't claim to have written a better description of the things than that okay so the first thing we have to think about is you know where do we get these feature tracts so we start with feature tracts or let me say maybe it's better say feature correspondences that are built into longer tracts so we didn't really talk about the we can't talk about the to frame correspondence problem find a point in image one match it up an image to you can imagine how you would extend that idea to now I see images three four five six seven basically I kind of push forward my correspondence until my matching is not very good and I decide okay I can no longer reliably say that this point in image n is the same one as the point image n minus one and so where these correspondences come from so here's a picture that kind of illustrates that idea so the top row is kind of messy because of all the dot side but this is kind of what I might get if I had a slowly moving camera right so this is the kind of situation that you're usually in in the world of video where you're taking a camera and you're moving it around a scene right now in this case you know since the camera is taking images that are spaced apart by not much more than a fraction of a second you know the way that a feature that the way the neighborhood of a feature point looks in image one is pretty close to the way the impede your boy is going looking image achoo right it's only moved a thirtieth of a second and so in this top row we can usually get away with using for example Paris corners and simple you know sum of squared distances metrics for matching points right and that's that's probably really what's going on to the hood of your typical matchmover is it's using simple feature points corners blobs and thence using some civil metric for matching them right now here in this image I've taken three images that are more widely spaced now in this case something like Harris corners is probably not going to work very well as we learned right because the viewpoint changed introduces you know rotations scale changes affine changes even you know non-affine changes and so for this kind of thing to match points between the two images we may need to use something that's more like sift right so for example if I wanted to match you know this point on the front of the building to this point on the front of the building what I've moved the camera substantially Harris corners are probably not going to do it for me I would need to use something like a more robust descriptor and the feature detector but that being said you know this kind of problem comes up when you're dealing with these kinds of structures for motion problems when you have a whole bunch of images that are not taken by a video camera but you still want to figure out where were the original cameras and so we'll talk about a little bit later but there's a very cool thing that came out a few years ago where basically you could search a database like Flickr for a keyword like you say you know trevi fountain or statue of liberty right you get literally millions of tourist pictures of these places right and then you say okay now can I figure out number 1 where where all the tourists standing and number 2 can I rate a 3d reconstruction of the thing that people are looking at from these images right in theory you should be able to and in fact people have shown that you can't right so that's not quite the same as the magical being problem in the sense that there's no smooth camera path between all these images but the same fundamental idea is apply okay so that's kind of where that you're going to be at in something like this and the third image is just a picture of like a green screen studio where again in cases where you can physically introduce features into the world then by all means you should do so right so here in this case you know I went to the screen screen studio and I put a bunch of masking tape crosses on the walls and on this object and then in that case I know exactly where the object that I care about is right that's something we talked about at the end of the features chapter was that in the real world the visual effects artists will try and go in and they'll try to place some marks or some geometry and seen for tracking later and the tracking that they're talking about is for the purposes of later match moving right and so you know it would it will be dumb to not do something like that especially in a green screen environment where you've got this huge featureless background right okay so I'm not going to really say too much more about like feature detection because we already knew all about how to do that from a couple chapters ago right one thing I will say though is that now we know some stuff that we didn't know in chapter 4 which is that you know you should always make sure that your features that you estimate are consistent with the F polar geometry for example right so we know that when nothing has moved in the scene that two pictures of the same scene are related by this apology it says that correspondence can only lie in certain places right so before you do any more match moving you should really weed out any correspondences that don't match that assumption right those are only going to confuse you okay and you should also weed out you know correspondences that you're not sure about you know it may take a little bit of hand tweaking to say okay this is a good track this is a bad track you may need to actually also maybe add some tracks yourself by clicking on places that a tracker might not find like say you have a you know say you've got the center of a table where there's no feature there but you say okay I need to have some creature dancing on this table right here maybe you'll try and physically put a point kind of by eyeballing it on this table throughout the shot to build a track that is close to where you need it another thing to think about is that you know as the camera moves you know features are going to get pushed off of one side of the image and there and then new image texture that you haven't seen before is going appear on the other side so you should always be searching for new features as the camera moves right so as features get pushed off one direction you should be kind of doing a scan and the newly exposed image for new features and try to match those right so as opposed to a single image pair the feature matching process is kind of a continuous process of looking for stuff finally you know one thing to be aware of is something like this situation which I would call a false corner right so here's a picture where you know I've got this foreground building and this background building and according to almost every metric that you might devise these two features look like a good match right if I were to look at the local area between these two features I would say hey nothing that would prevent me from making this a good match if I have enough of these features they might even by mistake kind of pass the F polar geometry test or they might they might throw off by a polar geometry estimation so that I'm confused about with the actual geometries but the reason this is a false corner is that even though in 2d things look great this is not the same kind of thing in 3d I mean it's it's close to the same on the on the front surface of the building but if I were to move that feature over onto the back of the other building I would be talking about extremely different point in 3d right so all this is just to say that you don't want to just take the features that come out of your detector and descriptor matching algorithm as gospel you probably want to do some little bit of hand tuning of these features before you go on to the next step okay okay so any questions about that let me just say also so my book is really more on the theoretical underpinnings of match moving there's also this great book by a guy named Robert who wrote a book about the practical aspects of math moving like if you're going to set up a match move how should you do it you know what features we try and find what's the best way to set up your cameras in a way that's going to make the move succeed so if you really ever want to get down to the nitty-gritty of how to set up your own match move shot and you have the freedom to do so then that's a great book to look at right that's really the hands-on professional kind of book okay so the first thing that I talk about to talk about any of this stuff is how does a camera form an image okay so so far all the images that we've talked about in this class we just kind of assume that they come from somewhere we know they're shot by some camera but we don't really know the 3d relationship between the world and the image plane okay but now I need to make that relationship a lot more explicit okay so draw a line here so let's talk about image formation okay so we kind of know that how to load this up so here's my camera here's the world and what happens is that there is an opening you know right at the end of your lens called the aperture right when you click the shutter of the camera that's what opens let's lie into the camera and then what happens is that there is some surface behind the aperture that forms the image right so what happens is that if I can do this a point in the world gets projected down through the aperture onto you know what used to be a piece of film and now it is basically a CCD array right a light-sensitive array okay but the principle between film cameras and digital cameras is fundamentally geometrically you know the same okay now if you think about it and I think that maybe it's easier to draw it from the side so this is this is like well okay suppose this is a tree okay tada very nice right so let's think about what happens to this point here so this point kind of comes up to here so if I would think about what would what would the image plane be it would be a tree like this right so really the projection process turns the image upside down right and it flips things left to right so when you get your film developed and you look at your negatives if you have you have ever done that you know it doesn't matter you because you don't notice the flipping right and the same with your worth your digital camera you know it automatically tells you the right way up but in practice the way things are physically formed on the image plane is flipped okay and so instead of instead of dealing with this flip mathematically what we often do in computer vision is we make an equivalent assumption that the image plane is actually sitting out here in space in front of the camera right if I think about it then I would get exactly the same image as this except it would already be you know right-side up right and so I'm going to assume from now on that actually there is a camera Center so this thing you're going to call the camera center this thing here we're going to call the image plane this thing out here I'll usually call the scene or the environment and this distance between the camera and the image plane we're going to call the focal length okay so in the beginning like so we know or I know at least that cameras like movie cameras are much much more complicated this they have these lenses that Bend and focus light and so on and so this bottle that I'm showing right now is what's called the pinhole projection bottle and so you may have heard of pinhole cameras like you'll see some you know maybe when you were a kid you to science fair experiment where you took a you know I didn't know if you could do this experiment anymore so it used to be that you could buy like some photosensitive paper right you would put it put it in the back of a shoebox in a darkroom then you would punch a pinhole in the shoebox and then you let light stream through and if you do that you'll see an image forming on the on the photo paper right and people have done this actually at crazy scales are these cool our installations where people have done this like with a barn right they will you know have a black bar and they'll put something on the back wall that is photosensitive and then they'll you know drill a little hole in the front of the bar and then they let light stream through over the course of maybe many hours and you see this goes to the image forming on the back wall of the barn right so this is the the most simple form of image formation okay now we'll talk a little bit about how things get more complicated but let me just talk about this for app okay question I mean I'm a little confused by the focal length yep from this way when in a camera is the yes okay so there's there is a little bit of confusion about the focal length and so you're you're definitely thinking probably like you talked about the plane of focus or the the area in which the image is focused right and so maybe the way I think about this is that you want to think about the image plane as sitting at the point where the camera is focused where things are sharp all right maybe I mean well not all the way at the tree you want think about as physically on the film right yeah it's a little confusing so yeah ribanna yeah moving on but but for our purposes the focal length is the physical separation between the point where the light enters the camera and the position of the image plane right and it's a little bit confusing to think about when you say image in focus how does that relate to the focal length right so that's a little bit subtle right so maybe we'll come back to that all right okay okay so let me just switch back this is I guess a picture of what I just showed you with a nicer font and so here you can kind of see that if the image plane is behind the camera Center things that flipped around but we're going to assume that the image plane is in front of the camera centers so that basically things projecting the same order onto the image plane okay so there are a couple of so here's another kind of image what's going on so what we're going to do is we're going to talk about 3d points which I'm going to use capital letters for capital XYZ and every 3d point gets projected down onto a corresponding 2d point on the image plane little X little Y so again capital letters for a 3d world lowercase letters for the 2d world okay and so you can see that the way I've set this up is that I'm going to assume that the image plane is so first of all there is something called the camera coordinate system right so this camera coordinate system basically is a three dimensional you know aquarium system that kind of tells us how do we describe points in the world right and so initially where's kind of going to assume that this chemical Orion system is centered at the you know camera center right this is called the camera coordinate system and that the x and y axes of this coordinate system are basically aligned with the image plane and at the positive z axis is pointing out into the scene so anything that I can kind of physically see with my camera has a positive Z coordinate okay and also one thing to note about this just as a minor note is that the coordinate system here as I've drawn it is technically not a right-handed coordinate system it's a left-handed coordinate system so for those of you there are engineers you grew up listening to the right-hand rule you know you do the thing where you figure out the so the page is that most of the time we like to have right-handed coordinate systems but in this case just for making our life easier with the image plane being in front of the camera that means that things are left-handed not a big deal it doesn't really affect you guys in any way but just to mention it okay and so again F here is the focal length that is the physical distance from the camera center to the image plane and now you want to know okay so if I tell you capital XYZ what is the corresponding projection onto the image point and we can figure that out pretty easily by thinking about looking at this picture from the side right so again here's the camera center this grade that line is the image plane here is the point out there in 3d space so this is like I'm looking at it from the you know position of somebody standing over here right so the x-axis is kind of coming out the board at you and so I can figure out what should this little Y be based on kind of similar triangles right so I can say that the 3d point is Capital Z away and the image plane is little F away and so the relationship between capital Z and little F is the same as the relationship between Capital y and little Y which is the thing I don't know right so I can use similar triangles to kind of say okay the relationship is is that and so if I go back to what that means so the kind of pinhole projection equations what I found out was that little X is too little F as capital X is to capital Z and little Y is to little F as capital y is to capital Z so if I rearrange this you know the thing I want to figure out but I don't know is little X little Y so little X is basically F times this and little Y is f times this and so you'll notice in my previous slide I put kind of a little you know tilde on here so why did I do that well the little X little Y that I produce from this are not the same as the pixel values that I get from my actual camera the reason for that is that you know let's think about it you know capital X and capital Z these are like you know in these are ought to be in the same units for this to make sense right so these things if I'm taking a real picture on the scale of you know meters say this thing is on the scale of probably you know micrometers right and so when I do this process what I get is a little X little Y but as measured in the same units as the focal lengths which are probably going to be extremely small that's telling me the way they're telling me we're in 3d space on the minuscule image plane did that 3d point end up right I think about my CCD array is like well you guys again may or may not remember how big a 35-millimeter negative was so certainly these days things are no bigger than this and CCDs are a lot smaller than that I mean see see these are actually very small now right think about how small the CCD of us be in your you know smartphone right so this gives us the physical position of that point on your smartphone but now you want to figure out what is the actual pixel location now again and so the actual pixel value is something more like this and so basically you know these terms here serve to shift the center or shift the zero-zero of the image to a corner right so basically you know when you look at an image in MATLAB or an open CV you know the 0 0 or the 1 1 pixel is in the upper left-hand corner and then you go all the way to the other lower right-hand corner to get the biggest pixel right this projection equation assumes that 0 0 is in the middle of the image right and in kind of photography and image processing you never really think about 0 0 being in the middle of the image you think about being at a corner so this is 2 shifted and this pair of numbers here is to scale you know x and y by the physical dimensions of a pixel right and so you know basically this is like saying okay well little XY this experiment turned out to be you know three micrometers right and now I look in my camera specifications I found out that one pixel is 1.5 micrometers wide and so that means that my actual pixel location when I look at that image from the camera is 2 pixels right so Swee have to do this little experiment there you have to use a little conversion to get the numbers that your camera is actually going to tell you what it gives you the image okay and so all this is encapsulated nicely into what we call the camera calibration matrix which we call K okay so this K is equal to it's a three by three matrix and has these entries and these alphas basically are simply the kind of focal lengths in units of physical pixels right this guy like saying you know what is the ratio between the focal length and the width of a pixel in the X and y direction okay so this camera calibration matrix kind of encapsulates everything we need to know about how the camera forms the images right it's got the size of the pixels and it's got the what we call the principle point x0 y0 which tells us where we want to put the center of the image okay of course these are small s yeah this is going to be able to confusing as I do this so make sure you're if you have any questions clarify yes what do you mean by standard projection matrix yes this in look at anything so they just yes ah okay so there are two pieces to a camera right one piece is kind of what I would call inherent to the camera no matter where I put it in the world the camera has what we call so these is a good time to say about this these are basically called the internal parameters right so the internal parameters basically follow the camera around so let's assume I can't zoom the camera on the fly it was a fixed focus camera right so I buy my camera a fixed focus lens and no matter where I placed that camera physically in the world it has these properties for how points in the 3d world are projected down onto its surface right but there's the other piece which is where is the camera in the world right those are called the external parameters that's next I'm going to talk about and so when you specify like a counter matrix in OpenGL you're specifying a matrix that combines these two things and I'm going to talk about how that works in just a second but forever which is it's kind of important to separate out these two things to make it clear that there's some stuff that is fundamental to the operation of the camera and there's some stuff that is kind of independent of the internal parameters just where the camera is in space so let me show you why this produces the image right so again if I assume that if I assume that everything is represented in the camera coordinate system what I can do is I can say that my observed pixel coordinate this is going to be a little bit of new notation there's just weird Tildy oops this is some new symbol I made to Z alright so this means proportional to or different way to think about this is equal up to scale so what happens if I took my 3d point I multiply it by this K right so I have my K which is f over DX F over dy little x0 little y0 one I multiply it by my 3d point and I get a vector I get F over DX x c+ x 0 zc f over dy y c+ y 0z c zc and this Childre means that this is a three-dimensional vector if I want to get back the vector that is scaled so that the lowest so the last number is 1 I divide through by the third element so this number becomes 1 and this guy becomes F over DX XC / zc plus x0 f / dy YC / zc + y0 I guess I left myself a lot of space here and this is exactly the projection equation that we talked about initially right so the idea is I take a 3d point in the world in a camera coordinate system I multiply it by this K matrix and I get the 2d image point right so this tells me everything I need to know about how does the camera act on points in the camera coordinate system to put a point into the image plane right now you're going to see this notation a lot in this chapter where there's some thing that is fundamentally a two dimensional quantity that I represent as a three dimensional vector that is kind of proportional to something else right part of the reason for that is that projective geometry unfortunately involves a lot of this kind of process of dividing one thing by another right that's inevitable but the way that we write this equation kind of obscures that or a little bit it makes things much easier to write right so I don't have to divide anything if I just were to show you this this just tells me that I multiply image coordinates and I get carmelite seen Aquarians and I get some image coordinates this kind of notation will attack a 1 on the end of something that doesn't necessarily need it this is called homogeneous coordinates and you're going to see this in this chapter okay yes this also makes the nonlinear transformation interlinear I mean it looks linear but it's not actually linear right yeah but that's a good point okay so questions or comments about that okay so the first thing to talk about basically here is well what about you know we know that real world images are not pinhole projection and one one reason that's true is that if you look at real images right so if you have a normal lens a real image looks probably something like the right image where we have this kind of distortion that comes from the non ideal lens right and so this usually manifests itself as what you call barrel distortion where this thing is kind of bowed out right so instead of the checkerboard looking like a nice rectangle it looks like it's you know edges are kind of bulging out and the same is true even up here if you look at this light right we know that this light should be a straight line but looks like it's kind of like this curved thing your brain used to looking at images like this and not really thinking about it too much but it is true that basically you know most cameras have some type of lens distortion okay now to apply all the stuff we do next we need to both model and undo this lens distortion because we kind of assume in the following stuff that we're dealing with these ideal cameras okay and so let's say a word about lens distortion so lens distortion means that things that should be straight kind of appear to kind of bulge out a little bit and the model for how this works is something like this so if I think about what the distorted image coordinates look like it's usually something along the lines of what you call radial Distortion oops it's a it's a UN on sorry the rotation here gets a little bit hairy I'm sorry to say so basically what we're saying is that the amount of bowing out that I see depends on how far away the point is from the center right and so here I'm kind of assuming the center of distortion is the zero zero of the image and then the idea is that the further away that my point gets from zero zero the more you observe this bowing out behavior usually the middle of the image may not look too distorted but the outer edges of the image look like they're bowed out okay and so this is called radial distortion there are other kinds of distortion too maybe for example the center of the bowing out is not the middle of the image but somewhere else there's things called that's like called tangential distortion so there are other kinds of surgery but this is the basic idea question yes squaring the square so basically right yeah yeah so I should explain it's better so it's basically saying that what I have maybe it's easier thing about this way so I look at the you know radius of the point it's kind of like this right so it's not like saying that the amount of distortion depends on the distance of the point from the center and I can add terms that are linear in that distance or quadratic in that distance basically right normally you can probably stock with just a couple of terms I mean you can even drive stock with this term if the thing is not too distorted right and so really to undo the distortion what you need to do is estimate these lens distortion kind of parameters or coefficients and the easiest way I do that is to take an image like this where for example here I know that this checkerboard should be square right and so I can say is okay I could assign kind of world coordinates or image coordinates to the troop positions on the checkerboard that I know should be in a rectangular grid and then I observe the distorted checkerboard and I use that to undo the sort because I see a bunch of examples of you know correct point and distorted point and it becomes kind of a linearly squares problem where again I only have two numbers I have to measure which are the or estimate which are the k1 the k2 right so I've got lots of equations in only two unknowns so it's not too hard to undo the lens distortion if I know hey I'm looking at something that should have been a checkerboard right so you can see a little more details in the in the book derivation but that's the idea is that you know you know it should be you observe what it is and then you undo it right so one of the most common things initially do when you're shooting a film is that you place a checkerboard in front of the camera at the you know lens settings that you're going to use to film the actual shot you serve this checkerboard and then later on you undo the distortion based on this image of the checkerboard right that's that's a very common thing to do okay so now when I get to your question about when you look at OpenGL isn't like that you know what is the camera matrix right so there's this other concept called the external parameters and so basically the issue there is that currently so far we kind of assumed that this was my camera this was my image plane and here was a point out in the world right and I denoted these with little subsea subscripts to indicate that those were points that were in the camera coordinate system right but in the real world we don't necessarily always ascribe points in the camera coordinate system for one thing we may not really know that right so it's okay I say okay well actually you know maybe there's some other coordinate system in the world that in this coordinate system I would get a different way of describing that 3d point right so it's not like a change of coordinates right so putting in another way for me standing in front of the room maybe it would make a lot of sense for me to describe three points in this room where I'm the origin and then this is the Z Direction this is the X direction and if there's a camera over there we're going to disagree on the 3d interpretation of a point unless we have a way of converting between our coordinate systems right so they say that's basically encapsulated by a rigid motion which means a rotation and translation of the camera so this is kind of like the cama coordinate system and this is what I call the world portent system and we kind of convert from the world coordinate system to the camera coordinate system with a rotation and translation which is also sometimes called a rigid motion so it means that if I want to know how should a point be described in the camera coordinate system I rotate the world corrent system and then I push it over with some translation vector so this here is a three by three rotation matrix so even though it's got nine numbers in it it's really only defined by three angles right you have kind of a pitch draw pitch yaw and roll right so there's basically three angles and if you actually write down what the rotation matrix is it's full of cosines and sines of these angles really so so it really only depends on three numbers so basically the whole rotation plus translation involves six numbers three for the rotation three for the translation and this way I can take any point in the world board system and convert it to the camera corn system once I've got in the camera Corian system I can push it down onto the image plane and so if I put all this together like I know that my image point so this is what we talked about earlier my image point that I actually see is e is proportional to K times this thing which I've just learned would be the same thing if I applied a rotation to the world current system and added a translation vector and so a different way of writing this would be K rotation matrix translation vector times the seen point followed by a 1 right so if I think about how this works right this this both locations this becomes a 3 by 4 matrix right this guy here so forget about the K for a second so it's like saying I multiply this 3 by 3 matrix times this 3 by 1 vector and I get this and I multiply the translation vector by one to get that right so I can see this kind of encapsulate the whole thing and so I'm going to call this whole thing here P and P is what we call the camera matrix so it's a little confusing because we have a camera calibration matrix which is K and we have a camera matrix which is P and P is the one that kind of ties everything together right so in OpenGL you're probably specifying peak right you give a 3 by 4 matrix that tells the camera how to interpret points in the world and put them into the image plane right so that's probably what you're used to doing right and so again this is a 3 by 4 matrix if you think about it even though there are 12 entries in this matrix there aren't actually 12 degrees of freedom in the whole thing because we know that there's only three degrees of freedom for our three degrees of freedom 40 and then if you look back at the way that we formed up K you know there's really only you know four degrees of freedom here the focal length the principal actually actually I guess the way we're in this there's 11 degrees of freedom F DX dy x0 y0 right but in practice most you know most cameras these days produce pixels that are square right meaning that the aspect ratio which is like the difference you know the ratio between DX and dy is just 1 right so it used to be that you dealt with these kind of either weirdly made or poorly made cameras that you could have non-square pixels back in the day and also even more crazy you could have non-rectangular pixels you could have pixels that were actually like little parallelograms due to poor manufacture or misalignment or something like that so there is something called the skew which basically told me what the angle of that parallelogram was these days for any camera that you would encounter you can rely on the you know skew being 0 I think pixels being a square okay and so putting this all together we often represent this if I were to kind of represent this with a bold X for image coordinates and a bold capital X for scene coordinates I can kind of encapsulate the entire image formation process with this really simple equation right everything is inside this camera matrix capital P okay and so kind of going forward what we need to know so so now this is kind of sets up the problem of nationally right so the problem is we don't know capital P for given camera position so we need to estimate that so in some sense what's happening is that at every point of the camera motion we need to estimate the corresponding P matrix okay and we will and so fundamentally what's changing as the camera moves is that the rotation and the translation the external parameters are changing if we assume the camera is not kind of continuously zooming then the K matrix is fixed and that fixed K matrix will help us with our analysis later on okay okay so questions about the setup yes where are we doing with these world would also be yes in this case the world corn is also left-handed right so we're going to assume that yeah all the accordance with left-handed that's right because we want this rotation matrix to be a normal rotation matrix it has determinant 1 right we could do it the other way with a rotation matrix I determine negative 1 but we don't okay so well I'll talk about for the rest of the time is what I would call single camera calibration okay so how could I find this P okay so there are two things that I'll talk about one is actually pretty simple which is called resection and this is a fancy word for saying estimating P from known little X and big X right so that ends the reason that if I knew a bunch of 3d points and I also knew exactly where those points were on the image plane then I have lots of information that I can use to estimate the camera matrix right so if I look at the again if I look at the the P matrix like this right so every correspondence basically gives me three equations in 12 unknowns right so in theory I don't really need to actually that's not exactly true because one of these equations is redundant so it actually kind of gives me two equations in 12 unknowns so two independent equations in 12 unknowns again kind of similar I think if I were to do this kind of similar to the fundamental matrix the Kahler matrix is only defined up to scale right so if I were to multiply the whole counter Atrix by 12 this equation would still be true because this Atilla is going to remove any scaling of the characters so fundamentally I know from the get-go there are only 11 degrees of freedom that I have to estimate and there are some further degrees of freedom that can be removed from things like knowing that the pixels are square and stuff like that okay but if you were to write this out right what you would fundamental again I'm not going to do this in any detail but you could say okay well XY is fundamentally equal to P 1 1 X I plus P 1 2 y I so you can kind of see where these linear equations are going to come from so that is that if I were to set up a bunch of correspondences where I knew these are the 3d coordinates these are the corresponding 2d coordinates I can estimate my P right so I can kind of say okay you know I can turn these into a vector a 12 by 1 vector P this whole thing turns into a linear system like this and this is like a 2n by 12 matrix where n is the number of correspondences I have and then I can go through and do something similar to how we estimated the fundamental matrix or how we have to bathe a projective transformation to come up with those values for P ok so it's not hard to do okay I mean the thing that is hard to do is that in the real world you rarely know the 3d coordinates of things that you care about right one way you could get those 3d coordinates is for example by taking a laser rangefinder right and saying okay I'm going to point this laser rangefinder at a bunch of points in the world that I have surveyed and actually that's a lot like what the guys who are on the side of the road with the with the device are doing right as they're triangulating 3d points with their kind of bore sighted you know image thing to figure out how do I assign 3d point locations to stuff in the world right so if you have an accurate survey of the environment such a survey could for example might come from like a lidar scan we're going talk about that in chapter 8 you know those 3d points you can get them you know if you spend a lot of time to measure stuff right so this is not like a totally infeasible thing to do um so at the end of story what I get is P right and I know that this P is actually made up of you know the camera calibration matrix K the rotation and the translation and so usually what I want to do is I want to not just get the camera matrix but sometimes I also want to pull out what is the corresponding K R and T right so you know getting K R and T from P it's just that about the linear algebra problem and I think I'm going to assign this for the next homework fundamentally you know you think about this P looks like you know a three by three matrix here and a three by one vector here right and so I can see that my three by three matrix looks like carry are where this is upper triangular and this is the rotation matrix also known as an orthogonal matrix and so there's a linear algebra result that says there's only one way that I can take any given matrix and decompose it into the product of these two kinds of matrices this is called the QR decomposition or if he ever did gram-schmidt in IA or some linear algebra class right so this is kind of like related to the QR decomposition or gram-schmidt right so once you do this process you can estimate K and R and then this thing is equal to ke but at this point I know K so I can do the inverse of K to get back to T so that's going to be a homework problem but MATLAB knows how to do this factorization for you right so this is not a hard thing to do okay so a more common thing to do which yeah so so this is all well and good but it requires you to actually know 3d positions of things in the world right and that's not something that you typically know how to do so the other way of doing things is what I would call plane based calibration and so the way this works is that I'm going to estimate the so this is usually for imperial parameters is that is the usual reason why you want to do this although you also get some external parameters along with it so the idea is the following this is something that actually matt has a lot of experience with as I did this for your you RFP project last year right so we are in the position of having to calibrate a bunch of cameras and so the easiest way to calibrate a bunch of cameras is to show the cameras pictures of a checker board okay so the idea is you know show the camera multiple pictures of a planar surface you know and the most common thing to do is a checkerboard so what I mean by that is like literally something like this right so here what I've done is put a picture of a checkerboard on a computer monitor I know the pure monitor is nice and flat you know you can also just print out a checkerboard and tape it to a flat surface as long as you know that flat surface is not going to bend around right and then what you do is you show the camera this checkerboard from a bunch of different positions okay I hear what you're seeing is basically actually so one thing that's kind of important to think about is that you know if I have a camera and I showed a bunch of checker boards at different physical positions like a fixed camera and moving checkerboard that's really fundamentally equivalent to the images that I would take if I were to keep the checkerboard fixed and move the camera around right I've got to see the same images either way and so really it's just a matter of perspective about whether you think about the checkerboard being fixed in space or the camera being fixed in space right so for example in this picture clearly I've moved the camera around physically because I my monitor is not moving around to my desk if I wanted to I could have put the camera on a tripod and picked up my monitor and moved it around but I probably still would have seen the same check of our locations right it doesn't really matter from the perspective of understanding the image formation process of the camera but just to say that you know sometimes you can kind of keep that straight in your head okay so what I see here well what I'm going to do I'm going to show you some mathematical derivation in just a second is that what I've done is I have made correspondences between the corners of the checkerboard squares right and so that's not hard to do we're like basically if you were to take Harris corners for example and just run them over an image look like this Harris corners are going to be it's going to be like a champ at finding these nice strong corner locations right and then the other thing you have to do is make sure that you match them up the right way but if you've got these strong grid locations it's not hard to say okay you know this is the upper left-hand corner the checkerboard I just kind of start counting along to figure out which triggers matched which other checks okay okay so let's do a little math okay so let's assume for our purposes that the plane is fixed at Z equals zero okay so this is by choice right so I'm going to assume that the plane is fixed the camera is moving around it and since I have the total freedom to choose my you know world coordinates because it's like an arbitrary plane I'm just going to say that the plane is at Z equals zero so that means that any point on the plane has coordinates capital X capital y Z equals zero and then I can write the you know homogeneous part is this and so now for any given position of the camera I could say ok I've got my K my our MIT and for any given point on that plane this is the image point that I get again this is up to scale and so if I like this a little better that's up say okay let's suppose actually look at the columns of this this is like R 1 R 2 R 3 T so these here are the columns of our and actually you know if I think about this well then this way you can see that this column of rr3 is always just getting multiplied by this zero right and so actually I can write this more simply as our 1 R 2 T times X y1 or putting a different way I could say this is just like some 3x3 matrix H times X Y 1 and so what have I learned right what I learned is that I have a holography or projective transformation between the 3d point on the plane in the world and the 2d projection of the image or the 2d version of that point on the image plane and this kind of makes sense because one thing I told you way back when is that projective transformations are what relate two images of the same plane and so what I have here is basically two images the plane one is the plane in the world and one is the image plane right so those two planes are related to capture to each other by this H ok and so there it is that um so for every camera position we have a projective transformation or just a 3x3 matrix h sub i where h sub I is equal to K times the rotation matrix at that point times the translation vector and here what I'm assuming is that I've got the same camera right so the focal length of the camera isn't changing all the internal parameters of the camera are the same right the only thing that's changing is physically where did I put this camera in the world right and so now what I'm going to do is I'm going to use these ages to build my corresponding P okay so let me write this a little bit differently this is that saying again this is up to scale so let me make this a little clearer so let's say that H I is equal to some unknown scale factor x uh oh actually this is not quite right this is like K r1i all right let me scratch that all right so H I is equal to some scale factor that I don't know rotation vectors and this translation times this K I still majus Crudup I'm going to write that one more time the right way sorry to all the viewers at home like this okay so this is fixed these things are changing right well another way to say this is that I can take the inverse of this I'm going to move the K over so another way of saying this is that the rotation matrices like this are equal to this doesn't really matter K inverse times whatever these columns of H R and so here is the trick the trick is saying okay well what can I do now well one thing that I know is that these guys here are rotation matrices columns right and the special thing I know about rotation matrices is that number one that the norm of each of these guys should be equal to one right that's like saying that this is the unit vector this is unit vector and I know that if I were to take the inner product of these things I should get zero right because I know the vectors are perpendicular to each other and now I have some constraints that I can apply right I can say okay I basically have to you know I have two constraints or two equations one says that you know R 1 I transpose R 1 I equals R 2 I transpose R 2 I and the other one is that R 1 I transpose R 2 I equals 0 now if I follow that along that implies some stuff that involves the ages that I know and the K that I don't know so for example let's think about what this first thing implies so for example the first equation implies that h1i transpose ke ke transpose inverse so I'm skipping a couple steps here that are in the book so and I realize this is a little bit tedious so it's this is the stuff that's kind of hard to lecture about because it's a little bit like just grungy math and the second equation if I work it out tells me that h1 I transpose ke ke transpose inverse h2 I equals 0 ok so the thing to get from this the main concept to get from this is that while I have some constraints you know I know this stuff you know I don't know this but I do know some stuff about this right so for example you know if I think about what does this matrix look like well this matrix is still going to you know have a form that looks like this okay you know K transpose inverse right I can write out this thing in terms of a through I through matrix that involves all these parameters kind of combined together okay for the moment we could just call this matrix you know Omega and this Omega basically has five nonzero entries again I realize I'm kind of hand waving here but the book has all the details and so the idea here is to say that you know each equation like this puts one constraint on the five unknown entries of this right so basically I get two equations in the five unknown of Omega and then I basically can solve for Omega and then I can back out entries of K so the interest of time I'm going to refer you to the book to kind of understand what mathematically that process looks like but fundamentally this process is not that not that bad and luckily for you you'd never have to do it yourself because there is this great MATLAB camera calibration toolbox but I believe is also inside OpenCV this this is the toolbox that everyone uses right so fundamentally it's very easy you put the you put the checkerboard images you take about rick boyd images you import them into the toolbox the toolbox will try it's best to extract the corners of the grid after you give a couple of mouse clicks in one of the images so it will basically build a little 3d coordinate system and it will find the corresponding corner points it finds them in all the images then it does the camera calibration process I just described more or less and then it comes back at you with the counter calibration parameters and so now you can see it estimates the focal length for the cameras estimates the x 0 y 0 for the cameras the skew means that the pixels are square and you know not parallelograms it estimates some lens distortion coefficients if you need them and that tells you how big the error was between the places where it fought the pixel thought the corners should be and where they actually are and so make a plot that tells you you know how much error did you generate right here this is like in terms of deviation between what the camera model says the 2d projection should be and what you clicked on in the image here you can see for example that you know most of the cameras this is color coded by camera most of the cameras have very low error but this pink camera had really crappy error and so that means they have to go back to my pink camera and make sure that those points are really right on the corners of the image after I do that I can maybe achieve a distribution where again now the error is much lower for all the cameras and this is a good situation and then what you can do is view the result as this is the you know hypothesis that the camera is fixed I'm looking at this three play lots of different angles and so here is the estimated bunch of 3d angles of the plane that you got or you can reinterpret this as a fixed plane in space and these were all the cameras were right so you know the great thing for you is that you know have to worry about coding all this stuff up this is a great tool box everybody uses right and in fact so Mac until you in the story of we had to calibrate a bunch of cameras over an impact except the twist was the cameras were hanging from the 30 meter or 30 foot high ceiling and so basically Matt mounted oh you have checkerboard on a binder or something right right yes he had strings and you were like a marionette and he was screwing out of this thing and so then we use these images the checkerboard to undistorted undo the legislation and then we also actually bring it back to the last chapter there was also then overlap between the cameras and so then you had some feature points that you placed on the floor to make basically a big mosaic at the floor right and so actually I didn't I find that the force I would have put some slides together but maybe be a next time I'm going to show the results of undoing the lens distortion for the impact thing if I still have it or I haven't what the guys have been doing lately so yes so it's kind of a fun thing we actually do this in practice on a regular basis it's not like some Theory kind of algorithm we have to do it okay so any questions about this process you know again you can calibrate cameras yourself very easily and I invite you to do so now I'm going to do is go back to my guy turn this low
Info
Channel: Rich Radke
Views: 60,974
Rating: 4.9035087 out of 5
Keywords: visual effects, computer vision, rich radke, radke, cvfx, camera calibration, camera matrix, internal parameters, external parameters, lens distortion, plane-based calibration, image formation, camera calibration toolbox
Id: 4-thTdR7Blg
Channel Id: undefined
Length: 72min 30sec (4350 seconds)
Published: Thu Mar 27 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.