Feature-based, Direct, and Deep Learning Methods of Visual Odometry

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay okay thanks for recording uh so thanks everyone for attending today's video down tutorial uh this tutorial is gonna be i think it'll be easy for most of the people i will introduce the fundamentals of some computer visions at first and then i will talk about some of the uh traditional visual dom tree and then we talk about learning-based builder launching and we'll find someone's link between the learning based and the traditional visual and they are all share some of the basic knowledge and fundamentals in computer vision especially geometry based computer vision which i'll talk later on so before i get started i just want to know how many of you have learned a computer vision either including client universities okay yeah uh some of them yes i don't know okay so [Music] first i think uh everything about that we need to understand about computer vision or later on visual with odometry either uh traditional uh uh direct message or even feature-based methods uh we all start with pinhole camera projection model i think this part is essential for everything so the pin peel camera projection model is actually a mapping of the 3d point uh in a 3d coordinate system or coordinate system and then later on project this 3d point uh to the image coordinate system in a 2d so this is actually a mapping uh from the 3d to a 2d but this mapping is actually not linear since we will talk about later there is some homogeneous coordinate staff that prevent this from being a linear linear transformation or linear mapping so so the whole projection process can be described as this x equals to p multiply this uppercase x so the big x represents the 3d word point coordinate and this little x represents the 2d image points and p is actually the projection matrix this projection matrix actually consists of two parts and we'll introduce it later so in conclusion uh the projection actually consists of two parts the first is actually a word coordinate system this is in 3d and we're gonna transform to the camera coordinate system also in 3d since we'll do a rigid body transformation from 3d to 3d and then we'll make this camera coordinate system in 3d and then project this 3d point in the image coordinate system in 2d so we can see a 2d image like we said in all the cameras that we have so first we'll talk about this part uh the word coordinate system in 3d and then the transformation uh to the camera coordinate system in 3d so this is a 3d 3d linear transformation so so we want to trans express this transformation by uh two components the first one is the rotation and the second one is the translation so the rotation is actually uh you rotate uh axis according in different three in three different axes x y and z and the translation actually move uh you know the camera center or the original point of this coordinate from uh one position to another this is called translation so represented by this graph we can see from here there is uh this is a word coordinate system uh there is some point p maybe somewhere some some arbitrary point represented as x w and so this x w is actually coordinate in this word coordinate system and later on we will have another representation another coordinate in this camera coordinate system represented as x c so this x w and x c will have some kind of uh relationship and which is related by the rotation and camera translation so we can use a matrix t to represent this uh linear transformation so the upper part of the uh of this matrix t as rotation matrix this is three by three rotation matrix and also this t vector uh it is three three by one vector and this is zero and this is one and we can transform this coordinates from x w to x by this transit rotation and the translation and now denote it uh by this transformation matrix t okay so uh one of the important thing that we notice is that this rotation and translation defined here this equations actually represent the linear transformation that we did for the coordinates uh in the camera in the word currency from the word coordinate system to the camera coordinate system it's not actually uh the motion of the camera itself so this rnt does not represent uh the rotation translation of the camera itself but represents uh the rotation matrix and translation vector that we map uh the coordinates of the word coordinate system to the coordinates to the camera okay so we need to know the difference so once example to represent this difference is we can show it here so we have a 3d point here and this coordinate is x w and its coordinate in the camera coordinate system is x c so we can see here that we can have x c equals x w minus c right if we uh add these two vectors together we'll have c plus x c or equals x w so this c this c vector is actually the translation of the camera itself so it is also the coordinate of the camera center of this camera current system in the world coordinate system right but this is not actually this is not actually the translation that we map this 3d point x w flat c since we as we talked before we are this trans transformation is actually denoted as t and t equals to the minus c so the relationship between uh the transformation of the coordinates in the word coordinate system to the camera coordinate system and there's a rotational translation of the motion of the camera itself is actually inverse uh the matrix is actually the inverse matrix so that mutually and invert invert inversive right okay so that is to say if we denote r and t as a transformation of the of the coordinates from the workplace system to the camera current system the camera motion is actually the inverse of this matrix t so that is to say uh the rotation r t or the r inverse because this is a because for the rotation matrix the inverse is actually equals to the transpose and also this trans this translation the exterior minus are transposed multiply t so this two matrix and vector will be the the real motion of the camera okay okay so then let's talk about uh how can we uh project the 3d point uh in the camera coordinate system to the 2d point in the image plane or the image coordinate system so we all have this pinhole camera model as shown here so we have a virtual image plane which is actually uh so it's really the real image of a 3d point is actually opposite to the original one but we'll make a [Music] virtual image plane in front of the pinhole camera center at here in order to make the problem seem uh simpler we'll just put a virtual image line out here so uh this is another illustration graph to show this uh projection process and we denote the 3d point as xc in the camera coordinate system and also we denote the image point as x image in the image coordinate system okay this is the camera center and this is this axis this coordinate is the camera coordinate system and this small 2d image coordinate is our final image coordinate system and inside this coordinate system 2d coordinate system we have another very small coordinate system in order to make our derivation simpler we will use this p as the original to construct another small coordinate inside this image coordinate inside this image form okay so this p point uh which is the intersection of this principle axis z and this image plane is actually denoted as a principal point and this coordinate is cx and cy so in general uh in the ideal case it's actually sex cy is actually half the image size because we can see here this one this principle point is actually located in the center of the real image point the real image coordinate system okay so uh if we have uh if we just look at this plane uh of this whole camera coordinate system y c z got it here and we'll have this uh similarity triangular uh trim similarity triangle and uh by the properties of uh the sim the similarity of the triangle we'll have uh you know this kind of relationship that z over f y is actually y over y right and for another plane uh z c x which is similar to the plane z c y we have to have another relationship for the uh for that side okay so then we'll have y equals to f y multiply y over z and x similarly with singular f x and multiply x over z so [Music] after this mapping will actually map a 3d point to the to the 2d image point so this is actually a mapping from the 3d clipping space to a 2d including space so this is actually not a linear process because in order to make it homogeneous we actually divided the z so making this process making the whole projection process in a linear process okay so because of homogeneous coordinates will have you know uh this kind of a non-linearity inside this projection okay so then we will add the uh where the principal point take the principal point as consideration so we are add the cx and see why now original coordinates that we get in this small uh current inside this real image coordinates this means maybe a little bit confusing because we construct the very small coordinates 2d coordinates in this real image coordinates so we have to add the coordinates of the principal point c x and zy okay so we denoted the x image coordinate in the image coordinate system uh in homogeneous coordinate form so we get the final coordinate of the 2d in the 2d image coordinate system as shown in this equation okay so uh right now we have the 3d to 3d transformation and now we also have the 3d to 2d transformation so later on we'll combine them together together 3d to 2d transition this is actually a first rotation translation and then do the projection so we will write them in a whole uh you know complete matrix okay uh which is a p that we introduced previously so uh for this 3d to 2d projection we can actually make this process and resist the process in a matrix form and this matrix is actually denoted as k which is known as the camera intrinsic matrix all the camera intrinsic parameters are sometimes denoted as a calibration matrix so this matrix actually uh diagonal is f x f y and 0 and this and this column will be c cx and cy if we multiply with the 3d coordinate in the camera coordinate system x y and z and do a homogeneous chord and make it homogeneous and we'll have this kind of final representation for the 2d coordinate system in the image coordinate system okay so this is a key and f x f y is actually the focal length in x focal lengths in x and y axis and c x a y is actually the principal point principle point coordinate okay so in ideal form there will be no more distortion for this projection model by the real there will be some distortion a radio distortion etc and we need to calibrate the camera to get these parameters in order to make the uh the projection process more precisely so then we combine the 3d to 3d transformation to 3d to 2d transformation hence we combine the transformation matrix t and the camera in transit matrix k and get the final projection matrix p which is actually p equals k multiply this projection matrix rt so we'll have a mapping from one coordinate system in 3d to the camera coordinate system and then to the 2d image coordinate okay so up to now we finished the camera projection uh process okay so uh any question up to now okay so uh so let's continue so later on let's talk about iv polar constraints so everywhere geometry is actually uh another keystone for the viral geometry because rear geometry of odometry is actually uh you need to process different frames model frames not just one single frame not just a relation between a 3d point and a 2d point but we need we need to consider about the relationship geometry relation between different frame so let's first get into the geometry relationship this tool frame can be you know coexistent so the two images maybe this one is the frame that we take at time step t and this time this image could be taken at the time step t plus one so this could be two continuous frames or they could be consistent in the at the same time we have multiple cameras and we get a projection from one point at the same time one 3d point at the same time to different views okay so this the relationship that the geometry relationship uh that we are talking about is actually called epipolar uh geometry uh eye polar constraints okay so first we will talk about epipolar line so every polar line is actually uh a very important uh definition in the high polar constraint so we can see that if we project this x point into uh this first we will have this 2d image coordinates right and if we combine connect this up to a center and this to the image point and this 3d point and formal array and project this ray to another view we will have that bipolar line shown here okay so this gives us very important hint is that in order to find the cross bonus which is the projection uh of the same 3d point to another view we can actually look at this point look at the correspondence in the epipolar line in another view right so uh we all know that finding the correspondence is actually vital uh to get the camera post and for other later on geometry relation we all need correspondence so in order to find the cross minus we can actually take advantage of the agriculture constraints that have been aligned uh so later on let's talk about what is at people so a different notation so so now consider about a number of points in the first image x1 x3 x2 and x3 and we we then form uh various different kinds of uh rays we want v2 and r3 and we project this different arrays into another view into a different uh image plane and this three projection is actually the three different type of color lines right so in some cases everywhere lines will coincide the same point or intersect at the same point inside this image inside this image or the inside this image plane so this is called ippo so we can notice that the fppo e2 is actually also the projection of the camera up to the center of the first view right we project this optical center to a different view where i have the ipod okay uh so next let's talk about some of the special cases of the heavy posts and every polar lines so if the camera movement is a pure translation which is perpendicular to the optical axis which which is like this shown in this image in this case the two image plane uh they're actually in parallel right and as a consequence that bipolar line is actually parallel and the epipole is actually at infinity right because uh this l1 l2 and l3 these alpha lines never intersect at this image point because they are in parallel so that report is actually at infinity oh you can see it doesn't exist at this case so this is a very uh common case for the stereo camera since the stereo camera we rectify it uh we can just mix every polar lines uh parallel we can just find the cross bundles by searching the one direction at the polar line which you can find that you can find the cross minus much easier and then you can do the stereo matching et cetera so another special case for this heavy pronunciation is actually the camera motion is a pure translation along of two axis in this case uh both of the centers c one c two and this antipolar ipo e one and e two will be in the same line and the happy pull the heavy pose will have the same coordinates in both images and also that before lines will form a radical pattern like this okay uh so quickly talk about what is that bipolar plane and enveloping is actually the plane uh consisted by the street three point first point is uh the 3d world coordinates of this x and also the camera center c and c prime for another view okay so these three 3d points will form a plane which we call that because we're playing and this available playing actually intersecting to image plane and we can get every polar line at here and here okay so uh a final definition uh in epithelial constraint is actually called uh it's called baseline baseline is actually uh a line which is formed by the two optical center c1 and c2 and if we connect these two points c1 and c2 we have the intersection between uh the two image plane and we'll have the intersection at e and e prime which are the fp poles right okay next we'll talk about essential matrix so in central matrix actually denoted as e is a three by three matrix which encodes agriculture constraints so initial matrix will actually [Music] have some kind of relationship in the higher geometry between the two different views it will have someone's information betw of the two different views the geometry the rotation translation is actually encoded inside this essential matrix that's why it's very important we need to find the essential matrix and then we get the camera motions rotation translation okay okay so the essential matrix actually denoted as t cross matrix multiply r which is actually t cross r so this t cross matrix actually means that if we do a cross product between the two vectors we can actually represent it as a matrix uh multiply a vector this matrix is called the t cross matrix so as here this actually t is actually a three by three matrix and r is the three best three matrix and finally we get the uh which is also a three dash three matrix um e this matrix this matrix e in central matrix c has actually described the relationship between the two correspondence x and x prime which is denoted as here so this e matrix actually connect these two uh these two correspondences this group of cross-liners so the thing is if we want to you know compute the translation and rotation of the two different views of the camera location and translation we basically need to first find the correspondence of x and x prime and there are we use this cross finance and use this relationship to get the essential matrix and if we have the essential matrix we can do a singular value decomposition then we can get the translation and rotation okay so the size in central matrix will also introduce the fundamental matrix which is denoted as f athletes actually have very special relationship with e which is denoted as here so when talking about an essential matrix we usually don't consider the camera intrinsics but in reality but in reality we're always in need we always have to consider the camera intrinsics intrinsic matrix k so take the camera intrinsic matrix k into consideration and also connect the cross bonus will have the fundamental matrix which is represented as here so we can see key and assuming the essential matrix of the the intrinsic matrix is actually inside this fundamental matrix okay so the fundamental matrix in the central matrix will have this relationship because we can see the t cross matrix multiply this r exactly actually equals to the central matrix and then uh once we have the correspondence of x prime and x we can actually compute the fundamental matrix f and once we have fundamental matrix f we can get the essential matrix e and once we have the extension matrix e we can do a single battery decomposition and then we get the translation and rotation and we can localize our camera so this is the essential part of visual dom tree of structural motion in general so e is actually a special case of f because f take the intrinsic matrix into consideration okay and next will be some interesting uh interesting relation between the fundamental matrix and happy polar line and at people so let's talk about it briefly so uh like we described before if we have the image uh if we have a point in the one one view we can actually get every color line uh our prime in another view and this our prime head is actually denoted as fundamental matrix multiply the image point x and homogeneous coordinates so according to the angular constraint we will have x prime transpose multiply f and multiply x equals to zero and substitute this l prime uh will have this relationship so this actually gives us a hint that this point x prime actually located inside this l prime which is true because uh we actually can't find the correspondence uh in the epipolar line in another view right so so it's a function of a line it's actually denoted by ax multi uh plus b y plus c i e cos to zero and this parameters of zero equals to a b and c if some point x y one is actually inside this line a b and c we can actually have this relationship this further proves that the cross furnace in another view is actually located except on the output line of another view okay so we can also have the relationship uh of the happy pole and fundamental matrix but i think because of the time maybe you can read it later on as well tutorial okay so so as we said before in order to get the fundamental matrix we need to have some of the matches so if we have enough matches between the two different views the algorithm to compute the fundamental matrix uh is called eight-point arguments because eight-point algorithm actually is the most fundamental on the essential algorithm to solve the relationship between x prime and x and then further get the fundamental matrix but more recently we'll have uh we can actually solve this fundamental metrics uh with less cross finance so the recent publication in parmi actually proposed a five point algorithm which allows us to use five correspondence to compute the fundamental matrix okay okay so uh later on i will have a uh very interesting sound which is called fundamental matrix sound which describes the fundamental matrix the property of fundamental matrix and i want to share it with you [Music] okay [Music] the fundamental matrix used in stereo geometry a matrix with nine entries it's square with size three by three has some degrees of freedom it has a rank deficiency it's only of rank 2 call a matrix f and you'll see two points that correspond column vectors and x prime x prime transpose times f times x [Music] [Music] [Music] [Music] is the last [Music] here we go [Music] is [Music] okay the books mentioned in this song uh which is uh uh the motivated geometry in computer vision and a lot of them here is actually from this book i think this is a holy bible in the geometry computer vision and many of the fundamentals in geometry computer vision bureau dollar tree strike promotion industry came from this book if you are interested more you can read that you can download it freely on the web uh so later on let's talk about ransackers algorithm briefly so as we said if we com if we want to compute the fundamental matrix and all later on all the essential matrix and later on the rotation and transform rotation translation the first step of the first step is actually want to get the feature the matching the correspondence we want to match the feature and together cross furnace and later on we can configure the rest of the camera motions etc so uh sometimes because maybe there's some uh the mismatching between the two frames uh we maybe have some outliers maybe see uh this point and this point is correctly matched you can see there are also the corners of this church but this point and this point is actually mismatched right uh this point is as the church extreme match to the ground which is uh obviously wrong so if we have many outliers or there are some amount of outliers inside the correspondence how can we reduce the effects of these outliers in order to get more accurate estimation of fundamental matrix so this will uh this this requires the renssec ransack algorithm so ransack algorithm is actually called the random sample consensus algorithm so in general what it did is actually uh we'll first uh go through a very simple example to see how rancic works and later on we introduce how ransacks uh uh can contribute the computation of the fundamental matrix okay so imagine we have uh several 2d points and our goal is to uh estimate uh estimate a line a function of a line which can actually cross which can actually include most of the points in this data set okay this is actually very similar to the linear regression in machine learning but slightly different okay so what we do is first we sample randomly sample some of uh two points of all the about the data because two points with two points we haven't got to get we can actually get the line and then we use this two points to fit a model which here this model of this line is actually shown here and later on we compute how many in-lines is actually inside this uh inside this line model so we actually define a threshold which means the distance the point between the difference between a point in the line in inside a threshold smaller than or equal to a threshold and then we can think it uh as a line okay so so using in this case we have six environs right then we can repeat this process and then sample the points again and in this case we'll have uh 14 in layers so everything this model is much better than this one it can fit uh more data in this all data sets okay so so this is how reset works okay so uh now let's see how ransacked algorithm can be used to compute the fundamental matrix f so for each iteration we randomly choose eight points correspondence uh from all the point cross minus right because the most fundamental arguments compute the fundamental matrix is called the a point algorithm or we can also choose five and use a five point algorithm and with this sampled cross bonus we compute the fundamental matrix f and later on we count how many uh in layers can fit in this fundamental matrix uh which is to see how many uh points in all of the correspondence can satisfy this relationship so x prime transpose multiplied and then multiply f less or equal to a threshold so ideally this equation actually equals to zero right but we're losing the threshold we we just set the threshold to a very small number and then we cut how many inlines are there for this fundamental matrix then we can we repeat this process for several times and choose the best fundamental matrix and the best correspondence so compute the fundamental matrix all right so once we have fundamental matrix we can just compute the essential matrix if we have the camera intrinsics right so we take singular value decomposition of this essential matrix and later on we can get a translation rotation denoted here okay so previously we talked about how important it is to match uh different frames right between frames or frame and the one of the cornerstone that we view this cross bonus is actually we need to extract some of the points that used to do the feature extraction uh to do the fields through the feature matching to do the and then to compute the cross balance so the first question is that if we give two images from two different views uh how can we choose the correct points to you know to finish the correspondence searching job right so if we look at this image of this mountain and this uh sky a lot of snow etc uh we will see maybe uh it's better to choose some point inside this one instead of choosing some point in the snow or in the sky because if we choose some point and here uh we'll easily match the point in the sky here here in here it's very hard to find a distinct correspondence in another view right if we find a point in the peak of the mountain we can easily find well uh this correspondence is actually here you can easily match this two points okay also for this two houses it's similarly uh we better choose some of the points from uh the corners of the window or the house or the roof right so this gives us a hint that uh one point that can compute the perfect cross violence is actually from the corners and edges so this point is actually called the interest point so this is very similar to the human virus system we are more interesting the corners and edges all the points with a lot of texture when we first take a look at some things on some objects right okay so there are some state of art are very successful interests the point detector and descriptor uh which is listed here but we're not talking about the details uh so we can group this detectors and descriptor into three groups the first one is interested point detector and we just detect the corners and edges so one two of the most successful point detector are harris counter detector and good features to track this is also known as shoot massive corner detector also we have a gradient based detector in descriptor which is uh sift and surf so safety may be the most well known gradient based detector in this scripture serve is actually a variant of surf also we have some battery detector in the descriptor so gradient computing in the image on the cpu actually consumes some time in the past time but no longer in this modern hardware still we want some description of this sector which can compete faster so then we'll have the battery detector or descriptor so the battery computation is much faster than the gridding computation so we will have a fast and then brief so fast is actually a detector detect the key points the interesting point and brief is it's a battery descriptor so a very perfect very successful combination of fast and brief it's called oriented fast and rotated brief so it adds some orientation to the fast and then some rotation to the brief and makes the this scripture more robust so that's why one of the most famous video slam algorithms use this rb feature to do the future matching and detection okay so uh that's about uh feature detector and matching so what if we we don't want to you know uh match the features from view to another view can we just first detect uh some of the key points interesting point corners edges whatever and then just use some algorithm to know the correspondence is the next image without searching the cross funnels so this searching of crosstalk actually consumes a lot of time so a more efficient way is to just you know use algorithm to compute the the location uh of this cross functions in the next frame of this point in the next frame thus we can form a correspondence in another approach so this algorithm is called lucas canadia algorithms look at the king look at it uh looks canadian arguments or arcade arguments so this argument is developed here at cmu by professor taker kennedy and his phd student lucas forgot his first name sorry so so this process the problem that they try to solve is also called optical flow so what is output flow here we can show here uh imagine we have detect some of the interesting point in this image right it's these four points so we can actually compute where the location is in the next image frame so the this vector then moves this point from here to here it's actually called optical flow right okay so we will talk about lucas kennedy algorithms briefly and imagine we have uh more than one frame we have room one two and two k so we have some of the correspondence uh we have uh the brightness consumption uh which has the pixel intensity of the 2d projection points actually does not change that means uh the intensity of this point of this point and also this point remains to be constant and also we have another assumption in lucas kennedy arabism which is a small motion assumption so you have if we have the brainless consistent assumption we can actually you know makes this equation here and here equal and we can get the final equation here so the uh small motion assumption allows to do the first taylor express expansion uh of the uh of this of this intensity and then make it equation derive it from here to here and then we can use the uh the brain is cons a brain is consistent assumption and to uh you know to make this term and this term equal and that makes this term equals to zero okay so later on we all have denoted this uh this phi x phi of x actually equals to the gradient of the pixel intensity in x and y direction so dx dt and d y d t is actually what we want to get is actually the velocity uh actually the velocity of the motion of the pixel x and y okay and this phi i and phi t is actually the gradient of the pixel intensity over time so make this equation uh simpler will have this equation which actually can be denoted as the inner product of two vectors and equals to minus i t and later on we can so this is just for one point we can compute this u and v if we have multiple points right we have this equation uh which is consists of a lot of points for example we take a patch [Music] uh for one of the uh one of the interesting points across the point we take a image patch which consists of a lot of adjacent pixels uh for that interesting point so imagine we have n pixels so we have a matrix size of n by two and this uv vector is actually uh the optical flow that we want to get and this all this vector is also known so we will have uh you know this kind of a and b a is a matrix and b is a vector and this u and v is our goal and we can actually solve this u and v by uh take the pseudo inverse of this matrix a right denote it as here okay so this whole algorithm is actually plays a very important role in both tracking and also visual dom tree especially we don't want to get any cross furnace we don't want to get any correspondence by searching from one view from one image to another we can just use this algorithm compute optical flow and then we will know exactly where the next point is just so we have the cross minus and then we can do the same thing with the correspondence to compute the fundamental matrix essential matrix and then the rotation translation the camera pose and thus we can finish the the camera location on the cam the very odometry so the paper is that further describes the lucas kennedy algorithm is here looks canadian 20 years on a uniform framework since the favorites published in international journal computer vision in 2004 and others are from cmu you find interesting further details you can read this paper okay okay uh later on we will talk about uh how can we compute the camera pose all the camera depth without even knows the correspondence okay this is called direct methods okay so there might be some of there might be some drawbacks of fischer based algorithm which is a feature-based slam viral dom child which is computer feature descriptor are time consuming especially uh the gradient-based features such as it and this feature-based slam algorithms always discard a lot of user information in the image because it would just take the interest point to do the feature matching and then compute the correspondence and the computer asian translation but thus we cannot do the dense correspondence okay it's very hard to do this cross-files with just as far as features so one of the possible solutions will be we can use the raw image pixels for the for matching after computing the feature detector this is called semi-direct message so uh this will solve the problems of the computing the descriptor and i'll also save some times the future matching so in order to do that we can use the lucas kennedy algorithm to compute the descriptors okay after we detect on the interesting point in one image okay or we can just direct matches raw pixels without even the future detectors this is uh the goal that we're interested in this part okay so there will be some uh derivation but i think this is the last part of the mess so back to our two view geometry figure this view in this view we now call this view as a reference frame this wave has a target frame so the same 3d point project the different views as x and x prime we also have every polar lines an agricultural pole but we're not going to use it this time okay so uh for a point at the 3d world current system x w equals x y and z so we can actually back project this 3d point to the 2d point which is we can represent this x w with this x image reference right here by using the intrinsic matrix and the depth that we estimate and then we can warp the same point that project this point to this view to the target view if we have the rotation translation okay but this rotation translation that's actually estimate not something we already have it's what we want to get and what we are going to estimate so uh so this if we finish this warping we can actually get the whole image if we walk every point at this reference frame to the target frame so once we have the warping we have the photometric loss you know we can minimize photometric loss optimizer for the metric loss and then we can get optimized rotation translation and depth okay so this is actually a list of square problems the solution for this problem is cross newton and labor and markov algorithms okay okay so that's all for the fundamental of computer vision we will have all the fundamental knowledge that you you would need that we need that you need to understand uh the famous or state of art traditional feature based feature-based directory rhodometry of virusland so any question up to now okay cool so let's continue so next we will talk about briefly talk about some of the traditional fisher-based and direct video downtree of with slam okay the first one is uh arcslam so arm slam is perhaps one of the most famous uh fischer based uh uh neurodomtry bureau slam algorithm up to now so uh the overview of arm slams actually shown in this figure orb slam has three stretch tracking mapping and loop closing the tracking is actually what we did we extract the first uh we first extract rb feature and then initialize the post estimation and by you know search the cross finance uh search the correspondence get the cross finance and then compute the essential matrix the fundamental matrix and then the camera pose okay okay and also this transworld consists of the track local map and also the new keyframe decision because we don't want to use every frame we want to use keep the mapping or keep the computing concise we just keep several frames which uh you know because if the the frame rate uh of the camera is really high there will be a lot of redundant frames so we want to cut this written frame just to keep frame which uh keeps different uh seams so this is called the keyframe so once we have the keyframe we can enter the local mapping thread and it's actually this this trans uh rams at the same time in parallel okay so the local mapping we compute the uh the 3d points by triangulating the the feature point and then we create new points [Music] and then if we have the 3d point we can actually do the local banner adjustment which is optimized the 3d points that we get and also optimize the camera pose with revis compute and finally we have the loop closing uh shred uh which optimizes central graph and also do the loop closure right so if we uh goes back to some of the places we visit we can merge the camera pose of the previous estimation in order to reduce the drift uh of the long-term estimation of the camera assumption [Music] camera goes estimation okay so some of the interesting results here by arm slam is the cross furnace uh the cross balance matching here is actually because of rb feature because the application of rb feature we can actually deal with the skill we can deal with different skills of the image image and even uh you know we can see here this image is enlarged significantly comparing with this image we can still do a good matching between these two frames or even there are some dynamic objects there's a person working from here to here and these dynamic objects actually will will actually brought a lot of outliers for this matching maybe uh for because of the use of ransack we can reduce the effect of this moving objects of the outliers okay so we can see this is the results on the ktodomg dataset it basically fades the ground truth the estimation okay okay uh later on let's talk about the uh dense direct methods so the most famous algorithms for the dense direct message is called dtm the dense tracking and mapping so so this algorithm compute the camera pose and the camera adapts by minimizing photometric loss so this is one of the most successful application of the direct methods uh series that we previously talked right so first the depth estimation of dtm is actually it constructs a cost volume the cost volume is actually it's actually a summation of the per pixel uh photometric arrow okay so the photometric is our areas and what we described previously uh we have the estimates the camera pulls and then has amazing depth we can work the point from one view to another and then we form the photometric loss and then combine all of this photometric laws of different frames okay sum them together you can get the per pixel cost volume shown here okay so by minimizing this photometric loss the the cost volume we can further estimate the depth so uh the dtm actually put a range of the depths so the minimum and the maximum we're actually searching uh optimal depth in this range by feeding it into the photometric cost volume to find the minimum depth that can fit this uh photometric cost volume okay so yeah this is an interesting observation at the dtn we can show here so it takes uh samples from three different positions okay a b and c and a there's no texture b is a corner and c is a uh it's an edge so we can see that we can easily find the optimal the uh the local the the local optimum of this cluster volume if we have a very evident features very evident textures so that means even direct methods will also require some texture okay in order to do the uh the perfect uh the photometric cost volume optimization okay okay so this figure shows that uh as we increase the number of the images that we included in the cluster volume we can get better results right here from here here and here from left to right we'll increase the number of each number of images that we use uh in this course volume and we can see it's more image that we use the better depth estimation we can get so later on we'll do the camera pose tracking by minimizing the following term this is uh also another kind of photometric loss a variant of automatic loss uh which we put it into uh uh r2 norm uh this one okay so uh one thing to notice that the dtm system estimate the depths in the post in an interleaved manner so we first estimate the depth and with with this optimized depth we optimize the camera pose so these are doing interlibrary load at the same time we don't estimate the camera position depth exactly at the same time but we first do the depth estimation then with this steps we do the post estimation and then with the pulse we do the test and estimation again and then doing the interleavally for each uh frame cross panel for for each two continuous frames coming into the dtm system okay so next let's briefly talk about the uh another direct message this is the semin dance direct message called rsd sam the framework is shown and here uh this is pretty pretty pretty similar to what arp slam did it has a tracking frame as a tracking thread outside the dash estimation threat which is basically the mapping threat 3d reconstruction thread and also the mac os optimization by min max photometric loss of many different frames here so uh one difference between rsd slam and d10 dtm we previously discussed is that rsd sam use not is not using every pixels in these images in these images they just extract some of the pixels which have very high gradient and then use this images and use these pixels to do a camera post estimation and the depth estimation so this is why it's semi-dense okay so optimization goal uh is also the photometric loss which is similar to the d10 uh we minimize the photometric loss and we then optimize the camera pose and the depths all right we can see some of the quantitative results right here so this is a reconstructed point cloud it's a little bit messy because um you know you you just [Music] extract some of the pixels with high gradient uh maybe this uh this picture is not continuous enough uh not you know you don't get them as a whole that's why they look like messy and here okay okay so finally let's talk about learning based visual dom briefly so learning this video downtree consists these three parts uh in this tutorial the first is supervised uh learning-based review with which we have the ground truth and then we just use a neural network to regress the camera pose and use this ground truth to do the supervision and then during the back propagation to automate the width of the neural network and then to find and then to better estimate the camera pose right this is the idea of supervised and then we also have this self-supervised learning approach self-surprise learning approach actually use uh photometric loss so it does not have lagranges so without ground shoes we cannot do a traditional supervised learning with another traditional and i'll get the traditional uh supervised loss but instead we have the photometric loss by warping from one frame to another frame with estimated posts and depths we can actually you know minimize photometric loss by doing the back propagation and then to estimate uh the camera pose and that's we also have the hybrid methods which takes advantage of both uh camera depth estimation with a learning problem with the learning ability of a neural network and also we take advantage of the traditional geometry methods and combine them two and get hybrid methods okay so part of this contents will be a little bit similar to what we so what we discussed in the last live seminar that i introduced about uh the direct message okay but we'll first talk about supervised methods so the first algorithm i want to introduce and share with you is postnet so this is the publication in iccb 2015. this is one of the first approach to use a neural network to do the camera pose estimation so uh the idea behind this algorithm is to use the convolution neural network and take image as input and regress the camera pose okay and then since it's a supervisor approach we'll have the uh the ground truth camera pose and use use this ground ground truth camera pose as supervision uh we can then get the the last supervision loss shown as here so this approach actually divide the translation rotation estimation into two parts and combine link them with a parameter beta as here so the network architecture they use is uh modified from google net okay so okay let's see some of the results here so the comments and results that i'll share with you is here it collects some of the data sets by itself so the data collection process is basically simply use a camera and go outside take some videos and use these videos and put these videos into a traditional slam algorithms traditional circular motion algorithm binocular slam algorithms and then get the camera pose and then use this camera pose to label the training of the neural network okay so one of the most important thing i want to notice that it's used uh the traditional structural motion slam algorithms to label the videos uh to train the neural network okay you can see collect several data sets at king's college street hospital shop this church so this is a birth eyes view right and the green trajectory in the training frames and the blue one is the testing frames actually the this is a blue one and this is the testing frames in this video and also the red the red trajectory is actually the predicted camera pose is a prediction all right so we can see our goal is uh we want to you know make uh the testing frame the predict the camera post make them maybe uh the same because we want uh the prediction as similar as the testing frames okay so this is the ground truth of course we want to mix estimation closer to the ground truth but we can see in this king's college in the street these two sequences we can find that the estimation in the grand truth is actually very similar right they basically have a lot of uh intersection right but for the old hospital for the shop we can see the estimation and the grand church in testing phase is actually quite different right a lot of difference between this these red points that red dots and the blue dots you can see here but again here this church this sequence uh we can find this new thoughts and the red dots actually have a lot of similarity right so can even tell me why this will happen any idea okay nobody [Music] you should start picking random people uh if you have the comments yeah is it because um the the trained model cannot estimate the rotation very reliably that's perhaps one reason and is there scale drift uh i i think they did a seven degree from freedom online so i didn't consider about skill in this particular case so um one important reason is that you see is the distribution of the training and testing is actually very similar like this sequence this screens and this sequence we can easily get some of the results close to the ground truth right and here here and here the training is testing actually has a lot of uh similarity in terms of distribution right in these two sequences the training and testing consistency the trajectory will not have so much intersections will not have so many uh similarities in terms of distribution so that's one of the important reasons and also one of the biggest drawback of this method is that if the training and testing is have some difference larger difference the estimation will actually fail like this is a question this sequence okay because their idea is just too not even too simple just input an image get a camera pose and use this post to do the supervision right it does not consider about uh uh you know the relation between different frames uh etc so that's why we'll have another method called uh deepview deepfield arm tree which is called towards end-to-end video down to the current conclusion your network which is published across um sorry it's 2017-2017 okay so in this approach uh one of the biggest contributions they have is they use a convolutional neural network but instead they don't take a single frame they take two images together so we can see that we talked previously uh in the fundamentals of computer vision is that the camera pose is actually the thing it's a relationship between the two frames which isn't not uh what one single frame can do right so that's why they take two frames into a consideration for these images inside i'll put this convolution on your network okay stack these two images together and use the informational two frames and then putting into convolutional network six layer very simple neural network and then put this feature into the current neural network recurrent neural network and also this recurrent neutron neural network will have the information of the next two frames okay and then put it inside and then get this camera post finally so as a result uh i think because they use the information of different frames of two uh adjacent images frames they can improve the camera pose estimation comparing with this uh postnet that we previously okay so this is a experimental results shown on kt they also compare some of the traditional methods and here like riso and black uh dots is the ground truth and uh the blue lines uh their methods okay you can so we can see that the result is not perfect but comparing with uh with the postnatal it has a significant improvement so the training is using all the sequences other than this for this four is using testing and other sequences used as training okay so the keto damage data set will have 11 uh sequences in total they use seven sequences of training and four sequences as testing okay so uh yeah okay uh another approach uh self-reference subsurface approach uh overall geometry so this is perhaps one of the most famous uh uh approach of self-supervised method for learning-based read option okay i've shared this paper with many people for many times but this is also this is really interesting and it's played a very important role i think in the announcement of the self-reliance learning approach okay so the idea is that the trainees actually use some of the labeled video clips shown here and they have two uh convolutional neural network when is used to estimate the depth another one is used to add estimate the camera pose the rotation translation okay this paper is published in cvpr 2017 so the key idea is actually very similar to what we talked about in the director methods so uh they use a whole cnn as a dev cn to estimate the depth of a particular point and then use this two uh camera pulses to project into two different frames i t minus one ninety plus one okay then know we can have the photometric loss of these two grains by combining together a non-transit photometric loss we can optimize uh the weights in the neural network and further improve the estimation of the devs and camera pose okay so this is a network architecture it's also not that complex the seminar input similar encoder and similar uh decoder for the damsel estimation and also for this post estimation so they have also have the encoder detector structure but in the middle they get the features extracted out and put a fully connected layers and gather rotation translation later on they also asked me to pose an x humidity network which is to exclude some of the uh pixels which does not uh fulfill the photometric consistency that we talked previously okay because if there is some moving objects or some parts which does not uh fulfill the photometric consistency i will have some of the you know a bad effect for the overall estimation of the camera loads and damps okay yeah uh so the final paper that i want to share is called uh dvs so this is a hybrid approach uh i also shared the uh share this paper in the last lab center and this approach actually combines the advantage of uh steps estimation of the camera mode of the neural network and also the advantage of the traditional traditional feature traditional visual downstream so the overall method here is we actually input two frames stereo frames they actually learn the stereo uh in training and testing with the monocular camera so okay so initialize it so uh one of the biggest advantage of using neural network to do the data estimation is that you don't need to randomly guess like what rsd slam and dso did these two approach the traditional direct approach where gas camera adapts initially and randomly initializes that the steps but this uh hybrid method this dvs so they can get a distant estimation of depths at the very beginning of the tracking so they can get adapts and later on they will use these steps to do the further optimization of the camera pose okay so the rest of the components very similar to what we talked in rsd slam into keyframe and john uh the chemicals and damps with the photometric loss so this is that and taps estimation neural network structure and here okay this is uh results uh on kt they have some of the improve they have some improvement comparisons uh with the so i think because of the the application of uh [Music] of dapps with neural networks that is major improvement uh for uh yeah so okay okay so that's all for the theoretical part and later on let's do some coding and relax okay okay uh for those who already have the api polar uh dot pi so this this python file is present script is actually uh tells how to compute the fundamental matrix and draw the mp color line based on the based on the fundamental metrics that we computed they actually load two images the first one uh we can load a stereo pair we can also load two uh images in the ten seconds right so uh if we first load the stereo image pair the left and right and then compute the fundamental matrix compute orb feature there first and then match the features using brute force measure once we get that we can compute the fundamental matrix uh based on the points that we extract from rb feature and then we select some of the inline points and then we draw the epipolar line and then we plot those lines so if we run this script make sure you have the correct directory for your data set so we first uh load the left and the right which is a stereo pair let's see how the epicolor line looks so if you run it you will have some similar results like this so you can see the epipolar line is actually in parallel because as we said the stereo pair that correctified they're just a very small motion between the stereo pair we move uh the camera and in the direction which is perpendicular to the um to the image plane which is the direction of that of the baseline then we can get a parallel i be polar line so if you change the the data a little bit so the is actually the left image and the ktl one is actually the image which is uh next to it so this l and l1 is actually in the time sequence right it's a monocular time sequence images let's see how it looks how the fundamental matrix looks like so if you ran a script you have the similar pattern that i have which you can see is a bipolar line it's actually in a radial pattern right all the mp polar line will intersect in the same point which is that people are shown at here and this epiphora lines looks like a radical pattern uh okay so if you finish this small experiment let's move to the next so next we'll see how we can use the extracted rb feature and match them to see how the image matching process which is fundamental for the essential matrix and later on for the rotation translation estimation okay again we will load two images we can load left and right we can also load uh uh left and uh uh left one which is basically happy polar uh i have covered which specific stereo or the two images in the time sequence so if you run this script [Music] so opencv will have some kind of intrigue uh will have some of built-in methods that you can draw uh the correspondence directly which give us better visualization shown here so this is bigly basically a little bit similar to what we did in the previous one a previous one we draw that before line but this time we just connect uh the correspondence with this line no no this is one this one is not ib power line it's just uh you know yeah a connection between the two uh feature correspondence uh shown here okay uh yeah i have a simple question for the epipolar example yeah sure that will show that if we supply two images from two timestamps i mean in our image and then one image and we can see where the all the ipod lines are concentrated to a single point uh is it yeah i mean uh all the upper point lines are intersection for the stereo pair you mean no just the uh uh for the counselor yes it's this one i think uh yeah i think this is what we this is what we want to get yeah uh i mean the intersection point is that should be the uh principal point of the camera uh yeah ideally yes but i think the motion uh uh the motion of this particular case i it's a little bit you know i think the motion that the translation is a little bit uh you know different from the the one that is really perpendicular to uh the image plane that's why we can see uh that people line is not in the principle point i think okay uh yeah for the food facial matching we are using we are not using the ransack effort is that right uh yeah uh yeah i think the ransack is in the fundamental matrix estimation yeah i if you take a look at mp polar dot pi the computing fundamental matrix is like cv2 the fine fundamental matrix do i have the cv2 fm oh that makes sense yeah use ransack to get better estimation from the monometrics i think the algorithm uses a five point algorithm that we introduced in that paper [Music] um so i if you successfully download the kd data set screens 0 9 image 2 i think you can continue with the video dom so the vlog video odometry part will consist of two scripts so one is uh the main part of the viral object another one is post evaluation youtubes which some of them may use to read and load and uh the images of the post also it includes some of the transformation between the quaternion to the to the euler angle etcetera so let me just uh i can show you what it's going to look like so so the algorithm is pretty simple just uh we write a loop and we continuously read images from this image screens and then compute uh the essential matrix first of course will extract features and do the feature matching after the future matching uh we compute the rotation translation and then we can we transfer this rotation translation towards the camera motion and then stars is transformed uh this camera motion and also draw the camera motion and also receive this camera pose to a txt file finally so if you ran this file you get something sweet windows the first thing we know is just to show uh how the all feature looks like you can adjust how much uh how many of fish you want to get the more you get the more computational expensive it will be and also another window shows the matching of two images so this is a monocular case so you just match between two consecutive frames and the the another one shows the trajectory here it's a little bit slow because written in python uh you can do the same thing in c plus class because for the simplicity of illustration i decided to use python to do the tutorial but it will be the same thing in the python obviously we just simply the binding of the cpap plus opencv if you rewrite it in c plus plus i think it's going to become much faster for everything i suppose csv true dormitory yeah i think why maybe this a modification to the arduino data set yeah i i put a ground truth uh post yeah yeah but you don't need for this code i think yeah it would be better if you uh include the grand church you can compare this where's the poses uh sequence.txt file which is this in files folder uh i think it starts in the just the same directory as you put in your code so for my case i'll get something in the sequence zero line you can say that image two folder even immediately folder and there is a file called process csv support csv is ground truth here for the code i think you are loading the txt file not the csv file uh yeah for my case you can take a look at the code 112 starting from 112 to 116. it will trans transform uh the post form to the qm format which is often used the first one is time screens and then the tx uitd and then the turning on the translation so [Music] [Music] [Music] um i think we're almost running out of time oh great seems working yeah i think it's almost running out of time and first uh some of the code is actually from the opencv tutorial in python and some of the open source in github and another code repository i really want to recommend to you is showing here so this one is recently uh released called it use python basically implement every algorithm mainly fisher-based hours and also have the binding for the back end of slam so this is really big and i think i recommend you read it after the tutorial and also uh it has the banning for the visualization to paneling shown here and also it has a python binding for the back end uh of of g2o so g2a is a c plus class 2 for the backend of transition 2 are written in c4 plus but it uses abundant a binding path and binding and making it available in python so i highly recommend you to read some of the code and run experience also more importantly it actually includes how the learning based feature and use this learning based feature to do the visual dom tree as we can see here yeah i support a lot of features such as all warps uh ob2 and also surf risk etc so it has some of the learning based features like super super point i think chain introduced it uh yesterday okay uh so i don't think we have time for the learning based part but uh if you follow the instruction of uh of the file that i shared you will find there is a open source implementation uh for the stratform structural motion lander uh yeah since we're out of time can you add this pie some info on pie slam to your slides yeah and then add the link to the website and then also kind of put can you put the code that you had and maybe we can just take the kitty that sub sequence and kind of package that up so it's all in one place so somebody could quickly uh replicate this next time sure no problem yeah so one last thing uh yeah i was about i thought we have time to uh training and testing this structural motion ladder patterns but since we are out of time i highly recommend you to train yourself and test it itself yoga is a monocular depth estimation and also pose estimation uh all by the end of the neural network and trained by the photometric loss that we talked in the tutorial okay okay thanks all for attending today's tutorial and very sorry for the inconvenience of data and the code and if you have any comments or questions i'm very happy to answer any questions you

Info

Channel: AirLab

Views: 6,489

Rating: undefined out of 5

Keywords:

Id: VOlYuK6AtAE

Channel Id: undefined

Length: 114min 24sec (6864 seconds)

Published: Fri Sep 04 2020