CVFX Lecture 10: Feature descriptors

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so the topic of today's lecture is feature descriptors okay so last time we talked all about different ways of detecting features so what was the feature feature was basically a good place an image that we could reliably find in another image of the same scene taken a different time or a different perspective okay so feature descriptors so as we'll talk about a lot in the next couple chapters you know the whole point defining features is to be able to make these correspondences between images of the same scene taken from different perspectives and so what that means is that if I have kind of this is my scene and this is my first view here's my little camera and this is my second view so that means that if I have a 3d point out here then when I project it back to these cameras you know what I'd like to be able to say is that I can reliably match this location in image 1 with this location image 2 okay after that to happen what that means is that I need a way of describing the pixels that appear around that location image 1 and location of achooo in such a way they can be directly measured and compared right so ideally what I want is to say okay so if this is you know a feature in image 1 and this is another feature in image 2 what I want is some sort of a algorithm that takes some region 2 pixels around F and produces a descriptor which I'm going to call D for the moment and I want the descriptor that I create around F is well I should probably say approximately equal to the descriptor that I create around F prime okay and so the descriptor is really more than a vector a list of numbers right so we want to figure out what is the best list of numbers that I should use okay so there are you know lots of easy and immediate choices right so the most obvious choice is what if I was just draw a little box of pixels around that future location right so what I could do is I could just say okay so here's image 1 here's image 2 you know if I just drew little boxes of pixels around the corresponding features well then I could just directly compare pixel pixel the values of the colors in those two blocks right like sum of the squared distances for example so that may work okay when you're doing something where the two images are very similar so for example if I've got a video camera and I'm moving the camera through the scene and the images from the camera are only separated by a fraction of a second like a thirtieth of a second video cameras right that means that you know there's not going to be that much difference between image 1 and image 2 and therefore a block around you know the projection of one feature in image 1 is going to be basically the same block in image 2 so there's not going to be a big deal with that square of pixels deforming or not being directly comparable however we're often interested in what's what I'll call the wide baseline case so what does Y baseline mean so here this distance here between the cameras is often called the baseline right and so but I say Y baseline I mean cameras that are physically very separated right and so in that case there are lots of situations where taking just the same block of pixels in image 1 and comparing it to the same size block of pixels image 2 is not going to work for me right so an easiest one easy example is let's suppose that one of these images is kind of significantly zoomed in compared to the other one right so if I have you know maybe I should actually draw this a little better so say this is a guy and here's a zoomed in picture of the guy so if I draw a block of pixels around this feature say I picked up on this guy's eye as a feature and even if I found the same place here in this image if I drew the block of pixels at the same size you can see that the two blocks of pixel to contains substantially different information right and also what would happen if these images were slightly rotated right so suppose it instead I was actually looking at an image that was more like you know this again even if I could find this pixel again then not only am i at the wrong size but I'm also at the wrong orientation right and so what we need is a way to kind of abstract out all of this confounding information about scale and orientation right so in some sense the descriptor should be what we call invariant to this extra stuff right and so that's why we talk about for example sift right those of you that know about sift stands for scale-invariant feature transform right and so the idea is the sift descriptor is constructed in such a way that even if the images are at substantially different scales I basically get the same descriptor if I'm looking at the same point in two different images okay so does that make sense that's the big picture okay so what I do today is basically kind of overview quickly the main ways that people are using descriptors okay they are constructing descriptors so kind of from this picture you can see that the first thing we need to do is I had to make sure that I'm comparing apples to apples right I need to basically have a region around the future location both images that contains the same apparent chunk of pixels right and so we talked last time about how we could use the normalized laplacian to find kind of like the apparent scale of a feature right and that kind of think is definitely going to help us so last time we showed that you know I could find an algorithm that would say if I landed on this guy's nose you know if I found that this was the apparent scale of the guy's head if I put the feature here then the normalized laplacian would hopefully give me the apparent scale of that feature in the bigger zoomed image as that right and so the first thing we're gonna do is we're going to use this detected kind of typical scale as a way to construct our feature descriptor okay we could say okay well I'm just going to start comparing features inside the same circle okay now because circles are not so easy to work with compared to squares for the purpose of computer programming typically what we do in computer programming is we try and draw a nice square around the pixels that we want to compare okay and so to go to the square right we need to decide on what should the orientation of this feature be right because circle you know I can twirl the circle around and there's no you know there's no like upside to a circle necessarily but for a square I do kind of say okay what is the top edge of the square right I need to say what side is up for this feature and so for that what we do is do compute what's called the dominant gradient orientation right so in this case what I like to be able to do is I'd like to say okay if I draw my picture of this guy again I don't know why I landed on this guy as an example but right so I'd like to be able do is like I like to be able say okay here is the square of pixels I'm going to use and this is the up direction on the square and then if I found the same feature here hopefully I could find this square and call this the up direction and this way the two areas of pixels would be directly comparable all I have to do is I have to rotate both both these guys to be kind of horizontal and vertical edges and that I can kind of if I were to resample the squares to be the same size hopefully this one-to-one mapping between pixels and corresponding locations in square would work out for me okay so this is not really that hard to do and so let me just show you a picture about one way it's done so here for example if you remember last time I had this example with this Japanese Lantern and so here is a location that's been detected on that Lantern okay and the black circle indicates the so the radius of that circle indicates the detected scale of that feature okay and so you can see this kind of makes sense in in that this circle happens to be kind of like sitting nicely inside the triangle formed by this white region and so now what I want to do is I want to find the gradients in a region around this feature okay and so what I've done here in in picture B maybe I should do is first go back to my drawing so you know what I want to do is I want to say okay this was kind of like a sketch of what that feature looked like what I'm going to do is I'm going to look at some neighborhood of that feature and I'm going to count up the directions of the gradient so you can kind of imagine what I'm doing is I'm going to form a histogram of the directions of these arrows right it's kinda like saying okay all you pixels around this feature tell me what direction you think you know the gradient is right the gradient again points in the direction of greatest increase right and so I could do is I could imagine okay well there's a you know zero to 2pi instead of possibilities and so for every pixel here that's in this region I could say okay well I am going to increment the bin by one unit for every pixel that I find that has a gradient that is of this angle right and so as I move along the pixels along this edge I'm going to be you know popping ones into this aggregation of this histogram say okay lots of people are voting for this as the orientation and maybe I get you know a few here and maybe I get a few here coming from the strong edges here and you know inside you know regions like this where maybe there's no strong intensity change you know the grading estimation there may not be very good and so again I may get a bunch of kind of just noisy contributions here for bins that don't really matter right so we can refine this idea a little bit so one idea is that why would I even bother counting pixels like this that are in some flat region right I know that their gradient estimates are going to be not very accurate I know they don't play a big role in how the image looks so what I could do is I could instead of adding one for every pixel I could wait the contribution of that pixel by how strong the reading goes so it's like say for those of you guys that have really strong edges I want to count you guys a lot more for those of you guys that don't have strong edges well I'll take your opinion to account but it really care that much about it okay and so that means that probably what's going to happen in this case is that I'm going to get you know a bunch of really strong contributions and then a bunch of very like weak contributions here and in the same way just as we talked about for the the detectors we can also kind of weight the contribution of a pixel by how close it is to the center of the feature right so again I care more about pixels that are really close to my center point than I do about pixels that are kind of within my you know summing up region but are far enough away right so basically we imagine is I've got like kind of a Gaussian profile around the feature point and again there's a weight that says okay you know stuff that's closer to the center should get more weight right and so that's kind of what this picture says is that here you can see what I've done is I've already kind of imposed the kind of Gaussian weight here so you can see that the you know contributions of pixels out here are pretty small and actually I think I left that one step which is also that you know since we have the detected scale of the feature what I could also do is I could kind of blur out the feature so that I'm really kind of only say okay you know if I think this feature is really important at the scale then if I blur away with a Gaussian kernel that is proportional to the size of that scale then the edges that remain after that point are really the edges that I care about so kinda what I'm saying is I want the gradient orientation that's important at that scale right and so after I've taken these gradients and then I've formed them into this histogram then you can see that this black bar is the one that stands out the most and if I draw a box that has an orientation that corresponds to that black bar you can see that the box I draw is basically lining up with what I imagine is a good orientation for that feature by its going kind of you know you can see it's aligned with the strong edge in this image right and so the idea is if I was to look at this image in a rotated and scale way if I applied the same algorithm I will basically get the Box having the same apparent you know orientation as I moved it to a different image right and so here you can see also that I've made the Box a little bit bigger than the original circle radius and so one way I think about this is that you know if the original scheme so I mean the idea is that the radius there is kind of like the intrinsic scale of the future but if I was to draw on circle had radius two times that or three times that that that new radius would still be invariant right the idea being that if I just multiplied my detective scale by three then if I multiply the detective scale by three and the other image I just look at the same apparent set of fixings right so what I've done is I've kind of grown the set of pixels that I want to include in my descriptor so now what's going to happen is that I'm going to use the pixels inside the square to build up this vector of numbers I'm going to use to describe the feature okay and so now what I've got is basically a box that encodes both the scale of the feature and the orientation feature okay and hopefully if I see that same place in multiple images I'm going to get a bunch of boxes that all have the apparent orientation and scale matching up okay so they pause and ask for any questions about this rest yeah do you have to actually take and find the maximum in this or distribution chart or is it something you'd buy after the average of all the way do I have to take the maximum where could I take a like a weighted average um well I mean there are different ways to do it I would say that it would be better to take a maximum right because you know if I were to take the average of these numbers I probably would not get this high peak here really what I want is that what I want is like the mode of the distribution right where are the orientations where this thing really Peaks right I mean what I could do is I understand I have to take this old I could take multiple modes like for example say that there was you know you could make an argument that the feature could also be kind of well described by a vector that's kind of perpendicular this because they're strong edges going that way too and so here you might say okay what I could do instead is take the local Maxima of this of this distribution so maybe I would pick also this bar here that sticks up above its neighbors sufficiently to be considered a possibility so I might have a feature it has kind of multiple dominant orientations but averaging is probably not a good idea because it will smear out the stuff that I really wanted to keep other questions yeah well you know you could argue that it makes sense to use a circle instead of a box but when it comes down to it we're going to be comparing pixels in different regions and it's kind of a pain to compare pixels in circle circular regions rather than box like regions right so part of the reason using a box is because sift uses a box and I will not talk about what SIF does I mean there are algorithms that use circular regions I'll talk about those a little bit later but you can use either I mean in in most cases though I still not estimate the dominant orientation of the future so we'll talk about whether we use a circle or a square a little bit later depending on the descriptor okay so the other thing I should mention is that the last thing I do so what I'm going to do is I'm basically going to for every feature I'm going to now say ok I have found this feature here and I have found that this is this is a crappy picture so I found this feature here I found this is its dominant orientation and I made this box that have scaled according to its detected scale and so now I'm going to take this guy and I'm going to basically resample this to be a constant size so for any feature I detect no matter how big or how small I'm going to make it an n-by-n block of pixels just by kind of resampling it so typically you know for whatever reason the N in some of the literature you read is like 41 by 41 for example it's not like this is some really well the five number but this is the magic number that you see some places and then the other thing I'm going to do is you know I want these feature descriptors to be invariant to how bright or how dark the image is right so if I see you know the same feature at you know on a building across the way in daylight and then I see it at Twilight I'd like to be able to make a descriptor that doesn't have this light to dark variation baked into it I can't abstract that out as well so what I could do is I could say okay if my what I could do is I could take my new intensities inside the block and I could make them by taking the old intensities subtracting off the mean and dividing by the standard deviation right so this is basically to do kind of some crude illumination invariance I'm putting this in quotes because you know really you don't have like illumination is a very complicated thing that involves 3d geometry and so on but really what I want to have is something where you know fundamentally there's like some sort of intensity scaling invariance or this lookout for shifts and scales linearly and so now what I have fundamentally is after a good feature detection I've got image one and this generates for me a whole bunch of bless you a whole bunch of features and that I have image two and this also generates three whole bunch of features and now before I talk about the actual method of computing the descriptor let's just think about okay I suppose that I generate some sort of a vector for each of these guys well then how I do the matching right well the most obvious thing to do is simply to say okay for each of these guys I'm going to compare it one to one with every one of these guys and I'm going to compute a metric for how good the matches right so for example the most obvious thing to do would be something like you know if a is a you know if a is a feature in image 1 and B is a feature image 2 you know what I could do is simply take for a given a find the minimum B such that I minimize the Euclidean norm between these things right this is like saying you know I minimize the sum right this is like the length of that vector sometimes it's easier to not have the square root so I could also just look at this right I should get the same kind of thing so why would I you know I'd have to do the square root another thing you could do that's a little more complicated is that you could look at kind of like the what's called the Mahalanobis distance between the two vectors so it's like saying what I would do is I would take a minus b transpose some Sigma inverse a minus B so this is called the Mahalanobis distance and you'd use this if you have some reason to believe that different entries of the feature vector were kind of statistically correlated with each other in different ways right so for example if you knew that the number in the first position was much more important than the number in the second position then the two numbers in the diagonal of Sigma would be very different or if you knew that features 1 & 2 kind of went along with each other in a certain way then the so yeah I want to think about this too much but basically if you had some sort of a model for if I saw lots and lots of future descriptors and I want to look at how correlated they were element by element I could build this Mahalanobis distance so you know most of the time I think that we can safely assume that we're using something like this yeah so this this is basically just going to be like sum of squared distances very sometimes you see is called SSD and this is really the Euclidean distance or the l2 norm and you know most of the time you can get away with something you're just says basically every feature I find the best matching feature in the other image and that creates what's called a correspondence right now one thing that you know makes that processes more complicated is what happens if there are many similar features between the two images right so let's suppose that I have a you know a building with a bunch of windows and again I have another picture of the same building well since I know that corner detection since another corners are generally good features that means that maybe I find this as my future image one and you know maybe I found all these features in image two and actually all these guys locally look like pretty great matches to that feature right and so what I'm going to do is I'm going to find a whole bunch of features that have very low feature to feature matches but yet I wouldn't really want to trust any of them right and so what I would do here is I would only pick a feature if the distance so let me just draw this like this this is like the best match and this is the second best match so they do here is what I could do is I could say only except the feature match if this number is acceptable oh right what that means is that you know in the window case the best match and the second-best match are going to have almost the same calling right which means this number is going to be close to one right whereas for a really distinctive feature that I can't fly anywhere else in the second image then the best match is going to be a lot lower than the second best match and that means that that is a better candidate for there being a good feature so what I might want to do is use what I would call this they call this the nearest neighbor distance ratio amid here is just a guard against these cases where an individual mesh may look good just looking at the lowest distance but it may be a bad choice for going forward and assuming that's a real corresponds and then if you want to get fancier you do things like normalized cross-correlation which we're not going to talk about right now another thing I mention right now yet is that there is a strong kind of geometrical constraint on where matching features could possibly be right so I mean if you think about it you know if I have a feature here so let's suppose I have this feature in this feature in the other image it really kind of isn't geometrically possible for this feature to match over here and this feature to match over there right so there's some sort of a constraint that you feel like you know there can't be this arbitrary geometric location of where the matches can occur and we're going to talk about that a lot in detail in Section Chapter five I'm sorry Chapter six I guess now chapter five about what's called the epipolar geometry so if it's early epipolar geometry is a is a kind of a encapsulation of the geometric constraints that govern where these matches can occur and so if I wanted to I could also weed out bad features by estimating the epipolar geometry so we'll talk about that Levin a lot more next week okay so at this point I think we can assume that basically we have generated a bunch we've detected a bunch of features using the methods that we discussed last time now we have drawn these kind of scale and orientation invariant boxes around each feature and now I have to decide how am I going to describe what's inside these boxes okay and that's really what the heart of making a descriptor is all about okay and so both descriptors are based on building histograms of pixels inside those boxes and of these the sift descriptor is really the one that you will come across the most often and this is the one that is used pervasively in computer vision of these things so sift which stands for scale-invariant feature transform invented by lo so basically what I do is I take my feature that I found so let's see again I have kind of this setup where I have my protected feature I turn that into my normalized feature over here so again what I'm going to do is I'm going to divide up this box into 16 smaller boxes basically have 4 by 4 grid and then within each of these grids fundamentally I'm going to build a little histogram of gradient orientation and so here this is a very crude histogram this is really like in typical implementations this is like an 8 bin orientation histogram and so what that means is that I've got 16 bins and I've got 8 I'm sorry I got 16 grid squares and I've got an 8-bit histogram into every square so I've got basically 128 dimensional descriptor now again there's some more secret sauce to how this actually works in theft one of them again is the idea behind you know not waiting every pixel equally but instead kind of superimposing this Gaussian around the center pixel so that pixels that are closer to the outside of the grid square are weighted less than pixels that are really close to the middle of the grid square there's also a little bit of secret sauce in terms of you know if I've got a pixel that is really close to the edge of one of the grid squares it doesn't necessarily make sense that this guy only contributes to this grid square because it's possible that if I drew this grid around a feature in a different image to just by kind of you know errors in my estimation that in the different image this guy might contribute to this grid square instead and so what you do is what's called trilinear interpolation which basically says that you know when I've got when I've got a pixel that is kind of close to the boundary of some other guys then this should contribute a little bit to each these other grids right if I'm really centered right in the middle of the grid square that I can be pretty sure that I'm mostly contributing to that grid square but as I get closer to the edge I kind of contribute a little bit to both squares I think I have a picture of this so then here's the picture of taking the original feature so again here is the detected scale and the orientation here I've made a box that is I think this is basically six times the size of the detected scale and here is my rotated box and what you see here this kind of white sticks in say H grid square is kind of like saying how much it how much is in each histogram bin in the direction that is pointed to by the stick and so you can see that for example in this grid square there are lots of gradients that are kind of going along this diagonal that's why this guy here is relatively large and so you guys see you know it's not perfect because you can see you know there are some big sticks that don't necessarily immediately correspond to for example up and down line so you can see that's kind of the way to interpret it is that in general you want the big sticks to be pointed along the gradients of the image inside the box and this picture isn't take into account all the smoothing and stuff that has to happen and this guy the picture I was showing you before so basically the idea is suppose i zoom in on a 2 by 2 array of grid square as I look at this point and suppose that this point has this gradient orientation right and so what I would do is I would say okay well this point should contribute a little bit to all four of its adjacent grids because it's kind of close to that corner right and since this angle you know the other thing is that since I've doing this very coarse quantization of the gradients into these eight bins again just to be on the safe side what I could do is I could say well what I'm going to do is for a for a an angle and estimated grade at a pixel that doesn't exactly fall into one of these bins well it should contribute a little bit to the two bins on either side so really every pixel is contributing a little bit to eight bins in the 128 dimensional sifter script so again the details of this are a little bit tedious but this is kind of the way it works if you really get under the hood okay and the last thing I want to say before I play that picture is that so now at the end of this I have basically a you know what I can think of as 128 by one vector that represents the gradients in that image right what I can do is you know to make sure that everything is comparable I normalize this to unit length right just so that I don't have one vector that is much much bigger in terms of values than the other vector if I see any big spikes you know so say this is a really big number you know the way the sift descriptor works is to zero this out and then normalize it again the idea is to basically zero out things that might be noise from other pieces of the image now you know you could argue about why you do each of these little steps I just kind of tell you how low propose to do it but you know suffice to say that there have been a lot of experiments about you know how do you make this descriptor as robust as possible to scale and viewpoint changes and so there are a bunch of little bits and pieces you have to do to keep track of exactly what LO proposed to do in his paper and so at the end of the day how does the descriptor matching work so here's an example of sift descriptors detected for two images that I guess because I've drawn these you know you know the left image is actually substantially lower resolution to the right image you can see that like for example but text at the bottom of this lantern is really blocky compared the text over here so actually this image is quite a lot smaller in pixel size than this image but what you see here are descriptors that have been correctly or not correctly descriptors that have been automatically matched by this whole process right so basically I found the scripters and I've matched them using this nearest neighbor distance criterion the one that basically makes sure that I don't have any possible duplicates and you can see that what I get are pretty good correspondences right so you can see for example that you know this this guy here has fundamentally the same detective scale orientation as this guy over here right even though the images were very different sizes and you can see even here there's a lot more detail inside this Lantern than there appears to be over here right and so actually most of these features are pretty good you know there are a few that are not so great like this guy here doesn't seem to match anything over there you know there's a couple extra guys like there's this extra guy over here but this is the point where we can say okay you know actually 80% of these features are already pretty good and we can use those as the basis for for example estimating the 3d transformation between these two images right that's what we're going to do in the next couple chapters is use these correspondences to learn things about the geometric relationship between the cameras that took the images okay and so one of the whole problems is basically to do a similar experiment right so take some near build some sort of scene on your desk you know take a picture of it from one angle take a picture from another angle kind of zoom in a little bit and then use the you know sift code that I provided in the homework to fundamentally try and match these two things and see how well it works right you're going to probably see some stuff that resembles this right so it's not like the features that you get are going to be like amazingly intuitive right so you can see here that if you were to pick out good features to match between these two things you'd be like hey well why don't you pick out these edges of this lantern right I don't have an answer for you but why does it get picked up by the SIF descriptor other than that they've been weeded out at some part of the process either during the detection process or during the descriptor matching process all that really matters is that you've got enough at the end the day you've got enough good features that do match right if you try and find reason in this scripters you'll go crazy looking after you look at these images for a while okay but that's the basic idea so they pause and ask if there any questions about that so in the filter so actually that's a good point so it's it's possible that there were good sift of feature descriptors here and then the nearest neighbor distance criterion threw them out because it were difficult to distinguish for example this corner from that corner right so that's a good point I didn't actually show the features prior to the filtering process all right so that may be a reason why they didn't get detected it's a good point other comments or questions okay so sift when it came on the scene totally kind of energized the community in terms of suddenly seemed to become possible to take images of very different scenes so so again sift is really only designed to be what we call scale and rotation in varying right meaning that you know technically if I took images that were from a very different camera perspective I would just have a scale invariance I would also have this kind of perspective change right and there's nothing really built-in to sift that makes it automatically invariant its perspective change but did seem to actually work pretty well so suddenly people started to use sift all the time in terms of designing these machine vision algorithms and so people are still using it but there have been a lot of kind of incremental refinements since then and so let me just mention a couple things so one has to do with you know changing V so we can kind of abstract the whole sift idea to have to do with we're finding histograms over certain regions right and so in sift the regions have to be happen to be these square boxes in the grid right but there's another school of thought so one is called glow which stands for gradient localization are great location and orientation Instagram so glow I can just show you a picture uses this set of grids and steps this comes back to the question about couldn't you use circular regions instead of square regions and so this is a case where again the feature location sits in the middle and what I do is I a great gradients inside this ring in the middle and then I have kind of sectors of the circle that kind of go out and out and there's a question about you know how large the radius and center circle is and how much you can see that the radius of this outer annulus is a little bit smaller than this guy and that's a little bit smaller than this guy so there's that using what they call a log polar grid where you know this would be basically like saying I have 8 plus 8 plus 1 is 17 kind of regions and inside to these I could quantize the histogram into they use 16 angles instead of 8 angles right so that turns into a different type of descriptor right another possibility that people explored what's called Daisy so this one this is just an example where basically we have the same kind of idea is that we are creating bigger and bigger regions of Attraction as I kind of move away from the center and then I've got kind of a choice about how many kind of petals as it were of the Daisy do I use do I use eight orders can a river how many rings do I use you know how far apart should the rings be and so what people have done is kind of these exhaustive algorithmic kind of experimental validation of say okay well as I tune all these parameters what gives me the best descriptor matching performance right and so people have compared the sift descriptor against the glow descriptor against the Daisy descriptor and you know some have been shown to work better for I mean I think you'll find most people are still using sift but there has been a lot of work on kind of squeezing as much as you can out of the descriptor right again in this case I think that for Daisy you know these are not like hard edged circular bins you can see the bins actually are overlapping compared to sift and this case what's happening is instead of just aggregating stuff inside the circle there's actually a Gaussian you know kind of kernel that goes inside to these circles and kind of you know down weight stuff as you get closer to the edge of the circle so there have been a few other algorithms that are actually little bit more related to 3d feature detection and matching so in chapter 8 we're going to talk about 3d stuff and so there's been a couple of algorithm is called spin images and shape contexts I'm going to save those until we talk about that's up in the 3d section one thing that you'll see a lot is called surf so one problem was sift is that sift as it is implemented can be kind of slow right and the reason for that is that you've got a bunch of gaussians to compute you've got a bunch of this you know get a little trilinear interpolation thing going on so you know initially when sift came out it was not so immediately easy to implement this on a semi nimble it on a mobile processor right so I mean that was not so great and so people looked for ways to make features that were faster and so one that you see commonly used is called surf this stands for speeded up robust features and so again the idea was that I still use a grid you know the four by four grid but instead of trying to create all these gradient orientations what I do instead is i compute very simple wavelet responses to the pixels inside this region so wavelets so basically i don't want to go into all what wavelets are base is what they use were hard wavelets which are nothing more than like simple box filters so really applying these hard wavelets to the pixels inside this box is really easy because all I'm doing is I'm just adding and subtracting pixels and so that add and subtract stuff is much easier to do on the resource constraint platform than it would be if I was doing all this kind of detail grade orientation stuff and so basically the advantage of surf is that it's extremely fast and it actually has comfortable before ways to sift and so you may see people using surf features in literature we may also see pca sift so pca sift is I think a little bit of a misnomer in the sense that you know it's kind of reasonable to think about okay so the system vector is always 128 dimensional right and so you might ask yourself well do I really always need all those 928 dimensions maybe I could get away with things to sort of reduction in dimensionality of the size inspector that's actually not exactly what PCA SIF does I mean that's what I thought PCF CCP CAF did when I first heard of it but actually what they did was they looked at a whole bunch of basically they built a lot of detected feature patterns right so they they ran a do G detector on lots and lots of natural images they generate the feature patches they learned then kind of what makes a good feature and then they built a basis for projecting a given patch onto this you know space of all the good patches they found and that way you can kind of say okay I'm just basically projecting the patch that I've got for my feature onto this basis and I can just choose the first say 20 or the first 30 coefficients projected onto this space right so that requires this kind of machine learning process that you need to get their basis from somewhere in order to use it for your thing so anyway I think that probably most of the time you're going to see sift and surf these days um it's also true that again I don't want to get to this too much but there are also what are called rotation invariants and I'm only going to say this in passing because they're a little bit mathematically complicated but the idea would be you know let's suppose I found this feature and I had drawn this scale-invariant circle around it so you know one thing that would be invariant to the rotation of this feature would just be for example you know what if I was just to add up all the numbers inside the circle right that number will be the same no matter how I rotated the circle or if I added up all the gradients inside that circle that would also be invariant right and so the idea would be that in theory you know I know that from kind of Taylor series I could describe all the pixels of the circle as basically you know what's happening here plus the second derivative or prime sorry what's happening here plus the first derivative the gradient plus second derivative and so as I kept on taking derivatives of all the pixels in this thing and hanging them up I could get these so-called invariants that in that case I have to worry about estimating a dominant orientation of anything because I'm creating a descriptor that is inherently invariant to the rotation right there's no need to kind of have this initial step of estimating the rotation but this part is a little bit you know public eight Duke because the more and more derivatives I try and take the noisier and noisier it gets so if you look at the book you can see that there are lots of there are a few references to very detailed studies or people have said ok so you know what's a good descriptor and detector pair we're gonna talk about that a little bit more next time I guess I want to jump back and talk about a couple other things related to detection so one thing is that you know let's think about this problem here so up question previous rotation variants yeah checksum for an application where basically you know an application is exactly the same application of technologies and things so it's just this really sort of fuzzy data map yeah I mean huh well taking into part so it's actually kind of a neat idea think about this is kind of a checksum in some sense right I mean although it's really not a sense of comparing two things so so I don't like it I guess I don't want to get into this guy too much because this is really mathematically complex what it comes down to is that there are some ways of taking the pixels that are inside this circle and creating numbers from them will always be the same no matter how that circle was oriented okay but the number of such things that you can extract from that circle is not that big compared to the dimension of the SIP descriptor and the difficulty of accurately extracting those that when you do them in the same in different image of the same scene you know the difficulty of getting good matches becomes higher right so that's kind of the main reason is that you know while it's true that for example if I just take like the you know average color inside the circle yes it's true that I could match that average color or somewhere else in a different image at the same scene but that wouldn't be discriminative enough for me to really do good feature matching right I need to have more numbers that match up between those two circles do better and so part of it for me I think is the dimensionality of the vector that you can create and the quality of that descriptor right I realize that super hand-wavy but that's these words it is very okay sure other comments or questions so I started to sir this picture so let's think about this issue though so one kind of problem and this comes down to the fact that you know basically what I've talked about so far is like scale and rotation invariance what happens in this case where I have basically some sort of a serious perspective change right so even if I was to draw a box around the apparent size of this R in both these pictures you know there's really no you know scaling and resizing of this box that is going to make the contents of this square match up with the conscience of that square and the reason is those images differ by this perspective change right and that's just not modeled by building a circle or a square around a region of pixels in the image right well we would need to do in this case is build something that was more like a an ellipse for example and so here is it kind of a picture of what are called a fine and variant regions okay and so the idea behind a fine invariance is okay you know I want to in this case instead of drawing a circle around a feature at its apparent scale I want to find an ellipse which are the yellow guys that basically as I were if I were to you know look at these two different images that the apparent information contained in the ellipses would be the same okay and so you can see that there are cases where for example the ellipse that gotten is much different than the circle I've got and he looks just kind of follow along with the dominant edges around the feature right so you can kind of see that when you've got kind of horizontal features the ellipses kind of squeeze out to be more horizontal and we've got vertical features they squeeze out to be more vertical and the apparent matching the ellipses you know hopefully the same content is inside you those of lips and so again I want to get into the mathematics too much so basically the region the reason that we would do this is - these are needed for kind of high perspective distortion usually when the images are taken from very wide baseline cameras that are very far apart and so again the philosophy here is that if I have an ellipse that is drawn so let me just kind of say this so what this means is that if this is basically a fim transformation f-fine basically is not like a full perspective change but something that's kind of like a rotation of scale and a shear so kind of like a you know dial change and this is an ellipse detected around the feature right what this says is that if I were to apply an affine transformation to the image and find the ellipse around the corresponding feature that that would be the same thing as taking the ellipse around the original feature applying a transformation of that right that basically says that you know I get the same apparent stuff inside the ellipse and there's an iterative procedure that you can use to get from these original circles that you got which correspond to the parents scale to kind of turn them into these ellipses and they go through this matrix that we used in the feature detection process where we built this kind of scale normalized Harris matrix so it turns out that if you look at the eigenvectors and eigenvalues of that matrix they kind of tell you how the gradients are pointed inside the region right and so it is that what I do is I kind of slowly kind of squeeze these circles into these ellipses and once I'm done I can kind of reverse the process to turn the ellipse so kind of I have in here for this feature with the arrow as I said okay I eventually found this ellipse was this affine invariant region and I turned it back into a circle and if I compare the pixels inside the circle here to the patil design circle here they're basically exactly the same right and that means that now I could build a sift descriptor or a glow descriptor or whatever I want to do on top of these circles and I would feel good about it right and so if I really wanted one extra step of a pie invariance what I would do would be this kind of what we call a high adaptation step before building the boxes for the descriptors what I would do is I would first kind of figure out what was going on in the local neighborhood feature and I would turn into something where I would never worry about the distortion I might incur and so again you can do that and there's a lot of papers that compare these features and so there's for example features that are called Harris alpha and features that comes from basically running the Harris detector finding the normalized using the distinctive scale that we talked about last time and then figure out what the ellipse should be and those are called harris outlines they basically come along with not only a detective scale but really this detected ellipses that comes along with the whole thing tells me what region of pixels I should be using so again this is a little bit advanced but you know it's it probably is necessary for cases where the images that you're trying to match our really far apart right in that case things like sift will start to break down okay so I guess I want to say just a couple more things about detectors that don't really necessarily fit into the kind of taxonomy that I talked about last time so one is this guy so this is a very simple idea these are called fast corners and it is that what I could do is I could say okay well what makes a good feature right let's step back and think about this for a second so what makes a good feature is that it should be kind of cornering right and so the fast idea was okay here's a pixel here I can build this ring of 16 pixels around it and I can say okay so for this guy to qualify as a feature for example I want 12 of the 16 pixels to be either darker or lighter than the pixel in the middle right that for example that would mean that fundamentally there is this kind of dark corner surrounding this light chunk right and so that kind of idea is a very fast test to apply to an image right so again let's think back to last lecture one of the problems with these feature detectors is that they can be computationally demanding right you've got all these gaussians you have to convey maga to do all this kind of multi scale comparisons this is like a very very fast way to detect corners at scales and so this is you know something where you say okay you know and again even better is I don't necessarily have to actually compare all these pixels there have been efforts to basically build what I would call like a huge decision tree where basically you say okay here I'm at this pixel you know now compare the pixel labeled one not convictable compared to three label 12 now compare the pixel labeled 7's officially sold kind of if-then-else comparisons and then based on some historical machine learning about what makes good features units basically have this whole hard baked in decision tree that you can quickly apply to the pixels around my center pixel to decide am I feature yes or no right and so that test is extremely fast and that can help a lot when you're doing things like for example robot navigation right you're at the time to do all this kind of fancy Gaussian convolution you just want to find good feature corner points that you want to do it quickly so that is you know not really related to any of this stuff we talked about last time but still a reasonably competitive detector for some purposes another interesting idea is what are called maximally stable extrema regions or MSE RS and so this is I this is an idea that again it's fairly simple the idea is okay let's suppose that I took this picture and I wanted to label all the pixels that were darker than some value okay and so for example maybe I start with that value being like 20 okay and so here if I compare this to this this is basically finding all the really dark regions of the image right so you know here for example it picks up the spots on this fugu fish but not all of them it picks up like all the stuff that's in shadow gets turned to one the lettering on the sign gets turned to one oops and then I suppose okay so this is like the everything less than 20 image now suppose I looked at the everything less than 125 image so now again there is a lot of stuff that's white if I go back to the original image you know so I'm still not picking up anything out here because this is all sunshine but here now I've got all the spots on the fish I got the letters on the fish there's starting to be some sort of bleed because these pixels are all less than 125 and if I say all pixels less than 200 for example I've got even more of the image right so where is this going okay so this is kind of related to an idea from image processing called the watershed transformation and so the idea is the following let's suppose I just consider a pixel you know on this character here okay so let me just go to my screen for a second these are maximally stable extremal regions MSE ours okay so kind of way I think about this is you know not sure you see this is to draw on paper which I guess is why I made a figure for it so it is okay let's suppose I consider this point in the image and this is the point that was you know on a character so I'm going to make this an a instead of the Japanese character okay so it is that what I can do is I can plot how big was the island of kind of binary pixels that passed that threshold test as a function of threshold okay so what I could do for examples I could say okay you know let's suppose that this a is of intensity 100 and this outside is of intensity 200 right so that means this is like a dark a on a bright background okay and now I'm going to ask myself at this pixel how many you know what's the number of pixels in the connected component starring this pixel that passed the test and so as a function of threshold you know as I kind of go up to a hundred right all the pixels inside this a are basically you know nothing much happening when I pass 100 suddenly all these pixels are you know included in the set and then as I kind of go this way since all these pixels are locally darker than their surroundings nothing much happens until I get to 200 and then I get another increase because suddenly as I increase about 200 this a is going to get kind of subsumed into a white chunk that contains the background right and so it is you know what I want to do is I want to find the threshold in some sense I want to find a place where as I just actually this is kind of a crappy picture but I need to do some notation to make this work so here this is basically the size of connected component containing pixel eye and so what I could do is I could make a measure that says actually maybe this is size affected component that threshold I so what I could do is I could basically say okay you know this number is strictly increasing right so that the greater I make the threshold you know the more pixels will enter into that connected compound right and so I want to do is I want to find what I would call a stable complan right what that means is that you know as I increase the threshold right I don't want the size of that component to change very much right I want that that change in size to be minimum right so if you can imagine what I want is for example I want a really dark blob on a really bright background so that even as I change the threshold the size of that you know connected compote around that blob is always you know fundamentally just the blob size right I'm not actually gathering in any more pixels that means that that blob is pretty stable with respect to changing the threshold right so what this is a measure of is you know I want this number I want this number to be small I want the surgery minimal right so relative to how big I am now I want the change in blob size to be small right maybe easier to show with the pictures so here this is again this is like the pixel inside the letter right and so again the size of the component is very small until I get to fundamentally the color of the letter that jumps up and as I increase the threshold some more then basically I'm going to include more and more backgrounds in CAHSEE for example here the letter is very well defined against the background as I increase more you know but I could be doing is actually showing this as a more continuous thing and if I had the image in MATLAB I'll show it to you but I don't have it with me so basically now at this point the connected component is huge right includes all of this other white pixel crap and so here what I'm seeing is that there's this jump and then you know kind of it starts including more and more pixels if I was take this measure of kind of how quickly is the blob changing where I get as a pixel is a picture like this that basically says okay you know right around here is you know if I were to make the threshold this big or this big the size is hardly changing right and so in some sense this number here have tells me the sweet spot for you know where is this blob really the most stable right I wouldn't want to choose this position right here because if I would just add one more pixel intensity suddenly that blob would get suddenly a lot bigger right I want to choose a region where changing the threshold a lot on either side is not going to make a big difference in how big the apparent blob is and so if I would choose the threshold corresponding that's number I would get this kind of binary mask of these pixels are locally darker than their background and they're stable over a wide range of thresholds right and some sense you can imagine what this does is it finds either light back light blobs on a dark background or dark blobs on a light background and so again this is kind of good in the sense that number one it turns out that you can find these blobs very quickly using what's called the water should transform from image processing and so this has some topic a tional benefits which are nice and then once you've got this this set of blobs I can basically draw like a bounding box around this blob and then I could go forward and say okay now describe that bounding box with a sift descriptor or describe with the Daisy descriptor right so in some sense this is like an alternate to a difference of Gaussian curve or finding a good feature this is something that is really customized towards you know blobs that stand out on a certain background and so again you know I wouldn't characterize this as a like super commonly used feature detector but is a competitive feature detector and so if you look at the you know kind of benchmarking data sets you'll always see this MSE R as one of the things that people are comparing instead of GOG or what up okay and so the main takeaway here is that you can really mix and match how you detect the features and how you describe the features right and a lot of times people have said oh you should definitely use the sift descriptor but you should actually be using the Hessian a fine detector right you know for example like there's no law that says that you have to use both the sift mandated detector and the sift mandated descriptor right you're free to change it around okay and so next time which I think will be a relatively short lecture we'll talk about you know feature detector evaluation we'll talk a little about how do you introduce features into a scene so when you have the freedom to actually place features into the scene how actually you do them in such a way that is as distinctive as possible and then just talk a little about how feature detection is being used in computer vision today so that's kind of where we're going any comments or questions about this okay so let me stop my recording and then I'll hand back home know why I always fail at closing this recording

Info

Channel: Rich Radke

Views: 20,024

Rating: 4.9018407 out of 5

Keywords: visual effects, computer vision, rich radke, radke, cvfx, sift, sift features, descriptors, feature descriptors, surf features, mser, sift descriptor

Id: oFVexhcltzE

Channel Id: undefined

Length: 70min 11sec (4211 seconds)

Published: Mon Feb 24 2014