Optical Flow - Michael Black - MLSS 2013 Tübingen

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks.thanks Bernhard I'm delighted to be here that seems a little loud is that too loud I kind of shout if I get sort of excited when I lecture and I also sometimes go kind of fast Baron Hart told me I've actually lost 15 minutes already so I don't know what I'm gonna skip yet so it'll be a little bit dynamic but the other thing I'd like to do is first get a sense of the audience because I know all of you are here interested in machine learning and I'm gonna talk about a computer vision problem called optical flow and the secret that I'm gonna let you in on right off the bat is there's been almost no work on machine learning an optical flow and that's not entirely true and I'll point out a few little things and we'll we'll talk about why that may be the case but that means that there's a huge opportunity here to do something new and to take everything you've learned in the last two weeks and apply it to this wonderful problem of optical flow estimation so before I get started how many of you know what optical flow is okay how many of those who know have used an optical flow algorithm for some purpose okay and how many of you have actually implemented an optical flow algorithm okay so it's like two people who will be bored in the beginning so what I'm gonna do is is take you through introduce you to the problem what it is why people may care about it and then give you some sort of the introductory things this is not optical flow advanced-level this is I'll do the introduction and take you through the formulation that I could consider the classical one if you pick up a paper on optical flow to read it you'll want to understand this classical formulation it doesn't mean it's the right formulation this is very important particularly when you start thinking about maybe applying machine learning algorithms maybe the classical formulation is not the right formulation but it's important to understand it and then I will for those of you who get so excited about this there'll be code and you can take it away and play with it and and improve upon it and have a foundation to start with and it's decent code and then give you an idea of where machine learning might play a role so the the term optical flow goes back at least to the work of the the psychologist Gibson and his work on ecological perception and and the idea is that as I move through the world or a bird or any kind of animal moves through the environment there's a changing pattern that that spreads across what Gibson called the optic array but in a camera it would be the image plane of a camera so you have this constantly changing pattern as you move through the world and this pattern induces what Gibson called a flow and this flow is the motion of the luminance pattern on the image sensor and the reason Gibson thought that was important is it gave in his mind direct access to properties of the world that were important for an organism to survive so for example if I'm an organism happens to be an airplane I want to land on this runway the pattern of flow shown here by these vectors gives a sense of what I'm approaching that I'm approaching this location here and I'm not about to crash into the ground and this might be useful if you're a rabbit and you want to hide in a hole as you approach a hole there's a particular distinct pattern of optical flow that you might recognize and be able to exploit if Alliance jumping out to eat you that will also present a distinct pattern of optical flow that's probably worth recognizing so this this notion is grounded in this is what Gibson called ecological optics and but there's also a mathematical definition of it so let's consider a very simple world in which we have a camera this is my camera and there's a focal length this camera in an image plane and the camera cannot there's only three things that six things the camera can do it can translate in space XY and Z and it can rotate in X Y Z so it's there's six degrees of freedom here and imagine I have a point out in the world capital P and it projects at some point little p in the image plane as the camera moves the projection of this point in the water onto the image plane moves and it induces a a 2-dimensional vector which describes the motion of that three-dimensional point in the world and that is something that we call the motion field now there's a couple things to notice here the first thing is the three-dimensional motion of this pixel one dimension is lost in the projection onto the image plane so we don't get back the three-dimensional structure of the world we get back something related to the three-dimensional structure of the world but missing a little piece of information now this is what's called the motion field it's related to the three-dimensional structure of the scene and how it is actually moving optical flow is actually a slightly different problem or has a different formulation or definition in fact it is the 2d velocity velocity field describing the apparent motion in the image now this is going to be these two things are going to be different so here's a first thought experiment imagine a mat ball so this is a Lamberson ball it reflects light in all directions so it's really diffuse okay and and imagine it's it's rotating about the vertical axis in 3d so it's spinning around what does the motion field look like so if someone's pointing the yeah so it's kind of like a horizontal motion right so what does the optical flow look like it's constant zero no optical flow you don't see anything in this case so now imagine this case I've got a stationary ball and I'm moving the light source around it what is the motion field mm I see some some motion the motion field I see a lot of hands waving what else there's some vague sense the motion field looks like this remember the motion is the three-dimensional motion of points in the world the ball is stationary it's constant that's right so the motion field here is zero and what's the optical flow look like those of you who are waving your hands before should now wave them so so these are two cases where the motion field and the optical flow field differ and that tells us something right away it tells us that if we compute the optical flow may not be able to immediately get back to the three-dimensional structure of the world and what's actually happening in it maybe so that's just important to keep in mind as we go forward so let me define a little bit more precisely what I mean by optical flow and we'll define some terms so here's two frames and this oh yeah what too close okay is that better all right thank you thank you for asking a question putting up your hand even though it wasn't related to the talk it's still good and I like questions so ask more of those the so here are two frames in a video sequence and well for the most of the rest of this talk we'll talk about only two frames which may be kind of a crazy thing to do but that's what we're going to do so here's two frames and the image intensity function I is a function of the spatial position of the pixels X Y and the time T okay and I'll use boldface typically to represent a vector so boldface X here is X Y and then we think of this we can think about how a point in here moves as a vector in 2d and with a horizontal component which is called traditionally U and a vertical component called V don't ask me why this is an H and V it makes some sense but that's how it is and it's a function of X Y and and maybe also T but we're going to ignore T here and and so this vector field it's kind of hard to see vector field so we're going to color code it in the following way small motions are going to be highly such close to white or not very saturated the larger the motion the more saturated and color is going to encode the direction so this is an example of the optical flow between these two pairs of frames so how every pixel moved remember it's the apparent motion of the pixels any questions so far about the representation okay so we're going to just as a little aside we're gonna now forget everything we might know about physics so that maybe it's always a dangerous thing to start out a talk by saying let's forget physics we will come back at the end to talk to think that maybe physics might play a role and be an interesting to think about but we're gonna do that we're gonna forget everything we know about materials and light and optics and all of that stuff and we're just going to focus on images and these pixel values in images and and and so that's gonna be what we do today now I want to point out that there's lots of applications of optical flow lots of people do structure for motion they try to estimate how camera moves through the world they try to estimate the 3d structure of the world they try to estimate where there are motion boundaries in the scene and there's lots and lots of recent applications particularly in the graphics area but video coding and compression uses optical flow to obviously figure out what's what's constant over time in a video and and use that for compression you can use it for pedestrian detection video denoising image resizing you can analyze plant growth and the motion of ocean currents and meteorological information people use optical flow all over the place and one neat application is the one called the painterly effect and and this is why it's useful to put your code online if you're a PhD student because these folks who made this movie what dreams may come downloaded my PhD code and then every night they would shoot Robin Williams well they would shoot Robin Williams moving around in a world and then they'd take the footage digitize it compute optical flow on all the frames and then they had an artist paint one of the frames with an impressionist painting and then they took the optical flow field and used it to take the paint and move the paint around as though it were moving like the stuff in the world so they had things like trees blowing in the wind and grass and water and it all moved and looked like a very beautiful moving painting it was supposed to be heaven and anyway they want an Academy Award for this and I got nothing but that's that's another reason to be careful about putting your optical your code online from your PhD thesis but whatever another movie that used optical flow heavily and you've probably seen is the matrix for the bullet time sequences they they shot actors from many cameras and then they computed the optical flow between the views of the cameras then they and overtime and then they were able to interpolate any point in space and time fairly realistically Matrix Reloaded actually they also stole my code from my thesis and they used it but they applied it not to images but to 3d faces they just took the same code and applied it to models of faces so they were able to morph faces and then reproduce new faces and poses that they hadn't seen before facial expressions and so on anyway so lots of applications so but how are you going to do it so our goal is to compute this 2d motion of pixels from one frame to another in a video sequence now there's a bunch of steps that we're gonna step through the first is we're going to have make some assumptions about the problem and and it's important to examine your assumptions and so we'll come back again and again to what assumptions we're making try to make them explicit try to make them clear and figure out where the assumptions might be a problem now once you've got some assumptions about the problem we're going to formalize it we're gonna formalize a function that we can optimize and it's as this suggests there's going to be local optima we have to deal with given a particular objective function we're gonna have to come up with some kind of optimization method for optical flow this turns out to be fairly specific to the flow problem there's a bunch of features about it you'll see that make it a little tricky and then the unfortunate part that's often left out of many papers the the dirty the dirty secrets and the implementation details this is where you get your hands dirty and it's just what separates the the good algorithms for the bad algorithms often is how they're how they're implemented and all these things work together and then there's one other little piece which is how do you evaluate and this is really important and we'll come back to all of these things but let's start with our assumptions the first and and most famous assumption of optical flow is one it's called brightness constancy and so the idea here is here's two frames again in a video sequence and the idea is if we look at a little patch let's say look at this patch or this patch here that there's a structure in the scene that's moving but its appearance doesn't really change very much it's it's 2d location changes but the pattern stays roughly the same and the same here so we're going to make an assumption that the image at time X position X Y and at time T looks like the image at the next time instant T plus one but just so the pixels have been offset by this U and V I've left out the the x and y indices for the unity I'll be a little sloppy with indices so if it's not clear stick your hand up so that's the fundamental assumption behind most optical flow algorithms it turns out we need we're going to need another assumption and it'll become clear in a moment why but the other one is spatial smoothness and and here we're going to assume that neighboring pixels in an image are likely to belong to the same surface that seems kind of reasonable assumption like you know these two pixels here two pixels here probably belong to a surface these two pixels probably blow into a surface and so we're gonna look at a pixel grid and we're gonna for every pixel in the grid we're gonna look at the four nearest neighbors and we're going to assume they belong to the same surface and if they belong to the same surface surfaces in the world tend to be kind of smooth and if their surfaces are smooth then the optical flow at those pixel locations is probably similar that's just an assumption and so what I'm just going to write that this assumption as for a particular pixel P the flow say the horizontal flow you should look like the horizontal flow at a neighboring pixel and in some neighborhood of P another way say this is we think the spatial derivative of the optical flow field is zero so let's given those assumptions we can begin to formalize an objective function we could optimize so let's put the first one in so we have a data term we'll call it a sub D it's a function of the flow flow vectors horizontal and vertical component and it's just going to be a sum over all the pixels in the image and I'll assume that I'm assuming these two things are the same and I'm gonna put a quadratic penalty function here that we want to minimize okay so here's a new assumption I just made I just stuck a two here which implied something about the noise distribution I expect and I'm particularly I assume it's Gaussian now I'm gonna define a spatial term eSAB asks which is a function of the flow field and it's just gonna be this pairwise Markov random field term I just described for you again I'm gonna put twos here which is going to assume that the flow field is smooth that deviations from smooth or Gaussian and first-order smoothness is all that matters I only care about first derivatives and in fact that flow derivative the derivatives with first differences all of these are assumptions hold them actually not very good but but that it's important to write down what our assumptions are then I'm gonna put these two things together with a weighting term here lambda I'm not gonna we're gonna have to deal with that at some point and I'll have an energy function e which combines a data term and a spatial term and looks like this that's the classical formulation of the optical flow objective function so let's what solves us right so now we come to the optimization part but to solve it it gets just slightly tricky the problem is that this function and the derivative of this function is not linear in the flow field the problem is the thing we want to solve for this this flow vectors at a pixel U and V they're stuck inside this this function this image function and that makes it a little hard to optimize so we'd like to actually get them out of there and the easiest way to do that is to linearize this using a Taylor series approximation so this is what's typically done we take this term the image with the flow offsets we do a first order Taylor series approximation so we write the the image XY at time T and the partial derivative of the image in the X direction the Y direction in the T direction and these two terms cancel out and so we're left with this equation where I've just replaced DX with u dy with V and we're assuming that DT both DT is small and so we end up with the following equation which is typically written in the following way I axis the partial derivative of the image in the x-direction times the flow vector u the partial derivative in the y-direction times the flow of vector V plus the temporal derivative of the image and and we are assuming again that this equals zero so this is called the optical flow constraint equation we've made a bunch of new assumptions here in particular to do this first order Taylor series approximation we assume that the optical flow was small and we could we could do this we assume that the image is a differentiable function which it's this discrete thing and so we're gonna have to approximate those derivatives and we also assume that just the first order Taylor series was all that we needed so a bunch more assumptions so we're starting to pile assumptions on top of assumptions here and it's getting a little unwieldy so let's look at this constraint at every pixel that we have so it's a single equation with two unknowns U and the horizontal and vertical motion so it's one equation with two unknowns so that gives us a line and it means that we don't actually know what the flow is along this line we know it's constrained to lie along it and this is why I call flow you'll hear it described as an ill-posed problem and it's this very reason and you'll also hear something called the aperture problem so I want you to look at this this motion and tell me what is the motion of the line where is it how's it moving what direction point up and to the right exactly that's great it actually could be moving in any of these directions you don't really know so I'm looking through an aperture here a hole that's one way to think about it and that there's some line moving behind the hole and the motion at this point is constrained to lie along this line but I don't know in what direction it's moving so the truth is this is this is the true video I just in PowerPoint I may push this to the background so you can see what was actually happening this was the actual motion right but you interpreted it as moving up and to the right so it was actually completely horizontal so this is the idea that that at a small at a single small region in the image you might not have enough information to disambiguate what the optical flow is and that what you're getting at a single point is a is a constraint that's ambiguous and the motion could lie anywhere along this line any questions so far I'm going a little fast because I want to get to some stuff so this is the interpretation that that you had and most humans have and and and the reason that when when I play this you don't have that as you see these endpoints and the endpoints are not ambiguous or they're ambiguous in a different way and and the overall best solution for the line is is horizontal motion so so here's so I don't know people maybe heard of this optical this aperture problem before in the optical flow constraint equation but it's really useful to visualize what's going on here's an image sequence two frames of a Pepsi can and if we look in this little region this red region here every single pixel has one of these constraint lines it remember it takes into account the the partial derivatives of the image intensity function in the X and y direction and and because there are many orientations to Mithen tensity function the image we've got horizontal lines and vertical lines and lines and all kinds of orientations we end up with constraint lines at all kinds of orientations also and I'm just plotting all of the constraint lines in this little one probably not all of them some function fraction of the screen straight lines in this little window and you'll notice that they all intersect at a point or a kind of mostly sort of intersect at a point this is the point that satisfies all of these constraints together and so that's what we would like to find in any given region of the image a unique estimate of the motion that satisfies a bunch of constraints like this so there's a bunch of ways to do that the first way and the most famous one comes from a paper from 1981 by horn and shunk horn in chunk often that you hear if you look at optical flow at all you'll hear about these guys and they wrote a paper called determining optical flow and they wrote down exactly this objective function I wrote down for you quadratic error term quadratic spatial term lambda term weighting these two things and and then they did their best to optimize it and and so this is where the optimization becomes important so remember horn and chunk were working in 1981 computers weren't that good in 1981 so they had to do a bunch of stuff to try and solve this problem it was even hard to get image sequences into a computer in 1981 so they it was pretty impressive so they wrote down this objective function and and the way and this is already linearized here this is our Taylor series approximation right so they have this objective function and they said well let's solve for that let's differentiate it with respect to the parameters U and V and set it equal to zero that's just what you would do like how do you solve this a differentiate it set it equal to zero and then solve and if you differentiate this thing and set it equal to zero you get a system of equations that are linear in U and V so they end up with a system of linear equations and you can take these and you you can I'll just write them out as matrices here just schematically there's a data term which is a diagonal matrix and a spatial term which is a a banded diagonal matrix of some form and then on the right-hand side you have some product of the the temporal derivative of the image and the spatial partial Lewis partial spatial derivatives in the image so this is just a big system of linear equations easy to solve you just write it this way you've got your sparse it's sparse also and so you solve for that but horn and chunk were working 1981 so they said well we now have a pair of equations for each point in the image would be very costly to solve these equations simultaneously by one of the standard methods so they implemented a really bad algorithm really bad approximation and this is where the implementation details matter they they knew what to do they just couldn't do it and consequently in every comparison for the 30 years after that they got slammed as having like the worst algorithm in history right here is a table from an early evaluation of optical flow results and on this the synthetic Yosemite Valley image sequence and hornish chunk got the worst possible Florence it got 30 more than 30 degrees of angular error they were just like wrong all the time so so actually for many years people said optical flow it just can't be solved it's ill-posed you formulate these the subjective function you can't optimize it the problem is just not even worth thinking about and but it turns out what people weren't doing is just looking at all of the assumptions and the implementation details so here are assumptions the big ones brightness is constant deviations from constant C or Gaussian motions are small ie less than 1 pixel hornish chunk assume fat also first order Taylor is a good approximation the image is differentiable the flow field is smooth deviations from smooth or Gaussian first order smoothness is all that matters flow derivatives are approximated by first differences well some of these assumptions are not so good but they're easy to fix and so one we actually just today I got mail from the International Journal of computer vision saying that our paper on this topic just appeared just this morning so and it's which the former student of mine - Ching Sun and another former student of mine Stefan wrote and we decided to look at all of these assumptions and try to figure out which ones were really causing the problems which ones mattered and how could you fix them and to do that we systematically changed only one of them at a time which is something people haven't been doing before so I'm going to go through a few of these and give you hopefully some insights I have time I think into what matters and and how people have dealt with them so the first one this linearization remember this Taylor series approximation to be valid which requires that the the flow be less than a pixel so that's no good because real motions in real video sequences in an action movie something could move a hundred pixels from one side of the image to another or even 200 pixels in a widescreen movie so the assumption of small motions just doesn't hold well back already in 1992 and even earlier probably Teddy Dolson was thinking about how to formulate this nicely and and the idea was to build an image pyramid so you start so the image pyramid is just you convolve this with a gaussian and your sub sample and you repeat that there's lots of variations on the theme but that's the basic idea so we have image well an image to convert them into this spatial pyramid and why is this important well imagine I have a magnitude of my motion as 10 pixels here at this up here it'll only be 5 pixels and here 2 and 1/2 pixels in here 1 and 1/2 pixels now I'm pretty close for this assumption I was making a small motion is is mostly valid and and so I can apply my optical flow equation here with my linearized brightness term and get a reasonable result so that's what we do we are calling it W here unfortunately W is you I just couldn't edit this for some reason so this is the optical flow field you at a particular level in the pyramid we estimated by optimizing this thing and I haven't by solving this system of linear equations etc then I take my image and I warp it by the optical flow and I'm not gonna tell you exactly how to do that there's a bunch of ways to do that and and then I refine this and then I project it to the next level so I basically computer an approximation at the coarse scale and then multiply it by factor of 2 and increase its size and I get an estimate for what's going on at the next level and then I warp this by that estimate refine it and repeat and each this way I get a incrementally estimate the large motions down here at the bottom and this works this works pretty well stop tickle flow algorithms do something something like this and there's one neat little thing it turns out there's a tiny little change you can make typically these these are symmetric so you you you subsample the image in the same amount in the vertical and horizontal direction but with widescreen movies you end up with because of widescreen directors shoot them with a lot more horizontal motions and bunch bigger horizontal motions so if you're just subsample more in the horizontal direction you can capture much wider motions and people hadn't had that before but it works really nicely so another thing I that's important is is if you're going to take something discrete discretely sampled like an image and compute derivatives of it you have to do that in a nice way I'm not going to tell you how to do it but it it turns out to be moderately important but if you do all of that ok so you do this course define thing you implement the optimization by solving a system of linear equations as opposed to horn in chunks crummy method and you implement your filters in a nice way then hornish chunk isn't so bad the basic formulation the assumptions of brightness constancy and and spatial smoothness with Gaussian noise actually produce an optical flow field here that isn't too bad it it has some properties for example it's over smooth as we would expect because this assumption of Gaussian noise on the spatial term implies that everything should be smooth so at sharp discontinuities we get here's the ground truth here's what it should look like and you you see this this is a shell here moving and you see its motion is pretty pretty sharp in the real world because it's two different surfaces and horn is chunk tends to blur it but it's not terrible yeah sure yes sir indeed I don't know that anyone's actually done a study with high frame rate cameras but you're absolutely right that makes the many of the sumption x' hold the pyramid effectively does that for larger frame rates the people are more interested in in pushing the problem of of of really large motions as we'll come to at the end because this is what happens in in natural movies that you get so there's another way of thinking about this problem which I didn't go into as a this objective function I wrote down is just the negative logarithm of a some probabilistic formulation where we're trying to model the probability of a flow field U and V conditioned on two images I won and I T and it's proportional to a likelihood term which is probability of I T conditioned on the previous image and the optical flow and a prior on the optical flow and so this was our data term and this was our spatial term and there's a bunch of formulations that look at this and in a more probabilistic way and I'm not going to go into those today we're gonna stick with this good old-fashioned energy function formulation for now so the two so we thought about these these problems and fixing those makes a big difference there's there's these other big problems that brightness is constant and that the flow field is smooth now probably bad assumptions so let's look at those the problem is how do we know how to do that to do it to understand what the truth is like does the brightness constancy hold are the spatial derivatives of optical flow actually Gaussian we would need some ground truth optical flow so what's the and and here's where the machine learning part comes in in this is why people haven't been doing machine learning on optical flow because there hasn't been any ground truth sequences so why is that well yeah yeah so so early on people tried that and they tended to produce very crappy image sequences you know you've got a computer vision graduate student who doesn't know much about graphics and they make a little synthetic thing and it doesn't really reflect the real world so generating synthetic data has been really hard Hollywood will do it for you and I went years ago I went to people in Hollywood I said please please please give me your 3d all that all the information that you use to generate your movies and then I can generate ground truths and they all said well the people that do the graphics they don't actually have the copyright for it and the studio's have that own the copyright have no interest in giving it to you and so that whole thing died the other thing so unlike something like the Kinect sensor which is a camera with images and in-depth information there's no camera in the world that measures the motion of pixels directly like there's no direct sensor that computes like how all the pixels are moving so there's no way to just go out and capture lots of data like people are beginning to do is connect so you really have to rely on either some kind of really laborious manual laboring or complicated instrumentation of the world which then makes the images very unnatural or you have to get some graphics data and so we got some graphics data and this was an idea of Dan butler's he's a PhD student at the University of Washington and it's joint work with Jonas both here at MPI and Garrett Stanley from Georgia Tech and the idea is that there are actually some really there are some people out there in the world making movies and making them freely available so the Derian open-source movie project uses blender too and a community of animators all around the world to make animated movies and one of those a really nice movie called Syntel if you haven't seen it I highly recommend watching it it's a very nice movie and and so they made this movie and then they put all the data online so we finally have something that we can study now I have to I'll come back to some important issues when you think about using a synthetic movie but I want to give you an idea of what ground truth optical flow really looks like in a complex scene so here are scenes from Syntel and here's this color coding of how all the pixels in the image are moving and this is not the original Syntel it's slightly modified by us but it gives you a sense of the complexity and the richness and of what's going on and it's a data set with over 1600 frames of ground truth data these are widescreen fairly high resolution we have divided into training and testing sequences and with large velocities of over 100 pixels per frame and it's online and if people are interested in it you can go to well since I'll give you the address in a second so the problem is what can animated movies teach us about optical flow in the real world and as soon as you start to do anything a computer vision that uses graphics data everyone's gets really really nervous and asks is it realistic enough does it really if I learn something from a graphics movie and then I take it and apply it to real images will I've learned the wrong thing and so what's really important is are the statistics of the world in Syntel realistic enough for us to learn something of value and so what we did this is really hard to do but what we did is we we constructed a set of what we called look-alikes so here's a Syntel scene of flying through some mountain and and here's a scene from Batman Batman Returns or something which looks kind of visually a lot like it so we searched the internet for scenes that had similar kind of semantic content to Syntel and you probably can't see this but here's a camp I are seen in Syntel and here's one from Dances with Wolves and here's a market chase scene and Syntel and here's one from a James Bond movie and here's uh running through a bamboo forest and here's some like Crouching Tiger Hidden Dragon scene or something like that so it turns out there's a lot of scenes in Syntel that are very much like scenes you've seen in in real movies we actually went to the the there's a whole Syntel group of animators and we wrote - we went to the news group they are like the forum where they communicate and asked for people's ideas of what scenes in other movies were inspirational for scenes in syntel's so we could you know cuz often this is good they were so offended they said nothing in syntel's completely novel it's not like anything else in the world it's just it's they were very offended that we even thought there might be similar things but you can see that you know maybe maybe they were inspired by some of these things so one thing we had trouble finding was fight scenes in the snow but it turns out there's a whole group of people in the world who all they like to do is is is take a home video camera and make fight scenes in the snow so we were able and turns out to find a lot of people fighting with swords in the snow a lot of bamboo forests these are some of the look-alike sequences on on high-speed but the question is you know so they don't look exactly like Syntel but but are they similar enough and and so then we started asking about the natural statistics in these scenes we looked at the image statistics and so for example one thing people look at is the the marginal image derivative statistics so this is the log histogram of horizontal and vertical derivatives of the image intensity function in Syntel which is the red line and in the look-alikes which is the blue line and they they line up pretty well what does that tell you not not a whole lot but these are characteristics of what people who study image statistics see that the derivatives are peaked at zero mostly things are kind of flat and then there's some large outliers and so you get these heavy tails so you can ignore the Greenline it it's a for now forever unless you get interested in this yeah now so that's that's something that the images at least have the right kind of first-order statistics okay the question that's maybe more interesting is does the motion have the same kind of structure and this is a little bit harder since there's no ground truth motion for those sequences so we did something a little bit fishy which is that we computed optical flow using an algorithm on Syntel and we computed optical flow using an algorithm on the look-alikes you have to assume those algorithms behave sort of reasonably and then we compared the statistics of those and what we found is that on cinto on the look-alikes this is a little bit hard to see but there's a red line and a blue line here that kind of lie on top of each other these are the statistic hram of the horizontal motions and the vertical motions and they're sort of similar they're quite different from this dotted line which is the ground truth histogram log histogram of horizontal and vertical motion so that's because the algorithm isn't able to capture the large motions present in Syntel but at least it gives us some confidence that there's something similar about the motions in the two scenes and finally if we look at the statistics of the spatial derivatives so here's the spatial derivative of the horizontal motion in the Y direction so it's a partial derivative of horizontal flow in the Y direction for the look-alikes blue and Syntel red and you can see these two things line up fairly well that's good but you also notice that they don't look Gaussian at all they're very peaked at zero and highly Curt otic so that suggests that our spatial smoothness assumption was probably wrong so so now that we have a database of images with ground truth image sequences with ground trees let's come back and look at some of our assumptions start with this brightness constancy assumption we have all these images we've got these flow fields we now know which pixels from frame 1 look like correspond of pixels in frame 2 and we can just look at the statistics of the brightness constancy difference so image 1 at pixel IJ image 2 at IJ offset by the true flow and we find that they're heavily peaked at 0 and again highly Kotik with heavy tails so this doesn't look very Gaussian so these are just a marginal statistics but the truth is this came out of a no 8 paper which didn't use Syntel it used a different training set that was much smaller so I didn't have I couldn't find the figure this morning for Syntel so my guess is syntel's more symmetric but and that's just not enough training data and then the spatial term so remember we we're assuming that a pixel here should look like its neighbors and but in fact what could easily happen here is something like this where we're looking at a pixel and it belongs to some surface and then the neighboring pixel here it belongs to a different surface moving a different way so in some sense you can think of this as a spatial outlier and that you want to be able to detect that this belongs to this surface and not this surface so again this is out of an old paper I don't have this Intel picture in here but these are the horizontal and vertical derivatives computer at a different way but you get the same idea of heavily peaked at 0 and and heavy tails and these heavy tails mean that there are well the sharp peak means that optical flow is usually smooth it so that assumption is pretty good the world is pretty smooth but you have these large discontinuities where there's one surface moving behind another surface and there with different motions and so you really do have to model these things so this goes back now to my PhD thesis I introduced the this robust penalty function which has rough this kind of shape it it this is the log of the negative log if this thing basically and it it's because remember we're working with this energy function instead of the probabilistic formulation so this row function here is just one of the ones I used and and it has a nice property that as the errors grow large so this would be for example brightness error or spatial derivative the penalty you pay saturates so this means you can't be too influenced by some crazy measurement and if you formulate if you just take the quadratic out of horn and chunk and you replace it with this row function and choose the right robust row function you can get a very nice result and there's a bunch of things we could choose here's the quadratic of horn and chunk here's something that I used in my thesis which I just guessed that maybe this would be good I didn't and and here's something called the char Bonnier which looks like an l1 error which is nicely robust but it's it's continuous at 0 so it's differentiable which is nice now if you if you plug in one of those robust functions and just take the exact same everything is exactly the same except I've replaced the quadratic with something robust you'll end up getting much sharper discontinuities here and so you don't get the spatial blurring that you would get with the quadratic so that makes sense that's a step in the right direction but what robust Punk function to choose right that is where we can begin to leverage the data set that we have now now there's several issues here the optimization just became a lot harder we can't just differentiate this and end up with a system of linear equations and solve anymore so there's a bunch of things we can do we can do gradient descent we can start with a convex solution and then gradually make it non convex by starting with a quadratic and gradually introducing some more robust shape to the penalty function we can use iteratively re-weighted least squares so that we solve a series of linear equations and then there's a bunch of ways to go but you get something that matches the world a little bit better but it's just a little bit harder to deal with so that is where we get into a bunch of modern techniques for computing optical flow and how am i doing I'm okay and and there's a bunch of things that matter so we now have an objective function we know that course define helps for dealing with large motions I'm going to tell you something about median filtering in just a minute graduated non convexity is this how do I track how do I deal with a non-convex energy function there's a bunch of other things and in fact in this paper that just appeared this is a it was originally cbpr paper but just appeared today we look at a whole bunch of different things so I'm not going to go through them all in yellow are the ones that I'll just touch on briefly and well I already touched on coarse to fine I'll tell you only about median filtering and then the penalty function that we're going to use but if you want to learn more about this there's a detailed evaluation now that evaluation is important and we're going to you have to decide what's a good optical flow field and so we're going to use I'm going to use EPE which stands for average endpoint error and and so if the ground truth motion is this yellow thing and my estimate is this EPE is the Euclidean distance between these points you could also compute angular error but they basically give the same kind of results so using a data set with ground truth we can look at different penalty functions the quadratic this Laurentian one the sharvani a one and ask what is the average end point error like using these different terms and we see the charr banya is a little bit better we can test the significance classic C is using sharvani and we can compare that with horn and chunk the quadratic or classic l the Lorentzian and the difference is statistically significant so this classic C thing seems to be good in fact it turns out that a little bit of non-committal bit more non convexity is better you can write this char Bonnier in this way with an a parameter here and the these classic char Bonnier this is 0.5 meaning this is a like an l1 but value slightly smaller turn out to be better and we found that just a little bit more non convexity really helped a lot so then there's one other key thing that is a little trick that people were doing and not really telling anybody about so in their paper hang on a second I'll show you this when you're computing the optical flow it would sometimes be kind of noisy and people thought it adds it's kind of noisy maybe if I just run a median filter over the optical flow field that will clean it up it turns out that works great it really cleans it up if you turn on a median filter the average endpoint error goes way down much better than if you don't do this median filtering hack it's just a little hack here's it with it on here's it was it off these are normalized so actually there's some little outliers here that are making this look very are sort of squashing the the intensity function here so you don't really see the structure but the endpoint error in this image is significantly lower than in this one and and so the the thing we looked at this and and then realized that you know we're trying to optimize this energy function and if we plug in this median filtering step this hack we end up with a higher energy but a lower endpoint error so what that means is what it means several things it might mean that our it means our objective function is wrong or that we're not optimizing that the solution we're finding is good but it's not actually optimizing our energy function which means we're optimizing implicitly some other energy function and and so we started to think well what is that and it turns out we can rewrite this and this turns out to be an important thing before we looked at spatial neighborhoods that were very small we just looked at the nearest neighbors of a pixel this isn't very powerful it just tells you a little bit about you look at the first derivatives it's much more powerful to look at a large neighborhood but the question is how to do that and the median filter gives us an idea what's actually happening in in the median filter is we're looking at a whole region and it's optical flow and we're trying to compute a median value and you can imagine than setting the center pixel to be this median value another way to write that is as we've written it before here's our data term as before our spatial term as before and a new constraint that says that this neighboring pixel should look like all of these so not just its four neighbors but all the neighbors in say a five by five region with a l1 penalty term so this is basically computing a median and it's we can we could add this into this equation that's roughly what people were doing now it turns out to be hard to optimize this thing so what you can do is you can split it into two pieces and reformulate it so that there's there's a we introduced an auxiliary optical flow field here that I'll put a little hat on and this little hat we're computing a median basically over the the optical flow and then we're connecting this is our optical flow equation we had before and we're connecting them through this and it's like a a spring penalty in some sense a quadratic penalty to make those two things the same the advantage of doing this is when we differentiate this with respect to the UVs this term drops away and when we were n qi ate with respect to these things this term drops away and then we have an alternating optimization between those two things so this allows us to compute a median filter and then couple it to the original optical flow equation and this median alternating thing I just described or this one here gets the same results as the median filter but it now gives us an explicit objective function that we can understand we've introduced this non-local term that allows us to have integrate information in a large spatial neighborhood and and that's a great thing except when it's not a great thing so what happens when you do it when you do a median filter if you've got some very thin small structure like this gun here and you look in the neighborhood around the tip of this gun most of the pixels that correspond to the motion of the background not to the motion of the foreground and so the median filter actually blurs away the tip of the gun so it gets rid of small structures but now that we've introduced this spatial neighborhood term and it's not just a median filter we can begin to look into what's going on there and and so we're going to introduce now a weighted non-local term which is going to have a new assumption that motion boundaries are likely to coincide with image boundaries in the image so let me explain that so here we have the this median filter that we're doing we want the center optical flow value to look like all of its neighbors except now we're going to actually compute a weight to each of these neighbors and that weight is going to vary so and where are we going to get that weight from so here's the weighted median we're going to compute we're going to get that by looking at the image intensity and some other properties here so the further away I am from the the center pixel the less I wanted to contribute and/or said the other way the closer I am the more I want it to contribute the more similar the image intensity values or color values are the more I want it to contribute and there's another term here for how similar the optical flow vectors are that I won't go into and and so if we do that we get the following masks or weights associated with a particular pixel so take the center pixel here on this gun of this little toy soldier and the the brightness here indicates the weight in this weighted median filter and so things that are on the gun have the same color and therefore they end up having a higher weight on a let's see B here is on the boundary between this fence thing and the background and so the center pixel prefers motions or pixels on on the left side here this pixel right here in this little V it likes pixels up here so this is a way of incorporating image information into the optical flow equation and it works like dynamite so let me here's the ground truth optical flow here's the estimated optical flow ground truth and estimated and they're starting to look quite similar so this we call classic plus n L and n L stands for non-local and it it it's a significant improvement so here's just with immediate regular median filter here's with a weighted median filter built into the objective function in a nice way and you can see that it really matches the image pretty nicely this is true at motion boundaries as well so I promised you some code so this is all on the Ching sons webpage if you search for de Jing son you'll find his software it's widely used and he's got all these things implemented in lots of different variations so it's easy to build on it it's all in MATLAB but in the ten minutes I have remaining I would like to point out a few problems so here's classic plus NL applied to this sequence and and you see while it does a reasonable job of maintaining motion boundaries and getting a nice smooth flow that looks like this ground truth it still screws up in some places where there's really fine structures and occlusion relationships it misses these little very fine I don't know bits of the leaf here and and so it's not perfect and what's really going on here is we formulated this this classical model derived from horn and chunk and and we've found that motion boundaries are critical for accuracy but this model doesn't really explicitly say anything about boundaries that's not in the model what we're doing is asking this little robust penalty function to do two things for us at once to implement smoothness that is to constrain neighboring pixels to have similar flows and at the same time to allow them to be different if they need to be different it's an awful lot to ask of one little function and and so what's really going on here is that there's some kind of segmentation that needs to happen so there's really there's sort of two problems that exist in the computer vision literature one is segmentation one is motion estimation and they tend to be treated separately so here's a an image of a scene a single frame here's a segmentation produced by a particular segmentation algorithm it's fairly representative and and what you can see is that from a single image the segmentation about the 3d world is often ambiguous so here for example this can gets merged with the background segment L and things get over segmented - this can get segmented into all kinds of things that makes it look like multiple objects even though it's one object from the optical flow information if we had the ground truth optical flow it would give us a very nice precise segmentation information about where the objects are but of course we don't have the precise optical flow information so the question is can we combine this problem of estimating flow with this problem of segmentation so that solving them together we can use cues about the image cues about the motion and get a better segmentation of the scene and a better optical flow estimate so this idea has a long history and goes it back at least to the work of Wang in a Dulce in 1993 on a layered model of optical flow and and so this is an important one to think about because there's a very nice simple it introduces a bunch of new assumptions but the key one is that the image can be decomposed into a series of overlapping layers each layer has maybe a fairly simple motion associated with it in this case the background is moving with in one direction the foreground there's an alpha map which is the segmentation and in and the foreground is moving in another way and the generative model of an image is I as I take my background pattern I take my alpha multiplied by my alpha mat and I take my foreground pattern multiplied by my alpha matte I add them together I get an image and then I can warp these and generate a sequence of images it's a very simple generative model of an image sequence and it explicitly puts in this segmentation model so it's a very nice place where you can incorporate information about image segmentation so I'm not going to go into the details but with the Ching Sun and Eric's uttereth and and some others we have a series of papers on doing this where we revived this idea of a layered segmentation and now we take a sequence like this and simultaneously segment it into layers and you see here these very fine structures are captured by this segmentation because it's using image information to do the segmentation and it allows us to model explicitly what's happening at an occlusion boundary rather than trying to wrap it all up into this little robust function there's an explicit generative model of what's going on at the occlusion boundary which allows you to do a much better job of estimating the flow so I think this is an interesting direction to go Ellen we can estimate here are the black here are the ground truth secluded regions and here are estimated occluded regions so so apart from tuning some parameters and and maybe learning some marginal statistics it really hasn't been much learning in this field and so what why okay now as you've seen this presentation of this sort of classical formulation do you have some ideas of why machine hasn't been used more or well number one it wasn't training data before but now that we have some training data are there places where you think you could start to think about machine learning yeah yeah it the unfortunately most the layered model is kind of nice and simple but it turns out most scenes don't actually come in a nice series of layers if you think about walking in a forest you have a ground plane and you've got all these things sticking up out of the ground plane and sometimes the ground is in front of the trees and sometimes it's behind the trees so tonight it turns out not to be it may not be the best model but indeed to Train a layered model we don't have any ground truth about what those those layers might be so getting ground truth is probably the biggest hurdle there are some other problems so lots of it this optimization I talked about makes things a little bit specific to the problem so this having to deal with large motions and the fact that people have done in a course to find way kind of restricts how you think about the problem and and if you don't do it in this coarse to fine way you have a very very difficult optimization problem where you have any patch of the image could go to any other location in the image and it's a horrendous search problem and the other thing is the this image sequences are just not as nice as images and so for example if I think here's a little patch in an image in one frame and if I'm thinking about learning a convolutional model or something to represent the images and denoise them I can imagine looking at image patches in lots of frames and but the question is what happens when I introduce a a temporal sequence if I look at the next frame the thing that was in that patch isn't there anymore so I've got this problem if I want to just do something naive its which is take just say I'm gonna take a sequence of frames and take regions and look in these regions and I'm gonna try and learn something about how things move I've got the ground truth motion I've got the images I should be able to learn from a little spatial temporal volume what the flow is except I can't do that here because the things move too fast so there's some scale thing that makes a naive application of machine learning algorithms really problematic here space or if so just to let you know like okay this classical formulation we talked about a bunch of the assumptions there's a whole bunch of stuff that still breaks it complex stochastic things with appearance change raindrops and puddles it's not clear what the optical flow is it's not clear what the optical flow should be fire and waves and so on again it's not clear even what the definition of the problem is like what's the right answer for the optical flow of a bunch of waves or or something that's exploding what's the right motion I look for a car with a pole reflected in it is it the motion of the car is that the motion of the pole is it some combination should I actually be thinking now I told you not to think about physics maybe I should think about physics again and think about reflections and maybe actually be modeling the material properties of the car and the the image around it and how its reflected by the car I don't really know what's the right optical flow for for things like plants blowing in the wind and and flags does anybody really want a detailed pixel motion of of all the leaves on this plant I mean maybe someone just wants some kind of course textural information to say oh it's clearly a plant blowing in the wind I don't know it's sort of problem specific I don't really know what the right answer is and then one fundamental problem is small things moving fast so this is even a problem for the classical formulations so this example I showed you the motions are so large that even at the course level of a pyramid because this structure is so small by the time you subsample on subsample and subsample and subsample it so much that the motion is less than a pixel you can't see it anymore it's disappeared so the motion of small things moving quickly is really a fundamental problem so and another thing that's missing is I only talked about two frames here and obviously the world persists there's temporal continuity the question is what persists in time and and the optical flow is something that an essential sense might not persist in time it's the physical structure of the world that persists and so again one might want to start thinking about the physics of the world and surfaces and their properties so I I promise to finish now and I will finish now sorry for going so quickly there the things I wanted to get across is that some of the early ideas going all the way back to the 80s we're not so bad the idea of Hornet chunk and then the layers idea of aidil sin' Lang and Adelson are pretty good the assumptions were wrong the assumptions can be improved we can use you know this relatively realistic synthetic training data to improve them and to train the models and that the problem is not yet fully solved maybe it's not even fully defined yet and and that moving forward might mean thinking about the problem in a very different way more of a physical representation but I also want to get across the idea that that optical flow is actually pretty useful so you might not be interested in in using machine learning algorithms to improve optical flow estimation but you might walk away from this with some optical flow code and say maybe I can use that in my problem so for example sylvia sophie has a that's gonna appear at ICCB where she uses optical flow to help figure out human pose estimation in a video sequence and it turns out if you just look at little movies of people moving around just look at the optical flow you can actually see a lot about what it is they're doing and so here's some nice examples of estimating the pose and you see the optical flow underneath the the the pose of this puppet if you want to know more you should ask sylvia about it that's just one example among many where optical flow today is good enough to be useful and then this Syntel training set if you want to play with it you can download it from our website here's some more examples of the training data in there just to give you an idea of the complexity alright I'm done I don't know if we have any time for I think we have a bit of time thank you very much perfect timing so questions please yes in the back yeah y'all go ahead shout I can repeat of the what of the retina hmm it's a good question so the question is should we think about the retina so what does the retinas got some different properties for example it's got a high resolution phobia and a low resolution periphery some people have looked at camera models that have similar properties so you might imagine that in the periphery you get very low spatial frequency information you can kind of get the gist of the overall motions so far you know that hasn't been terribly useful because most people don't want a camera that's like the eye they want to you know they they want to compress a movie sequence like this or understand a movie sequence like like this so so far biologically motivated models derive from the retina or from models of processing in the brain have not been terribly successful maybe that's because we don't know how the brain works okay so there's a second question here so indeed so going back to Gibson's whole motivation for this is so if something is coming towards me it creates a looming pattern and and you can write down very simply that this is just an expansion of the optical flow field and you can get it out very quickly that something is coming towards you so you could imagine implementing a detector that very accurately predicts when I'm going to come into contact with an object when how I'm approaching an object it tells you about your head not just whether something's looming or or moving away from you but your own heading in the world so these are all applications of the optical flow field that people have looked at yeah it's a good question so there in addition to this 2d optical flow I talked about there's something called scene flow which is a three dimensional vectors describing the 3d motion in the world the thing that we started out with to get that from single images it's actually quite hard and unless the scene is you can assume the scene is rigid if you can assume the scene is rigid you can do a whole lot of of things but in general the scene isn't fully rigid there's lots of stuff moving around in it and then people have been using stereo and there's a few formulations that do stereo and motion together we're extending this Syntel dataset to actually allow people to to do something like that where they we will have the stereo pairs and the optical flow and then people can do do the both together we're still trying to come up with exactly how to generate the evaluation data for this scene flow problem but hopefully we'll we'll get that online later this fall further questions so you're suggesting that if I if I knew what the objects were in the scene then I have a simpler problem of pose estimation of those objects over time absolutely but that's a hard problem so people have often thought well if I got the optical flow then I would know where the boundaries of things were and then I would know where the objects were and you're saying well if I knew where the objects were then I thought the flow problem would be easy indeed there's often these chicken-and-egg problems and so the interesting thing is can i maybe formulate those two things together that's what the layered model tries to do for a very simple case of doing segmentation and motion estimation together but you're suggesting maybe I should for example maybe in a simplified world this could work so I'm driving down the road I've got a camera and I know what I'm going to see our cars bikes and pedestrians so I might have fairly simple models for those and I the optical flow problem is now one of estimating the motion of these detecting the objects and estimating their motion you know you know in an arbitrary scene it would be pretty hard but in a constrained scenario maybe you could formulate them together more questions of the image of the boundaries of the objects yes yes so this is a very good question so the question is really like the assumption is absolutely right where do optical flow algorithms really break still well I told you some places but one is that the boundaries of objects and and you you know you want optical flow so that you can find the boundaries of the objects but the boundaries of places where optical flow doesn't really work very well and and so should you do something special for the boundaries and indeed that's something that that I've worked on several times formulating specific models for detecting and modeling what's going on right at the boundary you can do that I think these layered models are a good way to go in some sense because they tell you the motion of the foreground and the background and allow you to make inferences about the stuff that you can't see so if I have something moving in front of something else between two frames some stuff disappears and some stuff appears and and if I can track the layers over time then I can make good inferences about what's really happening at the boundary so I think to do the boundaries well estimating motion over multiple frames where I particularly try and estimate the structure where those boundaries are over multiple frames and make sure they're consistent across time I think you could do really well more questions so maybe before we think Michael maybe it's also a chance for a little advertising so you met or you saw Michael you saw Stefan and you saw me you saw a number of other people here from the Institute we're only getting started we are doing in your building we will be hiring a fourth director so we will have four departments you being plus we have four departments in strictly looking at hardware issues so I think we're really excited about the future and this would be an interesting place so if you go home with this idea tell others about it think about your own professional future etc keep us in mind we'll be happy to see some of you again later on this is one more video it said Ted you're off okay but while the question is coming in just to get it the the neural question about the retina this is this is a video sequence that was shown to an anesthetized cat and and typically when people study motion in the brain they use very simple things like sinusoidal sine wave gratings and things like that and our hypothesis is that that natural seen statistics are actually really important to the brain and that if we study motion processing in much more natural scenes maybe we'll learn something different and what you're seeing are LGN neurons in the cat on and off cells red and blue responding to different kinds of things both image structure as well as motion in the scene and with Garrett stanley and jose manuel alonso we've been looking at at using natural scenes like the ones i showed you from Syntel to get a maybe a richer model of what it is the brain might be doing and how what the neural code is for for emotion and so we have a paper on that that's a beginning little little beginning so this question of aging on this is related to the scene flow estimation so when I'm trying to estimate the three-dimensional velocity of objects or moving scene do you think it is important to use dense optical flow like if I use a sparse optical flow using image descriptors do you think it should serve the purpose it's a very good question so there it this comes down to something I touched on at the end is the application really really matters and if all you want to do for example is figure out how the 3d camera motion then you should probably track some sparse feature points and and do a ransac algorithm or something and try and figure out the camera motion if you want to do image coding maybe you need to know something more dense if you want to detect motion boundaries maybe you need something more dense I think it really depends on your application and there's lots of applications I think where some sort of sparse representation of the scene may be sufficient ok subjective versus objective optics place you made the point of the beginning that it's important to define it subjectively rather than object ly in terms of real motion but in your Syntel objective flow rather than subjective flow yeah how do you go between those two indeed so right I didn't I should have made this clearer what I really want out of the world out of an image sequence is the motion field I would really desperately that's what I would like to have if I'm doing image coding for motion JPEG or something I don't care about that I only care about the apparent motion but for if I want to build an intelligent system that moves around in the world like a living organism does and uses the motion in the environment then I think it really wants to get at the motion field my point in the beginning was that you can't always do that but it's what we would like and so when we evaluate algorithms with respect to Syntel we evaluate them in sort of the most difficult way which is that how well did the optical flow algorithm - at getting back the 3d structure what the projected 3d structure of the moving world and to do that you might argue well maybe you should be actually formulating it as a motion filled problem formulating it in the 3d world and I was hinting at that towards the end of the talk
Info
Channel: Max Planck Institute for Intelligent Systems
Views: 22,183
Rating: 4.9875388 out of 5
Keywords: Optical Flow, Machine Learning (Field Of Study)
Id: tIwpDuqJqcE
Channel Id: undefined
Length: 81min 26sec (4886 seconds)
Published: Fri Jan 31 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.