CVFX Lecture 25: Multiview stereo

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so the topic of today's lecture is called multi-view stereo and so this is one of the last I guess this is the last kind of 3d acquisition method that we're going to talk about and this is the only one that doesn't involve any sort of interaction with the scene right so for example for lidar and for structured light in both those cases we were pushing light into the scene and looking at the light in a camera right I guess for lighter there really was no camera we were using a photo detector so we pushed laser spots out into the world we measured them coming back infrastructure light we pushed you know either a laser light stripe or a laser or a white light pattern into the world and we looked at that with a camera right and so part of problem with those is that in each case we're kind of interfering with the scene right we're pushing stuff into the scene that wasn't originally there it'll be great if we could do stuff with purely image when strain so we just take a bunch of pictures of a scene and then we use that to reconstruct 3d and so as you may know you know these kinds of large number of images kind of passive techniques have become much more popular lately we'll talk about a bunch of kind of off-the-shelf consumer multi-view stereo stuff at the end of class so what do we think about this is that fundamentally multi-view stereo is kind of a combination of stuff that we already talked about so we've talked about camera calibration in chapter 6 right so we talked about if you've got a whole bunch of cameras how do you know where all of them are positioned and oriented in 3d space just from looking at image correspondences right so we know we should be able to solve the camera calibration problem just by looking at images so we start from the assumption that all the cameras are calibrated and then we also talked in chapter 5 about how do we estimate a dense correspondence between each pair of images and so you know we talked about that in the context of optical flow and the context of stereo right so you can imagine what you're really doing is you're trying to cycle the estimate the correspondence between a whole large number of it as opposed to just a pair of images right but still the idea is the same and so if I know where the cameras are and I know where every corresponding pixel between a pair of images it is then I can just simply triangulate through every corresponding pair that I can get a dense set of 3d points right that's exactly the fundamental idea behind multi-view stereo and so there are lots of different ways to approach this depth or this 3d from images problem and so I'm just going to highlight maybe a few of these big picture approaches let me just say though that I think it's still true although it's it's getting closer you know that something that is purely image based or don't know structured like no laser scanning is still not going to be as good accuracy wise something where you do push light into the scene and look at it right so multi-view stereo is never going to be as good as something where you put light strikes into the scene or scan the scene with a laser right so you can do pretty well but not like as well so right could it never be as good so yeah maybe I should be banished back off of my grand claim a little bit but yeah let me put this one so where do things like stereo purely image based things have problems right so say I'm looking at a flat white wall right so stereo we'll never be able to you know obtain direct actual correspondences on that wall the best it can do is make some sort of assumption like oh I'm gonna get a plane I'm looking at a surface doesn't change very much or a put this way suppose you've got some sort of like you know wavy wall that has not enough waviness to generate a lot of shadows and stuff for the straight edge of latch onto but enough that you couldn't really you know match up points at the wall right not saying that the images will never be able to do very well even if you had super high-resolution images but something like the lidar scanner that bounces physically off a point in the surface and comes back to you is gonna be able to disambiguate that kind of thing right so you know I think that when you've got like a highly textured object that both of you stare you can do very well and in theory the limitations for how well you can do are based on the resolution of your input images and your computational power to do the dense correspondence I mean you know it's it's fair to say that there are lots of so multi-view stereo algorithms generally made to operate on lots and lots of input images but at the same time I think that if you really dug down into the details that you know those algorithms are also not really well-suited to taking like you know a thousand 4k digital cinema images and making a reconstruction I think that computationally probably multi-view stereo algorithms are still you know kind of starting by using kind of more tractable image sizes and they're not using all the images that you have either so they're probably throwing out a bunch of those images and then using them to get you know incrementally better and better 3d so you know I think if you have the time to spend on you've got great cameras on then you could take several days to generate some amazing multi new job but then you have to think about okay well I could have just done that with a co-located camera and a lidar scanner or a structured light scanner would have taken me a lot less time right so that's where the balance is but I mean this stuff said there's other things you can do with both of you stereo that you can't do with wide our instructors lights so for example you know one thing is that you can generate 3d constructions of things that don't exist anymore or they don't exist in front of you you just look at archival photos of things right so there was actually this is very timely there was a great article in The New Yorker this week about Paul Devin back at the USC Institute for Creative Technologies the guy who has developed all these light stages we talked about a couple weeks ago and so one thing that he said in that article which was very interesting is that you know now it's very conceivable that you could look at historical film of you know actors who are either not around anymore or who are 40 years older and reconstruct their 3d facial geometry from purely images right you don't need to have there you know you know to try to make them up to look like they were younger you can actually see them when they were younger and effectively get a 3d model just from so well I think they'd be saying there was one you know great shot of Harrison Ford from some movie of long ago where he basically was kind of rotating his head under nice even light you know it was like the ideal thing that you would do if you were trying to you're gonna start just four images now you could say okay I could take that push it into a multi-view stereo algorithm and get like the 30 years ago Harrison Fork right so very conceivable to the point that you know there was an interesting discussion in the article about you know actors probably already are saying you know well I don't remember doing that on set and it's probably because they probably didn't do it on set and they've been 3d rendered to look like they were doing something right and so suddenly you know I'm sure things like your your likeness rights are going to become a much much bigger issue in terms of you know what have you been representatives doing right you you suddenly lose the control over your own body in sometimes you know the the actors face and there mannerisms are like the most distinctive thing that they have right now if you can actually read you know simulate that somewhere else you've taken a lot away from them right says it's slowly becoming a very interesting tricky technical issue so that's a great article and so unfortunately it's not free online but I really encourage you if you have a chance to take it take a look at okay so let me just say first of all that you know just in passing that so multi-view stereo was kind of kick-started in the same way that other things like this regular stereo optical flow were by this multi-view stereo benchmark so if you're interested in doing this kind of research just like with you know optical flow and regular stereo you know there is a nice set of data you know and these these data sets are basically like little tabletop models basically I'm not sure if I can because here's a little model stegasaurus right so the idea is that you know someone went to the effort this was a combination of I think Microsoft and Middlebury to create these very nice you know 3d models of tabletop sized objects with a whole lot of calibrated cameras and so for example here they're full data sets have more than 300 calibrated cameras of the subject and then they also used I believe they use structured light cyberware modeled laser scanners so I mean I think that you know they use some sort of laser scanning to basically generate the very accurate 3d models of these objects so now what you have our number one the true 3d object and number two lots of calibrated camera images of this object and this is a way that you can benchmark your data against other people and so again just in the same way for these other algorithms you've seen threats like this before you have these kinds of things where you have you know the ground truth model let's see I guess if I just select one of these guys right so for example here what you see on the left is one algorithms result of a multi-view stereo algorithm and this is the actual true scan the ground truth right and so you can kind of scroll your way down through different multi-view stereo algorithms like this one is a little bit you know not as detailed this these are a little bit you know fire detailed and so you know the idea is that you can kind of upload your results and these days again to be taken seriously as a multi beaut stereo paper you basically have to compare yourself against these staring datasets also there is another benchmark by Strecker at all that is much more like large outdoor scenes so instead of just looking at something that's like something out turntable just in front of you they did benchmarking for like building scales scenes and so again if you're just didn't doing much larger-scale to material and that's the data so that you should be pension are right benchmarking on not this smaller one okay so let me talk about four basic techniques and I'm going to spend most of my time I think on the first one and on the third one so the first one is what I would call a volumetric method and so this is actually kind of related to the visual whole idea that we were talking about in the context of marketless motion capture right so if you remember from that lecture so this is a volumetric methods right so you guys just turned in this homework on you know kind of silhouette based estimation right so it is that okay you know I see the object here and I look at this object from multiple perspectives and I obtain different silhouettes of that object and the idea is that I can do 3d reconstruction by kind of let me just make this a little bigger sorry I could do 3d reconstruction by kind of hoarsely voxel izing the space and then starting to ask okay so which voxels in the world are consistent with my images of the world right and so I'm like the silhouette and edge base so we talked about the visual hull right in the mark littell smart recapture lecture and so the visual hull starts only from kind of like this binary silhouette in each image and it pushes that binary silhouette forward in space and looks at where those things intersect and says okay so the true thing must lie somewhere within this set of 3d boxes right so the kind of twist for multi-view stereo is that instead of thinking about just the silhouettes we think about the actual images that we saw right so the idea is that let me go back to my little figure here so what I would do is I would say okay let's think about this voxel in 3d space right what I do is I have all the calendar calibrated cameras already so I project that voxel back down into each camera view and I asked okay did I see roughly the same color in each of the three camera views if I did then that voxel in 3d space is a good candidate to actually be truly there if I didn't so for example if you know the cameras see like red green blue instead of red red red they say well you know this voxel can't be actually a true voxel because the cameras all didn't see the same color right and so you do that kind of modulo expected little color changes so instead of just looking for a direct 100% match you kind of have a cost function as how similar are the envel colors that I saw you know in space and the other thing is that you know you do allow for things like this so in this case this camera so it suppose that the gray you know think of boxes was the true object right so in this case you know you have to allow for the fact that some of the voxels on the object may self occlude right so if I have some sort of an on well even even here this this object is still convex but from this perspective this camera sees this voxel here first before exceeds that voxels so this camera doesn't necessarily have to agree on the coloring of this voxel and so the idea is that you start to consider the voxels in a very specific pattern and the initial voxel coloring approach you had to also have a kind of a specific configuration of cameras to prevent bad things from happening right so it is that you kind of pushed your way through the voxels starting close to the cameras and then kind of pushing this plane of voxels away from the cameras and at every point you'd look at all the boxes in the plane you'd say okay which of these guys is consistent with all of the pictures that I saw right and so that was a little bit restrictive but that was the first approach so you know the first approach was was voxel coloring by sites and iron that was followed by something called photo consistency or the photo halt so the photo halt I was coup de la cosa Insights and so the idea there was that basically you're trying to find the largest set of colored voxels in the world that are consistent with projecting those down into each of your calibrated camera images right and so idea is that you know it's kind of similar to carving away voxels that can't possibly be consistent and so you know these algorithms are kind of what almost what I would call classical I mean part of the problem with them is that you have to very finely voxel eyes up your space and that means if you want like a super high accurate scan like you want a millimeter accurate scan or a sub millimeter accurate scan you can imagine that your working volume is necessarily restricted by just how much computational memory and horsepower you have and so the other thing is that in some of these earlier techniques you were you know you want to make sure that you don't do anything that you later regret so for example if you were to remove a chunk of voxels that later you found out that oh you've made a mistake and you need those voxels it's hard to get those voxels back right and so there have been some modifications to these kinds of voxel carving and voxel colouring algorithms that don't let you you don't let you recover from mistakes or in some sense you can imagine kind of solving kind of like a a graph cut problem like a huge multi-label graph cut problem where you say okay you know the label for each voxel corresponds to kind of a discrete disparity or discrete XY position in each of the resource images so you have like this huge multi label set that you could solve with some sort of like massive alpha expansion problem right so yeah I think that you know one of the nice things about these volumetric approaches is that you know unlike the methods we're going to talk about next they don't necessarily depend on any initial understanding about correspondences between the images right so they just look at where are the cameras and they start to push this plane of voxels through space and start to chew away 3d space until you've got something that is a colored set of boxes is consistent with these images right some of the besson's we're going to talk about next are going to be much more they're going to leverage the ideas behind sparse and dense corresponds much more and consequently those are the ones I think are really the top performing multi-view stereo algorithms now are the ones that are really using the correspondences that you get from something like sift or future patches okay so like I said volumetric methods I think are one of the first methods that people kind of looked at to approach this problem but not what I would say are currently necessarily the best methods so there's also what I would call surface deformations and so you know surface deformation methods kind of inherent their methodology from an old image processing computer vision algorithm called snakes or active contours right so basically the idea behind an active contour is that you know if I have a object in an image that I'm trying to segment what I do is a treat by the way what I do is I basically build what looks like kind of a flexible rubber band and so what I what I have is kind of initially a set of you know edges connected by these vertices they call this an active contour or a snake and the idea is what I do is at every step I allow my vertices my snake to kind of try to wrap tightly around an object right and so the idea is that you try and build a cost function so the edges of the snake are attracted to image edges right so when the snake is out in the middle of a flat region it is kind of compelled to move towards somewhere where it's hitting an image edge right so these you want to push these guys in closer to image edges and there are various ways of you know kind of attracting these guys with this inward force until at the end they wrap tightly around the object that you care about right and so this was a very you know common way to do image segmentation for a long time and you know one of the things that is tricky is that number one you need to have a large number of nodes to be able to kind of wrap really tightly around the subject so for example for my tree example you know if I just have like 20 you know nodes I'm probably not going to get a very conformal wrap around my object if I'm forcing things to be just straight lines between the edges so maybe I need to have 200 points on my snake instead of just 20 right and so there's also a little bit of a trick in terms of how do you keep the topology of this thing from you know being you know from staying a nice simple closed curve right because you can imagine that if I have something that's got like an arrow you know it's this between two regions that it's possible with the snake to kind of cross in on itself and that would be bad right so maintaining the topology of this especially when the nose you really close together can also be a headache and so really the way that people would solve this problem now would be instead of using a literal snake what they use is called the level son and so what the idea mind level set image processing is that instead of trying to push this curve around what you do instead is you say okay if I'm trying to segment say I'm just trying to segment this circle right so what I do instead as I say okay instead I try and estimate a function of x and y so that this function is exactly zero around the boundary of the thing that I'm trying to estimate and inside the object it's negative and outside the object is positive and so this kind of gets around a lot of these things about keeping track of how many vertices I need and criss-crossing of the you know edges of the snake and stuff like that the downside is that really to do this problem you're kind of back to a volumetric method where you have to kind of just sweetly voxel eyes up to space and then once you know what this function is every box one with space you extract all those voxels that have basically zero values of the level set so again this is not to say this can't be done is this is a very common thing you do but it is still kind of a volumetric thing where you have to make choices about how it gets is your bots all grid and stuff like that and so how does this relate to you multi-view stereo well the idea is that instead of having a one-dimensional contour instead we have is like a three-dimensional triangle mesh that it's wrapping tightly around the object in the world that we want to you know build a nice model right so it is that you start with maybe some really simple triangle mesh like maybe have a very nice sphere that you put right around the object in space and then you start to deform the vertices of those triangles until it kind of wraps around the 3d object that you're trying to trying to segment in the world and so what you need is basically some sort of a cost function that says you know s is some sort of for example triangulated mesh or you know kind of a level set version of that and we try and do is something like ok I want to minimize the energy of the surface where there are a bunch of terms so I have some measure of how well the surface matches up with each of the images and then I have some measure of you know kind of similar to the most to the to the marker let's motion capture where I can say okay I need the silhouettes of the surface to project down to the silhouettes on the images and then you usually have also what's called an internal force this internal force is something that keeps the snake kind of you know as tight as possible right so I mean you don't want to snake that is too you know loosey-goosey you want something that's always gonna be compelled to get the tightest boundary around an object that it can write so usually try to have a force that it's always kind of pushing the snake to conform to something right and so let's just talk for a second about so very kind of talked about this guy in the Marcos motion capture section this is kind of like saying you can think of as kind of always you know pushing the snake to be tighter and then what is this thing so this is basically saying something like you know how consistent is the surface with each of the images and so again there are different ways to do this but the easiest thing to do would be to say okay I have some surface and let me actually I think I have a better picture of it than what I can draw so I have some surface this is for example my candidate triangle mesh right and what I do is I say okay I'm going to take a patch over in one image I'm going to project it onto the 3d surface and I find out where that patch land over here and then I would have this dotted line box and then what I could do is something simple like the normalized cross-correlation between every you know for every box I draw on one image I can find out these daaad line boxes and all the other images and figure out what the cross correlation is between each of those things right and so if the point on the surface is good then I expect that when I project it back down the images that the patches all kind of agreed right one thing that you have to be careful of though and this is another way of illustrating this picture is that you don't usually want to match square patches in one image two square patches in the other image right we have a picture of long ago or we kind of showed that what's square in one image looks perhaps very non square from the other image and so what you can do is you have this triangle message already you have a candidate triangle imagine you gotta use that mesh to mediate what you should be comparing so you say okay this is my candidate what I do is I take this rectangle I push it up onto the mesh I find out where does that where is that rectangle hit my mesh that gives me some weird 3d chunk of surface then I can project that weird three and chunk of surface back down onto this other image and then what I do is I compare the solid square here to the solid weird shape in the other image and so the idea is that I should really be using the expected surface to tell me what the right set of pixels is to compare right and so that kind of idea of using the reprojection of the of the square onto the surface and then projecting it back down onto the image so I'm not just comparing the squares and squares I think it's definitely an important aspect of getting a surface based deformation model like this to work again you can be a little bit you know if you don't want to figure out necessarily this weird non-standard object at the very least you could kind of tell okay instead of comparing this to a square maybe I should compare it to a skinny rectangle or a fat rectangle depending on the kind of tangent you know kind of based on these angles I can tell will this square here project to something skinny over here or something fat over here and I compare it against at least a family of possible correspondences of just the original square so again you know one of these so one of the you know disadvantage is this technique again is that you know number one since you're generating this kind of you're evolving this 3d mesh on the plus side you know you're resulting 3d reconstruction is not going to have any holes in it right because your original spherical you know initialization didn't have any holes in so you basically just pushing the mesh in until it kind of matches up with all that stuff in three that you saw and that's nice because you know one thing you'll see in other multi-view stereo algorithms like we're going to talk about next is that often you do have some sort of a missing data problem where you just don't have good correspondences you end up with missing points in the 3d world and so this kind of gets around that by wrapping this kind of you know wrapping this chunk of triangles around 3d space on the other hand though this is not really the right thing to do for an on tabletop be kind of scan right so for example you know the scans I showed you here for the multi-view stereo these are gonna be okay for doing a volumetric based method right so for example the stegosaur is basically I don't think that this Degas thoroughly has any holes in it in this you know topology the temple does have holes in it but that's okay for volumetric methods if you're using a level sets the level sets can tolerate holes in the object since that's not a nice thing about using this level set approach but that being said you know if I'm looking to do multi beauty reconstruction of a scene like this classroom right where there's lots of independent surfaces lots of holes definitely things are not part of one object then this may not necessarily be the best way to go okay okay so what then are you know kind of the currently top-performing both of your stereo algorithms I think that it's fair to say that they're generally based on what I would call patches okay and so there's a very well known kind of family of algorithms that start with what was called patch based multi-view stereo and so patch based methods I would say are these days highly competitive and so the idea is pretty simple right so again the setup is we have a whole bunch of calibrated cameras and we can do things for example like estimate feature points like sift and surf and so on between each of these cameras right so what we do is we assume that we have a whole bunch of cameras calibrated and in each camera say we have a bunch of you know corresponding features this is gonna be a kind of a crappy picture because I didn't really plan it that's right so it is just imagine that's a straight line there so it is you could say okay well for every bunch of features that I can really accurately match up between my set of images I could triangulate through those points to get a single 3d point in space right and again when you've got lots and lots of images of the same scene taken from you know not like super differing points of view you can generate a large number of sparse feature matches right this is very much like you know sift we talked about or harris corners right so the idea is that first we generate lots and lots of sparse feature matches between all the images in our set and then we try to lick those out into 3d space and we get a whole bunch of sparse 3d points in some sense this is exactly like the result of structure for motion right so we talked about match moving one of the two parts of matching number one we recover the 3d camera positions and orientations and number two we recover sparse 3d point positions right which is exactly what you guys discovered on your match movie Homer Craig where you have not only the 3d camera path but also a bunch of 3d points in space right and so those are the beginning point for a patch based algorithm and so what you do is okay so then for every point that you can kind of successfully triangulate out there in the world you build a little patch so like this so you initially tragg you late some point in the world that becomes your initial center of a patch and then you can also estimate basically so here what I've done is I've kind of drawn this like a five by five set of points in space on the patch right so I estimate the center and a normal for the patch and then I kind of make a grid on that patch in the 3d world and then I can project that grid down into each of the corresponding images that see that patch and then I can do a little optimization problem over the center and the normal of the patch and so I do is for every good sparse feature point that I have I generate a little patch of texture in the world and then I basically try and iteratively move the center and the normal dispatch around so it matches up as well as possible with what gets projected back down to the images right and so the result is that basically corresponding to kind of like every good point that I would have gotten from something like special promotion not only to get a 3d point but I got a little oriented piece of texture right just a little square out in 3d world right and so what I get is at the end of this process not just a clown of 3d points but I got a cloud of little textured patches out in the world okay how do you determine the normal the patches so one way to think about it is that first you kind of think about where is the so one way of doing it would be say okay let's find the best image of that patch in the world right so for example you know maybe you would say okay I have a reference camera and I could just say the the first candidate for the orientation of the patch is the normal that points from the center of the patch back towards my camera right and that gives the initial estimate what that patch should be so the point is that I'm iteratively refining the orientation of that patch as I go right so all ideas in the initialization and so usually the initialization could just be the normal that points back to one of the cameras I mean obviously don't want to make it like perpendicular right I don't want to make it like parallel to or I want me to produce I mean as as front onto the image plane as I can right the idea being that I'm assuming that a good patch of image texture kind of corresponds and that some nice kind of frontally parallel surface out in there in the 3d world maybe it's not exactly parallel to my image plane but it's kind of close enough that gives me a good place to start and then I sort of features like that maybe things that are on the edges of things right right so that's a good point is that you know what if look but the idea is that if I had a feature match if I had a feature detection that was on the edge of something then the idea is I wouldn't be able to match that very well together images because for example if I have a sip feature that's on the edge of a building half of it is on the building half that is like background behind the building then if I see that from the other point of view I'm never gonna get a good match in the first place right and so those those get thrown out you know slightly I only see this process with really good matches okay that being said I will talk about next time how can you deal with matching when you've got you know something that's half on half off the surface and so we'll talk about that a little more when we talk about registration of 3d objects next one that's a good question um so the idea is that once I have the initial set of patches in the world that's still pretty sparse right so now what I want to do is I want to make a dense set of matches and what I do is I say okay so let's suppose that this light grey pixel was one for which I had a good match right so I kind of say okay that's this corresponding nice three square out here what I could do is say okay for the pixel next door that didn't have a good set feature what I can do is I can project out into the world array until I meet up with the plane defined by the patch of this guy so that kind of means that I'm extending this light grey plane out into the world and I see where does the where does the Ray through that intersect and then I generate a candidate darker gray patch that has this Center and the same normals the other guy and then I start to move that patch around right and so it is that you know I flow my correspondences out from you know good sparse correspondences out to places where I didn't maybe necessarily have good sift matches but I'm pushing those patches out into the world right and say it is that I kind of am you know seeding the world with good patches from something like sift and then I'm pushing them out with estimates kind of another the assumption that I have surfaces in the scene that are not you know too different from each other that if there are relatively contiguous right and so obviously to get this to really work requires a lot of careful programming but that's the basic idea is that fundamentally at the end of the day you go from a sparse set of little patches to a dense set of little patches and then you can use something like the algorithm we're going to talk about next time to turn all those you know little patches into a more contiguous 3d model but this is really the you know key idea behind some of the cool stuff that's been coming out lately for example you know one of the best researchers in this area y'all super Acala you know let's look at this guy here so basically you know this is what you do if you have lots and lots of tourist photos of a certain place and so here they've got you know the Colosseum and I may skip around this video so apologize to Yasu for this but you know you get like these very highly detailed you know surface meshes and then I think you'll see as you zoom in kind of how then once you've got the three then you can relight the model that matches up with a given photograph or a given position of the Sun a certain time of day and so you know the relighting of the scene also really sells the idea this is a 3d model so this is an original picture this is a rendered picture that kind of matches up with that as well as possible and then you can kind of fly through the scene you can see how good it looks and here you can see there are some missing data pieces in the you know in the windows and so on I guess these are actually holes in the real environment but you know you can see the patches are not like totally perfect around the edges of things but on the surfaces things look really good and I think we can skip forward to you know here's a reconstruction of Venice so again around the edges of objects you know you see a little bit of noise along all these kind of spikes and towers but on the on the actual surfaces of objects things look really good and then once you've textured the objects things look even better right and so this is really you know pretty impressive and again this is because of the zillions of tourist photos of these places that you can put into your basketball tu stereo algorithm right and so the details of you know how do you choose which images you should use as the basis for your NBS algorithm you know there's a lot of good work on that also and so now this is not to say that you would just be able to pop this onto your computer and do it in a few hours I mean this is like yeah I think I mentioned in a previous lecture there was this kind of building Rome in a day paper the idea was that you could take all these tourist photos of Rome and then how many you know if you had all the clusters you know of computational power in the world how would you try and get the 3d reconstruction done in 24 hours right but that was that's where hours of a single computer 24 hours look like a ton of computers although lately it's come down to be a little bit more commodity like you can now download for example there's this nice thing called visual sfm so basically one nice thing about this this patch based multi-view stereo rpm vs is that he made this publicly available and so lots of commercial not supported lots of kind of hobbyist stuff is built on top of the system and so this visual sfm is a very nice way of just simply dropping images in doing the automatic matching doing the automatic 3d construction and you know the matching I think may take a while this is basically a real-time run of three minutes of watching ago I'm not going to necessarily watch it for all 3 minutes but the idea is it really is just as easy as kind of loading up some villages generating these correspondences and then what you see is like at some point you see it's basically inserting every camera into the scene one at a time you can kind of see if you're reading this closely there's stuff about bundle adjustment and number of 3d points basically this is solving an incremental structure for motion problem like them removing problems we talked about in Chapter six and so after you solve all these one by one match moving problems then you call basically this you know execute dense reconstruction by PMPs and now it takes a few minutes to go from these sparse points to a dense set of points and that's a little bit harder to see but you can see that the very end of this video they kind of toggle something where now this person is working in the dense world and said sparse well you can see like all the all the holes between the points kind of click in and so this is definitely time to flow around with if you're curious about doing dense reconstruction another thing I guess while I'm talking about it so there are other things here's another like web-based both of you reconstructions so you kind of take your images and it uploads it to some server somewhere and then it you know does it one thing that's really well actually let me let me just go back for one second finish up my lecture on the topic so I'll come back to the to the tools that's not question though yes the photo tutors and guys know Yasu very well and so like there's definitely a big connection between that whole group of University of Washington alumni and this project right and so yes so basically the PMDs sits on top of I believe this software from NOAA Snavely called bundler that basically is the heart of photo tourism which basically like a large structure for motion you know large scale searching for motion for image collections thing and so what you need to make all this work is he'd get bundler and PMDs you put it all together you know but you know that's something that when it came on the scene suddenly it enabled kind of the average person to do both of you stereo in a way that you couldn't have dreamed of doing before and so I show one last thing at the end so let me just finish up my my technical part of the lecture here this is just an example of a multi-view stereo result the last thing I want to say I guess is that you can do things with instead of trying to estimate the three positions of points you can also think about this as trying to estimate a bunch of simultaneous depth maps from my images right so we know that basically it's kind of analogous to say okay for this point in this in this image I want to estimate its physical 3d depth right and once I know that depth then I can basically push forward this little block of pixels of this pixel and color that point in space right and so this is what I would call a depth map fusion idea so the last kind of technique I would say is def map fusion and so the idea there is that you know in each image and I guess I can't really say much better than this picture so each image each image guy takes turns being the reference image and what I do is very pixel in the image I kind of hypothesize okay so what should the depth in this what should the depth at this pixel be so I push out array into 3d space and say okay well that's where the pixel will be from this cameras perspective if it was really there in 3d then I can project that back down onto all the other images and look at the neighborhoods of that pixel and the resulting images and say okay well do these neighborhoods here match up with this neighborhood over here and if they do then that depth is a good candidate if they don't then it's probably not the right depth and so you can imagine that you know very crudely what you could do is just kind of like optimize over this D of P for every as a function of D for every pixel and then kind of get estimates based on how consistent I see the you know corresponding pixel neighborhoods needs other images so it's actually very similar to the idea of what we're talking about with with the surface based reconstruction where I'm kind of pushing forward points projecting the back down the images and using the image image matches as a basis for comparison and again you probably want to be smart about how do you you know not just looking at squares or rectangles around the corresponding points but also looking at you know the the deformed rectangles that come from perspective distortion as I you know look at the surfaces from very different angles right so I probably don't want to compare squares of squares I want to compare squares to little squishy point you know squish the squares or fat squares also the most rectangles because okay and I guess the last thing I would say about just general multi-view stereo is that obviously you can always do better if you kind of do a hybrid of just images and something like we talked about last time so if I were to also allow myself to project structured light into the scene just put some light stripes in the world to give me some texture on objects well then I can do it in better right because now I've got the best of both worlds I've got you know all the stuff from stereo corresponds and I've got kind of some introduced texture that I can use to help me with featureless regions of the world right okay so what I was saying here is that you know this has come very commoditize and so here is this service from autodesk called one to 3d cash now on your phone although it doesn't actually work on your phone you take the pictures on your phone right so here this guy is going around as kid now you basically upload them to a service and you get a 3d model that comes back at you right and people do this with their food and with their sushi you know and you know you can see that there are some places where you don't necessarily get all the data back but then you know autodesk has actually done a pretty nice thing where now you can print out these objects you know so you can do some editing of the objects and you can you know crop with planes and so on and now you bake the 3d print out you know so I was like making action figures yesterday so actually you know probably if you're gonna make an action figure this is probably an easier way to do it than using the spare that we had last time right because then you've got this you know so I don't know how much they charge you to print the object but the 3d reconstruction part is free right so I download this on my phone yesterday and made this model of my my Totoro in the book right and so you know I just went around and took you know about 20 pictures and here you can kind of see that things are not perfect like there's a there's a chunk of texture that forever whatever is in here you didn't get and so you have like your was looking through the object and you can also see that actually the book right is not super great because like there's lots of weird looks like this book has been bashed around a little bit it's kind of blended in with my you know my leather you know coffee table a little bit so you know there are lots of reasons why this could be true you know one one thing that will defeat or mess up multi-view stereo algorithm his reflections and specularity x' and so the book is shiny this was on a day where you know there was light coming through the windows and so the reflections on the surfaces of these objects are gonna definitely confuse these algorithms are trying to say are you know are these things photo consistent right and so if you really want to do this the right way you should try and control your lighting so that they're not a lot of bouncy reflections off of things right that's gonna be a problem but you know in terms of the you know in terms of the Totoro which is actually you know a matte object I'm actually kind of surprised this is just I didn't load long enough but you know the the 3d reconstruction of this guy is pretty good except for like you can see that it didn't really totally resolve the gap between this little tree branch and its head right so if I were to zoom in on this a little bit right so here this kind of appears to be still a filled-in hole whereas it probably shouldn't be and also you can see if there's some texture issue like here you know that the head of make it the head of this guy has some sort of a green leafy texture that again came from some sort of erroneous estimation of what the color should be at that surface so that's the surface itself has got the right you know it's got the right I know how I get back to the non zoomed version sorry we're gonna be out maybe if I were to do this yes okay so now I'm back to the sort of I think I'm Gary oh yeah so you can see you know not only there did it get some of my table but also got some of my sofa that's further away right so it's kind of only giving me the patches that it was able to get good estimates for but you know this is definitely worth playing around with and actually you know if you were a little more careful than I was about taking pictures it does work pretty well right one other thing that you know kind of apropos of what's going on these days so I have anyone played this LA noire video game was I the only one alright so la Noire was this kind of very interesting facial motion capture situation where they had instead of having a kind of animated face with I mean it was that may face revisited what they did was they had all the actors perform with I think the artists on this play they did all the facial expressions inside this multi camera studio so this is basically kind of like a multi-view stereo situation and then they would stream a version of this captured dialogue into the game in real time right and so this is the original on left and the right is the motion capture thing or the multi-view stereo thing and so really what they're doing is they're estimating you know real-time 3d position and texture and I gotta say that when you when you see it in the game it really is very different than other avatars and video games right like they had all of these first the idea was that you know part of it is that you're trying to look at people's expressions until are they telling the truth or are they lying right and so part of the key is that they wanted to capture actors who were making these kinds of like shifty-eyed expressions and so on and they actually I think was fairly successful in there there were points where it was kind of uncanny valley but there were a lot of points where it was also like you know you could really buy it this was a person and so like as you watch towards the end of this video you know it's not like it's some kind of crummy CGI version of some it really is the person's face that's been put onto these 3d models part of it part of the part that was a little tricky was that you know the bodies were separately motion captured and then they kind of sketched the moving heads onto these bodies and also you know things like clothes and hats and stuff like that inevitably look a little bit weird like when you put the Hat on the person you know it looks a little bit you know if the Hat doesn't seem to sit totally well but you know that was really an impressive step forward and unfortunately I don't know that any other games have really gone on to do anything quite like this because it was a huge undertaking for them to figure out how to make this work but that was really one of the selling points of the game was that you really got this sense of personality coming from the from the faces in the game so yeah it was brainy so another thing that actually just came off the press is the new Google camera app which does basically on your phone re focusing or blurring of an image right and so so maybe some of you guys are familiar with this micro camera so the Lytro camera is something that does this with what's called a white field in hardware so it actually is generating you know a lot of different actual physical images of the scene just a little bit separated so you can do kind of changes in parallax little changes in viewpoint little changes in refocusing because it's got lots of actual images of the world right whereas here you've only got one image of the world or so it seems what's actually happening is that you know you you are basically asked to move your camera a little bit when you take the picture and what it's really doing is a multi very multi-view stereo algorithm where it creates a depth map right and so in addition to having the picture it has the depth associated with every pixel and that's what lets you do basically interactive you know refocusing the image because then they can say okay I know that all this stuff is closer to the camera so it should be sharp and all the stuff is further away so I apply a little bit of a blur to it right so it's kind of like a software version of the Lytro camera right yeah I'm sure the lighter people are probably little bit unhappy with stuff like this right because it really is yeah but you look at what's underneath the hood right they're saying they're using structure promotion bundle adjustment both of you stereo right and so the key is that you don't just have a single image to work from you have a series of images that you use to create the depth map right and so again you actually if you look at this Google research blog after taking this course you should be able understand pretty much everything that's in this paragraph about firstly do structure promotion than they do bundle adjustment then they do both of you stereo and you know there's some interesting problems about how do you do this quickly on your phone because unlike the one two three to catch this one is not farming anything out to the cloud to do computations it's all doing it on the spot so they have to have a fairly efficient you know search promotion and both of you stereo algorithm to work on your phone ok there's anything else I want to say here well I think that that's pretty much it for the moment so I would say that you know again if you want to know what's really the best multi-view stereo algorithm at a given moment then this Millberry place is the first place to look for kind of understanding how good things are right and so you can sort basically by you know whether you care about accuracy or a percentage completion whether you care about the view on all of all of the views or just a subset of the views and you can see here these furukawa - for a comma 3 photo comma is the researcher that I told you about before I'm not sure if we have a most it over any portion oh yeah sorry guys so basically these papers here by Furukawa are the original patch space multi-view stereo papers and so you can see that these are still at the top of the stack and so I think it's fair to say that you know patch based methods are still pretty good but I don't think that the I think I think I remember reading that the lens blur in Google camera is actually a very you know computationally much simpler algorithm that actually probably is not very high on this list but can run quickly on your mobile phone right and you know it's still for the verses of refocusing your selfie looks good enough right you don't need to have like super accurate depth map to be able to do this kind of foreground background blur you know because because that's the kind of thing that you're taking a picture of I usually you know you're gonna be sufficiently separated from your background that if the only problem is telling which pixels are in the foreground and you can probably do that pretty well with a relatively simple algorithm okay so any questions about building new stereo so it really is worth downloading the multi-view or John doing the Autodesk 1 2 3 D cache there's a there's a app on the phone there's a you know desktop version then there's a web version and I had a hard time getting all three of them to play well together but certainly you can get the phone stuff onto your web app and then you can fool around with what's on the web app right and like I said I'd be very curious to see the whole pipeline going from the acquisition to the 3d printing I think that would be pretty pretty neat so okay so becase I will

Info

Channel: Rich Radke

Views: 9,682

Rating: undefined out of 5

Keywords: visual effects, computer vision, rich radke, radke, cvfx, multiview stereo, voxel coloring, voxel carving, 123dcatch, 3d modeling, pmvs, cmvs, patch based multiview stereo

Id: 2pbH6A6d_Ow

Channel Id: undefined

Length: 55min 58sec (3358 seconds)

Published: Mon Apr 28 2014