Tutorial : 3D Deep Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and this is a tutorial about 3d deep learning we're witnessing the rising popularity of three divisions and cvpr so for this tutorial we're going to introduce some related progress of how to apply the deep learning technique with the new coming large 3d data my name is also I'm a new professor at UC San Diego and this to tourists co-organized with my colleague professor real Jebus from Stanford Michael Branson Vangelis team a Charlson teaching okay and we are going to cover the whole afternoon for the topic of 3d deep learning and you see that there will be an opening remark for ten minutes and then we'll be three main topics of how to apply that we've learned three different types of 3d data we're going to first given an overview of the background why there's a new field of 3d deep learning okay and in fact before talking about the deep learning let me say that we're deeply interested in understanding the 3d because I'm a world around us is composed of through the geometry like from the house of relief to the cars to drive and from the food we eat to the clothes aware so it needs understand geometry and need to understand 3d and 3d has very broad applications like in robotics or augmented reality or automata driving or my reading okay so maybe I should stay a little oh okay I think this is good now sorry and yes so for you who are here I guess I don't need to say much about why 3d is important however historically notice really video computing techniques focused on single models and like the robustness in fact one reason is they're like the big 3d data like there's the famous in a computer graphics field they're the same or Stanford money and you to use a teapot and in fact also envision and gratis there has been the shape benchmark the Princeton shape benchmark is being used many years but scale is not large and diversity is not sufficient and recently we're witnessing the rise of internet 3d models like there you can find millions of models from from the man-made models the canno models from online repositories but also this is backed by the growth market of across sourcing for 3d modeling and we're seeing this is opportunity of data-driven 3d video computing we're seeing more and more 3d sensors as well like for the Kinect for the realsense for more and more 3d sensors from Google from many other companies VR as higher and higher quality the lower and lower price is estimated that the worldwide there are over 30 million units deployed and we can anticipate how much 3d data we'll have in the future so with this background of a big 3d data emerging there comes the new field of 3d deep learning it's arguably started from the Year 2015 and with some our seminal works and some big save datasets and it's actually very active due to very huge industry interest but let's say what are the basic tasks is really deep learning is my classification is that there are three principal basic categories of hands with 3d geometry analysis and 3d assist image analysis and 3d scissors so what are they concretely for 3d jump and analysis these are a number of examples health like 3d classification given some 3d object may be the volume may be some other form what's object or to departing given a 3d object what's the part of it or give a scene what's the obvious complicated and there's the correspondence problem try to blow the point-wise or paralyze correspondence or cross shapes and also the second type a main type of 3d different learning task is 3d sensitive like you may want to build a 3d model out of a single image I really want to complete a partial scanning or you may want to do the image or other or model in place of some other constraints a third category is the 3d assist of the image analysis like if you have the image as the query you'll want to retrieve 3d models which is might be a cross view we'll see many of this kind of heart cross view an image cross view image generation or intrinsic decomposition medically this kind of problems okay 3d is not explicitly used for somehow is in between so from the network perspective we see that we are designing the networks for 3d as input as output or as intermediate latent play on the for sake of time we'll focus for this tutorial focus on two types of applications the 3d geometry analysis will be sensitive so I'm presenting one dimension of the 3d deep learning cost categorization like in terms of tasks but a second issue that one invention is that there has been a number of a new machine 3d deep learning algorithms that's like because the 3d deep learning has different kinds of fundamental issues compared with the 2d deep learning techniques like for the image details basically we assume that our input is to the regular grist okay and images such a unique fermentation that it can be represented as 2d array of pixels somehow for this regular structure it atomistic convolution operation but on 3d this is very different there are so many different reputations is really each tailor for their own application scenario like I'm listing a number of the popular ones the motive view presentation for one object you look from different viewpoints there might be RGB or DVD a volumetric this this is more seen in the medical imaging society but also you can also try to volumize other types of 3d data like you have the regular Grizz and you use the distance field or some other kind of a field based recommendation or polygonal match a 3d model is represented as a collection of triangles or other kind of polygons and point cloud so for 3d data you have a set of points which might be unordered and this is a very common type of 3d data from the raw sensors or primitive based kind of orders your car mode is made of a number of primitives each might also have its own control points and other kind of a parametric presentation so in general we can classify the 3d data into two types of presentations the rasterizer form which has a regular grid structure and the geometric form which is often irregular so from this perspective we can classify there's a different dimension to classify this really deep learning algorithms like there are the ovens for a multi view presentation volume measurement Asia or important cloud the match and part days of presentations historically because of a similarity to the images the multi-view and volumetric scene kind of started earlier but within the this year and last year actually we're witnessing more deep learning work on other types of irregular severe event patients and we see a number in this facility are okay so the task space and representing space those the coordinate point there's a curtain product the two that builds a very rich space is poor for 3d deep learning and basically for our tutorial will be organized along the presentation dimension but for every reputation we're also going to introduce the network's for different tasks okay I've been produced before there's the fundamental challenge of three deep learning like four images you can you have to regular structure and you can use the convolution on top of it but for the 3d data there are the rasterized form as I just introduced and basically you can try to directly borrow the convolution neural networks from the image analysis field but it will place a number of other challenges and you'll list you'll see that from the tutorial and also there are the Iraq government agents that you cannot directly apply CNN to it but we have to invent new types of deep learning architectures so that basically finished my overview of the 3d deep learning field and in a rest we're going to live to listen to the our lecturers talk on specific types of 3d deep learning reven Taoiseach sorry three deep learning different 3d presentation like we're proceeding this order sorry the first one is to deploy our regular structures okay the third the second one would be the deep learning on point cloud and the parametric models sorry we switch the order of second and third it close our presenters time schedule and lastly we'll give the introduction to the deep learning our matches which is kind of the graph structure okay so thank you very much for coming here and let's welcome our first speaker to introduce a deep learning on a regular data [Music] [Music] I'm supposed to Oh [Music] okay well thank you how for organizing this my Gaelic Skellig Arrakis from UMass and that we'll discuss first of all the most of Europe reservations for 3d shape analysis and synthesis so some of the movie representations would like to say first of all the basic idea is to take a Joe metrical presentation of 3d shape such as a polygon mass or a point cloud and convert it basically to read it the use of this 3d representation so we produce 2d images where we can apply the traditional image base convolution or apply image based architectures that have been used a lot in image based analysis so into medical presentations in general can have various artifacts on for example if you take for example the chair that you see on the right and zoom in as the zoom in as the leg of the chair you will see that the parts are disconnected now if you take this 3d shape and read read produce images out of this 3d shape you'll see that the images do not incorporate this kind of difference visually it's barely noticeable so image based networks would not be affected by this kind of artifact geometrically however the severe implications in that is the many fulness the geometry becomes non manifold there is implications of the topology the topology changes so dramatically there various implications that can affect the performance of a metal that operates purely on geometry so the view based representations try to circumvent this problem by converting as easily these geometric data structures polygons message point clouds to in methods and also a good thing about doing this conversion is that essentially we take a polygon mesh while a polygon makes the set of vertices connected with edges but form polygons I mean every vertex can have different number of neighbors every phase can have different number of neighbors the tessellation can be quite irregular so you cannot also order the vertices are the neighboring vertices for each vertex we cannot create a nice ordering of the neighborhood around its vertex so convolution is not trivial it is not trivial to implement in this kind our data structures so by converting everything to images by converting everything to regular grid you can apply your traditional convolution so essentially the advantage that you get directly by doing this conversion is that you escape from geometric artifacts or the representations like polygon messes have artifacts commonly like non-medical geometry irregular tessellation paths that are disconnected so renders mean the rigid images will not be affected by this kind of tessellation artifice or will be minimally affected another big advantage of your basic operations that you can leverage existing image based networks you can take essentially your favorite network that was trained on massive we must assess like imagine it and you can apply it actually for 3d set analysis also later that this actually works and this means that you can take the advantage the best of both worlds of combining images in 3d shapes so the more data the better for deep learning so you must basically testers have this kind of advantage and also I would like to mention that this is somewhat the view based presentations are somewhat motivated by by human vision I don't want to say that we are inspired by human vision because nobody knows exactly how human vision works but essentially we can recognize that that's here is it there by just looking at it by a particular view we analyze what we see we don't see the dealer of the sales I mean we can infer that the safe is it there just by looking at the surface of it a particular view and if we want we can take many particulars different other views and infer that you have a chair so basically view better presentations are working somewhat like this you test take multiple views of the 3d saved you combine information hungergames aggregated from a source of two different views and you can basically recognize that you have a serrated chair so some what is it's motivated by by let's say human reason load doesn't work necessarily like invision another clear quick remark is that 3d model like the ones that you have in computer graphics they're empty inside so if you take for example the turning the left and check what happens inside you'll see that the interior is empty which means that well maybe you don't want to spend your presentation power can go to the safe interior and you base the presentations are basically analyzed the surface the raided views of the surface so they don't spend representation power on the interior it just focus all their reference based on power on the surface so um let me proceed by explaining the multi views in an energy tester for classification this is essentially the view based in a man 1.0 so the goal here is to perform a classification given an input geometrical presentation like the chair that you see here we want to recognize that it's actually there we want to basically from all possible candidate classes to infer that this is the most likely clear based on the input geometric representation and here the idea is to take essentially the input received and read it for multi views so in the first version of the view basin and then what we did is to place cameras around the 3d shapes if the safe was upright oriented were placing 12 cameras looking at the center of the safe elevated slightly from the ground looking down towards the shape and the riddling how it was done is was passing essentially your polygon mesh through your GPU that is capable of doing very fast restoration find all the pixels that are painted essentially by your polygon mesh and what you're producing your readers images that have essentially this kind of shading effects that you look you can see on the right so these readings can be produced by taking the surface normal and dot product it with the view vector so it's an encoding of a surface normal relatively the view direction so what we can do with this rendered images well you can design basically an image based architecture that processes this images and outputs a single descriptor and you want to produce a single compact descriptor from all these images you don't want to compute a different descriptor for every different view because when you compare say it would entail comparing all possible view descriptors with all the other descriptors and this from the other set and this means that you would have a quadratic explosion so when you want to produce out of it out of all these views a single nice compact representation so in order to do this let me explain what happens inside this view based architecture so you take let's say the first image the first R interview and you pass it through your favorite image basin and then this could be Alex on it this could be resonant it could be with DTM like what we did in the first elimination of this view based Network so you get essentially the pit row Maps you can get the future Maps up for example point five so the qualify the 5/5 convolution layer will produce in the particular case of DTM 256-bit your Maps size 4x4 that's what the first view produces and the second view will also pass through your image based are hit enter through VDD and we will produce another set of 256 meter maps of size 4 by 4 and from the third view at the 4th view and so on so the trick here is to take these visual representations and concatenate them with the single descriptor so what do we do is to introduce a special layer to do fall view pooling layer and what it does is the following let's say that the first feature map the first filter in this cone 5 encodes the Chernus of the image it's not necessarily turn this could be really anything but we have seen basically at image based architectures when they are trained they're sensitive to particular parts or objects in the input image like line code dog faces cards of shares legs and so on so let's say that the first view produces in the first from the first filter and activation of 8.04 channels it go to the second view and you see that the Chernus is 1.0 the third view is 0.0 maybe there is not there in the well-recognized in the third view and you go to all the views and you take the maximum of these activations across the views so you get the number 8.0 so you use the essentially the view that is most informative about a particular kind of feature like journals and you do the same for the second feature map third feature map essentially you get element-wise max operation across multiple use which can be limited very easily in terms of implementation you can go to cast range you access there you see the L twice later that L do does this element-wise aggregation so now that you have essentially your resulting aggregated representation it's 256 maps of size four by four you can go on and process them through the rest of the layers of your favorite image based architecture like lefty 6 and X this is seven layer and so on you can take essentially your your your layers from your favorite image based architecture and process these feature maps to output essentially a descriptor and this descriptor can be mapped to probabilities for example you can basically um decide that well based on this description based on linear classifier that this is there with the certain probability now a key thing here is that you don't take all these filters all these layers and training them from scratch an important thing here is that a while optimization in neural networks is kind of hard you can have many local minima but the more data you have the better and essentially here you can basically go to the image based architectures and you can see that if you pass rendered images of shapes they are still going to they're still going to recognize your 3d shape they can still basically output with high confidence that well this is it there so why not use pre trained networks I mean trains and massively much data says the more data the better if you have more data it's more likely that you can convert the better local minimum so the idea here is you take your network your VD GM or Alex net retrain a term on image net and now that you've written it you go and fine-tune it change the bit the parameters is crucial to change the parameters the last layer specially so that it performs well for safe pacification so that's the whole idea it's a form of transfer learning essentially that can work well in practice and what we did in the very first implementation is to use the bttm architecture to do this all twenty-minute layers both from the part on the left and also part of the ride we're including the FC layers were initialized from from an architect that was pretty trained on on imagenet okay and all we did was fine tuning in the case of three disabled classification back propagation that can be implemented as the view link layer is differentiable differentiable you can push the gradient all the way from the last layer back to the all the FCN back to the image based branches that you have on the left so you can implement back-propagation now an important thing here is that all the branches that you see on the left all the image based branches all the first five convulse layers are sharing parameters they are sharing the same filters and there is a for this is that the 3d set can be oriented in a different manner so the views are exactly not ordered there they're unordered they can you can have tell where the first view is basically looking at the turf from this side kind of viewpoint but you may have another chair that is oriented in a different manner and the first view would be something else so the views are not oriented in particular manner that's why all the branches should serve the same parameters so um again we get a descriptor we can perform classification through this 4096 descriptor that comes from the last fully connected layer through a linear classifier and as directively what we can do is for the case of retrieval you can take this 4096 descriptor let's say descriptor X you can pass it through a linear transformation layer like W dot X and reduce its dimensionality to 128 that's what we did for the case of retrieval so we trained for this matrix W that performs this dimensionality reduction so that saves from the same class I have similar descriptors otherwise different descriptors so in this manner also you can achieve better performance for a trivial so when we initially implemented this version 1.0 e tester we saw an interesting things that even if you take an image base in and then even without fine-tuning it was performing already much better compared to the first version of 'allah metrics automatic nettles both in terms of classification accuracy and receiver so an image basing and then even without fine-tuning and it can work well if you do fine-tuning on top of this you can get better slightly better classification performance and especially we can get much better table performance as you can see in the red the numbers with red colors here so it was working pretty well and we're pretty excited about view based representations and from this aspect also if you go and check which pixels are more important for determining the output probabilities for the ground truth class this can be done by just examining the the derivative of the output scores output probabilities for the ground truth class with respect to two to the image input images you'll see that the pixels that are more important for determining the class are the ones that are lying on the cyllid or near that internals feature cubes internal corners like suggested corners ridges and valleys of the 3d shape so this also agrees with with remarks that have been done in the modern on photorealistic riddling at this accrediting community that says that while internal contours like reduce valleys testing quarters are important for perception for the perception of shape so you get some kind of approximation of this kind of of quarters in a very fast manner by doing this examination of the gradients so um then the people from Stanford came and said that well if you can read there that your 3d shape through spheres of multiple resolutions you can also get gain a bit better performance you can also use alex net and this provides a couple of percent more accuracy in modern LED 14 so by then we had basically state-of-the-art performance a 3d cept reciprocation and while while during the recent years there have been changes the gap between volumetric methods and view based methods have closed still we don't know which one is the best method because everybody blow is different architectural tricks and it's not clear what this performing best but essentially the view based methods are offering already very very high performance now a problem that exists in accurate architecture is that if you see their viewpoint selection procedure before it was pixel display stealth cameras you can do something smarter so that you avoid surface information loss you want to capture the whole surface so that you don't miss any important regions so we'll just explain the view bases in an architecture one point one in the context of segmentation here the goal is to basically segment if we the safe into labeled parts and to have human annotations of labelled parts already in segments and what we did is essentially design air heat into the Texas input at the Medic representation of a 3d say printer setting the multiple views and output segmentation labels a discreet label for every polygon on our your polygon mass which entails in creating a segmentation and the labeling of the input surface so what we did here is to take the input say that you see here as blue and first of all for a particular distance from the surface a first set of cameras that maximally cover the surface area so place the first view points so that it covers most of the surface we mark the surfaces covered and then we go and find the second viewpoint that covers most of the so far uncovered region and we do this over and over again until we've paint find the viewpoint that cover 99.9% of the surface and we do this at the different camera distance we place also cameras farther away and farther away do you exactly the same greedy procedure of selecting cameras so that we can capture the surface of multiple scales so that's one important difference with the previous architecture that we do some kind of adaptive view selection Torian would say the second difference is that we're either both sided images so surface normals view Wester and we also read their depth images that encode surface positions relative to the camera and now all these are pairs of deficit images are concatenated into two Channel images and pass through fully convolutional matters the fully convolution networks will output for each view confidence maps so will produce a confidence map for its part label and for its view for example here on the top from the first branch you'll see that will produce the confidences for wings for the label wings so the redder the I live the higher the confidence that particular FCN branch has for detecting wings in this particular view and we do the same from the second the third and so on view so have all these branches that produce confidence maps for each part label now again this confidences live in image space and we have to aggregate all this onto the surface so what we did is implement a projection layer which was inspired by the previous view pooling layer that we had in the previous architecture what we're essentially doing is that for every triangle on the surface we find all the pixels that were rasterized were painted in other words by the triangle we access their confidence values and use the maximum of these confidences in other words if in a particular view you find that a wing exists with very high confidence well we use this confidence we project this confidence onto the surface and we do the same procedure for each part label we aggregate all this image baked based confidences and project a model the surface will get a surface signal which is essentially the confidence for it part label now the last one and another important difference with respect to the previous architecture is that the confidence that we'll get through this view pooling procedure might not be perfect there might be 0.1% of the surface that was not viewed by any viewpoints so we need to propagate confidences to this areas would produce a coherent labeling throughout the surface so you have a surface based model here to go to correct basically things and propagate confidences to or original we're invisible so you have a surface based model which has the form of a conditioner I don't feel it has you know returns which are based on the confidences that the work started from the projection layer and pairwise term that favor the similar label similar confidences essentially for our neighboring faces that have similar normals and different labels otherwise and essentially how it works is essentially to diffuse essentially the confidences through inference in this probabilistic model so that you correct any inconsistencies inconsistencies that were reduce by the view based meds and this whole thing now all of the architecture including the FCN branches that share parameters like before and the set of parameters are trained jointly you can basically derive gradients analytically all the layers are differentiable we can back propagate gradients from the CLF to different branches and you can trim them together so what we see here is a marriage between the view based view based model and the surface based more than to get the best let's say the both words any of same module can be used here I mean we used up the architecture but from long and out the first FCM architecture but in general we can use many different architectures we for example someone can use the most recent architecture which is the support unit which basically composed of an encoder it takes the input image and cause it into a compact representation in a small feature map and then thicken the convulsion to crop sampling to to confidence maps for its label and the trick here is that we can take the encoder presentations and concatenate them through with the decoder and this gives very good performance in limits limits translation tasks so in there you can use any FC and modules of course the more advanced they've seen would you have the better and I think the unit would be a better alternative to what we implemented so if we see what is created from the labeling that is created just by using the FCS just by projecting the confidences to the service without the CF you'll see that there will be some inconsistencies you'll see some blue areas in the area of the frame of the motorbike some tiny areas that were not visible by any views this 1% of the surface that is corrected essentially by doing inference on the CRF and to the diffusion so that through the inference the mean field inferences reusing this CF you produce essentially a much more coherent and correct labeling that agrees with the ground truth much more and so what we see here is the version 1.1 to summarize viewpoint selection which is adaptive per se to minimize surface information loss the merge of the FC ends they much based networks with surface based model which offers good labeling accuracy right now we don't know which method is the best because everybody's using its own splitting statement so it's not very clear their challenges do based networks are the FCN branches that we saw before the node processing visible points these need to be treated by the surface model the view pays representations cover identity so two views that are nearby will basically process very well the large common area of the surface at the same time so the surface is processed multiple times by nearby viewpoints so there is some redundancy which might be good actually if you have noise but whatever it's not trained this image based networks are huge they were developed for classification of image net for one one thousand classes many filters here will be permanently inactive because they were basically designed for for classification in classes that do not exist in shape map so the number of parameter is here huge that's why they're very slow to train yet I believe that one can basically try to distill or to shrink compress essentially these metals for the case of self-analysis many filters are not particularly useful and aggregating Europe as a basis via ma we have this view pooling layers this max pooling might lose some information we experimented here with other options like LS p ms combining a cecily aggregating unit surfacing from aggregating surface information across multiple use through air enhance and LSD ms and gated recurrent units but we actually didn't see an improvement compared to the max pooling it seems that the max to be pulling is quite powerful very briefly I will mention that this view kind of basic editor can be used for finding correspondence between 3d shapes if you have a point you can produce a descriptor through based architecture you take a point and you find the cecily views that access the neighborhood of this point the point is not excluded by M by the views that you need to select you take all these images from this viewpoint configuration you pass them through your favorite image based architecture like alex net you pull out the description you want to train your architecture so that points that are in correspondence have similar descriptors otherwise this would have different descriptors so you use the so-called contrast applause and the good thing here is that you can get essentially from human annotations of paths you can get the part perform and non rigid alignment procedure so that you find correspondences and you can use this for training this architecture so you can get basically training data correspondences for free for almost free given the human limitations of part we saw actually that this method even if it's trained on synthetic correspondences it can be it can generalize to correspondence between 3d models and scans so here one column so 3d models on the on the left and the next column so scans with noise and missing regions and the same thing happens for the second and the third column you'll see that essentially you can find correspondences between scans and and 3d 3d messes here similar color corresponds to point with similar descriptors so we saw that you can build matches basically between scans and 3d shapes and actually compare to to volumetric approaches in in a benchmarks of man-made saves we saw that you get much better performance compared to 3d maps but that's what I want to mention very briefly about correspondences finally last thing is you can try the squeeze you mask every bit of juice out of the view based synonyms we also use them for synthesis so here you see two input sketches so two input line drawings that were drawn by humans and the goal here is to get a 3d model out of these cases it's very common and in graphics to prototype with sketches and you want to get a 3d model out of it so we had the Nugget enter the text input cases and call them into a feature presentation and then decode them into normal and depth maps for different viewpoints so you had it who had the decoder that outputs depth and normal maps for a particular viewpoint then a second branch a third branch up to twelve branches that produce depth and normal maps and we also had the gang who had the discriminator that tell you during training whether we are good or bad whether you're fake or real this is common noun generative models to have this conditional Gann kind of archetype now we change it so that it's it's adapted in the view based setting so quadratic data I mean you can produce and ethically drawings based on standard artistic rendering non-photorealistic rendering algorithms and you can get pairs of synthetic line drawings with depth and normal maps from different views so the thing here is that during training if you have a 3d model the depth of the normal maps that you get are consistent because they correspond to ground truth 3d model during testing however they might not be such consistency there might be some discrepancies between the different depth maps so what we had is an optimization if using strategy so what it does essentially briefly I would mention that essentially you take the depth maps in the normal Maps you want to ensure that the depth of the normal maps agree so you solve an optimization problem where the depths are unknown you want the depths that will be optimized for to be as close as possible to the predicted maps and at the same time you want the derivatives of the depth that approximate tans the directions of the surface to be as orthogonal as possible to the predicted normals so you try to fuse essentially information from both depth of the normals and you'll find that this is useful another thing which is even more important is to take essentially the depth of the point that one view grab the 3d point by Cecily inverting the orthographic projection that you use from the view you can get go back to the point cloud then project this point back to another view point now you want to make sure that the resulting depth of this point 3d point that you get with aspect another viewpoint agrees with the depth that you predicted for from for that view point from from your neck so you have essentially problem that we want to fuse all the depth maps and normal maps all together an optimization and a strategy to produce a consistent point cloud for more business then you perform a cecily service reconstruction through your favorite point cloud to mass conversion technique and then optionally what you can do is to fine tune yourself as you can essentially and take your polygon mesh and perform it so that it matches more accurately accurately with the input contours this is happening because SSL or generate the model right now for images and surfaces tend to lose the tails is so basically an image or a 3d shape to someone from the vision community he's very happy so the same image and the same mess to someone from the graphics community and says oh this is horrible you lose basically all the details so well for for satisfying also the graphics community will have a success this fine tuning procedure for the from deforming surface so that it agrees with the input contours and the general I think the derivative models for 3d surfaces based on deep learning is something that it's an opera message for generating details compared to alternatives like volumetric methods I mean here you see the difference saved and the line drawings of this 3d said you want to create the constructions that mimic as possible as much as possible a different shape we saw that our method called same shape empathy is much better to volumetric kind of reconstructions baselines or better than going to the database and performing near a suitable we saw this especially in terms of preserving regular structures and topology of the 3d shape especially you seen the right to can recover politely the structure on the back while other methods don't and resolve this for characters we saw that you can recover somewhat a bit better detail compared to other alternative methods it works also for characters it can work for a single its gates or multiple schedules of course providing multiple sketch is a bit better because you have more information in multiple schedules if you draw from the front at the side we want to use this information from both schedules well this fusion promises that I decide before disadvantage of this of this pipeline is right now that the fusion this optimization is outside the network so it's not an entered network like the previous ones that I discussed so this is an open problem and another issue is that well with you based representation is you cannot see through the surface as I said we do not want to import the interior which might be seen as an advantage or might be seen as disadvantage especially if you have some volumetric information some information inside in other applications finally there is as I said some redundancy in the view based representations of different views might capture the same surface information and this might not be necessarily consistent so this is something that to think about with respect to the page-width reservations so they'll turn this the volumetric representation I didn't have a timer here so might have taken to my friends the main your your neck so you want me to go over back yet the only the good let me give you okay [Music] hello everyone my name is Jamila from Adobe research in the following camera review is deep learning model metric meditation so there's so many word recently on this topic I cannot now give a comprehensive overview all the literature's let's focus on some key ideas all weekend to live learning on volumetric packages perfect okay I'm packing so there's a many kind of volumetric data in the world especially from scientific domain and see we have a as I'm RI CT scan from medical imaging and also some like I'm manufacturing data for shape analysis and also some scan data for jobs so there's so many kind of this 3d volumetric data that's really not common on to this the ratio of the division of this world so we know that 3d volume essentially it does the trivial attention to this the 2d image data so given the successful applications of a convolutional net work on image data of course a trivial way to process 3d volumetric device intends the 3d out to the counter network the 3d console network so the basic concept essentially is 3d convolution using the 4d kernels so give a look at this example so give an input volume 30 by 30 by 30 ring of light 48 kernels to to get this down san jose's down sampled features each of the feature actually it's just another volume with the size of certain vector in Patterson so we can apply this 4d kernel repeat repeatedly to extra the higher-level feature on 3d volumetric data so we can perform 3d shape analysis and other cut so this explains three deconvolution has been adopted for shape analysis in the early days for a 3d shape classification for example on model more modern 840 and a model attain this kind of database so however due to this the vocalizations steps look at this people from a converted CAD model to vocalize the occupancy great there's a lot of detailed information has been lost so those voxel-based with the convolution networks cannot achieve comparable results as to the convolution own views for 3d shape analogy so the 3d conversions are not using and just like to introduce by vengeance on the other hand people also use this 3d convolution to generate the volumetric data essentially the core operation is the 3d deconvolution using for these kernels in this example kima vector D we can first convert that to a 40 volume with the number of kernels of 512 and each volume or feature map is four by four by four when on one they applied for the kernel and a 3d Cody convolution we can get a feature volume 256 and I wish each of the future volume they behave agree so repeatedly we can generate a log so as 64 by 64 set static before this we can generate a lot of shapes the recent work from MIT and published on Eid 2016 they use is the both 3d deconvolution Network and the 3d culture Network to generate 3d shapes and the for a supervised learning so essentially there's a pair of a 3d columns a network one is called the generator can map the latent codes little little herbal to a 3d strip and there's another discriminator that can tell whether a generator shape is a real leadership or not so this is a change in this or a famous gun work is generally over till network so results really amazing so we can see the latent vector and captured really some high-level jobs orientation of 3d shapes and some styles of the street a 3d shape so they can be perform this cosmetic operations they can generate the mutant generates a new style on 3d shapes so now I will talk about important application of a 3d data conservation essentially is they want to generate or recontour 3d shapes from single image level division so previous the work focused on depth but it steps essentially 2.50 foundation of treatment ation for the entire scene another hand there's a model based method that can now be easily generalize across different categories of objects so let's see how we can use this the volumetric presentation and the volumetric 3d convolution to recount drug 3d shapes from England this is early attempt on from standard so given the image essentially there's a 2d computer network that an that image into the relationship listen features and then we can map this little feature back to the 3d space using a 3d conversion network just show it here they also generalize this a single image space 3d contraction to multi-view until you know aggressively absorbing more information from other viewpoints of the 3d shape let's look at into your network the image encoder is very simple given the image x1 just a couple of solutions and down something operations you can learn its latent features a vector here is the essential concept in their paper is a 3d convolutional LCM it says recruiting on network met with memory cell the one-way passing through the the image features on one on one viewpoint and then they will be memorized in this recurrent network so one will observe a new viewpoint of the 3d shape and all the information of you can be combined together to generate the 3d shape and better and better details so the feature or the memorize the feature from the culture in our network and then fed into this 3d D commercial network to generate a 3d shape this operation has been defined in the previous slide the Train is network in civilized way with the ground truth on 3d volume essentially our vocalises 3d shape from this CAD models so here they define is the voxel a voxel cross-entropy loss in this equation the P index with ijk is the likelihood for voxbox for the particular box so of how likely of this box of being occupied and the why here is the panel able to tell them in the ground truth whether it is voxel and being occupied or not so they can easily train this network based on this law the data they are using is from the ship net they have a they use the 60k model and a bus collided into a 32 30 by 30 algorithm secret and the image is essentially also rendered from those ships together his ground truth correspondent for training and now let's look at some more visual result the first row are images and second row are generated three-digit this is pretty impressive especially we had two different con views from stream share they can generate fairly aligned and a similar shapes of course the way we can we notice that you can see this chair also it is no chair this table from this viewpoint we can generate a really good taking generates really good 3d shapes with four legs very nice so this viewpoint convey a lot of detail 3d information by the same time on it's pretty obvious that from this viewpoint look at the chair it's pretty uncurious we cannot get how many legs in this case on this table so this is January results clearly we'll meet a lot Alexia unfortunate but is it's a pretty common issue this is really you can't face the ricotta with your conductor this is the early attempts recently there's another line of work they try to combining the 3d localization both sides recognition and a 2d view points so they concern with they're concerned with another practical scenario death there's no ground truth 3d data available so how they can train your network to generate 3d shapes from single image so the core concepts in this work on this line of work is to implement the 3d projection force on the projection from 3d shape to 2d given a camera let's look at Vito so the essential is a concept is the ring the image from the camera center passing through the pixels or image fling reach to the 3d volume and a passing through the body so there will be an intersection between the three and the 3d volume that we want to generate so to get this the correspondence from 2d map to the image and a 3d volume essentially if we need to find that for every pixel of certain up to the observation which find it the segment or intersection along the ring so here we implement it so for every pixel on the 2d observation in central points with the disparity invisible in a range of the disparity doing generate a fake or pseudo 3d points for the 3d points so given the perspective transform we can apply this the inverse transform inverse a transfer speculation form to generate a central point on the target target volume so that we can perform by linear sampling on this target from a target volume to generate another volume so that we can clap this whap the 3d volume to a 2d mark this is how we can generate a 2d map the from the three 3d 3d shape of 3d volume this is exactly implement the perspective transform or protective projection from the 3d to 2d you know taemin camera so given this perspective transformer layer so we can define a lot unlock so that we don't need three devolve ground hoods 3d volume for supervision to learn 3d reconstruction from single wheel so leave a detail for this network the same as before we have image encoder the map image to a latent variable latent vector and they're given a little actor there's a 3d D consonant were generally D shape and then the key the key is there's a special on the perspective special transformer network that project generated a 3d volume to 2d mask and the web in London on the 2d mark laws applied to these projections so the overall that the laws can be defined as the combination of the ground truth suppose we have the ground first volume and sorry this is a ground truth volume and the projection muscular hybrid so also this network symmetry on the ship net some CAD models vocalize it so given in Google see some visual results give an input image it is a ground truth 3d shapes vocalized from different viewpoints and this is the results only train with the view map the view laws for projection line and this is the train with some combination laws it's a result render with the supervised both of box alone so you can see only given view as a supervision that generates a shape it's a pretty close to this a supervised VOC the law so another thing can be notice that you can there sometimes the vo base the training it generates even better result than this Broncos evoke the live stream so the main tuition you can see there's a small hole here so intuition behind this all the viewpoints in the to the observation essentially define a convex hull for the other possible 3d shapes so the network tend to generate the most solid 3d shape in that instead of hopes but there's always the questions here so since people will ask given all the studio divisions and the cameras you know and why not just the first recon drop the 3d shapes and then use their generated 3d shape as a supervision to train a network so this experiment will answer this question so we designed this following experiments there are two scenarios one is for every object we only allow Network to observer a narrow range of abuse for example here only cover 90 degrees another scenario is for every object we only allow Network to observe a vastly sample viewpoint so that is not easy for one to two trees recount a 3d shape and for any kind of objects so surprisingly you can see from this row it's only view based training it's the partial video training and generates a fairly similar result to this full video training so you see this is the narrow view it slightly worse than this pockets and for example news but it's pretty close to this the full view from the three 360 degree so the main intuition behind it is that so this is a encoder/decoder model essentially play or play a role that can condense all these partials reading information from different shapes during the training so there's the concurrent work in the CPR it's all roll paper on the implement this 3d to 2d projection in different ways they look at the rate directly instead of a look at the map so they call differentiable rate consistency essentially given a pair of the 2d observation a camera for any read you meet from the camera center they want to treat all the voxels on this target volume along digitally so one day find all this voxels to define a new concept called the redetermination probability so that can determine it or it will build the relationship between the pixel and all the voxels a laundry so a little bit detailed here is to clarify the concept intuitive explanation is on suppose we have are-- if it is a repassing through the 3d volume without stopping that means all the voxels along the Ray are likely to be empty the only another hand is this real stop at a certain box so that means the remaining voxel are don't worry are likely to be occupied so this is the how the implement this concept of differentiable reproducing cross different viewpoints so one day view this the rebase the relationship between pixels and voxels they compare 12 all this to the observation information such as the foreground masks or depth or color information was demanding information even by props to the 3d volume for also superb photo for training so this multi view based supervision for single video based 3d reconstruction essentially essentially connected to adults one is the volumetric mentation focus learning and it's a multi view based recognition for deep learning as just introduced like that okay um you may notice that all these previous work can only deal with very low resolution 3d volumes for example thirty two battery to Authority that's pretty low dimensional low resolution the problem is you can see from here the low resolution volume like lots of details and some introduced some architects so here to totally virtual Mosul like this katma CAD model in different resolution you can see more and more details are added I can represent exactly the shape of the is a CAD model so a reason that people notice this issue and then there's a a key characteristics of the soul is the volumetric data as there's faculty so when they increase the resolution of a 3d volume you can see very few boxes are actually occupied so then that means if we apply 3d convolution on this voxel the most of the computation obviously so how we can deal with that issues that we can scale up the 3d convolution of the higher resolution of 3d volumes so here's the key ideas people adopt this octree structure to represent these 3d volumes so what's really our occupancy group what so the officer essentially is a recursive state have structure that partition the 3d space rick collision 3d space for every node on the all three are exactly eight children for auxins octants so that's evenly partition this voxel in three dimensions sorry so of course given this orientation will open even or all we can keep on many computations that can be wasted before so in this example the Illustrated 2d shape up from a 3d so this one we apply regular dense 3d or 2d comes in here on this month or shape you can see every cell here is one kind of multiplication on using the kernel so case all those are basically actually only the the color pixels here are valid a multi multiplication or convolution to process the data so if we do have this octree here is the quadtree actually so or it when we partition the space like this way we can save a lot of computation or mastication the essentially give is the memory efficiency computational efficiency for speed up or scaling up the people earning on 3d volumetric data again so the key operation on this oxygen convolution is the comforter itself of course and also the pooling similar composition so this recent paper defined convolution on this outer structure to analyze the 3d shapes so I think I'll I will not talk about details of this presentation there's a lot of mathematical details need to be introduced if we get one to understand understand the kumkum clearly but I want to mention one key concept to implement this octrees our convolution is this neighboring good so because one way stored is 3d data in oxford structure then we basically break down the spatial neighborhoods on all the cell's so a bus won't perform 3d convolution we do need is a 3d neighborhood structure how we can do that in this paper in those papers they basically build a hash table to index all these neighbourhoods one applied two comma three deconvolution at a certain location so that's the key ideas on applying oak trees composition so this is the result of reports on this 3d shape classification you can see when they increase the resolution of the 3d volume they can achieve the comparable results which is the 2d base the multiverse we thought sorry multi-view accommodation network introduced the previous late value Mentalist okay there's another paper they did consider concerned with generating high-resolution 3d shapes so they also implements actually structure in the company convolution network but they introduced a new concept essentially for every voxel who can go ahead here with a label this this voxel - three kind of tag basically one is one one one kind of hotel whether this voxel is occupied another one tell whether is empty or not under another one tell you whether this also need to be further partition which they call mixed this label here is the illustration of this concept so this is the feature passing through with a 3d coming and passing through the network so given a certain level of this octree this is the feature map they define a 1 by 1 by convolution essentially it is the three dimensional or 3-way class classifier to classify each voxel to three labels so it's colored here so the yellow one means that's also it's empty the start green ones means is the field means the occupied and the important one is this red one they can mix this one need to be further partitioned into a next level to generate more details about the 3d shape so you see so get for those pixel art label as mixed they will pass through the feature map to the next level to put for further processing so you can also apply to really come to the convolution on top of it whether they can save a lot of computation because they only need to apply 3d conversion on those so-called needs features mix label t-shirts so the experiment with single image based a 3d reconstruction from the left to right of course this is into the image from left to right the increases the resolution of a generativist readerships we can see that more and more details are added when we increased or increase the resolution from this octopus to this generative network of course here's the one drawback that we can see there's a more noise also added on a surface on the generators to interships about this is essentially is the I think the 12 back of this volumetric mentation itself instead of on the network so this also is very impressive results okay that's basically concludes my part of this lecture we'll have a 15 minute break and then after that and how will introduce irregular 3d deep learning thank you hello everyone I'm Charles I like to hold a microphone alright so in this session we'll talk about deep learning technology for point cloud both the point cloud analyzes and also how to synthesize or generate point cloud from image or other data format I'm sure you'll find that we also mentioned these parametric models which can we can also consider that for generating normal point clouds just X Y Z or you can include RGB or you can also generate point cloud with point like with parameters like you can parameterize the point by like you have the box we have other primitive and use those points with primitive to reconstruct a 3d shape but anyway the first part is the point cloud analysis so this will be based on two of our recent works one is the point at which we also presented here in cvpr and another one is pointing at process so it's like a hierarchical version on point net is like a point I v2 and it's archived now okay so the big motivation behind all of this work is that we observe that recently there are many emerging 3d applications for example for a ton of driving you have light a lot of light our data we want to do perception in the real environment and also called mental reality you want to interact with the world we have 3d interaction then you need to understand the 3d environment and usually the data is 3d point clouds right and also for shape analysis you want to design shapes and deform shapes you would do with point cloud in many cases as well so all these emerging 3d applications means a very strong need for 3d deep renewal and we argue that for as you are already seeing that there are many different 3d representations who argue that point cloud among them is the one of the very important one why because it's very close to the raw sensor data and fits well for end-to-end learning idea for deep learning they will know that you can get point cloud from lidar or from death sensors and also it's very canonical right so you can convert point cloud to other data format and also convert many other 3d representations to point cloud very easily right when you deal with point cloud you feel it's like much easier than if you deal with meshes right okay so in our work point net we design an end-to-end learning framework for scattered and unordered point data so this is very different and seems one of the very first pioneer work to do the perennial point cloud previously it's mostly handcrafted the feature designed for different specific tasks and our framix unified framework then can be adapted to various tasks including object classification of the parliament issuing of semantics in parsing like what is shown here like you can click classify point cloud or instance into different area semantic categories or you can segment individual objects into different meaningful parts like the lag of the table or top of the table or wings and genes tails of an aeroplane etc and also you can do that in a single 'evil we can semi consignment 3dc here during an indoor office in two different semantic regions like the broth ceilings Wars or chairs tables etc but one one like obvious challenge we face to deal with point cloud is that this data is like instead of structure format it's a set set of vectors right we can represent this set by array which is like ten points this point is d dimensional in simplest form you can just include X Y Z or RGB or normals but this array is very special because we have all these points so we can actually switch the rules of this array and it will represent exactly the same point cloud' so to do with this data our model has to be invariant to n factorial permutations of the rows of this array so what function is permit invariant to those permutations so we call this it's like a symmetric function right for symmetric function you can have a permutation pi or your data from x1 to xn and you have whatever permutation you have this should have the exactly the same function value based on just this set of points actually you are very familiar with dimensional functions many functions very simple function like max pooling or average of summation or histogram there are symmetric functions but you can imagine if you just apply this very simple symmetric function on your point cloud you will lose basically all the information of our phone color just get the centroid of the point or just get the maximum point rather than make much sense for semantic tasks so the way what we wonder is that how can we construct the family of symmetric functions by neural network to make it rich in representation power so the basic idea is that we can do that through a function composition we can compose a function by neural networks combined with a simple symmetric operation so if you see here like in this representation if function G is symmetric then the entire function is guaranteed to be symmetric so you have the same function H for all your X and if function G is symmetric then this G function is symmetric and after the gamma function the entire function f is symmetric so basically if we have like X Y Z for each point we can apply the same function H on each point to project the point basically to some hidden high dimensional embedding space and because this embedding space is like redundant and high dimension that we can aggregate the embeddings from all the points through a simple symmetric function say back and then we look at a single embedding from all this and embeddings from each point and we pass it through again to another like neural network gamma to get the global feature of our point cloud so we call this architecture point as vanilla so one following question to us because as by constructing this function what symmetric function family can we can represent by this point that I can be represented enough family to make it meaningful for different tasks so we have a very interesting finding that we can prove that this format of function can arbitrary apause approximate any continuous set functions as long as you have enough neurons right but it's still like very good use to study foundations for this line of work and theoretically we will use a simple multi-layer perception network as the function h and also another one for the gamma but know that the same age will be shared so the weights are going to be shared across all the points to make the entire function symmetric and we use max pooling we also tried many other symmetric simple symmetric function like average or weighted average but empirically our experiment max pooling consistently works the best you know a multi-layer perception it's basically fully connected layers with values and also a batch normalization and based on this the point at vanilla architecture will also add some another module which we call a include transformer network so the idea that you have array of data like n points with XYZ as their coordinates also as the feature for each point or we can predict a transformation based on this point cloud by a net called key net this key net is the inputted point cloud output is transformation and itself is the point at vanilla so it's also a permutation invariant and we will have this transformation and the transmission will be applied on the point cloud and we are trying to transform the point cloud through some canonical representation so that the model will be less affected by the rotations of point cloud so rotations anyway it won't affect the class or the path segmentation of the object we hope our network is going to be robust to that so we want to add this transformation on the point cloud to align the point somehow but that's the hope practically we'll basically align the trips to some canonical certain few modalities of the orientations like 90 degree difference but helps in performance for a network we have shown that this is actually very similar to spatial transformer network in images if you know but one big difference that since we operate points the transformation becomes really simple it's just the matrix multiplication so you no longer need those bilinear ultra linear interpolation anymore so it's super fast and easy so summarize this point net classification Network from M by 3 so this is simplest form just have X Y Z the end points with XYZ input we can first do an input transform to try to align this a point cloud to some connect Coco's and then we can use the shared your network on each point to project the point into some embedding space with 64 dimension for each point we can apply a similar like transform on the feature space as well this and then we can again use some neural networks to project the embedding 64 dimensional embedding to a thousand and twenty fourth dimension the final embedding for each point and then we use this important max pooling function to aggregate information from all points to get a global feature vector for the entire point cloud and then we can like have another few fully connected layers on the global feature vector to output the scores for classification so the entire system including the transformer and also the the embedding the projection our land end to end so we can optimize our features so the tasks we are interested to extend this work to segmentation so now we only have a global vector it cannot be used for segmentation for the point to extend that we can simply do a concatenation between local embedding and global feature for each point right we just take down the embedding for a specific point from here to here and then concatenate with the global feature together so that now is for each point window post some the point itself and also the global feature are based on that we can classify each point into different categories so it's a point classification formulation so look at some results so even though this work is one of the first to try to use deep learning for point cloud it's already achieving very promising results for example on this modernist 4040 class classification benchmark which was also was also seeing in previous talks we are shipping this eighty nine point three percent accuracy which is on par or even better than previous data of 3d scenes and also we should result some path segmentation for the objects we can segment it into a different demented parts on the left is also pulled from partial input so we have a 3d models we can use virtual scans to scan the shape so that we can get partial scans with missing hosts etc the network can also perform on that and on the right is the segmentation output and complete complete point cloud you can see that the result is basically very smooth and number wise we also show that it out performs previous sit-ins art as the paper is published so we compare with wrong like one based on traditional features and also another base design we build based on a 3d scene we also sure is out some semantics in parsing so giving an input point cloud which is a fused scan from metaphor scanner of indoor offices we can segment the point cloud into different semantic regions as shown here the first part Row is the input second row is the output so that's a pretty good job on that as well amazingly actually this is the basically to output our first rung on these tasks and as I saw this a big surprise as well we're also sure that our opponent is very robust to data corruption so in point Co you will know that usually there are a lot of missing points in the data you cannot get at points uniformly well everywhere and density everywhere so we show that if your data missing like from zero percent missing data so you have all the data here and also if you miss a hundred percent you know we don't have a point here it's zero if you miss 95% of the point here the accuracy will be like dropped a lot right but for point cloud for our point that you see that even you miss 50 percent of the point compared with their training cases the accuracy jobs by less than two percent so it's very robust to the number of points you have even though it never sees these number of points in training cases and also it's very robust to outliers points as well and also to perturbations of point positions so we we also compared with 3d CNN so both now in this case both point net and three is here is like clean complete 3d shapes like facility now of course you trained on complete voxels and for point net you trained on complete point a point cloud sampled from services of 3d meshes and at test time we try to randomly drop out points and to see how to performance changes as we jump or and more points you can see that the point is very robust to that well as reducing the performance will decrease very quickly as your job a valid points of ID voxels the wise pointers so robust to those missing data we try to understand that by visualizing what the network has learned so the way we visualize is that we look at this global virtual vector which is the magnitude results from each point in beddings which actually search back in a point to see which input points are contributing to the global feature because we have this max pool we know that it was the output of this point is very small at this point we have no chance to contribute to the final global vector right so no we know that which point would contribute and we only keep those points as we call this as a 3d point set and to visualize that we have some original shape represented by this colored point clouds representatives for better visualization if you visualize the critical point set we see that it's as a subset of the original point set and it captures basically the count two and a skeleton of the original shape and we can also see that which points won't affect the global feature vector so we search in space like by brute force search in a volume or a metric space to search for each point to see whether its output will affect if you if you have these points like embedding high dimension in space if any na of the output will be larger than a global feature vector then will affect the global feature vector right so we see which point won't affect those global vector so if we visualize that we call this points that won't affect the output as the upper bound set so in this case any point set that force between this critical point set and upper bound set we have exactly the same global feature vector so you will have to exist exactly the same classification result and scores so by visualize you will see that it's very robust because you can add on points or remove points but as long as it's between these two sets you will have the same result yes and also generalize well to all of sample categories and here we visualize some shapes that are not included like their category is not included in our training set we see that we also have very reasonable skeleton capture and also very reasonable upper bound set means that the features we learned from this network is generalizable to ANSI in category and shapes as well so point net is very good it's a very simple architecture and a very effective on different tasks it's like a uniform framework for very segmentation classification and also number-wise is performing really well on several benchmarks but if we compare with 3d CN we find that it's quite different because in 3d scene you have this hierarchical feature you have multiple levels of abstraction by in point at architecture for example this simple illustration here we have a very global feature learning so you either have just embedding for one point or you have a global vector for all of your points right so in this way the local contacts feature is very weak in this architecture even though we can have a concatenation of local embedding and the global vector but it's just a global and very very local right we want to have some things similar to 3d CN of multi level abstraction so that's the work in point of class path so this lack of local context will actually result in some very real problem if we want to generalize the pointer to a very large scene understanding so for example for this case we want to do semantic segmentation for a table and a cup but in input different from employment employment we will centralized will make the data 0 mean so that we have the point with the centroid at 0 0 0 but here we will randomly just translate the table anywhere in space in that sense the coordinates of the points will change so much that it's very hard for the network to generalize well and also for instant segmentation for example here given a central point on the table we want to know which point belongs to the same four same object as this point so will be you will hope that it's like white for the entire table but actually it seems to project the prediction to the chair as well because they are at the same height so is that create some artifact for your segmentation so why because in training you have all these different configurations of objects in space it's like exponentially like different computing configurations and if you only look at a global vector it's very hard for you to summarize summarize well for all kinds of different configurations so the problem of only having a very like a single point feature or having very low global feature and also they depend a lot on absolute quality will make it very hard to generalize throughout sing-sing configurations in large-scale right so our idea for up this point add plus plus which is point at v2 is to use point nets as a basic module and build a hierarchical learning framework like applying point net in a recursive way to learn hierarchical features so idea is simple like in this image oh it is a 2d illustration there are some points gathered here like you have a point net in a local region to summarize the local region into a point so the local region could include a multiple points and then you have a less number of points in your point set and then you can do that recursively and if you look at the local region then another issue comes out is that if you just look like a very small region then you in sometimes you you may only have like a very small number of points here so for example in the extreme case you want to have one point in that local region then you cannot at all to use your network to extract features for that so we have to deal with this now you non-uniform sampling density problem if we want to have a hierarchical one which look at the local region so I will mention these two one by one this two point so first let's look at the hierarchical point sets feature learning also given a input point cloud like here's the 2d example so endpoints each point has coordinates which is d dimensional here D is 2 also the each point can be equipped with C dimensional feature vector for example RGB or intensity or normals whatever you like or have in your data so the first step is we can do a summary of points of regions we are interested in so we do a farthest point summary to select a subset of points it will be hard to visualize to see here but we select a subset of points from the original point set uniformly and then we do a grouping to search neighborhood for each sample centroid we use the radius space the ball query the query is based on the D dimensional coordinate to enquiry with inner radius which points belong to this local region and then we can use the shared small mini point net to summarize each local region into a feature vector which is higher dimension but represents the geometry and other features in this local neighborhood and as the output we have n1 points which is strictly smaller than N and still we have the D dimensional coordinates for each sub sampled points and a c1 dimensional feature for each local neighborhood we will mention that for each local region extraction we will use local coordinate rather than global coordinates so that makes the network looking at more local region instead of the higher global wine club so we can apply this summary and a grouping like again and again so that we can have multiple levels of abstraction for our point cloud so in each level we have less number of points but each points will capture a larger scale of local neighborhood and to compare with the point at which the convolutional layer will see that the input for the point set point point net point net layer it's like a small set of points in your local region and the other list and the neighborhood search is based on the ball query based on distance but in convolution actually input is a dense array of pixels or voxels are you do convolution it's the input is indexed ordered so you have well-defined order for your kernel and also the neighborhood search is based on array so array index you have a or Manhattan distance if you like and you have a fixed number of pixels of voxels in each neighborhood so there are somehow similar but very different in many aspects as well so used for classification we if we have this abstract set of points if you representing a local neighborhood we can apply another global point net again to summarize the entire set of points into a global vector and then we can use a few fully connected layers to predict the scores for each class we can also generalize this architecture to segmentation so now we have the suppose that meditation will design a label for each point on the original point cloud right and so for for this intermediates that are points we can copy it here at first and then we can use interpolation to find out a feature in the in the point we care about from this level like so for each point in this level will use interpolation to find out its the future value and then use the unit cornet which is basically a multi-layer perception on each point independently to update the feature vector and also will use the steep link to connect the original level features from these points to here and then repeat it again so that we can get a feature vector for each point on the original point cloud and now we can use this feature on original point cloud for point education so let's summarize the first section so we use the heretical feature learning for point cloud feature for classification and segmentation the second aspect we'll look at is the common issue of inconsistent sampling density so we know that sampling density can vary a lot in 3d point cloud due to for example their perspective effect and also the radio density variation or the motion of a sensor objects etc so it's a very common problem in CEO we know that people have shown that using very small kernel has been working very well for example in vgg rs90 use 3 by 3 are convolutions a lot we ask the question is it also true for point cloud learning actually is not so if if you look at this figure again you will see that you have different number of points in each local region that's very different from image to me you have fixed 3 by 3 9 pixels in each region but for point cloud since the density varies sometimes you may only have one point right so by number we can also show that if you decrease the number of points in your shape from thousand points to just 128 points in this chair for example the performance if you look at the original point net which is the green one the blue one which jobs fine but if you look at this yellow one so that's the hierarchical feature learning opponent transfers a job so quickly that the accuracy drops a lot when you decrease the number of points so but what we expect is to do with density by charging regions with larger density but do not charge the regions with very low density of points so we if the local region has only like a few points we want to look at a larger region so that we get a more confidence on the patterns of one cloud so by having that idea we design some some new novel layer architecture one we call a multi scale grouping so instead of looking at only a single scale of local neighborhood we'll look at different scales maybe like this is radius one and this is radius tools and this is radius four so we have multiple skills of a neighborhood and for each neighborhood who use point edge extract features and that will concatenate those features to have a to have a continuous feature for the entire neighborhood with different skills and during training time we can randomly job house input points so that the network can learn when to trust a small neighborhood and when to want to use more for the larger neighborhood right you can imagine if your job points a lot then to have a larger weight basically for the larger neighborhood but you have a high density here it may look more on the local neighborhood and also another another parallel design is to combine this local feature from different levels to know that we have four set abstraction different levels recursively apply and we can also combine them from different levels so this one is from level one this is from the previous level so the idea is similar so without wise we show that by adding this novel architecture to deal with this very intensity problem so performance is much better than before if you like this look at this red one as the green curve we show that as the as the points misses the performance nearly stays the same even for like a lesser point as 256 it's much much better so here we do this show this results our model Nets 40 classification results there are like two conclusions for this table one is that by using hierarchical feature learning which is a pointer past past instigate linearly better than original Cornette because it has rich it has a ritual representation power by using hierarchical feature learning and also amazingly we show that by adding any normals our pointer table pass can even be better than multi beauty and based method oncreate vacation without normal like for some shapes like keyboards or like chairs with sparse on the back it's very hard to discriminate this with other categories but with normals it helps a lot for the performance and which you here and also we show that number-wise like by adding our novel novel layer to do with the density problem the accuracy so this is like a sing level semantics and meditation task so the yellow one shows the one with non uniform intensity the results without the novel layers to performance jobs allows from the complete input but with the with the novelist not novel architecture we can boost the performance by a lot and also our architecture for heretic 0.93 children is much better than original Poynette on this logic you'll seeing segmentation tasks our beside that will also show that our opponent plus path is actually very flexible architecture it's not restricted to Euclidean distance or rigid object it can also be applied to organic objects so for organic objects for example here we want to classify animals like but this force and cat even though they are very similar if you just look at the console and basic shape they are actually different categories but their post is just very same the for organic orbital project of classification we want to look at the surface and geodetic distance rather than the extrinsic shape so the point at platforms can be applied to that as well if we use the coordinates rather than Euclidean coordinates we can use geodetic distance based distance and the coordinate and also we use the intrinsic features for each point and evaluating on these benchmarks we get get to the seed of start performance on this to visual I also to visualize what the point a trance trance has learned so we visualize the first layer feature where the motor is trained on modern as 40 class basically mostly the furnitures furnitures so we see there are some typical activation patterns for example the planes the kernels or the polls polls are like a pole clozapine etc which is common in furnitures okay so I think in this first session I covered Poynette uniform unified framework for deepening an ordered point set and also our extension based on point at the second generation point at point at class pass which is a logical feature learning on point cloud and so which deals with in non uniform sampling density in points next I want to introduce professor Leo gibbous will cover how to generate point cloud and how to do a primitive based shade generation okay Thank You Charles so I'll cover the final two topics of the point cloud session on point cloud synthesis and also on a presenting geometry through primitives before I start let Miriam façades one point in the material that shells covered on point net and pointed glass glass that is it is not really specific to point in 3d you can work with point sets in any dimension for example you will have a regular for is in 3d plus other attributes like RGB we can have your physical information like stresses and strains and and in the end also it can be done in any metric space so one can use you know non-euclidean metrics as well for the mic oh okay sorry I'm going to hold the mic closer to my mouth okay so I'll talk first about point louder synthesis and the goal here is to solve the problem that was discussed already in this workshop by both Vangelis and you may and also appeared several times in this conference namely how to go from a single image of an object to a 3d version of the object in summary presentation we'll discuss their presentations and what is this is quite a challenging problem because in addition to inferring depth for every pixel that you see you also have to imagine or to hallucinate the geometry of the part of the object that's occluded that we don't see and it's only possible because we have trials because somehow we try to recover geometry that we have seen before in various contexts in addition to the work here of course there's also evidence from nature that this problem is possible for example several birds like pigeons have their eyes on the side of the head which means that the receptive fields of the two eyes overlap very little if I'm on Twitter aria is very small that means mostly they have to act based on information from a from a single eye and even though we humans have stereo vision we have two eyes the baseline is very short and we have no trouble understanding that certain objects that we see are the same even though we see them from very different points of view and unless and again this means somehow we are able to recover the 3d structure of the full shape from one from two very close views and the single is the thing about what is the representation in the brain that makes this possible but it's not the topic we'll address here instead we'll try to solve it by algorithms in computer vision and and it's a problem that dates back to the early days of the fiddle that the seminal work by Bell to perform at MIT on shape from shading on understanding 3d structure from liking and this goes back more than forty years now in computer vision so as you can imagine will take a different approach that is will try to say we have knowledge of objects in the world because we have knowledge of these objects and because the shape of these objects a set of measure zero and the set of all potential shape that could exist we have a chance to be able to recover the shape of objects in images just from a single view and and we will follow the paradigm of learning by creating synthetic training data in already mentioned that we have values are that I can assess now for 3d geometry same a group we've been developing the shape that repository that has about two and a half three million models and given this model so we can generate renderings of them we can generate images and make look fairly real by adding backgrounds different lighting inclusions and so on and now we have ground truth we have images of objects and we know what is the 3d geometry they generated this objects we can use that to train a deep net and the particular method I will show you was based on training using roughly in 220,000 shapes from two thousand categories and one hundred million sorry and ten million images were generated from that data okay the question though is in what format argue again rate in 3d geometry and this issue was already discussed extensively in the previous talks and they you know these are visual candidates for deep learning is volumetric because that is a regular structure that is easy for the pledge to consume or to produce the trouble is that is an expensive data structure because the size of the grid is cubic in the in the in the dimension and the result is if you use a low res a cubic plate they tend to lose fine detail you know in most of the world having thin legs is a plus but in this case - because you lose them and also in the in the grid version as a very dimensioned doing geometric transformations like rotations is quite expensive so instead as you might imagine for being in this part of presentation we'll use an oil-based approach that is the geometry will generate is a point out and in this case we try to go from a single image of an object to a point cloud or roughly a thousand 24 points and again this is I think the first time that this like approach has been described in the vision community series what we wanna do will have an image of an object say this car so that that you see on your right we will segment this object and then from that we will try to generate a point cloud in 3d space that captures the shape of the object and I show here to renderings of this point cloud one in the same pose of the object and the other from a different pose to the curve actually see there to be the reconstruction the lifting from 2d to 3d captures not just default but also the back that is it does infer part of the geometry that is still visible from before the view from the image was taken and here's a comparison by mapping everything to volumetric grids to make it fair so we start with an input table and then we have other constructions using volumetric techniques our construction using a point cloud but then converted to add the voxel grid and then ground you can see that we do better in capturing fine details especially the legs of the table okay let me discuss now the architecture of the network that accomplishes this opcom direction this conversion from from the image domain to the sylheti domain and so we'll have a network that will be trained with shaping the data and it will produce a point cloud so of course the trade network we have to have ground truth so we will take our 3d model in capelet whose view I what I do I'll convert and also map it to a point cloud well what we realize here is that the same shape can have multiple point cloud representations it's not something canonical like the scan converting the shape onto a voxel grid and so we have to have a somewhat newest way of comparing point clouds and defining the appropriate loss function so let me say a few things about how to design this loss world is a geometric problem we have two sets of points the blue points they come in from the left and the red points coming from the ground to the reconstruction and we want to measure how close they are and this is a non-trivial problem because we don't have associations between the point we don't know which red point is supposed to correspond to which blue point well there's various ways to solve this problem this correspondence problem and at last method is to use some kind of assignment or transportation technique that tries to pair the points in such a way that you minimize the transport cost of the red point to the corresponding BluePoint so basically you're trying to find a correspondence structure file that that minimizes the sum of distances from say each red point to the corresponding a blue point and that's the classic optimal assignment type of method and my neck of the hood is called the earthmovers distance another way is to try to do something simpler namely simply match each blue point which we add a set point and vice versa and this defines a variant called the sanford distance by the way notice here we're using our squares for the example distance is less easy to differentiate but unfortunately this makes it not at with distance it doesn't satisfy the trigram quality but still we can use it as a loss function so these are two possible distances but one can think of others and in this setting we have to understand what are the relevant criteria for selecting a distance which are geometric criteria and the computational criterion from the geometric criteria we want the distance to reflect natural semantic similarity or difference of the shapes and to allow effective shape interpolations at the same time we want the distance to be a differentiable any computation to be fast because we have to do large-scale training I'm going to point out that the you know these distances are naturally differentiable both of the earth movers business and itself a distance except for sets of measure zero that is as you start moving the points they will be certain discrete instances when the associations change by that set is a very very small set in space of all configurations of points now another fundamental issue here is that since the net has been fair something that is ambiguous something that has multiple solutions there are main issues of Jamaa because it looks like they're sort of a regression problem but we are not because there are there can be multiple solutions first of all because it's a part of the object we don't see to the cutting many possible things they're hidden behind in front but even in the part we can see they can be ambiguities because to be able to make the distance comparison meaningful we are trying to generate a point cloud of the object in the same poles in space in same position imposing space as the original shape and sometimes there can be ambiguous views there can be things that you think are passe so for example you may think that the clear faces towards the front and end to the right but actually it can also be face it over to the back and to the right the convenience ambiguities and what because of possibility 11 ambiguity the tipnis has really to deal with a distribution of solutions it's not a why a certain function is not mapping one image to one shape but is mapping one image to a set of possibilities and well since the net doesn't know which are you want it somehow tries to mix the solution satisfies problems and here is where a situation where the distance that you choose it makes a big difference so you think of a distribution of shapes for example if you take this silikal of points available the radius if you look at the mean shape in terms of the earth movers distance you get a fairly nice and clean circle if you do the same thing with the example distance well you get a noisy version on the other hand if you start with something that has more combinatorial variation for example that is the center bars on the bottom and then this little noise outliers that show up well what happens with the anoobas distance it's sort of it tries to smooth and clean up the solution any sort of merges the top and bottom outliers into mass in the middle while in example distance because you don't have to match one to one you don't miss much way to this outliers so it's hard to tell which distances is best you know you can see from this real example here that on the one hand the EMD when the shape is clean gives you a cleaner solution than the chamfer distance on the other hand sometimes the same force can make the env loose fine structure that the subject distance can capture has the bottom side as the bottom arrow shows so we played with both and now let me talk a bit about the actual structure of the network one observation that is basically the structure of the network is that the geometry on shapes again we represent the surface of the shapes of the volume typically comes in two parts there is a part that is smooth like large areas and a planar spherical cylindrical or quite smooth and most man-made shapes have mostly stacked surfaces but then there are also special areas like corners and things where the geometry becomes much more intricate and somehow if we're going to do the construction well we must understand this distinction because the statistics of these two types of geometry are quite different so in fact in the upconversion by the way when I go from because you know the network will have a standard converter because it's an image so we start in standard way when I have a standard the convolutional architecture first but then we'll do the up conversion the in the lift into CD to different branches one that captures the smooth behavior and one that captures the intricate behavior and for the first we'll use the convolutional approach and the other will be the fully connected approach and each of this will give us a point cloud and then we will simply combine the two point clouds to get the final result and here's an example of what happens with this two blank sheets we over thousand twenty four points with pre-allocate I think 768 to the smooth branch and 256 to the the integral of Northwood branch and here this book area is shown in blue in the reconstruction and the non smooth in a red that you can see that the red captured sort of the exist the extremes of the shape and the upconversion branch is certainly traditional Cornett that especially essentially computing a parametrization that is a map from s from a smooth surface like the plane after three space so basically it produces XYZ coordinates for points in the parameter domain which in this case is an array and this is very good at capturing your smooth shapes as we know from traditional compute the geometric design and the combination of these two things does quite well comparing to you know to baseline constructions especially compared to volumetric reconstruction we get a 63% error reduction so the people here is the representation really matters how you code your geometry can make very very big difference in the results that you get as well I show some examples even though we train to the network with synthetic images you can also work real images and I show some examples on top where you see the the shape we constructed in the original view and also from a different side and the bottom row you see the network is able do something even with shapes that are the outside the categories that it was trained for so it captures the basic structure of the shape the same here they can be used to reconstruct not just from an image but also from a partial scan so it can be used to to complete shapes and this is just a beginning I think there's many many other problems about understanding how to just select the appropriate loss function the trade-off between using the sample distance that's the reverse distance how to add in further structural constraints for example symmetries you may know about how to do this beyond single objects right now if you give this object if you give this pipeline a collector of objects you would like to merge them it into a single object because that's the only thing that is ever seen and of course we have to study more how generalizable the method is okay and let me end by going to the last part that deals with describing geometry primitives this is really an orthotic ECOFIN envisions even alter the shape from shading going back to the early 70s were custom-built for the Stanford when to reduce the motion of representing 2d geometry by generalized cylinders this is a civil geometric primitives that can very efficiently encode very common geometries and the question now is can we even any shape can we find a representation of it in terms of a certain set of or geometric primitives and the primitives we use in this work there is just the most basic possible than this boxes so we take geometry in distance we take geometry in voxel format and try to replace it by this boxes that you see here furthermore we would like these boxes to be allocated consistently across similar shape you would like somehow the boxes to come to configure some semantic information about the structure of the shape and you see here the coloring mr. Nigel show is consistent so as I mentioned in this work we start with a volumetric representation of the shape and then we have an encoder in this case a traditional volumetric net that tries to predict the primitives that we would have in this case we have just one primitive and we start by something very simple where we fix the number or three matrix alpha capitulated about how this prescription can be simply invented for now say we are gonna predict M primitives to describe the voxels that we see in the shape on the left and the nice thing about this is that we don't really need supervision because the geometry of the shape itself is the prediction the question is that the geometry on the surface of a cuboid properly approximate the geometry that we started with and exploitive explain some of the some of the geometries are now essentially effectively by by computing disabilities that cover the geometry we have also computed the segmentation the geometry that is each primitive attracts or take certain points as belonging to it and therefore separates the voxels of the shape into different parts and now two interesting aspects clear one is how to deal with with you know variable numbers of parts not every cake has the same number of parts and chairs don't have arms for example and one way to deal with that is generate a certain probability for each part that is you you to estimate empaths but for each of them you give a probability it's actually present so for parts that have a low probability you can just you know or truncate them off another interesting aspect here is yeah is you know the most function is how well the geometry of a cuboid predicts the geometry of the shape in post direction so you want essentially every voxel of the shape to be close to one of the primitives and you want every every piece of a primitive to the cross be some part of the shape oops sorry sorry I'm going the wrong way and this shows this to wait some for distance again where we measure distance from the cuboid to the voxel grid and from the voxel going back to the geometry and if we do that by next by essentially sampling points on the only cuboids and then measuring the distance to the original shape and the nice thing here is that the coordinate of this point they are all in linear functions in the dimensions of the primitive and therefore we have nice differentiability properties so we can backdrop in our net and in fact because the dimensionality of the representation is extremely low that is one of the advantages of representing geometry by primitives the number of independent parameters is very small and this process can become quite fast another issue is the consistency of district compositions little bit like no just two parts each shape in the primitives individually but to pass them consistently and what really helps here one in this has more salt or the formulation is that the whole pipeline section is a continuous function so if the shape changes smoothly the cuboid the composition also reforms smoothly and what this really means in practice is that if if you have similar shapes then corresponding structures corresponding parts very likely will be captured by corresponding cube oil that will be and easily a large one correspondence between the cuboids of office of the shapes so again so what's nice here is aggregating essentially segmentation that is consistent across a shape families and we are doing that how to provision this is simply a way to measure how well this method does compare to others and here you can see that it's actually outperformed by some other methods but the more classical in terms of the area of the faces of the pyramid is involved and also we add a parsimony term in the organization for the net so that it lies to explain the geometry with asking for images as possible so you can see that initially it is it is using kinda more primitives but as the iterations to proceed the number converges to what we would think of as economical explanation of the of the geometry of the shape once we have this machinery and again because we have shapeless models and we can get training data from SafeNet we cannot start doing the same thing for me like you slay me we can take an image of an object and try to generate a not 3d model of the objects in this cuboid of our forum and again it's all impossible because we have this link between 2d and 3d coming from the synthetic data ok so it's a very simple example of one primitive I think one can easily consider variations where we are both limited we had spheres cylinders cones other geometric primitives these primitives can be very useful for modeling right because once you have fit this geometric primitives to a shape they all a bit of natural deformation so you can imagine elongated cylinder making a like a cube clutter and so so he was used for four designs a very interesting possibility of the same presentation you know I think a point cloud representation is also great for design but this is optical this gives you natural handles by which to modify the shape and again be nice to automate is more like artists do and in this brings to the end of the presentation so I covered two aspects here I talked about how to go from a single image to a point cloud 3d version of the underlying object and we saw several interesting issues in that part the issue of of the loss function because now we don't have to we don't prepare regular structures like box and please week of their point clouds so we have to have distance functions that are semantically good but also has to compute and differentiable so you can put an end to end pipelines we saw the issue of doing the up conversion different paths one meant to capture geometry to parametrization in the other particular collegue to capture the most memorable integrand aspects we saw the issue of ambiguities and how to get a net to predict actually not one but multiple structures in fact in the real implementation there is something like a random variable that you can use to select among multiple constructions of the same image Pacifica tried multiple times you can get the current a different outputs that confuse with possible variations of the nanostructures that you might want in the second part the main idea was to try to represent geometry for a very small number of simple geometric primitives it was possible to accomplish that without training data because essentially the geometry itself becomes its own supervision and I think this is a very very compact representation the to show can be compact closely at kept consistent across related shapes it provides a great tool for shape inoculation thank you break for 15 minutes okay the original plans to have the last session from 4:30 but I guess because we end a little bit earlier for the second session will start at 4:00 2020 so yeah 15 minutes later come back here [Music] yes I think I should go around and find that basically our room is in both cities right videos maybe okay [Music] [Music] [Music] horrible creates lots of toothpicks of animation so I think we need to start it's already 4:20 so so the sooner we start the sooner we finish and I'm really really really sorry that and guilty that I'm being the last one to speak here and I separate you from from the picture from your dinner I guess so basically the last part of this tutorial I would like to talk about I would say a little bit exotic for this for this audience kind of deep learning that we call intrinsic deep learning on manifolds and this is rather different way of thinking of doing deep learning or convolutional neural networks on on on 3d data basically will not be consuming 3d data is volumetric objects or images logical as Euclidian objects vector space objects but will be considered considering them as many folds this model is very common I would say very popular and very natural in computer graphics and people dealing with geometry I would say it is the less common and okay that's why I say that it's exotic in this community of computer vision so hopefully it will be clear and those of you who attended our tutorial in the first day about deep learning on graphs you will find some similar concepts at the end it looks more or less the same ok so the main idea basically I think the main idea is illustrated well in this slide if you look at these resurfaced these three-dimensional objects that we the deforms and here I will be talking mostly about applications that involves deformable shapes basically we wants to be in Ryan to deformation for example we want to find correspondence across different collections of shapes where we want to factor out all the deformations to be insensitive to these deformations you will see that if you we treat the object as a Euclidean object as a volume or a range image of course we can apply the standard convolutional neural network as we've seen the first part of the tutorial but basically the complexity of the network and the amount of training the data will be pretty significant because we'll need to learn these different environments of these informations from the data the the euclidean neural network at least in in a straightforward way will not be invariant to deformations even not to rigid situations like rotations so basically the main idea of this part of the tutorial is to try to define the basic operations of convolutional neural networks in particular the filters intrinsically on the surface itself in a way that deformation and violence at least certain class of transformations that we call a geometry is automatically built into the neural network architecture and as you will see as a result the networks will become much smaller with smaller number of parameters you will be able to represent basically everything that deviates from this automatic model and as a result also require much less training data okay so that's that will be the main idea the works I will be presenting so I will basically presented three classes of approaches spectral methods that are based on the analogy of Fourier transform manifolds spatial methods that mostly result from the VHD of narrative mostly Nikko seeds here and the mating based methods which is more I would say recent classifications that actually will be presented thicker of this year okay so to start with the basics of our model the we model now shapes as many folds basically the only thing you need to know about many folds that they don't have a global Euclidean space structure basically these are technically speaking topological spaces you can locally model them as Euclidean spaces this is what we call the tangent space or for two dimensional manifolds for surfaces this is what is called the tangent plane so just don't be misled by the fact that these are two dimensional manifolds these are boundary surfaces of three-dimensional objects three dimensional volumes okay but these are two dimensional manifolds that model three-dimensional shapes on the tangent space we define an inner product that we call the Romanian method it allows locally measure angles and distances and deformations of the surface that do not change the magic card called isometries or metric preserving okay and any property of the surface that can be expressed entirely in terms of the metric and his results will survive such deformations or all these properties are called intrinsic properties okay we can also measure distances on surfaces and these are called geodesics capelet shortest paths connecting the pair of points on the manifold now we'll also be working with differential operators on manifolds and again without going into too much details assume that you are given a function some smooth function that lives on on your manifold what we call its color field you can define a Hilbert space basically space of functions with an inner product on this manifold and you can compute differential operators in particular we will be interested in cooperation operator or in differential geometry it's called plus Delta operator it takes in a scalar field and spits out skaara field and basically geometrically what it does it takes the local average around in Invitational neighborhood around the point and subtract it from the value of the function at the point okay of course account incorrectly for the local geometry for the curvature what we like about the operation is that its intrinsic so basically everything that will be expressed in terms of properties quantities derived from the operation will be deformation invariant by construction it is self-adjoint or if you write it as a matrix symmetric matrix it has orthogonal eigen functions that will be interpreting as a generalization of the three bases and it's positive semi definite meaning that it has non negative eigenvalues which can be interpreted analogous little frequencies in classical harmonic analysis okay so working with these creates we can discretize shapes as you've seen in the beginning in many different ways for example if we deal with point clouds there are several discretization of the of the laplacian we basically build nearest neighbor craft and assign weights that will depend on the Euclidean distances within the point if we work with measures which is very proper representation in computer graphics there is I would say ubiquitously used cotangent formula that actually looks at the average of the cotangent of the angles around two angles around HH okay so the bottom line it boils down to large sparse matrix on which you need to perform eigen decomposition okay the eigen functions again as I said they're real and orthonormal the eigen values are non-negative and basically it boils down to this idea composition trouble problem of a large sparse matrix okay so basically yeah you can interpret the population eigen vectors or rotation eigen functions as the smoothest orthogonal basis okay basically you can pose this eigen decomposition problem as minimization of what is called the physics delay energy now if you look at the operation eigenfunctions in the region one dimensional space will find that these are basically sines or cosines what we call viiia bases right you can easily see that if you take a cosine take a second-order derivative you will get back the same cosine multiplied by the frequency squared right so this will be the eigen functions and legs again values of the operator and of course we can generalize this idea by instead of taking the standard video laplacian operator take the laplacian operator on the manifold then you will get orthogonal eigen functions that generalize the Fourier basis okay that's the main it will be the main idea of spectral analysis now in the Euclidian case you can do you can take a function and decompose it into a linear combination of sinusoids of different frequency right this is what we call free analysis so basically the forward rates of form or the Fourier coefficients nothing else but the projections of our function on these orthogonal basis okay and the inverse Fourier transform is taking the weighted linear combination of the basis functions that are weighted by these Fourier coefficients so the same idea applies to manifolds we just need to replace the basis with the eigen basis of the of the manifold laplacian okay and we can apply this to solve P DS for example we can let the simplest example of a PDE on a manifold heat equation so basically what it encodes is what is called in physics the Newton's law of cooling here F represents the temperature at spatial and temporal coordinate and the Newton's law of cooling tells us that the rate of change of temperature of an object which is the left hand side the temporal derivative it's proportional to the difference between its own temperature and the temperature of the surrounding which is differentially encoded by the population of greater and the proportion coefficient is called it thermal diffusivity constant so on the manifold it's exactly the same equation the only thing that changes is here we use the mini formulation of as no operator so we have the initial heat distribution and basically the as we allow time to run the initial key distribution somehow diffused on the link fold so we can express the solution of the heat equation by applying what is called the heat operator exponential of the laplacian operator where the exponentiation is understood in the operator sense basically we need to apply it to the eigenvalues of the operation ok so what you see here is forward fourier transform right we project the initial condition on the orthogonal basis we take the exponentiated eigenvalues and we compute the inverse transform ok so if I plug in the explicit expression for this for this inner product and I change exchange integral integration and summation what I get here is what is called the fundamental solution or the heat kernel of this diffusion equation which is a function that depends on two coordinates X and X Prime and also on the time parameter ok so this is how heat kernels look like they are centered around different points over here it tells you how the heat propagates from a points to all the rest of the points on the manifold you see that it is geometry dependent it's all shift invariant at different points on the manifold we will get different heat kernels what is nice though that it is deformation invariant you see that it follows the deformations of the horse now this is the heat kernel let's see what happens in the Euclidian case right so if we instead of the manifold we assume that we are given the euclidean space and we plug in the orthogonal eigen basis of the rotation we see that because of the associativity of the exponential function we get the heat kernel dependent on the difference between X and X Prime and not a general function of X and X Prime and this integral is just a convolution of F 0 the initial conditions of with the heat kernel and this allows us to interpret in terms of signal processing we can actually interpret the heat kernel as the impulse response function basically how if we initialize our heat diffusion equation with the Delta function how it will look like after time T okay now remember what is a convolution in the collision case it's given by this convolution integral basically it's a shift invariant operation meaning that if I shift one of the functions the results will be the same result shifted it commutes with the operation so basically it is diagonalized by the same by the same eigen business and this is what is called the convolution theorem basically they create the transformed ionizes the convolution operation this is bread and butter of classical signal processing it will also do filtering in the spectral domain the way we do it we take the Fourier transform of the of signal f we multiply it by the filter denoted here by G hat and we compute the inverse Fourier transform okay and why it is good because in the ingredient case we can compute the phrasing from fast using the F key already okay so again let's write it in for matrices and vectors basically the convolution if we write it as a matrix operation is expressed by this matrix that has a special structure so the coefficients of the filter form the diagonal of this matrix this is what we call the top little circulant matrix and it's as we said before diagonalized by the Fourier in the free images so we can write it like this this is in matrix vector notation the Fourier transform of F multiplying it by diagonal matrix means point wise multiplication and this is the inverse Fourier transform okay so this is the discrete version of the freer convolution theorem unfortunately in the non-euclidean case on manifolds we don't have any fixed invariant so we cannot even define or at least not straightforwardly define that the convolution integral but we can take what is a property in the Euclidian case the convolution theorem and say this is our definition of what we can call the spectral convolution right so we define the convolution is taking the Fourier transform of F taking the Fourier transform of G multiplying them element-wise taking the inverse Fourier transform okay our in matrix notation we can write it like this freeze on 405 multiplication by the diagonal matrix of spectral coefficients inverse - okay and these matrix G is no more circle basically the result is not checked invariant what is worse than the filter coefficients depend on the basis unlike the occlusion case where the basis is fixed it's the property of the space in which we're working here each manifold each shape comes with its own operation with its own basis we'll see why it is a problem okay so let's now fast forward to the generalization of convolutional neural networks basically we will we will start with the spectral domain representation and we would like to formulate the convolution operation in the spectral domain and basically this is a work of John Boehner they were the first to do it on graphs basically convolutional model on graphs and the idea is exactly the same convolutional layer that we have in the classical case and image analysis for example but the filters are represented now in the frequency domain so the way it works you take your function you compute the Fourier transform on the menu folder on the graph you multiply it by these diagonal matrix of the spectral coefficients of the filter take the inverse Fourier transform and basically you can do it with multiple filters and this is this C denotes here a nonlinear idea for example there is the rectifier okay so this was the first conceptually important work actually it has many disadvantages for example the number of parameters here is order of n the size of the input number of vertices in this case in the mesh and in classical convolutional neural networks the number of parameters is sixth it's independent on the input size also the complexity of the computation is order of N squared because we don't have a set key on manifolds or on crafts basically need to multiply by these dense matrix of eigen functions there is no guarantee of spatial localization of filters which is also one of the nice properties of classical convolutional neural networks that the filters are local here the filters are not the same at all they can be anything and what is worse that the filters are basis dependent basically it means that if we learn the model on one shape we cannot apply it immediately on another shape we will see how to at least to some extent so Emily this problem with specs on transformer networks ok so to give you an illustration of this problem this is the function that is given on a horse ok source and blobs ok I design a filter that does some kind of edge detection ok I compute it in the free basis of this horse see ok now let's start deforming the horse and will you be using exactly the same filter in exactly the same function you see that the result is totally different and the reason is that the basis is unstable basically each deformation of the horse will have a new basis that is not necessarily consistent with the basis of the original shape you can see it here this is the 52nd function of the laplacian you see that it changes completely across different poses and these are almost isometric deformations consider non-aromatic deformations it will be totally different ok so we'll see again how to remedy this property with spectral transformer networks and with spatial constructions convolution in the following but let's go back to the to the second problem that we had the lack of spatial localization ok so basically that was again one of the fundamental properties of classical conclusion or neural networks so basically what we wants to do we can express the localization property in the frequency domain basically what should the spectral multipliers of the filter satisfy in order that in the spatial domain on the manifold the resulting filter will be localized ok and we can use in the in the collision case we can use this property of the of the Fourier transform that basically in signal processing is called the vanishing moment property right basically if we want the moments higher order moments of our function in the spatial domain to be small we want the higher-order derivatives of the Fourier transform to be small basically the bottom line is that vocalization in space translates into smoothness difficult simile basically what we can do we can parameterize the filter using a smooth spectral transfer function and if we make it parametric we can also make the number of parameters that Express the filter independent of the input size okay for example we can use polynomials or if we want to be completely correct it's better to use chebyshev polynomials because that's an orthogonal basis but without going into these details basically we can consider a class of polynomial spectral transfer functions that are smoothed by definition okay and basically here they're everything all the problems that we've seen before are our salt the number of parameters is fixed the filters actually are localized not only that they're localized they have guaranteed our hope support because here we are taking our ARS powers of the operation and the operation is local operator so it affects only one ring on the graph of all the mesh we don't have any explicit computation of the eigenvectors basically it boils down to applying powers of the operations or our signal so this leads to linear complexity in the input signs again we don't have generalization across domains so that this deficiency still remains okay and again why we don't have the generalization across different domains because the eigenvectors especially the high frequency on different shapes behaves differently you can see here for example the 20s eigenvector of the operation looks very different okay so basically I don't have time to go into details but basically what we can do is if we have correspondence between the two fifths X and y we can express it as a linear operator but we called the functional map that was also work from from the central group and if each of the function that lives on these manifolds can be expressed in the Fourier basis let's say I truncated at K first coefficients basically we can approximate this linear operator in the in the frequency domain let's place a low rank operator and then we have a matrix C that translates Fourier coefficients from one basis to us and coding these correspondence so this matrix will usually look like this ideally in the ideal world that will be diagonal because if we expect the shapes to be as asymmetric there are eigenfunctions will be the same maybe after sign in practice because the shapes are non aromatic it will have this funnel shape structure because at high frequencies the eigenvectors the safety sake okay so basically this matrix e encodes the correspondence between shapes in the frequency domain okay now if we write the correspondence operator like this as e transpose c transpose c basically we can think of this c c transpose as a new basis function okay so let's call it C hat or C T bar okay so basically in this new basis you see that the functional correspondence is encoded by the identity matrix so basically you can think of functional map as a kind of synchronization of the basis so I am finding a new basis I'm modifying it by means of linear transformation such that the Fourier coefficients the new Fourier coefficients and these new pages speak in the same language okay and again going back to spectral convolutions if I take a spectral filter that works like this I applied so my two different shapes I get completely different results right because the basis behave inconsistently so the idea of stacks or transformer networks is to use some intermediate shapes called this crazy shapes shown here in gray basically we want to do the filtering in the in the common bases of this of this shape by means of this of these functional maps and you see that in this new basis the filter is now the filters now behave consistently see so if we if this was the structure of the spectral convolution on your letter it seen before basically we took a signal computer it's Fourier transform on the manifold apply the spectral filter and computer the inverse Fourier transform may be followed by non-linearity here we introduced the spectral transformer module that modifies the basis like in classical spatial transformer networks introduced by the purple sea sermon here we released the transformation is applied in in dispatch elderman in the K dimensional space of operational in Texas okay so that was a work again from the Stanford group that was presented we're actually hearing it at this conference a few days ago and basically it solves the problem of lack of generalization across of different domains of course now you need to explicitly compute the Fourier transform and discuss it comes at high computational complexity because you cannot compute it in n log n operations you need n square operations so here are some results so this is examples of normal prediction with spectral transformer networks and you can see that it works very nicely between here the color represents the orientation of the normal this is another example of shape segmentation similar to the problem step we're shown in the first part of the tutorial where if you compare to the ground rules segmentation is very faithful okay so that was about spectral domain methods for for doing convolutional neural networks let's now move to the spatial domain so basically if we look at different possible definitions of the convolution we've seen in the collision case the spatial domain right basically with we can think of convolution as applying some kind of a template and running it over a signal okay so that was the convolution integral in the spectral domain we've seen the convolution theorem and the spectral cnn's were based on on this analogy that was performed so the question is what would be the spatial equivalent of convolution on the minute okay and again think of convolution is a kind of page based operation in an image you extract the page of pixels you multiply it by some template right your filter you sum up the result and you move to a next position so we'd like to generalize the same thing on manifolds but the only thing that will change now is that this page will be position dependent we don't have any system variance okay so technically speaking we can define a local system of coordinates in which we represent the function that lives on the manifold locally around the point okay and we call this a page operator and again if in the image that they go the very way that the author the page is extracted is position independent because a fixed invariance of the underlying space here it is position dependent the extraction of the page at point X and X prime will be different okay a different way of thinking of the page operator is by applying some weighting function to the to the function that lives on the manifold around the point and basically representing the local system coordinators as local averages of the function okay and we can define the spatial convolution with the filter in this way basically we take the local page multiply it by biofilter and sum everything up okay so this can be a continuous filter or can be discrete cells are basically whether our family of weighting functions is continuous or discrete in practice of course it will always be discrete okay so this is probably the most intuitive and the most straightforward way of defining these patches so we can use local geodetic polar coordinate basically the page will look like a radial edge here but it will be defined intrinsically on the main fault okay so the weight here will be radial and anger weights that will localize basically will average the values of the function in each of these radio and anger around each point today so if we plot the local polar system of coordinates in the plane these will be the for example the the waiting kernels that are used for these geodetic pitches okay and we define the geodetic convolution in this way so it is illustrated here this is the petrel are extracted around a point and we multiply it by the filter G so you can already notice a problem with this definition and the problem is that we have rotation ambiguity right we can rotate the filter by some delta T and I think that there is no clear way or at least not immediately clear ways which of these visualizations to choose so there are several solutions to this problem first of all we can choose some canonical orientation for example we can use some the maximum curvature direction as you'll see in the following we can also take the angle coordinated take the standard Euclidean 1 dimensional Fourier transform with respect to this coordinate and this will translate the The Associated a the rotation ambiguity into a complex phase and if we take the absolute value of the Fourier transform we get rid of the face and we make it rotation invariant the third possibility would be to keep all the rotations and we call this angular max pulling basically we extract the page operation we apply the filter so it's like correlation with a rotating mask with a rotating template and then and then we take the maximum so actually there was a paper from Duke and these conference where they use the same idea for images we did it two years ago for medicines okay and this is how we express the convolutional layer in the spatial domain using these geodetic pitches so use this angra max pooling to solve the rotation on big Uwe T it gives us directional filters that are also spatially localized the number of parameters is fixed and all the operations are local so in principle its orders and computational complexity of course the angular max pooling potentially reduces the discriminative attendance of this approach okay so here is an example what we can do basically this generalizes the convolution so you can build any arbitrarily complex convolutional neural networks that work intrinsically on the manifold let me show you an example how we we can use this this kind of architecture for learning optimal local descriptors as you know basically at least before deep learning there was a plethora of different local descriptors plot for images and for for geometric data that try to describe local structure of the manifold around the point so here assume that we are given two shapes basically we construct the siamese architecture whether there are two identical replicas of the same neural network with shared parameters and they produce each point teacher vectors F and G and if we have a training set with shapes with known correspondence between some points so here corresponding points are represented by the same color we want these features at the corresponding points to be as similar as possible and that non corresponding points to be as the same as possible okay and basically by minimizing this loss we'll try to make this as discriminative responsible descriptors so this is how it looks like so what I show here is the distance in the descriptor space from the white point here on the shoulders or all the rest of the points on this and other shapes and cold colors mean that the distance is small so this is HKS again I work from the Stanford Club which is basically diagonal the heat kernel it's you can think of it this kind of multi scale Gaussian curvature so this descriptor you can think of it as a low-pass filter it is very smooth if I move away from my point the descriptor will change very little again you see in fact that it's organization is quickly forward in this example this is wave kernel signature let's work from the group of Daniel converse so it's based on the different formalism of coming from quantum mechanics so you see that this is scripture is much better localized but it's less discriminating for example points on the belly of this human for some reason look similar to the points on the shoulders here ok and this is how the the features produced by geodetic on Jerusalem your level look way better than they can cross the descriptors and indeed you can look at these curves so there are different evaluations of the descriptions basically the bottom line is the higher the curve is the better the quality of each script it is it significantly outperformed the some standard handcrafted each other scriptures ok so that was the construction of Judea that we call geodetic convolutional neural networks let's now look at a different way of constructing these local page separators and for this purpose we'll consider again heat diffusion on manifolds we've seen this diffusion equation right where basically the diffusion constant was assumed to be fixed everywhere on the main folds the heat conductivity properties of the surface over or equal at each point we want to consider a different kind of diffusion where these where these diffusion properties are not only position dependent but also orientation dependent what we call initial topic diffuse ok so here you can see an example of the difference between these orthotic and anisotropic diffusion so is the tropic diffusion propagates heat equally in all directions and each topic diffusion propagates heat differently in some takfir direction and what is important this construction can be made in finding if I deform the shape it will follow the deformations of the shape okay so again without going into too much technical details of the particular class of anisotropic diffusion tenders that we consider here our rotation matrices and some scale basically what it does it it allows to define as a laplacian operation that now has two extra parameters data is the orientation with respect to some canonical direction you can use the principle curvature direction well strictly speaking that's not intrinsic but you can use any any smoother or piecewise smooth vector field defined on the manifold and alpha will be an elongation basically telling how initial traffic this diffusion will be okay and we can define heat kernels as before and basically now the heat car noise has three parameters it has the scale basically the size of the blob the orientation and the elongation again we use these anisotropic heat kernels as weighting functions for our page of pages okay and basically this allows us to define the local pitches as we seen before with with the geodetic convolutional neural networks again these filters are directional spatially localized and only order of one parameters per layer unfortunately the computation of the heat kernels at least the way that we did then put in the paper these expenses because we also need to account for many different orientations I'm playing the game that was the example of the descriptors that we've seen before with geodetic convolutional neural network here we see even better results what is also important to notice that these kind of pages because intrinsic symmetries sleep the orientation of the page they allow automatically this differentiate between between symmetric points the shapes intrinsic descriptions like heat can also get signatures do not okay so back again to the construction of the major operator the next step you can do is to make this operator learnable and that was the paper that represented in this in this conference basically if we construct the local system coordinates around a point we can we can create a parametric family of weighting functions called and W depending on some set of parameters data and we apply this family of weighting functions to our local coordinates for example we can for simplicity consider thousands that are parameterize by the mean vector and the covariance matrix okay so this is how it looks like basically again consider the same geodetic polar system of coordinates Rho and theta we apply the Gaussian weighting functions in these coordinates and the spatial convolution will look like this so if I now exchange the position of the integral in the Sun what we get here is a Gaussian mixture so basically that's why we call this this conventional neural network architecture mixture model network or monad for short and basically here we have an additional degree of freedom learn about page operator in images of course it will be superfluous because you can already absorb this these degrees of freedom in the filter itself here we have most vulnerable pages the pages and learn about template coefficients okay and if you compare it to the previous two models the geodetic CNN and isotropic CNN basically there we have fixed the edge operator with fixed weighting functions here we allow to adjust the weighting functions in an optimal way and it produces much better results we can actually show that it generalizes many previous convolutional or convolution like neural networks including of course the classical cnn's also many models that are used for deep learning on graphs we covered that in the first tutorial in the first day okay so let me show you how we use these kind of architectures to learn the information environment correspondence between shapes which was one of the motivating examples for these kind of applications okay so there are several ways of computing correspondence of course you can use local descriptors just match them and find the nearest neighbors maybe do some post-processing a different way would be to say that we consider correspondence problem as learning as labeling okay so assume that we are given some reference shape we take a point X on our query shape and we design a deep neural network that produces an output vector that has the dimensionality of the number of vertices in the reference shape okay so I am using reference shape just for convenience is just label space okay and the output can be interpreted as a probability distribution of basical probability of point X corresponding to a point Y on the reference shape okay and basically we can find the optimal parameters of the network by minimizing cross entropy or logistic regression costs that basically the distance between the Khronos correspondence distribution a delta the ground zeroes point and they also produced by the neural network ideally we want possible Delta's at the same location now how we evaluate the quality of the correspondence basically given a ground Rose correspondence and the correspondence that we obtain from the network we measure the geodetic distance between these between these points and we average it over all the points which is the protocol that is used in the princeton benchmark that they all mentioned in this part of this tutorial okay so basically be in the Princeton benchmark we measure the percentage of correspondences that's fall within certain error radius around the ground force okay and we get these curves so just to give you just to calibrate the expectations the yellow curve shows blended intrinsic Maps one of the state-of-the-art methods using computer graphics that doesn't use any learning the green curve shows random forests one of the first methods use leather learning-based or for finding correspondence this is geodetic CNN this is anisotropic CNN this is what you get with money so you can notice here that about 90% of correspondences have zero geodetic error means that they're perfect okay and basically the maximum are roughly is about four centimeters which is more or less the size of human finger so maybe to better appreciate it we what we plot here is the local correspondents error the deviation of deviation from the from the ground rose correspondence here it's calibrated in centimeters so these are blended intrinsic maps if you look at the correspondents produced by blending genetic maps it looks nice because it's very small so actually it is not pretty bad this is what we produce with your data CNN you see that many points have zero or very small correspondent era of course there are some sparse points where the correspond answer is pretty large this is what is produced by anisotropic CNN's way better and this is what we produce in the neck so here are the correspondences nearly perfect it is better to visualize it by texture matting we take the reference shape and we transfer from it this checkerboard texture to different versions of the shapes and we see that there are some distortion some artifacts that but they're almost unnoticeable here are some more challenging settings here with range maps so basically these shapes have different topology some missing parts invisible parts and again here the correspondences it's pretty good and some more examples of with extreme Marshall correspondence okay so let's again look at the formulation of correspondence as a classification problem basically if we consider correspondence as a quantification basically we throw away some important geometric information namely look at this example let's say that the gray is our ground rules if I have two correspondences the blue one and the green one which one is better obviously the blue one right because it is closer to the contours in terms of geodetic distance but in terms of classification cost if I didn't hit exactly the trousers correspondence it means that I miss classified right so it doesn't account for the four correctly for the geodesic distance also the output of the network is not a point wise correspondence it's some probability distribution so basically the more correct way of measuring the correspondence error will be soft correspondence error basically we take the geodesic distance we weighted by the probability probability of X corresponding to Y and we integrate it on the reference on the reference shape okay now another deficiency is that in training time by minimizing this cost we can guarantee that two nearby points will be mapped to nearby points on the on the target shape we can not do it during the test time during the inference right nothing in the world grant is basically it's a point wise operation correspondence is predicted at each point of the query shape that's exactly the structure the output or what is called in computer vision the structured prediction problem right for example when you label pixels and images you want to condition the output of the network in one pixel on what happens in your neighborhood right so exactly the same thing happens here we want to somehow produce structures out because correspondence itself just point wise forward it depends on what happens around it and again we'll use here a functional Maps basically what we've seen inspectors format networks we can describe correspondence between two shapes by a linear operator and express it in the frequency domain as a transformation of the Fourier coefficients and if we have many corresponding functions let's say some local features that as we'll see in a second will be produced by our convolutional neural network basically we can build a linear system of equations and find see from these linear systems equations basically a compact representation of the correspondence now if we interpret the and basically this is just solving linear systems of equations solving it in this context now if we again look at the functional correspondence and the spatial domain basically because we truncate the Fourier basis it will map a delta function to some to some blob this blob is not necessarily positive actually because of truncation we'll have what is known in harmonic analysis gives the nominal it will usually have some oscillating side lobes right so if we take the absolute value of this thing and normalize it we get the probability distribution right so basically we get the soft correspondence of the point on the on the reference shape and instead of learning point yg scriptures at which we've seen before we can actually do something more complicated so that the neural network will learn local features as we did before right but instead of comparing these features and learning point wise descriptors will project them on tilde Fourier basis right will compute the functional correspondence see from the functional correspondence e will produce the probability distribution and that probability distribution will be used to compute the soft correspondence error okay and this will be the cost that will be minimizing so this is the fixed layer we need to back propagate through the functional correspondence fortunately it has a closed form expression basically we need to provide back propagate through this pseudo inverse and basically you can consider it as a complicated way of measuring the cost the quality of the correspondence and if we were happy with the performance of mixture model networks this is how the the functional net network works so basically the first benchmark that on which we're testing is already saturated almost perfect responds this so this paper is actually is not published yet it was just accepted to so to ICC okay so in the remaining time let me cover briefly a different class of approaches that use some embedding space to define the convolution okay basically the idea is to use some kind of global parameter ization to map the shape into some parametric domain which has 16 variant structure okay and it allows to use in in that space to use standard Universal neural network basically you pull back the convolution from from that space you may by construction guarantee invariance to some classes of translations the deficiency of at least the approaches that I'm aware of is that this permutation is usually non unique okay and usually this embedding will introduce resources so basically one of the things that we lost when we moved from the Euclidian space to the known equation setting was 16 violin right so let's think again what is translation invariance how can we define translation invariance on the manifold how we can define translation on the manifold basically you can think of it as some locally you click on translation or in other words flow along and non-vanishing vector field I move in Centennial each point on manifold in the tangent space by means of a vector field unfortunately what we know from a result in differential geometry that is called the Poincare Hopf theorem that the only possibility for for constructing such vector field is on surfaces which have genus one where they look like a tourist even maybe more practical setting this is called the hairy ball theorem basically tells you if you have a series hairs you cannot combat without introducing singularities but you can compare those okay so basically on the course you can have sixteen by instruction that's actually only two dimensional manifold original this kind of fiction by instructure and basically the idea this is the work of from the group of Karen listen from Python of it as entities at SIGGRAPH this year I think basically they use an embedding into the course the embedding until the into the torus is not uniquely defined it it determined by selecting a triplet of points on the surface each table of points produces a different embedding and then the convolution is brought back from the torus to the to the surface you see actually that it distorts its conformal map but it it scales locally the area and different selection of these triplets will produce different local skinning so you can see consider it as an advantage basically it allows you to zoom zoom in on some regions of the manifold but the problem is that different selection of this triplet gives you different results what they do is basically they consider multiple selections of these tablets and then do some kind of folding aggregate result probably a smarter way of doing it would be some kind of spatial transformer networks for you basically you optimize for the for the best position of these triplets of points inside the network itself like special transformer network so okay and it produces pretty nice results in shape segmentation I'm not sure that it's comparable to the results that we've seen before but I think it's a it's a nice animal in this zoo of intrinsic canora's on neural networks on manifolds okay so to conclude basically intrinsic deep learning allows architectures that are deformation environments traction so this is what distinguishes them from the volumetric or image based on point cloud or cloud based approaches this in turn in applications of the formal correspondence for example implies that we'll need much less parameters and much less training data so the examples of correspondence I showed you were trained on 86 by any standard of deep learning this is that incredibly tiny training set so to achieve similar results with euclidean methods you would forget about the recent work of Holly from last year use 60 million training examples I think it's and comparable in terms of the training sets and it's a part of a bigger trend of what we call genetic deep learning basically keep learning on special aircraft and there are several ways of defining a intrinsic convolutions each has its own advantages and disadvantages shape synthesis in this case is a very big problem we have some ideas how to do it but I think so far it is not be fun ok so that's all thank you very much for listening for remaining so so late and I guess if you have questions with great answer and also I want to announce that for the materials is to Couture is already online is on the website so for the CVR 2017 booklet there's a tutorial and there also shows the website have our two part yet slides on other three DD l dot Stanford Evo I think that's it [Music]
Info
Channel: ComputerVisionFoundation Videos
Views: 48,039
Rating: 4.9080458 out of 5
Keywords: CVPR17, Tutorial, 3D Deep Learning
Id: 8CenT_4HWyY
Channel Id: undefined
Length: 202min 39sec (12159 seconds)
Published: Thu Sep 21 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.