AI for Learning Photorealistic 3D Digital Humans from In-the-Wild Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good evening welcome to the April meeting for Silicon Valley ACM sigraph I would like to thank the Bay Area ACM chapter and the Los Angeles ACMC graph chapter for their help in co-sponsoring tonight's meeting tonight we will have a presentation by Matthew Chan talking about AI work at Nvidia in creating 3D models from 2D images I am Alice junck chair of Silicon Valley ACM s graph joining us tonight is Bill Brun from Bay Area ACM and also Joan Collins from the Los Angeles ACM sigraph chapter before we get to tonight's presentation we would like to tell you about our organizations Silicon Valley to graph typically meets in various Silicon Valley companies about once a month our meetings are great for networking and for learning the latest Technologies since Co hit we have been online however we hope to get back to inperson meetings eventually now let me introduce you to the officers of our chapter I am Alice Jun I am the chair of the chapter BR Simon is vice chair Carl Anderson is Treasurer and Ken trovi is a member at large finally this could be you we are always looking for more volunteers and needed help in many of the chapter activities in particular one of the ways where we could use use law of help right now is finding venues for in person meetings if you like what you see here please consider joining us you can find the latest information on our chapter on our website or on our Meetup you can also contact us using this email we also have a YouTube channel where many of our events are recorded if you're watching this on YouTube these links will be in the comments otherwise you can find this information later by typing Silicon Valley sigraph into into Google one of the main events for people working in in the computer Graphics uh industry is the annual sigraph conference this year the conference will be held in Denver during the last week in July it's a great way to get immersed it delays the computer graphics for a whole week registration for the conference now open an early registration will continue until May 14th uh these prices will be lower when then then what will be available later please find out more information on our Meetup or at sra.org I'd not like to invite Bill Bruns to tell you about Bay Area ACM take it away Bill thank you uh sfbay ACM has been in the Bay Area since the 1950s in different forms we cover uh all software topics we have both a YouTube channel with hundreds of videos from past meetings and a Meetup page with over 12,000 members uh you can find these things by searching for sfbay ACM we strive to have two meetings per month uh one on generally on AI and uh machine learning and the other on general computing topics we've had topics on AB test in on programming Android earthquake detection various things uh our next meeting as you can see on the slide on the screen uh is more about AI or machine learning uh tutorial on Rag and that will you can join that either online through Zoom or YouTube or in person at hacker dojo in Mountain View hope to see you there thanks okay thank you Bill now I would like to invite Joan Collins to tell you about Los Angeles ACM s graph chapter go ahead Joan all right thank you very much I am Joan Collins chair of the Los Angeles ACM sigraph and thank you Le and Bill for arranging this three chapter event this evening the other La sigraph officers in Los Angeles include Vice chair Larry Rosenthal uh who produced this uh event coming up uh on the metaverse and uh secretary Rick Hernandez Treasurer Dave Kender uh member ship Sharon Eisenberg and volunteers uh Fran and Donella Benjamin our meetings coming up include uh what you see on screen the May 14th metaverse 2024 can 3D finally provide a path to the metaverse Visions we're hoping uh June uh we have the fourth of four annual art show uh it is a monthlong um exhibition with uh over 20 artists in it uh SE uh September coming up we also have um an outdoor multimedia show for a very large audience uh that event is called Open Sky uh we're very active uh chapter we hope you join one of these ch here tonight and back to you Le okay thank you John and now to now to introduce tonight speaker here's Ken go ahead Ken thank you Alish tonight's speaker will be talking about the technology behind the 3D video conferencing at last year's sigraph emerging Technologies and last month at the GPU technology Matthew joined Nvidia as a research engineer in in 2022 they primarily work at the intersection between graphics and generative models specifically how they relate to 3D scene synthesis reconstruction and understanding they graduated from the University of Maryland College Park in 2021 with a bachelor's degree in mathematics and computer science if you could please hold off your questions until the end or type them into the Q&A and they will be addressed at the end of The Talk thank you Matthew take it away okay let's well please welcome Matthew 10 thank you for that wonderful introduction um great well hey everyone uh thanks for stopping by uh as they mentioned I am Matthew I'm from Nvidia research uh today I'm going to be presenting a little bit about um how Nvidia approaches the problem of learning photo realistic 3D people uh using uh in the wild data uh so what that means is that we're going to actually leverage Ai and we're going to use that to create high quality robust and efficient digital humans all from easily capturable and also widely available data online um I am on the very tail end of recovering from a little bit of sickness you'll have to forgive me if I'm frequently reaching for my water bottle here uh otherwise let's go ahead and uh Jump Right In so the first question that we should ask is why do we even care uh why do we want to do this in the first place so there's a good number of applications for 3D digital humans but I think the most obvious among them that we have here is uh telepresence so in the most literal sense of the word the ability tele presence is the ability to allow people to be digitally present even at potentially extreme distances so this is not really a new problem like at all it's actually quite old so a lot of the efforts that we see date back as far as the early 2000s and a little bit further um and there's a whole bunch of notable attempts at this problem I'll show a few here way back uh we found out that even something as simple as placing people in a physically accurate world with you know a TV monitor and a uh digital reproduction of half a table through the screen can help a lot with realism and feeling present um more recently if we skip forward about 20 years here in 2021 we saw meta attempted to solve this problem through the use of codec avatars with their headsets um recently also that same year Google attempted to solve the same problem without headsets so they introduced project project starine here um similar idea 3D Tower presence of people but without as or without a head mounted display um even more recently so that is uh last year Apple this is something that a lot of you probably have seen Apple produced the Persona avatars for FaceTime and other applications using the Apple Vision Pro so all these approaches that I've just covered really have excellent quality the thing that they all have in common though um is that they're trying to reproduce reality as we see it uh and we wanted to know at Nvidia what happens when we choose to straight from this assumption so is it possible to bend reality for a little bit better Communications let's see what if instead of reproducing reality we could do this do I have audio yes I can hear you fabulous so there's a lot of stuff going on here uh if you look in the bottom left corner we have stylization in the bottom right corner I'll play that one more time uh we have translation there's some eye contact UM and a lot of this if you caught that in the top left corner or I suppose the top right corner is all coming from a single 2D input this is just a sneak peek at what I'm going to be covering today um but before I get to exactly what it is that uh we're doing before I give you too much more information I'm going to go and take little detour into uh how we do this and then at the end we'll really get to why we want to do it this way so to start off with it really all begins with generative models um cool so let's start with Gans uh or for well Gans which stands for a gener of adversarial networks um Gans are relatively new they're only about a decade old as of today um and they progressed truly extraordinarily fast so since the very first Gans that we saw in 2014 it's taken less than 5 years until we were able to produce humans and faces that were indistinguishable from the real ones to prove that actually I have this collection of this study here which showed that a uh collection of curated images um is indistinguishable from real images uh um for the the average population here and we've seen even more of this in recent years um cool so I'm going to go do a little bit of a deep dive into uh what's actually going on under the hood here and how these things work so that we can get a better idea as to what we're doing in the end and why we care um so what is the goal of a gan here um it's actually a very simple goal we want to train a model that converts from random numbers into random people uh in this case we're converting into images of random people that don't exist these images should ideally look real but also ideally not necessarily be real let's do a deep dive into how these models are trained um in the case of Gans there's actually two networks we have a generator on the left side and we have a discriminator on the right side so the task of the generator is to generate images from these random noises the latent all the way on the left um and the task of the discriminator is to tell the generator how good or how bad the generated images are and then also tell the gener how it can improve to make the images look more real so in recent years Nvidia has actually proposed a series of style Gan papers each achieving state-of-the-art and image synthesis at the time of release so on this screen here I have generated images of style Gan 2 on the left and style Gan 3 which is the newer variant on the right um in San 2 you notice there's a handful of artifacts stuff like sticking if you look closely at the beard here I'm not sure how smooth it comes across on a zoom call um but you may see there's a little bit of sticking along the beard as if the the fine details are glued to the camera lens whereas on the right um it looks a lot better but you'll notice that even as I move the camera around the person changes a little bit and that's not really what we want um because while this looks really good in 2D that's actually just a 2D image of a person it's not really a 3D object um and while 2D is nice and all 3D would be better so uh one area that we've actually seen a lot of advancement in in the last few years um is the use of ganss so these same architectures or SL modifications to instead of creating 2D images we can create 3D objects and scenes I'm going to call these 3D Gans or 3D aware Gans so two years ago uh we presented the 3D aware image generator called EG 3D um and just like the 2D Gans 3D Gan models are trained using unstructured images of 2D people so you see that in the bottom left um and then yet despite never really seeing 3D people or multiv view images or any other even inherently 3D information um these mod these networks are able to generate um multiv view consistent images and videos and also objects all in real time so what we really strive to answer here was the question given the identical collection of images and data that we used to train the previous 2D Gans instead of creating photorealistic 2D images of people that don't exist can we create photorealistic 3D representation of people that don't exist unsurprisingly because I'm here today uh yeah yeah we can oh I missed a few slides let's skip these um so how do we train this s again uh let's go back to the basics here studing with the 2D again the generator inputs a random noise Vector uh and then it's going to use that to create a 2D image the discriminator on the right hand side then observes that 2D image and provides feedback on the realism and what puts should be modifed if to improve the realis for 3D Gan it's quite similar you have your random noise on the left side you have a generator but instead of creating a 2D image it creates a 3D object now we then randomly sample a set of camera parameters for a distribution of cameras in some real world data set and we're going to just take the camera and we're going to render out a 2D image then we're going to take that same 2D image and we're going to feed that to the standard 2D image discriminator so what that means is that these 3D Gans despite operating in 3D can use identical discriminators and data that we use to train the 2D Gans with no modifications to recap 2D Gans and 3D Gans are very similar in architecture the primary difference between 2D Gans and 3D Gans is simply that the generator creates a 3D representation and then differentiably renders it to a 2D image before passing the result to a standard 2D image discriminator an important note that I touched on in the last few slides is that we train entirely from 2D in the wild images rather than ground true 3D data so just like the previous 2D Gans 3D Gans train from these in the wild 2D Collections and they attempt to learn 3D representations from it the advantage in this scheme is twofold first 3D data of any kind be it 3D scans depth maps multiv view images or stereo captures are all very very difficult and expensive to correct at any reasonable quantity and they're also near impossible to capture and in the wild scenes by contrast acquiring 2D images is is Trivial uh the model and the the second reason why we want to train from 2D images is that if we do so then any model that we have will always already be trained within the wild data uh which means data that we acquire just walking down the street and so it will be robust to whatever 3D reconstruction and understanding tasks we have even when we throw it out uh into the wild and we remove it from a controlled lab situation so uh now we have an architecture uh such that the discriminator operates entirely on 2D views with no 3D ground trees data required and also such that the generator creates a single 2D view perene which eliminates the multi view data requirements that many other would suppose when you're trying to create a 3D object generator this method doesn't come without its drawbacks though primarily in computational cost uh Gan convergence so Gan training could often require as many as 25 million synthesized images or more to converge that's quite a lot um further this type of training often requires rendering our image rendering the entire image every time we want to do it um and neural rendering especially in a fully differentiable fashion as we do um is also very computationally expensive let's do a little bit of a deep dive into the underly 3D representation so that I can give you a little bit more context for that so for many 3D including the majority of the varieties that I'm going to be talking about today Nerf or its variance is the representation of choice so Nerf short for neural Radiance Fields uh up at the top there uh is a paper that was first published in eccv 2022 sorry 2020 so four years ago not even that old uh a Nerf is a formulation for neural volumetric representation in which an MLP learns the five-dimensional function that maps from a position in space in an incident R direction to a volumetric density and a specular RGB uh then to render from such a representation it's quite simple we can just Trace Rays through the space and composite the vi the final color by a volumetric ray marching though to do so often requires as many as 100 to 200 samples per Ray the good part about this representation is that it's trivially fully differentiable which means that we can use very easily in a lot of our AI methods uh and additionally this method requires only RGB colored to train there's no necessity for 3D ground truth such as depth maps or geometry um that may be ambiguous the bad part is that this method is in many ways extraordinarily expensive uh to render a single image at 512x 512 resolution with let's say 100 samples per pixel would require as many as 25 million forward passes of the MLP let's dig deeper into what that number means so to understand why these 3D Gans could be so computationally expensive we need to really understand this representation in context the standard Nerf MLP consists of several stacked fully connected layers that's the architecture on the left uh that form an MLP that must be evaluated once per sample in order to render an image per image in order to achieve yeah great per image in order to achieve high quality we ideally want to render every Ray and per Ray we require at least 96 but we'll round that to 100 samples per pixel prevent under sampling the volume of introducing noise this leads us to the aforementioned 25 billion evaluations of the MLP per image beyond that recall per Gan we require 25 million images to converge combined naively this could lead us to over 600 trillion forward and backward passes of this network that is required in order to train a single Gan model to convergence at a glance it's becomes obvious why this is prohibitively expensive in many ways so the good news is there's a few things that we can do to make this cheaper let's dissect that for starters the network is ginormous if we could shrink the compute cost of evaluation that would be a great place to start additionally if there were smart way to reduce the number of samples that we render or sorry the number of pixels that we render we should grab it finally we waste a lot of compute evaluating samples in space that isn't actually useful that could be for instance in the case of rendering aead the empty space in front of a person's face where there's just air or the uded region behind a person's face where you can't see it because it's uded um if we were to intelligently Place those samples only in the regions that we care about for instance right on the person's skin um that could save us a lot of compute I'm going to wrap back around to the last one it becomes important in a few slides but we'll touch back on it the last thing that I should mention here is uh it is in some ways potentially possible to reduce the number of images that we have per training I'm not going to touch on this one too much um yeah let's talk about the first point though how can we reduce the size of the network so rather than the standard implicit MLP which receives the inputs position and en code and Direction so it it receives the input uh position and Direction and directly encodes that into a volumetric density and color or rather than an explicit feature grid on the right um which directly looks up density and color from a predefined feature grid we experimented with and proposed a hybrid representation which combines the two to get benefits of Both Worlds so our chosen hybrid represent presentation consists of three axi aligned feature planes which contains spatially varying information hence why we call it a triplane representation um and we combine that with a very small MLP this little green box at the bottom which is actually only two layers and which is designed to decode the positional information into spatial density and color to note we still must run this decoder equally as many times so 600 trillion times per image in many cases but because it's so much smaller than the original MLP the cost of doing so is dramatically decreased this triplane architecture also brings a good number of nice benefits to the table first it's easy to generalize the stationary structure of the 2D planes makes it really easy for a model to learn and predict the content on these planes second the are really memory efficient compared to 3D voxal and other representations because they scale only quadratically versus cubically in terms of spatial resolution this means that we can fit a lot more spatial resolution into the same number of parameters and that's especially important when we're trying to push photo realism on a limited compute budget third 2D planes are especially compatible with existing off the-shelf 2D backbones for instance we can cage existing and well understood style Gan 2 architectures to generate these 2D feature points uh let's touch on the second Point here which is another way we can reduce the computational load of Gan training is to reduce the number of pixels that we actually render with our Nerf so many early 3D Gans leverage 2D convolutional super resolution layers on top of uh a low Fidelity 3D representation we can see that here in the middle is our low Fidelity Nerf so that's our volumetric representation with the pair geometry uh and then on the left is the super resolved output so the underline 3D object here helps guarantee good geometric quality and the additional convolutional layers add the missing details and ensure good image shortness so uh all well being trained to maintain consistency with the underlying 3D representation so following the techniques that I outlined just a few minutes ago we can train a model on entirely 2D ground truth data from which we can synthesize these photo realistic images and videos with complete disentanglement of the scene and the camera all uh excuse me all infancy at real time so there are a handful of other approaches in recent years that look to address the same issues that we addressed uh in different ways for instance rather than rendering a low resolution image and applying it through a high resolution supervision um and uh super resolution a possible solution is instead to render a small patch of the image at Full Resolution so for instance that is the approach taken in these two papers on the top is mimic 3D and on the bottom is epigraph so while this certainly allows higher Fidelity within the receptive field of the patch uh these methods can and often times do struggle in terms of global consistency in the final representation there you go skip to one W side uh another approach that is uh possible and that many other 3D Gans attempt is to bypass the expensive volumetric representation in entirety replacing it instead with spouse representations such as multiplane images on the top or Radiance manifolds on the bottom while this approach Works quite well and certainly does solve the computational complexity issue uh they often need to be tailored to specific fields and can suffer a lot in terms of generalizing ability uh whether that's diversity of objects or difficulty in representing nonfrontal scenes uh these SP methods are often tailored to the specific field and often fail in uh these these extreme cases let's return back to a super resolution idea though so this while certainly cheap and efficient is an imperfect solution applying applying super resolution in 2D does have its downsides observe on the left hand side here the same 2D texture sticking artifacts that we saw in style Gan 2 in the original 2D Gans uh in case you forgot that's where for instance the the hair texture in particular will stick to the camera and cause aing and flickering as the object moves so additionally while the geometric quality in the middle and the right is quite decent and globally consistent there is a lot of ambiguity especially in the fine details so uh there we go the next logical step for us was to examine how can we do better our goal was to maintain the extreme versatility and generalizability of these 3D volumetric representations but still allow us to solve the fine grain 3D detail and maintain Global consistency by instead of rendering patches or low resolution approximations natively scaling the resolution of the 2D images that we render ideally though without inheriting the exori high compute cost of ner so what if we just try naively there's really two fundamental problems that we Face the first is of computational complexity if we naively scale to the same setup uh this sorry if we naively scale the same setup to 512 x 512 resolution it becomes prohibitively expensive to train requiring over half a terabyte of vram in order to hold the forward and backward passes of the gradients per image the second problem that we run into with the setup is in terms of quality even that the same sample same 100 samples per pixel we still under sampling the volume leading the noisy artifacts in rendering look on the right side you see these salt and pepper artifacts especially in the notes so if we do want to actually render each pixel we really need to find another way to make our method cheaper recall way back we had three easily methods to to reduce the training cost of our Gan first was the size of the Nerf representation second was the number of pixels that we want to render and third is the number of samples per pixel we're going to actually go back and address how we do this third one so rather than sampling approximately uniformly across the Rays in order to find points of high density which as you see on the top naively samples both the empty regions in front of the person's face and the uded regions behind the person's face we really wanted to sample only the interesting Parts directly on the surface for instance you see on the bottom so our major modifications for to to enable this uh is twofold first we propose a generalizable feed forward sampler an oracle if you will which tells us where the surface is located and where and what regions of space we should really be paying attention to and second is we needed to restructure the volumetric representation away from the Villa Nerf instead towards a surface aware SDF based neurofield which would enable higher Fidelity 3D geometry and be much more amendable to the efficient sampling schemes that we propos so how does this Oracle work first what take our image and we'll cast a low resolution probe through the volume using naive sampling methods this will inform us of a low Fidelity estimate of the volume with a scene and then we can train a convolutional n network to convert this low resolution information to an intelligent high resolution estimate of where objects are in space we can then use this in our second pass rendering for the full resolution object enabling us to efficiently sample only the regions of space we truly care about even though we're operating at previously unattainable resolutions at the end of the day this allows us to create precise and extraordinarily high quality geometric details all because we are able to render each and every pixel that we see using 3D methods instead of 2D approximations so now that we know how to make high quality fake people in 3D let's get to the interesting stuff which is how do we actually use that to make high quality real people in 3D so uh I'm just going to talk a little bit about this um AI 2D to 3D lifting approach that we presented and uh that we submitted and presented at sigraph last year uh called realtime Radiance fields for single view um sorry single image portrait view synthesis and this is an algorithm which given a single RGB image creates the 3D avatar of the person which we can then manipulate and render as well so our design philosophy here behind this was really to answer the question if we use if we have a creative use of AI how can we make this 2D to 3D lifting as absolutely simple as possible so what that meant is we had a few goals that we wanted to keep in mind first uh it needed to run entirely on consumer Hardware ideally in real time uh the goal here was that we would have a nice Easy Al alith which wouldn't require an entire data center of compute in order to run second we wanted this to use single view inputs only for instance a single RGB image or video feed from a common webcam that all of you have at home already this would eliminate the need for expensive scans complex captures or unusual Hardware that you all too often see when trying to recreate 3D from 2D finally we wanted this to be one shot method to in the wild images and what that means is regardless of which person steps up to the camera and how they look that day whether they've gotten a new haircut or changed the way that they do their makeup we wanted this to be order run instantly without having to do any per person fine-tuning or adjustment so an obvious place to start with this problem is Gan inversion what is Gan inversion so I spent the last what is it 30 minutes now talking about how we could use these generators these screen boxes on the left to create 3D humans and so these meth these models implicitly understand what is a 3D human what that means is that we can then iteratively influence these Gans in order to acquire a 3D human which looks like the 2D Target image uh when we render a certain View using a two-stage process first we'll use iterative refinement to find the latent code in a pre-trained gan for a 3D human uh that approximately reconstructs the target image from a specific View and then we can optionally add a lightweight fine-tuning step to the generator to make it so that it matches the exact pixelwise identity uh I'll give a little demonstration here if this video will play Here There we go so here's what that looks looks like in practice in the first second or so of that video we found the latent code that looks really close to our person and in the second second or so we use that to update the generator a tiny bit to get the exact match at the end of the day what we'll have is a 3D object that when we render it from the same view will look like the input image but is still truly 3D there's a few key issues with this method though the first is that the 3D representation in this case the Nerf uh is not necessarily easy to train from a single view um there's a lot of ambiguity that gets introduced during the rendering operation uh as you flatten a 3D object to a single 2D image so unless you have a very very carefully designed optimization scheme the Nerf can and easily will fall apart additionally and possibly the more motivationally interesting um caveat here for us is that such an iterative process is exceedingly expensive uh if we were to run this on a video um sorry uh the the the timelapse that I'm showing on this screen here is a tuning progress to match the person on the left that takes place over 20 minutes uh and if we were to run this on a video we need to repeat this every single frame which would be extraordinar slow and certainly would not match our real time requirements so in our current so sorry in the work that I'm presenting now which is the the live portrait 3D from uh yeah so in this work we want to explore a bit of a different training Paradigm and ask the question If instead of trying to train from a limited set of two images we have an infinite amount of photorealistic synthetic 3D ground truth data can I train an efficient model to directly do this 2D to 3D lifting of course just a few minutes ago I said and I quote 3D data is prohibitively expensive to Source in quantity so why am I proposing this uh and the answer actually it's kind of easy it's because I've spent the last half hour talking about how we can train a 3D model to create infinite cheap accurate and Fast 3D data that we can then render out into paed photo realistic 2D images so why don't we just go ahead and use that to reiterate what I'm proposing here is we're going to take an existing pre-trained 3D Foundation model and use it as a synthetic data source to create an infinite supply of 3D data and that enables us to train a separate and brand new AI model with no requirements of 3D ground Truth at all so the architecture that we really proposed is this Vision Transformer based uh feed forward encoder which um in real time uh sorry excuse me second method would operate in real time on a single GPU as fast as 16 milliseconds versus the 2 and a half or more minutes for Gan inversion what that means is that we achieve a 10,000 times speed up compared to previous methods additionally our current uh our proposed method avoids squeezing the representation through any low dimensional Gans space latent codes uh and what that means is that we can preserve these person specific and highly complex details additionally our method would require no precise camera pose and isn't trained entirely with synthetic data with no you no requirement for expensive or complex 3D ground troop so if we work closer at the requirements of this network though we're going to notice that there's actually two conflicting goals with a 3D reconstruction the first is that given a single image 3D lifting needs to create the canonicalized 3D representation um with accurate geometry what that means is we need to take some arbitrarily posed image and convert that to a person where we know where they're looking and how they're looking but the other thing that needs to happen is we need to make sure that all of the high resolution details are properly copied to the right locations despite there being no obvious correspondence between a 2d input image and an output 3D object so to achieve this goal we actually propose a two-step approach first we're going to take our image and we're going to extract some low resolution 3D deep features with a deep feature extractor and then feed that to a vision Transformer we're going to then take these and combine these with the high resolution but globally incoherent details using a second vision Transformer and what that does is it produces a final output triplane which is both 3D consistent but still maintains the same high resolution details that we originally wanted so to reemphasize a model utilizes only strictly synthetic data that is generated on the fly during training we found that in order to properly generalize to these in the wild images as well we need to add some somewhat aggressive augmentation schemes to our generated data some examples here um for faces we modify the pitch and the Y of the training data on the left we also modify some other CA parameters like the focal length or the principal Point um and a model actually generates not only the RGB images but a handful of other intermediate data points that we use during rendering including the trio the depth the rendered features and the low and high res images all of which can and often are used to supervise our 2D to 3D model training so to give a more complete idea of what happens at the training time at each gradient step we first sample a latent code we then take this latent code and turn it into a triplane using a frozen existing 3D generator we can then take that and the trie from two different cameras camera 1 and Camera 2 the intrinsics and extrinsics of these cameras are sampled from reasonable priers we then take this triplane encoder that receives one of the images as input um and then that will predict a triplane which will then again render from the same cameras uh and then we can compare them via several losses for instance standard reconstruction losses or adversarial um discriminator based losses at the end of the day uh here's the results so a few interesting and and honestly really surprising things that we found uh is that a model shows really amazing generalization to these outof domain inputs so what that means is that we can use odd Expressions strange makeup and accessories and even despite being trained only on these photorealistic synthetic data uh such a model actually transfers extremely well even when applied to stylized and cartoonish images as you see on the bottom awesome so what we found is that despite having baked no implicit or explicit temporal memory constraints this model still worked um amazingly well on videos simply by processing each of these frames independently and in sequence great so now that we've gotten here we can finally get to the fun stuff uh We've covered what we were looking at in terms of 3D human synthesis and also how we actually do this 3D synthesis but I think it's finally time that we get around to why we want to do so so the most obvious application and the one that we have to show you here today is a technique for 3D Tower presence using AI so why did we choose this really difficult task of rifting 3D objects from an ambiguous 2D input directory uh well one obvious reason is that it allows us to triv trivially compose existing 2D modification methods uh for instance stylization and personalize your experience with animated avatars stylizing them with simple text promps and personalize your exper by applying these existing techniques to edit or stylize in 2D we can then directly lift these to 3D which is unlike some other methods which rely on consistent and true 3D inputs so here's an example of how such a pipeline could work at the end of the day first we'll go and create our stylized 2D image for those uh for that you can grab your favorite text to AI text to image AI Tool uh say stable diffusion or doly and customize the Avatar however you want then you need to collect a 2d video of yourself and animate the stylized image for instance using Nvidia Maxine's life portrait finally we can take that 2D stylized video and lift that into a 3D video in real time another possible edit that we could compose is the Maxine AI I contact this feature means that we could maintain eye contact with the viewer even if we're reading from a script or somewhere else as you see in the middle eye contact off on the right eye cont eye contact on additionally while existing 3D methods wouldn't be able to do this editing in 3D we can because our 3D approach Works entirely off of a 2d video um if we know where we're rendering the head from we can manipulate the eye gaze in the 2D video such that it maintains eye contact even after we render it into 3D a few other couple of fun things that we can do um we can rewrite our input images using existing 2D algorithms which may enhance the sense of realism as we transfer people from their homes into shared World spaces or we can change completely change ideas identities even in a photo realistic manner obviously for this one if you appli it in a malicious faction then fashion then there's a few concerns about deep fakes and the likes so take this slide with a grain assault one of the more interesting applications though is consistent realtime translations well it's really not difficult to Du over existing video methods doing so suffers from obvious desynchronization between the audio and the video so by composing a multi-step pipeline of translation then reanimating the face with the new audio in 2D then lifting that 2D reanimated face to 3D we can eliminate the desynchronization and still achieve accurate 3D reconstruction despite having no 3D ground truth and all of this utilizes existing well understood 2D methods and all this that I presented here today isn't just speculation or thought experiments um as uh Ken introduced at the beginning um we actually showed an early prototype of this pipeline at the sigap emerging Technologies uh last year and earlier this year just a few weeks ago we had an upgraded version in our booth at nvidia's GTC conference where we incorporated a handful of upgrades and extensions including uh XR headsets higher resolution models and Cloud microservices um before I wrap up and uh open up for any questions I'm sure there's going to be a good number uh I want to take a moment to thank all of my incredible collaborators um without each and every one of these wonderful individuals on the screen uh I'm really sure I wouldn't be here to be able to present all this really fun stuff with all you so uh thank you and uh thank each of you for stopping by I think so so that's it I will open up for questions now um what you're doing is very interesting I and and can you talk more about how uh you plan to apply this the applications cu remember in your opening you you mentioned that you would get around to why you're doing this and I'm very application oriented so um I'm just curious uh how does this get used in practice absolutely yeah so a lot of the stuff that we've been looking at recently is um for instance right now I'm on a zoom call uh you're on a zoom call I can't see you you can barely see me as a tiny little couple of pixels in the corner of your screen probably um it's not a great experience it's acceptable but if there were four people talking you don't know who I'm talking to what's going on right uh it's a little bit strange but if instead you have four people for instance sitting around a conference table and I am going to look to my left and talk to someone here you know like if you're looking at my camera I'm now talking to someone else right or maybe I'll talk to someone else over here right I I don't know if you can see that in my camera but uh it's just there's a lot of social cues that we lose as we move to these little pixels on the screen calls where everyone's looking at everyone and no one at the same time um that's one of the things and also there's just been a lot of studies recently on that perhaps because of this lack of social cues uh these Zoom calls and team calls are really mentally fatiguing um we just wanted to see like using these methods if we can improve 3D teleconferencing by making a 3D object and a 3D people that you see on your screen um maybe it's a more fun or a more engaging or just a more enjoyable experience right on so this would be like a metaverse whether you're watching it on a screen or a VR headset it would be like a three-dimensional room conference room that you're creating yeah so you you'd have for instance instead of people on little tiles you'd have people sitting around a conference table and you can look at each other and commun Comm unicate and yeah that that's a way to put it for sure very cool so if you're looking at one person or another I guess you're tracking um my gaze or my head motion when I'm looking at different people on the screen and then my avatar is replicating that yeah for sure that's a way we can do it cool thank you but you know Ed there's also uh if you would prefer to have your very cool self sitting there as opposed to the person who did not want to wash their hair that morning you know you could always be always be looking good anyway show up in your suit and your fancy you know exact yeah even though you're inamas and yeah hey we have we have Audrey coming in um I don't want to talk over here yes I especially like the idea of being able to put in a substitute when I haven't washed my hair that's perfect but also I am always interested you you go through all these efforts to create you know a 3D mesh and 3D objects I would love to be able to in some way download them and take them into other packages and things like that and use them with with the animation of course attached to them would be even a bigger bonus how do you you envision that happening in the future it's ongoing work um and the reason why I say that is because at the moment as I mentioned we're using this volumetric Nerf rep representation which is it's really it it's it's not an ideal representation it's easy to train it's easy to learn but it's very expensive and it's honestly it's it's kind of like low Fidelity um you've basically just got a cloud in the shape of a person right and you never turn that cloud in any way into a 3D mesh no no we don't um some methods do so for instance metaverse the the pixel codec avatars from meta um uses an actual animatable mesh and they they transmit codes that animate that mesh instead and it looks really good but that requires that you create your mesh which is which is difficult and expensive um but we don't and it's it's definitely ongoing work we're doing a lot of work on that but I agree if we could take these objects that we've created from these 2D images and just use that and I don't know drop it into unreal or drop it into somewhere else that would be great it would be it would be wonderful yes so thank thank you for this wonderful presentation I really loved it thank you thank you yeah yeah thank you so much for the uh yeah really nice work uh yeah I just wonder like have you encoun any uh like temporal artifacts when you generate those uh like like uh talking like the videos when the person moves or paring if so have you apply any like temporal regularization method or cons lost to make sure uh the texture looks natural uh yeah you all have such good questions uh the answer is in these slides no but as of about four months ago no three months ago um yes so it's it's also ongoing work it's a paper that I recently submitted that does more temporal Fusion of um frames and people as they evolve so for instance if I am looking to my right here um you see this side of my face and the current model just kind of forgets what's going on at the back of my head so it will hallucinate it and will come up with some reasonable guess but for instance if I have face paint on and I turn my head it forgets that I had face paint on so it will just like erase the face paint until I turn back and the camera can see it again we don't want that to happen so we have um I I have a paper recently I I don't believe it's public yet um that addresses this problem um cool but at the moment no but ongoing work definitely yes we want to have that I have a question um it's not quite related to the neural Nets and everything but um it's the it's the appearance of the eyes and whether they're tracking you or not and you look at the 3D models and they're recessed in is that why is that happening and is that you know is that an accidental thing where you get the the eye tracking with that the old old sculptures used to have that I think right yes exactly um yeah so it is an artifact of the training data set so if you look way back at these old sculptures actually it's a slightly different reason the reason why the old sculptures have these holes in their eyes is because people used to put gemstones there uh and then the gem SES got stolen so there's there's nothing in the eyes now other is a slightly different reason which is that most people when you take a picture of them look at the camera um just because you know people like looking at cameras smiling at the camera uh and what that leads to is that um most people regardless of where you're taking the picture from have their eyes looking at the camera but when you're trying to create a 3D model of a person that doesn't change no better where you look at it um you also still want to mirror that artifact of the people looking at the camera and the way that that manifests is not by the eyeballs moving as you change the camera because then obviously the 3D model is moving and that would be bad um so it instead uses optical illusions which for instance if you're familiar with the term the hollow face illusion it's where you have a concave object instead of a convex object and the way that that works is it looks like the eyes are tracking you when they're concave uh as you move left and right even though the eyes are still just because because of the optical illusion essentially um so so to recap it's an optical illusion that is a result of a data setus whereas if we train on cats which don't tend to look at cameras for whatever reasons cats don't like cameras um we don't get the same artifact cool uh that that's also you all have like as I mentioned really great questions uh actually I do have a question um pleas yes if this is used for video Tel conferencing what is the bandwidth requirement that's a good question um so rather than streaming the renderings of the person in and out we're actually streaming the full 3d representation so it's these triplanes and for those of you that are familiar with um this machine learning the triples are a tensor of shape sorry it's a set of three tensors Each of which is shaped um 256 by 256 by 32 channels and what that basically means is that we have a 2D image that has 32 colors and we just take it re organize that into a big tile and then stream that with video compression um so nothing fancy really under the hood we just we just shove it through a normal video encoder and that ends up being about 10 megabits per second up and down um which is honestly not that much more than a normal video call um like it's it's it's quite it's quite manageable uh I I imagine if three or four more people turned on their cameras here we'd be able to match 10 megabits per second up and down how do uh compression artifacts manifest themselves in this model yeah um because we're doing this untying and retiling of the 3D object and then compressing it with 2D video codecs that are optimized for human perceptual vision and not for whatever 3D representation par mization that we're streaming there are visible in many cases uh compression effects we do a lot of work to reduce those as much we can on the training side um but when they do manifest a lot of the time it's small flickerings for instance smoothness and inconsistency especially for instance like hair textures or super fine details um details sometimes get very slightly smoothed out or um light inaccuracies in light color shift so for instance the light might turn a tiny bit red for a frame and then turn a tiny bit blue for a frame and then go back to normal for a while um they they're they're small noticeable if you're really looking for them and know exactly what you're looking for but for the most part they don't they don't cause too many problems thank you uh I see a new hand just came up uh go for it is there a portal where you can input your own images and play around or any not any um thought of having one and also is there you can put any uh hook points where you can hook your own functions in this processing chain so for data privacy reasons we don't have that right now um just just just a matter of data privacy at the moment I'm not sure if we will ever get around doing that or in the near future we hope to productize this and perhaps um it might show up like in your somewhere else do we need beta testers um not at the moment but uh maybe we will soon but not at the moment but we we we don't have any way to publicly access this right now um inoy okay thank you thank you I was just wondering what happens if you run a person through like the cat model or vice versa that is a really good question um a lot of the time it doesn't work very well uh because the cat model is trained on cats and the person model is trained on people the first thing that's going to fail is the landmark detectors um where when you're trying to look for a person in an image if you only see a dog then it's going to not find a person um but if we really do our best to like force an image of a dog through the person model or a person through the cat model um it's really strange you come out with like a cat-shaped object but with human eyes and human teeth and it's it's just uncanny uh don't happen to have any examples here I don't happen to have any examples but you'll just have to believe me that it is it is extraordinarily creepy all right thank you I see a few questions in the chat um the first is do you have any links the answer is yes and I will get that to you um but in a moment uh the second question that I see here is um what's the expected latency of the encoded triplan representation to decoder to 2D rendering for the purpose of video conference streaming so the end to end latency is about 80 to 100 milliseconds um between when I capture between between when the photon leaves your face arrives to the camera and then exits the display on the other person's end um so uh we can budge measure that to about 10 milliseconds of 10 to 15 milliseconds depending on which model we're doing of actually no I I'll I'll back up even more we'll budget that to 0o milliseconds to capture an image 5 milliseconds to track your face um about 15 milliseconds to stream it to a server another about 15 milliseconds to convert that 2D image to a 3D object another about 20 milliseconds to stream that back to the client um and then about 20 to 30 milliseconds to actually render that image out sorry render the 3D object back out to an image that the the receiver can see um end to end as I mentioned about 80 to 100 milliseconds that is on par with existing 2D video composting systems so this Zoom call uh I know how to see it on teams but I don't know how to see it on Zoom probably has about 100 to 200 milliseconds wait and see as well so yeah uh thank you for the question uh hash any other questions there was a question by Audrey there in the chat also do you need beta testers uh yeah uh we do not at the moment um thank you for the offer but uh we we do not I would love I would love to read the paper that you talked about this is Audrey yes I have uh several papers let me go ahead and now seems like a good time to actually um minimize this stuff and a lot to take in which was great and now I just want to review it and think about the whole thing so there will be plenty um and we can post them in uh the Youtube and meet up afterwards too if you don't have them handy or I do have them on hand uh I think people can save the chat also yes let's see if I can stick this into the right chat to everyone um so uh on that page I'm not actually sure why I'm missing some papers but on that page from bottom up is efficient geometry aware 3D generative adversarial networks is the first 3D generative of model um yeah I'm missing some papers uh live portrait 3D Live 3D portrait is the second um not second that's the third one so that's the 2D to 3D converter that's probably the paper that you're interested in uh and then uh the what you see is what you began um Nvidia rendering every pixel here we go this is the [Music] second and that's the the the second model that I introduced which is the high quality 3D generator and that is the I believe the three main papers that I covered here today a question question um here again um this is cool stuff I I have an application where um um I'm fact I'm writing specs for um merging a real world immersive stage live performance stage show like in a dome okay um right where the stage where the Dome imagery on the Dome comes on to the stage you put live performers in there and you're they're interacting with the immersive environment projected around them okay and I want to lift those performers off the stage and put them into metaverse as digital digital twins right okay yeah like how how far how far are we from um full body capture and uh digital twin creation kind of a thing you think ongoing work uh I'm working on a project right now for uh at the moment as you may have noticed we only do faces and a little bit of the upper shoulders but mostly just faces um turns out that heads are relatively easy because with the exception of mostly the mouth they don't really move around that much but when you have a whole body that has a skeleton and you can wave your arms and your hand and you can olude your head and do weird stuff it becomes a little bit more challenging so indeed a lot of questions there but we are working on it and I'm not sure I can give you a nice estimate but I hope within the next small single digit number of years yeah yeah that's cool you know I mean obviously we can have Moab tra trackers on yes the performers of course which can be a little cumbersome in uh you know in terms of costuming and everything another uh way of doing it is a real-time volumetric with uh clusters of cameras all around yes you know and I've seen that in real time with like I think it was like a 3 second latency that Canon did of a um real time basketball game put on a headset right uh it was very cool you could walk right out on the on the basketball court and the players are running through you and everything they look really good anyway I I just thought I'd ask that it's yeah yeah we have have had some prior speakers and you can check out our old videos of various ways of motion capture cool thanks Michael is there one thing that you didn't cover uh that you wanted to and you didn't think you could fit it in one more little tid bit [Music] um there's a few I think I think future directions would be interesting but uh actually that came out a lot in questions which is future directions of can we make it look better over time can we make it um learn from you as a person so for instance you know you go home and you see your uh brother your sister your you know your family and you know what they look like because you've seen them so many times every day um can we make an AI do that so that it can it can just be you when you're you know somewhere else it can it can pretend to be you in the boring calls that you don't want to join um or at least maybe not like pretend pretend to be you but you know when you're in your pajamas it can make you look like you you when you were not in your pajamas um or questions like can we do hands and body um right a lot a lot of future work um things that are really interesting really exciting uh that are coming up soon we hope all right well we hope you're going to be at sigraph yeah yeah I hope so too did you did you submit something to Sig grab this year or is there which which conference are you submitt I didn't submit anything to sigraph uh this year mostly because I was crunching for uh um nvidia's GTC uh same deadlines sigraph was like a few weeks after the GTC conference the mission so we just I just I just didn't have the the the available waking hours to attempt that we will find you yes uh but you know it is how it is I might still attend we'll see actually have a question so can I like train like the three model like on a younger version of myself and then like you look younger in the video call uh so can you ask it for a younger version of yourself if I maybe show a picture and then then then it'll fact the real me in video call and then use the younger version of myself to make the 3D image from like work yeah so the great part about uh our method here is that you can just take your 2D image of you feed it through a 2d um age you know reverser whatever 2D model you want spits out a younger image of you um and then you just turn that into the 3du yeah um there's nothing stopping you yeah it should be quite easy to do should be P are you concerned about uh the next level of deep fakes then am I concerned deep fakes well I mean yes impersonating people more with that yeah um the short answer is yes the long answer is we're doing our best to be responsible about this um Again part of the reason why we don't have a public version right now is because if you have a 2D image of a person it's like ah that might be a deep F but if you have a 3D video of a person that becomes a lot more convincing um so we're doing our best to be responsible and make sure that nobody can make deep fakes of it um but as with any technology which takes a human and turns that into another human there's GNA be some bad actors uh the short answer yes I am concerned and yes I am thinking about it amazing work very impressive thank you so much Matthew thank you yeah thanks everyone attending yeah thank you very much so if there's no more questions maybe we'll bring the meeting to an end yeah Thank You Les for inviting us all in to witness H Matthew in you know action thank you all thank you great talk fingers crossed here we go thank you sigra for doing this too thank you yes thank you cheers bye everyone cheers everyone bye okay bye
Info
Channel: Silicon Valley ACM SIGGRAPH
Views: 310
Rating: undefined out of 5
Keywords:
Id: julPkak63uE
Channel Id: undefined
Length: 77min 59sec (4679 seconds)
Published: Thu Apr 25 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.