Editing Faces using Artificial Intelligence

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
over the past few years in machine learning we've seen dramatic progress in the field of generative models and while there are a lot of different flavors of these generative models in this video I want to talk specifically about one model called the generative adversarial Network or in short gap now gans were first invented by Ian Goodfellow in 2014 and since then these models have seen an incredible burst of research improvements and applications and one of the coolest things about Ganz is that you can take the underlying architecture but you can train the model on any data set that you want so most famously the researchers at Nvidia have trained these models to generate faces leading to the popular website this person does not exist calm but you can take the same underlying model and you can train this on anything that you want I mean you could train it on cats for example or on cars or what about bedrooms anything you have in terms of an image data set you can use it to train this generative model and because most of these generative models are open sourced by their creators a lot of nifty people on the internet have trained their own models on a variety of data sets like this one for example to generate anime faces or this one which generates album cover art or this one that I personally trained on Google Earth satellite images needless to say that training your own generative model can be a lot of fun but that's not everything it turns out that when you train a generative model on a data set which is by the way a fully unsupervised process because you're not using any labels well it turns out that these models actually discover the underlying structure in that data set and once the model has discovered this structure you can actually start using and exploiting that to do a variety of pretty cool things so in this video I want to give you an overview of what you can actually do with the latent space of a generative model once it has been trained on a particular data set have you ever wanted to see what you would look like as part of the opposite gender or what about playing with the barack obama or making emilia clark smile well in this video i will show you how to play with a latent space of the most powerful generative models that we have available today and not only that this video comes with a complete ipython notebook so everything we'll be doing in this video you can do for yourself on any image that you want are you ready to dive in deep my name is andrew and welcome to archive insights ok i'll start with a quick overview of everything we'll be seeing in this video so initially i'm going to do a quick introduction to generative adversarial networks for the people who don't have a lot of prior experience with this topic well then do a five-minute technical deep dive on the objective function that is actually optimized when you train one of these Gatz will then go and look at some of the state-of-the-art techniques that people often use in these generative models and that allowed them to actually generate such beautiful and high-quality images and then the rest of this video is going to cover how you can actually use the latent space to manipulate any image that you want alright so the idea behind guns is pretty simple but it's also very beautiful so in essence what we have is two neural networks we have the generator and we have the discriminator the generator is in charge of doing the following it gets a randomly sampled noise vector as an input in most cases we'll sample from a Gaussian distribution we take that noise vector and we input it into the neural network we you know follow through a bunch of convolutional layers and at the end we have an image this image is then fed to a second neural network the discriminator and the discriminator has one job it needs to look at an image and decide whether that image comes from the actual data set the real images that were training on or whether that image came from the generator and is thus a fake image and because we are you know controlling the training process we have the label we know for each image that we feed to the discriminator whether that image was a real one coming from the beta set or a fake image coming from the generator and by using this label we can basically back propagate a training loss through this discriminator network in order to make it better but the nice thing is that this generator network itself is also a fully differentiable neural network so if we stick these two networks back-to-back we can actually back propagate the learning signal through this entire model pipeline and this way we can update with the same single loss function we can update both the discriminator and the generator network until they both get really really good at their job so the most important trick in this pipeline is to make sure that both these networks are well balanced during training so that no one of them gets the upper hand and if you manage to do so and you train it for long enough then eventually what you'll have is a generator that has been learning from the feedback of this discriminator network and can eventually you know manage to generate images that look very very similar to the data set we've been training on okay so with that general introduction of the idea behind ganz let's do a 5-minute technical deep dive on the objective function that is used while training so the be clear I'm gonna be looking at the original objective function as it was published in Goodfellows paper in 2014 basically what's happening is that the generator and the discriminator they are playing a minimax game because the generator is trying to create images that actually fool the discriminator and the discriminator is always trying to be right it's trying to see the difference between real and fake generated images and so what you can see in the last function here is that we are minimizing this objective function for the generator but we're maximizing it for the discriminator so let's go a little bit into detail the discriminator and network outputs a single scalar value d of x per image which indicates how likely it is that this image X is fact a real image coming from the dataset it does the same for generated images you know the generated images are G of Z so Z here is the noise vector and G is the generator Network and it outputs a score you know again the G of Z of these fake images now while training we want the discriminator to recognize real images X as real so we wanted to output a high value close to one at the same time we also wanted to recognize fake images G of Z as being fake and so therefore outputting a low value close to zero and this is exactly what this loss function is doing but notice that we still call this unsupervised learning because the labels kind of come by itself we don't have to label any of the training data we just know whether an image came from the data set or from the generator and at the same time we also want the generator to create images of which the discriminator actually thinks that they might be real images so the generator is trying to create images that fool this discriminator into thinking that those images are actually real even though they were generated by the generator and notice that this first part of the loss function doesn't really depend on the parameters of the generator so when we're optimizing the generator the only thing we have to optimize for is that second part of the objective function and again every time we're calculating this objective function we are going to minimize it with respect to the generator parameters and we're going to maximize it with respect to the parameters of our discriminator and if we then put this objective function into a concrete algorithm this is what we get okay so at every training step of the algorithm we'll start with two things we sample a batch of random noise vectors and a batch of images from the data set and we're then going to use the objective function that we just saw in order to update the parameters of the discriminator by doing gradient ascent with respect to its parameters importantly while we're updating the discriminator the generator network itself is fixed so we're not changing anything at generator part when we're updating the discriminator once we've updated the discriminator for a couple of time steps we then freeze the weights of the discriminator and we go to you know another part where we actually train the generator and notice here that we resample a new batch of random noise vectors we generate images with them and then we apply gradient descent on that second part of the objective function in order to update the parameters of the generator it will that is the fool out with an hour in practice you know there are a few additional tricks that you can apply to make sure that this objective function actually converges nicely and smoothly because it tends to be a little bit unstable in the exact form that we've just seen in any case nowadays we have a wide variety of objective functions that people use to train these gans but all of them are built on the same core idea that we just saw in the algorithm if you want to get a little bit of an overview of all these different flavors I'd really recommend to check out this medium blog post by under hiding hi it's really good and it gives you a nice overview of this entire Gann landscape so before we start actually messing around with some real images I first want to introduce two final ideas that we're going to be using in our generative model the first one is the progressive growing of the layers in your generative model this was an idea published by nvidia in their paper called progressive growing of gaps and the idea is in fact not that difficult either you basically start with a generative model that generates very very small images of super low resolution and at the same time the discriminator also gets to discriminate between very low resolution images this makes the entire process actually super simple so this network is very stable and it converges quickly once that network has stabilized you then simply add an additional layer to both the generator and the discriminator architecture which works at a slightly higher resolution and you keep on training there's a few additional tricks where instead of just adding this layer one shot you basically do it gradually by blending the previous layer towards the higher resolution one but in practice this is what happens your generator starts by generating very low resolution images and the discriminator discriminates them and then during the training process you gradually scale up the resolution of the images in your training pipeline the second very influential paper by Nvidia in the generative model landscape was a model architecture called style gap and in fact style gain is the model that we're going to be using to manipulate our images so traditionally what happens is that you take a generator Architecture and it gets a random noise sample as an input you feed that noise sample through a whole bunch of up sampling and convolutional layers until you get an image what the style gain generator does is slightly different first it has a mapping network and this mapping network takes the noise vector Z and transforms it into a different vector called W and the important thing here is that the W vector doesn't have to be Gaussian anymore the distribution of those w's can be whatever the generator wants it to be then the actual generator architecture doesn't start from a random noise vector anymore it starts from a constant vector and this constant vector is actually optimized during training so it's kind of like a seed a fixed seed in the beginning of the first layer of the generator architecture but the actual numbers of that seed the values of that vector will they're constant but they're optimized during the training process finally the output of the mapping layer W is plugged into multiple layers of the generative architecture using a blending layer called a DES in and during training also we add noise to these parameters and for people wondering on why you would actually use a mapping layer like this well imagine that you have a data set right and imagine we look at two properties we look at gender and we look at facial hair right so we have male and female and we have beard and no beard well in most image data sets of P well you will find very little women that have beards so in other words our data distribution has a gap there and if we're sampling from a Gaussian distribution well that distribution is uniform so it doesn't have any gaps and so this is essentially what this mapping network allows you to do it allows you to sample from a uniform distribution but then warp that distribution in such a way that you can have gaps for example where there are no actual images and the idea is that if this warping of the space is already done by the mapping network right so your W vector is already in a good shape then your actual generator who takes that vector and turns it into an image has a much simpler job of doing that because the relationship between images and the input vectors is more one-to-one it doesn't have any gaps or strange distortions that it has to learn all right so that is the style gun generator architecture and so in the style game paper they apply this generative architecture you know which starts from a constant and uses a mapping layer but they also use the progressively growing of the layers and with those two tricks combined they were able to create a very powerful gun that was able to produce incredibly realistic images all right so now that we've seen a little bit of the prerequisites it's time to actually go into our ipython notebook and start playing with the style game model ourselves okay so the general principle that we're gonna leverage here is that when you train a generative model the latent space usually learns the underlying structure of your data set and again that structure is learned fully unsupervised by the generative model because we're not using any labels in the data set so how can we leverage this structure well the core idea is that instead of manipulating images in the pixel domain which is really difficult and very complicated we're gonna manipulate images in the latent space of that generative model and so in order to do this starting from any given image we're gonna have to find a way to find a query image inside the latent space of the generator so more specifically let's say that we start with this image of Barack Obama the question is well how can we find the latent vector Z such that if we send Z through the generator we get this image of Barack Obama that's the first problem we're gonna have to solve you could try randomly sampling a whole bunch of these latent vectors and see which one is closest but that's gonna take a really long time we're gonna need a better approach so one of the most straightforward things that you could try is you know given the fact that this generative model is a fully differentiable neural network well you can basically send gradients through it so you could basically randomly start from any latent vector Z you generate a random image but then you compare that image to the query image of Barack Obama and you could define a simple loss function let's say the pixel by pixel difference you know the l2 difference between these two images and you can do gradient descent not on the image but you send your gradients through this generator model and you actually update the latent vector Z at the beginning of your generator and so by applying gradient descent on this l2 pixel loss we could in theory find the optimal latent vector Z that gives us a good image of Barack Obama unfortunately this doesn't really work the l2 optimization objective is going to start going in the direction of Barack Obama but before before it gets there it's gonna get stuck in a very bad local minimum of an image that doesn't look at all like Barack Obama and once we're in that well the optimization objective simply can't get out anymore so l2 optimization directly in the pixel space doesn't work we're going to need a different approach and this different approach is something we've seen a lot of different machine learning applications it's the idea that you can use a pre trained image classifier as a lens to look at your pixels so rather than optimizing our l2 law directly in the pixel space we're gonna send both the output of our generator and the query image through a pre-trained vgg network that was trained to classify image net images but instead of actually going all the way to the last layer onto the classification we're gonna cut off the head of that network and we're gonna simply extract a feature vector somewhere inside the last fully connected layers of that classifier so we send these images through the pre trained classifier we extract a feature vector at some of the last fully connected layers in the network and this gives us a high-level semantic representation of what is in the image and it turns out that if we actually do gradient descent on this feature vector rather than on the pixels of the image or approach does work to give you an idea of what that looks like here's a small video of an optimization process on three query images below and the top images they start from the average face in the style game space and then start going towards the query images in the bottom now there is one final problem with this approach and that is that this is actually really really slow it takes a very long time for the optimizer to actually find a good latent code I wonder if there's a way to make a really good guess of the starting point where we start our search in the latent space so with that idea why don't we make a data set first we'll sample a whole bunch of random vectors we'll send them through the generator and will generate faces and once we have that data set we can train a resonant to go from the images to their respective label code that's pretty clever right and so in the repo you'll be using in a notebook there is already a pre trained ResNet like this so you can just use it out of the box so here's what our pipeline currently looks like we take a query image we send it through the residual Network and this network gives us an image estimate of the latent space vector in the stallion Network we then take that latent vector we send it through the generator which gives us an image on this image we apply a pre trained vgg Network in order to extract features from it and we do the same thing for our query image of Barack Obama and then in that feature space we start doing gradient descent right we minimize that l2 distance in this feature space and we send those gradients through the generator model all the way back into our latent code and importantly during this optimization process the generator weights itself are completely fixed the only thing we're updating is the latent code at the input of our generator and now in this optimization process there's a lot of different things that you can tweak for example which specific layer of the vgg network are you using as that semantic feature vector or for example are we applying a mask to the face such that only pixels within the face region are actually used to compute that l2 difference and we can even add a penalty on the latent code that we're optimizing such that it doesn't move too far away from the concept of a face according to the style gain Network because as it turns out you can use this process to find any image you want inside the style gain Network even a style gun trained on faces well by doing this this approach you could basically find a car inside the latent space of style gap but the problem is that the vector which gives you a car is going to be very far away from that Gaussian distribution that we actually started from and so if we later want to start manipulating this face it's gonna be a really good idea to make sure that whatever latent vector we're gonna find is going to be similar to the concept of a face inside this style gain Network okay so here's the entire process into it we start with a query image we send it through the ResNet and we estimate and then we do this latent space optimization until we finally get our optimized image which is as close as possible to the query image that we started from here's three additional videos of that process in action on Albert Einstein emilia clark and barack obama okay so now that we have the latent code which gives us our query image or at least something very close to it what do we do now well now it's time to play but in order to play with the latent space we need another data set because what we want to do here basically is we want to start messing with specific attributes of those faces things like age gender smiling glasses all of those attributes that we kind of know are important for how a face actually looks and so again we're gonna randomly sample a whole bunch of these latent vectors we're gonna send them through the generator get our faces and then we're gonna apply a pre trained classifier that was trained to recognize a whole bunch of these attributes and if you wanted to you could hand label these images with any kind of attribute that you care about okay so now that we have all of those labels for the faces what do we do with them well basically if you look at the style guide latent space this is a 512 dimensional space so it's very complicated and what we really care about is how a certain direction in that latent space changes the face that comes out of the generative model so with the data set that we just created we can basically put all of those faces you know at their respective locations and we can start looking at all of the attributes that we've collected and all of those attributes are quite well separable by a relatively simple linear hyperplane in that latent space and so once we found that hyperplane if we take the normal with respect to that hyperplane well this direction in the latent space basically tells us how can I make a face look more female because the only thing I have to do now is I take my query image I find its latent space vector and then from that point in latent space I start walking in the direction of you know what makes a face more female and so in the ipython notebook you will see that in the repo we're using a whole bunch of these latent space directions have already been predefined for you if you want there is code available to train your own living space directions on any attribute that you care about but they're the ones that are already there are kind of fun to play with and so with all of that done I think it's time for you to start playing with style gap thank you very much for watching don't forget to subscribe follow me on Twitter and support me on patreon and I hope to see you again in another episode of archiving sites [Music]
Info
Channel: Arxiv Insights
Views: 370,961
Rating: undefined out of 5
Keywords: Generative Adversarial Network, GAN, Deep Learning, AI, Nvidia, Deep Generative model, generated faces, Artificial Intelligence, fake media, StyleGAN, Generative, Neural Networks, generative adversarial networks, GANs, face editing, face morphing, StyleGAN2
Id: dCKbRCUyop8
Channel Id: undefined
Length: 25min 27sec (1527 seconds)
Published: Fri Sep 13 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.