Introduction to GANs, NIPS 2016 | Ian Goodfellow, OpenAI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
um thank you all for coming this is a massive room so today we will have six great invited talks panel discussion and a selection of posters and spotlight presentations I don't have much to say but welcome young good fellow from open AI he will be given the first talk of today an introduction to generative adversarial networks thank you good morning thank you everybody for coming I guess I'll explain first a little bit what my goals for this this talk are I know there's a lot of different people here at the workshop and the main purpose of the talk is just to give everyone a little bit of context so that you know what adverse sail training is what generative adversarial networks are if you were at my tutorial on Monday you probably will have seen a lot of these slides before but I'm also going to throw in a few new ideas just so that you feel like you've got something extra for your time but this talk is mostly for the people who have just arrived at the workshop and needed some context so this workshop is about adversarial training and the phrase adversarial training is a phrase whose usage is in flux and I don't claim exclusive ownership of the phrase but to avoid confusion I thought I'd comment a little bit on how the phrase has been used before and how it's mostly used now so I first used the phrase adversarial training in a paper called explaining and harnessing adversarial examples and in that context I used it to refer to the process of training and neural network to correctly classify adversarial examples by training the network on adversarial examples today other people have started using the phrase adversarial training for lots of different areas almost any situation where we train and model in a worst case scenario where the worst case inputs are provided either by another model or by an optimization algorithm so the phrase episode training now applies to lots of ideas that are both new and old the way that we use the phrase adverse sail training now it could apply to things like and an agent playing a game against a copy of itself like Arthur Samuels checkers player back in the 1950s so it's important to recognize that when we use the phrase adversarial training today we're not only referring to things that were invented recently but the usage has expanded to encompass a lot of older things that also had other names like robust optimization most of the day's workshop is about a specific kind of adverse ale training which is training of generative adversarial networks in the context of generative adversarial networks both both players in the game are neural networks and the goal is to learn to generate data that resembles the data that was in the training set the reason that we call the training process for generative adverse cell network adversarial training is that the worst case input for one of these networks is generated by the other player and so one of the players is always trained to do as well as possible on the worst possible input it's worth mentioning that there are other works going on in the space of adversarial training where the goal is still to train on adversarial examples inputs that were maybe created by an optimization algorithm to confuse the model and you will see some posters about that here there's also some work about that in the reliable ml workshop but I hope that clears up any confusion about the term adversarial training so generative adversarial networks are mostly intended to solve the task of generative modeling the idea behind generative modeling is that we have a collection of training examples usually of large high dimensional examples such as images or audio waveforms most of the time we'll use images as the running scenario that we we show pictures of in slides because it's much easier easier to show a picture of an image than to play an audio waveform but everything that we describe for images applies to more or less any other kind of data so there are two things you might ask for a generative model to do one is what we call density estimation we're given a large collection of examples we want to find the probability density function scribes as examples but another thing we might do is try to learn a function or a program that can generate more samples from that same training distribution so I show that on the lower the lower row here where we have a collection of many different training examples in this case photos from the imagenet data set and we'd like to create a lot more of those photos and we create those photos in a random way where the model is actually generating photos that have never been seen before but come from the same data distribution in this case the images on the right are actually just more examples from the image net data set generative models are not yet good enough to make this quality of images but that's the goal that we're striving toward the particular approach the generative adversarial Network to take to generative modeling is to have two different agents playing a game against each other one of these agents is a generator network which tries to generate data and the other agent is a discriminator network that examines data and estimates whether it is real or fake the goal of the generator is to fool the discriminator and as both players get better and better at their job over time eventually the generator is forced to create data that is as realistic as possible the data that comes to the same distribution as the training data the way that the training process works is that first we sample some image from the training data set like the face that we show on the Left we call this image X it's just the name of the input to the model and then the first player is this discriminator network which you represent with a capital D the discriminator network is a differentiable function that has parameters that control the shape of the function in other words it's usually a neural network we then apply the function D to the image X and in this case the goal of D is to make D of X be very close to one signifying that X is a real example that came from the training set in the other half of the training process we sample some random noise Z from a prior distribution over latent variables in our generative model you can think of Z as just a sort of randomness that allows the generator to output many different images instead of outputting only one realistic image after we've sampled the input noisy we apply the generator function just like the discriminator the generator is a differentiable function controlled by some set of parameters and in other words it's usually a deep neural network after applying the function G to input noisy we obtain a value of x sampled in this case from the model like the face on the right this sample X will hopefully be reasonably similar to the data distribution but might have some small problems with it that the discriminator could detect in this case we've shown a slightly grainy noisy image of a face suggesting that this brain and noise is a feature that the discriminator might use to detect the images faked we applied the discriminator function to the fake example that we pulled from the generator and in this case the discriminator tries to make its output D of G of Z be near zero earlier when we use the discriminator and real data we wanted D of X to be near one and now the discriminator wants D of G of Z to be near zero to signify that the input is fake simultaneously the generator is competing against the discriminator trying to make D of G of Z approach one we can think of the generator the discriminator is being a little bit like counterfeiters and police the police would like to allow people with real money to safely spend their money without being punished but would like to also catch counterfeit money and remove it from circulation and punish the counterfeiters simultaneously the counterfeiters would like to fool the police and successfully use their money but if the counterfeiters are not very good at making fake money they'll get caught so over time the police learned to be better and better at catching counterfeit money and the counterfeiters learn to be better and better at producing it so in the end we can actually use game theory to analyze this situation we find that if both the police and the counterfeiters or in other words if both the discriminator and the generator have unlimited capabilities the Nash equilibrium of this game corresponds to the generator producing perfect samples that come from the same distribution as a trading data in other words the counterfeit are producing counterfeit money that is indistinguishable from real money and at that point the discriminator or in other words the police can not actually distinguish between the two sources of data and simply says that every input has probability one-half of being real and probability one-half of being fake we can formally describe the learning process using what's called a minimax game so we have a cost function for the discriminator and we call J superscript D which is just the normal cross entropy cost associated with the binary classification problem of telling real data from fake data we have one mini batch of real data drawn from the data set and what a mini batch of fake data drawn from the generator and then if we use this minimax formulation of the game then the cost for the generator is just the negation of the cost for the discriminator the equilibrium of this game is a saddle point of a superscript D and finding this saddle point resembles the process of minimizing the Jensen's Shannon divergence between the data and the model we can use that to actually prove that we'll recover the correct data distribution if we go to the equilibrium of the game we can analyze what the discriminator does and they play this game and we see exactly what it is that allows generative adversarial networks to be effective the basic idea is that if you take the derivatives of the minimax games value function with respect to the outputs of the discriminator we can actually solve for the optimal function that the discriminator should learn this function turns out to be the ratio between P data of X and P data of X plus P model of X you can do a little bit of algebra on that to rearrange it and you get P data of x over P model of X so we're learning a ratio between the density that the real data is drawn from and the density of the model currently represents estimating that ratio allows us to compute a lot of different divergences like the Jenson Shannon divergence and the KL divergence between the data and model that are used for training with maximum likelihood so the key insight of generative Ebersole networks is to use supervised learning to estimate a ratio that we need to be able to do unsupervised learning there are also a variety of other papers by Shakir Muhammad and his collaborators and Sebastian knows and his collaborators that talk a lot about the different divergences that you can learn with these kinds of techniques and how this estimation procedure compares to other techniques have also been developed in the statistical estimation literature previously but this is the basic idea right here is that we're able to learn this ratio so far I've described everything in terms of the minimax game I personally recommend that you don't use exactly that formulation you use a slightly different formulation where the generator has its own separate cost and the idea is that rather than minimizing the discriminators pay off the generator should maximize the probability that the discriminator makes a mistake the nice thing about this formulation is that the generator is much less likely to suffer from the vanish and gradients problem but this is more of a practical tip and trick rather than a strong theoretical recommendation and some of the other speakers you'll see today might actually give other advice so it's kind of an open question about exactly which tips and tricks work the best one of the really cool things about generative adversarial Nets is that you can do arithmetic on the z vectors that drive the output of the model we can think of Z as a set of latent variables that describe what is going to appear in the image and so Alec radford the co-organizer of this workshop and his collaborators showed that you can actually take Z vectors corresponding to pictures of a man with glasses the Z vector for a picture of a man and the Z vector for a picture of a woman and if you subtract the vector for men from the vector for men with glasses and you add the vector for women you'll actually get a vector that describes woman with glasses and when you decode small jitters of that vector you get many different pictures of a woman wearing glasses a lot of you may have seen a similar result before with language models where the word embedding for Queen could be used to do arithmetic where if you subtract off the word embeddings for female and add the word embedding for male you get a vector that is very close to the word embedding for King in this case Alec and his collaborators have a slightly more exciting result because they not only show that the arithmetic works in vector space but also that the vector can be decoded to a high dimensional realistic image with many different pixels all set correctly in the case of language modeling the final result was a vector that was very near the word for King but there was no need to decode that vector into some kind of extremely complicated observation set that corresponds to a king probably the biggest issue with generative adversarial networks and to some extent with other forms of adversarial training is that the training process does not always converge most of deep learning consists of minimizing a single cost function but the basic idea of adversarial training is that we have two different players who are adversaries and each of them is minimizing their own cost function when we minimize a single cost function that's called optimization and it's unusual for us to have a major problem with non convergence we might get unlucky and converge to a location that we don't like such as a saddle point with a high cost function value but we'll usually at least converge to some general region when we play a game with two players and each of them is simultaneously trying to minimize their own cost we might never actually approach the equilibrium of the game in particular one of the worst forms of non convergence that we see with generative adversarial networks is what we call mode collapse or if you're in on a little joke in our first paper we also caught the Helvetica scenario sometimes the basic idea of the hind mode collapse is that when we use the minimax formulation of the game we'd really like to see is minimization over G in the outer loop and maximization over D in the inner loop if we do this min max problem applied to the value function V we are guaranteed to actually recover the training distribution but if we swap the order of the mechs and the men we get a different result in fact if we minimize every G in the inner loop the generator has no incentive to do anything other than map all inputs Z to the same output X and that output X is the point that is currently considered most likely to be real rather than fake by the current value of the generator so we really want to do min max and not max min which one are we actually doing the way that we train models we do simultaneous gradient descent on both players costs and that looks very symmetric it doesn't naturally prioritize one direction of the min Max or max min in practice we find that we often see results that look an awful lot like Max min unfortunately with G in the inner loop so using some very nice visualizations from Luke Metz and his collaborator collaborators we see here that if we have a target distribution we'd like to learn with several different modes in two dimensions the training procedure shown in the bottom row of images actually visits one mode after another instead of learning to visit all of the different modes so what's going on is that the generator will identify some mode that the discriminator believes is highly likely and place all of its maps there and then the discriminator learns not to be fooled by the generator going to that one particular location and instead of learning that the generator ought to go to multiple locations the generator moves on to a different location until the discriminator learns to reject that one - one way that we can try to mitigate the mode collapse problem is with the use of what we call mini-batch features this is introduced in the paper that we presented on Monday night from open AI where the basic idea is to add extra features to the discriminator so the discriminator can look at an entire mini batch of data and if all the different samples in the mini batch are very similar whether the discriminator can realize that mode collapse is happening and reject those samples is being fake on the CFR 10 dataset this approach allowed us to learn samples that show all the different object classes in CFR 10 for the first time on the Left I show you what the training data looks like for CFR 10 you can see that it's not that beautiful to start with because there are only 32 by 32 pixel images so the resolution is very low on the right we see the samples that come from the model and you see that you can actually recognize horses ships airplanes and so on and cars that we actually have the real object classes recognizably occurring within this data set an image net there's a thousand classes so it's much more difficult to resist the mode collapse problem an image that our model mostly produces samples that have kind of the texture of photographs but don't necessarily have rich class structure we do occasionally get rich class structure if I show you some very cherry-picked examples we're able to make lots of different pictures of things like dogs spiders koalas bears and birds and so on we still see a lot of problems with the model though in particular we often see problems with counting we think that this might be something to do with the architecture of our convolutional network that it's able to test whether a feature is absent or present but it doesn't necessarily test how many times that feature occurs so we see things like this giraffe head with four eyes this dog with something like six legs or this kind of three-headed monkey thing or you know stacks of puppies rather than a single puppy or a cat with one and a half faces we also often see problems with perspective where the model generates images that are extremely flat in particular the image on the lower left looks to me like somebody skinned a dog you know like a bearskin rug and then took a picture with the camera looking straight down at it on the ground whether the picture in the lower middle looks to me literally like a cubist painting we're in the cubism movement artists intentionally removed all the perspective from an image and rearranged the object to show us different parts from different angles but representing the entire thing is flat in many cases we see images that are really quite nice that have some problem with the global structure a lot of the time this just consists of images of animals where we don't actually get to see their legs that they have a head and torso attached but they don't actually complete the legs anywhere and in my particular favorite generator samples so far on the lower left we have an image that we've actually named fall out cow where we have an animal that is both quadrupedal and bipedal it actually has legs and it has the right number of them but it has two different bodies so what are some things that you can do with genitive adversarial networks there are just so many different things that it's a little bit hard to show all of them I showed a lot more in my tutorial but I can show you just a few really quick one thing that came out recently is image to image translation this is from the research group at Berkeley and the basic idea here is to take a conditional generative adversarial Network and map from one domain to another it can do things like take images that say for every pixel where the different what kind of class should appear at each pixel and turn that into a photorealistic scene with the desired objects in the desired positions it can also take an aerial photo and turn it into a map where it can take a sketch of an object and turn it into a photo of an object more recently there was a very exciting result that finally developed the ability to generate realistic samples on the image net data set from all 1,000 classes and with really good diversity this result is called plug-and-play generative models and it combines many different approaches to generative modeling including generative adversarial Nets moment matching denoising auto-encoders and lunch main sampling the results are really excellent and we see lots of different very recognizable high quality images with all the right numbers of legs and everything so generative modeling has really come very far in just the last month actually and generative ever eternal nets are part of that progress I have a few comments about exactly what it is that allows generally retro style networks to work well on kind of an intuitive level one of the main things that's really different about generative a dresser Nets compared to other approaches to machine learning is that they give a very nice way of telling the model that there are multiple correct answers so a lot of the time with supervised learning we use something like mean squared error to tell the model loads output should have been so on the Left I show a little bit about what's wrong with the mean squared error training process suppose we have the blue dot at the bottom of the slide representing some input and we'd like to learn to map this input to some desired output suppose all the different green dots represent different possible outputs that are all valid well in the training set suppose the the label that we had for this particular blue dot was the green dot on the far left but suppose that the model produced a green dot on the far right mean squared error will induce the red error arrow saying that instead of producing the dot on the right the model should have produced the dot on the Left which is the one that appears in the training set that means that over time the blue dot will actually get mapped to something more like the mean of all these different green dots and that causes us to learn things like blurry images when we try to learn to predict different images associated with some input generative Everest or networks don't actually directly use a pair of inputs and outputs to tell the model what it should do instead the discriminator learns how inputs and outputs can be paired and then the discriminator tells the model whether it did a good job or not so the discriminator what ideally learn that all of the different green dots are possible options and then when the generator produces the green dot on the right the discriminator says that that was a good thing to do there are many different good things that the model can do and the discriminator will hopefully endorse all of them so we now have this mechanism for saying that many different outputs are possible instead of always steering the model toward one predefined answer we can see this especially in the context of next video frame prediction so this paper by Bill Lata and his collaborators shows what happens when we use a few different kinds of models to predict the next frame in a video on the left I show the ground truth where we have this 3d rendered image of a person's head and you can see that the image is very sharp and and has a clearly visible ear using a model that was changed with mean squared error in the image in the middle we see that the ear has vanished because the exact location of the ear is not especially predictable and when the model averages over many different possible places the year could go it vanishes similarly the eye has become blurry in the image on the right we see what happens when an adversarial loss is included in the training process in this case the model is now encouraged to produce samples that actually look realistic and it knows that there are multiple different answers that are all possible so it's been able to choose one of the many different sharp images that could happen at the next time step it's also worth thinking about whatever sterols training looks like for people and really the way that the idea of adverse sail training emerged was that economists and other researchers in those kinds of fields we're already working on thinking about the way that multiple different agents acting in a market have their behavior influenced by the process of them optimizing their own payoffs while all the other players optimize their paths so some things that I think could be interesting to look at from machine learning code of view are weather cycles in markets can be explained by the failure of optimization algorithms to converge if we have trouble fitting generative Evaristo nuts and we have complete information just think about how hard it is to choose prices for goods when there are many more actors and when you don't have complete information about the market I'm sure this has been studied to some extent but I think that bringing the economy the machine learning people together could find some more interesting ideas we didn't already know about we've also seen lots of cases of things like auctions that are designed to make sure that people pay the right price and that's more or less what I was thinking of when I designed generative adversarial Nets one last remit is that if we think about the way that people learn researchers like Erikson have shown that the way to become really good at any particular task is to practice it a lot but also to do deliberate practice you're not just putting in a lot of hours you are specifically choosing subtasks within the skill that you're trying to get good at that are especially difficult for you and getting feedback from an expert who coaches you you can think of adversarial training as capturing both of these aspects of developing a skill rather than just training and lots and lots of training examples your training on the worst case inputs that are really hard for the model and in the case of generative adverts or networks you have an expert the discriminator coaching the generator on what it should have done instead so a lot of insights from human psychology and human learning are actually telling us how we can make machine learning more effective so in conclusion adversarial training is a way of training a variety of models in different ways that all involve working on a worst case input generative adversarial Nets are one of the most popular members of this framework and they're based on using the estimate of a ratio of densities to do unsupervised learning and part of why they work so well is that they allow the model to have multiple correct answers and they draw on a lot of the ideas that help humans to learn really well I'm almost out of time but I think we might be able to have a few questions if that's okay with the organizers you have several microphones on the sides okay yes I have caution first here over here so I secured armed for sequential things like video do you know any of any worked of using my recurrent network that actually can generate the risk so the sequence of data to generate our video or what kind of charge that would be yeah there is a paper about generating videos with generative adversarial networks I forget the exact title off the top of my head okay there's also there's a paper here at this workshop today I called unrolled generative every cell networks and I know that one of their experiments involves using a recurrent network to generate Amnesty one pixel at a time so you could check out their spotlight and poster and how about generating language sequence of words in discrete domain the thing that's difficult about that is is the discrete outputs which means that the generator is not differentiable so that's an open research area it might be solvable using things like the reinforce algorithm to do policy gradient on the parameters or using things like gun Bell softmax or the concrete distribution or it might be possible to generate word embeddings from the generator and then decode them to discrete values instead of generating military values directly and how about a speech which is continuous we know that google's are wavenet use different mechanism of generating speech and using this gain framework can be used equally well yeah so before I left Google I suggested that they try generating continuous waveforms it can and I don't actually know if they tried and again didn't work or if they just went straight to using Pixlr and n-type methods to generate the continuous waveform but I do think that the continuous waveform is the way to go with games if that did work the advantage of games would have over wavenet is that the gans can generate the sample much faster that wavenet needs to pass through a neural net for every single sample of the audio so it's generating you know really thousands of samples using thousands of passes to the network it takes about two minutes to generate one sample one second about you okay though again could generate a long waveform in one shot it's interesting to see that wavenet didn't really have temporary effect I use show over there using mean square error kind of criterion for the image so do you have an explanation why once brother the arm is not blur so the blurring effect is with mean square error in real valued spaces and they were using discrete spaces where they have a soft max distribution so their loss function is not actually mean squared error it's it's a categorical cross entropy okay and but that's still no excuse let's ask another question and you can check with you later yeah any other questions okay thank you [Music] you
Info
Channel: Preserve Knowledge
Views: 113,192
Rating: 4.9793134 out of 5
Keywords: machine learning, data science, neural networks, Geoffrey Hinton, Yoshua Benjio, Andrej Kaparthy, Andrew Ng, Ian Goodfellow, GANs, Deep learning, mathematics, lecture, Terry Tao, Convolution, generative, AI, Artificial intelligence, Robot, Self driving cars, Google Brain, Alphago, Yann LeCunn, CMU, Facebook, Google, Microsoft, Research, Big data, Bitcoin, Blockchain, programming, computer science
Id: 9JpdAg6uMXs
Channel Id: undefined
Length: 31min 25sec (1885 seconds)
Published: Thu Aug 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.