Ian Goodfellow: Generative Adversarial Networks (NIPS 2016 tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so welcome to the third tutorial session this one's on generative adversarial networks so it is actually is my great pleasure to introduce dr. Ian good fellow he did a masters and bachelors at Stanford University finishing there in 2009 at which point he moved to the University of Montreal where he did a PhD with yoshua bengio and I and after that he moved to the Google brain group at that same year and after that he moved just recently earlier this year to the open AI where he currently is so I think that Ian is quite simply one of the most creative and influential researchers in our community today and I think that we have a room full of people ready to hear about a topic ganzar generative adversarial networks that he invented two years ago in a bar in Montreal I might add is testament to that so yeah well so without further ado I give you good fellow yeah I forgot to mention he's requested that we have questions throughout so if you actually have a question just go to the mic and he'll maybe stop and try to answer your question I'll try not to do that again thank you very much for the introduction Aaron thank you everybody for coming today let me tell you a little bit about the format here despite the size of the event I'd still like it to be a little bit interactive and let you feel like you can make the tutorial what you want it to be for yourself I believe a lot that the tutorial should be a chance for you to get some hands-on experience and and to feel like you're building your own mastery of this subject so I've included three exercises that will appear throughout the presentation every time there's an exercise you can choose whether you want to work on it or not I'll give a little five-minute break since I know it's hard to pay attention to a presentation for two hours straight and if you'd like to work through the exercise you can work through it otherwise just take a break and chat with your neighbors the basic topic of today's tutorial is really generative modeling in general it's impossible to describe generative ever salient works without contrasting them with some of the other approaches and describing some of the overall goals in this area that we're working on the basic idea of generative modeling is to take a collection of training examples and form some representation of a probability distribution that explains where those training examples came from there are two basic things that you can do with a generative model one is you can take a collection of points and infer a density function that describes the probability distribution that generated them I show that in the upper row of this slide where I have taken several points on a one-dimensional number line and fitted a Gaussian density to them that's what we usually think of when we describe generative modeling but there's another way that you can build a generative model which is to take a machine that observes many samples from a distribution and then is able to create more samples from that same distribution generative adversarial networks primarily lie in the second category we're what we want to do is simply generate more samples rather than find the density function as a brief outline of the presentation today I'm first going to describe why we should study generative modeling at all it might seem a little bit silly to just make more images when we already have millions of images lying around next I'll describe how generative models work in general and situate generative address all networks among the family of generative models explaining exactly what is different about them and other approaches then I'll describe in detail how generative adversarial networks work and I'll move on to special tips and tricks that practitioners have developed that are less theoretically motivated but it seemed to work well in practice then I'll describe some research frontiers and I'll conclude by describing the latest state of the art and generative modeling which combines generative adverse health at works with other methods so the first section of this presentation is about why we should study generative models at all most of the time and machine learning we use models that take an input and map that input to a single output that's really great for things like looking at an image and saying what kind of object is in that image or looking at a sentence and saying whether that sentence is positive or negative why exactly would you want to learn a distribution over different different training examples well first off high dimensional probability distributions are an important object in many branches of engineering and applied math and this exercises our ability to manipulate them but more concretely there are several ways that we could imagine using generative models once we have perfected them one is that we could use the generative model to simulate possible futures for reinforcement learning there are at least two different ways that you could use this one is you could train your agent in a simulated environment that's built entirely by the generative model rather than needing to build an environment by hand the advantage of using this simulated environment over the real world is that it could be more easily realized across many machines and the mistakes in this environment are not as costly as if you actually make a mistake in the physical world and do real harm similarly an agent that is able to imagine future states of the world using a generative model can plan for the future by simulating many different ideas of plans that it could execute and testing which of them works out as best as possible there's a paper on that subject with Chelsea Finn is the first author where we evaluated generative models on the robot pushing data set to start working toward this goal of using generative models to plan actions another major use of generative models is that they are able to handle missing data much more effectively than the standard input to output mappings of machine learning models that we usually use generative models are able to fill in missing inputs and they're also able to learn when some of the labels in the data set are missing semi-supervised learning is a particularly useful application of generative modeling where we may have very few labeled inputs but by leveraging many more unlabeled examples we were able to obtain very good error rates on the test set many other tasks also intrinsically require that we use multimodal outputs rather than mapping one input to a single output there are many possible outputs and the model needs to capture all of them and finally there are several tasks that just plain require realistic generation of images or audio waveforms as the actual specification of the task itself and these clearly require generative modeling intrinsically one example of a task that requires multimodal outputs is predicting the next frame in a video because there are many different things that can happen in the next time step there are many different frames that can appear in a sequence after the current image because there are so many different things that can happen traditional approaches for predicting the next video frame often become very blurry when they try to represent the distribution over the next frame using a single image many different possible next frame images are averaged together and result in a blurry mess I'm showing here some images from a paper by William Lauder and his collaborators that was published earlier this year on the Left I show you the ground truth image the image that should be predicted next in a video of a 3d rendering of a rotated head in the middle I show you the image that is predicted when we take a traditional model that is trained using mean squared error because this mean squared error model is predicting many different possible futures and then averaging them together to hedge its bets we end up with a blurry image where the eyes are not particularly crisply defined small variations in the amount that the head rotates can place the eyes in very different positions and we average all those different positions together we get a blurry image of the eyes likewise the ears on this person's head have more or less disappeared on the right I show you what happens when we bring in a more generative modeling type approach and in particular when we use an adversarial loss to train the model in the image on the right the model has successfully predicted the presence of the ear and has successfully drawn a crisp image of the eyes with dark pixels in that area and sharp edges on the features of the eyes another task that intrinsically requires being able to generate good data is super resolution of images in this example we begin with the original image on the left and then not pictured we down sample that image to about half its original resolution we then share several different ways of reconstructing the high resolution version of the image if we just use the bicubic interpolation method just a hand designed mathematical formula for what the pixels ought to be based on sampling Theory we get a relatively blurry image that's shown second from the left the remaining two images show different ways of using machine learning to actually learn to create high resolution images that look like the data distribution so here the model is actually able to use its knowledge of what high resolution images look like to provide details that have been lost in the down sampling process the new high resolution image may not be perfectly accurate and may not perfectly agree with reality but it at least looks like something that is plausible and is visually pleasing there are many different applications that involve interaction between a human being and an image generation process one of these is a collaboration between Berkley and Adobe called I again or the I stands for interactive the basic idea of igon is that it assists a human to create artwork the human artist draws a few squiggly green lines and then a generative model is used to search over the space of possible images that resemble what the human has begun to draw even though the human doesn't have much artistic ability they can draw a simple black triangle and it will be turned into a photo-quality Mountain this is such a popular area that they've actually been to papers on this subject that came out just in the last few months introspective adversarial networks also offer this ability to provide interactive photo editing and have demonstrated their results mostly in the context of editing faces so the same idea still applies that a human can begin editing a photo and the generative model will automatically update the photo to keep it appearing realistic even though the human is making very poorly controlled mouse controlled movements that are not nearly as fine as would be needed to make nice photorealistic details there are also just a long tail of different applications that require generating really good images a recent paper called image to image translation shows how conditional generative adversarial networks can be trained to implement many of these multimodal output distributions where an input can be mapped to many different possible outputs one example is taking sketches and turning them into photos in this case it's very easy to train the model because photos can be converted to sketches just by using an edged extractor and that provides a very large training set for the mapping from sketch to image essentially in this case the generative model learns to invert the edge detection process even though the inverse has many possible inputs that respond to the same output and vice versa the same kind of model can also convert aerial photographs into maps and can take descriptions of scenes in terms of which object category should appear at each pixel and turn them into photorealistic images so these are all several different reasons that we might want to study generative models ranging from the different kinds of mathematical abilities they force us to develop to the many different applications that we can carry out once we have these kinds of models so next we might want your how exactly do generative models work and in particular how do generative adversarial networks compare in terms of the way that they work to other models it's easiest to compare many different models if I describe all of them as performing maximum likelihood there are in fact other approaches to generative modeling besides maximum likelihood but for the purpose of making a nice crisp comparison of several different models I'm going to pretend that they all do maximum likelihood for the moment and the basic idea of maximum likelihood is that we write down a density function that the model describes that I represent with P model of X X is a vector describing the input and P model of X is a distribution controlled by parameters theta that describes exactly where the data concentrates and where it is spread more thinly maximum likelihood consists in measuring the log probability that this density function assigns to all the training data points and adjusting the parameters theta to increase that probability the way that different models go about accomplishing this is what makes the models different from each other so among all the different models that can be described as implementing maximum likelihood we can draw them in a family tree where the first place where this tree forks is we asked whether the model represents the data with the density with an explicit function or not so when we have an explicit density function it looks exactly like what I showed on this previous side slide we actually write down a function P model and we're able to evaluate log P model and increase it on the training data within the family of models that have an explicit density we may then ask whether that density function is actually tractable or not when we want to model very complicated distributions like the distribution of our natural images or the distribution of her speech waveforms it can be challenging to design a parametric function that is able to capture the distribution efficiently and this means that many of the distributions we have studied are not actually tractable however with careful design it has been possible to design a few different density functions that actually are tractable that's the family of models like pixel RNN pixel CNN and other fully visible belief networks like nade and made the other major family of distributions that have a tractable density is the nonlinear ICA family this family of models is based on taking a simple distribution like a Gaussian distribution and then using a nonlinear transformation of samples from that distribution to warp the samples into the space that we care about if we're able to measure the determinant of the Jacobian of that transformation we can determine the density in the new space that results from net warping within the family of models that used in explicit density the other set of approaches is those that cannot actually have a tractable density function there are two basic approaches within this family one of these is the model family that approximates an intractable density function by placing a lower bound on the log-likelihood and then maximizing that lower bound another approach is to use a Markov chain to make an estimate of the density function or of its gradient both of these families incur some disadvantages from the approximations that they use finally we may give up altogether on having an explicit density function and instead we represent the density function implicitly this is the rightmost branch of the tree one of the main ways that you can implicitly represent a probability distribution is to design a procedure that can draw samples from that probability distribution even if we don't necessarily know the density function if we draw simple as using a Markov chain that gives us one family of distributions of models of which the main example is the generative stochastic Network and then finally if we would like to draw samples directly we have models like generative adversarial networks or deep moment matching networks are both examples of models that can draw samples directly but don't necessarily represent a density function so now let's look at each of these in a little bit more detail and describe exactly what the advantages and disadvantages of them are and why you might want to be in one branch of the tree or another so first fully visible belief networks are the most mathematically straightforward they use the chain rule of probability to decompose the probability distribution over a vector into a product over each of the members of the vector we write down a probability distribution for the distribution over X 1 and then we multiply that by the distribution over X 2 given X 1 and then X 3 given X 1 and X 2 and so on until we finally have a distribution over the final member of the vector given all of the other members of the vector so this goes back to a paper by Brendan Freund 1996 but has had several other advancements in the meantime the current most popular member of this model family is the pixel CNN and I show here some samples of elephants that it generated the primary disadvantage of this approach is that generating a sample is very slow each time we want to sample a different X I from the vector X we need to run the model again and these n different times that we run the model cannot be parallelized each of these operations of sampling another X is dependent on all of the earlier X I values and that means that there's really no choice but to schedule them one after another regardless of how much bandwidth we have available one other smaller drawback is that the generation process is not guided by a latent code many of the other models that we study have a latent code that we can sample first that describes the entire vector to be generated and then the rest of the process involves translating that vector into something that lies in the data space and that allows us to do things like have embeddings that are useful for semi-supervised learning or generating samples that have particular properties that were interested in fully visible belief networks don't do this out of the box but there are different extensions of them that can enable these abilities one very recent example of a fully visible belief net is wavenet and it shows both some of the advantages and some of the disadvantages of these fully visible belief networks first because the optimization process is very straightforward it's just minimizing a cost function with no approximation to that cost function it's very effective and generates really amazing samples but the disadvantage is that the sample generation is very slow in particular it takes about two minutes to generate one second of audio and that means that barring some major improvement in the way that we're able to run the model it's not going to be able to be used for interactive dialogue any time soon even though it is able to generate very good lifelike audio waveforms the other major family of explicit tractable density models is the family of models based on the change of variables where we begin with a simple distribution like a Gaussian and we use a non-linear function to transform that distribution into another space so we transform from a latent space to on this slide the space of natural images the main drawback to this approach is that the transformation must be carefully designed to be invertible and to have a tractable Jacobian and in fact a tractable determinant of the Jacobian in particular this requirement says that the latent variables must have the same dimension allottee as the data space so if we want to generate 3,000 pixels we need to have 3,000 latent variables it makes it harder to design the model to have exactly the capacity that we would like to have another major family of models is those that have intractable density functions but then use tractable approximations to those density functions currently one of the most popular members of this family is the variational auto-encoder the basic idea is to write down a density function log P of X where the density is intractable because we need to marginalize out a random variable Z Z is a vector of latent variables that provide a hidden code describing the input image and because of the process of marginalizing these variables out to recover simply the distribution over X is intractable we're forced to use instead a variational approximation this variational approximation introduces a distribution Q over the latent variable Z and to the extent that this distribution Q is closer to the true posterior over the latent variables we're able to make it bound that becomes tighter and tighter and does a better job of lower bounding the true density unfortunately this model is only asymptotically consistent if this Q distribution is perfect otherwise there's a gap between the lower bound and the actual density so even if the optimizer is perfect and even if we have infinite training data we are not able to recover exactly the distribution that was used to generate the data in practice we observe that variational autoencoders are very good at obtaining high likelihood but they tend to produce lower quality samples and in particular the samples are often relatively blurry another major family of models is the Bolton machine these also have an explicit density function that is not actually tractable in this case the Bolton machine is defined by an energy function and the probability of a particular state is proportional to e to the value of the energy in order to convert this to an actual probability distribution it started to renormalize by dividing by the sum over all the different states and that sum becomes intractable we're able to approximate it using Monte Carlo methods but those Monte Carlo methods often suffer from problems like failing to mix between different modes and in general Monte Carlo methods especially Markov chain Monte Carlo method perform very poorly in high dimensional spaces because the Markov chains break down for very large images we don't really see both some machines applied to tasks like modeling image annette images they perform very well on small data sets like m nest but then have never really scaled all of these different observations about the other members of the family tree bring us to generative adversarial networks and explained the design requirements that I had in mind when I thought of this model first they use a latent code that describes everything that's generated later they have this property in common with other models like variational Ottoman coders and Boltzmann machines but it's an advantage that they have over fully visible belief networks they're also asymptotically consistent if you're able to find the equilibrium point of the game defining a general a generative adverse trail network you're guaranteed that you've actually recovered the true distribution that generates the data modulo sample complexity issues so if you have infinite training data you do eventually recover the correct distribution there are no Markov chains needed neither to train the generative adversarial Network nor to draw samples from it and I felt like that was an important requirement based on the way that the Markov chains had seemed to hold back restricted Boltzmann machines today we've started to see some models that use Markov chains more successfully and I'll describe those later in the talk but that was one of my primary motivations for designing this particular model family finally a major advantage of generative adversarial networks is that they are often regarded as producing the best samples compared to other models like variational autoencoders in the past few months we've started to see other models like pixel cnn's competing with them and it's now somewhat difficult to say which is the best because we don't have a good way of quantifying exactly how good a set of samples are that concludes my description of the different families of generative models and how they relate to each other and how generative address our networks are situated in this family of generative models so I'll move on to describing exactly how generative adversarial net works actually work the basic framework is that we have two different models and their adversaries of each other in the sense of game theory there's a game that has well-defined payoff functions and each of the two players tries to determine how they can get the most payoff possible within this game there are two different networks one of them is called the generator and it is the primary model that were interested in learning the generator is the model that actually generates samples that are intended to resemble those that were in the training distribution the other model is the discriminator the discriminator is not really necessary after we finished the training process at least not in the original development of generative adversary works there are some ways of getting some extra use out of the discriminator but in the basic setup we can think of the discriminator as a tool that we use during training that can be discarded as soon as training is over the role of the discriminator is to inspect a sample and say whether that sample looks real or fake so the training process consists of sampling images or other kinds of data from the training set and then running the discriminator on those inputs the discriminator is any kind of differentiable function that has parameters that we can learn with gradient descent so we usually represent it as a deep neural network but in principle it could be other kinds of models when the discriminator is applied to images that come from the training set its goal is to output a value that is near one representing a high probability that the input was real rather than fake but half the time we also apply the discriminator to examples that are in fact fake in this case we begin by sampling the latent vector Z in this case we sample Z from the prior distribution over latent variables so Z is essentially a vector of unstructured noise it's a source of randomness that allows the generator to output a wide variety of different vectors we then apply the generator to the input vector Z the generator function is a differentiable function that has parameters that can be learned by gradient descent similar to the discriminator function and we usually represent the generator as being a deep neural network though once again it could be any other kind of model that satisfies those differentiability properties after we have applied G to Z we obtain a sample from the model and ideally this will resemble actual samples from the data set though early in learning it will not after we've obtained that sample we apply the discriminator function D again and this time the goal of the discriminator D is to output a value D of G of Z that is near zero I'm sorry I realized there's a mistake in the slide actually it's backwards the discriminator wants to make the value in this case be near zero and the generator would like to make it be near one so the discriminator would like to reject these samples as being fake well the generator would like to fool the discriminator into thinking that they're real you can think of the generator and the discriminator as being a little bit like counterfeit counterfeiters and police the counterfeiters are trying to make money that looks realistic and the police are trying to correctly identify counterfeit money and reject it without accidentally rejecting real money as the two adversaries are forced to compete against each other the counterfeiters must become better and better if they want to fool the police and eventually they're forced to make counterfeit money that is identical to real money similarly in this framework the generator must eventually learn to make samples that come from the distribution that generated the data so let's look at the generator Network in a little bit more detail we can think of the generator network is being a very simple graphical model shown on the Left there's a vector of latent variable Z and there's a vector of observed variables X and depending on the model architecture we usually have every member of X depend on every layer of Z so every member of X 2 and on every member of Z so I've drawn this as just a simple vector-valued model where we see one edge you could also imagine expanding it into a graph of scalar variables where would be a bytes heart bipartite directed graph the main reason that generative adversarial networks are relatively simple to train is that we never actually try to infer the probability distribution over Z given X instead we sample values of Z from the prior and then we sample values of X from P of x given Z because that's an central ancestral sampling in a directed graphical model it's very efficient in particular we accomplished this ancestral sampling by applying the function G to the input variable Z one of the very nice things about the generative every cell networks framework is that there are not really any requirements other than differentiability on G unlike nonlinear nonlinear ICA there is no requirement that Z have the same dimension as X for example or Boltzmann machines require energy functions that are tractable and have different tractable conditional distributions we don't actually need to be careful to design values that have multiple different conditionals that are all tractable in this case we only really need to make one conditional distribution tractable there are a few properties that we'd like to be able to guarantee that impose a few extra requirements on G in particular if we want to be sure that we're able to recover the training distribution we need to make sure that X has a higher dimension than Z or at least an equal dimension this is just to make sure that we aren't forced to represent only a low dimensional manifold with an X space an interesting thing is that it's actually possible to train the generator network even if we don't provide support across all of X space if we make Z be lower dimensional in X then we obtain a low dimensional manifold that assigns no probability whatsoever to most space most points in X space but we're still able to train the model using the discriminator as a guide that's kind of an unusual quirk that sets this framework apart from the methods that are based on maximizing a density function those would break if we evaluated the logarithm of zero density so the training procedure is to choose an optimization algorithm you can pick your favorite one I usually like to use atom these days and then repeatedly sample to different many batches of data one of these is a mini batch of training examples that you draw from the data set and the other mini batch is a set of input values Z that we sample from the prior and then feed to the generator we then run gradient descent on both of the players costs simultaneously in one optional variant we can also run the update for the discriminator more often than we run the update for the generator I personally usually just use one update for each player each player has its own cost and the choice of the cost determines exactly how the training algorithm proceeds there are many different ways of specifying the cost the simplest one is to use a minimax game where we have a cost function J superscript D defining the cost for the generator for the discriminator and then the cost for the generator is just the negative of the cost for the discriminator so you can think of this as having a single value that the discriminator is trying to maximize and the generator is trying to minimize so what exactly is this value that the two players are fighting over it's simply the cross-entropy between the discriminators predictions and the correct labels and the binary classification task of discriminating real data from fake data so we have one term where we're feeding data and we're with a discriminator is trying to maximize the log probability of assigning one to the data and then we have another term where the discriminator is aiming to maximize the log probability of assigning 0 to the fake samples when we look for an equilibrium point to a game it's different than minimizing a function we're actually looking for a saddle point of J superscript D and if we're able to successfully find this saddle point the whole procedure resembles minimizing the Jensens Shannon divergence between the data and the distribution represented by the model so as our first exercise which will be accompanied by a little five-minute break we're going to study what the discriminator does when the discriminator plays this game at the top of the slide I've shown the cost function that the discriminator is going to minimize and the exercise is to determine what the solution to D of X is written in terms of the data distribution and the generator distribution you'll also find that you need to make a few assumptions in order to make a clean solution to this exercise so I'll give you about five minutes to work on this exercise or if you don't want to do the exercise feel free to talk with your neighbors or just take a break for a minute so that you don't need to remain attentive for too many consecutive minutes I'm also happy to take questions from the mic during this time if anyone's interested yeah over there yeah my question is what prevents the generator from always generating the same image you see what I mean it could just lazily learn to always generate one single realistic image and be fine with this yeah that's a good question and it's an important part of ongoing research in generative adversarial networks essentially if we're able to correctly play this minimax game then the generator is not able to consistently fool the discriminator by always generating the same sample the discriminator would learn to recognize that individual sample and rejected as being fake in practice it's difficult to find a true equilibrium point of this game and one of the failure modes is actually to generate samples that have too little diversity to them and because of that we're having to study ways to improve our ability to find the equilibrium I became thanks did I yeah over here okay so I'm on your left actually yeah here I'm raising my hand okay so I'm actually learning a bit of Gans as well and variational encoders and I see certain resemblances in terms of sampling in this Z space in what case should I when generating samples use again I know what cases should I use variational autoencoders Thanks if your goal is to obtain a high likelihood then you would be better off using a variational auto encoder if your goal is to obtain realistic samples then you would usually be better off using a generative adversarial network rather than a variational autoencoder you can kind of see this in the cost function the generative adversity all Network is designed to fool the discriminator into thinking that it's samples are realistic and the variational autoencoder is designed to maximize the likelihood I how to sample from the data is just uniform distribution or that's also a really good question and I think one that is a topic of ongoing research the naive way of implementing the algorithm and the one that everyone does so far is to sample uniformly from the training data and also to sample uniformly from the z space but you could imagine the importance sampling could give us big improvements in particular most of the points that we train the generator on are wasted because we're usually going to sample from points that are doing pretty well and what we'd really like to do is find points that are doing very badly or maybe points that lie on the boundary between two modes in order to adjust those boundaries so you could imagine that as a procedure for doing important sampling where we visit latent encodes the yield more important aspects of the learning process and then reweighed those samples to correct for the bias on the sampling procedure could actually lead to an improvement so I just have one quick question I'm surprised well extremely impressed by this this beautiful algorithm but one thing that I'm rather confused by is why don't strange artifacts appear on the representation for the weight created by the generator and once is created by the generator it has some and it's any sort of non visually relevant artifact whether it is a non smoothness and then that would just mean the discriminator is set up to just win does that make sense yeah that makes sense so there are unusual artifacts that appear in samples created by the generator and in a lot of cases we're fortunate that those artifacts are somewhat compatible with the blind spots and the discriminator one example is if we use a convolutional generator the generator is somewhat inclined to producing unusual tile patterns there's a really good blog post by a ghostess Adina vase londoom Alain and Chris Ola I'm sorry if I forgot into the authors in that list about the checkerboard patterns that appear when you use D convolution with large stride in the generator though the good news is that the discriminator is also using convolution presumably with similar stride and so it might actually become blind to the same grid patterns that the generator creates the best answer exactly right but more generally there are a lot of artifacts that come out of the generator that don't really seem all that relevant to the sample creation process and the discriminator spends a lot of its time learning to reject patterns that ideally it would just you know not ever have to encounter in the first place like on M NIST is a very simple data set with just handwritten digits on a background if you look at the weights that the discriminator learns in the first layer they often look a little bit like 40 a basis so early on in learning they're realizing that the generator often makes a lot of high-frequency stuff and the data doesn't really have that frequency and so the discriminator is looking at this whole spectrum of different frequencies in order to figure out if there's too much of different bands president or not really it seems like it would be much better for the generator to go straight to making pen strokes and the discriminator go straight to paying attention to pen strokes instead of spending all of its time policing exactly how it sharp the transitions between neighboring pixels are so if just wanna understand this objective function a little bit better if you fix the generator so that it just does negative sampling or what a rather let me ask what is the relation between this objective function and a negative sampling approach the kind that are used with like board Tyvek Oh negative sampling forward to Veck I haven't really thought about that one connection to negative sampling is when trading boltzmann machines we generate samples from the model in order to estimate the gradient on the log partition function and we call that the negative phase you can think of the generative adversity all Network training procedure as being almost entirely negative phase the the generator only really learns from the samples it makes and it makes it a little bit like when you carve a statue out of marble you only ever remove things rather than adding things it's it's kind of a unique peculiarity of this particular training process so in the interest of time I think I should move on to the solution to this exercise but I'll continue taking more questions probably most of them at the next exercise break okay yeah okay so yeah I'll take your question next when I come to the exercise to you so the solution to exercise 1 and as you recall if you were paying attention to the questions rather than to the exercise we're looking for the optimal discriminative function D of X in terms of P data and P generator to solve this it's best to assume that both P data and P generator are nonzero everywhere if we don't make that assumption then there's this issue that some points in the discriminator the discriminators input space might never be sampled there it's training process and then those particular inputs would not really have a defined behavior because they're just never trained but if you make those relatively weak assumptions we can then just solve for the functional derivatives where we regard D of X as being almost like this infinite dimensional vector where every x value index is a different member of the vector and we're just solving for a big vector like we're used to doing with calculus so in this case we take the derivative with respect to a particular D of X output value of the cost function and we set it equal to zero it's pretty straightforward to take those derivatives and then from there it's straightforward algebra to solve this stationarity condition and what we get is that the optimal discrimination function is the ratio between P data of X and the sum of P data of X and P model of X so this is the main mathematical technique that sets generative adverse own networks apart from the other models that I described in the family tree some of them use techniques like lower bounds some of them use techniques like Markov chains generative adversarial networks use supervised learning to estimate a ratio of densities and essentially this is the the property that makes them really unique supervised learning is able to in in the ideal limit of infinite data and perfect optimization it's able to recover exactly the function that we want and the way that it breaks down is different from the other approximations it can suffer from under fitting if the optimizer is not perfect and it can suffer from overfitting if the training data is limited and it doesn't learn to generalize very well from that training data so far I've described everything in terms of a minimax game where there's a single value function and one player tries to maximize it and the other player tries to minimize it we can actually make the game a little bit more complicated where each player has its own independently parameterised cost so in all the different versions of the game we pretty much always want the discriminator to be using your bridge version of the game where it's just trying to be a good binary classifier but there are many different things we might consider doing with the generator in particular one really big problem with the minimax game is that when the discriminator becomes too smart the gradient for the generator goes away one of the really nice properties of the cross entropy loss function that we use to Train sigmoid classifiers and softmax classifiers is that whenever the classifier is making a mistake whenever it's choosing the wrong class the gradient is guaranteed to be nonzero the gradient of the cross entropy with respect to the logits approaches 1 as the probability assigned to the correct class approach is zero so we can never get in a situation where the classifier is unable to learn due to a lack of gradient either it has gradient and it's making a mistake or it lacks gradient and it's perfect so the discriminator has this particular property but unfortunately if we negate the discriminators cost then the generator has the opposite of that property whenever the generator is failing to fool the discriminator completely then it has no gradient because the output of the discriminator has saturated what we can do is instead of flipping the sign of the discriminators cost we can flip the order of the arguments to the cross-entropy function specifically this means that rather than trying to minimize the log probability of the correct answer we have the generator try to maximize the log probability of the wrong answer both of these cost functions are monotonically decreasing in the same direction but they're steep in different places at this point it's no longer possible to describe the equilibrium with just a single loss function and the motivations for this particular cost are far more heuristic we don't have a good theoretical argument that this place is the Nash equilibrium in the right place but in practice we see that this cost function behaves similar to the minimax cost function early in learning and then later in learning when the minimax function would start to have trouble with saturation and a lack of gradient this cost function continues to learn rapidly so this is the default cost function usually advocate that most people use even though it's not quite as theoretically appealing generative address trail networks did not really scale to very large inputs when my co-authors and I first developed them and eventually they were scaled to large images using a hand design process called lap Gans that used a laplacian pyramid to separate the image into multiple scales and generate each scale independently but more recently the way that they are usually used is following an architecture that was introduced in a collaboration between a start-up called in deco and face book AI research this architecture is called the DC Gann architecture for deep convolutional generative adversarial networks even in the original paper generative Ebersole networks were deep and convolutional but this paper placed greater emphasis on having multiple convolutional layers and using techniques that were invented after the original development of generative error so networks such as batch normalization so in particular when we generate images we might wonder exactly what we should do to increase the resolution as we move through a convolutional network the answer from the DC gun architecture is just to use a stride of greater than one when using the deconvolution operator another important contribution of the DC gun paper is to show that it's important to use batch normalization that every layer except for the last layer of the generator network that makes the learning process much more stable and since then guns have been applied to a wide range of large image generation tasks DC guns showed that you can generate really good images of bedrooms in particular many different data sets that have a small number of output modes work really well with DC gun style architectures so here we can see that we're getting realistic beds blankets windows cabinets and so on and that we have a quite a variety of different kinds of lighting and all the different sources of lighting are rendered in a very nice realistic way another domain where generative adversity or networks work well because the number of outputs is restricted is the domain of images of faces DC guns were shown to work very well on faces and in particular they showed that the latent code is actually very useful for representing faces many of you have probably seen the result the language models that have word embeddings can have properties where the word embedding for Queen if you subtract the word embedding for female and add the word of bedding for male give us a word embedding very close to the word embedding for King so you can actually do algebra in latent space and have it correspond to semantics the authors of the DC GaN paper showed that generative adversarial networks provide a similar property for images in particular if we take the word or the image embedding for images of a man with glasses and subtract the embedding for images of a man and add the embedding for images of a woman we obtained the embedding that corresponds to images of women with glasses all of the images in this slide were generated by the network none of them our training data they all come from decoding different embeddings so this shows that we're able to do algebra and latent space and have that algebra correspond to semantic properties just like with language models but what's even more exciting than language models is that we're actually able to decode this latent variable to a rich high dimensional image where all the different thousands of pixels are actually arranged correctly in relation to each other in the case of language models we only had to find an embedding that was really close to the embedding for the word King but we didn't have to actually map from the embedding to some kind of complicated data space so here we've shown we can go one step further and actually accomplish that mapping tasks when we try to understand exactly how a generative adversarial networks work one thing that's important to think about is whether the particular choice of divergence that we minimize is really important and in the past I and several other people have argued the generative adversarial networks made good samples and obtained bad likelihood because of the divergence that we chose I no longer believe that and I'm going to give you an argument now that the divergence doesn't matter but I will start by explaining to you why you might think that it should so if we maximize the likelihood of the data that's similar to it that's equivalent to minimizing the KL divergence between the data distribution and the model distribution and that's shown on the left in this panel here the data distribution is represented by the blue curves where we have a bimodal data distribution for this example the model distribution is represented by the dashed green curve and in this particular demonstration I'm assuming that the model is a Gaussian with a single mode so it's not able to represent the data distribution correctly so this is what the maximum likelihood solution to this problem would give us the Gaussian ends of averaging out the two different modes the KL divergence is not actually symmetric maximum likelihood corresponds to minimizing the KL divergence with the data on the left and the model on the right but we can actually flip that around we can minimize the KL divergence with the model on the left and the data on the right and when we do that we get a different result where instead of averaging out the two modes the model as shown in the panel on the right here we'll choose one of the modes we can think of KL data come a model as saying that the model should put probability mass everywhere that the data puts probability mass and we can think of KL data KL model comma data as saying that the model should not put probability mass anywhere that the data does not put probability mass in the left it's really important to have some mass on both peaks on the right it's really important to never generate a sample in the valley between the two peaks because none of the data ever actually occurs there both of these are perfectly legitimate approaches to generative modeling and you can choose one or the other based on whichever task you are using and what the design requirements for that task are the loss that we traditionally use with generative adversarial networks mostly because it was the thing that popped into my head in a bar as as Erin mentioned is pretty similar to the the divergence on the right but since that night in the I've realized that it's possible to use other divergences and and several papers by other people have been published on how to use other divergences and I now no longer think that the choice of divergence explains why we get really good samples and don't get as good of likelihood so here's how you can actually get maximum likelihood out of a generative adversarial network where you approximately minimize the KL divergence between data and model rather than model and data for the discriminator Network you use the same cost function as before which is just the binary classification task and for the generator network we now sample from the generator and then we penalize it according to e to the value of the logits of the discriminator and if the discriminator is optimal this has the same expected gradient with respect to the parameters as the KL divergence data and the model does so its approximating maximum likelihood by using supervised learning to estimate a ratio that would be intractable if we were to evaluate the maximum likelihood criterion directly in general we can think of these different costs as being like reward functions we can kind of think of the generator net as being a reinforcement learning agent where it takes actions and we reward its actions depending on the way that the environment responds the thing that makes this particular reinforcement learning setup a little unusual is that part of the environment is another learning agent in particular the discriminator all these different costs have one thing in common you can compute the cost using only the output of the discriminator and then for every sample you you just give a reward that depends on exactly what the discriminator did so if we look at a graph of the cost that the generator incurs as a function of the output of the discriminator we can see that all these different costs increase as we move from left to right and so our decrease as we move from left to right essentially that's saying that if you make the discriminator think that the samples that the generator created are real then you incur a very low cost we can see the way that they saturate in places and all so we can see how sampling along these curves that give us very different variance in the estimate of the gradient the green curve that lies the highest is the heuristic Allah motivated cost which is designed not to saturate when the generator is making a mistake so if you look at the very extreme left where the discriminator is outputting zeros where the discriminator is successfully rejecting the generator samples this cost function has a a high derivative value so the model is able to learn rapidly early on when it samples did not yet look realistic then if we move downward in the series of plots the blue curve the minimax curve is the one that we originally used to design this model framework and the one that's the easiest to analyze using the minimax theorem this curve is relatively flat most of the way across and starts to curve down gently as the samples become more realistic and then finally the maximum likelihood cost which has the negation of an exponential function in it is very flat on the left side but then shoots off exponentially downward as we get very far to the right so we can see that we would actually incur very high variance in the estimate of the gradient if we were to use that particular function because almost all the gradient comes from a single member of the mini batch whichever one is the most realistic because of that we don't usually use the maximum likelihood cost with generative adversarial networks we use one of the other costs that has nicer saturation properties and nicer variance properties but it is a perfectly legitimate cost and when we go ahead and we use that cost to Train there's actually there's a few other ways of approximating the KL divergence but none of the different ways of approximating the KL divergence give us blurry samples like we get with a V ie so that we used to think that the VA was using the KL divergence and got blurry samples and gowns were using the reverse KL divergence and got sharp samples but now that we're able to do both divergences with gans we see that we get sharp samples both ways my interpretation this is that it is the approximation strategy of using supervised learning to estimate the density ratio that leads to the samples being very sharp and that something about the variational bound is what leads to the samples for the VA e being blurry there's one other possibility which is that the model architectures we use for generative adversarial Nets are usually a little bit different VA use usually are conditionally Gaussian and usually have an isotropic Gaussian at the output layer generative adversarial networks don't need to have any particular conditional distribution that you can evaluate so the last layer is often just a linear layer which would look kind of like a Gaussian distribution with a complete covariance matrix instead of a restricted covariance matrix so it's possible that that complete covariance matrix at the last layer remove some of the blurriness but we no longer think that the choice of the divergence is really important to understanding how generative adversarial networks behave earlier I showed you a family tree of different generative models and I said we're going to pretend that all of them do maximum likelihood and clearly they don't actually do that now that we've seen how generative adversarial networks work in a little bit more detail we can actually start to describe exactly how it is that they compare to some of the more similar generative models in particular noise contrastive estimation is a procedure for fitting many different generative models including bolts and machines and other different types of generator nets and noise contrastive estimation uses exactly the same value function that we use as as the value function for the minimax game for generative adversarial nets so a lot of people look at this and think maybe these two methods are almost the same thing and and I myself wondered about that for a little while so it turns out that actually this same value function also appears for maximum likelihood if you look at it the right way so what this value function consists of is on the Left we have a term where we sample values from the data and we measure the log discriminator function on the right we sample values from a generator function and we measure the log of one minus the discriminator function it turns out that the differences between noise contrastive estimation maximum likelihood estimation and generative adversarial notes all revolve around exactly what the generator and the discriminator and the learning process are so for generative adverse neural networks the discriminator is just a neural network that we parameterize directly the function D of X is just directly implemented for both noise contrastive estimation and maximum likelihood estimation the discriminator is a ratio between the model that we're learning and the sum of the model density and the generator density so that probably got a little bit confusing right there what is this model that we are learning and how is it different from the generator well it turns out that for noise contrastive estimation the generator is used as a source of reference noise and the model learns to tell samples apart from noise by assigning higher density to the data so noise contrastive estimation might consist of generating samples from a Gaussian distribution and then training this discriminator function to tell whether a given input comes from the gaussian distribution or it comes from the data distribution and it implements that discriminator function by actually implementing an explicit tractable density over the data and by accessing an explicit tractable density over the generator that creates the noise and I ask a question yeah go ahead because you have this nice slide there my name is Yong Schmidt Hoover from this with a high lab and I was wondering whether you can relate these very interesting GA and soy games to the other adversarial network that we had back sent in 1992 where you had two types of network fighting each other also playing a minimax game where one of them I to come up with try to minimize an error function that the others were maximizing and it was not exactly like that but it was very similar in many ways because there you had an image coming in and then you had these cold layers like in an auditing color and then you try to find a representation initially random representation of the image but then for each of these units in the cold layer there was a predictor which try to predict this code unit from the other guys in them in the code layer and then the predictors try to minimize the error just like the feature detectors the code units try to maximize it trying to become as unpredictable as possible now this is closely related to coming up with this reference noise vector that you just mentioned because of course then in the in the code layer you basically get in the ideal case a factorial code where each of these units is statistically independent of each other of the other units but still tells you a lot about the image so you still can attach an ordering coder to that and then get a generative distribution you just wake up the code layer units and you randomly activate them according to their probabilities they are factual code which means that you get in images that are just reflecting the original distribution of the images so in many ways very similar but in other ways different and I was wondering whether you have comments on the similarities and differences of these old adversarial networks yeah so Jurgen has asked me if I have any comment on the similarities and differences here but he's in fact aware of my opinion because we've correspond about this by email before I mean I don't exactly appreciate the public confrontation if you want to form your own if you want to form your own opinion about whether predictability minimization is the same thing as generative adversity all networks you're welcome to read the paper one of the nips reviewers requested that we add a description of predictability minimization to the generative a Brazil Networks paper and we undid added our comments on the extent to which we think that they are similar which is that they're not particularly similar to the nips final copy just just for completeness however so I reacted to exactly these changes and then you did not comment it's not sure that you commented or reacted to these confrontations yeah so there are comments which you did not address and I think still think I would prefer to use my tutorial to teach about generative adversarial networks if people want to read about particularly memorization and please do sir just to just honor to make sure what you will have so related work section their comments have been added to the newspaper so returning to the comparison to noise contrastive estimation which is far more similar to generative ever-so networks than predictability minimization in that they have exactly the same value function we find that for noise contrastive estimation the learning of the final generative model occurs in the discriminator and for the generative address or network the learning occurs in the generator that's one way that they're different from each other and it has consequences on exactly what they are able to do an interesting thing is that maximum likelihood estimation also turns out to use this same value function and can also be interpreted as having a discriminative function inside it the difference between noise contrastive estimation and maximum likelihood estimation is that for noise contrastive estimation the noise distribution is fixed and never changes throughout training if we choose to use a noise distribution that is gaussian as the reference distribution then in practice learning tends to slow down relatively quickly once the generator once the model has learned to create samples that are easily distinguishable from a Gaussian in maximum likelihood estimation we take the parameters of the model distribution and we copy them into the noise distribution and we do this before each step begins so in some ways the maximum likelihood estimation procedure can be seen as the model constantly trying to learn its own shortcomings and distinguish its own samples from the data and the generative every cell that works approach we constantly update the generator network by following the gradient on its parameters all three of these approaches constantly follow the gradient on the parameters over the discriminator so we can see the way that we get some computational savings relative to maximum likelihood by looking at the corners that both noise contrastive estimation and generative adversarial networks cut for a noise contrastive estimation it's clear that the main corner we cut is that we never update the noise distribution and that eliminates a lot of computations right there for generative adversarial networks the way that we're able to cut a corner is that we don't need to make sure that there's an exact correspondence between a density and a sampler so for maximum likelihood we need to be able to sample if we're going to follow this particular implementation of maximum likelihood we need to be able to sample from the model when we evaluate the term on the right but we also need to be able to evaluate densities of the model in order to evaluate the D function and we need to perform computations that convert between those density representation and the sampling procedure generative adversarial networks only over a sample from G and only ever evaluate D there's no need to perform these transitions from densities to sampling procedures and that provides a lot of computational savings so I've completed this section of our roadmap on exactly how it is that generative adder cell networks are able to work from a theoretical point of view and now I'll move on to a few tips and tricks that should help you to make them work better in your own practical applied work the first really big tip is that labels turn out to really improve the subjective sample quality a lot as far as I know this is first observed by Emily Denton and her collaborators at NYU and Facebook AI research where they showed the bakwin generative Evaristo networks didn't work very well at all you could actually get them to work really well if you made them class conditional so Metis Mira and Simon OS and arrow had developed a conditional version of the generative adversarial network where you could give some input value that should control what output should come out and Emily and the collaborators showed that if you use that class label as the input you could then create an output value of an image from that class and that these images would be much better than if you just learned the the density over images to begin with another thing is that even if you don't want to go fully to the level that you have a class conditional model you can learn a joint distribution over the probability distribution of X and Y and even if it's sample time you don't provide an input Y to request a specific kind of sample the samples that come out will be better Tim Salomon's and I did this in our paper that we'll be showing at the poster session tonight it's it's not a key contribution of our paper but it's it's one of the tricks that we use to get better images one of the caveats about using this trick is that you need to keep in mind that there are now three different categories of models that shouldn't be directly compared to each other there are those models that are trained entirely without labels there are models that are class conditional and there are models that are not class conditional but that benefited from the use of labels to guide the training somewhat and it wouldn't really be fair to make a class conditional model and then say that it's strictly superior to some model that didn't use labels to improve its samples at all another tip that can really help a lot is a technique that I call one-sided label smoothing and we also introduced this in the paper with Tim that we're showing tonight the basic idea of one-sided label smoothing is that usually when you train the discriminator you're turning it to output hard ones on the data and hard zeros on the fake samples but it's much better if you turn it to output a soft value like 0.9 on the data and on the fixed samples it should still strive to output zeros that's why it's called one-sided is that we only smooth the the side that's on the data so what this will do is you can think of it as introducing some kind of like a leak probability that sometimes the data has been mislabeled that we accidentally gave you something fake and said it was real in particular this will reduce the confidence of the model somewhat so that it will not predict really extreme values it's important not to smooth the generator samples and we can see this by optimizing what the optimal discriminator is if we smooth by replacing the positive targets of one minus alpha and replacing the negative targets with beta then we see that we get this ratio of densities again or in the numerator we have 1 minus alpha times the data distribution and we have beta times the model distribution because this value in the numerator determines where the output of the discriminator function is large and therefore determines where the generator wants to steer samples we need to make sure that this second term does not appear in the numerator otherwise we would reinforce the current behavior of the generator if the generator is making lots of weird pictures of grids and we assign beta times P model to those weird pictures of grids in the discriminator we will just ask you to keep making weird pictures of grids forever and and the gradient near those images will not steer you away from them so that's why we always set beta to zero and only really smooth using the alpha term on the left term so we didn't invent label smoothing we just advocating the one-sided use of it for just for the discriminator label smoothing dates back to the 1980s I'm not sure where it originated Christian's egg ad and his collaborators showed that it works really well for regularizing inception models and one of the really nice properties that I've observed for it is that compared to weight decay weight decay actually will reduce the training accuracy of your model it will actually cause the model to make classification mistakes by shrinking the weights until it's not possible to make the correct classification anymore if you turn up the weight decay coefficient enough label smoothing will not actually introduce mistakes it will just reduce the confidence of the correct classifications but it will never actually steer the model toward an incorrect classification so if regenerative adverse Erichs this allows it the discriminator it is still more or less know which direction is real data in which direction is fake data but it doesn't actually result in it miss guiding the generator and it gets rid of really large gradients it gets rid of behaviors where the discriminator linearly extrapolates to decide that if you move a little bit in one direction then moving very far in that direction will give you more and more realistic samples it's important to use batch normalization most layers of the model and I won't go into batch normalization in detail but the idea is you take a full batch of input samples and you normalize the features of the network by subtracting the mean of those features across the whole batch and dividing by their standard deviation this makes the learning process a lot better conditioned unfortunately the use of these normalization constants that are computed across a whole mini batch can induce correlations between different samples generated in the same mini batch so I'm showing you a grid of sixteen examples in the top image that we're all in one batch and then the next grid of sixteen samples is all in another batch same generator model in both cases the only reason that there seems to be a common theme in all the examples in each image is that they're using the same mean and standard deviation normalizing constants and in this case the model has kind of pathologically learned to have its output depend a lot more on the precise randomly sampled value of that mean and that standard deviation rather than paying attention to the individual values in the code so in the top we see a lot of very like orange images and in the bottom we see a lot of very green images so to fix that problem we are able to change two different versions of batch normalization that actually process every example in the same way the simplest of these is what we call reference batch normalization where you just pick a reference batch of examples at the start of training and you never change them and you always compute the mean and the standard deviation of the features on those reference images and then you use them to normalize different images that you train on it means that every image throughout all of training is normalized using the statistics from the same reference batch and there's no longer this random jitter as we resample the images that are used to create the normalizing statistics unfortunately because we always use the same images we can start to overfit to that particular reference batch to partially resolve that we introduced a technique called virtual batch normalization the basic idea here is that every time you want to normal as an example X we normalize it using statistics computed both on the reference batch and on the example X itself added to that batch a lot of people ask me questions about how to balance the generator and the discriminator and if they need to be carefully adjusted to make sure that neither one of them wins in reality I usually find that the discriminator wins and I also believe that this is a good thing the way that the theory works is all based on assuming that the discriminator will converge to its optimal distribution where it correctly estimates the ratios that were interested in and we really want the discriminator to do a good job of that in some cases you can get problems where if the discriminator gets really good at rejecting generator samples the generator doesn't have a gradient anymore some people have an instinct to fix that problem by making the discriminator less powerful but I think that's the wrong way of going about it I think the right way to do it is to use things like one sided label smoothing to reduce the how extreme the gradients from the discriminator are and also to use things like the heuristic non saturating cost instead of the minimax cost and that will make sure that you can still get a learning signal even when the discriminator is able to reject most of the samples there are a few other things that you can do to try to make sure that the coordination between the generator and the discriminator works out correctly in particular we really want the discriminator to always do a good job of estimating that ratio we want the discriminator you really up to date and to have fit really well to the latest changes to the generator that motivates running the update on the discriminator more often than the update on the generator some people still do this I don't usually find that it works that well in practice I can't really explain why it doesn't work very well all the theory suggests that it should be the right thing to do but that particular approach doesn't seem to consistently yield an obvious payoff we're now coming to the most exciting part of the roadmap which is the research frontiers in generative adversarial networks can I get a quick check on how much time I have left okay yes so the biggest research frontier in generative ever sail networks is confronting the non convergence problem usually when we train deep models we are minimizing a cost function and so we're using an optimization algorithm to perform minimization there are a lot of things that can go wrong with minimization especially when you're training a deep model you can approach a saddle point rather than approaching a minimum you can approach a local minimum rather than a global minimum we're starting to become skeptical that local minima are as much of a problem as we used to think they were and you can have all kinds of things we think they're like bad conditioning high variance in the gradient and so on but for the most part you're pretty much going to go down a hill until eventually you stop somewhere unless your hyper parameters are really bad and you don't usually need to worry that your optimization algorithm will fail to even converge in the case of looking for an equilibrium to a game it is actually pretty difficult to guarantee that you will eventually converge to a specific equilibrium point or even that you will stop in some particular location that isn't a great equilibrium so to start looking at exactly how this works we're going to do another exercise where we're going to analyze a minimax game and see what gradient descent does for this game we have a scalar variable X at a scalar variable Y and we have a value function x times why and basically the one player controls X and would like to minimize this value function the other player controls Y and would like to maximize it and the exercise is to figure out if this value function has an equilibrium anywhere if so where is at equilibrium and then to look at the dynamics of gradient descent and analyze gradient descent as a continuous time process and just determine what the trajectory that gradient descent follows looks like on this particular problem I can take a few more questions while people work on this one now you have guns that generate really really nice results and train on a lot of data I think like there's the vegan work presented here let's train on 27 terabytes of video so the thing I'm wondering is nobody has looked at all these videos how can you know that Yan is not generating near duplicates is there any theoretical motivation is it related to overfitting and are people trying near duplicate search to see if it's just very good at compressing this data instead of generating yeah so duplicating a training example would actually definitely be a form of overfitting it's not something that we really believe happens in generative ever cell networks we don't have a strong theoretical guarantee that it doesn't happen one thing I can point out is that the generator never actually gets to see a training example directly it only gets to see the gradients coming from the discriminator so the discriminator would need to perfectly memorize a training example and then communicate it into the generator via the gradient another thing is because we have this problem with fitting games finding the equilibria like like people are analyzing in the exercise right now we just we tend to under fit rather than over fit we I'd be really quite happy if we started to overfit consistently but it's it's actually pretty difficult to really measure how much we're overfitting because you wouldn't really expect the model to perfectly copy a training example it's more likely that it would mostly copy the training example and then kind of change a few small things about it and we do things like look for nearest neighbors we generate samples and then see the most similar training example in terms of Euclidean distances but it's really easy to make a small change that causes a gigantic difference in Euclidean distance so that can be kind of hard to tell if it's actually eliminating the duplicates or not and it's it's also worth mentioning that in many cases genitive address donuts aren't even necessarily compressing the data sometimes we actually train them with more parameters than there are floating-point values in the original data set we're just we're converting it into a form where you can get infinitely many samples in a computationally efficient way but yeah we are usually compressing as you said yeah and so on my question is right now in like for example the vanilla ganz right you're you're taking noise you're doing like a noise shaping in a sense right and then you're reconstructing some signal some image in its native space in Ag native basis so our question is what do you think of actually doing the generation in a more safe sparsa fied basis of those types of signals for example maybe a cosine basis or even the coefficients of some dictionary do you think that it might make the learning of the gans easier or do you think it might not matter or something like that so I was just curious alike should the output of the generator network be a set of bases yeah or for example coefficients of some say it's a natural base maybe a Fourier basis or some wavelet basis or a dictionary or something just wondering if that makes any difference of the learning if it makes it easier because you can put some more priors on these as a member of the deep learning cult I'm not allowed to hand engineer anything so the the closest thing I've done to what you're suggesting is my co-author Bing Xuan the original generative Ebersole net paper was able to train a really good generator net on the Toronto faces data set by doing layer wise pre-training I wasn't able to get the deep jointly trained model to fit that data set very well back then my guess is it would probably work now that we have patched norm we didn't have bachelor on back then you can view what Bing did as being a little bit like what you're suggesting because when you train the output layer of the generator in the training step it learns essentially a dictionary that looks a little bit like like wavelet dictionaries and then when you start training the deeper layers of the generator those layers are essentially learning to output wavelet coefficients and so I do think that would that would help yeah question I can use after the gains are trained to create more synthetic data set for another classifier like the idea that after the guns are trained I kind of captured the probability distribution of my input and use them to automatically generate more images like to avoid like how we normally use data set augmentation to the images like that yeah so my former intern Chen Qi Chen who was mentored when I was at Google I don't want to disclose his project but I'll tell you that he's doing something cool related to that and if you talk to him he can decide whether he wants to disclose it or not I don't think I'm giving away anything about what he's done by saying that I've also had a lot of other people tell me that sometimes when they're evaluating a generator Network to see how well it's doing the one test they'll run is they will create a synthetic data set using the generator and then train a classifier on that new data set then use it to classify the real test set and if that classifier is able to classify the real test set they take that as evidence that their generator was pretty good if if it could be used to make a fake training set there are a few downsides to that procedure like for example if you were generating one mode way too often but you were still generating all the other modes occasionally your classifier might still be pretty good even though your generative model is screwed up but it does it does basically seem to work so in the interest of time I think I'll move on to the solution of the exercise but there'll be one more exercise you'll get to ask a few more questions so the solution to this exercise which is we're looking at the value function of x times y where x and y are just scalars there is actually an equilibrium point where x is 0 and Y is 0 when when they're both 0 the each of them causes the gradient to go away on the and then we can look at the gradient descent dynamics by analyzing it as a continuous-time system so if we actually evaluate the gradients DX DT is negative Y and dy DT is positive x the sign difference is because one of them is trying to minimize the value of function and one of them is trying to maximize it if we then go ahead and solve this differential equation to find the directions I guess there's a lot of different ways of doing it depending on exactly which pattern matching technique you're most comfortable with my particular approach is to differentiate the second equation with respect to T and then I get that d squared Y DT squared is negative Y so I recognize from that that we're looking at a sinusoidal basis of solutions and from that you can guess and check the corresponding coefficients and we get that we have this circular orbit where the only real thing that changes exactly what this the circle looks like is the initial conditions so if you initialize right on the origin you'll stay on the origin but if you initialize off the origin you never get any closer to it so a gradient descent goes into an orbit and oscillates forever rather than converging and then this is continuous time gradient descent where we have an infinitesimal step size if we use a larger step size then it can actually spiral outward forever so there are actually conditions that you can check to see whether or not simultaneous gradient descent will converge or not and they involve complex eigenvalues of a matrix of second derivatives and I won't go into it because it's not really the kind of thing that makes for a nice talk but the long and short of it is the the generative adversarial Nets game does not satisfy the main sufficient condition for convergence so that doesn't mean that they don't converge it means that we don't know whether they converge or not according to you the main criterion that we can look at and it seems like in practice they do converge sometimes and they doubt other times and we don't have a great understanding of why they do or don't but the most important thing under said is that simultaneous gradient descent is not really an algorithm for looking for equilibria of game it it sometimes does that but it's it's not really its its purpose and the most important research direction in genitive ever sonnets is to find an algorithm that does find equilibria in these high dimensional continuous non convex spaces it's important to mention that if we were able to optimize the generative adversarial network in function space if we were able to update the density function corresponding to the generator and and the discriminators beliefs about the generator directly then we can actually use convexity in function space to prove that simultaneous gradient descent converges for that particular problem the reason that this breaks down is that we don't actually update the densities directly we update the G and D functions that do the sampling and and the ratio estimation and then on top of that we represent G and D using parametric functions deep neural networks where the actual output values of G and D are very non convex functions of the parameters and so that causes us to lose all of our guarantees for convergence the main way that we see this affect the generative a dresser networks game is that we get behaviors like oscillation where the generator continually makes very different samples from one step to another but doesn't ever actually converge to producing a nice consistent set of samples in particular the worst form of non convergence and one that happens particularly often is what we call mode collapse where the generator starts to make only one sample or one similar theme of related samples it usually doesn't output exactly the same image over and over again but it might do something like every image it creates is a picture of the same dog and the dog is in different positions or has different objects in the background or we might see you know every sample it makes as a beach scene for example but it is essentially generating too few of things so the reason that mode collapse happens particularly often for the genitive adversarial Nets game is that the game is a little bit pathological and the way that we specify the value function in particular if we look at the minimax version the min Max and the max min do different things if we do the min max where we put the discriminator in the inner loop and maximize over it there then we're guaranteed to converge to the correct distribution in practice we don't actually do the maximization in the inner loop we do gradient descent on both players simultaneously if we put G in the inner loop that actually corresponds to a pathological version of the game where the generator learns to place all of its mass on the single point that the discriminator currently finds to be most likely so Luke Metz and his collaborators produced a really nice visualization of this in their recent papers submitted to iclear where we have this target distribution shown in the middle of the slide which has several different modes in two-dimensional space and then over time we see how as we move left to right and train a generative adversarial Network we learn to sample from different modes of that distribution but we don't ever actually get multiple modes at the same time this is because simultaneous gradient descent can sometimes behave a little bit like min Max and a little bit like max min and we're just unlucky enough that it often behaves more like max min and does the thing that we don't want some people have explained mode collapse in terms of the fact that we use the reverse KL loss that I described earlier when I said that I don't believe the reverse KL loss it describes why we get sharp samples because the reverse KL loss would prefer to choose a single mode rather than averaged out two different modes it does superficially seem like it might explain why we get mode collapse but I don't think that it is actually the explanation in this case for one thing if we use the forward KL we still get mode collapse in many cases also the reverse KL divergence does not say that we should collapse to a single mode it says that if our model is not able to represent every mode and to put sharp divisions between them then it should discard modes rather than blur modes but it would still prefer to have as many modes as the model can represent and with generative adversarial networks we usually see is a collapse to a much smaller number of modes than the all can represent that makes me believe that the problem is really that we're doing max-min rather than that we're using the wrong cost we often see that generative adverts on networks work best on tasks that are conditional where we take an input and map it to some output and we're reasonably happy with the result as long as the output looks acceptable and in particular we may not really notice if there's low diversity in the output so for example sentence to image generation as long as we get an image that actually resembles the sentence we're pretty happy with the output even if there isn't that much diversity in it Scott Reid and his collaborators have recently showed that for these sentence to image tasks generative adversarial networks seem to produce samples that are much less diverse than those produced by other models in the panel on the right we can see how the sentence a man in an orange jacket with sunglasses and a hat skis down a hill gives three different images of a man in essentially the same pose when we use a generative adversarial Network but using the model developed in this paper it's possible to get greater diversity in the output one way that we can try to reduce the mode collapse problem is to introduce what Tim Solomon calls mini-batch features these are features that look at the entire mini batch of samples when examining a single sample if that sample is too close to the other members of the mini batch then it can be rejected as having collapsed to a single mode this procedure led to much better image quality on CFR 10 we're now able to see all the different ten classes of images in CFR 10 on the Left I show you the training data so you can see that this data is not particularly beautiful to start with it's 32 by 32 pixels so it's it's relatively low resolution you can see that there are things like cars airplanes horses and so on in the panel on the right we have a Gantt rain'd with mini-batch features and it is now successfully able to generate many different recognizable classes like cars and horses and so on previous generative adverse own networks on CFR 10 would usually give only photo texture blobs that would look like regions of grass and regions of sky regions of water but would not usually have recognizable object classes in them an image net the object classes are not as recognizable but if we go through and cherry-pick examples we can see some relatively nice recognizable images where we get many different kinds of animals like dogs and maybe koalas and birds and so on if we look at some of the problems that arise with this sampling procedure we can see some of the amusing things that convolutional networks get wrong one thing in particular is that I think probably due to the way that pooling works in the convolutional network the network is usually testing whether some feature is absent or present but not testing how many times it occurs so we tend to get multiple heads in one image or animals that have more than one face on the same head we also often get problems where the perspective of an image is greatly reduced and I think this might be due to the network not having enough long range connections between different pixels in the image but it's hard for it to tell the things like foreshortening ought to happen in particular the picture of the gray and orange dog looks literally like a cubist painting to me where you know the Cubist's intentionally removed the perspective some of them also just look like we've taken an animal and skinned it and laid its fur out flat on the ground and then taken an axis aligned photo of it we also see a lot of problems where individual details are great but the global structure is wrong like there's this cow that is both quadrupedal and bipedal there's a dog whose eyes are different sizes from each other and and a cat that has like a lamprey mouth we also often just see animals that don't really seem to have legs that they just sort of vanished into fur blobs that often conveniently end at the edge of the image so that the network doesn't need to provide the legs so did anybody notice anything that actually looked real in in these samples Aaron yeah so the cat was the cat was real to test your discriminator network good job Aaron another really promising way to reduce the moat collapse problem besides many batch features is called unrolled gans this was recently introduced by Google brain and was submitted to iclear and I guess it's worth mentioning that a few other people had suggested doing this for a few years beforehand so it's it is an idea that was floating around into ether a little bit I imagine some people in the audience are probably thinking like oh I told people about that but Brian was the first to go ahead and get it to really work really well revisiting the same visualization that we saw earlier there unrolled Gunn is able to actually get all the different modes so the way that unrolling works is that to really make sure that we're doing min/max rather than max-min we actually use that maximization operation in the inner loop as part of the computational graph that we backprop through so instead of having a single fixed copy of the discriminator we build a complete tensor flow graph describing K steps of the learning process for the discriminator so the generator nut is essentially looking into the future and predicting where the discriminator will be several steps later and because it's the generator looking into the future rather than the discriminator looking into the future we're actually setting a direction for that min max problem we're saying that it's max over the discriminator and the inner loop and then men over the generator in the outer loop and that very elegantly gets us around the mode collapse problem another really big important research direction for generative address that works is figuring out how to evaluate them this is actually a problem that's broader than just generative address donuts it's a problem for generative models across the board models with good likelihood can produce bad samples models with good samples can actually have a very bad likelihood and then even when we talk about good samples and bad samples there's not really a very effective way to quantify how good about a sample is there's a really good paper called a note on the evaluation of generative models the walks through a lot of corner cases to clearly explain all the problems with the different metrics that we have available for today and then for genitive adverse so networks these problems are compounded by the fact that it's actually pretty hard to estimate the likelihood there is a paper based on estimating the likelihood in submission to I clear though so that problem might be cleared up pretty soon once once we have more experience with that particular methodology another research frontier is figuring out how to use discrete outputs with generative ever sale networks I described earlier that the only real condition we impose on the generator network is that it be differentiable and that's a pretty weak criterion but unfortunately it means that we can't really generate sequences of characters or words because those are discrete and if the output is discrete then the function isn't differentiable you can imagine a few ways around this one is you could use the reinforce algorithm to do policy gradients and use that to Train the generator network there's also the recently introduced techniques based on the Gumbel distribution for doing relaxations that allow you to Train discrete variables or finally you could do the old-fashioned thing that we used to do I saw geoff hinton on thursday and he was mentioning to me how this reminds him a lot of the way that Boltzmann machines were really bad at generating continuous values so what we do there is we would pre process continuous values to convert them into a binary space and then we'd use Boltzmann machines from there so you could do the same thing in Reverse with genitive adversarial nuts you could have a model that converts these binary values to continuous values and then use generative adverts or networks from there you could for example train a word embedding model and then have a generative adversarial network that produces word embeddings rather than directly producing discrete words one very interesting extension of the discriminator is to actually make it recognize different classes and this allows us to participate in an important research area of semi-supervised learning with generative adversarial networks originally generative adversarial networks used just a binary output value that said whether things are real or fake but if we add extra outputs saying which class they belong to and then having one fake class we are able to then take the and use it to classify data after we finished training the whole process and because it's learned to reject lots of fake data it actually gets regularize drooly well using this approach tim Salomon's and and i and our other collaborators in open area we're able to set the state of the art on several different recognition tasks with very few labeled examples on em nist c fart n @ sv hn another important research direction is learning to make the code interpretable Peter Chen's info Gann paper here at nips actually shows how we can learn a code where different elements of the code correspond to specific semantically meaningful variables like the position of an image another research Direction is connections to reinforcement learning recent papers have shown that generative address cell networks can be interpreted as an actor critic method or used for imitation learning or interpreted as inverse reinforcement learning finally if we're able to come up with a good algorithm for finding equilibria in games we can apply that algorithm to many other places besides generative adversarial networks things like robust optimization literally playing games like chess and checkers resisting adversarial examples guaranteeing privacy against an attacker who wants to thwart your privacy and all of these different application areas are all examples of games that are rise in artificial intelligence and might be improved by the same kinds of techniques that could help us to improve generative adversarial networks we're very close to out of time but I'll give you five minutes to do this exercise and I'll answer the last set of questions during the exercise this exercise is jumping back a little bit to earlier how I described that there's a different cost function that you can use for maximum likelihood and generative adversarial networks and I think this is a really good closing exercise because it really drives home the point that the key mathematical tool generative a versatile networks give you is the ability to estimate a ratio and to see how the estimate ratio estimation works you are going to derive the maximum likelihood learning rule in particular we have a cost function for the generator network which is an expectation of X sampled from the generator and then applying f of X and we want to figure out what f of X should be to make this cost function give us a maximum likelihood as a hint you should first start by showing the following that the derivatives with respect to the parameters of the cost function are given by this expectation of f of X multiplied by the derivatives of the likelihood and if you'd like you could actually just take that as a given and skip to the last step at the very end what you do is you should figure out what f of X should be given this fact about the gradients if you can choose the right f of X you can get the maximum likelihood gradient so I'll give you a few minutes to work on that and I'll take a few questions and then and then I'll conclude so in your previous about as a generative Network are you missions out there is a important assumption that is a function should be differentiable what if the function is not differentiable because in some area such lacto Informatics the data is some categorical categorical label not numerical value so it's 94 insurable so in that use that acceleration how to generate a synthetic data and you using GT and network so there's there haven't been any papers actually solving that problem yet I talk about this a few slides earlier and my recommendations are to try the reinforce algorithm to do policy gradients with discrete actions to try the concrete distribution and Gumbel softmax which are two papers that were recently released about how to train models with discrete outputs or to convert the problem into a continuous space where a generative address donuts can be applied so the variance and ganz is that it's very powerful in capturing the modes of the distribution right but it's not really truly understanding what images are as in disease you know you start from zero to generate X right so the question is you know if you increase the systems increase the image size assumably the modes of the distribution going to increase exponentially so ultimately you know if you have a you know practically this may not be solution this may not be a problem maybe we just care about hundred by hundred pixel images but this assume I'm interested in two thousand by 2,000 pixel images you know if I truly understand what images are how images are generated you know there is no difference between a hundred by a hundred and two thousand by two thousand I can you know that ultimate machine my question is about like way down the future I mean at the end of the day you are capturing modes of the distribution but this mode is going to explode if you go to larger images so at some point you know the the modes of the model also have an exponential explosion as you use a bigger convolutional net so if I mean I mean I don't want to repeat the same structure I mean the question is the modes of the distribution right at the end at the end of the day you are capturing the modes of the distribution yeah but a larger model can capture more modes I guess the nice thing about natural images is that when you increase the resolution you're looking at a different level of detail but within the same level of detail the same structure is repeated all across the image so let's say that we've been studying 64 by 64 images and we couldn't really see the individual firs ah like Harrison and animals fur and then we move up to a higher resolution we can see their fur at the higher resolution we don't need to relearn the distribution over images of fur at every pixel separately we we learn one level of detail that can be replicated across the whole image and we generate different Z values at every X and y coordinate the that randomly decided you know fine details of the fur like which angle it should be pointed in and things like why do you think in practice ganz don't ask a love well when you go to larger images oh well you might be surprised by what comes in a few slides yeah I think I should probably move toward the conclusion now so recalling exercise 3 we're looking to design this f of X this cost function that's applied for every example generated by the generator in order to recover the maximum likelihood gradient we start by showing this property that we can write down the gradient of the generator in terms of an expectation where the expectation is we've taken with respect to generator samples and we multiplied f of X by a likelihood gradient that's relatively straightforward to show the basic step is to turn the expectation into an integral use leibnitz's rule which means you have to make a few assumptions about the structure of the distribution involved and then finally we take advantage of our earlier assumption that the generator distribution is nonzero everywhere that allows us to say that the derivatives of pg are equal to P G times the derivatives of log P G so that gives us this nice expression where we can get a gradients of the likelihood in terms of samples that came out of the generator but what would you really like is gradients of the likelihood in terms of samples that came from the data so the way that we're able to do that is important sampling we have this f of X coefficient that we're able to multiply by each of the gradients and we can fix the problem that we're sampling from the generator when we want to sample from the discriminator by setting f of X to be P data over P generator and this means that we'll have kind of bad variance in our samples because we're sampling from the generator and then rewriting everything to make it look like we sampled from the discriminator but in theory this is unbiased from there it takes a little bit of algebra to figure out exactly how we should take the discriminator and implement this ratio we recall that the optimal discriminator gives us this ratio of p data over p data plus p generator and doing a little bit more algebra we can rearrange that to say that we need to set f of X to negative e to the logits this is maybe a lot to absorb right now but I think it's it's pretty intuitive once you've worked through it slowly on your own once and it gives you an idea of how you can take this ratio that the discriminator gives you and build lots of other things with it so to conclude the talk I'd like to show you some really exciting new results that came out using generative adversarial networks and that kind of addressed the last question we had about whether a generative Ebersole networks scale to very large images a new model just came out last week I seem to have this curse that every time I have to give a talk about something an important new result comes out right as I have finished my slides so I desperately made some new slides on the plane on the way here plug and play generative networks or generative models sorry make 256 by 256 high-resolution images of all thousand classes from imagenet and have very good sample diversity the basic idea is to combine adversarial training moment matching in a latent space do you know atom encoders and Monte Carlo sampling using the gradient and the really cool thing is they also work for captioning or inverse captioning where you generate images by giving an input senate's that describes the image overall the basic technique is to follow a Markov chain that moves around in the direction of the gradient of the logarithm of P of x and y with with Y marginalized out you can use denoising auto-encoders to estimate the required gradient but to make the denoising auto-encoder create really good images the auto encoder needs to be trained with several different losses one of those losses is the adversarial networks loss and that forces it to make images that look very realistic as well as images that are close to the original data in l2 space this confirms some of the tips that I gave earlier in the talk for example on the tips and tricks section I said that you often get much better results if you include class labels we see here that plug-and-play generative models don't make nearly as recognized full of images if we generate samples without the class we also see that the adversarial loss is a really important component of this new system if you look at the reconstructions of the denoising auto-encoder we begin on the left with the raw data in the middle we share the reconstructed image and on the right we show the reconstruction that you get if you train the model without the adversarial Network loss so adversarial learning has contributed a lot to the overall quality of this current state of the art model so in conclusion I guess I'd hope that everyone remembers that generative adversarial networks are models that use supervised learning to approximate in intractable costs by estimating ratios and that they can simulate many different cost functions including the one that's used for maximum likelihood the most important research frontier in generative adversarial networks is figuring out how to find Nash equilibria in high dimensional non convex continuous games and finally generate veteran-owned networks are important component of the current state of the art in image generation and are now able to make high resolution images with high diversity from many different classes and that concludes my talk and I believe that we're out of time for questions because we already took several of them in the exercise brakes and I think Erin will now announce that we are headed to sign textbooks and I hope you know what room were sending them [Music] you
Info
Channel: Steven Van Vaerenbergh
Views: 88,628
Rating: 4.961102 out of 5
Keywords: nips, nips2016, conference
Id: HGYYEUSm-0Q
Channel Id: undefined
Length: 115min 53sec (6953 seconds)
Published: Thu Jan 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.