Understand the Math and Theory of GANs in ~ 10 minutes

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to part two of the series on generative adversarial networks in part one we gave a high-level introduction to games by the end of this video you'll not only understand this equation but have learned an algorithm for solving it as well as seeing how to theoretically prove that if we solve it we recover the perfect generative model to get early access to videos and exclusive subscriber content sign up for my mailing list at blog Zack Jost comm [Music] let's get started by recapping the data flow for again we start with the vector of randomly generated noise called Z this noise is the input to the generator which is a neural network that transforms the noise input into a fake sample we call the output of the neural network G of Z next we have the real data to note it here is X notice that the output of the generator is the same dimension as the real data this is indicated here by three boxes finally we have the discriminator neural network the input to the discriminator network is either a fake sample or a real one the job of the discriminator is to output a single number that represents the probability the input is from the real data distribution if we use a fake sample as input to the discriminator the output is denoted as d of g of z and this represents the discriminators estimate of the probability that the fake sample is real if instead we use real data as input to the discriminator the output is denoted as d of X and this represents the discriminators estimate of the probability that the real sample is real since we know which ones are real and fake we can assign the proper labels 1 if it's a real sample 0 if it's fake note that this adversarial framework has transformed an unsupervised learning problem in which we just have raw data and no labels into a supervised problem with labels that we create now let's take a look at the cost function presented in the paper this consists of two terms the first represents the discriminators predictions on the real data and the second the discriminative predictions on the fake data the first term is the expectation of the log of the discriminator output when the input is from the real data distribution in other words if you sample a bunch of real data and give them to the discriminator what's the average of the discriminators predictions let's ignore the log transformation because that just scales the numbers the discriminator wants T of X to be a large number because that represents high confidence that a real sample is actually real the generator is not involved in this process at all the second term is the expected value of the log of the quantity 1 minus the discriminators prediction on the fake samples once again D of G of Z is just the discriminator output when feeding in fakes and the expectation is just the average of the predictions when you pull a lot of noise input for the generator let's take a deeper look at the D of G of Z term the discriminator wants D of G of Z to be as small as possible or to have confidence that the fake samples in fact fake but since the generator wants to fool the discriminator it wants this value to be as large as possible or for the discriminator to think the fake sample is real with high confidence this sets up the adversarial game so why is the cost function using 1 minus D of G of Z that's because this transformation points the desires of the discriminator and generator in consistent directions between the two terms in the equation remember that the discriminator wants the first term to be large because that's the probability of a real sample being predicted as real by making the second term 1 minus D of G of Z it transforms the quantity from something the discriminator wants to minimize into something that wants to maximize so if we add these two quantities together the discriminator wants to maximize the entire thing and the generator wants to minimize the entire thing hence we end up with this optimization objective this merely says to find a G that minimizes the cost function and find a D that maximizes it ok so we have a mathematical statement of our optimization problem we still need a training algorithm that solves it so let's step through the algorithm in the paper for each training step we start with the discriminator loop we repeat this loop K times before ever updating the generator where the value of K is a choice we make this allows the discriminator to converge before attempting to optimize the generator although in the experiments of this paper they just set K equals 1 in any case for each discriminator loop we start by pulling m noise samples and using the generator to transform the noise into m fake data samples next we sample M real data samples at this point we have M fake samples and M reel samples we can associate a label of one for each real sample on a label of 0 for each fake sample we then pass the samples into the discriminator to get predictions and use our labels and cost function to calculate a loss we then take the gradient of our cost function with respect to the discriminators parameters these gradients tell us how to change each to most efficiently increase the loss function this is like measuring the steepest direction to walk up a hill so we update our discriminator parameters to maximize the cost function or take a step up the hill this completes the discriminator loop the generator loop is similar we start by sampling M noise samples and using those as input to our generator to get M fake samples note that in this loop we're only interested in updating the generators parameters and this is done by taking the gradient of the loss function with respect to these parameters since the DX term only depends on the discriminator in the real data the derivatives with respect to the generator are all zero since we know this in advance there's no need to waste time pulling real samples and scoring the discriminator on them that leaves us with a reduced cost function which we can then calculate the gradient note with respect to the generator parameters and use that gradient to update our generator to walk down the hill this completes a full training iteration so that's how you train again the other contributions of this paper are the nice theoretical results the others proof two major points that the minimum of the cost function is achieved if and only if the probability distribution of the generator matches that of the real data this means the fake samples are indistinguishable from the real samples this proof is done in two parts first they show that for a static generator the maximum of the cost function with respect to the discriminator is a constant and that constant is equal to minus log 4 to show this let's write out the cost function but expand the expectation values into their integral form we can then make a change of variable in our second term so that we integrate over the generators output directly instead of indirectly through the input noise variable now we have an integral over a single variable denoted here as X which represents data samples and each term is weighted by the respective probability of it occurring in either the real data or the fake data now if we were to maximize the expression under this integral at every single point we would maximize the entire quantity at any particular point the form of the equation under the integral is as follows a constant times log y plus a constant times log - why and we want to find the y that maximizes this to find the maximum for this expression we can apply the classic calculus trick of setting the derivative equal to zero and solving to find the extremum applying this gives the following equation which with the little algebra simplifies to this we continue our simplification until we end up with this solution if we then plug in the original variables for a and B we recover our final form for the optimal discriminator we can easily see that if the probability distribution of the real data and the generated data are identical then the optimal discriminator returns a value of 1/2 this makes intuitive sense if the real and fake data are identical then the discriminator just has to rain and we guess which is which a 50/50 shot let's plug this value of 1/2 into our cost function this just gives a constant of log 1/2 in each of the expectations which ultimately evaluates to minus log 4 since the expectation values don't do anything to a constant so we have shown two things that the optimal discriminator is given by a simple relation of the probability distributions of real and fake samples and that when these distributions are equal the maximum of the cost function is minus log for second they consider the generator which wants to minimize the cost function let's look at the cost function again and let's plug in our equation for the optimal discriminator to get an upper bound on the loss next let's write down the definition for Colbeck Leibler divergence or KL divergence for reasons that will become apparent notice that the definition of KL divergence is an expectation value of the log of the ratio of probability distributions this is precisely what we have when we plug in our expression for the optimal discriminator also notice that if we multiply top and bottom of the inner expression by 2 we can pull out the bottom one as a minus log 2 and keep the remaining term on the denominator this is useful because it enables us to rewrite our cost function as this equation alright one last definition the Jensen Shannon divergence is defined as follows this is very similar to KL divergence except that it's symmetric this means the Jensen Shannon divergence from PDQ is the same as from Q to P this is not true in general of KL divergence at a high level think of Jas divergence as a distance measure between two probability distributions our function turns out to be 2 times the TAS divergence between the real and fake data distributions plus the constant of minus log 4 and yes this minus log 4 is the very same one that represents the maximum constant cost we incur from having an optimal discriminator when the data distributions match so this is the final form we've been working toward here's the important point of all this math for an optimal discriminator the generators aiming to minimize this quantity the minimum for any Jas divergence is 0 and this occurs if and only if the 2 probability distributions are equal so this shows that this cost function has 1 minimum and that minimum is achieved when the generator perfectly Maps the real data distribution when this happens the jst vergence term would be 0 but this implies the distributions are equal and then our minus log for term falls out which we separately showed is the cost from the discriminator making coin flip decisions when the real and fake data distributions are equal so that's the theory under the covers to recap this video we showed the architecture of Ag and training process where random noise vectors are used as input to a generator and the output of the generator is used as input to a discriminator neural network to learn our discriminator we treat it as any other classification problem where the inputs are real samples and fake samples from the generator and we try to predict the probability of being real we then use the positive direction of the gradient of our loss function to update our discriminator in an attempt to maximize the cost and the negative direction to update our generator to minimize the cost this process is designed in a way to mimic the theoretical proof that showed that for an optimal discriminator minimizing the cost function would result in a generator that perfectly models the real data so our training loops take turns optimizing the discriminator to convergence by locking in the generator and then updating the generator ideally this process is repeated until convergence to the global minimum I hope this takes some of the mystery out of the math for you if you'd like another perspective on this topic I have a three part blog series on Gangi find useful I'll put a link in the description along with the original paper and if you'd like early access to videos an exclusive subscriber content don't forget to sign up for my free mailing list at blog Zack Joe's calm thanks for watching

Info

Channel: WelcomeAIOverlords

Views: 60,686

Rating: undefined out of 5

Keywords: Generative Adversarial Networks, Machine Learning, Data Science, Artificial Intelligence, GANs, Deep Learning

Id: J1aG12dLo4I

Channel Id: undefined

Length: 12min 3sec (723 seconds)

Published: Mon Jul 08 2019