Ali Ghodsi, Lec : Deep Learning, Generative Adversarial Network, Oct 24 2017

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay okay let's start today's lecture would be about generative models okay we are going to discuss two types of generative models one of them is generative generating moment matching models and the other one is adversarial networks which is called gain so we will spend most of our time on gang it's very popular these days but let me assert with moment matching networks so before I tell you about moment matching network I need to I need to tell you about basically you know one measure of similarity between two distributions that we use in a statistic is a relatively recent and it is used in this type of networks the measure is called maximum mean discrepancy okay so maximum in descriptions it is pretty counterintuitive but very interesting our concept you know suppose that you have two distributions distribution P and you have distribution Q and you want to know how close these two are so so far you know at least one way of comparing these two distributions what is it KL divergence right so that's another measure so suppose that I have sample from P call it X and I have sample from Q called it boy and say I have n samples here and say I have M samples here and my question is that how close these two distributions are maximum mean discrepancy or in short they called MMD basically suggests that you can map this data through a Phi a transformation for so take X apply some transformation fire to this and you have Phi of X so do this for all of the points okay then take the average of these Phi of X's so I equal 1 to N a Phi X I 1 over n this is the mean of fire of exes okay do the same thing for y one over m now MMD claimed that the distance between distribution P and distribution Q is in fact or can be measured as the distance between these two meats sir sorry theorem of pure math yeah like say excited via our two transformations of P and Q right but it depends on the type of this transformation right not for any transformation it's true and it's not trivial actually because what we are comparing here is only the mean right so before applying this transformation you know suppose that this is my P and this is my Q they have exactly the same mean right but there are very different distributions how come just by comparing the mean of two distributions you can say that this these two distributions are the same the idea is that you know these are suppose that these are one-dimensional distributions and suppose that I apply a fire to them to map them in two dimensional space such that each X becomes X x squared and each Y becomes Y by a square an original space when I take the average of X's and take average of voice they match each other but they have different variants right so this distribution are not the same but suppose that I applied a fire to them and now they are in two dimensional space if I take the average of these three dimensional points compute the moved and compute the average of these two dimensional points do they match no the first element I mean the mean is two dimensional the first element of the mean of these two will match right but the second element will not right because basically the second element of this mean in two dimensional space is the variance in the original space right so you go to two dimensional space you take the mean but mean is basically the mean and the variance in the original space so if they don't match means the moments here doesn't match right don't match that's basically the intuition but the idea is that there might be some sort of Phi which take care of all of the moment of a distribution not just the first and second law okay maybe you can go to a to a space such that you have all of the moments and if these two means are not the same means that the moment of these two don't match okay now the interesting thing is that in fact you don't need to know this fine okay why you don't need to know this far similar distributions have similar moments I mean if they have if two distributions have the same set of moments they are the same that's basically the idea okay suppose I want to compute this quantity and these are definition of these two means right so basically what I need to measure is one over n Sigma over I Phi X I minus 1 over M Sigma Phi Y square okay I can write it as 1 or n Sigma I equal 1 and X I minus 1 over m sigma y YJ transpose times 1 or M Sigma now if I expand this you know I have this there which is 1 over N squared Sigma I Sigma J Phi X R transpose Phi XJ then I have another term for this one which is 1 m/s squared Sigma I Sigma J Phi Y I transpose Phi Phi J then I have a crust term which is minus 2 n times M Phi X R transpose Phi white now you know I have dot product here Phi transpose Phi can be replaced by sorry I care not right so I don't need to know Phi I can just put a cairn out here and this is also a dot product I can cut a can so basically I don't need to know why I just can't rewrite everything in terms of caramels okay so it can't be any Phi Phi's should have some sort of properties if you want to proof this okay it should be unit bound and some other properties so not any kernel will do the job here you know some specific kernels one of them is RBF kernel for example RBF kernel Maps data in a Hilbert space infinitely long you know and you can match all of the moment of two distributions basically you just need to have some samples of two distributions then you can use this measure to compute the distance between two distributions if you remember KL divergence was not distance was like divergence it was not it was not symmetric this is symmetric okay so it's really a distance so if two distributions matches it as zero other words it's more than zero so it's always positive greater than equality so it can basically measure similarity between two distributions using this measure okay so this is called MMD now generative model the goal is to you know you know the difference I think between generative models and at and discriminative models right so in generative models basically you would like to generate data okay so generating generating data could be quite easy if say for example you're generating data from a Gaussian for example you know if I have a Gaussian distribution and I tell you okay generate a data for me means that show me a data with the similar sort of proper that they haven't seen in this training data so you can sample from a Gaussian you can find the mean and variance of this Gaussian and we know how to sample from girls we know how to sample from uniform we know how to sample from Bernoulli you know how to sample from simple distributions right so this is my sample and they tell you a case sample in new point for me generate a new point for me you know how to do it but what if I my samples are handwritten digits you know I show you some handwritten digits and I tell you okay generate a new one for me means sample this is there is no difference between these two problems in nature you know because this is also a I mean that was a two dimensional vector I wanted to sample from this is higher dimensional vector you know because you you know I can just Viktor is this it's a long vector sample from this means sample this high dimensional vectors in a manner that if I reshape it it looked like a handwritten digit it's pretty difficult job or it was a pretty difficulty of up to a couple of years ago so it was very hard to estimate the distributions of data like this okay now the idea of generative models in deep network is that suppose that I have a network I know how to sample from uniform I know how to sample from Gaussian can I feed this with a sample from uniform or sample from Gaussian such that the output look like an image looked like a face a room a building whatever you know give me a data and the idea is that this this network is a very complicated function so I fitted with a simple you for sample or Gaussian sample and I expect this network to transform this data this vector to another vector which when I see shape it look like a handwritten digit or an image or whatever that's the idea okay to matter eyes this idea basically you know suppose that we have chosen a data set that you want a sample from say it's M missed handwritten digits so I have handwritten digits II m-must okay so I can have a network say it's feed-forward network the input of the network is uniform or Gaussian the output of the network you know I need to define an objective function for the for the objective function I'm gonna define MMD as my objective function basically I would say that okay the output of this network is one data set Q and the original data M this is another data set P and I want to minimize the distribution between P and Q I want the distribution of this be similar to distribution of this you know and they have the tool for that you know if I have sample from P and they have sample from Q then I can minimize so train this network change the weight of this network such that the distribution of the output of this model matches the distribution of the data and it's a very nice objective function it's differentiable I can take derivative of that and I can do back propagation and trained events okay this is called moment matching generative model that has been around since 18 2013 or so yes you can't actually in practice they're playing with this colonel a lot in fact they're adding of some some care notes adding some carols you know because the colonel is a measure of similarity you know they have like RBF plus a polynomial plus some other type of care less at the atom nowhere basis for a linear combination of care not it these are not mathematical it's like heuristic you know because some of the kernel in some papers you can see the type of kernels that they are using Boyle eight the theorem you know because theoretically it's supposed to be a KML corresponding to a unit bound transformation so and they choose for example polynomial kernel which doesn't satisfy this yes no no we're going to actually talk about gain now any question here yes you know this is M nest okay so I use em Mitch as P and I have 100 I have 10,000 handwritten digits these are the samples that I have from P okay let me call it Q and call it Y to be consistent with what we have - okay these are my samples from Q my network you know I I randomly set some baits here and I fitted with a random Gaussian random uniform it generates something you know suppose these are samples from P I would like to minimize the distance between P and Q Q is fixed you know I don't changes Emnes so basically it means that I have to change this waits until the distribution of the output of this network is similar to the distribution of MS yes just play with different kernels until it works sorry how do you know I don't think you you know if you think about this intuitively not mathematically you know and think of kernel as a measure of similarity right this measure basically tells me that okay that the similarity between these two distribution is that how similar is the point with another point of this distribution how similar is the point from that this the other distribution with the points of that that distribution minus two times of the similarity between these two okay so I mean intuitively it makes sense you know it's not clear that it's a measure of the similarity between two distributions but in terms of a similarity measure it makes sense that how similar they are how similar they are how different they are right so as long as it's a good measure of similarity it's you should be able to produce something reasonable out of that you know as long as it's a you know for em nice for example you know let me give you this example suppose that this is this is one data point of a data set and this is another data according to our perception these two are pretty similar shapes if you look at the Euclidean distance between these two it's huge all right whenever it's black black it's white and vice versa so it has a big Euclidean distance so if I compute RBF kernel on that which is RBF is based on this Euclidean distance it doesn't show any similarity between these two you tell me that they are very dissimilar but they are similar they are similar based on my perception you know I have to choose a kernel which matches my test matches my perception so if if in my task these two are similar then this RBF based on your clearing distance would be a bad care it doesn't do it your job so you have to choose a kernel carefully that matches your intuition and perception about the date No okay so that was a bad moment matching so there's another way of making generative model it's called adversarial Network or gain generative adversarial network that's the first paper that introduced again there's 2014 it is very popular these days you know if you see many applications of that and then just let me show you two coats from Bengie oh and in the cone you know the cone believes that it's the coolest idea in the last 10 years in the last decade in machine learning ngo believes that this is basically the future of AI depends on this type of approach okay so you can see many interesting applications that has used Gann these are a couple of those you know a network which as the input you know you can have a network which as the input has the uniform or Gaussian and then the output has the shape of a person or image and so on but these are other type of applications you know a network that as input takes this type of labels and as output generate a realistic street view for example or as the input takes a you know a picture which has been taken by airplane or the drone and make a map out of that or the input is again you know this labels or somatic of a building and the output is quite realistic the input is black and white and the output is color the input the input is just edges and the output is you know a realistic image is they and output is night and vice versa 9 and the days and any type of this transformation translation in the images space in the text space you can see many interesting applications and they use this technology they use gap it's another example these are not real pictures these pictures have been generated by a network it seems quite realistic you know the data set was data set of bedrooms and these are some bedrooms and don't exist but you know quite realistic you may have heard or I showed you in the introduction that you know in the space of words you know we can do arithmetic you know you can add like King - man plus woman is queen so in this paper you can do the same thing in the space of images this man with glasses - a man without glasses plus a woman is that's the output same woman with glasses this is quite interesting caption - picture so we have a description you would like to generate you know this image has been generated out of this caption you know description is the orange on a table next to liquor bottle and that's what the network generates out of this okay the concept of adversarial learning is quite simple you know suppose that this is my generative model you know and I don't have this measure of similarity between two distributions so I can be the judge myself you know if I can present you know I can just if I have a way to take derivative of my judgment then I can be the judge that can sit here and let the network to generate something and look at this image and say no it doesn't look like Emily's I can change it I can do that but I cannot take derivative of my judgments you know change it means what how can i back propagate my judgment to the network and let the network to you know correct itself so adversarial network is basically the same thing with the exception that you can't take derivative of this judgment you know instead of me be inspector here there is a classifier you know I put a classifier here so I don't use say MMD for example a classifier is here and the output of the network comes and the classifier should tell me should classify this you tell me that this is realistic or is not realistic and how can this classifier understand it's realistic or is not realistic you know I have a data set amnesty this is one class I have the output of this model this is another class train a classifier which make distinction between these two class output after Network and amnesty suppose that is a very good classifier you have a very good classifier which can make distinction between these two then you let the network to train itself and make outputs which look like em nice so the classifier cannot make distinction between the output of the network and m-miss okay basically Network tries to make itself better and better at the level that it can fool the classifier so take this idea one step further you know it's not a pre trained classifier it's a classifier that you're gonna train during the process of training the network so these two will be trained at the same time okay and they are playing in adversarial game you know this network would like to fool the classifier would like to produce images in a way that classifier cannot recognize that it's fake classifier I mean the job of this model of this generative model is to make images such that the classifier cannot recognize it's it's fake on the other hand the classifier should make itself better and better at the level that it can I mean generative model can't fool the classifier you know so classifier would like to detect any fake image and generative model would like to make images such that class work cannot detect it so they compete with each other you know and you train both of them at the same time so that's the idea the idea that you have real word images like Amnesty and you take a sample goes to a classifier and you have a generative model and Geritol model making an image take a sample of that one and this discriminate or treat them as two classes you know assign one to one class say for example negative one to another class or zero to another class so one of them is real the other one is fake and you want to make this classifier better and better and better to make this thing distinction perfect on the other hand the generative should get better and better and better in a way that discriminator cannot realize that it's fake this leighton is like this that that goes inside the network you know a sample from a uniform a sample from a Gaussian for example so correlated latent random very random variable here o latent this is not a part of the network you know it's your your sample you know if you you can think of your generative model as you can think of your generative model as this model this is your observation observation comes from a latent variable you know imagine that you know imagine that this is your model that everything that I observe has been governed by a variable that I don't observe right now if I have the model then when I sample from this I can see one of these but at the beginning I don't see this I just see the observer I just see the gemenese okay ah yes know if you have a data set that contains ten digits you have another data set which contains only two a classifier can make distinction between these two Class right the idea is that classifier has no way to make distinction between these two they're same they're the same is they have the same distribution yes no that's the objective function and I mean next half an hour I'm going to explain what this objective function is and how it works so this is the objective function of Gann the objective function basically is a min max problem we have a discriminator D and we have a generator G and we are solving basically a min Max basically given G given the generator we are trying to maximize this cost function and find a discriminator a classifier or with the weight of the classifier which maximizes cost function and given the classifier D we are minimizing this cost function computing the weight of generator such that it's minimized so the cost function is defined on this form P of data is the real data say for example Amnesty when I sample from Emily's and P of Z is basically this zap okay this is G and this X would be G of Z the output of the network okay so you know I have this generator and I how discriminator and the sample from M missed that's my real data PF data so I take a point from that and it goes to discriminate or this converter tell me that okay it's one it's a real date and I will feed the generator by is that like a Gaussian a uniform whatever and G is that would be my new X it goes to discriminator this converter tell me that zero it's fake and then we have to change the weights until they do match this work and the claim is that by doing min max actually can match the distribution of your real with the distribution of your generator and we'll see how okay so the idea is that take GK as given compute D as this max then assume that the D is given compute this mean compute G then assume that G is given compute D and go on until boss of G and D are trained okay so basically switch between these okay that's the cost function that's our cost function let's just write the definition of expectation okay expectation of log DX with respect to P data is this integral P data X like the X DX and this distribution this expectation by definition is just the same thing okay okay now this is Bhaiji this is my Z and this is G offset you know the output of the network so X is G of Z if X is G of Z that means that is G inverse of X right if it's invertible and then these that you know what is these these that is basically G inverse Prime the derivative of G inverse at point x DX right so what I'm going to do is to replace n is that with G inverse X and it is that with G inverse prime X DX in this equations in this definitions data so the first term will not change in the second term whenever I have Z I change it to G inverse X okay and GX I sergey's that I will change it to X okay GZ is X and that is G inverse X so I would like to rewrite everything in terms of X so I rewrite everything in terms of X here and my DZ basically would be just G inverse prime X DX okay so I have everything here in terms of X okay you know that has a distribution it's Gaussian or it's uniform or whatever then G is a transformation applied to this Z I can compute the turn the distribution of GZ right so when I transform a data I can tell you what the divided distribution is after transformation right it has to do with the Jacobian of this transformation right so basically that's exactly what we are doing here we're saying that you know G is a transformation apply to that and that has a distribution which we call it P of Z but after transformation the distribution of GZ you know after transformation would be basically this track this distribution times the Jacobian of G inverse because of transformation rule you know when you transform a data you know that's how you compute the distribution after transformation so so basically P offs the G inverse x times the Jacobian at point X that's the distribution of GX so I call it PG the distribution of generated points so when I here actually from this step to this step when I change all the Z to X I have this term I have the disturb and I have this term and this term times this term is basically distribution of generated points PG so basically I just replace here PG like 1 minus DX okay given G this is I have to maximize this cost function and compute D ok so both of these integrals are over X so I write it as one integral P data X log DX plus P generated data log 1 minus DX I have to maximize this with respect to D okay so take derivative set the derivative equal to zero so if I take derivative you know take off its integral you know to understand it better assume that it's this great case it's some you have some of some terms right you're taking derivative with respect to a given X so basically you're taking derivative with respect to one of the terms of this song right okay so here basically you know if I take a derivative of this log DX would be what would be one divided by DX times P data that's derivative of the first and the other of the second here is pg divided by one minus DX sorry minus PT sir and this is equal to Z so DX times 1 minus DX P data times 1 minus DX minus P G times DX is equal to 0 now I have to write here it's PD - PD DX minus P G DX is equal to 0 so DX is PD + PG / or sorry is pd / is pd / PD + PG right okay so if I take derivative and set it to zero then DX is P data / p data + PG that's that's optimum D given G right so given G that's the optimum D okay now let's get back to the cost function that we have that was the cost function given G we maximized for D and we realized that optimum D has this quantity now in this objective function that we took derivative and set it to zero replace any D by the optimum D and then optimize for G you know because that was the process assume that G is given optimize for D then assume that D is given optimize for G now I know that given G the optimum D would be P D divided by P D pause P G so I will replace every occurrence of D by P D divided by P G so any of them must n FD is d assuming that it's optimal it speed data divided by P data plus P G are you ok with this so far yes Norrell ooh is not invertible but yes you can assume that there is another network I mean this network is a mere Lu is not invertible means that if you feed this network with that and generate X you can't use the same weight and put X inside and get that on the other side but it doesn't mean that there does not exist another network which take X and generates that you can do that right no no that's my point that no go ahead and use relu for example here it doesn't matter so this network you can't feed the network backward and generate G but there is another network l that can take X's input and generates Z as output and that L is basically the inverse of this G you know is inverse of this G but this G I mean when you say that sir Lu is not invertible you're right but you are under the impression that g to be invertible i should be able to feed the network backward and take whatever no it doesn't work that way but there's another network which act exactly as the inverse of this one there's a network you can take X as input and generates Z as output and you can train that and that's a no-no ok now I'm going to actually divide this quantity by 2 and multiply it by 2 okay so I'm only dividing this by 2 and then multiplied by 2 when I multiply this by 2 I have a log true here ok so basically you know there should be a 2 here and then I have log of 1 over 2 which is negative log 2 and they have log of 1 over 2 which is negative log 2 so I add them off and I put negative log 4 here so basically I divide these 2 by 2 and then I put negative light for here okay before I show I was not supposed to show this so what is this KL divergence right KL divergence between P data and P data divided by P data + PG order and this is KL divergence between PG and the same quantity okay so it's KL divergence between P data and this quantity and this is KL divergence between PG in this quantity - like four and you would like to minimize it okay KL divergence is always positive or nonzero basically it's greater than equals zero Kilda jurors cannot be negative so this is greater than equals zero and this is greater than equals zero so if you want to minimize the whole quantity what would be the minimum value sorry note that the the minimum of everything sorry - like for right because the minimum happens when this is zero and this is here because they're they're greater than equals zero and if you want to minimize this the minimum happens when these two are zero because they can't take any negative value okay okay so the minimum happens when these two are 0 so the minimum is - Lord 4 so the minimum happens when this is 0 and this is 0 and don't forget that Kail divergence is measure of similarity right when KL divergence is 0 basically this one tells you that P data is the same as P data + PG / - and this one told you that PG is the same as P theta plus G Plus PG / - so when does this happen you know you know the first term told me that P data is P data + PG over 2 and the second term told me PG is P data + PG or two when this is possible sorry when P D is equal to PG okay so it's possible if if P D is equal to PG that was our goal you know I wanted to match the output of this which is PG I wanted the match distribution of this with this distribution okay so by minimizing this cost function given the optimum value of P D then these two distributions will be much yes this is basically you know this is symmetric KL divergence by definition symmetric KL divergence is KL divergence of P with respect to M and kale that the KL divergence of Q with respect to M because you know it's not symmetric right when M is P plus Q over 2 and that's exactly what's happening here you know one theorem is P data with respect to M the other one is PG with respect to M and M is P data plus PG / - but you then you were not able to show that it's equivalent to this right basically we divided by 2 multiplied by 2 to show that by minimizing this actually they're matching this to M where m is average of these yes okay so by minimizing the cost function given Dida given the optimum D you are matching this to distributions and optimum D now is P D plus PG / - and if P D and PG are matched P D plus P G over 2 is 0.5 right is half you know I decided this is in class once in class 1 then it's 0.5 right okay so in summary this is our cost function and we have a generative model and we have a discriminative model and we are looking for G such that G is the answer to this cost function we are assuming that D is given we compute G and then we assume that G is given we compute D and we iterate between these two and basically this is similar to you know optimizing this cost function I mean similar to match these two distributions okay so we don't have a explicit cost function to match two distributions through this classifier actually we are matching this two distributions so in this stuff MMD that we're matching to distributions we have this classifier that matches this okay okay any question yes it's supervised I mean in supervised you know adversarial by itself is not supervised or unsupervised you know it's like if I ask this the cost function is supervisor are supers can you use can be used anywhere right I can be used in a unsupervised problem it can be used in a supervised problem it's just so let me show you a couple of examples this is text to image you know caption generate images and if you want to have an idea of general structure of this type of networks you know think of you know suppose that I have a network which take text in reconstruct text like an auto encode and they have a network which take images and generate images right this take text generate text but at the middle translate every text to a vector a fixed length vector and this one takes images translate them to the same image but at the middle translate everything to Fiske fixed slant vector right if I take this part in this part and put them together so the input could be axed the output could be void the only thing is that this fixed length vector may not match corresponding fixed length vector of this one so you need another network here which map any fixed length vector of text to corresponding fixed length vector of image so suppose that you have a data set of some captions and corresponding images you train and network only on captions you train a network only on images even on images that you don't have caption for even on text that you don't have images for you know you train this when you train this one then you have data set a caption corresponding images in this network any caption translated to a fixed length vector in this network any images transmitted to fixed length vector then I have a data set of captions and images this vector corresponding to this text this vector correspond to this image so if I put a network between these two which matches this vector to this vector then this as the input I give a text as the output I should get an image you know that's not how this models have been designed originally that's sort of abstraction of many different papers that conceptually that's what's happening there right so you can do many different things that could be this part in this part you know input his image output is text into this text out for this image this is English to English this is French - French this could be English to French translator could be French to English translator imagine just you know going from one domain to another domain language text image whatever conceptually that's what's happening but every single image has a different recipe you know with conceptually that's what's happening in the network so it generate a translate text to images these images are fake images or images this has been generated by a network but look like very realistic also these flowers don't exist you know the description of flower you know it's just generated by a network okay we learned about the variational auto-encoder the other day and variational attained color can be used as a generative model because variational autoencoder takes an input it's at a encoder and at the middle you know it's it's like a bottleneck at the middle you have a constraint that it should is similar to a Gaussian now if I get rid of this part and sample from a Gaussian the output should look like you know that the data set that we trained with so it is a generative model variational auto encoder this is variational auto encoder and this is again it's hard to see the difference you know these are right besides its other the claim actually of Gant paper is that again points that generated by Gann shows more details are more realistic something interesting is that if you compare the likelihood of the data generated by variational Audion coder and the likelihood of the data generated by Gann the likelihood of data generated by variational auto encoder is higher than the likelihood of Gann but according to our perception the quality of gain is better than quality of variation lottery encoder I mean the likelihood of variation of the encoder is higher because the objective function is designed for that you know we'd be basically maximizing this lock load and it is maximum we never maximize the life line for gap so it's more realistic in terms of our perception in terms of likelihood it's less yes so yes no no you know this is a conceptual level you know it's not that you take this recipe and upload somewhere it works immediately and if you look at the papers that translate like caption to image or image to caption it's not as simple as what I explained but that's an abstraction of many of these papers that do this type of translation from a language to another language or from you know image to another type of image it's conceptually what's happening yes domain adaptation basically I think you know what domain adaptation is you know you have a model which has been trained on a training set and the distribution of your test set is different from the distribution of your training set they gave you examples the other day and there are models based on network for domain adaptation and the idea is that you have a network and basically you're you basically use another network as a transformation which match the distribution of this with the distribution of your test set and you can use maximum in discrepancy to match these two distributions you can use again to match these two distributions okay these are some real images of C 410 these are some generated images using n of C 410 when C I mean try to matches the distribution of C 4/10 seems quite realistic images and the source code actually for the original paper and implementation for 1000 exists if you want to try any any other quests or any question okay thank you
Info
Channel: Data Science Courses
Views: 34,806
Rating: 4.9284434 out of 5
Keywords: Generative Adversarial Network, Gan, Deep Learning, Machine Learning, Generative models, moment matching networks
Id: 7G4_Y5rsvi8
Channel Id: undefined
Length: 67min 35sec (4055 seconds)
Published: Wed Nov 22 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.