Week 8 – Practicum: Variational autoencoders

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay Oh 97 yes almost 100 come on three more please I should invite my mom so she hacked this morning conversation it was funny how the heck she managed to hack assume only God knows yeah don't join with the two devices just increase the number yeah 100 100 100 k it's like the dogs all right so let's get back to the how-to encoders that we have started out encoders generative models right and so let's restart by having a quick review about the out encoders so again we have an input at the bottom in pink as now you can see the colors then you have the rotation the affine transformation and then you get the hidden layer again another rotation and then you get the final output which we are going to be trying to enforce to be close to be similar to the input again you have a parallel kind of diagram where each transformation is represented with with a box right so in this case people call this network out to layer neural net because there are two transformations but what I actually you know advocated is that this is a three layer neural net because for me the layers are the activations that's kind of what is usually the definition and then yeah now uses that new kind of symbols that look like a box with a round top okay all right so we have two different diagrams here because we can switch back and forth between the representations sometimes is easier to use the left one when we want to talk about the single neurons but then sometimes we prefer to use the other one which you can also like you know account for multiple layers so each like block here or like the encoder and the decoder can be several layers as well so again these two macro modules I guess and so the input get goes inside an encoder which gives us a code so H which was before hidden representation of a neural net when we talk about you know out encoders H is called code and therefore we have an encoder which is encoding the input into this code and then we have a decoder which is decoding the code into whatever representation in this case is is a similar as the same representation as the input okay so on the right hand side you have been out encoder on the left hand side you're gonna be seen what is a variation value how to encoder alright so there you go a variation alt encoder okay it looks the same so what's the difference nothing is missing something so the first difference is here that instead of having the hidden layer age now we have the code is actually made of two things it's made of one thing that is this kind of capital e of Z and V of Z and they're gonna be representing soon the mean and the variance of this latent variable said then we are going to be sampling from this distribution that has been parameterize by the encoder and we get to Zedd instead it's my latent variable my latent representation and then this latent representation goes inside the decoder so the parameters that I sample from like I have a normal distribution which have some parameters E and V E and V are deterministically determined by the input X but then Z is not deterministic Zed is a random variable which which gets sampled from a distribution which is parameterized by the encoder okay so let's say H was of size T now the coder press decode here on the left hand side is going to be of size two times D right because we had to represent the all the means and then all the variances in this case we assume that we have you know D means and D variance so each of those components re are independent okay alright so we can also think about the classic out encoder as just encoding the means and so if you encode the mean and you have basically zero variance you're going to get a again a deterministic out encoder so H might be in this case D and therefore on the left-hand side e and V will be total to D since we have D means so does that mean we're sampling the distributions it's gonna be one multivariate Gaussian that is orthogonal and so if you have all those components that are independent from each other and therefore Z is going to be a D dimensional vector but then to sample a D dimensional vector from a Gaussian you will need D means and then in this case the variances because we assuming that all the other components in the covariance matrix are all zeros you have only have the diagonal where you have all the variances okay so here just to make a recap you have the encoder that is mapping this kind of input distribution into like the input set of samples into this R 2 D and so we can think in this case that we met from X to the hidden representation and then the decoder instead Maps the z space into RN which is back to the original space of D X and therefore we go from lower case that into X hat someone asked EFC and the VFC is that the output of the encoder yeah e of Z and V of Z are just parameters that are deafening deterministically output by the encoder so the encoder is a deterministic you know it's just the classical rotation and squashing and then another fine transformation so it's just a piece of a neural network which is outputting some parameters ok so this is the encoder which is giving me these parameters E and V given my input X right so this is deterministic part then given that we had these parameters these parameters are you know giving me a Gaussian distribution with specific means and specific variances and from these variance from this Gaussian distribution we simple one one sample set okay and then we decode which means we're gonna see what means this in a in a second but basically you're going to be encoding the they mean and then you're going to be adding some additional some noise okay to that encoding in the in the denoising auto-encoder we were getting our input you were adding noise to the input and then you were trying to reconstruct the input without noise in here the only thing that is changed is the fact that the noise is added to the inner representation rather rather than to the input does it make sense yeah that makes a lot more sense thank you so I noticed that the notation itself kind of looks like expected value are we generating just a normal mean from Z or we actually computing my kind of a weighted average no no there is no okay so my X instead of outputting the is outputting let's say D is gonna be 10 that is the hidden representation now instead of having 10 values representing the mean we're gonna have 20 values 10 values are representing the mean and 10 values are representing the variances okay so we just output a vector H here given my X the first half of the vector represents the means of a standard deviation of a Gaussian distribution and the other half of the vector represents the variances for the same Gaussian distribution okay so the component age the first component h1 is going to be the the mean of the first Gaussian and then the component age let's say okay let's call it h 2 in this case is gonna be the variance then you have h 3 it's gonna be another mean H 4 it's force may be another variance and so on okay so does that make with that make Z like a 10 dimensional vector that's sampled from yeah yeah yeah so Z is that here it's gonna be half of these sites here right so the encoder gives me twice the dimension of Z and then because you get half of the dimensions like one set of these are for the memes and one set of these are the variances then we sample from a Gaussian that has these values so the network simply gives me not just the means as for the classical Alton cooler but also gives me some what is the range that I can pick things from right before when we were using the classical out encoder here we only have the means and then you simply decode the means in this case you not only have the means but also you can have some variants some variations across those means okay so out encoder normal encoder is deterministic the output is deterministic input a function of the input a variation halt encoder the output is not longer a deterministic it's no longer a deterministic function of the input is going to be a distribution given the input right so is a conditional distribution given the input so in this case we did see that we saw similar a similar diagram last time where we were going from a specific point on the left hand side to the right hand side in this case we start here like a point and then we get through the encoder you're gonna get some position here but then there is a addition of noise right if you only have the mean you would get just one Zed but then given that there is some additional noise that is due to the fact that we don't have a zero variance that final point that final Zed is not going to be just one point it's gonna be like a fuzzy point okay so instead of having one point now one one X is going to be mapped into one region of points okay so is going to be actually taking some space and then we how do we train the system when we train the system by sending this latent variable Z back to the the decoder in order to get these X a hot and of course it's not going to be getting it exactly to the original point because perhaps we haven't yet trained so we have to reconstruct the original input and to do that we're going to be trying to minimize what is the square distance between the reconstruction and the original input and then we had the problem before like to go to the latent to go from the latent to the input space we need to know or delay that latent distribution or to enforce some distribution last time we were seeing that we're doing something similar when we are using the the classical the standard out encoder but we were going from one point X to one point Z and then back to X right right now instead we are going to be enforcing a distribution over these points in the latent space before we were going through one point one point at one point and then you don't know what's happening if you move around in the latent space remember so if you have on the left hand side at any samples you're gonna have automatically on the other side ten latent variables but then you don't know how to go between this input between these you don't know how to travel in this latent space because we don't know how this space behave okay variation hutton colors enforce some structure and they do this by adding a penalty of being different or far from a normal distribution so if you have a latent distribution which is not really resembling a Gaussian then this term here will be very strong very high and when we train about a tional encoder we're going to be training NIT by minimizing both this term over here and this term over here so the term on the left hand side makes sure that we can get back to the original position a term on the right hand side enforce some structure in the latent space because otherwise we wouldn't be able to you know sample from there when we'd like to use this decoder as a generative model okay this is maybe not too clear but let me give you a little bit more things to think about so how do we actually create this latent variable Z so my Z is simply going to be my mean yi of z plus some you know some noise epsilon which is a sample from a normal distribution which is like a normal multivariate Gaussian distribution with zero mean an identity matrix as the covariance matrix which has each components multiplied by the standard deviation ray so you should be familiar with this equation here on the top right this is how you rescale a random variable epsilon which again is a normal you have to use this kind of repair metallization in order to get a Gaussian that has you know a specific mean in a specific variant so again the noise in the latent variable Z just in coded version of the noise introduced in the input so there is no noise in the input you put the input inside the encoder and then the encoder gives you two parameters e and variance when you sample from this distribution you basically get Z and what you get here it's simply you can write the sampling part as this one so the problem with sampling is that we don't know how to perform back propagation through a sampling module actually there is no way to perform back propagation through same thing because this one is just generating a new set so how do we get gradients through this module in order to train the encoder and so this can be done if you use this trick which is called the repairer metallization trick the repair matter is Asian trick allows you to express your sampling in terms of you know additions and multiplication which we can differentiate thrown right the epsilon is simply a additional input that is you know coming from whatever wherever well we don't have any need to send gradients through this input the gradients will be coming going through the multiplication and through the addition okay so whenever you have gradients for training this system the gradient comes down and then here we can replace the sampling module with a addition between E Plus the epsilon multiplied by the square root of the variance okay such that now you have you know addition you know how to make prop through an addition therefore you get gradients for the encoder here a output gradient and then you can compute the partial derivatives of the you know finite costs with respect to the parameters in this module okay so in just in a you know intuition part this KL here allows me to enforce a structure in the latent space okay that's what we think about like that's how I'd like you to think about this character and so let's actually figure out how this stuff works okay so we have two terms in my purse ample loss we have the first one which is the reconstruction loss and then there is the second term which is going to be these KL this relative entropy term okay so we have some Z in this case which are spheres bubbles okay in this case why there are bubbles because if you we add some additional noise right we had the means and the means are basically the center of these points right so you have one mean here one mean over here one mean over here 1 min over here and then what the reconstruction term is going to be doing is the following so if this means if these bubbles overlap what does it happen so if you have 1 min here and another mean like one bubble here another bubble it is overlapping and there is a region where there is you know intersection how can you reconstruct these two points later on right you can't write you following so far if you have a bubble here and then you have another bubble here all points on this bubble here will be reconstructed to the original input here so you start from an original point you go to the latent space over here and then you are not some noise you actually have a volume here then you take another point and this other point it gets reconstructed here right now if these two guys overlap how can you reconstruct the points over here so if the points are in this bubble I'd like to go back to the original point here if the points are in this bubble I'd like to go to the other point but if points are overlapped sorry if the bubbles are overlapped then you can't really figure out where to go back right so then the reconstruction term we'll just do this the reconstruction term will try to get all those bubbles as far as possible such that they don't overlap because if they overlap then the reconstruction is not going to be good and so now we have to fix this so there are a few ways to fix this right how can you tell me right now how can we fix this overlapping issue right why didn't we have this overlapping issue with the normal encoder because there is no variance AHA and so what does it mean okay can you translate what not being not having a variance mean the spheres are not spheres but they're points correct right so if you have just points points will never overlap right well they have to be the exact same point but you have the exact same point only if the encoder is dead right or you have the same input I think well it's unlikely the two points overlap if now instead of having points you have actually volumes well you know volume can overlap because they are meaning infinite points right in that volume okay so one option is going to be kela variance and so you have points and now this defeats the whole variational thing right without this Spacey thing by killing the variance now you don't know anymore what's happening between the points right because if you have space like if they takes volume you can walk around in the latent space you can always figure out where to go back if these are points as soon as you leave this position here you have no whatsoever whatsoever idea how where to go okay anyhow first point we can kill the variance other option well the one I show you here I the other option is gonna get these bubbles as far as possible right so if they are as far as possible what's gonna happen in your Python script so if these valleys these means rain they go very very far then they will increase a lot a lot a lot right and then the problem is that you're going to get infinite right this stuff is gonna explode because all these values are try to go as far as possible such that they don't overlap and then that's not good okay all right so let's figure out how the eternal variation Alton color fixes this problem could you just clarify what you mean by pushing the points apart like are you putting them in a higher dimensional space no no no so as they are here so each if you don't have the variance all those circles here all those bubbles here are just points even that we have some variance they will take some space now if this space taken by two bubbles overlaps with another bubble the reconstruction error will increase because you have no idea how to go back to the original point that generated that it's fair and so the network the encoder has two options in order to reduce this reconstruction error one option is going to be to kill the variance such that you get points the other option is going to be to send all those points in any direction such that they don't overlap okay okay yeah that makes sense okay good so a reconstruction error and gets these stuff to fly around but then let's introduce the second term so I would really recommend you to compute these relative entropy between the a Gaussian and a normal distribution such that you can practice maybe for next week but then if you compute that relative entropy you get these stuff right you get several four terms basically and everyone should understand how these roots no okay I'm just joking I'm gonna be actually explaining it okay so we have this expression let's try to analyze a little bit in more detail what they these terms represent so the first term you'd have these variants - log variants - one so if we graph it it looks like this you have a linear function right on the after you know after - on the x-axis and then on the other terms you have subtracted you subtract a logarithm which goes to plus infinity like if you sum a minus logarithm it goes to plus infinity at zero and then otherwise it's gonna be just you know decaying so if you sum the two and subtract one you get this kind of cute function and if you minimize this function you get just one and therefore these shows you how these term and forces those spheres here to have a radius of one in each direction because if it tries to be smaller than one this stuff you know goes up is crazy and if the increases here doesn't go as up as crazy so they are it slightly you know roughly all the ways at least one or you know half but it won't be is much smaller because this stuff you know increases a lot so in this case we have enforced the network not to collapse these bubbles in order to make it grow them too much right because otherwise they still get penalized here so then we have another term here these a of Z everything squared and that's classical problem which has a minimum over there and so this term here basically says that all it means should be condensed towards zero and so basically you get like this additional force here by this purple side and now you get that all those bubbles get squashed together in this bigger bubble so here you get the bubbles of bubbles representation of a variational encoder how cute is this very cute how can you pack more bubbles so what is the only parameter here which is telling you the strength of your variational encoder it's gonna be simply the dimension D because you know given a dimension you always know how many bubbles you can pack in a larger bubble right so it's just a function of the dimension you pick and you choose for your hidden layer though is the reconstruction last the first term the yellow term is that the one that actually pushes the bubbles further apart and not gonna the rest of it is what kind of keeps them from doing that right so the reconstruction would push things around because we have these additional taking volumes thing right so if we wouldn't be taking volumes the reconstruction term wouldn't be pushing anything away because it don't overlap given that we actually have some variance the variance will have these points actually taking some volume and therefore this reconstruction will try to get those points away so if you check again those few animations I'll show you so we had at the beginning those were the points with additional noise now you get the reconstruction that is you know pushing everything away then you get the variance it is assuring you that those little bubbles don't collapse and then you had the final term which is the spring term because it's the quadratic term in the loss which is basically adding these additional pressure such that all the little guys get you know back towards zero but they don't overlap because there is the reconstruction term so no overlap due to the reconstruction size not going to smaller than one because of the first part of the relative entropy and then all these guys are parked again for the quadratic part which is the spring force it's a term something that needs to be tuned like hyper parameter kind of thing so the beta is the actual in the original version of this variation alt encoded there was no beta and then there is a paper which is the beta variation of the encoder just to say that you can use a hyper parameter to change how much these two terms you know contribute for the final gloss this loss the second loss term with the beta that's the KL divergence yeah then the normal distribution rate yeah between the design which is coming from a Gaussian of mean II and variance V and then the second term is going to be this normal distribution and so this term tries to get Z to be as close as possible to a normal distribution in the space V dimensional space okay and this formula that you that's a generic this is so I would recommend you to take a paper and pen and then then try to write the relative entropy between a Gaussian and a normal distribution and you should get all these terms the relative entropy so yeah these lkl is the relative entropy yeah so just look up the formula for the relative entropy which is telling you basically how far to distributions are and the first distribution is going to be a multivariate Gaussian and the second one is going to be a normal distribution right yeah normal distribution is not the same thing the Gaussian has a mean vector and the covariance matrix the normal has 0 mean and identity matrix for the covariance matrix we said earlier though that the Z should not have covariance it should be diagonal right yeah yeah so all it's gonna be diagonal but the values on the diagonal those V Benson okay it's an off-center big normal versus a centered in normal small normal so it's off center and then each direction is scaled by the by the standard deviation right of that dimension so if you have a large standard deviation in one dimension it means that in that direction is very very spread make sense but that there is both a line it's a line on the D axis right because again all the components are independent yeah is the reconstruction lost the pixel wise distance between the final out and the original image the reconstruction loss we we saw that last week and we have two options for the reconstruction was one was the binary for binary data and we have the binary cross-entropy and the other one is going to be instead the real value of one so such that you can use the half or one DMS MSC right so these are the reconstruction losses we can use for example you talk more with me then with the end good now well not good you should talk as well with young but we should be going over the the notebook such that we can see how to code the Stars and also play with the distributions because before again the main point was that before we were mapping points two points and back two points right right now instead you're gonna map points to space and then space two points but then also all the space now is going to be all cover by these bubbles because of several factors right if you have some space between these bubbles then you have no idea how to go from this region here back to the input space right instead of variation auto-encoder gets you to this very well-behaved coverage you know these nice covers of the latent space okay good I can't see you you miss you guys okay so could order questions so far I hope you can see stuff I just gave feedback can see stuff yeah yes yep so work get PDL condor activate PT and jupiter notebook boom okay so i'm gonna be covering now dve and so now I'm gonna just execute everything such that this stuff starts training and then I'm gonna be explaining things alright so at the beginning I'm gonna be just important all random as usual then I have a display routine we don't care don't add it to the notes I have some default values for the random seed such that you're gonna get the same numbers I get then here I just use the M nice data set we modify in East from yawn device I set the CPU or GPU in theory I could have used GP one because my Mac here actually has a GPU and then I have my variation out encoder okay so my variation out encoder has two parts as a encoder here let me turn on the line numbers so my encoder goes from 784 which is the sides size of the input to this square for example MD in this case is 20 so 400 and then from D square I go to 2 times T which is gonna be half of my means and half is gonna be for my Sigma squares for my variances the other case the other the decoder instead picks only D right you can see Randy here we go from D to D square and then from this square to 794 such that we match the input dimensionality and then finally I have a sick I don't have a sigmoid because my input is going to be limited from zero to one there are images from zero to one then there is a module here which is called three premiere eyes and if we are training we use the three parameter ization part could you just say again why you use the sigmoid in the yeah because my data it's leaving between zero and one so I have those digits from the M missed and they are they are values like the values of the digits are going to be from 0 to 1 so I'd like to have my network this module here outputs things that goes from minus infinity to plus infinity if I send it through a sigmoid this stuff sends things through like 0 to 1 okay when you say the values of the digits you mean the deactivations right so I use the enemy's data set and this is going to be both my input and also my targets right and the images and the values of these images will be ranging between 0 to 1 like is a real value each pixel can be between 0 and 1 yeah I think actually the the inputs are binary so the inputs are all 0 or 1 but my network will be outputting a real range between 0 & 1 sorry parameterization we have the we what do we do here so we parameterize ation given a mu and a log variance explain later why we use log variance if you are in a you know in training we compute the standard deviation is going to be log variance multiplied by 1/2 and then I take the exponential and so I get the standard deviation from the log variance and then I get my epsilon which is simply sampled from a normal distribution which with whatever size I have here right so standard deviation I get the size I create a new tensor and I fill it with a normal distribution data then I return the epsilon x in the deviation and I add the MU which is what I show you before if I am NOT training I don't have to add noise right so I can simply return my mu so I use this network in a deterministic way the forward mode is the following so here we have that the encoder get the input which is going to be reshaped into you know these things such that basically I enroll the images into a vector then the encoder is going to be a looting output in something and I will shape that one such that I have batch size two and then D where D is the dimension of the mean and the dimension of the variances then I have mu which is the mean simply the first part right of these guys of this D and then the load variance is going to be the other guy and then I have my Z which is going to be my latent variable it's going to be these three parameters ation given my mu and the load bar why do I use a load bar you tell me why do I use a load bar right right right so given that variances are only positive if I compute the log allows you to output the full real range for the encoder right so you can use the whole real range and then I define my model as this VA and I send it to the device here I define the optimization optimizer and then I define my loss function which is the sum of two parts the binary cross entropy between the input and the reconstruction which is here so I have the I X hat and then the X and then I try I sum all of them and then the k KL divergence so we have the bar which is the you know linear then you have the minus log of R which is the logarithmic flip down and then minus one and then we have the MU and then we try to minimize this stuff right all right so training scripts it's very simple right so you have the model which is outputting the prediction X Hut let me let's see here right forward outputs the output of the decoder the MU and the log var so here you get the model you feed the input you get X hat mule Akbar you can compute the loss using the X hat X mu and dog bar X being the input but also the target and then we you know yeah we add the item for the loss we clean up the gradient from the previous steps perform computational compute the partial derivatives and then you step and then here I just do the testing and do some caching for later on so we started with initial error of 500 ruffling 514 this is pre before training and any goes immediately immediately down to 200 and then goes down to 100 okay and so now I'm gonna be showing you a few of the results this is the input I feed to the network and the untrained network reconstructions of course look like right but okay that's fine so we can keep going and there's gonna be the first epoch right cool second epoch third fourth and so on right and it looked better and better of course so what can we do right now a few things we can do for example now we can simply simple Z from normal distribution and then I decode this random stuff right so this doesn't come from our encoder and I show you now what the decoder does whenever you sample from the distribution that the latent variable should have been following and so these are a few examples of how sampling from the latent distribution you know gets decoded into something we got a nine here we go to zero we got some five so some of the regions are very well defined nine to but then other regions like this thing here or this thing here or the number 14 here they don't really look like well like digits this is because why what's the problem here we haven't really covered the whole space I just trained for one minute if I trained for 10 minutes it's going to be just working perfectly okay so here those bubbles don't yet fill the whole space right and it's the same problem which you would have with a normal out encoder without this variational thing right with very normal to encoders you don't have you know any kind of structure any kind of defined behavior in the regions between different points with the variation at the alt encoder we actually take the space and enforce that the reconstruction of the all these region actually are to make sense again so let's do some cute stuff and then I am done here I just show you a few digits and so let's pick two of them for example let's pick three and eight which is going to be let me show you here so we'd like to find a interpolation now between a five and a four okay and this is my five reconstructed and our four reconstructed so if I perform a linear interpolation in the latent space and I then send it to the decoder we get this one so the five gets morphed into a four you can see slowly but it looks like crap let's try to get something that stays on a manifold so let's get for example these three so it's going to be number one number one and then let's say maybe these 14 here so I do interpolation of these guys here you can see my out encoder actually fixed those kind of issues here and then you can see now how the three get those little edges closed to look like an eight right and so all of them look like kind of legit no it's just kind of a three kind of a three a three day became an eight right and so you can see how now by walking in the latent space we get to reconstruct things that look legit in the input space right this would have never worked with a normal out encoder finally I'm gonna show you a few nice representation of the embeddings of the means for this train out encoder so here I'll just show you a collection of the embeddings of you know the test data set and then I perform a like a dimensionality reduction and then I show you how the encoder clusters all the means in different regions in the Leighton space and so here is what you get when you train this variation note encoder so this is the the beginning when the network is not trained you can still see you know clusters of digits but then as you keep training well at least you know after five books you get these groups to be you know separated and then I think if you keep training more you should have like more separation okay so here I'm basically doing the testing part I get all the means so my model outputs X hat mu and lock var right and so my means I append them I append all my news into this mean least I append all the log bars in this lock bars list and I append all the Y's to these labels list during the testing part right so this is testing and so I have like a list here of codes which is a I have the MU log bar and then device right so here later on I put those lists inside my a dictionary and then later here below I compute a dimensionality reduction for epoch zero epoch five and epoch ten so I use this TST which is a technique for reducing the dimensions of the codes which are twenty right now the dimension height is 20 so I fit I get my X is gonna be let's say the first thousand component first thousand samples of the means and then I get these YZ which are basically a 2d projection somehow of these twenty dimensional Mews okay and then I show you in this chart here how these 2d projections they look at a box 0 before training the network because this one is before the first training epoch and then as I book five you can see how the network gets all this mess here to be you know kind of more nicely put here I didn't visualize the variances I'm thinking whether I can if I'm able to do that as well not sure so each of these points represent the location of the mean after training the variation at encoder I haven't represented the area that these means are actually taking okay are that means supposed to be random at X 0 the randomness is in the encoder right but then you still feed to the encoder and those inputs digits so the input digits all the ones are kind of similar right so if you perform a random transformation of those similarly looking initial vectors you're going to have similarly looking transformed versions but then they are not necessarily grouped all together like most of them are for it for example let's say these are once let me turn on the color bar so we can see what this stuff is so let's say these are the zeros these over here so all zeros look like they all look similar therefore even a random projection of those zeros will all be kind of together what you can see instead is gonna be this purple is all spread around right so means the force there yeah there are very many ways of drawing for you know someone right is closest the top someone doesn't so if you see on the right hand side inside all the fours are almost all here right there is just a little cluster here next to the nine because you can you can think about if you if you write a four like that it's very similar to write a nine right and so you had this force here that are very close to the nines just because of how people drew this specific force okay nevertheless they are still clustered over here you get all these things are spread around and so this is very bad nevertheless they tell you this that this diagram here shows you that there is very little variance across the drawing of a zero okay so it shows you like somehow there is a specific mode it is very concentrated here but it's really not concentrated for these guys so I'm just curious like what are some other some other like like motivations or usages of variational auto-encoder like so the main point was the whenever I show you in class the two weeks ago a generative model you cannot have a generative model with a classic loud encoder in this case here again I train I didn't train this stuff a lot if you train it longer you can have better performance here and the point is that my input that comes from just this random distribution okay and then by sending this random number here a random a number coming from a normal distribution you send an inside this decoder if this this is a coder is actually a powerful decoder then this stuff will actually draw very nice shapes or numbers like for example those two images I show you of the two phases in the first part of the class last time those are simply you take a number from my random distribution you feed it to a decoder and the decoder is going to be drawing you this very beautiful picture of whatever you trained is decoder on okay and you cannot use a standard out encoder to get these kind of properties because again here we enforce the decoder to reconstruct meaningful or good-looking reconstruction when they are sample from this normal distribution therefore later on we can sample from this normal distribution feed things to the decoder and the decoder will generate stuff that looks like legit right if you didn't train the decoder in order to perform a good reconstruction when you sample from this normal distribution you wouldn't be able to actually get anything meaningful okay that's the big takeaway here next time we're going to be seen said generally generally about the cellular networks and how they how they are very similar to these stuff we have seen today on hi Alfred oh I have a question for the yellow bubble yellow yeah each yellow bubble comes from one input example yeah so if we had 1000 I don't know what images or one thought amperes that means we have 1000 exactly yellow bubbles yeah and EGL bubble it comes from the the easy.we Z distribution together with the noise added to latent variable so the bubble come from here let me show you should I show you this one is okay or should I say okay so here you get these X and these X goes inside the model right whenever you send these X through the model it goes inside forward so X goes inside here and then it goes inside the encoder right okay and then from the gives me this moola quarter from which I just extract the Mew and log bar okay so so far is everything it's like a normal out encoder okay the bubble comes here so my Z now comes out from these self-repairing tries and it's self repair parametrize is gonna be working in a different way if we are in the training loop or we are not in the training loop so if we are not in the trainee loop I just returned the min so there is no bubble when I use the testing part again so I get the best value the encoder can give me if I am training instead what this is what happens so my I compute the standard deviation from this log bar so I get the log bar I divided by two and then I take the exponential right so I have e to the one half log fire such that you get you know the standard deviation and then the epsilon it can be simply an D dimensional vector sample from a normal distribution and so this one is one sample coming from this normal distribution and the normal distribution it's you know like a sphere in D dimensions right it's fair with the radius which is going to be square root of D but then so here at the end you simply resize that thing the point is that every time you go you called is very parameterization the parameter is function you can I get a different epsilon because epsilon it's sampled from a normal distribution right so given a mu and given a log bar you're gonna be getting every time different Epsilon's and therefore these stuff here if you call it a hundred times it's gonna give you 100 different points all of them cluster in mu with the radius of you know roughly standard deviation and so this is the line which returns you every time just one sample but if you call this in a for loop you're gonna get you know a cloud of points all them centered mu which has a specific radius okay and so this is where we get these bubbles come from the sampling and of these thing right I have to run it 100 times if you want 100 samples you get 100 times you had to run into 100 times these are a parameterization gives you every time a different point which is you know parameterize by this location and this kind of you know up you know volume yeah and this comes from Adam you and walk Mary variance comes from one sample one input example yeah yeah yeah so my one input X here gives me one mu and gives me one load bar and this one mu and well one log Mar gives me Z which is one sample from the whole distribution if you run this function here 1,000 times you're gonna get 1,000 Z which all of them will take this volume right off okay got it got it thank you of course I'm like Auto coder it by sight encoders and decodes in general yeah it looks like in this implementation it's like fairly straightforward in terms of like it just has like a couple linear layers with a rail ooh and a sigmoid are most like I like I previously they seemed in CODIS where they're he's like attention all this stuff like is is this is something kind of as basic as this it seems like it's pretty satisfactory like is that I mean all right are they usually this basic or more complex okay okay that does I think softball for me so everything we see in class is things that I've tried it works and it's fairly a representative of what is sufficient to get this stuff to run so you know I'm running on my laptop on the MDS data set you can run several of this kind of test and play and so today we have seen how you can encode how can you code up it in color and all you need is like three lines four lines of code which are like like what are the differences between the play mountain code array and so the difference is that you have the reprime in relation to repair it parameterization repre Mitra is module method here and then you know just these three lines over here right so you have like six lines plus the relative entropy the architecture that's completely different so it's completely orthogonal right one thing is gonna be at the architecture which is based on the current input you can use a convolutional net you can use a recurrent net you can use anything you want and the other thing is the fact that you convert some deterministic Network into a network that allows you to sample and then generate samples from a distribution okay so we never had to talk about distributions before we didn't know how to generate distributions now with generative model you can actually generate data which are basically a you know you say like a bending a rotation or a transformation of whatever with the original Gaussian right so we had this multivariate gaussian and then the decoder takes this ball and then it shapes it to make it look like the input the input maybe like something curved you have this bubble here this big bubble of bubbles and then you the decoder gets it back to whatever looks like how the input looks like so all you need depends on the specific data you are using for M nice this is sufficient if you using our convolutional version is maybe worth working much better the point is that this class was about variational encoder know how to get crazy stuff all the crazy stuff is simply you know adding several of these things I've been teaching you so far but the bit about variation of the encoder I think it was covered mostly here okay okay thanks questions no okay okay that was it okay thank you so much for joining us okay everyone almost left 70% see you next week all right all right okay

Info

Channel: Alfredo Canziani

Views: 17,917

Rating: undefined out of 5

Keywords: Deep Learning, Yann LeCun, autoencoder, over-complete, generative, variational autoencoder, posterior, prior, KL divergence, relative entropy, PyTorch

Id: 7Rb4s9wNOmc

Channel Id: undefined

Length: 58min 5sec (3485 seconds)

Published: Wed May 20 2020