Ali Ghodsi, Lec [7], Deep Learning , Restricted Boltzmann Machines (RBMs)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay let's start this is going to be our last lecture remaining of the course would be a paper presentation okay today we are going to learn about restricted Boltzmann machine restricted Boltzmann machine basically is the first model it's a building block it's not by itself you know not a neural network or deep network it's a building block for making a deep network the first paper in 2006 the paper that basically was the beginning of deep learning had this structure as a building block Woodsman machine but maybe the reason that we did last among all of the other topics is that in spite of the fact that it was the first model rarely is used at the moment you know the idea started with this but it's not that common anymore okay restricted Boltzmann machine is unsupervised model and you have two layers it's a undirected graphical model you have a visible layer and you have a hidden layer so basically the idea is that given this visible layer given the input I would like to have a representation of the data in a different space so we give me a data in one two three four five five dimensional space I would like to have a representation of the same data in seven dimensional space so you are familiar with the concept of dimensionality reduction you have seen PCA as a linear dimensionality reduction you have seen none variations of dimensionality reduction and in all of them you map the data in a different dimensional space and one basically interpretation of what we are doing in dimensionality reduction is that we have a hidden variable and we have an observation and given the observation we are inferring that hidden variable and this hidden variable could be in different dimensional space so you can look at restricted Boltzmann machine the same way that it's you know a method to represent the data in a different dimensional space or have a different representation of the data so it's totally unsupervised it's called restricted because original Boltzmann machine you know this restricted Boltzmann machine is like a bipartite graph so this is say for example hidden and you have observations and there are undirected links between hidden and observation but there is no link between units of each layer so this variable has no connection to this value in the original Boltzmann machine there are some connections between these units restricted Boltzmann machine is basically a model which restrict this type of connection and there's connection only between hidden and observation not between units of each layer okay so the way that given V observation we infer the hidden variable is by defining a distribution function probability function over this and infer the most probable you know observation given that model the model is energy based model and there are many energy based models I'm sure you have seen so the idea is that we can define a joint distribution or observation and hidden variable of this form when e is energy function so I'm going to define an energy function my goal is that the energy between this configuration of the data observation and hidden is as low as possible you know this type of models has been motivated in physics originally and that's why we have the concept of energy because in practice you know they really physically they had this concept and they wanted to define a probability over this concept and they define the probability on this poor so you have a function of energy number just make exponential exponentiate this function exponential of a negative value of this function interpret this is probability and if you want to interpret this as probability you need to normalize it divided by a normalization factor to make sure that summation will be 1 ok ok so this is the joint this is our Joint Distribution bit over observation and hidden variable which we call it energy function and energy function is defined in this form V is our observation which is a vector and H is our hidden variable which is a vector could be in different dimension and BC and W are our parameters of the model B is a vector is a bias vector C is a bias vector and W is a matrix okay and Z is a partition function or normalization function in which you need to make sure that you know it sums to one which is summation over all possible values of V and H ok let's see how this parameters work in which the meaning of this parameters intuitively you know as I said W is a matrix basically you know is a matrix of you know all of these weights matrix of the size of this matrix times the size of this vector times the size of this vector and B is a vector of the size of V and C is the vector of the size of H it's bias so if I rewrite this energy function in a scalar form basically I have a dot product here which I can write it explicitly the summation of elements of the vector B and vector V and elements of vector C and vector H and so on restricted Boltzmann machine are usually trained on binary data there are variations of restricted Boltzmann machine for real valued observation but 90 percent of the time it's on binary data and the model that we are discussing now is for binary did so V and H are binary data could be 0 or 1 the values could be 0 or 1 okay so you know suppose that this B is negative what's going to happen I need to minimize the energy or I need to maximize the probability I need to minimize the energy so I prefer low energy so this is the energies so I can I will prefer you know the negative of this to be large so plus plus plus this value should be large so if B is negative I'm talking about V now B is going to be multiplied by V B K is going to be multiplied by v k if BK is negative and you want to maximize the negative value of this energy then what V K is going to be yes here zero I mean if you have two choices V can be 0 or 1 and B is negative then to max it to minimize the energy V should be 0 right if B is positive then and you want to maximize this energy minimize this energy or maximize the negative value of this energy suppose you know the negative value all of them are positive now and you want to maximize it and now B is positive then you cure V to be 0 or 1 yeah 1 right because you want a large value larger value is better it's preferred so we can think of B as a way to express your preference over the value of V if B is negative means that you prefer V to be 0 if P is positive means you prefer V to be one ok so the same thing here for C negative C is a way to express your preference or the value HJ to be 0 and positive value for CJ is a way to express your preference over HJ to be 1 okay so this is the role of these two parameters and we have another parameter which is W now think of w JK and suppose that w JK is negative if w JK is negative what can you say about HJ + VK you want its they are either 0 or 1 if it's negative you want them to be 1 all right see you have three terms you have BK v k+ CJ h j+ this is our ter term w JK h j VK we why we would like to maximize this and this is negative okay if i want to maximize this and this is a negative value you know I don't want this be a negative term contribute to my cost function I want to cancel it out so I prefer either H or V to be 0 right or both of them so w JK being negative is a way to express my preference over the value of h AV k that one of them should be 0 or both of them should be 0 either what about if w JK is positive both of them one right so I prefer that both of them are 1 if w JK is 1 so w JK is a parameter which shows a connection between this pair of random variable how do i want to connect them so this is my model it gives me some freedom over the value of H given the value of V I have freedom to express you know my preference over different values of H individually beings or one or in connection with V through W okay and then you know I told you that having this energy function you can define Joint Distribution and this is your partition function okay unfortunately computing this partition function as usual in graphical model is intractable its summation over V summation over H and you have exponentially many summations you know in practice and it's intractable to compute so when Z is intractable to compute it means that computing this Joint Distribution is intractable because we can't compute the partition function okay so we can't compute the Joint Distribution over V and H that's the bad news the good news is that conditional probability is easy to compute so easily I can compute roll to a V given H or I can compute variety of H given V okay so this is intractable the joint is intractable but conditional is easy to come P so if conditional is easy to compute and is easy to sample from then how can I sample from the Joint Distribution you know I know how to sample from H given V I know it's sample from V given H but I need to sample from H and V keep sampling right I mean the most obvious thing that you can do is keep sampling you know give sampling you know just sample conditionally and that would be joint so that's the idea - they claim that it's easy to compute now I will show you that it's easy to compute and then having this it would be easy to sample jointly from this distribution okay so uh sorry for any model you know if you want to do inference you're going to need this joint distribution if you want to be able you know when we have graphical model it means that we are modeling a joint distribution by definition you know graphical model is a way to model a joint distribution by adding some restriction you know because if you have joint distribution over some random variable theoretically you can answer any queries any okay so if I have a random variable which is weather rain if I have joint distribution of this then any query I can answer what's the chance that tomorrow is rainy given that today's whatever okay graphical model is a way to model the joint distributions but since joint distributions most of the time is pretty hard to compute you know a graphical model impose some constraint over I mean use your dominant knowledge and impose some constraint over the joint distribution by imposing some independence between some of these variables that's the main idea of basically graphical law okay so and here if we would like to do inference we're going to need you know if I want to compute this conditional distribution it's going to be joined divided by PV right and it is basically 1 over Z e to the power of negative energy function which is a function of V and H divided by TV the TV itself is summation over H of P V and H ok so we write it here so I have 1 over Z and it's e to the power of this energy functions function was B transpose V plus C transpose H plus V transpose W H okay my right was c4 was 4-h right and B was for V yes ok so that's this term and this term would be summation over H of e to the power of B transpose V plus a transpose H plus V transpose WH and I have 1 over Z T let's say I can take this 1 over that outside this summation and I can cancel it with this one okay so I can get rid of this partition function that was problematic now this is e to the power of V transpose V times e to the power of C transpose H times e to the power of V transpose WH I can do the same thing here so it's e to the power of V transpose V times e to the power of C transpose H times e to the power Z transpose WH and this over H this first term is constant with respect to H right so I can take it out I can put it here and then I cancel it with this so it's going to be e to the power of C transpose H plus V transpose WH divided by some or H of e to the power of C transpose H plus V transpose W H so I can write the conditional distribution of H given V as this form and this is going to be some Z prime which is my partition find okay you may think that ok we got rid of that partition function but we have another partial function here to compute but soon again you're going to see that you know how can we in practice actually compute this very easily so that was the steps that we had 1 over that of this divided by P V and then it's simplifies to e to the power of C transpose H plus V transpose W h / some Z Prime and if I write it in a scalar form that's going to be the scalar form of this now see I have exponential of some so exponential of some would be productive exponent of a product of exponential right so I have exponential of some I can write it as product of exponential so basically product of H given V is 1 over the prime product of exponential of these terms in H J is one of the element of J so what does it mean it means you know it's very important we basically we have a very important property about H which going to help us so having this conditional probability or H as a as a product it means that element of H are independent from each other right so basically I can write probability conditional probability of the first element of H given V times the probability of second element of H given V and so on so they're independent which was clear from the graphical model either you know that given VD should be independent okay uh any question here yes you know here I have exponential of some right so if I have e to the power of a plus B right it's e to the power of a times e to the power of B so exponential of some is product of exponential right so here I have exponential of some I can write this product of exponential right so if I have probability of if I have probability of a and B and I can write it as probability of a times point of B is they are independent if they are not I have to write it as word of a times root of B given a so this formula basically tells me that for all T of H which is you know a vector given V is for T of H 1 given B times Rho T of H 2 given V times root of H 3 given V so means they're independent so it means that given V elements of H are independent if you observe V given that observation elements of H are in them so you can do exactly the same for probability of V given H and show that given H elements of V are independent it's symmetric right it's no difference between V and H in terms of this computation that will be exactly the same but I can you know now I can compute the priority of H one of the element of H given V if I need probability of H given ly as a vector knowing that each unit is independent from the other one I compute the product of one of these units and then multiply all of them together right so probability of one of the elements like HJ being one given V just one of the atoms is joint divided by PB right and this PV is basically probability of H J V over all possible values of H J and over all possible values of HJ means probability that H J is 1 plus 40 that H J is zero okay so probability that H J is equal 1 given V we have it probability that H is equal to 1 given V it's the same value probability that H is equal to 0 given V means that put a 0 you know you remember how what was the form that was the firm right put H equal to 0 that's going to be e to the power of 0 so it's going to be e to the power of C right so probability of 1 unit of H given V would be e to the power of C J plus V transpose W and this notation basically means all rows column G of V and this is 1 1 plus e to the power of the same value and it's sigmoid function right so probability of H 1 H 1 unit given V is just a sigmoid function of this one and if you need the probability of vector H given V because they're independent that would be the just product of all of the G's so that will be this product that's going to work off H given be the same for product of V given H that's this sigmoid function okay now having these two probability of this form it's very easy to sample from a restricted Boltzmann machine using Gibbs sample so give sampling has two s steps as you know in first step you know given say for example V given observation you sample H then given a two sample V and you just repeat this and V and H that you have is samples of the Joint Distribution occurring to give something but knowing that you know each of them each of these units is just a sigmoid function and their independent help to first sample from each unit independently so you can parallel you know sample all of this unit and it's it's very easy actually to sample you know I give you an observation V given an observation V suppose that the model has is already learned you know I know the W I know B and I know C we haven't discussed this yet I haven't told you how to learn the parameters of the model but suppose that the parameters of the model is a layer I give you a V you know it's very easy to compute this you know you know the C you know that's your observation and you have W and that's sigmoid function of this it gives you a probability so it's a probability say is zero point two means that the probability that H J is equal to 1 is zero point two it means that you have to sample the uniform number because like Bernoulli it's zero you have to sample a uniform number so a sample uniform number you know MATLAB just Rand and if this uniform number is less than what I computed here then hg is one else H J is zero so easily I can sample from that you need and I can do it in parallel you know I can do it for all of the units given the observation and then sample for V so it's very efficient very easy to sample and you can sample hnv you can sample from from the model jointly easy ok any question so far so if this is clear that we can sample from the model easily if the model is learned mean if the parameters of the model is known now our main questions should be how to learn the parameters of the model so I give you some observation and so you have a bunch of V in your training data set H is not observed you know because it's hidden variable so you just have set of images that's your V and you need to compute W B and C okay so if you want to turn it off again thank you so how do we learn the model you know the most obvious thing that someone can think of is to maximize the likely the model given date so I define an objective function which is just the log likelihood of the model and this is my observation likelihood of the model I would like to maximize but this P of our observation is in fact summation over H of P W and H and H is my hidden variable okay and this joint distribution can be written in terms of energy function you know what happened here is that I see I had summation of T equal one to n of log of summation over all H of P my observation T and H okay but this is basically 1 over Z of e to the power of minus e H we write this is constant I can take it out so I have log of one over that just write it as negative log of Z okay and then it sorry yeah it's x then you can write it see you can write it as this stare minus n log of Z and in this n comes because I have t from 1 to n so I have n of those so I have n Times log of Z this becomes n times log of Z and I have another term which is basically this term okay so is it clear here ok so I was just trying to write the log likelihood after this function and now I just replace the definition of Z by definition that is double sum over this energy function okay so that's my like likelihood now if I want to optimize my model if I want to learn the parameters I have to maximize this large largely it means that they have to take derivative with respect to my parameters and set it to Z and my parameters is B C and W I have to take there wita with respect to this three variables and set it to Z so in in total if I want to if I just show all of my parameters as theta and I want to that was my like likelihood that I computed so I would like to take through with you with respect to this that's going to be the derivative of the first term minus the derivative of the second term so with respect to theta so they're two of the first term minus 3 or the second term so derivative of the first term you know I have you know the I have first I have duty of the sum so it's going to be sum of derivative so I'm going to have a summation over T equal 1 to N of derivative of this log and derivative of log of a function call it u derivative of log of u with respect to theta would be sorry yeah it is going to be basically u prime divided by u right so this go to denominator and then I have to take derivative of this right okay that's what happened I have derivative of T 1 to n because it's derivative of the sum of the derivative and this derivative of this log would be 1 over this U which I have it here and u prime so I have to take derivative of this which is summation of e to the power of a function and e to the power of U with the u prime e to the power of 4 u right so derivative of this times e to the power of this I mean the same term the same term times derivative of this which is derivative of this term with respect to Tate okay so is that clear I mean is that clear how we have taken derivative of the first term here No yes so my only question is about the you have this is derivative of this term it's not derivative of theta minus this maybe have you need a bracket here yeah it's there about that here like besides this is it clear what we have done you want me to do it on the board or your okay okay do exactly the same thing for the second term here it's going to be N and then I have to take derivative of this log it means that this comes as a denominator and I have to take derivative of e to the power of this function which is going to be derivative of the function times e to the power of the function so they eat to the power of the function derivative of the function again it's not like theta minus it's like it it's better if I have a I put a bracket here means that it's derivative of this negative energy function okay so if you think you know I have like two terms now if you think the first term is expectation of this energy function with respect to this conditional distribution and second term is derivative of this energy function with respect to this Joint Distribution okay so you know to see this if I write like summation of X I I equal 1 to n that's expectation of X if X has been drawn from a uniform distribution right if I write a WI X I that's expectation of X if w i's are basically if you know if WI if it's a distribution you know if let me write it as P of X I that's expectation with respect to this distribution right but if it's not really a distribution you know I mean if it doesn't sum to 1 and I put W here then I can normalize it and that would be a basically expectation with respect to that distribution that's exactly what happened here you know that we had this summation of this term and then we normalize it as if this is a W it's just a weight that V normal isn't the same thing happen here that's a weight and be normalized but this weight is you know wh WT is fixed you know I owe you an observation and you're just sampling from H but here W and H both of them are unknown in the second term remember the second term comes from the fact that we wanted to compute Z which is summation over V and H right so in short I can write it in this form and those who are familiar with undirected graphical model this is not a specific to restricted Boltzmann machine it appears everywhere in you know undirected graphical models this form that you have this expectation of the derivative of energy with respect I mean at the overage but here the second term is basically double expectation or double sum because it's over H and V okay so this term is easy to compute the second term is in most of the case intractable because to compute this first here I show you w I'll show you V and observation and to compute this expectation you have to sample from H and then you have to sum over all values that you have sampled from it it's linear right I show you one observation is sample from H Sumida it's easy but here you have the sample from H and V and then sum it up and don't forget that why it becomes intractable don't forget that we are taking derivative of maximum likelihood it doesn't have closed form solution I have to do gradient descent do a gradient descent means for each of these parameters I have to say W minus the derivative of this gradient descent now suppose that in each iteration you want to compute this derivative and for each iteration you have to sample or H and V and then you know compute this expectation compute the sum and do the expectation to sample from H and V you have to do a gift sampling which is easy to do but it takes some time for a you know burning n period you know until MCMC technique all of them you know needs to mix well you know you have to let it pray it goes until we get to the stationary distribution and then we take that sample and imagine that in each iteration you have to do this it becomes completely intractable you know this this part is okay the first term is okay but the second term is not okay okay let's pretend that it's okay if I take derivative of this energy function with respect to the three parameters that they had it's easy to take the derivative you know that's my energy function with respect to W it's going to be H we transpose with respect to B it's going to be V and it's respect to C is going to be H okay now if I want to write basically my update functions according to gradient descent then the total gradient that will my first term - you know this double expectation over this HW or this double expectation over V and so on okay and this becomes problematic as I said but you to avoid this problem there is an algorithm which is called called the contrastive divergence so it's not an exact method it's an approximation and I don't know it's a little bit hacky I mean it's makes sense intuitively and people have done some theoretical work on that but the main intuition is that okay you want to compute this double expectation here which is hard to compute okay we can compute it let's replace it with a point estimate so I can compute expect double expectation here suppose that I need to do a point estimate or suppose that I mean point estimate in a sense that if I want to compute the expectation only with one point okay so that's basically the idea that's the main idea here so the idea of contrasted divergence is that replace the expectation by a point estimate at that point and okay what does it mean exactly it means that you know I I give you an episode observation V okay using conditional distribution you can sample H because you know pH given B so I give you VT sample h and sample given H you can sample another V and given this V you can sample another H that skips sampling you know go a couple of iteration of Gibbs sampling you have at Point call it V tilde use this video the in place of your expectation I mean put it in the derivative of the energy function your we were supposed to find expectation of the derivative of the energy function right it basically means over all values of V and H now don't need you don't need to sample all values of W and H sample one point as if you compute the expectation over only one point okay so as an approximation so the intuition basically is that if you look at this cost I mean this derivative it has two terms you know think about each term what each of these two terms are doing I want to minimize the derivative when I want to minimize the derivative means I want to make this term is small and I want to make this term large the second term large this term the first term is a term about data the second term is a term about the model you know the first step is based on my observations second tier has nothing to do with my observation now when I want to read make the derivative small means that I want to make this term is small and I want to make this term large so making this term is small means that I show you an observation give me a representation of data H such that this configuration makes sense make sense in a sense that it has low energy where has high probability okay yes you use H to the state as well for the rest of the derivative here yeah yes yeah yes that's HDL tilt as you know suppose that my observation is this image the first term basically make sure that you find a hidden variable or a representation of three which makes sense make sense in a sense that they despaired of configuration this pair of data H and V minimize the energy function but the second term should be large because it has negative sign in front of that and you if you want to minimize everything that should be left that should be large means what means that you know suppose that I started the process I initialize everything with some random byte values uniformly if I sample V given H given a random H if I sample V given random edge what what I'm going to see it's not a digit given a random age I'm going to see some noise okay so the first-tier makes sure that if if I show you an image give me a configuration which minimize the energy the second term basically tell me if I you know the assumption is that the model has some you know wrong belief doesn't see in the data I have some random age and it generates some random V this is the probability of observing this noise should be small the probability of observing this should be large the product of the other one should be as small so basically in each iteration you change the parameter in a way that it pushes down the probability of observing this type of data and increase the quality of observing this type of data and if you want to do it exact it should be expectation over all possible values of that but if you cannot compute it let's take one sample of the data that we don't like to see what sample of this type of noise and try to push the probability down according to this particular belief of the model or generation of the model okay so it seems that it works pretty well in practice if you compute just this second term which is a double expectation by a point estimate means that means that this expectation just get rid of this expectation compute this derivative and just you know use this example that you have generated through Gibbs sampling starting with an observation this is usually called negative sample in this literature so the idea of this algorithm is replaced the expectation by point estimate and obtain the point estimate by Gibbs sampling and you don't need to run this gap some sampling for long run just do it for a few iteration in practice in fact people shown that one iteration of Gibbs sampling would be sufficient in many cases I mean given we generate HD rate V and that's your negative sample and go for it it it does work well and another important thing that is start the chain with your observation don't start it with some initial value you know start the chain of the Skip sampling with your observation okay so that's the the detail of the algorithm the time I'm going to skip let me show you this example this is M missed digit and you know if you train a restricted Boltzmann machine then and look at the weights that it lands that's visualization of the weights okay the visualization of the weights and so what do you how do you interpret this you know as just think of this as features the features that are supposed to explain your data so one way or the way that you know this paper explain this feature is that it's like stroke of the pen you know when you're writing different digits you know it's you know just struggled up so they just want to show that it learn interesting features okay so now you can turn the light on so what does so that was that's restricted Boltzmann machine and restricted Boltzmann machine as I said is building block of a deep network now so how it contributes in deep network as I said that was the first paper of deep network adapt here but the problem with the network originally was that we're not able to Train deep models using back propagation you know if I have many layers using back propagation you are not able to train and that was the reason that no one actually chained the deep model until 2006 that Hinton actually trained the first deep model using this idea of Boltzmann machine so the first deep model that was proposed was layers of boltzmann machine so this is these two layers are Boltzmann machine these two layers and so on so it's completely unsupervised right so I give some input to this model and I can compute a representation for this according to Boltzmann machine so it gives me regardless of what the output is you know I'm doing classification or regulation or whether regardless of what the output is I can compute a representation for this lay now take this layer as input and this one is hidden and this one is hidden now you can you know do another Boltzmann machine and find a representation of this layer and layer wise you can go up and up and off you know and compute basically weights and representation of each of these layers up to your output now think of all of these weights as initial values of weights and now do some fine-tuning now assume that you have a full network and these are your weights but these are initial values of your weights I want to give this input take that output find you in this and fine-tune could be like same you know techniques like back propagation for example that was the original idea but now that they are able to basically learn deep models without this type of assumption and by just back propagation by assuming that this activation functions are not Sigma or hyperbolic tangent anymore but they are like linear rectified Union then there is no need basically to use Boltzmann machine as building block and rarely people you know use it in deeper structure but it was common at the beginning of dick network and since each of them is basically undirected graphical model people used to call this a structure deep belief Network so it's belief network layers of lead network together used to call it deep belief network the first paper was about a deep autoencoder an autoencoder by itself is in a structure in deep network that originally people and researchers use this Boltzmann machine to make auto encoder and deep autoencoder but recently you can have autoencoder without this Boltzmann machine so I'm going to show you or tell you a little bit about auto encoder but have in mind that auto encoder can be built using Boltzmann machine or without Boltzmann machine and recently people usually don't use Boltzmann machine too to build you know the idea of auto-encoder just forget about deep network the idea of auto-encoder is that I have a layer of my observation X which is a vector and I have another layer which is has less unit compared to this X so say for example X is in Rd then I have H which is in R P and P is less than D and then I have X again okay so it's like a bottleneck and then I have connection between all of them so it's like a bottle so suppose that it's just I have a neural networking it's not deep its shallow of this form and it's feed-forward neural network but the input and the output are the same so I give input I would like to reconstruct this input and I can make have another restriction that people usually have if this weight is w I can tie the weight of this layer to be W transpose that this part is usually called encoder and this Paul part is called decoder so that's why it's called auto encoder so given X I would like to reconstruct X okay so and then I will treat this part as you know some representation of data in lower dimensional space in P dimensional space so end of the day I have learned a transformation W which basically when applied to X it generate H and X is d by 1 and this is P by D so it generates P by 1 so it Maps the data in lower dimensional space so we have say for example PCA that map's data in lower dimensional space so we can assume that autoencoder is something similar with a structure of neuron okay in fact you can mathematically prove that if you have a simple linear auto encoder the basis that this auto encoder computes is span the same space as eigenvectors in PCA okay so basically in a sense they're identical this is PC it's doing PC and another way of actually to get some insight about the problem is that PC if you want to compute pca you can compute pca by maximizing the variation of each direction or we can compute pca by minimizing this cost function or let me write it this way X I summation or I I would like to minimize this way subject to the constraint that u transpose U is identity if you try to solve this optimization problem the solution is eigenvectors of covariance matrixes if x is centralized okay so compared with this structure that's exactly what's happening you know u transpose x is a transformation of X and if you transfer if u is d by p u transpose would be p by d its map everything to p dimension then this is orthonormal matrix so you would be the inverse of your transpose or the inverse of this projection matrix is projected back so you will be like d by p projected back to the original s space so as if you project it to lower dimensional space you project it back exactly you know exactly the same thing that we're doing it projected to lower dimensional space projected back to original s space and then we are trying to minimize this cost function exactly the same as cause same cost function that you're trying to minimize see ok so there are papers in the literature not recently in the past that people try to make it non-linear by adding some you know sigmoid function for example in this layer and so on which is not helpful at all because mathematically you can prove that the optimal solution to this problem is pca and basically it means that the optimal solution here actually is linear you know you if you want to minimize this cost function you don't need any non-linearity to be applied here okay so the the paper actually that was published in science and made deep network you know as a field was sort of deep autoencoder if I have many of these layers if I have many of these layers if I have many of these layers you know we are not able to Train it but my assumption that these are boots machine you know they train it that was the first deepest structure that was trained okay but now you can have a deep autoencoder without Boltzmann machine and people use this for dimensionality reduction okay so it could be an alternative or dimensional reduction it in fact it produces very good drizzle so if you have many layers then you need to apply some non-linearity here because if you don't have any non-linearity then it's going to be a linear transformation linear and unfortunately transformation it collapses to one linear transformation so it's a exactly same as peace nothing happened but if you have many layers then you need to apply some non-linearity in each layer okay so in the literature of autoencoder this type of a structure is called the under constraint at a encoder so over constraint other encoder doesn't have this a structure of a bottleneck in fact you know here that was D that was D and that was P but P was less than D so what if P is greater than D okay so what if this is the structure so this is my X and this is my X and H is larger than this so if I train a network of this form what does H is going to learn sorry is it going to be higher dimensional representation would be almost lossless you based Moodle sometimes handling something trees be zero they or basically have done in C yeah exactly now it's first of all you are not going to lose any information here you're going to lose some information because we are going to lower dimensional space here you are not going to lose any information and but as you said it's going to be either redundancy or exactly the same values copied an H because you know what's the most trivial way to do this reconstruction suppose that they have some units here and that have more number of units here and I would like to reconstruct X so the most trivial solution would be to copy everything from this node to this node from this node to this node from this node to this node and then copy them back right and set the rest to zero for example so this is the most trivial solution and if you do gradient descent if on a feed-forward it's going to learn something like this and can reconstruct everything perfectly so by itself it doesn't learn anything useful in this hidden unit but there are some variation of that that could be useful and one of them is called denoising auto-encoder denoising auto-encoder add a layer of this form that if I have X here then I make a layer 2 X tilde when X tilde is noisy it's not the same as X okay is noisy and the output is X nor is less then H could be a useful set of features I mean and this noise could be any part of noise could be additive noise for example or you can basically train this model such that this noise is setting some values of this X to 0 you know so if X is 1 and 0 black and white image and just ignore some of these values as if you haven't observed those another way to basically learn something useful in this structure is to add a term as a regularization term to the cost function so I have a cost function that needs to be minimized and this is based on my observation X so distance between all the input and output so we can add a regularization term here to avoid this trivial solution and thus the regularization term is basically derivative of H with respect to X it's a Jacobian matrix H is a vector and X is another vector it's a matrix so the norm of this okay so in the case that we had this trivial solution it means that H was a copy of X right so if I change X H will be changed immediately so this derivative basically tell me how much X how much H is going to change when I change X I would like to penalize the case that H change radically when I change X I don't want this to happen you know I change X but I don't want H to change immediately or radically you know so basically you have the cost function and you penalize it with this variable and in over-constrained you know that's the noising is one instructions another structure it's useful in some image processing tasks and if you remember I suggested a project as final project to apply this to natural language press when denoising even when noisy data was not a noisy image but it was noisy say translation noisy sentence and you want to make it correct okay any question yes go to the other board which was still indeed is there any restriction on any of the matrices in each layer being orthonormal like that no no no there's not such a restriction and that's why it doesn't exactly find the basis that PCA forms but it finds bases that expand the same space as PC okay so let's back get back here t30
Info
Channel: Data Science Courses
Views: 35,637
Rating: 4.9529409 out of 5
Keywords:
Id: FJ0z3Ubagt4
Channel Id: undefined
Length: 73min 12sec (4392 seconds)
Published: Tue Nov 10 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.