S18 Lecture 22: Boltzmann Machines

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright okay guys 7:59 I should start one minute of it considering that it's not gonna make any difference how do we have the second camera alright okay so alright so I'm gonna continue with the final lecture on hopeful networks and Boltzmann machines I spent an extraordinary amount of time on hopfield networks and I'm going to continue to do so because hopping networks are just so much easier to understand than Boltzmann machines which are presented as a bit of a mystical object right they have is there's this amazing Boltzmann distribution god knows where it came from and then you have this crazy mechanism which somehow incorporates a Boltzmann distribution and you end up having a generative model believe it or not and then you can model all kinds of funny stuff and nobody really knows what it's used for right so whereas hopfield networks content-addressable memories they kind of make sense so and it's it's the right end to start from that's exactly where the whole theory of Boltzmann machines began from so I'm gonna do that now I'm going to continue talking about worlds minute about hopping networks and kind of slide into Boltzmann machines unfortunately I prepared this this particular series of lectures a little poorly so I'm not going to get all the way through to the point where it I'd like to be put on the topic of Bulls when machines maybe the next time I do this do the course I'll reorganize this but anyway a quick recap here's where we were with hopeful networks right we had defined a hopfield network as a network with loops meaning the output of a neuron went back and fed to the other neurons so each neuron in this network was a simple binary a perceptron switched on or off using a threshold and if the total field at neuron was positive the neuron in positive if this meant that it had to flip it would and if the total field was negative the new run became negative again now this also means that the neuron always tries to align itself at the local field and it will flip if required to do so so the wave so you end up with this end this evolving process where if you initialize the network with some set of five values each neuron is going to try to align itself to the local field in doing so it changes the field at other neurons so those may flip and so on the entire system will evolve and we also saw that the system continues to evolve until this energy term that we've defined also called the Hamiltonian and spin glasses converges to some local minimum now this is a quadratic in the state of the network what when isand when I say state of the network it's basically the binary pattern that is encoded by the network and although this is a quadratic it can have multiple local minima because we're really is working on the corners of a lattice and we are not considering the continuous space of possible outcomes now in the rest of the lecture now a perceptrons obviously have this weighted combination of inputs and a bias and a bias that you have to consider and the energy term is just the sum over all neurons of the product of the value the neuron takes and the field so the energy term will also include a bias but then we as we've done in the past you can always think of the bias is just another weight attached to a neuron which is whose value is pegged at one so and the rest of this lecture is going to be using this quadratic term without explicitly referring to the by bias term but it should always keep in mind that there's actually another neuron which is pegged to one and whose weight contributes who also has a weight to all of the neurons but anyway so given this here's how the whole thing evolves you have an energy in and energy term here the system is going to keep evolving until the energy comes to a local minimum so here's how the network would evolve you initialize the network with some initial parameters and then you go through all of the neurons check the local field at each neuron and aligned the neuron to the sign of the field the time that theta over there is basically just a sine sine function and the updates can be done sequentially or all at once right not here I would like to draw a little diagram just for the sake of illustration we will use this diagram a few times so let's think of this as time and so this is 0 1 2 3 4 and so on and let me think of these as my bits this is y1 through in this example Y right 4 so if I initialize the neurons in this manner what happens at the next time you can at each time each neuron sees a weighted combination of the same neurons and just before at the previous instant if you think of each of these steps in the evolution as an instant and responds accordingly right so this is the behavior that you actually see so although I have represented the whole thing as B as looping in reality there's an evolution it can be actually sort of undal the whole thing and think of it as this unrolled Network right and this would happen at each time so the state of the network is simply the set of Y value so this is y 0 I'm going to call this y1 I I've just used some arbitrary notation which just tells you what is the state of the network meaning all of the bits in the network at each time P and the state this state of the network keeps evolving through time now that's just an unroll unrolling of the same thing the reason this is not your standard record in neural network is that we don't have a predefined you know a number of steps that the system is evolving to exist it's going to continue it stops now this evolution will continue till this energy sort of converges it doesn't change anymore now how exactly does this evolve if you initialize it you can plot the energy of the network as a function of state again keep in mind that the state is a discrete you know is a discrete vector so there's really no meaning to to arranging the state or drawing the state as a continuous valued for variable as I have over here this is meant only for illustration but in some n-dimensional state this to mention aspasie where n is the number of bits the state basically represents the corners of a hypercube within the space itself the space is continuous so it's convenient to think of it as a continuous curve and the energy of the function is going to vary as a function of state if you start off the system at some point the energy is the system is going to evolve to reduce the energy of the state till it comes to some point where it can't go down any further and it will stop over there so these locations where the energy is minimum these are stored patterns if the network is initialized close to a stored pattern it is eventually going to air to evolve into the store pattern so this is a content addressable memory we may recall memory content from partial or corrupt values again this is a fundamental difference from other forms of memory as we know right and we saw how this works if I had a network which remembered this picture now each of those locations each of these stable patterns is what we will call a memory and I'm going to call those so when I say the network remembers a pattern it means that that pattern is a pattern at which the network has a local minimum so the network remembers this first pattern and if I initialize the network to some degraded value of the network that's going to sort of evolve till it recovers the pattern or here for instance the network were designed to remember this pattern and if I just gave you this partial pattern it's going to fill it up now observe that there are two different things going on over here in the first case I give it a noisy pattern meaning every pixal may or may not actually represent the true value of the memory at that point I have Noi's in this case if you let them at work evolve we are going to let it change every single bit in the network and just hope that it comes to some local minimum which actually records the original pattern so we have a network networking over here is a problem ok so now the hopfield network can be trained to remember specific target patterns I just showed you a network which remembered these two pictures the Bugs Bunny on in one and Chippendale and the other one so how exactly did it learn to remember those patterns its trained to remember those patterns and this can be done by setting the weights in the network appropriately so question one can I use back prop to do this if so how would it is how would you do this hots anyone so to see how we can use back prop to do this go back to this figure here that I drew on the board right so if I gave you some random initial input and let the network evolve that's going to it's going to evolve for several steps eventually it's going to converge at which point you could pick the lower little closest pattern to water it has converge to find error propagated backwards keeping in mind that this is record the weights and each you know this is just a tied state network so although we draw all of these fancy different looking networks they can all be eventually mapped back to your standard multi-layer perceptron we've seen this for convolutional neural networks we've seen this for recurrent neural networks now I'm telling you how fill networks are just the same thing right but this is not the interpretation we're going to use make sense to you right now the Hoffy network can be trained to remember specific target patterns like the pictures in the previous example so we've seen in the past couple of classes that if the network has n bits then it's going to news need n neurons and this can be designed to store up to n target and bit memories what I mean by target memory is am is a pattern that I want the network to remember so there but we saw also that in the process of trying to make the network remember specific patterns it ends up also becoming remembering some other kinds of patterns which you didn't intend the network to remember so and these are what we will call parasitic memories and the network can have an exponential number of parasitic memories so when you train the network you want to design the weights make matrix such that the energy of the target patterns is minimized the energy corresponding to potential parasitic memories is maximized so that when you when you initialize the network randomly it's much more likely to stop evolve and stop at a target pattern and not had a parasitic memory right so here's what we really wanted to do when we wanted to train the network you want to minimize the total energy of all of the patterns that you're trying to store you also want to maximize the total energy of all the patterns that you do not want to store so if this where the energy contour as a function of the state of the network meaning when state again represents all the patterns that could daddy all the all the bit patterns that the end the network could actually take then you have some which I or Target memories so you want to bring the energy down at those locations all the rest of these potential potential memories you want their energies to be raised and so we have to learn our rates doubling such that the sum of the energies of the target memory this is reduced and the negative of the sum of the energies at the non target memories is also reduced in in other words the sum of the energies of the non target memories is raised right so this we could do using simple gradient descent and you can see this of the you know obviously so we have these some of the the the energy for any pattern we said we know is simply defined as summation W IJ y i YJ right and if i plus something else if I'm looking at and if I want the derivative of the energy with respect to D WI J a specific idea all of the other terms are going to fall out this is simply going to become minus y I YJ right very straightforward nothing fancy because the W is what you're taking the derivative with respect to so this means that if I think of this in terms of vectors if I say a equals y W Y transpose W Y minus then if I take the gradient of the energy with respect to a vector Y that's simply going to be Y respect to w that's simply going to be y transpose right that's exactly what we are seeing over here this is just your so this term here when I take the derivative of this term it's going to be the sum over all the target patterns of the outer products of the patent memories that I want to remember if I take the derivative of the second term which is trying to raise the energies of all the non-target patterns that two is the sum over all non-target patterns of the outer products of the patterns except where the sine is negative and this is just your standard gradient descent rule right or in this case gradient you're gonna have some step size edom now we also saw that so this is what you have this is the overall gradient descent to of the gradient descent rule is gonna have two components the first one is pulling down the energies of these guys it looks like a heavy n looks like heavier learning in a sense right and the second one is like an T hebbian learning it's pulling up the energies of all the patterns that you do not want to remember now we saw in the last class it is no really nori now if you consider this if I have nth and bits I have to raise to n possible parasitic memories and if I want to be to compute the second term explicitly I would have to sum over this entire exponential set of patterns you don't really need to do that so we saw in the last class that what we really want to do is to kind of make sure that whatever it is we are targeting to remember is remembered so that there is no there are no valleys adjacent to these target memories that can suck the network down and away from what it is you're trying to remember so the learning can be modified to say I want to lower these guys and I want to raise the valleys which are adjacent to it which could potentially just drag away drag away from these target memories right and so the way we could learn it is that you could initialize this network at these target memories let the network evolve and it's going to converge at the nearest valley and then your update rule is going to lower the energy energies at these points and it's going to raise the energies add the valleys the network ended up in after when you let it evolve from the memories that you really want to stir okay this was very simple right so this was the plan yes no it will not right so the again so this was that initialize W you compute the outer product of target patterns initialize at work with each target pattern and let it evolve and then you get overall gradient descent rows right this is a stochastic gradient descent version already seen it in the last class you could do it one pattern at a time anyway moving on we also saw one a little extension to it this that we don't really need to go all the way down to the valley to raise the energy you just want to move away away from the target pattern if you move away from the target pattern you want the neighborhood to be higher than the target pattern itself right so one way to do that is to just start from this location but instead of letting it evolve till it hits a minimum you let it evolve for a few iterations and you know that those are energies those are these those are states close by the target states where the energy is lower and you want to raise those guys and the longer you let it evolve the broader the valley that you're trying to construct around these target memories is going to be so that actually gave us this very simple modified update rule where if you were doing stochastic gradient descent you would sample a target pattern and now when you speak of sampling a target pattern if you say some memories are more important than others then you could sample the patterns that are more important more frequently and those that are less important less frequently and then you could initialize at that point run a few iterations so that it went down a little bit and then you just uses this the gradient descent rule over that particular pair this is the stochastic gradient descent version of it all of these we've already seen in the previous class right so so far so good now there was a question asked two classes ago in that when I try to recover this original pattern from this noisy picture I don't actually recommend it instead I get all of these dots why is that anyone someone's got to be able to give me an answer why do i why would he expect such a thing to happen so everybody is sleeping so let me answer these questions end up being rhetorical because I really never get answers but anyway we mentioned parasitic memories right so there are things there are patterns that the network will remember that you do not want it to remember and there's the noisy version of the pattern this is one of those memories so let's say I'm trying to remember these guys then you're going to get patterns between these guys like little dip out there up there which are parasitic memories I've had these little pictures with a little you know cartoon parasite just to let you know just in case you don't get the point now you're gonna get many of these guys right so how could it not now obviously the network has a tendency to get into get stuck in these locations how could you make the network not get stuck in these parasitic memories and the network is well trained you would expect the energy at the target memories to be quite low and the parasitic memories not to be that low right so if that is the case what could you do to let the network escape from these parasitic memories so one thing I could do yes you shake it right meaning one thing that is so so the way we dealt with the hopfield network at each point we accepted a flip off a bit only if this resulted in a little inner reduction in the energy right so that meant that when you stuck over here you can't you're never going to flip a bit again because that's the local minimum but then what you can do is to change your rule and say sometimes just sometimes I'll let a blood be flipped even if it's a local memory local minimum right so you permit changes in state even if the energy increases particularly if the change in energy is small which is what you would expect others little dips up there so now I can change the manner in which I update my my bits to be something like this I cannot define the local field as at any at any a neuron I as CI which is one over T sum over all of the rest of the neurons the weighted combination of those neurons let let you know what does what what all for the tears and now I can actually compute maybe a logistic at that point at that neuron and I use this logistic to decide what the sign of the output must be note that this is not the same as flipping I am using the logistic to decide what the sign of the output must be so you would have so if you would have you the Z is good let's go if this is Z you would expect the logistic to be something like this right if Z is very positive and then zero of course if Z is very positive over here Z zero the probability that the output is great is one is going to be pretty much one in which case if it's currently one it's going to remain one of the current if it's currently zero it's going to become one and stay that way it's not going to change right if Z is very negative it's going to become in this case I'm defining me you don't need a plus one minus one convention anymore when your time when you're thinking over in these terms it's more convenient to think in terms of one zero conventions it's just a scaling right so scaling in a shift so now so if Z is very negative the output was going to remain is going to be is going to go to zero and if it's currently zero it will remain at zero if it's currently when it's going to flip if Z is in the middle it's going to be a bit noisy right so what observe what happens over here when Z is very large if the output almost assuredly aligns itself with the field if Z is a very large negative number the output assuredly almost always aligns itself with the field but for intermediate z-values there is a probability that the output does not align itself with the field right so this first term over here the field quantifies the energy difference obtained by flipping the current unit and this guy if the difference is not large the probability of flipping approach is 0.5 so what do we mean by this this guy over here that's a field right so if that is not if if that's not very far from zero in this case so that's zero then this the Sigma term is going to be close to 0.5 if it's exactly zero the Sigma term is exactly 0.5 and if Sigma is 0.5 what does it mean it means it doesn't matter what the field is telling you see this thing was close to 0.5 doesn't matter what the field was telling you that it were that must be you know whether what the orientation must be is just flipping a coin to decide what the orientation must be and as you go away from it the coin gets more and more loaded right so the d if the difference is not large the probability of clapping approaches 0.5 what is this T the T is a temperature parameter which we have drawn from thermodynamics if I increase T Z tends toward zero right so without T Z might be very a very large positive number but if I toss in a very large T then the network is going to begin flipping its bits even though you know in absence of T that's net the current position might be considered very stable so why would you need a T of this kind if you look over here if you go back to this figure if this guy is a broad valley the parasitic memory is a broad valley or if the parasitic memories deep then having a large T allows you to escape it right so T actually gives you a control which lets you escape unpleasant memories and the as a result the evolution of the network is more likely to escape Spheeris memories so in so increasing T gives you a greater probability of escaping bad memories if T is one you get the traditional definition of an energy and a field that we've talked about before but if T is 0 you get deterministic hopfield behavior meaning then the neuron aligns itself with the current field and that's it there's no probability of it misbehaving right that clear to everybody so all we've done is added a little bit of noise and lets you escape these ugly little dips in the energy energy contour so now this gives us a stochastic hopfield network where the behavior evolution of the NIR network is now going to be like so you compute the field at each point they should have assuming T is 1 and this in the end is little algorithm and each bear is flipped using a loaded coin whose probability of 1 is given by the logistic function so questions how long do we continue this when we stop right and when we stop what is the output value remember that we are flipping coins all the time to decide what a bit is right so if you're flipping coins all the time there's no such thing as a definite or output value because if I'd continued one more step maybe the bits would have changed so what is the output value that we're actually going to consider given that the system is continuously evolving so let's look at the second guy first now but even before that right so let's yeah here's what we're going to do I can let the system evolve like so y0 y1 y2 and if i let it evolve then do we have the reader if they if i let the system evolve it's going to go like so y0 y1 y2 it will keep the water and keep flipping and eventually are going to get you know some y 90 by 91 all the way to y 99 now if you consider any small window of evolutions and you compute the probability distribution over that window at some point that probability distribution is going to begin to converge right and when it becomes stable then you then we will claim that the network system is in equilibrium and now I can actually compute the average value of these guys not each of these vais this is actually many bits right each of these as many bits so what happens and each of these can take values 0 or 1 so what happens if I take the average of the bit values in the first position this is going to give you the probability that y1 is greater than y 1 equals 1 right the average of these guys is going to give you the probability that Y 2 equals 1 and so on and if this probability exceeds 0.5 I'm going to say that bit is 1 otherwise I'm going to call it a zero so I basically let it evolve and use the probabilities to decide what the bits are going to be so easy enough now in reality remember we were we had a temperature parameter and we wanted to use the temperature parameter to escape local you know bad memories so what we will actually do is to perform a process called annealing now this this may this mechanism is actually based on the metropoliz histories Hastings algorithm you start off with the high energy temperature T 0 and you let the system evolve till you think it is approach some kind of an equilibrium then you lower the temperature and then you repeat the process and you keep doing this till the temperatures as low as you want it to be and then you take your measurements so in hop fields original paper the temperature T the lowest value they actually take his tan whereas if you're not ha feelin in in Hinton's version of papers whereas if you want to approach a hopfield network the T has to go down to zero here to keep going to the T goes down to zero right anyway so again as before for noisy pattern completion you initialize the entire network and let the entire network evolve if you're trying to complete a pattern which is that where the known portions are actually clean then you're going to fix those portions and let the rest of the network evolve right but all very fine so now let's go back to this first question when do we stop what do I mean by equilibrium I sort of hand waved around it but what exactly do we mean by equilibrium even before that let's go back and look at what's happening to an individual neuron at each neuron you computed a field and use that field to compute a logistic probability to decide what that neuron value is going to be now so the probability that any particular neuron takes the value 1 is a conditional distribution is the probability that Y I equals 1 given all the other Y values right what is the overall probability of the entire set of neurons if I use this process and I let use this process to avoid you know and to let the network evolve eventually it's going to arrive at some value you know that's a stable configuration when it's in that state in that stable configuration of the stay of stability stochastic what is the probability that the network entire network will have a specific configuration of bits so you know if I use this rule to continue to let the network evolve what is the probability that at any given time I'm going to see you know the bit pattern 0 1 0 1 I could compute this probability for each bit pattern right and if you work your way through the arithmetic it turns out that the probability of any state where we saw this in the last course we actually spread backwards we went from the Boltzmann distribution to the logistic but it can also go the other way and you can it turns out the probability of any state Y can be shown to be given by the Boltzmann distribution which is this is the energy of the state right and the probability of the but given by the Boltzmann distribution is simply going to be some constant times e raised to minus energy of the state divided by the temperature this is like this is the equivalent of a thermo thermodynamics and where the Boltzmann constant is 1 C is a is a normalization term which makes sure that the whole thing is a probability distribution in the absence of air it won't be right so if you're minimizing the energy this is the same as maximizing this probability or maximizing the log probability right so it turns out that the hopfield network is actually a probability distribution of a binary sequences if you let the hopfield network in randomly initialize the hopfield network and let it evolve once it is done evolving one set arrive at some stable configurations to care in a prior to know statistically stable configurations what is the probability of finding the network in any given state is given by the Boltzmann distribution so the parameters of this so this is the propyl network can actually be you can actually be viewed as a model for a probability distribution of a bit strings and the parameter of the distribution is going to be the weights matrix of the network the conditional distribution of the individual bits in the sequence is a logistic so we're going to because this whole thing actually models a Boltzmann distribution if you no longer know no longer going to call it a hopfield network you're going to call the stochastic hopfield network a Boltzmann machine so the Boltzmann machine is simply Able's as a half-filled network where you don't follow deterministic rules of evolution but you actually let the network have probabilities of you know a probabilistic behavior okay now the entire model can therefore also be viewed as a generator model you can think of it as producing bit sequences and the probability of producing any by any binary vector Y is the Boltzmann probability distribution for Boltzmann probability for that one right now from this perspective considering that the network is actually now no longer behaving deterministically but it's actually stochastic you'd expect the mechanism that you use to train the network to change you have to account for the fact that the network is now probabilistic right so for any pattern you can define an energy for the network and you can define the probability that the network is going to end up with that energy it sort of magically changed the symbols from Y to s at this point because I had s on my slides you know some time ago and I copied them and I've been too lazy to change the symbols but s also stands for state so it's kind of convenient for me but here here it is the probability that the network is actually going to produce a particular state which means by state I mean a bit sequence which represents the configuration of all of the net neurons in the network the probability that is going to find produce a particular state is simply proportional to e raise to minus energy assuming t is 1 right and what would the proportionality constant be it's going to be there just the numerator summed over all possible state values right so if I want to train the ball's man-machine now question considering that I've gone from deterministic behavior to stochastic behavior would you expect the learning rules to change anyone that is very confident it's the answer is it doesn't right because adding a little bit of noise doesn't make things any different noise just cancels out so it's magical I just took a deterministic machine added a bit of noise made it a stochastic machine I'm going to give you pages and pages and pages of math to give you an equation that I could have given it that I gave you two classes ago with one line of math and the two are modeling exactly the same process okay so to train the Boltzmann machine we must learn the weights to assign a desired probability distribution to the states so basically if I want if I think of the network also as memorizing content-addressable memory I want to assign more probability to patterns that we like things that we want to memorize and less to other patterns so I want to train the network assign a design a desired probability distribution to the states let's say I've given some given some collection of training inputs these training inputs gave you the patterns that I would like to remember but moreover the relative frequency of the patterns that I have in these training inputs tells me which are more frequent and which are less frequent so least amongst them have a probability distribution you can think of these as having a probability distribution and what I really want to do is to train a network that captures this probability distribution you might want to assign higher probability to patterns seem more frequently and if some patterns are not in this set basically what I assign very low probability to those guys right so you can think of this as trying to model a probability distribution for bit sequences but friend fundamentally this is really no different from trying to memorize patterns and this also goes back to what da gerald was speaking off a few classes ago that what the network really does regardless of how you look at it is to memorize patterns we have different perspectives but really you're just memorizing stuff right so but then thinking of it from a purely probabilistic perspective what are we really doing we are maximizing the likelihood of shared states of store States so let's take a look at the maximum likelihood learning process P of S is the probability that's going that is assigned to a state which is a bit sequence of configuration by the network and that's again the proble ceman distribution so what is log of P of s now we saw that P of s was going back to our formula just for reference we had PFS was erased ooh - the energy of house divided by summation over s prime e raise to minus energy of s prime which was raised to minus y transpose WI using now write divided by summation s prime e raised to minus s prime transpose s Prime so if I take a log of P of s that's going to be log of e raised to minus s transpose WS minus log of summation over s prime e raised to minus s prime transpose W s Prime that's basically what I've written over there right and what is the log of this guy this is simply going to be minus s transpose WS minus log of summation over s prime exponent of minus s prime transpose W face except that now I have explicitly written it out so I've written this this term have expanded the matrix operation out on the slides ok so on both sides now that is the log probability for a given observation right and given the training data set so the expected log probability for this for thee for all or the entropy of the distribution is simply given by this but then if I think of the expected log probability assigned to the specific training data instances or the log drop total log probability assigned to the training data instances is going to be the sum or let's say average over all of the training data instances of the log probability assigned to the individual training data instances and this I can I can plug the terms from above down here that's going to give me this guy over here in the first term right this is the sum over all of the in straining data training patterns that I see of the log probability component obtained from the numerator and this denominator term really doesn't depend on the training instances right the denominator term is constant so if I'm summing over n training instances and dividing by n the sum and that did the 1 over N the the the division and the summation cancel out all you're left with is the is the logarithm of the denominator term the term to the right so again just to make that explicit what I have done here is 1 over N 1 over N summation over s earase to minus energy of s divided by s prime e raise to minus energy of s prime log this is 1 over n summation over s log of e raise to minus energy of s which gives me this very first term here this is the log of the energy right minus 1 over N summation over s log of summation over s prime exponent of e minus of e FS prime now observe that this term is not a function of s right so this term can come out of the summation and this is simply going to be this this second term is going to be log of exponent summation over s prime exponent of whatever times 1 over n summation 1 this is just 1 so as a result you are just left with this log term to the right ok and so these are the two terms whose derivatives we now need to take so this is this is the log probability of your training data that you're actually trying to maximize now if I take the derivative of this guy right what is the derivative of the first term the derivative of the first the first if I take the derivative of this stuff with respect to any W IJ the derivative of the first term is simply going to be the average over all the patterns you're trying to store of the products of the bits we saw this already right so the key piece is the second term the second term is a summation over all states right we have this this the second term has a I should stop closing my pen this summation over all possible states in the network right so instead of working this out on the board I've actually worked this out on the slides so you're for the second time you're taking the derivative of the log of the sum over all states of arrays to minus energy for the state right with respect to W IJ and when you take the derivative with respect to the log of any term firstly the term ends up in the denominator then you get the dinner the derivative of the term itself with respect to the weight so that's going to be just working it out this is going to be D over D WI J of summation of log of summation over s prime of exponent of summation over I I less than J W IJ s Prime I s prime J this is just the denominator term in the Boltzmann distribution right and this is going to be first because I'm taking a derivative with respect to a law of a log it's going to be 1 over summation s Prime exponent of whatever this term might be times the derivative of the summation of an S Prime exponent of whatever this term might be right and when I do this this is simply going to give me this whole thing is going to be given give me the summation or X prime exponent of minus e of s prime divided by this is guy this I'm going to have to use a second term here s double prime summation over s prime x four minus energy of s prime times the derivative of the energy itself and this term over here is just P of s prime right so if you work your whole thing out work your way out through the arithmetic it turns out that the derivative of the denominator of the Boltzmann distribution simply ends up being so this first term ends are being the probability and the second term this guy is the derivative of the energy itself right so the whole thing simply ends up being the expected value of the product of the of the I attend the g8 bits across all possible states all possible bid patterns so overall the this term cannot be computed right this is being summed over every possible every possible state so there are an exponential number of them so the way we will do it is it going to initialize the network network randomly let it evolve and after many a parks take a snapshot of the state and if I do this several times eventually I'm going to call this simulation I'm going to get one simulated value for the state sequence many many times so I have the network I just initialize it let it evolve take a snapshot keep doing that and if I do that then this expected value of the product of the states can be approximated as the average or the product over the stead of the states over all of the simulations that I've got so given this now I have actually have an empirical estimate for the log probability it's not the actual log probability the actual log probability needed the expectation over all possible values for the state instead now I've just ran simulations to compute the denominator term right so now the empirical estimate for the derivative of the law of the log probability of the log probability for the training data is going to be the summation over the average over all the training instances of the energy negative energy for the training instances minus the log of the summation summation over all the training sample all all the simulated samples of the denominator term again work your way through and that simply gives us this very simple rule there it turns out that the derivative of the log probability of the training data with respect to any parameter sim simply ends up being the difference between the average value of the products of the bits on the training instances minus the average over all of the simulated samples of the products of the bits over the design over the simulations where the network was permitted to evolve and the gradient descent rule is simply going to be you know way to plus let's go right so what does this actually remind us off so here is the overall training process right you may initialize the weights let the network run to obtain simulated state samples compute the gradient as I have written it up there and update weights what does this remind us off this is our hopfield network all over again right this is exactly what we did that the first term over there is the average energy over all of the memories that we are trying to store and the second term is the average energy over a bunch of random samples right in other words we went through this complicated looking arithmetic it's not very complicated except you know exponents in derivatives and so on but it was unnecessary if you actually just worked your way and this part of the table field network as your primary model you would have I ended up at exactly the same learning room and all of the modifications we made to the learning rule over here would also apply fairly directly to the Boltzmann machine right so now adding capacity the network can store up to n and bit patterns so questions so far anybody okay now now if I went back here you can see how this whole thing might be modified right you wouldn't actually need to run many simulations to generate your samples over which you're going to be training your learn during the second term in the update rule you could have just started at each of these guys run a few iterations think only those samples and increase and and and increase their energies and that is also going to work and you would expect that to to improve the log likelihood of the Boltzmann machine of the training data as assigned by the Boltzmann machine so it's very pretty right it ties up very directly to everything that we've seen know we've seen the network can store up to n n bit patterns but we want to increase the capacity so we saw how we could do that how we could do this I can just add a large number of neurons whose actual values you don't care about so I'm trying to store n-bit patterns and to store n-bit patterns but i want to store more than n n bit patterns I can just add key bits to these n bit patterns I don't care what those K bits are but now I have a network with n plus K bits so the new capacity is n plus K patterns right and again we only care about the patterns of the first n bits we are interested in n bit patterns the rest of them I just meant to augment the whole system so that increase the capacity so you're going to for call this first set of guys the guys you're interested in whose values are gonna inspect with values you're actually looking at as visible neurons and these guys whose values you don't really care about these are going to be our head neurons so now my network has two sets of neurons the visible neurons these are the bits who which I which are going to be inspecting and the hidden neurons these are the bits that you're not going to be inspecting all they are segregated the figure into two circle two balls over here the fact is every neuron connects to every other neuron which means every visible neuron connects to every visible neuron and to every hidden neuron every hidden neuron connects to every hidden neuron and to every visible mirror right so the but again keeping in mind that we're kind of agnostic to the configuration of hidden values so now for any given pattern of visible neurons there are 2 raise to K possible patterns for the head neurons now if I want to augment my the patterns that I'm trying to store with K bits what do I said those K bits to be right basically what I'm doing is I'm taking my key n bit pattern I'm sort of appending a bunch of K bits to it so that now I have many more bits and I can store more but then I have a choice what are those cables going to be right now ideally you want to choose those K bits says that each of the patterns that I'm trying to tor has the lowest energy possible but you can't do that it's an exponential search over cables right so also keep in mind that any single training pattern that you're trying to store doesn't have to be a single augmented pattern it could be any number of augmented patterns so you're sort of expanding the capacity but you're also possibly making the antenna into a network more robust but even there how many how many are we going to take so the obvious solution is to say let me be completely agnostic to the nature of the the hidden the arrangement of the hidden patterns units I just want this to be such that overall over all possible combinations of the headsman units I want the energy of the visible units or the patterns that would be target patterns in the visible units to be minimized all right so there's no need to be specifying a specific pattern about the hidden units although you could you just like to a virtual any part of all possible account combinations and the nice thing about now the difference between these two approaches is in the one you now have a strictly deterministic approach you're always sort of choosing one or more patterns for the to augment the visible units in the second case when you're saying I'm averaging over all of them you end up with a probabilistic nice probabilistic interpretation so now this becomes a regular Boltzmann machine as the literature speaks about it where you have now this remember this is a Boltzmann machine without hidden units this is what we saw the basic framework has no hidden units we were to extend this to have hidden units so now with the hidden units the complete state pattern is the problem with extending it with the hidden units now in the in the case of the Boltzmann machine with no hidden units when I actually gave you training patterns then you knew exactly what the pattern was for each of the training instances once I extend it even for the training patterns I don't really know what the pattern is anymore I just know the values of the visible units I have no idea what the Hedden units what values they had new units took right since it is a defined over since the patterns are only defined our visible neurons so we'll sort of change our perspective slightly I can if I think I've I can think of this again in a deterministic hopeful that what kind of framework but the stochastic framework is much more more much more interesting and in the in a sense and provides a little more insight and actually as we mentioned earlier has the added benefit of being a generative model and now there's something interesting that pops out if I think of this as a generator model then you can think of the visible units as the observations and the hidden units as the latent underlying latent representations of these observations now there's a little bit of history over here before hop now this seems like a very simple model there's nothing really particularly complicated about it we started off with how field nets we walked into Boltzmann machines everything was obvious now if we go back to the 70s and the 60s the general the field of AI had generally come to the conclusion that the surface observations were all you had in the sense that while you could assume well while you agreed that they were underlying latent variables that controlled everything that you observed the it was considered generally tough or even impossible to actually come up with nice mathematical ways of learning what these latent variables might be so it was simply because the relationships are nonlinear and who knows what these nonlinear relationships are so you're faced with so much of an unknown that the general perception was you know let's work with the surface because it's impossible to figure out what goes on underneath and the hopfield network was possibly you know the first actual statistical model which gave you the other field the Boltzmann machine which gave you a representation that employed hidden or latent variables that governed the behavior of the visible observations and yet gave you a mechanism for actually learning what these latent variables might be so this was that this was the earliest real foray into modern modern latent variable models for explaining in the domain of machine and if you'll read a read the Hopkins paper from 1982 he actually makes a specific case for this it's very interesting anyway that apart what we are really interested in what what the Boltzmann distribution gives you is the probability distribution over the entire bit sequence what we are interested in is the probability distribution only over the visible bits so how can we get the probability distribution over just the visible bits simple you have the bit sequence has two components these are the visible components these are the hidden components you marginalize out or what the hidden components you basically some the probabilities of all bit patterns where the visible components take the value that you want and you're going to have to raise to K of these guys right yes or one of the first I wouldn't the e/m algorithm not 19 V I think the I don't remember the date exactly that's in the 70s 72 or 78 one of the students the dempster so given a latent variable model a probabilistic latent variable model you can actually it just shows that if you that you that what used to be active what was actually a familiar algorithm at that time converges that's about learning an algorithm here we are really speaking of this in the context of AI the connection had not been made right so here I actually have processor so that I'm trying to model and their their underlying latent variables that control their observations that you get if you look at how fields are papers are speaking of communications channels for instance I need you where the output must be the same as the input and you want to automatically learn what the hidden representation is and that ends up being a nonlinear model nonlinear relationship so while people knew these relationships existed that you can actually learn these things in a nice clean automated way was kind of considered doubtful at least according to the papers that was in there anyway so going back right so you want to marginalize over the hidden variables so now I can partition every bit sequence into two components the arrangement of the visible neurons in the arrangement of the head neurons right V and H and so when I want to train the network to store specific memories I want to store them I want to store arrangements of the visible neurons I don't care about what the head neurons really are so I want to maximize the marginal probability of V which means this requires summing over H right and this is an exponential state space we can't actually sum so we're gonna use simulations again and the way we will do it is remember your gradient descent had two components right the first component was the expectation of the product of the bits pair products of pairs of bits over the distribution are observed over the distribution of training samples the second component was the expectation of the products of pairs of bits over the distribution currently represented by the network and you are trying to maximize the difference between you were using the difference between the two as ingredient right so now the first expectation itself is going to be hard to estimate because you have these hidden components so what we will do is that for each training instance you'd fix the visible patterns and let the head neurons they want using exactly the same process that we just did and then when it evolves it's going to evolve for several steps and after it evolves for several steps at the end I have some value right that's going to give me one arrangement of hidden neurons I can concatenate the hidden and visible neurons I get my first training sample I repeat the process I keep repeating repeating the process K times and I'm going to get K training instances but observe something interesting over here that I'm getting K copies of a single visible pattern right I took a single visible pattern let the network evolve and got a snapshot then it took let it evolve I got another snapshot I get key of these guys I'm going to get K simulations as a matter of fact you don't need to read them let the network evolve k independent times you could just do it once and then take the final K patterns that you get to create K copies and the result is going to be the same because at equilibrium the behaviors are always going to be the same right and then then we have the second component right where you're trying to lower raise the energy of the undesired patterns there you would unclamp even the visible units you're not going to send them to any specific value let the network evolve and after it's run for a very long time you're going to get many you know you're going to get some samples and equilibrium you take a bunch of samples you get a second set of simulated samples and now they a derivative of the empirical estimate of the log likelihood of the training data with respect to any parameter it's simply going to be this guy is the product it's the average of the products of pairs of bits but all of the samples you derived when you clamped your visible units to the training patterns - the same average what all of the samples you got when the network was unclamped right this is exactly the same rule as we as we had before just dropping in the visible hidden units didn't make things any more complex any more mystical it's still the same thing that we had but we just but all we do according for the fact is that now because we don't actually we're not actually specifying which value a hidden unit mistake you're letting it you're sampling it many many times remember if you're taking just a single sample this is like assigning a particular value to the hidden units for any given training instance right so the overall training has got to look something like this you would initialize the weights run simulations separately to get clampdown and current clamp training samples compute the gradient and there's a gradient update a gradient descent a central I keep getting my science mixed up a lot go check check the slides before I post them so that's this is it I mean that's it that's all it is about Boltzmann machines so you know it's a very cute idea that's a simple extension of our four hopfield networks but now it has additional property that it's actually a generative model that generates a Boltzmann distribution right and and very early on it was pointed out that you don't really need to use a Boltzmann distribution you could use could use a table model other kinds of distributions by defining your energy term differently and you end up with other kinds of distributions but well so remember the Boltzmann distribution is only looking at the interactions of pairs of units if you use a different distributive different definition of energy you could be looking at interactions of triplets or other kinds of cliques of units and this the whole thing sort of generalizes to a step to a standard belief Network kind of structure but anyway so Boltzmann machines a stochastic astok extensions of hop in Nets enable storage of many more patterns than half units but also enables computation of probabilities of patterns and completions of patterns so for so overall you have the training process but now if you want to complete once you've trained the network suppose you want to complete a pattern like we just saw it in the two examples we saw what would you do you would clamp the bits that you already saw then led the network evolve and then once it had evolved you're going to get many samples towards the end for these guys you just take the averages and you can set these two bits right this is for completing a pattern if you are trying to clean up a noisy pattern you said the network let the entire network including the visible units evolve and you get the same kind of behavior you can add in for good measure things like annealing where you actually start off with a high temperature and work your way to a lower temperature now observe that the actual training process assume that the temperature was one so this is this this is typical right now filling out patterns denoising patterns this is easy there's one simple thing you can actually also use it for classification regression so how would I use it for classification for classification now let's say I am trying to train a classifier right what was the standard mechanism that we used when we are trying to train a classifier we just had a multi-layer perceptron he had inputs going in he had a one heart representation coming out right I just chained the picture so let's say I have X 1 through X n XD as my input vector features and I have classes cal classes now I can append that c1 through CL to my input patterns itself so now I get an extended vector I'm no longer thinking it of it as inputs vs. outputs I am thinking of it explicitly as the Joint Distribution of the input features and the class labels right so now this becomes my these are all my neurons and I could of course have a bunch of these are the visible noodles I can have hidden neurons I have all kinds of connections between the two I can train the network and later on when I'm given an input I can't read this as pattern classic pattern completion I'll just provide these guys and go through the inference and try to determine these and that will give me my classification output right now once you begin thinking of it this way you realize that it's really fundamentally no different from let's say a regression problem this could be classification this could be regression just by treating my inputs and outputs as joint variables that must be jointly modeled the Boltzmann machine allows me to learn the distribution and makes a posteriori predictions of the variables I'm trying to estimate no problem is the Boltzmann machine takes forever to Train why for every single input I've got to run a bunch of simulations then also for the for the patterns that I don't want to be storing I have to run a bunch of simulations and so you have to run a large number of simulations for every single input at each iteration so clearly this is not a very friendly computation friendly model if I think he offered in those terms you'd like a simplification and now answer to that is what we will call the restricted Boltzmann machine we call this the restricted Boltzmann machine but that terminology came out later originally the paper the the model is called a harmonium model proposed by Paul's Mariinsky in the early 80s and he has this big 8888 page chapter about it from 86 or something so here's how I just think of the same thing as your Boltzmann machine but now separate the two the thing before thing with the Bulls and machine was that every neuron spoke to every neuron so the visible neuron spoke to the hidden neurons the visible neurons also spoke to the visible neurons the hidden neuron spoke to the visible neurons but they also spoke to one another right instead of that segregate these things out make it I bipartite graph the hidden neurons do not talk to one another hidden neurons only talk to visible neurons visible neurons only talk to her neurons so that gives you a restricted Boltzmann machine the network can now be separated like so right and now because you don't have loopy inference right so let's say I'm trying to store a pattern I don't actually have to loop the network there are two components to my learning algorithm right the first one was where I led the network evolve and ran simulations just to compute expected products of bits for the training instances and even that required simulation that's going to end up becoming a one steps one step process so here so here is how the model still obeys the same rules as regular Boltzmann machines which means that I can be visible units to set any state right I'm going to compute the weighted corporation of all of the inputs into the state and computer logistic over it but then for the hidden units the hidden units are only going to be looking at the visible states right so every the eyath hidden unit is going to be a logistic computed over the weighted combination of all of the visible units the ayat visible unit is going to be able derived from a logistic computed over a weighted combination of all of the hidden units so the hidden units govern the visible units and vice versa we just got rid of a bunch of pages from the standard Boltzmann machine things greatly so remember for a full full Boltzmann machine for each training pattern we had to generate many copies of the hidden pattern and then create simulated patterns right here it's gonna be much simple you're going to anchor the visible units and you just need to sample once there's no there is no evolution as you can see right you just need to sample once and if you sample just once there's a because the visible units are anchored the hidden units are not going to come back and change the visible units you're done so a single sample going to give you the simulated samples for the complete data and so the second copper component of the training the footballs and machine you unclimbed the visible units and let the lit the entire network you want several times to generate all of these unclamped sample values we're going to do exactly the same thing over here except first for each you're going to unclamp it for each training pattern you're going to use the training pattern to predict the hidden units just I can draw the Train the visible units use those visible units I can draw the hidden unit so you're going to be flipping back and forth in the generation so the whole classes will look something like this you'd set your visible bits to your training to the pattern that you'd like to store from those you can draw the hidden values but from the hidden values you can go back and draw the visible please then you can go back and draw the hidden values and repeat the process for a very long time right till the system converges now this now this if I just do this once observe that there are two things over here this first guy is already giving me the first component of my updater right because in the first transition the very first arrow shows you want the hidden units would be if the visible units were fixed to your training and training value towards the end you're going to get what you would get if everything we just let evolve right so the overall training rule is going to look like this you get to fix the hidden units and then let the network you don't even fix it you just pretty she lies the hidden the visible units and let the network evolve it's going to give you this complete chain right you let it evolve for a very long time then this first term is simply going to be the products of bits at the very first iteration right you fix the visible units to your training pattern drew some samples for the hidden units and now you can compute the products of bits and average overall training samples is going to give you the first guy right the second term is exactly the same thing except now you're going to do it at the very last you know at the very last step of this evolution but other so other than that it remains the same so very simple process right now again recall let's go back to our half-filled works I keep going back to our feed networks we said that so what is actually what is what is this whole process doing you're starting from some visible units right and letting that network evolve yes so the RBM is evolving so just say so the point is for the visible for the first component so here if you go back to this one right they trade the gradient had two components the first one was when you fix the visible units and let only the hidden units evolve and over here there's clearly not evolution in the sense that when I fix these yellow balls you can draw the blue balls but then those blue balls are not going to come back and change the yellow balls and they don't affect each other directly right so for the for the first component for the first component and the gradient descent update rule there is no evolution required you only need evolution for the second component which is fully simulated and it turns out the two of them can be combined into one problem one step like so you just said the yellow visible units and let the network you know compute the head and units computer hidden visible units computer hidden units let it continue and and the first component of the gradient descent can be computed from the bits after the first step the last component can be computed from the bits after the last step right so this product VI AJ is going to be the product of the IATA bit here and the j8 bit over there the ayat visible bit and the GF hidden bit you'd compute the average of this over many rounds over all the training instances this second term is going to be the product of the ayat with little bit on that yet visible bit at the end of the evolution and then the difference between the two is your gradient now going back when we were speaking of hard what we are really doing over there is we are starting from the visible from from the pattern we are trying to store and you're letting the network evolve so that's basically what's happening over there you set the pattern that you want to store and let the network evolve and you try to lower this energy and raise this energy right but then we saw when we were training hopfield networks that you don't need to go all the way to the valley you could just go stay very close right which means I just need to do like two iterations of this guy and you can still expect the network to sort of learn to push the patterns you want to store into the valleys and everything around to raise everything around so it's very pretty right it greatly simplifies this whole process just by setting or separating out the visible and the training neurons but we are still using their effectively heuristics which changed the statistical model in some manner but all of them can be related back to the hopfield network and in fact pretty much all of these transitions could be made deterministic you could apply them to hop in that's right so we're kind of running out of time restricted balls and machines are excellent generator models for binary or binarized data what do I mean we were speaking of bits all the time right but data are not necessarily best we like to work with real data so if you have you're working with real leader you could sort of quantize them to binary values and you can they fantastic for working with binary values but they can also be extended to continuous value data and I thought this is a very nice paper by Max Welling and a Hinton in 2004 which explains how you do this but and it's useful for classification and regression huh exactly as we saw them earlier so here is the how you do the continuous valued restricted Boltzmann machine observe that for the visible units now we are no longer drawing it from a binomial distribution or a logistic you would be drawing it from some other exponential distribution but rest of the arithmetic doesn't really change in fact you can even make the hidden units continuous valued the whole idea remains so it's very pretty working starting with discrete value nibbles gives very nice there's something more you can actually make these things deeper right you can have once you begin segregating these these units you can say that my hidden units themselves are hierarchically arranged and that gives you what is called a deep Boltzmann machine now the inference rules perdy Boltzmann machine are very similar to what you have in just you know conventional Boltzmann machine and IBM said we won't have time to go into it and there's a variant on it where you can separate these bi-directional connections into two unidirectional connections one going one way and the other going the other way and this is what Radford Neal and Hinton call the Helmholtz machine and the the the training algorithm that we just saw when we apply to the Helmholtz machine they like to follow the wakes so sort of wrap up over here I sort of outlined Boltzmann machines and restricted Boltzmann machines ideally I'd have spent another lecture or compacted my first two lectures into something you know cleaner there's a lot more to cover I barely touched the subject there are various learning algorithms and inference algorithms including meaningful approximations the most interesting things are what they are used for they used extensively as feature extractors it was the original design for both one of the original intents of designing both machines in the first place they're very nice to be as generator model Sentra for influence for filling in missing values and data and such like turns out our beams are excellent and of course there are beams and BBM's can be made much more structured and that's a whole if I don't believe Ruslan is actually teaching seven or seven this time but he can go and look at his slides and there's a great deal of depth that he goes into about these variants of D Boltzmann machines anyway I'll stop here any questions no ok the next few classes we're going to switch topics again I'm going to be talking about reinforcement learning I'll start off by talking about about Markov processes Markov reward processes Markov decision processes and how we can actually use these to play games and there's a series of four lectures I'm hoping then by the time I get to the fourth lecture I can actually cover DQ networks and some of those variants if I don't Ryan here is going to be doing a recitation on the subject it's one of the most fun topics that you can actually work on in deep neural networks don't mess it and this boy knows his stuff more than I do thank you
Info
Channel: Carnegie Mellon University Deep Learning
Views: 39,718
Rating: 4.9483871 out of 5
Keywords:
Id: it_PXVIMyWg
Channel Id: undefined
Length: 81min 32sec (4892 seconds)
Published: Mon Apr 09 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.