Backpropagation and the brain

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there today we're looking at back propagation in the brain by Timothy Lilly corrupt Adam Santoro Luke Morris Colin Ackerman and Geoffrey Hinton so this is a bit of an unusual paper for the machine learning community but nevertheless it's interesting and let's be honest at least half of our interest comes from the fact that Geoffrey Hinton is one of the authors of this paper so this is a paper that basically proposes a hypothesis on how the algorithm of back propagation works in the brain because previously there has been a lot of evidence against there being something like back propagation in the brain so the question is how do neural networks in the brain learn and they they say there there can be many different ways that neural networks learn and they list them up in in this kind of diagram where you have a network and it maps from input to output by having these weighted connections between neurons so the input is two-dimensional and then it maps using these weights to a three-dimensional hidden layer and usually there is a nonlinear function somewhere at the output here of these so they they do a weighted sum of the inputs and then they do a nonlinear nonlinear function and then they propagate that signal to the next layer and till then to finally to the output all right so how do these networks learn the one way of learning is called hebbian learning the interesting thing here is that it requires no feedback from the outside world basically what you want to do in hebbian learning is you want to update the connections such that they kind of match their own previous outputs or even increase their own previous outputs so you propagate a signal and then maybe this neuron spikes really hard and this Spike's really low then if you propagate the signal again right then you want to match that those those activations or if you if you properly similar signals no feedback required so basically it's a self amplifying or self dampening process the ultimately though you want to learn something about the world and that means you have to have some some feedback from outside right so with feedback what we mean is usually that the output here let's look this way the output here is goes into the world let's say this is a motor neuron right you do something with your arm like you hammer on a nail and then you either hit the nail or you don't let's say you don't hit the nail so after it looks like crooked there you have feedback right so feedback usually in the form of some sort of error signal right so feedback it can be like this was good or this was bad or it can be this was a bit too much to the left or so on the important part is you get kind of one number of feedback right how bad you were and now your goal is to adjust all of the individual neurons or weights between neurons such that the error will be lower so in hebbian learning there is no feedback it's just simply a self reinforcing pattern activation machine in the first in these kind of first instances of perturbation learning what you'll have is you'll have one single feedback and that you can see this is a diffuse cloud here what you're basically saying is that every single neuron is kind of punished let's say the the feedback here was negative one that means every single neuron is is punished for that so how you can imagine something if you have your input X and you map it through through your function f then the function f has a way to w1 and so on right so you map X through it right and then you get feedback of negative 1 and then you map X with a little bit of noise plus M right da-da-da-dah and you get a feedback of negative 2 right then you you that means that the direction of this noise was probably a bad direction so ultimately you want to update X into the direction of negative that noise by modulated of course by by some some factor here that's that it kind of tells you how bad it was so this could be the negative 2 minus negative 1 now that makes big sense No yes that would be no it would be negative 1 minus negative nevermind so basically with a scalar feedback you simply tell each neuron what it did right or sorry if if the entire network right the entire network did right or wrong so the entire network will lead to this feedback you don't have accountability of the individual neurons all you can say is that whatever I'm doing here is wrong and whatever I'm doing here is right so I'm gonna do more of the right things now in back propagation it is very different right in back propagation what you'll do is you'll have your feedback here let's say that's negative 1 and then you do a reverse computation so the forward computation in this case was this weighted sum of this layer now usually layer wise reverse computation which means that you know how this function here this output came to be out of the out of the inputs and that means you can inverse and you can do an inverse propagation of the error signal which is of course the gradient so this would be your your you you would derive your error by the inputs to the layer right so this basically tells in the back propagation algorithm you can exactly determine if you are this node how do I have to adjust my input weights how do I have to adjust them in order to make this number here go down right and then because you always propagate the error according to that what you'll have in each in each layer is basically a vector target so it's no longer just one number but each layer now has a target of vectors and it says okay these are the outputs that would be beneficial please this layer please change your outputs in the direction of negative two negative three plus four so you see this is so the negative two would be this unit the negative three would be this unit and the plus four would be this unit so each unit is instructed individually to say please this is the direction that each unit should change in in order to make this number go lower you see how this is much more information than the perturbation learning in the perturbation learning all the units simply know well the four was bad and now is better so let's you know change a bit and here you have detailed instructions for each unit because of the back propagation algorithm so ultimately people have kind of thought that since back propagation wasn't really possible with biological neurons that the brain might be doing something like perturbation learning but this paper argues that something like back propagation is not only possible but likely in the brain and they proposed this kind of backdrop like learning with the feedback network so they basically concern all the they differentiate hard between these two regimes here in this hand you have the scalar feedback which means that the entire network gets one number as a feedback and the each neuron just gets that number and here you have vector feedback where each neuron gets an individual instruction of how to update and they achieve this not by back propagation because still the original formulation of back prop as we use it in neural networks is not biologically plausible but they achieve this with this backdrop like learning with the feedback network and we'll see how this does but in in essence this feedback network is constructed such that it can give each neuron in the forward pass here detailed instructions on how to update itself right so yeah they have a little bit of a diagram here of if you do hebbian if this if this is an error landscape if you do have you in learning you basically you don't care about the error you're just reinforcing yourself if you do perturbation learning then you it's very slow because you don't have a detailed signal you just you just rely on this one number it's kind of if you were to update every single neuron in your neural network with reinforcement learning considering the output the of the neural networks or the error considering that the reward not using back row and then with back probably have a much smoother much faster optimization trajectory so they looked at this and they they come to some some conclusions first of all so here's here's back prop basically saying back prop as we said you have the forward pass and there you simply compute these weighted averages and you you also pass them usually through some sort of nonlinear activation right and the cool thing about this is in artificial neural networks is that once the error comes in you can exactly reverse that so you can do a backward pass of errors where you can propagate these errors through because you know it's kind of invertible the function doesn't have to be invertible but that the gradients will flow backwards if you know how the forward pass was computed so first of all they go into a discussion of back prop in the brain how can we even expect that and one cool piece of evidence is where I find is that they cite several examples where they use artificial neural networks to learn the same tasks as humans right and or as as animal brains and then I have no clue how how they measure any of this but then they compare the hidden representations of the living neural networks and the artificial neural networks and it turns out that the these the networks that were trained with backpropagation x' then networks that were not trained with backdrop so basically that means if you train a network with backprop it matches the biological networks much closer in how they form their hidden representations and they they do a number they cite the number of experiments here that show this so this gives you very good evidence that if the hidden representations they look as if they had been computed by backdrop and not by any of these scaler update algorithms so it is conceivable that we find backprop in the brain that's why they go here next they go into problems with backdrops so basically why why would we why so far have we believed that back prop isn't happening in the brain so now let's I want to highlight two factors here that that I find a thinker suffice state they have more but first of all back prop demands synaptic symmetry in the forward and backward paths right so basically if you have a neuron and it has output to another neuron what you need to be able to do is to pass back information along that neuron so it kind of has to be a symmetric connection idea of the forward and the backward pass and these need to be exact right and this is just not if you know how neurons are structured they have kind of input dendrites and then there's this accent act action potential and along the axon the signal travels and the back traveling of the signal just I think is very is very very very slow if even possible and so it's generally not invertible or inverse compute capable so this is one reason why that prop seems unlikely and then the second reason here is error signals are signed and potentially extreme valued and i want to add to that they also just talk about this somewhere that error signals are of a different type right that's a different type so first let's see what signed error signals are signed yes we need to be able to adjust neurons in a specific directions right if you look at again what we've drawn before here we said here this is how these neurons must must update so the first neuron must must decrease by two this must decrease by three and this must increase by four now in background we need this but in if if we assume that there is something like a reverse computation or signaling here happening then we still have the problem that usually these output signals are in the form of spiking rates which means that over time right so if a neuron wants to if a neuron has zero activation there's just no signal but if a neuron has a high activation it spikes a lot if has a low activation it kind of spikes sometimes well what he can do is negative spike right like zero is as low as it goes so the the thought that there are signed information in in the backward pass is inconceivable even if you have something like a second so you can imagine here instead of this backward connection because of the symmetry problem we have some kind of second neural network that goes in this direction still you'd have the problem that here you can only have positive signal or a zero and they might be extreme valued which okay it can't be really encoded with the spiking because they are they're limited in the range they can assume but they are also of a different type and I'm what I mean by that is basically if you think of this as a programming problem then the forward passes here are our activations right and the backward passes here they are deltas so in the backward passes view either propagate deltas or you propagate kind of directions so the activations are sort of impulses whereas the backward signals are this isn't how you need to change their their gradients ultimately so it's fundamentally a different type of data that is propagated along would be propagated along these directions and that makes it very unlikely because we are not aware as this paper says that the that neural networks that neurons can kind of switch the data type that they're they're transmitting all right so then the paper goes into their end grad hypothesis and what this is the hypothesis basically states that the brain could implement something like neural networks by using by using an approximate backdrop like algorithm based on autoencoders and I want to jump straight into the algorithm no actually first they do talk about autoencoders which which I find very interesting so if you think of autoencoders what is an autoencoder an autoencoder is a network that basically starts out with an input layer and then has a bunch of hidden layers and at the end it tries to reconstruct its own input right so you feed a data in here you get data out here and then your error the error signal it will be your difference to your original input now the usually when we train autoencoders in deep learning we also train this by back prop right we see then this error here and this goes back but if you just think of single layer autoencoders so um let's let's go over here single layer auto-encoder with let's say the the same number of the same number of units in this in this layer what you'll have is so this this is input this is output and this is the hidden layer right you'll have a weight matrix here and you'll probably have some sort of nonlinear function and then you have another weight matrix here and they call them W and B another way to draw this is I have weight matrix going up then I have a nonlinear function going transforming this into this signal and then I have the be going back right so I'm drawing I'm drawing it in two different ways up here or over here and with the second way you can see that it is kind of a forward backward algorithm where now the error if you look at what is the error here the error is the difference between this and this and the difference between this and this and the difference between this and this right and you can train an autoencoder simply by saying W please make sure that the that the the the input here gets mapped closer to the output and to be the same thing this will become clear in a second so but basically sorry this I mean the the hidden representations you'll see basically the idea is that you can train an autoencoder only by using local update rules you don't have to do back prop and that's what this algorithm is proposing namely if you think of a stack of autoencoders this this this transforming one hidden representation into the next right this is the feed-forward function what you can do is you first of all you can assume that for each of these functions here you have a perfect inverse right you can you can perfectly compute the inverse function that's this this G here of course this doesn't exist but assume you have it what you then could do is you could if if you knew in one layer and on the top layer of course you know if you knew that okay I got this from my forward pass but I would like to have this this is my desired output right so in the output layer you get this this is your error signal if you knew you you you could compute an error right here this is what you do in the output right now in back prop we would back propagate this error along the layers but now we don't do this instead of what we do is we use this G function to invert the F function right and by that what we'll say is what hidden representation in layer two what should the hidden representation have been that in order for us to obtain this thing right so the the claim here is if in layer two we had had H two as a hidden representation then we would have landed exactly where we want it right that's what this G function does because here we use F so had we had F h2 and used F on it we would be exactly where we want instead we had h2 here and used F on it and then we landed here where we don't want so this is where we want we would want to be in layer two and this is where we were so again we can compute an error here again instead of back propagating that error what we'll do is we'll use the inverse of the forward function in order to back propagate our desired hidden representation and you can see there is of course a relationship to the true back prop here but the the important distinction is we are not trying to back propagate the error signal we're trying to invert the desired hidden states of the network and then in each layer we can compute from the forward pass we can compute the difference to the desired hidden state and thereby compute an error signal and now we have achieved what we wanted we want an algorithm that doesn't do back prop that only uses local information in order to compute the error signal that it needs to adjust and by local I mean information in the same layer and also the data type that is propagated by F is activations right of hidden representations and by G is also activations of hidden representations both of them are always positive can be encoded by spiking neurons and so on so this algorithm achieves what we want they go bit into detail how the actual error update here can be achieved and apparently neurons can achieve you know in the same layer to to adjust themselves to a given desired activation so this algorithm achieves it of course we don't have this G we don't have it and therefore we need to go a bit more complicated what they introduces the this following algorithm the goals are the same but now we assume we do not have a perfect inverse but we have something that is a bit like an inverse so we have an approximate inverse and they basically suggest if we have an approximate inverse we can do the phone so G G is now an approximate inverse to F what we can do is this is our input signal right we use F to map it forward to this and so on all the way up until we get our true or error right here this is our error from the environment right this is the nail being wrong and then we do two applications of G right so this is an application of F we do to applet of g1 we applied g2 this to what we got in the forward pass right and this now gives us a measure of how bad our inverse is right so if G is now an approximate inverse and this now we see here oh okay we we had a ch2 in the forward pass and we basically forward passed and then went through our inverse and we didn't land quite exactly where we started but we know that okay this this is basically the difference between our our inverse our forward inverse H and our true H and then we also back project using G again the desired outcome so we invert the desired outcome here now before we have adjusted directly these two right because we said this is what we got this is what we want but now we include for the fact that G isn't a perfect inverse and our assumption is that G here probably makes about the same mistakes as G here so what we'll do is we'll take this vector right here and apply it here in order to achieve this thing and this thing is now the corrected thing our corrected to desired hidden representation correct for the fact that we don't have a perfect inverse and now again we have our error here that we can locally adjust again all the signals propagated here here and here are just neural activations and all the information required to update a layer of neurons is now contained within that layer of neurons right and and this goes back through the network so this is how they achieve how they achieve this this is a bit of a close-up look and here are the computations to do this so basically for the forward updates you want to adjust W into the direction of the H minus the H tilde and the H tilde in this case would be this the the hidden representation that you would like to have so you will update your forward forward weights into the direction such that your hidden representations are closer sorry that your forward haven representation is closer to your backward hidden representation and the backward updates now your goal is to get a more a better to make G so sir W here is our W or the weight of F and B or the weights of G so in the backward updates your goal is to make G a better inverse right so what you'll do is again you'll take the difference between now you see the difference here here here right not the same error so here you will you in the W update use what we labeled error here in the G update you use this error here so this is the error of G so when you update the function G you want to make these two closer together such that G becomes a better inverse right because you're dealing with an approximate inverse you still need to obtain that approximate inverse end and this here is how you learn it this algorithm now achieves what we wanted right local updates data types check signed check and so on I hope this was enough clear in essence is pretty simple but it's pretty cool how they work around this they call this a different story with propagation and not these these kind of papers I don't think they invented this maybe I'm not sure maybe they did maybe they didn't and this paper just kind of frames it in this hypothesis it is unclear to me I am not familiar with this kind of papers so sorry if I miss attribute something here all right then they go into into how could these things be implemented biologically and they go for some evidence and they also state that we used to look at neurons basically in this way where you had input and feedback here very simple simplistic view of neurons whereas nowadays even the company computational community views neurons in a more differentiated way where you have for example different regions here on the soma that can be separated from each other and you have inter neuron interference and so on I'm not qualified too much to comment on this stuff but I invite you to read it for yourself if you want alright so this was my take on this paper I find the algorithm they propose pretty cool if you I hope you liked it and check it out bye bye

Info

Channel: Yannic Kilcher

Views: 16,104

Rating: undefined out of 5

Keywords: deep learning, machine learning, biologically plausible, neural networks, spiking, neurons, neuroscience, hinton, google, deepmind, brain, cells, soma, axon, interneurons, action potential, backprop

Id: a0f07M2uj_A

Channel Id: undefined

Length: 32min 25sec (1945 seconds)

Published: Mon Apr 20 2020