A bio-inspired bistable recurrent cell allows for long-lasting memory (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there today we're looking at a bio inspired by stable recurrent cell allows for long lasting memory by Nikolov akava damyang amst and aegean trio of the University of liège this paper here is not a paper that wants to push state-of-the-art on anything it is a paper that takes a concept from the biological research on actual neurons which is the bi-stability property and tries to introduce it to recurrent neural networks and on on toi data or small data they show that this has the interesting property that these recurrent neural networks can then remember important things for much much longer than our current recurrent architectures can do this I believe this is a very interesting paper and it's a nice you know refresher from the whole state-of-the-art number pushing papers so dive in with me to explore this if you like content like this also consider subscribing if you aren't and sharing it out and leaving a like in the comment if you have any sort of comments all right they they basically say recurrent neural networks provide state-of-the-art performance in a wide variety of tasks that will require memory which is true right so we have these recurrent neural networks and what the recurrent neural networks do is they're basically so a classic recurrent neural network go something like this there is a hidden state at time step T and there is a sequence of inputs that you have to work with so we'll call them X 1 X 2 X 3 X 4 and so on and then at some point you have to provide an output this could be at every single time step or sometimes it's just at the end you have to provide an output Y so for example this here could be a piece of text and you need to decide whether or not that piece of text maybe it's an email whether or not that's spam this could be a time series of a patient in an ICU and you need to decide whether or not to give some medication to the patient so the occasions of this are very wide and any sort of series data will do so there's this hidden state and at each time step this hidden state is updated to a new hidden state so this call did call this h0 it's updated to a new hidden state by incorporating the input so somehow the input X and the previous hidden state are made into a new hidden state and then the next input is taken in and by using this hidden state a new hidden state is made and so on so one property here is that the newest hidden state always only depends on the previous hidden state and it doesn't really directly depend on like the hidden state to before itself it only depends on the hidden state right before itself and the input that corresponds to it so this is the information for the other important property here is that these connections that make a hidden state into the next hidden state and also that incorporate input they're always the same so these for these functions here that incorporate the input they're always the same in each time step so the parameters are shared between them and the same the same goes for the D functions here that transform one hidden state into the next hidden state of course this there is a joint function between the two that actually produces the next state and state so these weights are all shared and it for each time step and that's what makes the network recurrent so we call the single time step here we call that a recurrent cell and the question now is how do you construct a recurrent cell usually recurrent neural networks they run into this problem of either gradient explosion or vanishing gradients because usually this here if you are into neural networks you know this this is a weight matrix multiplied by the previous hidden state and if you just multiply the same weight matrix over and over and over again it pretty much depends on the same the value of that weight matrix if the top singular value is higher than 1 then this signal is going to explode and if it's lower than 1 the signal is going to fade over time and there's pretty much nothing you can do so classic Oren ends have been have looked like this right here so the next hidden state is a nonlinear function G and G can be some non-linearity like a sigmoid or a hyperbolic tangent but it's a known it's a function of the current input and the last hidden state by simply matrix multiplying these two things by some weight matrices and then adding them up so that's what we've just looked at now this is problematic as I said because of the vanishing or exploding gradients and therefore people have come up with methods to solve this and you might know things like L STM's and gr use that are used to solve this now these cells here are much more complicated than the standard cell that we saw here but they also are much more effective because they don't have this vanishing or exploding gradient problems their promise is that they can remember things for a longer because they allow the gradient to flow without these problems during back propagation now how does one of these look in this paper they mainly look at the G are you the gated recurrent unit which is a simpler version of the LST and the LST M is slightly more complex but the principles are the same so they look at the G are you right here what does the G are you do these are the formulas for the GI G are you and we're going to try to deconstruct these formulas so as you can see the inputs are the same inputs are going to be this points input this time steps input and the last hidden state okay those are all the quantities that we need and we need to output somehow the next hidden state okay the last hidden state is then used to predict the why by the way in all of these cases so first we need to calculate two things called Z and R and both of them are the same they're multiplying these two things by weight matrices and then running them through a sigmoid non-linearity okay let's do that let's say we have X sorry the last hidden state here and we have X T here so first we're going to calculate the Z T and the RT from that now every one of these arrows right here is a multiplication by a weight matrix so every one of these arrows is transforming the input and let's also let's join this into a sigmoid node and that gives you Z T and let's join these into a sigmoid that gives you our T okay so far so good now we need to combine all of this in this last line right here so you can see that the Z thing here sort of acts as a switch so Z is the result of a sigmoid and therefore it's between 0 and 1 and here this is the the Hadamard product this is the element-wise product between vectors which sort of means this is like a gating this is like a switch if it's 1 it selects this quantity and if it's 0 it selects the quantity over here and of course it can be between 0 and 1 but that those are ready this the ends of the spectrum so Z is a switch that selects between the last hidden States so let's draw that right here so the last hidden state goes here and is one of the options of the output right so and the option is given by Z so Z T let's how do we draw a switch like this maybe so ZT is responsible for module this switch right here okay this gives you the next hidden state you see ZT modulates that switch so HT is a one possibility that the switch can select what's the other possibility the other possibility is this quantity right here which is a hyperbolic tangent of whatever that is so that is combined of X so let's go from the from the back right here tan H what's the input to the tan H it's two things first of all the the X is an input to the tan H so we can draw directly a line from here the X modulated every arrow as you might remember is mod is can be a function you know the whole arrows are functions like this arrow right here is not a function it's just an arrow maybe that's confusing you get what I mean and the next thing is our x the hidden the last hidden state or the last hidden state motive modulated by this matrix so R is an R is acting as another gate R can be again between 0 and 1 because it's the result of a sigmoid so this hidden state will also go here it will be modulated running out of colors here be modulated by our here as a sort of gate so R can either close or open this gate right here and then that's fed into the tan H so it's rather complicated set up as you can see right here so let's analyze this first of all the hidden state is either the last hidden state or it is something new and that's modulated by this Z right here and Z is calculated from the hidden state and the current input okay so this allows the cell to basically look at the hidden state is sort of the information what happened so far and the current input is the new information that it gets from the sequence and it sort of gets to look at these two things and decides do I even want to update my hidden state if not I can just select this path right here and then nothing happens to the hidden State the next hidden State will be exactly the same as the last hidden State if it decides if it thinks wow this new thing in the sequence that's actually important I should remember that right because remember these the task of the network sometimes is to remember things from this sequence I think we drew this over here so if this is an email and we want to detect whether it's spam then this word in the sequence right here might be really important because it would say something like gold like by gold these two things might be by gold and you need to remember that in the hidden state because the only way that information from X is going to flow to Y is through the hidden States so you would want at this point you would want to remember this input in the hidden state so you would actually want to update the hidden state and then this here might be not as important so you might want to say I don't wanna I don't wanna I still want my hidden state to be the old hidden state okay so Z is that gate that allows us to do this if we decide to update the hidden state then what do we do again if we decide to update the hidden state we can we can we will incorporate the new input but we will we can also decide to mix that how to mix that new input with the old hidden state okay so if we decide to update the hidden state we don't simply discard the old hidden state because the old hidden state will still have a path to the two to be sort of still there to be remembered but it's a longer path and it needs to go through this thing here and through this thing here so this thing here decides which of the old hidden state pass through so at each you can see right here this is a element-wise product this R is going to be between zero and one at each point in the vector so at each point in the vector the R decides is this worth remembering or not okay and if it's not worth remembering then this is going to be zero and that that position of the old hidden state is going to be zero and then that's going to be forgotten and that's the opportunity for for the hidden state to incorporate new information because then there's a it can delete this old information and it can incorporate the new input and that will result in on this path on the new hidden State okay so there's two major things first we can decide whether or not to even incorporate new information that's achieved by the Z gate and then we can decide which parts of the old hidden state if we want to update it which parts to forget that's the or gate and how to update it is then basically a result of the weight matrix that's associated with this function right here all right so that's the gated recurrent unit and it works a lot better than the classic RN ends so having said that they now turn to this property of neuronal bi-stability that happens in actual neurons so this here is sort of a model of a neuron with this property now forget everything we said about gr use we're just going to look at this right now what happens in the neuron usually is you have this is a single neuron you have input synapses from other neurons so these are connections coming from other neurons in into you they are accumulated right here usually they are just in a classic model of a neuron they're just summed up you would sum up all these all these input signals and then you decide you'd run it through like a step function so if the sum of all the things is smaller than a particular hold the output would be just nothing and if it's higher than a particular threshold then the output of the neuron would be sort of a firing of the neuron this can be wait and whatnot but in this case it's just a function of the inputs and that gives you your input signal so this is like this is it this is your input signal to the neuron now there is this property right here that makes it interesting the signal goes out here and is integrated this is an integrator and that's going to be in the output signal but there's this Greek disconnection this back connection right here and that means that the signal that comes out at time step T is going to be fed back into the signal and actually added to the signal before itself and sort of self modulating right the signal comes out is sent back is added to this input and then sent through again and this here is just an integrator that's integrating everything that's happening so if you if you look a bit closer you'll see that there is a minus here so it's actually not added it's subtracted and there is an F here which means that this is a nonlinear function now if this weren't a nonlinear function we can just sort of try or let's say this is a monotonic function we can sort of try to estimate what happens if all of this right here is very high you know it's a high number big number this will be a big number then this sum will be a big number this output will be a big number what happens is this here will be a big number this is monotonic so it will also be a big number and the that means it will subtract that big number so that means when whenever the neuron is going to to be very excited this feedback would actually push it back now when it is not very excited so when it's a very low number very negatively excited then the feedback would work in the exact opposite direction this will be very negative this will be very negative and this here would push it towards the positive so this neuron somehow self stabilizes overtime to this to the zero point right here and that's simply if this F is a is the identity function right now so you can sort of see how this property works now we'll make it a bit more complicated in that we'll assume that this F here is not the identity function but let's say they have it somewhere but this right here so the F F of the post is this here it's we post minus alpha tan H of repost Horace is this the entire F yes that's this this thing right here if this is the case if this is that this if this is the signal minus the tan H then something very very very interesting it happens and that that's depending on this alpha right here in that if this alpha is between if the Alpha is between zero and one then we simply have our monotonic function so here you can see how big we post this so how big the output signal is here that's the experiment we made before and here you can see what the feedback signal is okay or the integrator the integrated signal maybe this is in the wrong place and maybe F is just minus the tan H I'm not sure but in any case the way they build it after in the gru it's pretty explicit so this is the thing we said before namely if if the signal is very high then this signal here will be high as well and because it's subtracted right here it's it's going to push the signal back towards zero again if this is looks lower than zero then this thing here will also be lower than zero and because it's subtracted it's going to push the signal towards zero so this thing here is the stable point it will always push it back towards zero however if we change the function and we change change just the parameter alpha to be one point five a very different thing happens that you can see right here then it turns out if your output signal is very is very high the same thing happens is going to put me push back but if your output signal is between zero and this point right here there is a regime where actually even though the output signal is positive you will be pushed towards this point right here and therefore there is there are these two stable points right now and the stable point basically means if you deviate if the signal deviates from it it's going to be pushed back towards that point and you can see these two stable points they're not at zero they're actually at at these two points here and that's pretty interesting because um that means you can potentially remember things with the cell right an output signal of zero is basically not informative but here you can be in either the state here or in the state here and the little little perturbations will still keep you in that state so you could potentially be in this state right here as an output and the cell will just keep updating itself and be stable and always output that signal right here and then you could go ahead and if you can provide some huge input signal right here you could potentially throw this over to the other side over this hill and then it would stabilize at this point so this is sort of a way to remember things within these biological cells pretty cool now this here is a non filled circle that means it's an unstable point it's technically stay is stable in the sense that if you're exactly at zero you will remain at zero but if you perturb even a little bit you will go oh if you perturb a bit you will go away from it okay I hope this sort of property is right is clear and why this is so fascinating because we can use this these this fact that the stable points are not at zero and are more than just one stable point for remembering things and they're now trying to fill this into the gated recurrent unit so they call this the by stable recurrent cell BRC and the formulas are these right here a little smaller come on can't zoom any more okay it looks almost the same as the GRU so the formulas are these this and this so let's analyze the differences to the GRU the first most striking difference is that a lot of weight matrices here have become single numbers so or single vectors this here used to be a make weight matrix and this used to be a matrix multiplication and you'll see this sort of throughout whenever the last hidden state is incorporated into these things then you'll see that it is no longer a weight matrix but is in fact a in a product with a vector a element-wise product and that has a reason namely what they want to model is individual neurons so on a biological level and neuron can only feed back onto itself if there is a layer of neurons right here they can only each feed back onto themselves whereas in a recurrent neural network my hidden vector my hidden state is a vector and if I transform this into the next hidden state or any quantity let's say I transform this H into this R right here and this R is a vector to then any interaction is possible so any sell any entry in the vector here can influence any other vector because there's a big weight matrix in the middle they want to leave this away they want to model this as close as possible to actual layers of neurons and therefore they say okay the input X can you know be distributed to all the neurons because technically the input comes from some other neurons down here and they can all have connections to these neurons but these feedbacks we only really observe them in individual neuron this feedback cycle so that's why they model these recurrent wait products by just element-wise products with vectors and then the second difference you again see that there is this switch right here this C switch and the C switch is like before it's a sigmoid with where combined the the output and the previous hidden state there's nothing new here so this switch is the same the cell has the possibility of letting in new information or just ignoring the new current information the XT the second thing is here and this is the same as well right the tan H this is a combination of the new information it's in case we want to let in the new information of the new information and you need to decide what things of the old information to forget or remember now the difference here is in this a so this a used to be again this sigmoid of the combination and now it's just slightly different it used to be sigmoid now it's 1 plus tan H this is a very very slight modification its tan H because tan H is between minus 1 and 1 instead of 0 and 1 like the sigmoid and the 1 plus makes it such that this is between 0 & 2 and we've seen before that this critical behavior there is 2 to these functions when it's between zero and one this behaves like a classic gated recurrent unit like a classic GRU but when it's between one and two then you have that exact behavior that we saw before of the bi-stability okay so depending on what the a is if the a is zero to one it's a classic cell and if the a is one to two it's a bi-stable cell and the network can decide by itself what it wants to do because here it has it can actually learn how to do that alright so this is the only change the only change really apart from it only being individual neurons feeding back on themselves is that now this is no longer between 0 & 1 with sigmoid this is now between 0 & 2 because it's 1 plus the tan H very simple change but the effect of this is pretty pretty cool so they do some theories like a schematic drawing of this if this a is between 0 & 1 again you have this stable state that's at 0 but it spits between 1 & 2 you have two stable States at two nonzero points and again this we already saw this but now this is for I believe this this recurrent cell this by modal recurrent cell not for the neuron itself and here they give an example of what happens when you run this particular signal this particular time series through a cell like this while fixing the C and the a parameters so now the C until a parameters are learned they're just fixed and you see what happens now as you can see the the blue should be a classic the classic behavior so in this blue case what happens you see right here this C is moderately low so we saw the C is the switch of whether to leave in old information or take up new information if it's low it means we want to take up new information this is reasonably low and that's why when the signal goes up here the blue line goes up as well and when the signal goes down the blue line goes down again and so on so the blue line pretty straightforwardly follows the signal right here okay now in contrast to this the red line is over this threshold so a is fixed at 1.5 C is still at 0.2 so again when this line goes up then this line goes up but because this is near a this stable point if it goes down again it doesn't appear to go down enough it sort of remembers that state it was in it doesn't go down with the signal only now that it goes down even further it's over this threshold so we were in this situation now and the first bump down was only like two here and that pushed it up again but now it jumps over here because the signal is even lower and then the cell sort of switches to another state as you can see here it goes down but then this bump here is not enough to bring it up again so it kind of remains in this state so you can see the it sort of remembers the input and small deviations or small changes in signal don't manage to throw it away from that only larger things only it needs to go vary the signal needs to go very much down in order for it to change the state so that's pretty cool that this is remembering behavior and now remember in the actual implementation these C and a parameters this C and this a right here aren't fixed they are also determined by the cell itself and therefore the cell can decide by itself when it wants to remember things how hard it wants to remember things and so on so we're going to check this out in an actual implementation so there's this one last modification they make where they say okay they tried this and it doesn't really work because it works sometimes but there is this issue of these neurons connecting only back on themselves which really makes the model much less powerful than a classic recurrent cell it's closer to biology but it's much less powerful and there is this property they say of neuromodulation where technically in real neurons the one neuron here could influence another neuron by modulating these a and C parameters ok these a and C parameters this is called neuro modulation so there are interconnections between the neurons that influence how much other neurons remember and forget things so they decide let's model that and lo and behold we're now back to having weight matrices right here so this this is sort of they say this is a not really a super biologically plausible way of implementing neuromodulation but it sort of it's an easier way and it brings us closer to the G back to the gru and yeah so now the only difference to the GRU is that the fact that here there was a sigmoid now it's a 1 plus tan h ok I find this that's pretty cool so now also the only difference here is this property of bi-stability this is the only difference and now we can actually compare so let's compare they first give they do these sort of benchmarks which are they're pretty pretty neat so they have this first benchmark where it's the copy first input benchmark I'm having some trouble here moving this paper around with my fingers so the copy first input benchmark is simply a time series in this benchmark the network is presented with a one dimensional time series of T time steps and the each entry is a is a random number after receiving the last time step the network output value should approximate the very very first input step okay so all the network needs to do is remember the first thing it sees and that's that should be learn about right that should be learn about because you can so you can it's not specified whether the zero is hidden state the initial hidden state is given into the network but technically it doesn't matter because it can just learn whatever that is I can learn to have a designated bit in this hidden state so this hidden state is of a size 100 I believe one designated bit in the hidden state of whether it has already encountered the first thing or not if it has not encountered it means that it's at the first time step therefore it should incorporate the new information into the hidden state and if and also set this bit and then for each subsequent step you can see I've already set this bit and it can simply close that gate that makes it incorporate new information so it should be able to carry this information all the way to the end by simply always closing that gate after the first step and what happens in this so as you can see when the result is all the results up here so this is after three so they train it for 300,000 gradient descent iterations and you can see that when these time steps when the series are pretty small the LST m's or the gr use tend to perform well but you can see that these BRC is they don't tend to perform poorly they're just performing worse right it's zero it's still the 0.01 regime or something like this of error however when you go up to like 300 steps then you can see the gr use and the l STM's they start to fail because they are not made explicitly to remember for that long they don't have this by stability property whereas now these things Excel you can see they're still pretty low and at 600 steps these things completely fail they completely forget the input so and the NBR see at least is still able to remember the first thing pretty pretty well and yeah so the second one is no this is the first experiment still the copy input benchmark you can see right here that even at this 3 at this 100 thing where the GRE you still learns it it learns it much much later than the BRC which learns it pretty fast only here when the when it's only 5 when it's series are only five steps long does the GRU slightly outperform the BRC so the general notion here is that these classic cells are more powerful in like classic tasks whereas these things are shining whenever these things fail because they can't remember things for very long so they're not these new cells are not state-of-the-art yet possibly there are still some modifications to be made we've had a pretty long history of optimizing gr use and LS TMS they haven't always worked so well as they do now because we kinda know how to handle them and I expect if these cells here take off especially these in VR C then with time we'll be as proficient at handling them and they will probably become on par or even outperform the LSTs or gr use on every day like on all the tasks and then be especially good on tasks where you have to remember things but for now they're outperformed by LS DMS and gr use ok so the second thing is a more interesting experiment the denoising benchmark where they say the the copy input benchmark is interesting as a means to highlight the memorization capacity of the recurrent neural network but it does not tackle its ability to successfully exploit complex relationships between different elements of input signal to predict the output they have a new benchmark in the denoising benchmark the network is presented with a two dimensional time series of T time steps five different time steps are sampled uniformly with okay and are communicated in the network okay I'll just tell you what what's going on so this new time series is two-dimensional in the lower dimension you simply have a bunch of random numbers like five eight to nine actually these are numbers sampled from a uniform Gaussian or so so they're not actually 5 8 2 & 9 but you can imagine it like this 5 a 2 9 3 4 0 2 and so on and in the second dimension you have a negative 1 I believe almost anywhere and then at some points you have a 1 negative 1 again and then you have a 1 and a negative 1 again and at last point of the sequence you'll have a 0 and so the zero is simply a marker that it's the end of the sequence what the network needs to do is it needs to output all the elements so the output of the network should be in this case should be 9 4 so all the elements where there was a 1 in order ok so it remember what it needs to learn it needs to learn to every time it sees a 1 in the second dimension it needs to take the first dimension put it somehow into the hidden state and then carry that hidden state forward and sees a 1 again it needs to take the second thing also put it into the hidden site but not override the first thing it put into the hidden state like if it were to just realize I need to put this into the hidden state then it would almost surely override the previous information so it needs to be able to say I've already kind of in my H is going to be a vector of a hundred dimensions it needs to be able to say well I've already stored a bunch of stuff in that part of the vector maybe I should store that thing here over here this is fairly complex things to remember and technically gr use and Dallas teams are able to do it but as we'll see they're not as much the results are in this table where you can clearly see that whenever the N so the N the N is a parameter that is how far how far in this direction are these ones so when n is zero the ones can be anywhere but when n here is like five that means that the last five ones surely don't contain a 1 that means only the first whatever Add L minus L minus 5 contained the 1 so the higher this number n is the harder the task because you're learning signal is way way further away from the from what's when you get the output so you can see when the N is low then the gr use and the L STM's they perform pretty well but also these cells perform pretty well they're just not performing as well however when the task gets harder and you actually need to learn a sparse signal over a long period of time where in between you don't get any signal the GR use and the LS TMS fail while the brcs would still be able to learn these kinds of things so that's that's fairly cool now it's if from a researchers perspective I wonder if they just first tried this task you know as I described it and then they discovered like ah crap they can still do it and like ok how can we make it such that there is a difference ok let's actually make the task harder like this and then they did that I wonder if they always have the idea with the end here or just introduce this after after it they they fail to produce a difference in the first place I'm not sure but they have they have another benchmark but they basically show that these cells are actually good can incorporates this information can reason about what they need to remember and whatnot and in the end they also have this sequential em nest where they just feed an M nested digit by digit and at the end I think the the what the output of the neural network needs to be the class of the of the M this digit and again here they have a parameter called n black which means that so they have an M this digit it's like a three they unroll it to a single vector right they feed this one by one into the recurrent Network and then after that they attach a certain number of just empty pixels black pixel and after that the network needs to predict the Y you can see if they ask the network the class of the digit immediately after it's done then the gr use and the lsdm perform fairly well as do the BR C's but if you attach a whole bunch of these black pixels remember an amnesty it has some seven sorry 784 may be entries so attaching 300 black pixels is quite significant in in terms of the length of these sequences and then the gr use and the L STM's they can't learn they can't learn to ignore these things because the learning signal is just too far away right here but these things they can because they can exploit this by stability property and remember things again I I wonder how this came to be it seems pretty funny but the last thing they do is they investigate what happens in their cells and this I feel is the most interesting part and they do this on this denoising benchmark so the task we've looked at before where you need to remember five randomly selected numbers that are indicated by the second dimension here they show a sequence where the five numbers occur at 3,100 246 at 300 and at 376 so these are the five positions where the sequence indicates that the networks remember the thing in the first dimension and an output they analyzed two things they analyzed the proportion of bi-stable neurons so basically they analyzed this out these a quantities and they analyzed how many of the neurons in the lair have an a that's higher than one which means that they are in this by stable mode and also they analyze what's the average value of C so see if you remember if this is high it means it doesn't let in new information and if this is low it means it lets in new information if we first look at the C you can see that every single time when the second dimension indicates that this is one of the inputs to remember this the network drops immediately drops the C values the different colors here are different layers they build they have a recurrent network has multiple layers of these cells as as usual in the recurrent neural networks so this C as you can see it goes up pretty quickly but and then as soon as one of these inputs appear the C drops which basically means that the network realizes it now must let in the new information and then it immediately shoots back up makes it seem like so the network says okay as long as so all of these inputs here they have the negative one in the second dimension right so it recognizes it says there's no reason for me to incorporate that information it's not important and as soon as the second input comes it immediately shoots down again now you can see this here is the last layer of the network the highest layer so sort of the highest abstractive information and you can see that from input to input this value of C gets higher and higher and these spikes as they go down but they go down to a higher and higher point which you know is is the fact that it recognizes it needs to let in new information but it lets in less and less new information the more things it needs to remember so not only does it recognize weight need to remember this it also recognizes but I probably shouldn't shouldn't you know completely forget what I had previously because I it is important for me to remember these previous things so that's a pretty cool demonstration the fact that these go down at the input and the fact that generally they go up every time after a new input is incorporated into the hidden state this basically this shows that the or this is a pretty good indication that what they're saying is really happening right okay the second thing shows almost the same it shows how many of these neurons are actually in there by stable mode and you can also see right here that especially in the last layer you can see that the number of neurons in the by stable mode goes up and up and up and up after each of these steps and these spikes here corresponds to always the points where they have to let in new information okay cool so I find the event is to be pretty cool and I find this last experiment to be the coolest where they can actually show look here there's a pretty good indication that the thing we we built does what we say it does they also actually have a proof here of the bi-stability when this a is higher than one i won't go through this right here but if you want you can look at that I'm excited to see what happens with these kinds of architectures in the future because it seems to be a pretty minor modification and maybe with a little bit of more modification or if we sort of just tune this a little bit and kind of figure out what we have to do to make these things actually compete with the classic gr use and LS TMS in regimes where a long memory isn't necessary I feel this could be a you know kind of a standard building block in the recurrent neural network toolkit even though it's been sort of outperform by Transformers in previous years all right that was it for me and I hope you had fun with this paper and invite you to check it out and bye bye

Info

Channel: Yannic Kilcher

Views: 7,318

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, gru, lstm, schmidhuber, bistable, bistability, neurons, biological, spiking, tanh, stable, attractor, fixed points, memory, memorize, sparse, long sequence, history, storage, remember, rnn, recurrent neural network, gated recurrent unit, forget, backpropagation, biologically inspired

Id: DLq1DUcMh1Q

Channel Id: undefined

Length: 49min 12sec (2952 seconds)

Published: Mon Jun 15 2020