Hopfield Networks is All You Need (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there today we'll look at hopfield networks is all you need by researchers from the johannes kepler university in linz and the university of oslo so on high level this paper proposes a new type of hopfield networks that generalizes modern hopfield networks from binary patterns to continuous patterns and then shows that the retrieval update rule of these new hopfield networks is equivalent to the attention mechanism that's used in modern transformers and it's actually a more general formulation of the attention mechanism and therefore it can be used to do kind of a variety of things to improve modern deep learning and uh it also has a companion paper where it applies this to some kind of immunology research and gets uh achieves state of the art in a task that is specifically suited to this type of attention all right let's dive in together we'll go over what this paper does what it proposes and so on if you like pay if you like videos like this uh consider subscribing you know sharing it out and i hope you're enjoying this all right also thanks to my discord community um for you know very helpful bringing me up to speed on this paper uh super interesting discussions there if you're not on our discord yet uh i invite you to join it's fun okay so what is a hopfield network a hot field network is a pretty kind of old style old conceptualization of a neural network so in a hopfield network what your goal would be is you can conceptualize it as a bit of a neural network so let's say we have five neurons or something like this uh your what your goal would be is to have a neural network where you can store so-called patterns and a pattern in this case would be a binary string of size five so for example one zero one zero zero or one one zero one zero and you'd have a list of these patterns and what your goal would be is to store these patterns in the neural network such that and here you know we'll just consider everything to be sort of connected to everything else and um what your goal would be in this is that you can kind of store patterns inside this neural network and you adjust the weights somehow so this was as i said this was this was this is kind of an old model um you store you you adapt the weights such that you store these patterns and what does it mean for a pattern to be stored if you have stored a pattern you can you will then be able to retrieve it and you retrieve a pattern in these kind of old style hopfield networks by providing a partial pattern so what you'll say is for example i i want a pattern that starts with one one zero and you give that to the network and there would be a so-called update rule and the update rule is kind of an internal uh rule so let's just go through this so here this one one zero maybe this is one one zero and then they would kind of send messages around so this update rule would somehow adjust the value of this and this neuron here to what's most compatible with the network weights and if if the network weights have been adjusted correctly this will turn out then at the end of applying this update rule that this is a one and this is a zero and therefore this pattern here is retrieved now had i input uh one zero one at the beginning then the outcome would be different hopefully this pattern here would have been retrieved okay so you can see the applications of this like you can have the first three digits as sort of a database key and then the last ones as sort of the value that you store along with it and then you can simply provide the first few you can also provide you don't always have to provide three um so this all depends this is this is sort of an as i said an old conceptualization of neural networks so people were imagining that this is kind of how the brain works you know fire together wire together and also with research into this it it turns out that you know you might think you know there's there's kind of five neurons so maybe i can store five different patterns you know accurately because if i store too many patterns right if i have many many many many patterns then i can't expect to be able to retrieve all the patterns again because some of them will just be so equal that you know many will start maybe with this and and i won't have a chance to to retrieve the one i want or if the update rule will make a mistake so you might think this might be like five because i have five neurons or maybe ten because i have ten connections but it turns out that um in modern hopfield networks with the appropriate update rule you can store exponentially many patterns in these networks exponentially many in the in the dimension of the um in the dimension of the patterns and here i guess that would be the length of the pattern so this is a little bit surprising the kind of storage capacity of these networks and we'll this um this paper here generalizes that to continuous uh to continuous states so what do we mean with continuous states i guess i mean continuous patterns so no longer is a pattern a binary string but a pattern now is a string of floating point numbers okay like 0.5 1.3 and so on and you know a string of floating or a sequence of floating point numbers is naturally depicted as a vector okay so our patterns are going to be different vectors that we store and um you know in high dimensions that the the vectors will be kind of separated well from each other as long as we don't have too many but this paper shows that all these properties for the modern hopfield networks that hold for binary strings still hold if you go to these kind of um continuous to these vector patterns that means you can store exponentially many patterns in the dimensions of the vector which is pretty surprising right because you'd think like you know after you have one vector per dimension that you know after that it might get a bit shaky but no you can actually store exponentially many that's pretty surprising and this paper is a lot about how to do that and the fact that that happens and so on so we've talked about update rules for these um kind of hopfield networks and i haven't really specified what that is i've just said that you know i enter a pattern and then the network does something and out comes out comes the whatever the pattern that matches my query so this here is called a query you might already um this is on purpose like the kind of overlap between the attention mechanism lingo and the hopfield network lingo we're going to conflate the two to kind of make clear where the two overlap if you don't know what an attention mechanism is or aren't familiar with it watch my video on attention is all you need uh once you watch that this video will make a lot more sense all right so in what the update rule does is specifically and the update rule there isn't only one right there are many different proposals of hobfield networks and they all lead to different properties but what an update rule does ultimately is it minimizes what's called an energy so every type of hopfield network is associated with an energy function and this the energy function of the modern hopfield network for binary strings is this energy function right here so with x x is the pattern um the pattern this is the kind of state of the hopfield network and so these are the the whatever is stored in the network and then design here is the query that you enter into the network and then the energy here tells you this quantity you have to minimize this quantity in order to retrieve the pattern that you want okay now we are never directly working with the energy as such uh so what you could do is for example use back prop or something to use gradient descent to decrease the energy but usually along with an energy function comes an update function and the update function is what i've talked about here like you do something and then the network does something and then you get the pattern out what the network does is it minimizes its energy function and the update rule is made such that the corresponding energy function is minimized so the energy function is more like a theoretical consideration that you say okay here is my energy function of my hopfield network and the there will be a corresponding update rule that minimizes that energy function and if you use that update rule maybe multiple times then the energy function will be minimized and you will have retrieved your pattern or not if if you have too many patterns stored it might also fail right so they they say what the update rules are in the um in the text here for the old hopfield networks but we're not really interested in the old ones we're interested in the ones that this paper cares about namely where the patterns that you store in the hopfield network are these vectors over our vector patterns and the query is also a vector pattern so you want to store all of these patterns into the hopfield network so i'm going to draw it like this here i'm going to store it into the hopfield network and then after that you want to come up with a query and the query is like this and in the case of the binary strings we had something like well i sort of know half of my binary string now in the vector hop field network it's more like well i sort of kind of know the direction that my vector should point in okay and you will re what you want to retrieve is the vector that has kind of a large inner product okay so if i enter this query into my hopfield network what i hope is that this vector here is retrieved now you see it's not exactly the same vector like they do point if i translate that here by i it's maybe something like this but so they are different but you want to say well i kind of know what i want i kind of want something like this and then the hopfield network would answer with ah i have something like this it's this right here okay so you that the connection to attention mechanism should become pretty pretty obvious right now but you know the um to actually establish this formally is the kind of the point of this paper and you know it's pretty cool to see so they formulate this new energy right here this is the energy of this new continuous hopfield network um specifically they have to have this term right here because they now have continuous states and continuous queries uh this if you minimize the energy it basically means that your query can never you know go to infinity because you have the the query right here in the energy function um the update rule is this right here and we'll look at that in a moment but remember the update rule is what you actually implement in code so if i if i have a query right here i plug it in here this is the state of my hopfield network and i apply this rule multiple times and out comes the kind of answer of the hopfield network to my question so the i input this and the out comes this after i update after i apply the update rule maybe multiple times right and interestingly you can already see that this here if you rewrite a bunch of these quantities so if you rewrite the beta here which is the softmax temperature in a way to be one over square root of d and if you take the query design here to be the query matrix and if you take the x here to be the key matrix then this is equivalent to the update or sorry the attention mechanism of a modern transformer so that's the point of the paper is that we can look at the transformer attention mechanism as a hopfield network and they have this interesting this interesting diagram at the end right here uh so the appendix you know this is typical i guess sep ho haita i remember this sailor paper had like 60 pages of machine proof appendix this also this has like 70 page appendix crazy but at the end of the appendix you'll find this diagram right here now usually in an attention mechanism um you have whatever the the input is so you have an input right here so this is attention mechanisms or at least transformers they work on sequences or sets of objects and from these you'll generate three things you'll generate the you'll generate the queries the keys and the values now you can either generate the queries from the same objects which would be self-attention or you can generate the queries from like a different object over here uh it doesn't it doesn't matter too much for our discussions but either you you know have a reference input or you have you know this kind of same input all the way and then what you do is you use three different heads or you know three different matrices to transform that input into queries keys and values so i often conceptualize this as you have kind of your input set and each of the input sets outputs a key and also each one which would be a vector and also each one outputs a query so i often draw this here the same sequence and each one outputs a query and the query sort of the query is kind of a request for information so the key exposes uh sort of what exposes something about the input here so this could be a sentence down here this could be my cat is very pretty and the the the vector the key vector right here could encode something like this is a noun or this is an animal or anything like this right and the query here it could ask for for other things so for example since this is cat this vector right here the query vector is generated from that you know token cat now it could recognize that cat is a noun and it could ask the other nodes to basically say are there any adjectives around here because um you know adjectives because i it itself is a noun it's the object of the sentence right it could ask are there any kind of adjectives that describe the object because that would be naturally a thing to ask if you were the noun you would want to know are there any kind of modifiers um for for me so it could output the query and the query here could mean you know this direction could mean adjectives and you see here the word pretty is an adjective so it itself would output a key that says by the way i'm an adjective right so if the cat asks then if this node asks for an adjective and this outputs uh the adjective vector then because the inner product between the two things is high this will be routed here so attention mechanisms basically information routing that's how i always describe it but in this paper we look at it more like these here are the patterns that are stored in a hopfield network and i by inputting a query and the dot product being the update rule of the hopfield network i retrieve from the hopfield network i retrieve the appropriate pattern that i ask for okay and then you know the values the values are simply a modification of the keys in this form but a lot of people also do keys and values to be the same thing but this routing of information happens here where you multiply the queries and the keys and then you put a soft max over them okay so if you just look from the perspective of a single node like this node here this cat node what it would do is it would inner product its own query vector with all of the key vectors right so it would build an inner product with all of these and then it would normalize it would put it through a soft max which will kind of give it a distribution right so here would give it like uh so this this actually matches because ma well my is also very important for cat this this this is just an accident i did not plan this uh this here this also well many things match but but uh in our example we would just say that this last one it's it's not only higher it's also wider um it matches very well right and so the information routing would route mostly information from this pretty token to the cat token which makes sense in our case right this is the attention mechanism now sin if we are interpreting this as a hopfield network and the update rule here is the dot product you can actually think of applying this rule multiple times so what happens now if we and this is where this update rule comes in what happens if we take this distribution and we don't aggregate the values like usually we would aggregate the values by this distribution what if we aggregate the keys by this distribution okay what comes out well if we look at this and you know let's just assume that this key right here matches really well but the others also match a little bit what would come out would be a weighted average where a lot of weight is put on this particular key so what will turn out would be something like something that's very close to that key you can see um i'm going to draw the old key here in green and i want to draw the old query in blue so you see that it's then whatever comes out is not the query but it's also not that only key that matches right it's kind of a weighted average but with that key dominating okay now since you know in a hopfield network what we would do is we would go again we would put this new thing the red thing instead of the query vector okay so we would use this aggregated keys this weighted average as a new query vector for that node right here so duplicate that node over here i'll use that query vector again and do the same thing again okay inner product with all of the query vectors and now since this is already an aggregate of the query vectors what's going to happen of course the distribution that's going to come out is going to be weighted even more heavily into the direction so let's make it even wider into the direction of that key that matches okay and you can pretty clearly see if i do that iteratively um then that will lead to a situation where everything is like very low except that one key will sort of dominate the distribution and ultra high and ultra wide okay and that's how that's exactly how a hobfield network works right i would input the query which would be sort of what i want right i kind of know what i want okay and then i apply this rule multiple times right and with each time i refine refine refine until i decide on a pattern right the hopfield network is made for pattern retrieval and these here are the patterns that i want to retrieve so here the patterns aren't kind of stored in the network beforehand but the patterns are also generated like in an attention layer so the keys are generated by the previous layer or by these matrices but that doesn't matter for the hopfield network update rule so you see here that the attention mechanism can be interpreted as simply one step making one step of this update rule but you can think of making actually multiple steps and and retrieving the particular key so you know deciding on a sort of a hard routing of particular information now that only works if if there are no other vectors that are close to that particular key right so if the query is this and you know the way i drew it here you can see that there are many there is this one and this one and this one that matches so technically the way i drew it um what would happen most likely is no many no matter how many times you apply your update rule it would sort of result in kind of the average of the three keys right so because they're all matching and they would all contribute to that weighted average of the query in the next step and then that means basically the convergence would be to something in the middle and that's going to be a central point of this paper um in which situation we are so they call the first part is retrieving a single pattern and they call the second situation where you have multiple patterns that all match that are not well separated from each other they call this a meta-stable state and it's going to be pretty interesting to look at um transform like bert language models and look at where they actually are are they actually operating in this single pattern retrieval mode or are they operating in the meta stable state mode all right so here you can see it in the diagram the only thing differing this from a hopfield network sorry from an attention mechanism is this branch right here so here you ask do you want to do multiple updates after you've you've multiplied the queries and the keys do you want to do multiple updates if yes so if you're in a this hot field network situation you want to do multiple updates then you go back as you can see and you do you use the keys together with the output of the softmax to generate a new query so this query q here is now generated from the output here and the key so the keys are the same these are this is the same thing it's just put here twice okay this is exactly what we discussed okay i hope that's some somehow clear that these the the attention mechanism is simply a one step uh hub field network pattern retrieval algorithm with a particular update rule that is that is matches this energy function that they propose right here of course they do this you know particularly because the update rule that turns out is the transformer update rule but um i actually don't know if they backwards engineered the energy function to match the transformer or if they first came up with a continuous help field networks and then just kind of discover that it's like the transformer we'll maybe never find out okay so um let's go there are a couple of theorems i believe there are four five theorems right here that uh show that kind of make some points about this about this stuff and we'll go through them we won't go through the proofs or any you know super in-depth meaning but it's pretty cool to go through them and they are proved very rigorously as i said there's a 70 page appendix so have a look at that if you're up for it okay so they say here we have an update rule this is our update rule for our new hop field networks so the first theorem they say is the update rule that we propose converges globally if we apply the update rule repeatedly the energy for t goes equals infinity and the energy will converge sorry the energy will converge to a fixed point this being a fixed point for t of course sorry for t goes to infinity yeah if this is a fixed point basically saying that if i apply this update rule here over and over and over again it will it will make this energy function converge to a fixed it will make this energy function converge don't want to say anything mistakenly here or claim too much but that basically connects the update rule to the energy okay so just showing like this really is the update rule for that particular energy function okay now as as itself it's not super duper interesting yet but um now we get to theorem two so theorem two for the iteration that's the update rule that we just looked at we have we have that um this convergence holds as t goes to infinity for some stationary point furthermore this quantity here goes to zero so that means this is the um the update at t plus one and this is the update at t and the difference between them goes to zero so that means not only does the energy converge but the iterates themselves convert so the algorithm actually converges the individual updates of the algorithm so this e new at some point that will no longer change because the the norm between it and the previous one will go to zero you can see that either the sequence here converges or in the other case the set of limit points yada yada is a connecting subset this is a bit over the top here they say okay it can either converge to a point or it can converge to a connected subset but if the loss is finite then any sequence generated by the iteration equation three converges to some fixed point so you know basically saying that here we oh this is not the loss i'm sorry um no this is the domain never mind i'm an idiot this is basically saying that this algorithm will converge okay and they define here what it means for a pattern to be stored and retrieved and that's for establishing what the kind of storage capacity of a hopfield network is so we've established that the update rule minimizes the appropriate energy and the update rule will converge at some point which means that we can you know if it converges we can retrieve the pattern that it converges to so now we define how many patterns can we actually store for that we need to know what does it mean for a pattern to be stored so we assume that we have patterns and these patterns are called x okay x i we have n different patterns each one is called um x with a subscript we assume that around every pattern a sphere is given so how do we imagine this um we have these patterns and this is this is just a space now they consider patterns of the uh on on a sphere but we'll just conceptualize it as this we'll have a space there are patterns we want to store okay and we'll say around every pattern there is a sphere okay sphere like this and naturally the patterns are going to be there's going to be a notion of well separated patterns and you can imagine this a little bit like these spheres won't be touching each other if these spheres aren't touching each other that means that the patterns are kind of well separated and that means that if we initialize the query remember the query here is a vector that kind of sort of looks like a pattern and that means the query is kind of close to the pattern in some notion of distance so if we initialize the query somewhere in that sphere then it might if it converges to that sphere to that pattern then we retrieve the pattern okay now it gets a bit more complicated than this but not much more so we say a pattern is stored if there is a single fixed point inside the sphere to which all points that start inside the sphere converge and none of the spheres intersect so the sphere of point i doesn't intersect with the sphere of point j so that's where we say all these spheres are non-intersecting we say x i is retrieved if the iteration equation three converged to the single fixed point in that sphere the retrieval error is the distance so you'll notice you have two things you have x i this is the actual pattern and you have x i star this is the retrieved pattern so these hopf they don't always have to give you the same thing that you stored that's part of the the nature of continuous neural networks whatnot so for every sphere we say there is a pattern there is a sphere now we as pattern is stored if every i can start wherever i want in this sphere okay wherever i want it will always converge to a point that's inside the sphere okay and maybe that point isn't the pattern that i stored but actually this point right here but wherever i start i will always converge to that particular point if that's the case then i have stored this particular pattern now the fact is i don't retrieve this particular pattern i retrieve the blue thing but i can then define the error of retrieval the error of retrieval is simply the distance between the two things ideally this distance is very small right but you know we can't can't guarantee it now there are going to be theorems that deal exactly with this retrieval error but first you can see that here if if these spheres become larger you you can't accurately store a pattern anymore so this is the kind of ideal situation but there are also situations where these spheres you know if i have these patterns right here these spheres are so large kind of the the attractions of the patterns are so large that if i start let's say here then i don't converge to either of these two patterns i converge to like something in the middle i converge to maybe this point right here and that's going to be one of these meta stable states okay we're going to encounter situations like this but we're also going to encounter situations like this and the bottom thing isn't necessarily bad and that's what you have to keep in mind um and yeah as i said we will get to it but just keep this kind of sphere image in mind okay so first we'll just deal with the you know the up the top situation where we store patterns and then retrieve patterns so we'll we'll assume a failure probability which is p and p is going to be you know pretty pretty low for their example so they have p equals 0.001 you know like a 0.1 percent error probability of retrieving your pattern things like this and randomly chosen patterns on the sphere with radius m we define some constants then with probability 1 minus p the number of random patterns that can be stored and stored in the sense of having these spheres around them so that you can retrieve them accurately or at least you can retrieve something that's close to them is is bounded lower bounded by this quantity right here so there's the square root of p there is this constant c but then you see that d is in the exponent right here so that means it's exponential in the number of dimensions so that's that's pretty cool so if you add a dimension you exponentially increase the number of the number of patterns you can store and you know that's that is a kind of i mean it's it's been known for modern hopfield networks with binary strings so it's not uber surprising but if you have you know it's not what you would imagine like that okay so they may give a few examples of these co you have to set these constants you know in a particular fashion such that this is given and so on um but they say you know examples here are where c is something like three and d is 20. um so if you were to add a 21st dimension then your i guess storage capacity would increase by a factor of three which pretty cool all right so this is how many that we can store infinitely not sorry exponentially many patterns um in these networks now they deal they say the next theorem states that the update rule typically converges after one update if the patterns are well separated so if we're in a situation where these patterns are well separated which is kind of like this but you can also imagine this in terms of dot products because we operate in the space of dot products so if the patterns are well separated that sort of means that they all kind of sort of point away from each other and this notion of separation is going to be captured by this quantity right here this is the separation of example of pattern i which is just the inner product with itself uh minus the maximum inner product with any other uh pattern and this quantity is going to be large when no other pattern is close to it so when the separation is large then the update rule the retrieval rule of um calculating you know i have a query calculate the inner product with all of those then i re-weigh all of the patterns by that inner product by the softmax then i used that new thing as a query again and so on as we discussed it will converge to the closest pattern but this theorem says it actually converges pretty fast and here i have my problems with saying that it converges after one step um typically converges after one update because that you know genuinely depends on a lot of constants as we'll see but it does converge exponentially fast in this separation constant as the theorem 4 says with query psi after one update the distance of the new point to the fixed point is exponentially small in the separation delta i the precise bound using the jacobian and its value in the mean value theorem are the following so here you can see this is the distance between the updated psi after one step and the um and the fixed point right here this is what it converges to is going to be the distance as it was before times this thing right here so you can see since this is a this is a multiplicative update um and in this jacobian so this is expanded down here this is this you can see here you have the you have this sorry yeah this is this so this is bounded by that you have the exponent the exponential function negative this separation right here so the higher the separation the faster this algorithm converges okay to say that it converges after one step is you know it might be a bit of of bragging i don't know if this is a common thing if you have like an exponential convergence uh that you are allowed to say it's after one step i'm not sure especially what i'm not sure about is that you have n here as linear constants in that factor okay so if you if you and that's what they do in their code so if you look at their code and the code's available which is pretty cool it's implemented in pi torch as a general module that can you can just drop in so this is not only for transformers this is for you can replace like lstms you can replace pooling mechanisms um you can you know do a whole bunch of stuff in their paper in the accompanying paper they do this multi-instance learning with giant uh sets um on using these hopfield layers so pretty pretty cool this code is definitely worth kind of checking out and maybe you want to replace some stuff with you but the question is how many of these update steps should you do right because we looked at the diagram at least in the attention mechanism it seems like you have attention layers right you have a transformer and the transformer consists of you know you have this input right here and you go through layer layer layer layer layer and in each layer there's contained in it and one of these attention mechanism right this entire thing is in this layer okay and now if you interpret this as a hopfield network and you want to do multiple steps that means you go this branch right here so in each layer potentially you do multiple steps of these things so for whatever computational constraints um transformers had already this will certainly make it worse but also you need to decide how many steps you want to do now you can hard code that of course but they say you should do these steps until this norm here until the norm between the old and the new is small enough so where is that so you can't measure how close you are to the convergence points right because you don't know in practice but you can measure how far you're away you can measure where did we have it you can measure this quantity right here that's something you can measure how far twit rates are apart so what you'll simply do is you'll measure that and if that is small enough then you'll you'll stop but that i guess is very related to this so how if you we've already proven it converges to this x star so i guess we can approximate this quantity right here with the quantity above and that tells you how many updates you need to do and that quantity is linear not only linear but actually here quadratic in n i don't care you know yes it's exponential uh amd uh separation but it's quadratic in n and if i've learned anything from kind of my fast code courses is that constants actually matter when you're not dealing with infinity with an infinite number of steps so the number of the number of steps you need to do i guess will depend on the sequence length in a quadratic fashion so i'm not sure you can always claim this is converges in one step now i might be super mistaken here and none of this uh will can none of this actually makes a difference in the in the light of the exponential decay here but i would just i'm just a bit worried saying this usually converges in one step it's clear i guess why they do it right because the attention mechanism in transformers is a one-step application of this rule and this here is kind of a theoretical justification for interpreting this precisely as a hopfield network because you'd say well in a hopfield network you would do multiple steps but wait wait we can actually prove that even if you interpret it as a hub field network you it can it usually converges after one step so what you're actually doing in a transformer is applying a hopfield network update rule uh to convergence so yeah i'm not yeah i might be bickering on a high level here luxury problems theorem five then says so theorem four is how fast does this converge um theorem five the last theorem right here uh says that the retrieval error of a pattern and so this is the this is what you converge to and this is what you've stored um is bounded by again something that's exponential in the separation right here as you can see okay so that was the theorem so if we go quickly through them again theorems one and two deal with the convergence of this algorithm and the fact that it actually minimizes the proposed energy then theorem three says you can store exponentially many patterns in terms of the dimension of your space and theorems four and five say that this update rule will converge exponentially fast after after one step if you believe that and the retrieval error will also go down exponentially fast with the number of update steps that you do okay that sounds pretty pretty pretty good but we've heard it it's very dependent on how well separated these patterns are and it turns out that it's you know at least in transformers they aren't always well separated and that might be on purpose remember the these states here the the patterns aren't pre-stored like in a classic hopfield network but the patterns if you interpret an attention mechanism as this are also generated by the network itself so the pattern matrix that you retrieve from and the query are generated by the attention mechanism in in this case as i said this is applicable to many many more uh domains than just this but um yeah so there's another slight modification that you have to do to make this actually equivalent to an attention mechanism and that is you'll have to recast the value because usually what you'll do is you have some sort of input and then you make queries keys and values from that using different heads the only thing to make it formally equivalent is you have to make the values generated from the keys so the keys give rise to the values as you can see right here that you first multiply with the key matrix and then with the value matrix i think that's you know that i don't i doubt that this will will change anything um if you if you the only way that could really change anything is if this matrix here would be super low rank like collapse the space of um into like very few dimensions which the value matrix wouldn't do so you know but just letting you know that the technical equality requires this slight modification okay now we said that um it might not you know be that this is always super well separate and you retrieve a single pattern and that's what they research here in a pre-trained bert model so they take a pre-trained model from i guess from hogging face and they run they just run a data set through it and what they do is so for each for each query and sorry for each attention head because you have multiple ones of these attention heads right um in each layer so in each layer you have multiple ones of these heads for each head they look at over the course of the whole data set how do these softmax distributions look like so when you believe that this is a hopfield network and you believe that this converges in one step then if the patterns are well separated what we would expect is a distribution as we said like this okay there would be one dominant pattern that you retrieve you know that's what you want to retrieve that's what comes out but a bang um you retrieve that accurate pattern anything else would mean that the hopfield network sort of failed right it wouldn't give you back one particular pattern so they have i think that's a pretty it's a pretty smart experiment they look how many bars do we need to add how many of these bars in the soft max distribution do we need to add to reach 90 percent right so it depends a bit on the temperature of the softmax which is hardcoded and attention mechanism pdb is one this squared over d um so they say how many do we need to add to get to 0.9 to 90 percent of the mass of this distribution and if this is the hopfield network where you retrieve one pattern then one will be enough right one of these bars will probably be i don't know like 99 okay but there are other cases imagine the case where the patterns and the query you retrieve the spheres that it gives rise to are all like overlapping okay so what that will do is it won't converge to any particular pattern but the attractor space in this kind so you can imagine if you have two spheres that are apart from each other the update rule converges either so if it's closer to here it converge here if it's closer to here it'll converge here but if they are overlapping like this the energy landscape will actually make it such that it will neither if it starts somewhere it will neither converge to here nor to here it will actually converge to somewhere in the middle okay into the mean of the stored patterns and if we take that to the extreme what could be is it could be that the softmax distribution looks completely uniform okay which would basically mean that you know i don't care where my information comes from just average and this has its applications so if you for example want to make a sentiment classifier a very cheap way to do that is to simply take pre-trained word embeddings like glove or word to back you know assign each word word embedding and then just average the word embeddings okay and you count on the fact if there are a lot of kind of negative words in there like bad sad angry the word embedding kind of will you know reflect that and the average word embedding will point more into the bad direction and if there's a lot of happy words the average will point into the happy direction okay so there are applications of averaging information not caring particularly where it comes from and um in that case what we'd expect is that this number and we'll call that so we'll call that the number k in this case it equals one but in this case k equals i guess n the number of inputs okay because we need well not maybe n but you know approximately we need almost all of them to uh to reach the 90 percent okay and there there is an in between and these are called these meta stable states where and the in between is something like you'd have a couple of patterns here a couple here and a couple maybe here it's almost like a clustering like and these overlap and these overlap and these overlap but they don't overlap with each other which means that if you start somewhere here you would converge to the mean but not to the mean of all the patterns but just to the mean of these patterns and here here and here here so this this is like a clustering in latent space right so you can interpret these hopfield update rules as somehow you know getting not going to a particular pattern but going to sort of a cluster and this is if you ask something like hey is there any adjective around right and all of these patterns they kind of overlap in that space in that query space of adjective they overlap and therefore the update rule would converge to sort of the mean which would basically say yes there is an adjective here right and the information would not be routed so that the distribution if we start here right and we converge to this the distribution would look something like small small small and then you'd have a couple of large ones right you'd have like maybe two or three or four of large ones and these would exactly correspond to the patterns here so the information will be routed from all of those in that cluster to this particular node that asks the query okay these are these are what's called these meta stable states and what they do is they calculate over the entire data set this number k and here they show you the distribution so in these plots what you'll see is over the entire data set um k goes into that direction so i guess let's go to tis here this this seems pretty easy so k uh is in this direction and this is simply the amount of like how so in each you you let a data point run through it you measure k for that particular layer one you see this is layer one head four okay this is one layer one attention head and then you can see that the number k is distributed like this okay so contrast this to this head right here where it's a lot of weight on the number one or like very few numbers okay so these blue ones would be these are your typical like when you retrieve one particular pattern so this attention head we can sort of conclude in this particular tension head this is very specific it looks at its input it looks at its token and it decides what information do i want and it retrieves one particular thing from the other nodes okay whereas here it's more like kind of an an averaging it's more like i want this kind of information and on average i don't even know what the sequence length is here i guess it's maybe 512. uh so of the 512 the median this number is always the median in median it collects information from 231 of them okay so you can see that this corresponds this green and orange ones correspond to these meta-stable states where uh there's kind of an implicit clustering done in the in this space of attention whereas the blue ones they correspond to attention heads that ask for particular information retrieve one particular maybe a few patterns and um happy with that and the red ones here you can see that they often just average they just you know because k is so high means that i need all of the i need all of these bars to get to the 90 or i need almost all of them which basically means it's a uniform distribution right so it's like i don't care where information comes from just average whatever average i just want the average you know some particular uh space and as we said that also has its uses interesting how this translate through so this here is as we go down the vert model on the bottom you have layer one you see there are a lot of these averaging operations going on so a lot of the heads are simply doing averaging and as you go up the layers the heads get more and more specific in the types of information they seek but then again in the last layers interestingly you get into a lot of these meta stable states again which i guess again interpret this as you as you want i'm going to leave this up to you but it sort of says like here you want kind of general patterns at the bottom and then the middle layers are kind of the logical workhorses so you look for very specific things in the input this is i guess this is where i guess this is where the thinking happens um so this is sort of pre-processing i'm just making stuff up here by the way this is this must be in no way true this is maybe thinking and this this here this might already be output again because you know after that you have language modeling or classification so this might already be like aggregating uh types of information this is how i sort of interpret it okay uh yeah so so this these these experiments are pretty pretty pretty interesting and here they have they do these are the last experiments for this paper um they do an interesting experiment where they actually replace the attention heads by simply an average mechanism and later they actually replace them by gaussians but in this case they simply average and they show that look if i replace layer one with just averaging the perplexity doesn't rise that much right so it's pretty good um even if i replace an entire layer here with averaging uh it it perplexity goes more up and you can see the corresponds if you remember the previous plot the correspondence is pretty one to one with how much blue and green uh heads there are as in contrast to how much red uh and orange ones there are so here you have lots of blue ones and you can see that the error kind of goes up and interestingly here you have more meta-stable states at the end but still the perplexity goes up uh more so i guess you can only really replace the red ones with the averaging so this is always averaging in one particular layer and they go into more detail here where they say look this is this is layer six and this is layer 12. so this is one particular tension head from layer 6 and layer 12 and the updates don't be confused it goes in this direction okay i was confused at first and you can see right here this number k at first you know it's kind of spread out but then it pretty quickly converges to a very small number and there is this kind of point right here i don't know if the learning rate's decreased i don't think so i think that's just kind of a a phase transition right here this is the blue line by the way the blue training line a face transition where all of a sudden these just these attention heads they somehow decide okay this is the thing i want to specialize in this is the type of task i want like a sub task of linguistic subtask i want to specialize in and then they concentrate on one particular pattern per input so they are really specializing whereas in the last layer you see here that even during training they are sort of continuously learning so first they also do this averaging then they go into this meta-stable region right this is this meta-stable region k isn't one but also k isn't a very high number um so they continuously learn and it's even indicative of this training might not be done here first of all and second of all it would be really interesting to see how this works out with you know sizes of transformers and like especially these these huge transformers just the fact that they can keep learning the more we train them might be you know be interpreted in the light of what kind of states they converge to and the fact that there are tension heads i don't know how does this go on do they stay in the meta stable states because it makes sense to have metastable states as i said it makes sense to kind of cluster things or are they simply is this simply an intermediate step and if you go really far down they would actually also converge to the k equals one where they really specialize or maybe do we need more attention heads for this i don't know it it's just i think this is just the the beginning of kind of research in this direction i think just this kind of number k um how it's how it's made it's pretty simple and apparently it's pretty pretty revealing so you know that's pretty cool so that was the paper uh and its experiments it's it's a pretty sizable paper as i said even the paper itself is uh 10 pages and then there is this immune repertoire classification which uh i will like spend one minute looking at it so you have you have these set classifications so for each human you obtain a set of immune receptors and you simply obtain one label whether that human is immune to a particular disease or not and your task is kind and then a different human has a different set you have no idea which one of these things is responsible for it being for the human being um for the human being immune or not in fact there is a it you can't even decide based on these you can only decide based on like sub sequences uh of these and they might be in combination with each other so there might not be a single one responsible but like a combination but you don't have labels for the individual ones and you have different ones per human and they are different lengths all of this is just a giant um giant task and you have many of them you have tens of thousands per human right so they build a system here where first they do these 1d convolutions to process the inside sequences and then they do this hopfield um attention mechanism all with with learned queries uh over these things and then they train on the output label and surprisingly uh that actually works even with tens of thousands of inside sequences and only one label for all of them and so they they achieve i guess uh favorable results compared to other baselines on this task using these hopfield network which is pretty interesting but i'll let you look at that paper yourself so i hope this somehow uh made it a bit clear what happens here and it would actually be pretty interesting um if we you know to see what happens if we just do maybe two rounds of these updates is this even desirable right uh is it desirable to run this to convergence is there something good about not running into convergence or does it actually not matter because it actually does converge in one step i i don't know but have a look at the code um it's pretty cool and i hope you enjoyed this video i'm sure you have many open questions as do i uh don't hesitate to ask me in the comments or join our discord as i said there are lots of helpful people on our discord and i'll see you next time bye

Info

Channel: Yannic Kilcher

Views: 55,041

Rating: 4.9588571 out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, schmidhuber, hochreiter, lstm, gru, rnn, hopfield, attention, attention is all you need, transformer, bert, query, key, value, routing, pattern, retrieval, store, error, exponental, binary, continuous, hopfield network, lse, energy function, update rule, metastable, separation

Id: nv6oFDp6rNQ

Channel Id: undefined

Length: 65min 15sec (3915 seconds)

Published: Sun Aug 09 2020