Rethinking Attention with Performers (Paper Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there today we'll look at rethinking attention with performers by researchers of google the university of cambridge deepmind and the alan turing institute this paper is yet another paper in the quest to make transformers more performant and what better name to give to a technique than the per former so the performer performers are a new kind of class of models they try to approximate the transformer if you don't know what a transformer is i've done like a a ton of videos on transformers on attention mechanisms and you can there's more than enough material to look that up today we'll talk about performers and the performers as i already said they approximate transformers and they do so without running into the classic transformer bottleneck which is that the attention matrix in the transformer is has space and compute requirements that are quadratic in the size of the input and that limits how much input you can put into the model so it kind of limits how long of text you can input if you work with text or how big your images are that you can work with uh this is all kind of bad at when you use transformers so the performers get around this by this technique they call fast attention via positive orthogonal random features abbreviated favor plus they use this favor plus to get around it and what's interesting is that the favor plot i'll just call it favor this um fast attention it it is potentially useful beyond transformers so it's apparently been here developed in the realm of the transformers but they say which may be of independent interest for scalable kernel methods you'll see what they do is they approximate the attention matrix by decomposing it but they do it in a special way and they do it in the in the way um if you know what random fourier features are maybe you can kind of think think ahead a little bit if not we'll we'll get into it for sure uh i think honestly this might be one of the enabling one of the next mini breakthroughs in in deep learning not big breakthrough but kind of mini breakthrough i remember a time when we used uh sigmoid and tan h non-linearities believe it or not you young kids at the beginning of deep or not the beginning of deep learning but before deep learning really took off um it was a sensible thing to use soft max and tan h non-linearities everywhere in your neural networks because well first of all they were like differentiable so that was cool and then um you know it was sort of how nature does it with the step function in like it was an approximation to the step function in a true neuron and so on and it was just kind of well motivated so people thought that must be the way to go but then of course uh turned out that relus are much easier much more stable give much better results and so on don't saturate all these cool things this here is kind of the it's it see it feels like the same thing because right now we're doing this soft max thing in attention and it's very important because it normalizes the attention matrix right it gives you kind of this thing that comes out is kind of a distribution over the inputs and so on so it's well motivated and you may be able to see but also as the sigmoid is it's kind of has this exponential thing in there and the favor algorithm is going to approximate this soft max thing but it can be used to approximate much more so maybe you know we're going to find that if we swap out these the non-linearity in there we might be able to build much better transformers or whatever the model will be called performers i guess they already do this here with reloose in this very paper so the performer is going to be fully compatible with regular transformer and with strong theoretical guarantees unbiased or nearly unbiased estimation of the attention matrix uniform convergence and low estimation variance so the difference of the performer here is going to be that there have been methods before that decompose the attention matrix into low rank matrices but um those either don't work or they kind of rely on on priors like the you're assuming that your attention matrix has a certain structure if it doesn't it sort of fails this method here is going to be an unbiased estimator and it's going to sort of converge to the attention matrix if you add more of these random features okay this is fed here like provably not relying on any priors fully compatible with regular transformers which means that you can take a transformer checkpoint and sort of plug it into this framework and then you just have to fine tune a little bit um to sort of use the checkpoint of a regular transformer which is pretty cool right so we'll go through the paper it's quite a heavy paper it's quite a math heavy paper we won't go through all of it um i just kind of want you to get the idea of what these performers do what the reasoning behind it is and how you might be able to kind of work with them or extend them where it's where it's going from here as always if you like content like this don't hesitate to share it out and tell your friends about it all right so the problem with attention or the problem with transformers is like i've done this a million times and you can go look it up but if you want to map a sequence of layer l into a sequence uh or a set or what not of layer l plus one and you need to compute these attention weights right so the attention weights are going to be from each token here to each token in the next layer you're going to compute one of these weights right so there is there is this matrix is called a the attention matrix and a is going to be of size l by l and that is a problem if you have long sequences right you can already see this so the way that this a comes to be is that conceptually the upper layer like it's all the same layer but conceptually the upper layer emits something that are called queries and the lower layer emits something that are called keys and values now the keys and the queries they go together into matrices so it multiply the keys and the queries then you run this through and this is the problem you run this through a soft max non-linearity to basically get a distribution and then you multiply it by the values so the query key matrix this attention matrix it will tell you how to aggregate the values all right if it weren't for the soft max so you can you can think if if these if these the dimensions of the queries and keys and values let's call it small d then the dimensionality here would be something like here you'd have l by d here you'd have d by l for the transposed and then here you'd have l by d so because you have to do the softmax you have to compute this first which gives you this l by l which is the terrible thing however if you could if you could if somehow decompose the soft max operation you could first do keys and values which will give you a d by d matrix and then you could multiply it by the q matrix right which would be much much much more easy if d is smaller than l certainly it wouldn't grow quadratically in l it would just grow linearly in in space and time so here this is formulated out the attention mechanism right here the attention mechanism is made of queries keys and values and it's given by this formula right here now there is a bit of a technicality i wasn't exactly correct in what a is so here they they say they i called this thing here a okay they are very specific what they mean by a by a they simply mean the exponential function of the normalized queries times keys and then to get the actual softmax you have to normalize by here so d which is so you see the inverse is made here d is constructed from a and normalizes a but the normalization is of secondary importance the important part here is that this exponential cannot be easily decomposed right it's it's not like you can decompose the inner multiplication into two exponentials or something otherwise the problem would be solved so what is this paper doing it's exactly what i just said was impossible um so you have this matrix a right here and you multiplied by v i guess again forget about the normalization by now it will decompose a into the query q prime and k prime now they are called prime because they are not the queries and the keys because we've just said the queries and the keys they go into the exponential so it's going to be that k uh sorry q prime times k prime transposed is going to be approximately equal to exponential function of q times k maybe normalized by square root of d but you can see that this here isn't decomposable and yet they decompose it and the question is how because there have been papers before that try to decompose the attention matrix i think linformer maybe and there is also the reformer which uses lsh and so on so there have been a number of tricks but they all don't perform as well which this paper also shows empirically and they all rely on certain assumptions of the attention matrix and they all are not unbiased estimators in general this paper is going to be an unbiased estimator and they do this via sort of a kernel framework so what they they first of all they make this problem more general they say we have our attention matrix a the ijf entry is going to be the query i the key j and some some kernel function of that okay in our case this is going to be the right x of query times key like this sorry the other way around transpose transposed times key the inner product of that however you can think of any sort of of kernel function okay so um [Music] yeah if if i'm i'm not going i'm not going to try to explain uh more details into kernels we had a fantastic machine learning street talk so if you don't know about this this is our podcast machine learning street talk where alex stanlek explained kernels in great detail and with very very precise language and very understandable as well so what i'm going to say is that they allow you to do things like this so you can think of kernels as kind of connecting two things they allow you they represent an inner product in some other space okay so the kernel function of two inputs right here will be equal to some some inner product of the two inputs when pulled through this function phi right here and that's what we're going to use now usually usually when you learn about kernels you do it in this way you say we would like to compute in this very high dimensional space but we can't we can't do inner products we can't map this function phi explicitly so we're going to instead use this kernel right here this kernel function and that's going to be equal if you pick the right kernel function for the particular phi in this paper we're going to do it the other way around because we say well this thing here is this is the soft max function and that's just a a beast right we can't possibly compute that however if we could find out what inner product that corresponds to what other space we could just go to that other space and perform an inner product and this thing over here is linear right this is a linear function this here is the nonlinear function this is our softmax so you can see that by going in this way by finding what is the higher or the the phi function for the soft max kernel we can construct all of this attention business in a linear fashion and that's what this paper does what it allows you to do is it allows you to find these q and k q prime and k prime matrices such that as over here right this is the kernel function and this here is linear and then you can simply first multiply k by v or k prime by v and then you can multiply q by k and that will alleviate you of having this giant attention matrix so how do they do it if you again if you know about random fourier features this is going to be very much or very similar thing right here um they're not going to explicitly construct the high dimensional space such that this is exactly equal but they're going to construct an approximation and the approximation you can make arbitrarily good and you do that via the following you say so here you see this is how do i have to map something into this other dimensional space where this whole soft max business is just a linear operation so what you would do ultimately is you would take your queries you would map it through this file okay and you would take your keys and you would also map it through this phi and this will give you query prime and this will give you key prime right so and then in the higher in the higher lower whatever dimensional space you would take the inner product and the inner product between the two is going to approximately be as if you had mults of the inner product is going to be approximately as if you had taken the original q and k multiplied them and put them through a soft max how do we do it so here we define what the function needs to look like such that this holds the function again they go very general here the function in general is going to look like the following so you have one function here that's called h that is a function of your input and it's in front it's a deterministic function of your input and you also have a normalization factor so this is kind of it's it's kind of a factor in front of it you see that here comes a vector so this is a vector right we are mapping this to use a sum dimensional space and this is the vector now it's a bit you have to pay a bit of attention so inside this vector you have l different sub vectors they're all concatenated after each other okay so you have so you see here this where the f this is f1 and then f2 f3 f4 and so on until fl okay so you have all these sub vectors um it doesn't matter ultimately you just concatenate them all but it's important uh to just keep in mind within each of these vectors within each of these sub vectors you always have the same repeated term okay you have this w times your x so the inner product between w and x you can see there's w1 through wm or omega i think it's an omega and again in the in each sub vector you have this repeated so what are these omegas first of all the omegas are random vectors drawn for from some distribution now in practicality this is going to be a normal distribution like this one here an isotropic normal distribution so and the the other part here is what are the f's so the the f's f1 through fl are going to be functions deterministic functions so a in example they gave right here if one is the sine function f2 is the cosine function and then you have to specify h and h in this particular example is one but it can be a function of x here it's just the identity sorry not the identity the constant function one so let's break this a little down so we have x and x is going to be a vector x as i said x is going to be like one of the queries here or one of the one of the keys here one one of them right one column or one row however you conceptualize it and we wonder how do we want to map so x is going to be some vector okay then this is an ugly vector let's draw it like this x is a vector then what we're going to do is we're going to take a bunch of omegas now it's important that the omegas are random so they come from this isotropic normal distribution but they're going to remain the same throughout the algorithm there is a method to resample them but just conceptualize that at the beginning of the algorithm you choose these omegas and then you fix them okay so the omegas are going to be also vectors which are random just a bunch of random vectors okay let's take three what you're going to do is you're going to compute the inner product between your x and each of the omegas so inner product between your x and each of the omega so this gives you omega 1 x omega 2 x omega 3 x so the inner product this is going to be this is this is going to be numbers okay and then you're going to have a collection of functions so these are going to be functions maybe function one is going maybe here the sine function function two is going to be the cosine function okay now you are going to take each you to make a table you're going to take each of these products you computed and put them through each of the functions so this is going to be sine of w omega 1 x cosine of omega 1 x sine of omega 2 x and so on okay and then you're going to take this table and you're going to flatten it to a big vector so sine omega 1 x cosine or no sine first the ordering they do it doesn't matter as long as you always do it the same omega 2 x and so on right until you have here cosine of omega 3 x so that's the vector they're constructing and these are those random features okay so this here is going to be the vector that you're constructing what you do is basically geometrically your x is like somewhere here and it's a bit hard to draw in low dimensional space because you don't get the intuition but this is if this is your x you're going to choose a bunch of these omegas these omegas are going to be randomly sampled from a uniform gaussian so this is omega 1 maybe oh my god 2 omega 3 omega 4. and you're going to compute the inner product between um between any of the two okay so you're going to be essentially computing the projections onto each other or the angle however you want to conceptualize it the angle of this to each of the two of the omegas and then you're going to make a features out of these angles right so this will sort of tell you how your vector stands to each of these random features now the reason i say it's difficult in low dimension is because now i have more omegas than the dimensionality which is two right here and and this makes no sense right as soon as i have two vectors that are not collinear in two-dimensional space i can if i project x onto them like like this oh sorry like if i project x onto both of them i already have x fully represented right there is there's no need to have more of them however if you are in super duper high dimensional space and you don't you don't have as many features then you get some interesting approximation properties namely so this was an example right we don't always have the sine and the cosine here this is purely an example you can only have one function you see like this f1 you don't need two functions you can have one you can have many okay and you can choose how many omegas you sample that is a parameter so yeah you have a couple of choices i want to make it clear the choice of h so the choice of h and f they go hand in hand the choice of h and the f's determine what the phi function is okay so the choice of h and f determine which kernel function this phi function corresponds to if you construct it like this so by choosing the correct functions you tell the function which kernel you would like to approximate and then by sampling the omegas the more omegas you sample the more accurately you approximate that kernel okay and then you can give some approximation guarantees as they say so the soft max kernel is given by this thing here which we've already seen okay and now how do we approximate the soft max kernel and they show that right here the softmax kernel is approximated by this thing right here so it's a bit of a a ugly formula and it contains this gaussian kernel the gauss kernel so they say if we choose h equals to 1 so just a constant factor and this f1 and f2 to the sine and cosine and in if we choose d the distribution to be a normal distribution isotropic around the mean this is the gaussian kernel and then we simply have to choose h differently uh this factor in front to make it into the softmax kernel so as long as we put this factor in front you can see that this here represent this here represents an inner product right so you have to kind of um think of decomposition so if you put you can see f1 the sine f2 the cosine which is this makes it the gaussian kernel and then this factor in front of it here 2 for h this makes it now the soft max kernel so if we choose h and f like this then when we map our queries and keys through if we map our queries and keys through the phi function and then make the inner product between them okay like here that will approximate depending on how many omegas we've sampled better or worse the approximate the result as if we had multiplied them first and then put them through the softmax function all right so this you can see how this becomes much easier because we can independently put them through the phi okay and then it's just a linear operation which allows us to do our trick where we multiply k and v first and then multiply by q instead of the other way around which we're forced to do when we apply the soft max this was a long a long way to get here but i hope you're with this and this is this is pretty straightforward actually so far now renormalization we can take care of that easily but there is a problem and this is they argue this hasn't been proposed so far because it doesn't work like this so even though you approximate this kernel fairly well it's it's a bad approximation and they they say here there is however a caveat here the attention module from one constructs for each token a convex combination of value vectors with coefficients given as corresponding re-normalized kernel scores that is why kernels producing non-negative scores are used applying random feature maps with potentially negative dimension values leads to unstable behaviors especially when kernel scores close to zero which is the case for lots of entries of a corresponding to not relevant tokens are approximated by estimators with large variants in such regions this results in abnormal behaviors eg negative diagonal value renormalizers and consequently either completely prevents training or leads to sub-optimal models so what they're saying is that when you use soft max you always always uh get positive values right so if i have a bunch of vectors uh or a bunch of numbers this is you know positive number negative number very positive number negative number and i run it through a soft max um i will get out a distribution right like this or really big sorry the soft max will scale that up i will get out a positive district like a kind of a histogram okay and now i'm trying to approximate this by this formula right here and you can see these are these are vectors which gives me sine and cosine coefficients and i linearly multiply two vectors together which definitely means i can get negative entries and so on so the renormalization then has to somehow maybe take care of that and it says especially especially around zero when the original soft max matrix would have values close to zero this approximation is really bad and has high variance and they also argue a lot of attention vectors are close to zero because we know that attention is is sort of sparsify uh just by the fact of what how the soft max works it exaggerates the largest inner products and it really dampens the low inner products okay actually i might not even have done this correctly here if it's if it's very negative i'm not sure in any case they say that's why this doesn't work because it has such high variance it's a good approximation but has such high variance in the wrong places namely around zero where most values are so they call this these s the sm the soft max approximation with m sampled features trig because it uses the sine and cosine functions and now they're trying to remedy this and for that they propose a different decomposition so a different approximation to the soft max kernel and they say we can also decompose the soft max or approximate the soft max kernel with the following formula and i look i i'm not going to they have a proof for this but this is the formula you sample again you sample these things and then you perform this inner this is the inner product that approximates the softmax kernel okay and this is further you can reduce this to this thing right here so it's a deterministic matrix right here this which is given by that and it's this cos h so cos h is the hyperbolic tangent this can be this is so cos h of x is e to the x plus e to the minus x divided by two okay so this function um approximates the softmax and that's just something you'll have to take from their proof however you can now see that this can be fairly easily represented as an inner product you already see it here right um this you simply this is the the part that comes from x and this is the part that comes from y if you want to note this in our in our notation earlier again we use the distribution that we sample the omegas from is going to be a normal distribution and our functions are going to be this h function is the pre factor it's simply going to be the made up of the norm of x and put through the exponential function and then we have two options actually right here i don't even know why they put the first one but the second option makes more sense and there's a bit of a more of a factor right here so you have two functions there is x of u and negative x and x of negative u as the two functions you remember this is where we had sine and cosine before now we have x u and negative x sorry x of negative u and we can quickly check that this gives us the same thing so this h these h functions if we inner product them that's going to be to give us the this what is that even lambda is that a big lambda uh matrix right here and our vector let's just say we sample one single omega right so we have our x we sample one single omega so x is going to give us a vector with two sub-vectors right since we have two functions each sub-vector is of length one so the first um is going to be e to the omega x and the second entry is going to be e to the negative omega x if we put in y through the same or if as instead of x and y you can think of queries and keys um that's going to be oh my god y e to the negative omega y if we now take the inner product that is going to give us um and i'm resolving the exponentials already right here so that's going to give us e to the um e to the w x plus y and here is going to give us plus e to the w or sorry the negative w x plus y and that's the you know there's a normalization factor uh that's why the square root of two is here right so that comes in somewhere here to give us this normalization factor so this is exactly the hyperbolic cosine of omega times z and z is x plus y that they say it's somewhere yeah here okay so if we choose f1 and f2 to be this x bu and x negative u then we get if we perform the inner product we get out exactly this formula number seven right here so this is this and that is an approximation of the soft max kernel of the soft max function it's just a different approximation than before okay and the cool thing about this approximation is that the approximation itself only ever has positive values so these vectors here you can see the x the vectors here and there's of course a four a factor in front of this right here which is going to be also an exponential these are all exponential so these are all going to be positive features which is very very nice and they also show this theoretically so here this kind of funky graphic shows this this is the ratio of the approximation mistake okay the ratio of the approximation mistake of the of the original approximation that we discussed and this new positive approximation that we just built right now and you can see that in parts here it's fairly similar so this i believe so r is the ratio so it's fairly flat right here but there are parts where it just shoots up right and in fact they can prove that um you can see this also right here so the error of the trig approximation that shoots up while the positive approximation just stays flat or flatter in these regions they can in fact prove that the [Music] the error of the yeah so you see the error if the soft max values go to zero so that's the problematic regions the error of the trigonomic approximation can go to infinity while the error of the positive approximation goes to zero okay they have a number of theoretical results in here i think that's one of the main ones the fact that uh the this approximation succeeds where the other approximation fails really quickly they also have this variant here where they don't build a two vec a two ver or a vector of two sub vectors but just one with just the exponential function and that is the same thing because of course if you sample w you're going to have uh sorry omega if you sample omega you're going to have omega as much as uh negative omega i believe and and thereby in expectation you're going to get this uh hyperbolic cosine again i think that's the reason why but this lower this lower construction here gives you the hyperbolic cosine okay so pretty cool we simply use this approximation we run our queries right this their queries and our keys through this and again we ideally use more omegas than just one maybe a bunch the more we use the better we obtain a linear function that approximates the soft max function uh the more we sample the more it approximate it it's unbiased and so on i have a bunch of variants of it so variant where you normalize the omegas which gives you the regularized soft max kernel which is not a soft max anymore but it's a regularized soft max and they can approximate this in pretty much the same way except instead of a normal distribution you use a uniform distribution right here um and they have a bunch of other things namely uh one other improvement is that so far we've simply sampled these w's okay we sampled the w's from a normal distribution like this here they say we can improve even further namely we can strictly improve with this gives us an estimator with strictly lower variance if we make sure that the w's we sample are exactly orthogonal so they're already approximately orthogonal if we sample them from a high dimensional space but if we make sure that they are exactly orthogonal sorry then they are giving us an even better approximation and you can do that by this procedure called the gram schmidt or orthogonalization or gram schmidt renormalization procedure that's a it's a pretty easy procedure and it doesn't mess with your unbiasedness um whenever d is an isotropic distribution isotropics just means the same in every direction so like a a standard gaussian would fulfill or or a uniform would fulfill this thing as long as it's centered i think maybe even if it's not centered depends on how you re-normalize i'm okay this is irrelevant but if you if you make them exactly orthogonal say this leads to the first theoretical results showing that orthogonal random features can be applied to reduce the variance of the softmax or gaussian kernel estimators for any dimensional ltd rather than just ace asymptotically for large enough d as it is the case for previous methods and leads to the first exponentially small bounds on large deviations probabilities that are strictly smaller than for non-orthogonal methods okay so we're going to end up with a thing that's strictly smaller so bounds that are strictly smaller than if you don't use orthogonality it the only thing it requires is that m is smaller or equal to d so the number of omegas you sample is going to be smaller or equal to the dimensionality that the original space operates in which let's say this will be the case in all our experiments okay and again these these are exponentially small bounds which is pretty cool i guess for you the end user it matters that this works and if you use all of their tricks with the positivity and the orthogonality so by the way this here is where they show that see the orthogonal mse the mean squared error is smaller than the original one minus some thing and as long as the sum thing of course is greater than zero you're going to have something that's smaller okay then they prove a bunch of other things again about this kind of this regularized sorry not regularized i forget it's the where do you divide by the norm in any case they implement this in jax oh great wow cool um i okay i have no opinion on jax um but uh they have the code released and i'll of course link to it and here you can clearly see so this is a log log plot um where you have l the size of the input and the number of seconds that it takes to go forward and backward over here in the model and you can see the x here the x is the baseline where you simply bypass the attention matrix you simply take the identity function and just return the value matrix and you can see that the performance the performers they scale fairly well with that baseline and in fact they scale at the same slope which is the important part right here you can really see that this is linear slope where the transformers which are the dashed lines they all curve upwards um which of course is that that quadratic requirement uh the same in the backward pass i don't know if they continue curving i think it's also a straight line in the log log plot but the slope is two instead of one like the um the linear like the linear models again the comparison is only important between the baseline and the lines that you're looking at if they have the same slope they scale the same as you get higher look at this this is log l right so this is these these are now 2 to the 18th uh tokens and i believe this is done on one gpu yes so an out of memory error on a v100 gpu and uh this is pretty good this is pretty good news for everyone who wants to run the the performers in in kind of a low resource environment low reason with low resource i mean uh like a deep learning gpu instead of a thousand tpus which is pretty cool they they also show the that their method is better than the kind of so the orthogonality is better than the iid features and then of course the positive iid features are better than this original trigonometric decomposition and they show that this thing that you can take a transformer checkpoint and you plug it into the performer and you simply have to fine-tune a little bit to get it to the performance that the transformer was at right this is i believe this is the original training curve of the transformer so you know it's not a fair comparison because the performer starts from the checkpoint already at least that's how i interpret it it's not clearly written and they say okay over here this trig thing works this is the original approximation this even works however if we do that on a bit more challenging more longer sequences uh data data set then you can see that the trig softmax it just it just whacks out that's this thing here and you actually need uh better these positive approximations now to compare to the linformer here which is pretty cool so the linformer another i've made a video about it if you want to know about it but they also do random projections of the attention matrix but you can see that the linformer plateaus along with the performers if you don't redraw the random features so if you want in the performer if you do it at the right time you redraw these random features these omegas you have to you have to see where you can you can't just arbitrarily redraw them between computation steps but at the end of like a computation step you can redraw for the next computation step and if you do that and the even better with the regularized or the the normalized features you get to the same level of performance that a standard transformer would get but of course without the quadratic requirements and okay lastly as i said they've already they've already swapped out the um they swapped out this non-linearity by a relu so here they construct performer relu taking f equals relu in equation five you remember what f was f was the sine and cosine when we had the first approximation and f was the x x of u and x above minus u the second one and as i said uh the big improvement in deep learning came when we swapped sigmoids for relus and here they've already they're already trying swapping now this because they say well so we have a method that we can basically plug in anything we want so they plug in relu because it's you know worked well and this again it works pretty well so they compare again also with the reformer here with the lin former as you can see and of course they beat everything now whether or not this method is going to be the next thing like the thing that everyone uses is to be we don't know um it's fairly possible it's pretty cool and it appears to be theoretically solidly grounded but you never know from the experiments of the single paper the broader impact statement much respect they just use it to tell you how awesome their paper is like there's no mention on on on any kind of uh ethical impact which i believe like i'm all for these kinds of broader impact statements like just kind of okay research on transformers is going to be better because now people have access to it it's backward compatible that's pretty cool uh it's applicable to biology and medicine because we can now take a longer sequences it's all like yeah i like these kinds of broader impact statement the last thing here is that you might be um so the the only problem is if you want to do this causal attention that if you want to do like a generative model like a gpt sort of model you have to do a bit of a trick and that is because your attention matrix isn't the full attention matrix right so you can't just decompose it it's this lower triangular matrix right here but since you have linear decomposition of this thing you can do these kind of prefix sums namely you can compute simply so uh you you you can compute the key one times value one and then you can compute key two times value two plus key one times value one and you compute key three value three plus key two value two plus key one sorry value one and so on you compute these things and these are all these are all the big where the l goes away right so we do that first and then we simply have to come along and we take q q1 multiply by k1 v1 we take q2 multiply by this and this q3 will multiply by this this and this and you see that's how you get your causal attention so you simply keep track of these prefix sums right here and then when the next q comes along you simply multiply it by all of the things that are above it in the prefix sum that's how you get your triangular matrix so even that is solved a thing that i believe the linformer wasn't able to do with its particular decomposition i might be i might be wrong here all right they have a bunch of experiments on protein analysis and so on which of course wasn't possible i guess before because it was so so heavy they also have like imagenet 64 as you can see right here which is an impossible data set for a classic transformer as i said they have code code is in jax which is like this is it it's ugly code uh let's be honest but it's code so that's fairly cool and i want to point out the right at the bottom here is actually where the stuff happens so you can see that just quickly you have here keys and queries are where is it exact so the queries and keys are going to be constructed right here so query prime and key prime are going to be pulled through this feature creator which implements these these kernels so these can be either as we said these x or the release or the sine cosine what not then you're going to multiply the queries and the keys which gives you yep this w matrix and all that we need to do now is normalize it okay so we re-normalize by constructing this denominator right here and then there's a whole block for the unidirectionality which you can imagine is pretty ugly but the renormalization we constructed we reciprocal means we take the inverse multiply it by the w and return the result this should be translatable into your favorite what not pi torch or tensorflow maybe it's already been done i haven't researched that particular thing in any case i invite you to check out the paper the code play around with the functions used here as long as you you know use phone you don't even you don't need to know like these papers they always know which kind of kernels their functions correspond to but you know in svms people just went when nuts i just plug in some functions see what happens probably nothing good but um it's possible all right so that was it for the per former i hope you gained something from this um kind of an understanding of how it works and i wish you the best bye
Info
Channel: Yannic Kilcher
Views: 42,144
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, nlp, natural language processing, natural language understanding, data science, transformer, attention, attention mechanism, transformers, attention is all you need, gpus, tpu, linformer, reformer, explanation, imagenet64, kernels, gaussian kernel, softmax, softmax kernel, approximation, random features, random positive features, random fourier features, google, favor, machine translation
Id: xJrKIPwVwGM
Channel Id: undefined
Length: 54min 38sec (3278 seconds)
Published: Mon Oct 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.