Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there how is everyone doing today we'll look at the perceiver general perception with iterative attention by andrew yeagel felix giamino andrew brock andrew scissorman oriol vigneals and zhao carrera of deep mind this paper on a high level describes a model called the perceiver and what this model does is it interleaves a latent self-attention mechanism with cross-attention mechanism and so it is a transformer and the secret is that the data only enters the transformer through this cross-attention mechanism that allows the model to have the latent array be of significantly lower size than the data array and this solves in part the transformers quadratic bot memory and compute bottleneck the image comes in or the data rather comes in multiple times through this stack and the weights can be shared making it essentially a recurrent neural network this model here works for any modality so the paper not only does images but videos and audio and point clouds and you almost have to you have to change pretty much nothing about the input in order for the model to work so this is a pretty big step towards first of all making transformers more deep and second of all applying the same models to very very different modalities of data so we'll dive into the paper we'll look at how it's done it's actually a fairly simple idea so shouldn't take us too long i always say that but maybe today we'll achieve it if you like content like this yeah tell me how you feel in the comments leave a like tell your friends about it and let's go so they motivate the name the name perceiver it's it's not really tied to anything they mo they motivated by saying biological systems understand the world by simultaneously processing high dimensional inputs from modalities as diverse as vision audition touch proprioception etc the perception models used in deep learning on the other hand are designed for individual modalities often relying on those domain specific assumptions such as the local grid structures exploited by virtually all existing vision models so what do they mean they mean if we have an image and the image is of a not a cat a house what did you think so the image is of a house and if we have an image processing pipeline usually what it will do it is it will assume that the image is some sort of grid and that you can localize any pixel by its xy coordinate and also that the pixel is in some kind of relation to the pixel around it you usually build models according to that so a convolutional neural network very explicitly will slide over a filter over the image with all shared weights and therefore it directly says that what matters to a pixel is the pixels around it and only in the upper layers and after some pooling do these receptive fields grow such that more and more information across larger distances is incorporated on the other hand something like a visual transformer like the vit what it will do is it will do transformer like attention but because it can't because the images are so large because whatever 224 by 224 pixels are just too much to put into one transformer it will simply subdivide the image into these patches and therefore it also essentially says it will take each patch and make a vector out of it so it also essentially says that whatever pixels are close together they go into this one vector so they're treated as a group so this this paper says that all the current architectures that deal with computer vision somehow have this built in however the the so so other models have that too other modalities like audio video and so on and the perceiver here is supposed to alleviate that so they say it induces helpful inductive biases but also lock models to individual modalities in this paper we introduce the perceiver a model that builds upon transformers and hence makes few architecture makes few architectural assumptions about the relationship between its inputs but also scales to hundreds of thousands of inputs like convnets so transformers uh notably have our models that transform sequences to sequences or let's say sets to sets so you have an input set and what we've usually come to know as transformers are stacks of self-attention layers and in the self-attention layer what you would do is you would simply transform the input into an equally length output sequence and in the middle you'd have this attention mechanism and the attention mechanism essentially needs to compute the weight between every one of the inputs and every one of the outputs giving rise to an o of let's call that m i think they call it m squared so here you have m sequence length so an o of m squared compute and memory requirements now if m is small that's not a problem but if we go into the range of nlp usually so in in nlp we usually deal with m's in the order of i don't know 2000 1000 let's say 1000 so in the order of 1000 though we would want more ideally but in the in the computer vision our m is easily something like 50k which is about 224 squared so the m squared would be 50 000 squared and that just blows the memory of our computers um maybe not the ones in the future but certainly the ones now all right so the problem here is that these transformer architectures take too much memory what this paper does is it goes ahead and it says couldn't we do a better job so usually in a transformer layer i'm gonna draw this again here as two layers what you'll do is you'll compute queries keys and values from the same input so you have your input right here and what you'll do is you'll compute queries keys and values from that input and those get mingled together in the attention and that gives you the next layer and you'll produce queries keys and values again queries especially are of size m by d keys are also of size m by d now if you multiply those two together and you transpose this you can clearly see that gives you m a matrix of size m by m what this paper does is it it says okay um we can draw back actually on what the very initial transformers proposed the very initial transformers if you remember and if you don't you can go watch my video on it the very initial transformers were something like generative models that had an input sequence and they had an output sequence so the output sequence and maybe that wasn't fully completed yet right so you want to predict the next thing but there was a clear distinction between sequence a and sequence b now sequence b would do self-attention so they would have these stacks of self-attention layers with the quadratic thing and ultimately you'd want some kind of output here such that you know what the next word would be this isn't so sort of an auto-regressive model however the input uh did not use self-attention it used cross attention so it it was also a stack but it used cross attention so it went like sort of like this over um and the way that works is so by the way think of machine translation right so here is the german sentence and here is the half finished english sentence that you would want to complete so if you want to know what's here you need to attend to the english sentence so every part of the english sentence needs to attend to the english sentence but also every part of the english sentence needs to attend to the german sentence that's why you have these paths going over but none of the german sentence necessarily needs to attend to the english sentence it could make sense but it's you know it's a restriction where you say okay the information flows from the german sentence to the english sentence so and that results in this cross attention where the keys and the values are produced from send like sequence a but the queries to do the cross attention so the queries for this particular flow of information are produced by the target sentence and you'll notice something these now can be of different lengths notably if the sentence b right now is much shorter than the sentence a that would result in a shorter q and that would result not in an m by m here but that would result in like an m by something smaller right and let's call this n and if n is much smaller than m then you don't have this quadratic bottleneck so that's exactly what this model does essentially let me just get rid of all of this stuff again this is akin to a few things so it's akin to the original transformers it's also akin to if you remember the model d e t r which is a detection model and what we call the things there are learned queries so what do we do here we start with our goal is to be to have a latent array that is not huge so n here is a size that we can handle in a regular transformer and this stack the top row here is just a regular self-attention transformer with all the drawbacks but because we only have a q of we only have sequences of length n the self-attention modules right here so this is latent transformer this is classic self-attention that we do here and here and you know in all the stacks in all the layers to follow but we can handle it because n is relatively small so in this paper i think n is something like 500 or a 1000 it's something you can handle with current hardware the problem is when you when you know you want to bring in an image but this is quite smart what do they do they take the image and they just unroll it into a byte array so now we have the m here and the m is huge the m is 50 000. however because we produce the queries from the latent array and not from the image itself um we won't get the quadratic blow up so this is m and this is n and you can see that results in an n by m a tension matrix and not an m by m attention matrix so in this cross attention module the data of the image comes in to the latent uh into the transformer however it is not transformed into an equally long sequence it is transformed into a much shorter sequence namely this latent state on this latent state we have a transformer transforming it into a new latent state from that queries are generated to do cross attention again to the same image so the same image will come in every single layer the same image will come in to the into the architecture and so on so if this reminds you of a recurrent neural network that it is sort of a recurrent neural network especially because they say you can also share these weights between repeats if you share these weights it is definitely a recurrent neural network where this here is the initial state which you either learn or randomly initialize in this case i'm pretty sure this is learned though i might have misread so this concept again it relates to rnns in fact it is an rnn if you share the weights it relates to learned queries as opposed to generated queries so you can learn the queries instead of generating them when you learn them you can choose yourself how many there are and it also sort of relates to i'm not sure but how to call this you can see the image goes in multiple times so the way conceptually you can think of this is that here is a bunch of learned queries they have no clue about the incoming data so what you generate here is just kind of a generic set of queries um like what would you know what would you like to know about this incoming data point and you have a thousand things that you can want to know and you have i don't know 50 000 things to attend to so you're going to choose a thousand criteria right to to gather from that input data now the way attention works right is the queries you have a set of queries q and you have a set of keys down here a bunch of keys more than queries and every query exposes sort of a vector and every key exposes a vector and the information is routed by means of highest or high inner product so you would route things that have a high inner product together like these two um yeah those are the ones that you would write so every key potentially has a not potentially every key has a vector associated with it so the queries essentially say what kind of things i would like to know of the incoming data and the keys are say for each pixel in the data say what kind of things that particular pixel offers to to the to the um to the model if you just do this once you might get some generic information but then you get to do it again and you will notice that the queries here the later queries are a result of that processing so the data comes through through here right and influences these next queries therefore these next queries here can be dependent on the earlier data so you can pretty easily see that you know now the next time you're going to attend to this data you do this in an informed fashion you already kind of know what's in there so you refine what you would like to know about the data and so on you can refine and refine you can ask for more and more specific things the more you learn about the data so this is really a process of learning more and more about the data in a dynamic way where you can say what you would like to know and you know this uh i i think it's a it's a great idea it might be refined in the future but it certainly does um also you know it makes sense and it also solves the uh kind of quadratic bottleneck oh wait i almost forgot i had a visual demonstration of how the quadratic bottleneck here is solved bear with me here's a matrix it's m by m now watch [Music] problem solved all right so by the way that the lower is supposed to represent n by m i did not write that down okay so this not only allows you to overcome this quadratic bottleneck it also allows you to build much deeper transformers so i believe their best architecture here had 40 sorry 48 layers of transformer which you know we can do in in kind of nlp but it takes a lot of hardware and when they also share the weights their number of parameters in these things is not more i think it's comparable to kind of a a resnet a standard resnet so yeah pretty cool there is so they apply this to pictures they apply this to videos they apply this to audio they apply it to video and audio together they apply to 3d point clouds though one has to say for video they don't actually put the entire video into so that this here isn't the entire video but they i think they also put kind of little um time space chunks of the video in it so it doesn't solve yet all the problems with transformers it's still if a data point is huge you won't get it in there uh simply by the fact that is linearly huge what you will solve is the fact that things are quadratically huge the last thing to do is um to pay attention to this thing positional encodings now the way they do positional encodings is as so now we have like a fully fully independent like a data modality independent architecture right it's it's important to realize this this thing here has nothing to do with an image like is it an image who knows right we don't care we simply this is the array of pixels this is simply the unrolled the unrolled image um there is no convolutional filter there's no patching or batching or anything it's just the image or it's the audio data right it's like sample after sample of audio data and so on this um you can even think of a situation where you would feed in different parts of the data from time step to time step in which case it really becomes like a a recurrent just like a recurrent neural network um but the point is the transformers they are invariant to to position so if i feed one two three four five into a transformer it will do exactly the same thing as if i feed three one two four five that is not much of a permutation but it is so it is it is invariant now that is that that's stifles it because we you know there is something to something being in a certain location right especially if you think of text um word order matters and so on what we so but there's a clear distinction we don't want to build these things into the architecture but we want to give the model the possibility to exploit that information because clearly it's there like a piece of text is not just a set it is an actual um string of ordered words so what do we do we give positional encodings with the input and position encodings you know have been used all over the place transformers specifically need them the way this paper does transit does positional encodings is like they do it or much like they do it in the first transformer paper and that is by fourier features so if you have five inputs right here you build up kind of a fourier bank of of frequencies um so this is the lowest frequency something like this like a sine wave and then a higher frequency well five probably wasn't the optimal thing to demonstrate this um so by kind of indexing so here if we look at the position number two right here um it has like if we just consider this binary it has like no not binary like 0.9.9 minus 1. that's kind of the encoding that's the positional encoding of that location and if we look at three it's 0.9 minus one one um so you can see that you can with this kind of positional encoding as opposed to a learned positional encoding what you can do is you can always detect when two things are close together that means that in the lower frequencies they will share the same number and you can but you can also do very high resolution you go to the highest frequencies and if they're different there but if they match all of the frequencies above them that means they're like right next to each other so that's how you do position encoding with fourier features again i discussed this at length in my attention is all you need video the fourier features also have the additional benefit that you don't rely on learned encodings which means you don't you don't rely on the fact that you have kind of an exact or a maximum amount of sequence length so the yeah i mean you still have kind of a maximum here but i like this more because it's sort of independent it's one less thing to learn and the learning happens in the processing itself so in terms of experiments it's pretty simple they are in vision they are on par with something like a resnet 50 and uh they're you know they're doing pretty well in vision without any sort of assumption that the input data is an image right that's the that's the crazy part so other than the position encodings which are the the fourier features in two dimensions um there is nothing here saying this is an image it's simply a array of pixels this it i think that's crazy um and sorry this is a visualization of the attention maps so in this model specifically what they do is layer one has a set of weights then layers two two i think seven have as a different set of weights and then layer eight has another set of weights so layer one is the blue here layer two to seven share the weights they're green and the last layer i don't have do i have orange here okay and you can see that these are the attention maps of different channels and they stress that they don't overlay it on the image so the attention map in the first layer actually really attends to the image pixels you can see the dog clearly in many many of these uh attention maps right here like where it attends to clearly attends to parts of the of the dog and it seems that it can do sort of edge no it kind of attends to the intensity of the pixels right in the first layer then in this second to to seventh layer attention maps look like this so they look like sort of a grid so they heavily rely on these positional encodings in order to build up this grid however this grid is not always the same it's sort of different for different things and then in the last layer again my question would actually be how i see that these things are different from channel to channel so these are the the different channels right here but how different are they from input to input like has the model just kind of learned a general sequence of attention maps for all possible input images like that it works well because it's pretty it's kind of suspicious right that these maps they seem like so my question would be how much do these attention maps really depend on the input versus how much are they just general attention maps right then and um so i can totally see that this model might just do all the work in the latent transformer by simply having so many layers and that the attention isn't too important like it would always do the same sort of attention no matter what the input is and i can see a model like that totally uh performing well so in order for me to demonstrate that this idea really works as advertised namely that you know the model selects itself what it wants to attend to iteratively informed by the data and so on uh it would be cool to see that these things somehow depend on the data because this grid pattern right now tells me that maybe they don't okay so the last thing they also applied is as i said to audio video 3d point clouds and i think they outperform other methods in these so they reach state of the art in a bunch of them which you know pretty pretty cool uh of course image computer vision has been sort of the prime or one of the prime disciplines of um of deep learning research so that's maybe a bit more competitive last thing i want to show here is the ablations so they find specifically that you know the number of latent variables which is the you know the size of the q the the the n so that this is what we need to keep small in order to avoid this quadratic bottleneck you can pretty clearly see that as this goes up performance goes up so this at least validates you know our intuition that if we could do bigger transformers it probably would be a good idea number of attends i think that is how many times the how many times the image goes into the structure also here the more the better and number of transformers per attempt that's you know how many in-between self-attention layers do you have uh per time you attend the image so that gives your model time to process and time to decide what to attend to next time also here we see we see a rise though it would be interesting to see like an interaction term between uh between these two things um that will tell us if it's just about making the model deeper or or not okay so that was all i had to say you can uh kind of check out the attention maps they have here themselves they have them for audio they have them uh here i think for the video and also there are a bunch of experimental details that are also pretty cool however i just think it's a cool idea and i'm excited to see where people take this all right that was it from me i'll see you next time bye
Info
Channel: Yannic Kilcher
Views: 54,963
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, deep learning tutorial, what is deep learning, introduction to deep learning, deepmind, perceiver, cross attention, attention mechanism, attention is all you need, google deepmind, deepmind perceiver, perceiver model, perciever model, perciever, self attention, rnn, recurrent neural network, weight sharing, computer vision, natural language processing, fourier features
Id: P_xeshTnPZg
Channel Id: undefined
Length: 29min 36sec (1776 seconds)
Published: Mon Mar 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.