An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there today we'll look at an image is worth 16 by 16 words transformers for image recognition at scale so this paper is a bit special andrei carpathi tweeted this out and i'm going to guess many of you have seen it already it's a paper that's under review at iclear iclear of course uses open review so all these submitted papers can be seen and can technically be commented on and as you can see it's anonymous and good thing it's anonymous because the double-blind review process relies on anonymity so we can really evaluate this paper uh which is a very interesting paper at its merits without you know having a clue who would be writing something like this now out of pure out of pure randomness i just happen to have this in my like uh ctrl c ctrl v memory i just pasted this here i don't know why but this is this other paper called big transfer uh general visual representation learning by alexander koleznikov lucas perez yahwah chai and others of google research i've actually made a video about this so if you're interested totally not related at all um i mean yeah so disregard the fact that the paper that we're discussing here uses a gft 300m data set that is not available to the public only to google that is um [Music] and actually this other paper also trains on that disregard that um also largely disregard the fact that their model is called vit while the other papers model is called bit disregard the fact that they train on the exact same data sets as you can see right here i mean this here is imagenet then c for 100 pets flowers and the v tab v tab this visual task adaptation benchmark i've done a video on that too by google but they do have actually the imagenet reel here which is a just a set of new labels for imagenet which comes out of a paper by google with largely the same authors as this paper i mean disregard the fact that the color scheme for the vtab evaluation is exactly the same as is the histogram plotting and of course we don't even want to bicker about the plotting style with these uh bubble sizes and so i mean anyone could do this anyone anyone in the world could just randomly have this much overlap with these models and of course anyone just has the money laying around to train on 2.5 000 tpu v3 days um and you know compared with 9.9 000 tpu v3 days for the bi team i guess you could just pick those numbers out of the paper but what do i know um so no don't worry peer review is totally fine like like i mean yeah so i hope i've made my point this is by these people um and you know people say you know we need anonymous on on archive because the danger is that people upload their paper and archive and then we can see who they are i think this should prove to anyone that an anonymous archive is like it's the crappiest why why like why would you ever work against the core incentives of people like clearly these authors have an incentive to make known who they are and clearly we as readers have an incentive to figure it out and to completely work against these incentives just seems so it seems dumb it seems counterproductive and it doesn't work as you can see what do you want to do standardize the plotting styles standardize everything standardize the citations i mean come on here you go like when we compare oh no um where is it when the when they compare against things they say oh our first point of comparison our first coin point of comparison is the big trend randomly just big transfer by these authors that we have no relation to maybe or maybe not um it's it's ridiculous you can't shield this this fake anonymity um this is actually counterproductive and it only helps the big labs um the this anonymity criterion all right let's actually dive into the paper after this rant well yeah don't worry peer review very pristine very good very anonymous double blind for sure so the paper says while the transformer architecture has become the de facto standard for natural language processing tasks and we know this you know this is from the first attention is all you need paper two things like bert gpt gpt2 gpt3 transformers have revolutionized nlp say its applications to computer vision remain limited envision attention is either applied in conjunction with convolutional networks or used to replace certain components of convolutional networks while keeping their overall structure in place which is correct in computer vision convolutional networks have been so incredibly successful uh since alexnet and then of course resnets uh being the major contributor there i mean even this big transfer paper right here all it does is scale up resnets and then feed in more data so cnns are extremely extremely powerful in computer vision we show that this reliance on cnns is not necessary and a pure transformer can perform very well on image classification tasks when applied to when applied directly to sequences of image patches and they go on saying that they outperform cnns while requiring substantially fewer computational resources to train well you know substantially fewer in these regimes of thousands of tpu days is you know something that um is a bit ironic honestly but you know it's it's it's it's pretty cool so what's the deal with transformers and images uh classically transformers are of course things models that operate on the sequences specifically actually they operate on sets so you'd have a set of words uh which you can characterize as tokens which i'm just going to characterize as bubbles and then the transformer would somehow take all of these in and do something with them and something in this particular case is attention and the tension is a quadratic operation which basically means that you have to calculate the pairwise inner product between each of these between each pair of the of these bubbles which becomes a very very large task very quickly you see i even have trouble drawing i think i drew this twice however this this already with five it it is many many many interconnections and you can imagine that if you are in nlp and have a paragraph that's maybe 500 tokens long you need 500 squared connections so this one uh thing is a limitation of transformers they work really really well for nlp however they are limited by the memory and compute requirements of that quadratic attention images are therefore much harder for transformers because an image of course is a raster of pixels and there are many many many many pixels to an image right so usually even in so imagenet might be imagenet counts as a large images in computer vision applications but even the image net they're like what 250 by 250 pixels which are small by human standards we are used to looking at i don't know 1 000 or 2000 pixel um side length on a regular basis for for it to be clear i mean even the rasterization of this pdf you can see is you you will recognize it as blurry um and that's that's way way more resolution than imagenet images so the just the rasterization of images is a problem in itself even for convolutional neural networks but if you want to feed this into a transformer you have to think that every single location here every single pixel has to attend to every single other pixel which the image itself is 250 squared big so the attention will cost you 250 squared squared which is impossible in current hardware even for google right maybe they can do it but so people have resorted to other things uh doing things like only local attention so only attending to the kind of area around them which of course is the the foundational motivation behind convolutional neural networks is that you learn kernels that are local and then you you kind of slide them across and over the layers across layers once once you go from layer to layer so the first layer this part might attend to like a cone around itself and this part might attend around a cone around itself but then the next layer the thing that attends in the same cone will have a larger effective receptive field right so in this the receptor field grows by depth however transformers are able to attend within a single layer to everywhere and this paper solves this by not going in the direction of hey let's do local attention over pixels but they say let's do global attention by simply going over image patches so they divide the image into these patches as you can see here and one patch is in this case something like 16 by 16. they unroll these patches into a sequence which is a in first instance it's a set they combine this with a positional embedding so the transformers naturally they have no idea what what is where it's not like the transformer in a way is a generalization of an mlp of a feed-forward network in a feed-forward network what you have is you have you have just you have connections between these different inputs and outputs okay and these are fixed so the this node here will always attend to this node here with the weight that's specified by this particular connection however in a transformer this w isn't a fixed number in a transformer this w is computed on the fly so and and that's dependent on what these exact nodes are and therefore the m while the mlp knows where information comes from the transformer doesn't the transformer computes on the fly and therefore is permutation and variant and that's why a lot of applications add to the inputs these so-called positional embeddings where they simply say look this here this here is patch number one this here is patch number two this here is patch number three and you can do this in a sophisticated way in images specifically you can say this is position one one this is position one two one three then you go on by saying this is two one two two and so on now they in the paper claim that they've tried this and it doesn't help it's it's uh much easier if they just say this is one two three four five and uh the these are learnable embeddings so the the you don't actually feed the number one but what you have is you have a table and the table will say we'll have these indices one two three four five and so on and each one is associated with a vector and these vectors are learnable parameters so whenever you say this is the first patch what you actually do is you go here you grab the vector to the number one and you put the vector along sorry up here along with the patch into the transformer now the patch itself is still a small image right it's a 16 by 16 image so you have to get that somehow into a form where the transformer can understand it one way of doing it of course is simply to unroll it and say gee this is a 16 by 16 what's what's 16 by 16 it's like 256 um i think so i don't know i guess two it's 250 it's a 256 dimensional vector however they find that if they that first put that through a linear projection that helps before they put it into a transformer so there is one single matrix and this one single matrix is called e in this case embedding haha they take a patch like this they unroll it so here you have the image you unroll it into a big vector you multiply that vector with the embedding matrix and that's what goes into the transformer along with the position embedding in this case we have position embedding whatever seven you go grab seven right here you concatenate that here or add it and you put that into the transformer and from here it's a standard transformer this is just out of attention is all you need standard transformer and what you do is you have a special input this is a learnable embedding it's like the birth embedding the cls embedding and you take the output of this thing finally in order to classify this is just a standard classifier so it's a really simple architecture except for the bottom part here it's a transformer one of the inputs is decided to be special that is not associated with any patch but is a learned input the output of that particular dimension or of that particular input you take as a classification okay so there are more outputs right here but they are discarded of course because so in the last layer they're actually not even computed i would guess what in the last layer only this thing is computed but in the other layers everything is always computed right so you have many many transformer layers in here transformer layers are of course made up from these blocks right here sorry not the embedded patches but this thing okay and you see the the multi-head attention that's the expensive operation so the paper completely completely discards the notion of convolutions they have a variant where they i believe replace this patch embedding here with a convolutional embedding but i don't i don't think it helps much they they really want to show that convolutions aren't necessary and i i don't want to go too much into the details of the paper because also it's it's also subject to change you know an open review you can revise it and so on uh but the experiments show as you can see right here that this visual transformer this vision transformer outperforms uh the the the other like the convolutional networks by a pretty significant amount often like sometimes small but sometimes also large and costs less to train than these big convolutional networks at least of of this one other paper right so it costs less to train here you see of course if you go 16 by 16 patches then that means you will have so if you divide your image into patches that are themselves bigger that means your your sequence of patches will become smaller and therefore you're computationally more efficient if you go with 14 by 14 patches um but also the the h i believe is more layers uh there is actually a table up here yeah so the huge has 32 layers and that is has doubled the amount of parameters all of that gives you a higher computational requirement still lower than the big transfer paper okay so the idea here is you train on these big data sets like this jft data set so you pre-train on that this is a weakly labeled data set of 300 million images and then you transfer to the other data sets which just happened to be the same data sets that this paper used plus the other data set that the same authors created after this paper came out don't worry about it okay they also test on this visual task adaptation benchmark and you can see that especially specifically in these natural images subclass they oh actually both of these models make gains but then overall the visual transformer outperforms the conf nets so what's the what's the deal here what's the deal with transformers and that's something i want to talk about i don't want to go too much into the rest here of course you can visualize the attention you can see it's doing something sensible um and you can visualize the positional embeddings that are learned which is pretty interesting and you can see that uh the position embeddings come out pretty sensible you can see where they pay attention to mostly and the seems like this positional embedding it largely recognizes where it is in the image even though you never tell it you simply let it learn but it it relates to other positional embeddings that are in the same row or column largely and that's all sensible you can see the filters it learns so this is analogous to visualizing what convolutional networks learn and you can see it does something sensible it does something that we're very much used to if you look at convent visualizations you'll see exactly filters like these so it learns almost like the same thing as convolutional neural networks right but it's not specifically programmed to do so also you can see as you increase the depth of the network the mean attention distance so the the distance over which the attention uh goes increases and from like the middle of the network you pretty much have global computation and this is also like this is almost like the drawing i made of the cnn right where you you would have um the different heads so some heads would immediately at the beginning go out a cnn in this case would look like a line a cnn would look like a line that's like this the additional benefit you get in the transformers is of course that at the very beginning you can already pay attention to things that are very far away you cannot do that with convolutional networks or when you use local attention so all this branch up here that's kind of the gain that transformers can make they can attend to very far away things right at the lower layers um yeah so so what's the deal with transformers it seems like transformers are coming for everything so first they i guess they they were attention was introduced in lstms so lstms with attention were the cool thing to do and i think still are in some places in nlp um but then transformers completely replacing lstms in nlp and now transformers are coming for vision they have been paired with vision as the introduction here said but now they are replacing convolutions sorry they've been paired with convolutions now they're replacing it and here's what i what i think about this so what you had in lstms and in convolutional neural networks were good inductive priors so technically if you think about it if you have something like an mlp a feed forward network like we looked at here the the the notion should be that it could technically learn any function right a feedforward network can technically learn any function but it's it's kind of unstable and so on you know if you shift by a pixel all the inputs are all weird and so on so a convolutional neural network for images seemed pretty good because it has a good inductive prior and the good inductive prior is this is that probably what a one pixel cares about is its immediate neighborhood and then what that neighborhood as a whole cares about is its immediate neighborhood right so that's sort of how we look at images like you integrate over small regions and then you connect the regions to each other and so on so this is a very sensible inductive prior for images as well as the lstm for language if you have a language right having an lstm having the inductive bias of let's first process this thing then you know remember some general woo state then in in go to this thing and then incorporate that into our memory what we already know right then that kind of updates our latent belief and then we go to this thing and again we uh incorporate that that's how we read and that's that's how we do it and so the inductive prior of this model is is actually very very solid and inductive priors or inductive biases uh the name already contained it it's a bias we bias the model towards solutions that we think in general are relevant are are useful right we we we tell the model look we know you could learn everything from data no doubt about it we have the theoretical results you could do that however you don't have enough data and we want to make it a bit easier for you so we tell you that certain things like cnns like convolutions generally tend to be useful so we restrict the model and we bias the model towards a certain solution or lstms these are bias biases that we introduce in the statistical sense of bias right so these are biases that help the model become very good at tasks however now we are in a regime where we have lots of data and lots and lots of data and we know bias why is it called bias because it will buy us our estimator our estimator will not be the perfect expect expected value matches the actual underlying thing estimator therefore we know that if we have enough data a biased model will perform worse in the end than an unbiased model it's only in the not enough data limit that the bias model can perform uh better at least i mean i'm simplifying here but now transformers come along and transformers are basically transformers aren't an another architecture transformers are basically a general compute thing they're even more general than mlps like people think that mlps like this mlps are the the on most unbiased thing ever because everything's connected to everything no transformers are actually more general because not only is everything connected to everything but these connections are always computed on the fly so a transformer is like the most general thing there is in terms of deep learning that we have right now that we can train um yeah i'm making bold statements but that's how i think about it so the the if the cnn and the lstm are more specialized mlps then the transformer is a less specialized mlp and therefore it's not necessarily in the architecture of the transformer that makes it so special it's just the fact that it is a general computer and if we we are now able to feed enough data into it such that it can actually learn the the things and it can it can not only can it learn the useful biases right we give we give useful biases and you can see it learns the same thing as a convolutional network or very similar things right it learns these filters and so on that before we would have we would have given this thing here as like a wavelet filter that was our even before cnn's we we fed in like wavelet filtered things and this thing would be on top of the list so it learned it can learn that from scratch but probably this thing is not exactly a wavelet filter it's actually something that performs slightly better right that we couldn't have come up with as a as a bias to build in and that's why it works better because it can learn almost the same things but it can do so a bit better because it has that much data so i believe the world is still open transformers aren't the end transformers are simply one general computer there can be others there can be something even more general than a transformer and their world is still wide open to build in inductive biases that are actually better than cnns or lstms also to build inductive biases in transformer or if you go in the other direction to alleviate because what you see right here and in the formula you see this pretty well there are inductive biases in the transformer and if i had to guess i would say the ones that are to go next are the skip connections in here now the skip connections are very important for us to be able to train these uh architectures um because if you read the resonant paper the residual nets paper that's kind of where the gradient flows back the rationale is that you can go very deep and each layer only has to kind of calculate the delta that you have to do to the input instead of transforming the input as such and so on it makes a lot of sense but it is a strong inductive bias and it pulls through all of the layers as you can see here right all of like the skip connections is pulled through all of the layers this is a very strong inductive bias and we tell the network maybe it's sensible if you only calculate the diffs in each layer if i had to guess this is one of the next big things to go if we have yet an order of magnitude more big data sets and we figure out how to train big networks without these skip connections all right so it's not like as i said it's not like transformers is like the very very good architectures in the same sense that lstms and cnns are very good architectures it is the fact that transformers are so general they are actually able to make use of the big data that we just now have that we didn't have before end of the big compute such that these inductive biases of the old models become unnecessary again totally random i mean check out this video if you're in the mood for a totally random absolutely non-related paper to this tell me what you think in the comments and definitely you know keep an eye on this on open review it's going to be very very interesting all right with that being said that was it from me bye bye
Info
Channel: Yannic Kilcher
Views: 126,464
Rating: 4.9464765 out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, attention mechanism, convolutional neural network, data science, cnn, transformer, attention is all you need, vaswani, beyer, google, google brain, google research, tpu, tpu v3, iclr, iclr 2021, peer review, anonymous, karpathy, andrej karpathy, twitter, review, under submission, big transfer, bit, vit, vision transformer, visual transformer, transformer images, transformer computer vision
Id: TrdevFK_am4
Channel Id: undefined
Length: 29min 56sec (1796 seconds)
Published: Sun Oct 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.