Vision Transformers (ViT) Explained + Fine-tuning in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
vision and language are the two big domains in machine learning two distinct disciplines with their own problems best practices and model architectures or at least that used to be the case the vision Transformer or vit mounts the first step towards a measure of both Fields into a single unified discipline for the first time in the history of machine learning we have a single model architecture that is on track to become the dominant model in both language and vision before the vision Transformer Transformers book known as those language models and nothing more but since the introduction of the vision Transformer there has been further work that is almost solidified their position as state of the art in Vision in this video we're going to dive in to the vision Transformer we're going to explain what it is how it works why it works and we're going to look at how we can actually use it and implement it with python so let's get started with a very quick 101 introduction to Transformers and the attention mechanism so Transformers were introduced in 2017 in the pretty well known paper called attention is all you need Transformers quite literally change the entire landscape of NLP and this was very much thanks to something called the attention mechanism now in NLP attention allows us to embed contextual meaning into the word level or sub Word level token embeddings within a model so what I mean by that is say you have a sentence and you have two words in that sentence that are very much related attention allow you to identify that relation ship and then allow the model to understand those words with respect to one another within that greater sentence now this starts within the transform model with tokens just being embedded into a vector space based purely on what that token is so the token for bank will be mapped to a particular Vector that represents the word bank without any consideration of the words surrounding it now with these token embeddings what we can do is calculate dot products between their embeddings and we will return a high score when they are aligned and a low score when they are not aligned and as we do this within the attention mechanism we can essentially identify which words should be placed closer together within that Vector space so for example if we had three sentences a plain Banks the aggressive bank and the bank of England the initial embedding of that token bank for all of those sentences is equal but then through this encoder attention mechanism we essentially map the token embedding Bank closer to the vector space of the other relevant words within each one of those sentences so in the case of a plane bounce what we would have is the word bank or Banks being moved closer towards words like airplane plane Airport flight and so on for the phrase aggressive Bank we will find that the token embedding for Bank gets moved towards the embedding space for grass nature fields and for the bank of England we'll find out the word Bank gets moved towards Finance money and so on so as we go through these many encoded blocks contain the attention mechanism we are essentially embedding more contextual meaning in to those initial token embeddings now attention did find itself being used occasionally in convolutional neural networks which were the past state of the art in computer vision generally speaking this has produced some benefit but it is somewhat limited attention is a heavy operation when it comes to having a large number of items to compare because essentially with attention you're comparing every item against every other item within your input sequence so if your input sequence is a even relatively large image and you're comparing pixels to pixels with your attention mechanism the number of comparisons that you need to do becomes incredibly large very very quickly so in the case of convolutional neural networks attention can only rarely be applied towards the later layers of the models where you basically have less activations being compared after a few convolutions now that's better than nothing but it does limit the use of attention because you can't use it throughout the entire network now transform models in NLP have not had that limitation and can instead apply attention over many layers literally from the very starting point of the model now the setup used by Bert which is again a very well-known Transformer model involves several encoder layers now Within These encoding layers or encoder blocks we have a few different things going on there is a normalization component and multi-head attention component which is essentially many attention operations happening in parallel and a multi-lay obsetron layer through each of these blocks we're just encoding more more information into our token embeddings and at the end this process we get these Super Rich vector embeddings and these embeddings are the ultimate output of the core of a transform model including the vision Transformer and from there what we tend to find with transform models is that we add another few layers onto the end which act as the head of the Transformer which essentially encode or take these Vector embeddings information rich embeddings and translate them into predictions for a particular task so you might have a classification head or a ner head or a question answering head and they will all be slightly different in some way but at the core what they are doing is translating those super rich information embeddings into some sort of meaningful prediction now the vision Transformer actually works in the exact same way the only difference is how we pre-process things before they are fed into the vision Transformer so rather than with Bert and other language Transformers that consume word or sub word tokens Vision Transformer consumes image patches then the remainder Transformer works in the exact same way so let's take a look at how we go from images to image patches and then after that into patch embeddings the high level process for doing this is relatively simple first we split an image into image patches two we process those patches through a linear projection layer to get our initial patch embeddings then we pre-append something called a class embedding to those patch embeddings and finally we sum the patch embeddings and something called positional embedding now there's a lot of parallels with this process and what we see in language and will relate to those where relevant so after all these steps we have our patch embeddings and we've just processed them in the exact same way that we would token embeddings with a language Transformer but let's dive into each one of these steps in a little more detail our first step is formation of our image into image patches in NLP we actually do the same thing we take sentence and we translate it into a list of tokens so in this respect images are sentences and image patches are word or sub word tokens now if we didn't create these image patches we could alternatively feed in the full set of pixels from a image but as I mentioned before that basically makes it so that we can't use attention because the calculation or the number of computations that we need to do to compare all images would be very restrictive on the size of images that we could input we could only essentially import very very small images so if we consider that attention requires the comparison of everything to everything else and we're using pixels here if we have a 224 by 224 pixel image that means we would have to perform 224 to the power of four comparisons which is 2.5 billion comparisons which is pretty insane and that's where a single attention layer in Transformers we have multiple attention layers so it's already just far too much if instead we split our 224 by 224 pixel image into image patches where we have 14 by 14 pixel patches that would leave us a 256 of these patches and with that a single attention layer requires a much more manageable 9.8 million comparisons which is a lot easier to do with that we can have a huge number of attention layers and still not even get close to the single attention layer with our full image now after building these image patches we move on to the linear projection step for this we use a linear projection layer which is simply going to map our image patch arrays in to image patch vectors by mapping these patches to the patch embeddings we are reformatting them into the correct dimensionality to be input into our vision Transformer but when not putting these into the vision Transformer just yet because there's two more steps our third step is the learnable embedding or the the class token now this is an idea that comes from that so Bert introduced the use of something called a CLS or classifier token now the CLS token was a special token pre-appended to every sentence that was input into Birds this CLS token was as with every other token converted into an embedding and passed through several encoder layers now there are two things that make CLS special first it does not represent a real word so it almost acts as like a blank slate being input into the model and second the CLS token embedding after the many encoder blobs is that embedding that is input into the classification head which is used as a part of the pre-training process so essentially what we end up doing there is we end up embedding like a general representation of the full sentence into this single token embedding because in order for the model to make a good prediction about what this sentence is it needs to have a general embedding of the whole sentence in that single token because it's only that single token embedding that is passed into the classification head now Vision Transformer applies the same logic and it adds something called a learnable embedding or a class embedding to the embeddings as they are processed by the first layers of the model and this learnable embedding is practically the same thing as the CLS token in but now it's also worth noting that it is potentially even more important for the vision Transformer than it is for Birds because for bird the main mode of pre-training is something called Mass language modeling which doesn't rely on the classification token whereas with the vision Transformer the ideal mode of pre-training is actually a classification task so in that sense we can think of this class token or class embedding as actually being very critical for the overall performance and overall training of the vision Transformer now the final step meant that we need to apply to our patch embeddings before they are actually fed into the model is we need to add something called the positional embeddings now positional paddings are a common thing to be added to Transformers and that's what it's Transformers by default don't actually have any mechanism for tracking the position of inputs so there's no order that is being considered and that is difficult because when it comes to language and also Vision but let's think in a sense of language for now the order of words in a sentence is incredibly important if you mix up the order of words as a person it's hard to understand what this sentence is supposed to mean and it can even mean something completely different so obviously the order of words is super important and that applies as well to images if we start mixing the image patches there's a good chance that we won't be able to understand what that image represents anymore and in fact this is what we get with Juke cell puzzles we get a ton of little image patches and we need to put them together in a certain order and it takes people a long time to figure out what that order actually is so the order of our image patches is obviously quite important but by default Transformers don't have a way of handling this so that's where the positional embeddings come in for the vision Transformer these positional embeddings are learn embeddings that are summed with the incoming patch embeddings now as I mentioned these positional headings are learned so during pre-training these are adjusted and what we can actually see if we visualize this similarity of the cosine similarity between embeddings is that position embeddings that are close to one another actually have a higher similarity and in particular positional embeddings that exist within the same row and the same column as one another also have a higher similarity so it seems like there's this a logical thing going on here with these positional embeddings whereas identifying patches that are within a similar area is pushing them into a similar depth space and patches that are in a distance area is pushing them away from each other within that Vector space so there's a sense of locality being introduced Within These positional embeddings now after adding our position embeddings and Patch embeddings together we have our final patch embeddings which are then fed into our vision transformer into their processed through lapse of encoder attention mechanism that we described before which is just a typical Transformer approach now that is the the logic behind Vision Transformer and the new innovations that it has brought now I want to describe or actually go through an example of a implementation of the vision Transformer and how we can actually use it okay so we start by just installing any prerequisites that we have so here we've got paper sword data sets and Transformers and also Pi torch so we've we've run this and then what we want to do is just download a data set that we can actually test all of this on and also fine tune with so we're going to be using the C4 10 day set we're going to be getting that from homeface data sets so from data sets import load data set let this run and we just run this and one thing just to check here before we go through everything is to make sure that we're using GPU save and we will have to rerun everything okay so after that's downloaded we'll see that we have 50 000 images with classification labels within our train data and we also download the test split as well that has 10 000 of these and then what we want to do is we want to just have a look at the labels quickly so let's see what we have in there so we have 10 labels that's why it's called c410 and of those we have these particular classes within the data airplane automobile so on and so on so from there we can have a look at what is within a single item within that data set so we have this pill so python pill object which is essentially a image and then also the label now that label corresponds to airplane here in this case because it's number zero and we can just check that so run this this gives a z we can't really see it very well it's very small but that is an airplane and we can actually map the the label so zero two labels dot names in order to get the actual human readable plus label Okay cool so what we're going to do is we're going to load the vision Transformer feature extractor so we're going to be using this model here from homeface Hub and we can actually see that over here so we have Google Vid base patch 16225 in or in 21k now what that means is we have patches that are 16 by 16 pixels they are being pulled or being built during pre-training at least by a 220 by 224 pixel image and this in21k is just to say that this has been trained on or pre-trained on the imagenet 21k data set so that is the model we'll be using and we use this feature extractor which is almost like a preprocessor for this particular model so we can run that and this will just download that feature and extract it for us that's pretty quick and we can see the configuration within that feature extracted here so what is this feature extractor doing exactly it is taking an image our image can be any size and in a lot of different formats and what you can do is just normalize and resize that image into something that we can then process with our vision Transformer so we can see here that it will normalize the pixel values within the image and it will resize the image as well it will resize the image to a uh to this here 224 by 224 pixels in terms of normalization to normalize I'm using these values here for each of the color channels so we have red green and blue and yeah that's pretty much that's what it's going to be doing so if we take a look at the first image we can use the feature extractor here on our first image which is that plane and we're going to just return tensors in using pie torch because we're using pi torch later on so we run this will return is a dictionary containing a single tensor or single key value pair which is pixel values which maps to this single tensor here and go down okay I'm going to look at the shape of that and we see that we have this 224 by 224 pixel image or or pixel values tensor now that is different to the original image because the original image was train zero image or IMG what's the shape of this I think we can maybe do this maybe size okay 32 by 32 so it's been resized up to 224 by 224 which is the format that the vision Transformer needs now when we are doing this what we're going to want to do later on is we're going to be training everything on GPU not CPU Now by default this tensor here is on CPU we don't want that we need to be using GPU web possible so we say okay if the Cuda enabled GPU is available please use GPU okay so we can see here there is one available so we're on powerlab so that's great it means everything will be much faster and the reason why we need that is because here we're going to need to move everything to that device so what we'll do is here as we use feature extractor here we're going to say two device now just move everything to GPU for us okay and then we use this with transform to apply that to both the training and the testing data set or in reality we're going to be using test data set more as a validation data set now after all that we're ready to move on to the model fine tuning step so with this there are a few things we're going to Define so training and testing data set we've already done that's a problem feature extractor we have already done that as well no problem the model we will Define that it's pretty easy the something called a collate function evaluation metric and some other training arguments so let's start with the collate function so here we this is essentially when we're training with the hook and face trainer we need a way to collate all of our data into into batches in a way that makes sense which requires this dictionary format so each record is represented by dictionary and each record contains inputs which is the pixel values and also the labels so we run this we then need to Define our valuation metric which is going to be using accuracy which is I can read that if you want but it's pretty straightforward so we Define that and then we have all these training arguments so these are essentially just the training parameters that we're going to use to actually train our model so we have the batch size that we want to use where we're going to Output the model the number of training epochs that we want to use how often do we want to evaluate the model so run it on the validation slash test data set that we have the what learning rate you want to use and and so on and so on we run that that just sets up the configuration for our training and then we move on to initializing our model again this is just using the same thing that we had before so when we had that feature extractor we initialized it from pre-trained and then we had the the model name or path model ID and so that is just the the vit patch 16224 that you saw before one thing that we do need to add here is because we're doing this variety of image classification we need to specify the number of labels or class classes that will be output from that classification head which in this case is 10 of those labels um so we Define that as well we move the model to our GPU and with that we are ready to initialize our trainer with all of those things that we've Justified so we run that and then to actually train the model we do this so trainer.train after that we can save the model we can log our metrics save our metrics and then just saved this state the current state of the trainer at that point so I'm going to run that very briefly and install okay so it seems we're getting this error which I think might be because we're trying to move the the input sensors to GPU twice so I think the trainer is doing it by default but earlier on we added the two device so we need to remove that and and print it again so up here within pre-process we just remove this run it again and then just rerun everything I'm gonna pass over things to the trainer and then try and train again okay it looks like we're having a lot more luck with it this time so we can see that the model is training that actually doesn't take too long but what I'm going to do is just skip forward so I'm going to stop this and what we can do is you can run this to sort of get your evaluation metrics and view your evaluation metrics your model will be evaluating as it goes through your your train set thanks to the trainer but if you would like to check again you can you can just use this but for now let's just have a look at a specific example so what we're going to do is load this image I mean icon really tell what that image is I think so if we come down here it should be a cat yeah so we run this we can see that it is actually supposed to be cut it's very blurry I can't personally tell but what we're going to do is load a fine tune model so this is the model that has been fine-tuned using this same process so we can download that from hooking face Hub and we can also download the feature extractor which does we don't need to do that because it is actually using the same feature extractor but in a real use case scenario you might actually just download everything from a particular model that is hosted within the hook and face Hub so this is what you would do because it's not really uh not really fine-tuned so run that that will just download the the fine-tune model and we can see here we have the exact same feature extractor configuration there we process our image through the feature extractor turn Pi sort sensors and then we say with torch now grad which is essentially to make sure that we're not updating the gradients of the model like we would during fine tuning because we're actually just making a prediction here we don't want to train anything we use the model process the inputs and we extract the logits which is just the output activations and what we want to do is take the ARG Max so where the Logics is the maximum value is basically highest probability that it is that class being predicted so we extract that we get the labels and then we output labels and if we run that we will see that we get capped okay so it looks like we have fine-tuned a vision Transformer using that same process and the performance is pretty accurate now before 2021 which really not that long ago Transformers were known as just being those language models that they were not used in anything else but now as we can see we're actually able to use the Transformers and get really good results within the field of computer vision and we're actually seeing this use in a lot of places Vision Transformer is a key component of the opening hours clip model an open eyes clip is a key component of all of the diffusion models that we've seen pop up everywhere and the world is going crazy over them right now Transformers are also a key component in Tesla for self-driving they are finding use in at huge number of places that would have just been incredibly unexpected a year or even two three years ago and I think as time progresses we will undoubtedly see more use of Transformers within computer vision and of course the continued use of Transformers within the field of language and they will undoubtedly become more and more unified over time now that's it for this video I hope all of this has been useful and interesting so thank you very much for watching and I'll see you again in the next one bye [Music]
Info
Channel: James Briggs
Views: 21,563
Rating: undefined out of 5
Keywords: natural language processing, nlp, Huggingface, semantic search, similarity search, vector similarity search, vector search, vision transformer, vision transformer pytorch, vision transformer explained, vision transformer code, vision transformers huggingface, computer vision transformers, vit huggingface, hugging face, hugging face tutorial, transformers pytorch, huggingface pytorch, huggingface tensorflow, vision transformer tensorflow, huggingface pipeline, james briggs
Id: qU7wO02urYU
Channel Id: undefined
Length: 30min 27sec (1827 seconds)
Published: Wed Nov 23 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.