Finetuning Vision Transformers (VIT) with Huggingface Transformers and Tensorflow 2

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone and welcome to this new and exciting session in which we are going to be looking at how to fine-tune an already trained Transformer more precisely Vision Transformer model using the hug and face library and tensorflow 2. Hagen face today's at the Forefront of practical AI and it permits practitioners around the world to build train and deploy state-of-the-art models very easily it's also used by thousands of organizations and teams around the world don't forget to subscribe and hit that notification button so you never miss amazing content like this here we have a host of different tasks which we could solve with readily available hug and face models like audio classification image classification object detection question answering summarization text classification and translation in our specific case we are dealing with image classification I have a click open right here you see we have this uh image classification page where you could already test an image on this vit model here so this is the wheat base model with patch size 16 and the image size 224. we also have this Facebook's uh dit based distilled model and with five size 16 image size 224. you could browse a host of other image classification models here as you can see here we have the different models sorted in order of uh number of downloads see 219 000 times this model was downloaded now we'll be working with this vid or we fine-tuning this vid by Google right here here we have the model description so you could have that we've seen this already in the paper um the intender uses and limitations and then how to use this model without any fine tuning so here you can pass in your image and then um already run classification on this image using this Veet model right your read4 image classification model now here you will notice that this is uh we suppose that we dealing with a pytouch model so we could check in the documentation here where I would see this vid model on the left side together with many other different models which are available for free and you see here we have this Veet you you could also check out the the it the distillation Transformers let's get to D we should have d e i t around this there we go here's the dit you see you have the dit and this documentation right here so uh we also have the swing which we've seen we looked at the screen previously so we could check out the swing somewhere here um swing s Where We Are swc you have the screen Transformer right here so that's it see the screen Transformer we have seen already now let's get back to our Veet Vision Transformer okay so we have this Vision Transformer right here and you could uh here go through the the whole documentation so we have the fit config with feature extractor listen enlarge this so it becomes clearer there we go we have this feature extractor the vid model vid for masked image modeling with for image classification and you'll notice here we have tfv model so the difference with this is this is a pytouch model and then this is the tensorflow model which we'll be using now the the tfv model and tfv4 image classification now recall that you have your vid model which starts from the patch you have the patch once you have the patch you have the Transformer encoder and then from here you have you take this you have this MLP head and then you have your output right here now with this uh tfv for image classification we have all this foliar so we go from dispatch or right up to the MLP head but with the TF vit model this part this part is not included so what you get will be only this outputs right here so with a TF vid model we just we have this whereas with the TF of it for image classification we have all this now apart from tensorflow they also have the code for Flex so you could check out Flex vid model Flex vid for image classification now that said we are now going to focus on fine tuning this Vision Transformer model don't forget to subscribe and hit that notification button so you never miss amazing content like this before doing anything we'll start by first installing this Transformers library right here so we have Pip install Transformers now that's fine we move on let's go ahead to start with the fine tuning but then let's get back to documentation and here we have the overview go check out those wheat config right here and you'll notice that all those different configurations are basically what we've seen already so we have this hidden size 768 as default value number of hidden layers 12 so here the stacking 12 different um Transformer encoded blocks number of attention heads 12 intermediate size 3072. now for this 3072 actually is for the dense layers we have here now the input you're getting from this normally or right from here we have say one by 256 by 768. then this gets in and gets to this point where we still have the same so up to here is the same and we expect that this output here should be of the same shape but then in this MLP layer we have two dense layers now the first dense layer will convert this to 1 by 256 by 3072. and then the next dance layer would convert this to 1 by 256 by um 768. so that's why they call this the intermediate size right here so this in terms of size you can see uh dimensionality of the intermediate uh fit for a layer in the Transformer encoder we have the hidden activation JLo hidden drop out probability so they have some drop out here attention probably drop out attention props drop out probability uh initialize our range uh layer Norm Epsilon value to better understand this we could get back to the layer Norm documentation in tensorflow you see Epsilon here by default is 1 times 10 to the negative three and then we could see uh where exactly it's been used so recall that with normalization we have x minus a given mean divided by a standard deviation uh now we we do not want a situation where this year is uh zero and then we have an infinite output so we generally add some Epsilon right here now this Epsilon by default as you see here is one uh 0.001 and in hug and face here it is 1 times 10 to the negative 12. now it's only in quarters so there's no encoder decoder that's why this is set to False Image size 224 the patch size 16 number of channels three what I'm gonna add a bias into the query keys and values yeah true encoder stride 16. now you can remember the the use of the stride when we're trying to get the patches we once once we have an image like this and then we have a price size of 16 by 16 would move through 16 pixels to obtain the next patch so that we have no space here so uh we have something like this actually okay so that's it uh we understand this with config you could check out your this usage of the vidconfig so let's copy this code and get it back here uh there we go get back here and then we have this code pasted out here okay so that's it you see clearly you could easily create a vit model without necessarily going through all this uh process which we had done right here so this was just for educational purposes it means that if you want to build your own veit you're simple you just have to um do this uh specify the configuration that's a bit config and then your initialize the model and that's it so let's let's run this and then let's also uh print out this configuration let's look at that you see there we go we could change this we could change let's say we change this hidden size to uh some value 144. so here let's have our hidden size so you see this is how you could change the hidden size to suit your needs so that's it you change this that's fine you look at that and you see you now have this new configuration and that's it now for the next we'll look at this uh feature extractor the feature extractor is similar to what we've done already that is taken in the the input resizing it and then clearing out some normalization so that's it for the future extractor you could check in the documentation this bit model here is a pytorch model so let's go to tfvit so uh this is for tensorflow now so here you see uh there we go we have our our tfv model you can expand this parameters and here we have all those different arguments which we could check out now here you have this example of how we could use the tfv model directly without uh going through any stressful process so here you see we have this um first of all we have the data data set and then we have this image which is extracted this you could you could get this image from our own data set so that's it uh they're loading this from hug and face hub or better hogging face data sets so that's it and then you have this feature extractor so that's a feature extractor we have this uh model and now note that you're the tfv model is from pre-trained so this means that we are going to use this model which has already been trained and you see the specifications here with base patch 16 to 24. in 21k so that's it the inputs now pass through the feature extractor before then pass through our model so we will see how to adapt this code so that we could fine tune our own model in tensorflow and as we have said before this tfv model different from the image classification model in that the outputs here and not the final output classes but this hidden States from the from the Transformer encoder so let's scroll here maybe let's read an example okay there's an example here you see this apple here that gives directly a cat so a model print is one of 1000 imaginary classes so that's it whereas here we have this hidden we have this hidden States okay we place this out here let's take this off take this one off and then we could get started with building our own wheat model based off this hugging face uh tfv model so we wouldn't need this data sets here we already have our own data set uh that's fine we wouldn't make use of this feature extractor we have this model uh yeah we have our hog and face let's call this hug and face model and then oh let's just take all this off actually so we have this we have the login phase model we have our tier V model from pre-trained and that's it now we're gonna Define some inputs so we have here our input uh it's equal the input layer and then we'll specify the shape so here we we work with 224 by 224 by 3. and then using the tensorflow functional API we'll get an output here x which will take in this hog and face model let's call this base model so taking the base model takes in inputs let's call this let's change this to base model okay we have the base model takes in the inputs and then this this year we we have base model.vit so we have this method this vid method which we call before taking in the input and so from here we have this base model of V takes in the input and then now we have the output here x now when we run this you see we download this 330 megabytes uh pre-trained model let's have this um our hug and face model here so we we get the inputs from here and then we have the outputs let's call this output X okay so we have that let's run this again we have a hog and face model Now set and we still get this error you see uh it's linked to the positioning of this inputs here so let's let's change this let's say we have three by 224 by 224 we run this again you should see now that everything works fine see uh it now works fine so this means that the inputs of this uh base model here is a Hawking phase model should be of this shape so it's three uh let's take this here instead of being 224 by 224 by 224 as we usually do now it's in set 3 by 224 by 2 24. now this means that we need an extra layer which will convert this into this before passing into our base model right here so let's build that extra layer and we'll take an inspiration from our resize or skill layer which we had built already let's get to resize rescale right you're so we have that uh we'll modify the recital skill specifically for this um organ phase model so here we have at this code resize for skill for hug and face okay so we have this recitable skill we'll resize make sure it is uh 224 by 224 so every image which passes here will be 224 by to 24 we're gonna reskill and then after we're scaling we are going to permutate the value so we call on this pyramid here permit layer and the way we'll build this out the way we'll call this pyramid layer will be such that we move this from this third position this is zero one two three batch by 224 by 224 by three so moving from this third position to this first position so here we have three uh go into this position and then this uh one two shifts to the right so we have uh three this one goes here three one and then this two comes here such that the output now will be batch the see the batch remains intact by uh three which has been shifted by 224 by 2 24. so be careful not to do instead two one this is this is one two and not two one because here we're having uh height by width by channel so I want to change this to channel by height by width okay so that said uh we do this here so we just do three one two and that's it so after this input layer before getting here we'll call this x will take in our resize rescale and that's it so this recycler skill takes in the inputs and then here what we're passing will be X let's run this again and we should get an error because uh when we permuted it goes back to 224 by 224x3 so let's have this see we have that 224 by 224 by 3 as we're used to working and then we run this now and everything should be okay oh we're getting an error it says it's not defined uh let's run this or let's make sure that's how we called it uh let's go up resize oh wait we need to run this actually this should be fine now um that's it you see everything is okay so now what we'll do is we'll pass in some inputs let's let's pass in some input we have this test image right here and then we have a model which takes in the test image now we we need to also convert this or rather at the batch Dimension so let's expand themes and take in the test image right here so we have that we run this and see what we get let's add this to the zeroed axis axis zero run that again and then we told that we expected this but instead found this now uh let's get back up here and we could change this to 256 by 256 and then knowing that this resize will convert it back to 224 since our model takes to 24. so let's run this again and there we go here's our output you see we have the last hidden State one by 197 by 768. and as we scroll we have this polar output 1 by 768 and then from here we we scroll down again let's see if we have another output and that's it okay so this is what we get is output we have the uh the last hidden State and this polar output from the documentation where we have the parameters TF with model the parameters you see here we have this last hidden State the polar output and then we told that these are the two outputs we will always get and then we told that this one is optional the hidden States is optional but the last hidden state is an optional we get we always get the last hit instead we always get this polar output and then those attentions here are also optional so if you want to get this attention so all you need to do here is to specify config remember the configuration config config config.output attention set that to true and so this means that by default this will be set to false and then for the hidden States we also repeat the same so config that output hidden States we set the two to true now here in the documentation to explain the difference between this polar output and this last hidden state but uh getting back here you should see the shape so you see this is um just one of this here while this is all a full hidden state or a full last hidden State one by 197 by 768 but this model or this hugging face vid model was built taken to consideration this class embedding right here and so this means that if you want to carry out some classification is better off or were you better off taking this final year this final um all this class and balance final hidden state now if you want just those hidden States we could specify this pick out the zero index we run this and we should get only the output or the last hidden States so that's it we get this last hidden States and then since we are interested only in that output corresponding to the class embedding and the class embed is at the zeroed position here the zeroth position we are going to take we are gonna we're gonna do this here we're gonna take um this we're gonna take for the First Dimension here we take all and then for this next one we take we select the zeroed index and then the next we take all so let's run this again and see what we get and that's it we have this output right here and now since we've converted this hugging face model into a tensorflow model we could do or summary so let's run this and check out our model summary so that's it we have this model summary we scroll down see 86 million parameters we have the input uh sequential this sequential is corresponds to the recession skill layer of it and then the slicing operator which we have here which permits us get a specific output now getting back here we'll now add our final classifier so we have this and then we have we'll call this output uh that's fine let's just call this output so we have this output it takes in uh the dense layer as the number of classes specified here and then we have activation softmax as usual so that's it we have this your output okay now let's run this and see what we get we're getting an error because we didn't pass in this x here so let's run that again now as the model is training just remember that the learning rate we use here isn't appropriate as we can't be using this type of Higher Learning rate when doing fine tuning so we have to change this and use something lower let's say 5 times 10 to the negative say five okay so let's stop the training and then restart this process you see now that when we initialize our model and then we modify this learning rate you see the loss drops now much lower than what we had before with a Higher Learning rate and our accuracy is already at 25 percent and we are still at the first epoch so be careful when you fine-tuning or when you're updating all the parameters of an already trained model you have to make sure you use a very low learning rate

Info

Channel: Neuralearn

Views: 3,060

Rating: undefined out of 5

Keywords:

Id: 80mVsXJbyu0

Channel Id: undefined

Length: 23min 44sec (1424 seconds)

Published: Sat Jan 07 2023