Implement and Train ViT From Scratch for Image Recognition - PyTorch

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys what's up today we are going to implement something that is a nice blend between NLP and computer vision which is Vision Transformer so we are going to go over it briefly and I will actually prepared another video where I will explain this paper which is an images words 16 to 16 words Transformers for image recognition at scale but in this video we will go over it briefly and we will implement it and in final we will actually train it so the motivation for this paper is the transforms are very strong in NLP and usually gives us the state of art results so we want to leverage that in computer region too but how we are going to feed the images to it so this paper comes with a solution what they do is basically they take the image and they divided into what is called patches as you can see here at the left these are all patches okay little patches a patch a patch a patch a patch Etc so after we've divided it into a patches we feed them into a Transformer inorder just like bird okay and those images are actually acting like a tokens to the Transformer so the in the paper the patches are 16 to 16 in size so that's where the name of the paper comes 16 to 16 words another thing that we another thing that we should be looking at this architecture is that along with the images after we apply them to linear projection okay we are feeding p position embeddings along with them position embeddings are the one with purple if you are familiar with birds we were feeding position anded too with the bird so that is what it is all about also the paper mentions there's this learnable class embeding what it is is let's say we are doing a sentence classification okay instead of passing everything to the model for inference we store every value in a token at the beginning of the sentence which is usually a CLS token in this case it is denoted by asterics all the information is stored in this token and we only classify that token let's say we are classifying it positive or negative as sentence we classify that token as positive L negative that learnable token and that is how we proceed as same logic applies here we have a learnable embedding token in the beginning which is denoted by ax we just classify that and what to keep in mind that it actually adds one token uh one token to total number of patches we have so it will be number of patches plus extra extra learnable embedding we are going to need to know that and we are going to implement that CLS token too another thing we need to know about the is when implementing it is that how to know the number of batches so for example we have an image like this over here and it is a 3X3 image and it has a page size of one as you can see one patch one patch one patch one patch one patch one patch one patch Etc page size of one so how do we calculate the number of Pates what we are going to do is we would divide the tree by one and square it and it would be the number of patches we will actually use this formula so I wanted to implement I want to explain in the beginning and also let's say if we are going to calculate the dimension of a patch what we would do is let's say this is one okay we would do for one to one patch is that we would multiply them together the sizes height and width so it would be one multipli one one multiplied the input channels in this case it is three and it would be true so we know how to calculate number of batches we know how to calculate the dimension embedding dimensions of a patch so what happens after we fit them Fe everything the position embeddings and the patches to the Transformer encod is that basically we take this learnable parameter learnable embedding here we feed it to an MLP head which is you can call it whatever you want classification ad you can say and after that we classific we classify the token as whether it is bird ball car Etc so in the paper it is usually an image of 2024 to 2024 and Page size of 16 to 16 is used but for our purposes we will use a smaller image and a smaller page size so if it is all clear let's jump into the coding for this I am using kles notebook but you can use Jupiter you can use Google collab whatever you like I don't care so we will start by implementing torch after that we will Implement NN from torch import NN now let's start defining hyper parameters the first one we are going to Define is learning rate so I'm going to give a learning rate of 1 E minus 4 as you know we can do whatever you want but it worked well for me we will need number of classes this says how many classes we will classify whether if it is only a cat and dog it will be two if it is like other stuff it will be change accordingly so for this I will say 10 because we will do the training of the M data after that we will need page size it will be it will be four in this case and our image size will be 28 we will have input channels of one number of heads of eight what is number of heads is that as you can see there's a Transformer encod here and Transformer encod has attention heads number of heads decides how many attention heads we are going to use so in this case we are going to use eight attention heads after that we need a Dropout value which I am going to use 0.001 0.1 we will need hidden Dimension this hidden Dimension is the hidden dimension of this MLP head so I'm going to give it 768 like I said you can play it around next we will need adding weight Decay what this is is the weight Decay value that we will give to our Optimizer so in the paper they use decay of 0.1 however when I tried it it actually didn't work that well so I will say to it zero I will keep it zero which is a default value also we will need betas for our adom so for Betas paper is used beta 1 of 0.9 and beta 2 of 0.999 we will copy the exact numbers 0.9 to 0.99 9 after that we will need activation function which the paper used yellow MP contains G line linearity so let's just say [Music] G after that we will need number of encoders what number of encoders is in this case you can see a one only a one Transformer recorder but in reality it is the one what happens is usually we stack them on top of each other okay for example for in this case I want to St four of them so what I say is number of anord is four after that we have to define the edic dimension so we talked about it how to calculate it if we have a patch of one to one in size and three in number of channels we just multiply everything so that's what we're going to do here we have the patch size instead of multiplying hat a hat WID which are the same we are going to square it and multiply it by input channels which will be page size of four 4 * 4 16 16 * 1 is 16 so we are going to say that 16 another thing we need is is number of patches so we calculated this two if we have 3 to three image we will take the hit or width divide it by Page size and square it okay that's what we're going to do here again so we have the image size which is WID or eight whatever you like we divide it by Pat size and we Square which is 49 in this case after that let's just Define our device and see actually what we are working with so we will it will be Cuda if torch. that is available available else we want to use CPU and run it the first part we are going to implement is the part until here okay the part that goes on Transformer encoder from dividing the image into patches to creating the CLS token to merging them with the position embeddings so first let's do that I'm going to call it class of patch embedding and it will take from NN that module we will start by defining an init we will give it embedding Dimension we will give it patch size we will give it number of patches Dropout and input channels since it is a pytorch class we will say super in it now let's start by dividing it into patches the function that will divide the image into patches so I what I would call it is Patcher so for that we will use sequential and we will use the convolution and then that con to the we will just get help from that so we will give it some parameters the first one is input channels we will give it input channels equals input channels output channels equals to Output channels H sorry output channels equals to embed dim embedding Dimension chal size equals the patch size and St equals the pet size what it's going to do is that it will just take the image and divide it into squares of patch size after that we will flatten it flatten two and it would be ready to go the next thing we are going to Define is the CLS token the one that we are going to do the classification with this Aster over here so how we are going to do that is we are going to call it self. CLS token and we will Define it with an ended parameter par meter we will use torch. Randon it will Generate random numbers you can see that people are using generating zeros as initially but I prefer generating random numbers you can use generate zeros too inside we will say one we will give input channels and we will give embedding Dimension and since it is a learnable parameter we will say requires graph equals to true so this one this one in number will be replaced by our bch size and the next one next is just input channels and embedded Dimension now we will need position embeddings and that is the purple ones that we will merge with our patches so again we will use NN for that NN parameter torch. Randon again size size equals to one now we will say number of patches + one there's a plus one because see this is the number of patches 1 2 3 4 5 6 7 8 9 and additionally we have the CLS token acting as a patch so we just put plus one here and as the last AR uh last Dimension we gave embedding Dimension it is a learnable parameter again so requires graph equals to True at last we will need Dropout easily we can say n and the drop out P equals to dropout now we will continue with the forward pass let's just say Define forward self and X we will start by initializing the CLS token so s. CLS token we will expand it with the shape of our input which is X so which is 0 -1 - one What minus one means is that it means that we won't change that Dimension it tells that to protect that Dimension okay now we will get our patches self. Patcher we will feed it with our input and we will permute it with 0 2 and one again we take X and we will concatenate it now we will we have patches okay access now we want to add the Cs token to it so we have this area right of this and we want to add the left CS token so we say torch. cat and we say you CS token and X which are the patches and we say Dimension equals one so this guy over here merges the CLS token with the patches after that we apply Dropout s. Dropout X and we return X it is always good practice to check your work if you define a model if whether it works or not so that's what we are going to do now we will say model equals to patch embedding embed themm patch size number of patches is Drop Out input channels channels and let's put it into our device and we will Define a dummy input which is X it will be equal x equals the tor. random n and its Dimensions will be a b size let's say 52 input channels which is one and image size 28 to 28 this will act as a dumb image and let's actually fed that to our newly defined model and get its shape so we get an error it says size is not defined so let's see what's wrong in here because we forgot the equal sign we got another error number of patches is not defined number of patches number number of patches there's a typo here and r t perter because it is not peram it is just perut okay now we get our results so with B size is 512 just like what we fed and expect the second dimension is 50 what it denotes is the number of tokens we are going to feed so it is also correct because we have number of patches 49 plus additional CLS token 50 that is correct and 16 is the size of our patches as you can see here it is the embeded dimension so it is working well now it's the time for the second part of the implementation we implemented here now we will put the Transformer encoder in such a way that we can feed our m into it we will Define our NLP hat our classification end and we will get our results so let's get to it I will call it weit which stands for visual Transformer we will imported from module we will Define in it say self number of patches number of classes patch size size embed them number of encoders you'll take number of heads you will take hidden Dimension we will take Dropout we will take activation let me take myself here activation and we will take input channels okay just like we did super. init after that let's Define by using the fun the class we previously defined so let's call it embedding block we will say patch embedding which is this we just defined over here it will give us the input to our encoders and divide the image into patches so as's input embed themm what else we gave patch size patch size number of patches Dropout and inut channels next we will need encoder layer so you may see we are not going to implement the whole Transformer encoder from scratch it is all scope instead we will import the Transformer encoder directly from py for that we will first use encod layer for that we will do nm. Transformer encod layer D model equals the embedding Dimension number of hat we defin it number of hats Dropout equals to Dropout activation ital activation which is G batch first = two what this means is that when we defined our images the B size comes first so we say that to this encod the layer that the bch size comes first as an input and alongside with that we will say Norm first equal to two so this is only a one encoded layer like I said we will speack them on top of each other so how we're going to do that is we will say self. encoder blocks and we will use nn. Transformer encoder you will use encoder layer that we just defined and we will specify number of layers which is number of encoders which we stated as for in the beginning so we defined the Ed too so we are around here now now we are going to the moving on to the MLP head so self. MLP head equals n. sequential what we are going to say is that first we will do a normalization layer norm and normalized shape will be embed them eding Dimension and we will apply a linear layer over here for classification purpose input features will be embedding Dimension and output features will be the number of classes in this case it is 10 now we actually implemented it and it is time to define the forward forward layer let's say find forward we will say self and X as the input so the first thing we're going to do is we will say x equals the self self. embeding block sorry embedding block so the first thing is we will get the embeddings and the patches of the image we put that to X after that so we applied here now we are fited to the Transformer encoder we say xals the X self. encoder blocks X we fed it to our inorder block two now we will feed our input to the MLP head and we will do it by xal the self. MLP head X and we will only take the CLS token we do it like this for that only we take the zero token so like we mentioned in the beginning we don't classify this whole embedding okay instead we classify the only CLS token in the beginning just like a sentence classification test we give each contains all the information that other embeddings have it is a learnable parameter and we feed that to the MLP head and we just classify it and we return X so again it is a good practice to check our work so let's say model equals the viit number of patches image size number of classes patch size M dim number of En corders number of heads hi Dimension Dropout activation and input channels we put it to device actually we have to put it to device to I missed that and we Define a dumy image as an input again torch. random n l n b size 512 let's say input channels one he and width 28 to 28 and let's feed it to our vision Transformer model and let's printed shape now we get an error takes 11 position arguments because we forgot image size here let's rent again M dim is not defined because there is no such thing this is with e another thing we missed is that after we concatenated the CLS token with the input patches we actually need to merge them together with the position edies we have to add together so how we are going to do that is that after torch. cat concatenation we will say x equals the self. position embeddings plus X we just add them up and we got the results 512 to 10 is what we want exactly so what it means is that we gave it 5 12 batches you can call it 512 images and we gave it 10 classes so it returned us a probability distribution over T classes so think of it like this we have the X1 X2 X3 as inputs and for every one of them we have results for one one y1 y 2 Y3 Etc so y1 can be a dog Y 2 can be a cat y tree can be a cow whatever but each of them has a probability distribution and when we are doing prediction we will choose the one with the highest so this is what exactly we want and we can see that our vision Transformer implementation is complete and working well so the next thing we are going to do is actually we will take that we will take the amness data set and we will train it and see how it performs for training we will use the amness data set you can just access it from here so let's go overview the data set we will have train and test files the train file will have a label followed by pixels pixel values and the test file will only contain the pixel values you can come here to this digit recogn digit recognizer and just download all the files here or if you are working with the keggle notebook you can just come here say add data type digit recognizer it will pop up just click this plus button and it will be here either way it is fine all we need is data so before we start importing any of them we will import some additional libraries since we are going to do the training the first of them is pandas import pandas as PD after that we will import from torch import optim for using adom from torch. ut. data we will import data loader and data set we will from torch Vision import [Music] transforms we will do some data augmentation and stuff like that with this so from SK learn. model selection import train test split we will split the data with this inut ort mat plot li. pip plot s plot we will plot the data and results with it after that we will import NPI as NP we will import random we will import time it to estimate training time and from t t t2dm import t2dm for a nice looking progress far we forgot s here okay now we will add some additional hyper parameters the first one we are going to add is random seat so that our code will be reproducible the second one is BGE size I will give it 512 depending on your GPU you can increase it or decrease it number of epo I will set it to 40 now why this High number okay when we if we are using Transformers like anywhere else you may use two numbers like two three or even one but in this case our Transformer model haven't trained on anything basically it is a foundation model so the models that we used at other places we were fine tuning them they were already trained on mass amounts of data and we were just fineing them on a smaller amount of data but now we are just training the foundation model from the beginning with a small data compared to a Transformer model so 14 seems reasonable but you can change it however you like and we don't need anything else here if we need we can just turn back and edit now to make sure reproducibility we will do some things random seat random SE np. random. SE random seat torch. manual seat random seat torch. chuda manual seat random SE and we will do course that's Manuel seat all after that we will do t. back .q and deterministic equals to True torch. backend. two. bench mark and we will set it to false and we will run it so it ran without a problem now let's just import the datas so let's go over here and create a new cell so let's start by train train DF equals pd. read CSV and the pet to your file in this in my case the pet to my file is can be copied like this and it's that so you have to just input the pet to your train file for the test DF pd. read CSV the same but just test and if you want to make this is actually a competition too recognizing this digits and if you want to make a submission you can just click copy the submission file CS3 and say submission dfal to pd. read CS and this making a submission is completely optional but is a fun thing to do and I'm going to do that but if you want you can just omit it so let's just look at the data let's look at train DF that head let's see how it looks like I forgot to run it train DF so like we looked before label at the First Column rest is followed by pixels let's check test DF there's no labels just pixels for every column and let's check submission df2 like I but this is only if you're planning to do a submission and it looks like this image ID and label as you realize we don't have a validation data so first we will create that we'll sprit the train data into train and validation data frames we will use train test split for that you will put train DF we will say test size equals to 0.1 so 0.1 of all the train data goes into validation in this case we will set random State equals to random seat to make sure it's reproducible and we will say Shuffle equals to true we want to shuffle our data set we will need three separate data set objects one for train one for validation one for test and they will be very similar so let's start by train Class Mist to Rin data set and data set again Define init we'll put self images labels and indices so since this is a pytorch data set we will need init length and G height functions for sure so let's just say self images equals the images self. labels equals to labels self. indices equals to indices and self. transform equals to transforms that compose so what we are going to do with this is that we going to tweak with our images so first we will give just think that we gave this transform fun this transform variable and image and operations we are going to Define now will be applied to that image so transform that to P image so we turn it into this p p object after that trans form by the way this is transforms to random rotation 15 so we are going to rotate images randomly by 15° what this does is it is a data augmentation I put it here as a sample you can apply much more extensive data application techniques which will increase the results but just know that this is how we apply here after that we will just transforms transforms. to tensor we will turn everything to tensor and we will normalize it that normalize 0 0.5 with let's say 0.5 if we are going to wait let's this is at long location okay if you're are going to apply more datation techniques apart from random rotation just don't forget that only applied into your train data set okay so with this out of the way we Define length we'll put self return length of self images it is simple when we call length it will return us the length of the data set and we want the get item function What will what it will do is that when given an index it will return us the data at the specific index so we will return image self. images we can choose it with just index and we will reshape it by 28 to 28 we want it as type np. unide integer 8 so why NP why onside integer 8 because it is small it is not it is not necessary to use any other like long or stuff outside inte integer eight is just fine we will need labels s. labels index we will need index self. indexes index and we need to apply the Transformations we defined here to our image so image equals the self. [Music] transform image and we return everything here as a dictionary image image return image we will return label and we will return index index this is train data set validation data set is super similar so I just copy and paste it what is different is that like I said we are not going to apply data argumentation in the validation data set because we want the pure results of the data set without augmentation however we can still have apply the normalization and stuff because in the inference time in the run time if you happen to deploy this model into production or and other thing you can still apply the apply these normalizations to the image that you take as input but if you put that agented data into your validation and test data set it will give you wrong results and it may mislead you so we just remove the random r rotation and we can remove the P image too so we remove them both and the rest is the same and we need to change its name too amnest well data set the last one is the test data set now I copyed amist well data set pasted it I'm going to call amist submit data set so since it is a test data set we won't have any labels as input because that's what we are trying to predict so let's just remove labels let's remove labels from here too and as you may realize we don't have any rotation or argumentation in test to because like I said if you do put it in there it will change your results and it will mislead you so remove that with remot labels here the rest is the same and we don't return any label and let's run it as you may remember we just defined some Transformer Vision models and after we defin the classes we actually tested them to see if they work or not so we will do the same for data set too we will take some data and draw them either so let's start by plot them fig here because we will plot some samples from the data sets we will do FH array equals the PLT that subplots so I will take one row and I will put three colums for train validation and test so I will say one to three to Dimensions so we have train data set equals m n terrain data set first first we will give the r we will give the images so we will give it by train DF iog column um there is no this column wait yeah column comma 1 and column so what it does is that as you remember from the train dat train DF at the First Column we have the label we don't want the label so we say we take it from the first one and we don't take the label we take only the pixels if we are familiar with nonp slicing it may look familiar to you so that's basically what we did here and we take values that S type np. inside integer 8 after that we need the labels train DF ilog again we do a similar procedure and we only take the zero column because as you can see here it contains the labels as a last argument we will give indices oops we will give we will turn into values of course values and as last we will take train DF index. values and we gave indices so we created our data set object let's check if the length function Works length train data set let's take the zero index of the train data set so that we know the get item function works and actually let's show an image from the data set say x array zero which means we will write it into the zero column of our subplot and we say im show train data set we take the zero image and we select image here because it will trans it will return us image label and index we want to select the image from here so we select it with selected by not images but only image okay and we want to squeeze it the reason for that is it will return us with a back size or input Channel okay we don't want that input channel here so what we do is we just basically squeeze it and get R of that input channel so we did The Squeeze as last we will map it to gray because I want to show it as gray and let's give it a title IM show uh sorry X xr0 that set title train image because this is a train image and let's put a separator here okay now we will copy this paste it we will do it for validation amnest Val data set I look one I look zero just this will be validation data frames not train [Music] validation validation data frame length of the validation data set print validation data set zero imow validation data set zero set title and this is you guessed it a validation image now we do it for test so for test we will let's start by making it test data set mnist submit data set we will only feed we will only feed images and indices but when feeding images we will remove this ilog Okay the reason for that is in the test we don't have to exclude the First Column which is labeled it is already consist of only pixels so we can just take it as it is as as just with values we don't need the labels because we don't have labels only we have we need index indices let's make them tests this is also test this is also test data set this is also test data set one thing I missed here is that we want to write the the validation into the First Column of our plot I change this to F the one here I Chang these to two because I want to write them to the second column of the plot and this is not train data set this is test data set this is not train image this is test image and at last we want to do plot that show and run it we have an ER transforms is not defined in train data set to P image let's say because TR there's an S missing here let's R again and it has run successfully so this is the length of our train data set and this is what it returns if we specify an index as you can see image this is the image this is the label it returned so so it represents 8 and this is the index of it so let's go over here and see as you can see it is eight let's go to the validation its size is 4,200 we have image which is this all these stances here has label 8 and index of 5,457 let's also check and verify its label as you can see it is eight again and for the test the size is 28,000 and the image is as we can see in here with this tenser we don't have a label that's what we're trying to predict and it has an index so it is as we like now that out of the way let's just Define our data loaders first train data loader we will use them while training to iterate through the data data loader data set is train bch uh sorry train data set batch size equals the batch size and Shuffle equal to true we do that for validation the data loader which is well data loader for well data set and we will do it for test data loader test data loader data set is test data set but we won't Shuffle it let's run this okay now it's time to move on to the training Loop now we are going into the training the first thing we need is a loss function so let's say cron as nn. cross entropy loss and we will need an Optimizer so for this let's move to the paper and let's look what Optimizer they used is Adam so we are going to use adom Optimizer too equals to optim adom model. parameters betas we already defined them and took it from the paper beta LR equals the learning rate and weight DEC equals to ADM weight DEC hyper parameter I one I like to estimate the time of the training so I will say start time it. default timer for EPO in tqdm in range EPO I will add two more parameters which is position equals to zero and leave equals to true what it does is that it probits to qdm to print a new line as progress bar every time we do a new iteration so instead it will stay as a one bar and it will just load okay I will do it with I will do it for all all two qdm calls so let's take the model into train mode we will need train labels we will need train PRS which will store labels and predictions respectfully we have to store running loss which will start as zero now we will iterate through the data loader rotate with index and image label in enumerate tqdm train data loader position = Z leave equals true because of the same reason I just explained image is image label we take image from from that that we will take it we will convert it to float and we will put it to device why we set that image because we return image from the data set okay don't let it confuse you and we will take the label in the same way image label we will just say label and we will take it type as torch. outside integer 8 we will put it to Dev either for predictions we will do it like we will take the model and feat the image like just like we did here when we first defined our model and fed the Dum image just like this model X we fed the image here and after that to to get the label the predicted label we are going to do is y PR label = to torch. argmax Y PR and dimension = to 1 so what we do here I showed you that after we run the model with this sty data it gives 512 as the B size and 10 as the number of classes and it contains the probability distribution so that probability distribution has the highest number assigned to the predicted label so you want to take the take the column with the highest number and return it which will be our prediction so we do that by AR Max and giving it the dimension of one okay that's what it does if it is has say let's say 512 images came here in the wi R and and they will have 10 columns each for for 10 classes we have to get the one with the highest probability assigned we did we do it like this that's how we get that now let's just add them to the train labels train labels extend so we put them into GP uh from we take them from GPU and we put them into CPU when doing storage when storage them because if you store them as GPU in the GPU you will run out of memory you will be using your GPU for just storing this unnecessarily when it is possible put everything to CPU especially if you are storing it so we stored labels we stored the predictions we do the same procedure by PR label that cpu. detach we will update our loss plus equal to create teron we defined by PR and label this is loss now it's time for make sure that learning is happening for that what we do is Optimizer do0 gr l. backward and Optimizer that step so this these three lines will make sure that learning is happening after that we need to update our loss so train running loss will will be equals to loss. item by the way there's no plus sign plus sign is for here okay so we updated for each iteration in train data loader over here after that we update the train loss for the whole Epoch so we do that by train loss equals train running loss divided by index + one so train running loss takes stores the loss for all the part and just adds them up and stores them and with train loss we just make it for we store it as a loss for one Epoch now we did it for training and we get the training loss too now what we are going to do is we will do the validation part it is very similar we first take the model into evaluation mode we say validation labels which which will store the validation data validation PS which will store validation predictions and validation running loss which will store the running loss for validation you will say with torch. nor gr so what it does is that it makes sure that no learning is happening here so we are just trying to get the validation score with the train model so we don't want any learning to happen so we use that for inex same procedure image label in enumerate to qdm well data loer position equals to zero leave equals to True these arguments are for 2 gdm like we discussed image equals the image label image same things that float. to device we do it for label image image label label. type torch. outside integer 8.2 device y prediction equals to V image to model and Y prediction label which will be the predictions is torch. AR Max right red and dimension as one why we do that we just discussed in the training part to take the highest probability after that let's just store them in one labels that extend label that as we discussed we put them into CPU first then we store them which is the good practice well PRS again the EXT standard y PR label we label. cpu. detach again we puted the CPU from GPU and we store it like that after that we take the loss is Criterion y PR and the label and we update the while running loss loss plus equals to loss. item and we update the loss for the hooc which is well loss equals well running loss ided by index + 1 Now train and validation complete on top of each other we want to print them as the train happens how we are going to do that is we just with print let's do a separator here after that let's say train loss Epoch let's give aoch number EPO + one train loss do for f so what it means is that I want only I want the four digits after the train loss value so I it something wrong here what did I do wrong I close the brackets J close the brackets here train loss EPO plus one train loss. for f no we take the same and we do it for valid loss validation loss okay validation law for + one but this will be well loss L that's for I feel like I missed something here but probably it will give us an eror even if find that now let's try let's print the accuracies we will calculate the accuracies on the go so for train accuracy oops accuracy Epoch Epoch + one how we are going to calculate it is that we will do sum one for x and y in zip train PRS train labels if x is equals to Y and we will divide it by length of train labels I will explain what's going on here and we will take only the last four digits okay F okay this I missed the parentheses here probably this they do match they do match they do match okay I missed this par there's one more parentheses needed here okay so what I do is I iterate train presss and train labels with X and Y and if X and Y values are equal I I say this is one okay we found a correct prediction let's say x is cat okay which is the prediction and Y is the labels the gr root that is C too then it means we made a correct prediction and we give it one in the end we sum up all the ones and we divide it by the length of the data which is train label we can say in this case and as a result it gives us the accuracy so we copy that we paste that and we do it for validation again but the only difference is that these are well instead of train and we put another separator multip by 30 we started a timer in the beginning now we stop it stop we go to the beginning stop equals the time it. default timer and we will calculate how much time has elapsed it will be training time stop minus start I only want the last two digits so that two F seconds okay now let's run it when it gives if it gives an error let's meet here again and if you're on kegle before running it what you need to do is you need a GPU you can go here can say accelerator and choose GPU and turn it on okay so let's start the train again let's meet you here okay we got an error what happened is that I forgot parentheses here and here and we don't need these and let's run it again we got another error in the validation part I forgot parentheses here learn it again our training is complete and we are back here let's look at the results our will train and accuracy is 0.9 for 40 while validation accuracy is 0.92 for theog 40 you can pump those numbers up but since this is just a basic tutorial I will we will just continue now the first thing we want to do is you want to torch. chuda empty cash what it does is with wait a sec there is why torch. Cuda empty what it does is you load some things in the memory the model the images and you have the GPU nearly full so before we continue we have to offload it and we we can offload it with this line okay we have we want to have free space in our GPU so that's what we do here now let's continue with the predictions as you know we have test data loader too for that let's start by creating a list for labels list for IDs oops and images again we put the model into evaluation mode again we do with torch that no grad just like we did in the validation part we don't want any learning any gradient update so we put this similarly for index and Sample in enumerate ttdm test data loader position equals to zero leave equals to true the 2dm argument switch talked about so that it stays in its place so image is sample image we put it to device and we store the IDS extend integer we want to store the ID as integers we add one because this is by the way for submission uh because the files start from from zero but their indices are added one so we add one for I in Sample index if you if you don't want to do submission you don't have to store IDs let's get outputs it is equals to model image again image that extend we store the images because we will actually look at the predictions and original images first we detach it and then we put the CPU like we mentioned we don't want to take memory from the GPU so we do that and we also store the labels x t again we want to store the labels as integers in torch. argmax we are selecting the highest probability and among Dimension one uh let's run this and xtends because it is xend and it's running now let's actually plot some test data and see our predictions to do that let's say plot. figure F and X array equals the plot. subplots two because we will plot in total of six IM two rows and three columns and we will need a counter we will iterate to the rows in two and we will iterate to the columns after that we will put this image into itro and J column and we will show it from images we will take the counter one we will squeeze the badge and cmap it to gray so we will show it gray and we will set a title for it and we will say the predicted value labels counter okay let's close this and update counter and in the end we need to do is let's just check this out PLT Dosh uh let's show it so as you can see we have six images this is two model predicted correct this is zero model predicted correctly this is n model predicted it correctly this is zero the model predicted as four this is a mistake this is three model predict that's three seven predicted seven so out of the six images we chose we have one WR pred and if you want to continue with the submission part you can just do it like this dfal PD the data frame list zip ID and labels columns image ID and label submission df2 CS submission. CSV index equal to false submission DF head as you can see this is your submission file can just come here click submit and it will actually submitted so that's all we actually we trained the model we implemented it from scratch and yeah if you like the video just make sure give it a like subscribe to my channel it helps a lot and if you have any questions don't hesitate to ask it in the comments see you next time

Info

Channel: Uygar Kurt

Views: 12,107

Rating: undefined out of 5

Keywords: kaggle, kaggle competition solution, kaggle grandmaster roadmap, kaggle competition live, pandas, huggingface, tensorflow, pytorch, keras, transformers, nltk, deberta, llm, large language models, nlp, cv, computer vision, kan jee, yannic kilcher, rob mulla, Aladdin Persson, ViT, vision transformer, jupyter notebook, implement from scratch, MNIST, MNIST Dataset, handwritten text recognition, transformers for cv

Id: Vonyoz6Yt9c

Channel Id: undefined

Length: 75min 33sec (4533 seconds)

Published: Fri Sep 29 2023