How to Train DETR Object Detection Transformer on Custom Dataset

Video Statistics and Information

Video

Captions Word Cloud

Captions

I need to learn Transformers I have this kind of imposter syndrome crisis recently don't get me wrong I feel pretty comfortable working with almost any computer vision algorithm that is popular right now but when it comes to Transformers I don't feel comfortable enough that's why I decided that this week we'll dive deeper in the world of object detection Transformers and we'll learn how to train one of them on custom data set so if you are just like me and you would like to learn something new today sit back relax make some coffee and prepare yourself for 25 minutes of pure struggle but I promise that we'll use plenty of super cool and useful tools today like pytorch lightning Transformer supervision tensorboard so there's a lot to unpack let's not waste any more time and train some models and as usual I prepared for you a dedicated Jupiter notebook you will find it in in the description below so you can follow my steps during the tutorial we scroll through the overview and table of content I highly encourage you guys to read it you can learn a lot about the detr model and we stop by Nvidia semi command this is the first cell that we are going to execute it is here to confirm that we have access to GPU and the execution may take a little bit of time because we first need to connect to the actual GPU on the server but when it's done it prints out the architecture of the GPU that we have at our disposal and in our case it's Tesla T4 now if you don't see the same output in that cell that most likely means that your runtime is not GPU accelerated and in that case you need to navigate to runtime change runtime type and select GPU from the drop down I guess that value is not by default so you just need to switch it up to GPU and you should be ready to go at the same time keep in mind that that pop-up might look a little bit different than what you see on the screen that's because I'm using Google collab Pro and you have less option in the free version cool next thing is creating home constant I like to have it because it helps me out to manage paths to files and directories and when we examine its value in case of Google collab it leads to slash.content directory and this is exactly the same directory that is the root of your file viewer in the left tab of Google color we can confirm that by running LS command and we see that we get exactly the same result in the file viewer and in a terminal I know it might sound super intuitive for many of you but I usually get like 10 comments under each video asking about path management so I just wanted to make sure that everybody understands where we are cool now let's set up our python environment we have few dependencies that we need to install we'll use supervision to annotate our frames and manage our detections we'll use Transformers obviously to load our DTR model we'll use steam to load the backbone for that model rubber flow to download the data set and pytearch lightning to manage the training at this point we are basically ready to start the training but before we do that I always like to use the pre-trained model and run it on some example image just to confirm that all the installation steps were done properly so let's do that we start by downloading our example image that one is coming from a private Gallery you will see it in a second and after downloading is completed we should be able to see that image in our file manager on the left hand side we can see that the name of the image is dog jpeg so let's use that name and our home constant to create the path to that image in just a second we'll be able to use that path to load the image and run it through the network but before we do that we need to load the actual Network into the memory with Transformers it's actually pretty easy all we need to do is load two classes from Transformers package the first one is DTR for object detection and the second one is DTR image processor the first one is the actual model the second one is the set of utilities that we'll use in conjunction with that model to for example pre-process the image or post process the detections we want to be able to run those models on CPU or on GPU depending on the hardware that we have at our disposal and because we are using pytorch this is actually pretty easy using torch Cuda is available we can set the value of the device constant to CPU or to cuda0 if the GPU is accessible we will also create another constant called checkpoint that will Define the version of the base model that we'll use for our training we'll also sprinkle few more constants that we'll use for detection post processing later on but at this point let's load the model the whole process may take few seconds to complete but when it's done the whole model texture will get printed on the screen now we can scroll a bit lower to the part where we will actually run the inference and let me quickly go through the code snippet that we'll use we start with torch no grad to disable gradient calculation which will result in reduced memory consumption of our model next part is pretty straightforward we use opencv to read the image from the hard drive and use image processor to apply all necessary Transformations before the inference at the very end of that line we use two device to make sure that image python tensor is living on the same device as model weights now that we have our inputs ready we can push them through the model and save the results in the outputs variable that output is still raw so to make it useful we need to process it using the same image processor as before it will take care of bounding box scaling and fill filtering by confidence among other things now we are ready to visualize our predictions and to do it we'll utilize supervision package we start by converting our results to supervision detections and then we create bounding box annotator that we can use to annotate the frames the last thing that we need to do is call annotate method pass our image detections as well as the labels that we generated one line above to annotate our image with class names and confidences now we can run the cell and wait few seconds for the model to generate the predictions we can see that the model generated double detections so for example in the case of backpack or the person we have two bounding boxes instead of one it's a bit hard to see but the car on the right side of the image also got detected two times but this time we have two bounding boxes with two different classes car and truck double detection is a very common problem with object detection models like YOLO for example and it can be easily solved with non-max suppression algorithm when we scroll a little bit lower we can see another code snippet this one is basically identical to the previous one the only difference is that we apply non-max suppression on our detections we can see that the line that previously only converted results into detections right now uses nms now when we run the model we notice two things first of all it produces detections much faster because it's all already had and all the weights are loaded into the memory second of all thanks to that small nms change we get only one bounding box per object so far so good our model infers that's a very good sign looks like we installed everything properly and now finally we can move forward and shift our Focus to the actual training and obviously if you want to train model and custom data you first need to have your data set in correct format for this particular video I expect you to have your data and Coco dataset format if you don't know what it is make sure to watch our other video the link to it is in the top in the top right corner and in the description below and shout out to Jacob for great explanation and in the meantime let me show you the data set that I'm going to use in today's tutorial so I go to roboflow.com sign in and after my workspace is loaded I select football player detection and that's the data set we are going to use to train our custom model it comes from recent kaggle Bundesliga competition and allows me to detect players goalkeepers referees and ball if you want to follow along with this data set the only thing that you need is roboplow API key so let's grab it we navigate back to roboflow go into our profile then settings then rubber flow roboflow API and copy our key then we navigate back to notebook paste the key hit shift enter two times and our data set is being downloaded that process may take a little bit of time because we have like few hundred images to download so in the meantime let me show you how you would proceed if you wouldn't have that snippet already in the collab and you would need to obtain it on your own I go back to my project first select the version I have four to choose from I keep the latest one and then click export button select Coco Json as the format click continue and after a few seconds I should have a code snippet that I can just copy and paste into notebook in the meantime our data set got downloaded and we can see it in File Explorer on the left side so in the data sets directory we have football player detection version one and importantly we have three sub directories train validate and test that we will use for training and validation of our data set validation of our model if we want to find our data set we can use location property of data set class that stores the path to root directory of the whole data set having annotation files is one thing but being able to use them during training is another fortunately pytorch comes with very handy cocoa detection class that will right now utilize to build our data sets and later the data loaders we start by extending that cocoa detection class and that's because we have few custom things that we need to add let's start with Constructor first of all python cocoa detection expects two paths the first one to image directory the second one to annotation Json in our case images and annotations are located in the same directory and that means that we can easily infer the path to annotations by just joining the name of The annotation file with the path to the images directory and then pass those two as two separate argument comments to pie torch cocoa detection parent class now another thing that we need to remember about is that we need to pre-process our images before we push them through neural network and that's why we inject image processor in the Constructor save it as class field to then use it in get item method depending on the object detection Transformer that we would like to train we would inject different image processor in the Constructor for Cocoa detection and that's because usually in Transformer API every model has dedicated image processor that you should use like I said we divided our data in three parts test train and validate and right now we need to load them a separate coca data sets when we run the cell we can see how many images we have in each individual subset if we go back to our roboflow dashboard we can see that those numbers correspond to the distribution that we defined in our project now I guess it would be good idea to verify if our cocoa annotations are being read correctly to do that we scroll a little bit lower to next code snippet the code here is quite straightforward we start by using pytearch API to get IDs of all images that we have in our data set then we randomly select one of them and load it from the hard drive at the same time we load annotations corresponding to the same image and then once again use supervision this time to convert coca annotations into detections object that we can once again use to annotate bounding boxes on the source image I know it's a lot of talking but let's take a look at the results we see our source image being annotated with bounding boxes everything seems to be okay we have the right classes and the bounding boxes are in the right places because we select image at random we can just rerun the cell and take a look at different image from the data set to build data loader we need to Define few things the data set the batch size and the collect function we already spoke a lot about the data sets so let's focus on the two other arguments to speed up the training the neural network can usually consume multiple images at the same time and that group of images is called batch the larger the batch size the faster we can consume data and learn however there is the catch as the batch size grows so is the memory consumption of our neural network and that essentially means that our training breaks because we simply don't have enough memory to store our model weights our images our gradients at the same time all in all selecting a batch size is simply a balance between the speed of the training and the memory allocation at the same time our neural network expect to get the patch wrapped in some specific data structure and that's where collect function comes in it is responsible for putting everything together in such a way that it will be understandable for our neural network during inference and training in case of detr the collect function is quite specific because the authors of the paper decided to use variety of image sizes doing the training that means that tensors responsible for storing pixel values have different shapes and it's basically impossible to directly patch them together to solve that problem they decided to pad every image in the batch to the shape of the largest image but now you lose the information about the size of the original image well not exactly to be able to go back they decided to create additional tensor called pixel mask that stores only ones and zeros one if the pixel was originally in the image and zero if it doesn't Pretty clever solution but we need to handle that whole complexity in our data loader okay we are done with data processing and loading and now we can focus on training the model model training is essentially a gigantic Loop where every time we get the batch of data we push it forward through the neural network we calculate the loss to understand how far from desired solution we are we use Optimizer to understand how can we get better and then go backward for the neural network tweaking our weights so that next step will be just a bit closer to the desired solution sounds very simple but writing those training Loops can be very tedious task easy to make mistake so that's why it's pretty much a standard practice to use some library to manage the training process and in this tutorial we'll use pythar's lightning to do that so just like with data loaders we start by extending the base class in our case lightning module if you know something about the programming you know that when it comes to inheritance there is usually a set of method that we are expected to use and we can learn about those from pytorch lightning documentation and I highly encourage you guys to read it or at least take a look at it in our case most of the stuff is pretty straightforward but there are a few tricky parts so I will take you through the whole implementation we start in Constructor where we initialize our model from the same checkpoint as few minutes ago and obviously if we would train different object detection Transformer we would need to use different class to initiate it another thing that is very specific to DTR is the backbone learning rate so like I said before training is just a gigantic Loop where every iteration we take a small step towards desired solution and the size of that step is decided by learning rate usually the whole neural network have the same learning rate but not this time and we need to be able to handle that complexity in our trainer that's why our trainer takes two learning grades the first one that will be applied to the whole neural network the second one that will be applied only to the backbone we save both of those learning rates as the class Fields the next method that will require our attention is configure optimizers and that's because learning rate is one of the key factors that influence Optimizer result here we need to ensure that we include not only the basic learning rate but also the learning rate for the backbone the rest of the steps are pretty standard ones we added a little bit of logging to training and validation steps but that's about it at the end we just need to make sure that we pass train and validation data loaders and we can start training finally to keep track of the key metrics during the training we'll use tensorboard it will scan our logs directory and create real-time charts during the training okay all stars have been aligned and we can finally run our training just one more sanity check we create an instance of our pie torch module pass one batch through it just to confirm that everything works and doesn't break [Music] just one more sanity check looks like we didn't trigger uh some cell yeah there it is collect function okay now now we should be able to run everything smoothly hopefully and it works cool now we have one more step to do which is actually to trigger the training I'll just adjust the amount of epochs and change it from free to I don't know 50 for example and we can hit shift enter and the training starts 50 epochs even with 600 images that's quite a lot of time so obviously we will not watch that whole process I will speed it up for you [Music] okay so it's Epoch number 20. I decided to take a earlier look at the model just to make sure that the trains and it doesn't do anything stupid so let's refresh the tensor board switch to Scholars and scroll a little bit lower maybe to train lost bounding boxes it is a bit bumpy but it still converges what about other metrics maybe training loss it is much smoother and like I said it still converges the validation loss does as well okay we have still 30 ebooks to go so see you later the training has completed so I'd say let's put the model to the test and run some inference the code that we'll use to do that is basically a combination of code that we already used it can be divided into two parts the first one speaking up the random image and displaying the annotations the second one is running the model and displaying detections so let's execute it and take a look at the results so the top one like I said those are the annotations quite a cool image because we get to see all the classes the bottom one shows predictions and straight away we can see a lot of false positives we see multiple bounding boxes on the goalkeeper I expected that given the fact that we got multiple bounding boxes for our test image uh let's change the IOU threshold to something lower and try to filter them out one thing that is a bit inconvenient that we still pick the random image so we won't be able to compare that to the previous result but it is quite noticeable that we no longer see multiple bounding boxes on a goalkeeper for example something that we wouldn't be able to solve with an MS are obviously those false detections of referees and we see that the model decided to detect refereeing few random places around the field yeah maybe a longer training would help with that let me know in the comments if you have ideas on how to solve that particular problem still I'm quite happy with this result given the fact that the actual model was developed in 2020 and since then we saw already a major Improvement in accuracy okay let's evaluate it on the whole test set before we wrap up the whole tutorial shout out to Niels for packaging the original cocoa evaluator that came together with the TR model uh because right now we don't need to clone the original repository we can just install it with P package let me just run the next two cells and in just few seconds we should have evaluation results and yeah the overall IOU at 50 threshold is just around point 41. cool yeah that's it for today uh the whole video is significantly longer than I originally anticipated but I guess it's because we did dive very deep into the pie torch and pytorch lightning API I just hope that you found it useful and you learned something I can say that I learned a lot that was actually my first pytorch lightning project so I am super happy with the result if you want to see more tutorials like this make sure to like And subscribe let me know that this longer format is suitable for you thanks a lot my name is Peter and I see you next time bye

Info

Channel: Roboflow

Views: 23,531

Rating: undefined out of 5

Keywords: code walk-through, facebook research, detr, detection transformer, end to end object detection with transformers, object detection, tutorial, facebook ai, coco, bounding boxes, transformer, explained, attention mechanism, hugging face, pytorch lightning, torchvision, object detection with transformers, transformers, custom dataset, yolos, colab

Id: AM8D4j9KoaU

Channel Id: undefined

Length: 23min 26sec (1406 seconds)

Published: Fri Mar 03 2023