Detect Anything You Want with Grounding DINO | Zero Shot Object Detection SOTA

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
grounding Dino the latest state-of-the-art zero shot object detector I know that's a mouthful but trust me it is crazy with regular object detectors you are pretty much bound to predefined list of classes and if you wish to expand that list you need to gather more data annotated retrain the model that's a lot of work with grounding Dyno pretty much all you need to do is change your prompt and in most of the cases it will successfully find any arbitrary object that you would like to detect I tested it on few images looking for weird objects like knees or dogtails and it performed really well so without further Ado let's jump into the code this demo will be a bit different because I will not show you how to train the model however we'll go through multiple inference examples and I will show you what did I learned along the way and spoiler alert this is the most fun I had with any computer vision model by far since I remember hi it's Peter from The Future I'm Just producing the video and I decided I need to correct that statement obviously when we talk about computer vision in general I had a lot of fun with models like stable diffusion but when we talk about object detectors specifically I stand by my original statement this is the most fun I had okay back to the video okay let's get our hands dirty as usual we created a Jupiter notebook that you can use along the way in that tutorial so we scroll a little bit lower and open zero shot object detection with grounding Dyno let me just increase the size of the window a bit so you would have easier time following the tutorial and we are pretty much good to go as usual the first thing that we are going to do is to execute Nvidia SMI just to confirm that we have access to the GPU in the meantime Google has asked us whether or not we want to run that notebook because it's not created by Google of course it's not because it's created by me everything runs properly we have the usual Nvidia semi output so now we can proceed to create the home constant we'll use it to manage the paths to models weights configuration files and the rest of the data and the next thing is to install grounding Dyno it is not distribute via peep just yet so for now you need to clone the repository enter the directory and install all the dependencies and spoiler alert at the very end of that cell you see that I install roboflow and that's because at the very end of that video I will show you how to use grounding Dyno and roboflow to automatically annotate your data set most of those dependencies are already installed in Google collab but keep in mind that if you run that installation process on your machine that may take a little bit of time cool so far so good we are done with the installation and now we can prepare to load the model into the memory so when we load the model it takes two parameters the first one is the path to configuration file the second one is the path to weights file the first one comes with the repository the second one needs to be downloaded so that's what we do the download shouldn't take too much time because that file is pretty small and at the very end we just create a constant leading to the grounding Dyno weights and confirm that the file has downloaded properly now to be able to play a little bit with the model during the demo I prepared a small set of images so we will just W get them and after the download will be completed you will find them in the data directory those are just coming from my private album myself and my doc but obviously feel free to upload your own images I'm actually super curious how the model will do with different examples okay and now we can finally load the model into the memory like I said the function takes only two parameters config file and the weights file it loads a little bit more of weights from the internet in the meantime I believe for the backbone and when that is done we can finally have some fun now we can scroll a little bit lower and start our journey with the first example and I guess that's the perfect moment to talk about multi-modality it is really hot topic right now because upcoming gpt4 model will be multimodal and that means that it will be able to take image and prompt as an input and it's exactly the same in the case of grounding Dyno and that's how it's different from the regular object detection models where you usually only pass an image and get list of bounding boxes that match the pre-selected class list over here you pass an image and the prom and you get the list of pounding boxes that fulfill that prompt and it's super cool because if you want to detect something new on the image you don't need to retrain the model you just change the problem so now let's go back to our first example I selected doc free jpeg image as the first one we can see the raw image right now on the right side and we will use a simple chair query my intent here is to detect all the chairs visible on the scene you can also see that there are two additional hyper parameters box threshold and text threshold they are here to improve the quality of your predictions I will not use them in this demo but obviously feel free to experiment try different value let us know in the comments what you found okay enough of the talking let's run the first query and take a look at the results and we can see maybe let me just make the there whole image a bit smaller so we would be able to see it all at once we see that the model was capable of detecting all the chairs visible on the scene okay but you might say detecting chairs is cool but it's also not super impressive after all chair class is part of Coco data set and pretty much any pre-trained yellow detector would be able to do the same and you would be right backgrounding Dyno has a lot more capabilities so let me show you what we can do with just a little bit of prompt engineering so now let's run our model on the same image but let's modify the prompt a bit so this time I will look only for chair with men sitting on it and now when we run the model we can see that we have two detections the first one is men and the second one is the only chair on the scene that is occupied by a person and that's pretty crazy if you ask me because it allows you to create create Advanced constraints that previously would be only possible by writing a ton of python code you know the texture and detect person and if the IOU of the chair and the person detection is high enough over some threshold and filter out all other chairs and keep only those with high IOU how much simpler is just to write give me chairs with men sitting on it okay so now what I want to show you is that we can detect multiple classes at the same time so not only I detect chair but I also decided to put dog table show light bolt like basically anything that I saw on the scene and try to detect all of those objects at the same time if I want to do that I just need to put the name of the classes and separate them with the commas and that should be fine so let's take a look at the result and it turned out like really well in my opinion sure most of those classes are still coming from Coco data set but there are some classes like tail or light bulb that are not part of the Coco data set but still were detected just as well okay so let's try to go even more crazy and add even more classes so I don't know path finger and I for example and when we take a look at the result we see that it detected I and the PA but only one and no finger okay so you get the idea you can detect a lot of stuff and the objects that you want to detect don't necessarily need to come from the cochle data set so you can get creative I had a lot of fun just looking for different objects on the scene and the additional benefit is that you can use language to create additional constraints to detect specific objects on the scene and we will see one more example of that right now okay so let's take a look at another image this time dog Dash 2 JPEG and that image was done in restaurant so you can see a lot of glasses on the table so let's write a simple query to detect all the glasses on the table and you can see that it done pretty good job it accidentally classified a salt contained as glass but it's it's okay given the fact that it's transparent and most likely made out of glass but now let's modify the query and look for the glass that is most to the right and sure enough the model is capable of detecting that particular glass I mean like it blows my mind that stuff like that is possible and once again it's it's not about the fact that it's technically possible because I could do that with yellow I would just run the detection of glasses and grab the bounding box most to the right on the frame it's just the fact that using language to create that query is significantly easier to do okay so enough of the examples using my images you will find even more of them in the actual notebook but I don't want you to be bored to death during that video so let's move on and try to combine the power of roboflow data sets with grounding Dyno detection capabilities for me the biggest use case for model like that is actually automated data set annotation obviously um that wouldn't work for every data set because sometimes the set of classes that you are looking for is so obscure that even such a powerful zero shot detector like grounding Dino will not be able to deliver however I think that it's always worth to try because you may be the lucky person and spend like 50 percent less time annotating bounding boxes in your project and all you need to do is write a small text query to do that okay so we are back in Notebook and right now we will pull the data set from roboflow universe so the first thing that I need to do is log in into roboflow so we have a new CLI to do that I select the workspace that I want to use to generate a token in my case I will keep the default one now the token is generated I can copy it paste it in the input field press shift enter so that it would be hidden and yeah we are pretty much authenticated now to make our life a little bit easier I created a small utility that is here to pick the random image from the data set and now the only thing that is left to do is to download the data set I decided to go for the data set from roboflow universe and this one contains pictures of workers so people in reflective vests and helmets I decided will be pretty cool use case for us the next thing right after the data set is going to be downloaded is to create the text prompt and I decided to take the default route and just concat every class name from that particular project and separate those names with commas and it turned out that that was not the best choice we can clearly see that the person have both helmet and reflective vest on him but the model only detected the helmet and this happened because the class names in the project were not very specific they used abbreviations and a lot of assumptions like instead of calling the class reflective vest it was only reflective and because of that the model had a really hard time to understand what do we want to detect so let's go back to Jupiter notebook and manually refine the list of classes I will go for reflective safety vest helmet hat and non-reflective safety vest and sure enough after just a little bit of prompt engineering the model is capable of detecting much more on the image and that's all for today uh I hope that you noticed that I had a ton of fun playing with that model I really think that architectures like this are the future of computer vision many people speculate that gpt4 will have similar capabilities we actually have a video when we also do a little bit of speculation I will include that video in the description but for now thank you very much for watching make sure to like And subscribe and stay tuned for more computer vision coming to this channel soon my name is Peter and I see you next time bye
Info
Channel: Roboflow
Views: 24,375
Rating: undefined out of 5
Keywords: Grounding DINO, Object Detection, Zero-Shot Object Detection, Text Recognition, Computer Vision, Cross-Modality Decoder, Natural Language Processing, Transformers, SOTA, State of The Art, Swin Transformer
Id: cMa77r3YrDk
Channel Id: undefined
Length: 14min 8sec (848 seconds)
Published: Tue Mar 28 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.