Accelerate Image Annotation with SAM and Grounding DINO | Python Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
image labeling is boring takes time and costs a lot of money but with recent advancements in computer vision namely introduction of very capable zero shot object detectors like grounding Dyno and groundbreaking models like Sam you can actually automate most of that process for most of use cases you can literally write a python script that will do 95 of the work for you while you are just chilling and your only job is to go through those annotations at the very end maybe add or remove some polygons here and there but the truth is you just saved a ton of time okay so let's dive in and let me show you how you can use Python to fully automatically annotate your raw images or to convert your bounding boxes into segmentations ah and stay around till the end of the video because we have quite a big announcement to make okay so let's open the notebook first the link to the notebook is in the description of the video but you can also find it in roboflow notebooks repository it is located in the computer vision skills section the first from the top so let's open the notebook right now it will redirect us to Google call app great and before we start let's zoom in so you would have easier time following the tutorial here in the top section you can learn more about the models that we are going to use and the general idea behind the notebook so as I said in the intro we will use a combination of grounding Dyna a super powerful zero shot object detector and segment anything model that can produce masks based on the prompt and importantly you can use bounding box as a prompt so we will use the output coming from grounding Dyno to prompt segment anything model and the first thing that we need to do is run Nvidia SMI just to confirm that we have access to GPU the command should give you the same result as the one with the right now on the screen and if that's not the case then most likely you are running your notebook in a CPU mode not in a GPU mode you can easily fix that by following the instruction in before you start section the model should run regardless but it will be significantly slower on a CPU without an optimization great now we need to set up our python environment feel free to skip that part using the chapters in the YouTube video it shouldn't be very long but it can be boring especially if you watched our grounding Dyno and some videos already for all of those who are still here what is happening right now is we are installing grounding Dyno to do that we clone the repo we check out the latest hash at the time of recording that video this is to make sure that the notebook will run even if some breaking changes will be introduced in the actual repository and at the end of that cell we install all project dependencies and this is the part that most likely will take few more minutes to complete so let's speed up that process shall we [Music] great the next thing that we are going to do is to install segment anything model and in that case the process should be significantly faster just a few seconds if I remember correctly and we are done just two more dependencies to install I promise the first one is supervision this one is actually installed as the part of grounding Dyna repository but its version is fixed to quite old one and today we are going to use one of the latest features so we need to uninstall the old one and install 0.6.0 and the other one is roboflow as we are going to use data sets from roboflow universe but also upload our automated annotations into the roboflow backend so we can refine them in roboflow annotation tool awesome those who watch our recent grounding Dino tutorial probably remember that to run that model you need to have two files the configuration file and the weights file the configuration file is part of the repository so we have it already the weights file need to be downloaded and we run wget to do that similar story when it comes to Sam once again we need to W get our checkpoint and when that is done we are pretty much ready to load both of our models into the memory we start with grounding Dyna and during the initiation process it will download few more weights probably for the backbone nothing too crazy so that process should be done relatively quickly I hope and after just a few seconds we have it okay the second model Sam I'm using the larger version of the encoder it's slower but it produces better quality of the masks and that's our primary objective here it is also Worth to remember that the image encoding part will only run once regardless of the amount of bounding boxes on the image so in reality in that particular use case even if that encoding part will take a second longer it's not a problem for us create the python environment setup part is done we can finally play around with those models and hopefully how to annotate our data as usual I will use my images as an example but feel free to upload yours into Google color you just open up File Explorer on the left side and drag and drop your images the one thing that you need to remember about is to put them into the data sub directory and will be simply much easier for you to follow the next steps of the tutorial because you won't need to manage your file paths on your own everything should work out of the box great so I'm setting up the data directory and downloading few of my images into it and we can pick one of them and use grounding Dyno to produce zero shot detections for selected classes here is the list of classes that we are going to use so obviously feel free to change the image or change the list of classes or generally speaking play around with the parameters of our inference here is something quite interesting so we notice that with a little bit of prompt engineering adding word all before the name of the class to be specific we are getting significantly better quality of the detections without it especially with longer class lists model tend to return only one instance of object of specific class but that small hack pretty much ensures that you will get all of them at least in most cases because you never know you can learn more about it from our blog post the link to that blog post will be in the description below but in the meantime back to our video now we can put grounding Dyno to the test and run our first zero shot detection on example image that might take a little bit of time especially the first time around it is pretty common problem that the Deep learning models need to warm up but overall we got pretty good result maybe the only weird thing that happened is that we got double detection for the dock and the reason for that is that my leg is weirdly splitting the dog in half so we got one detection for the whole Dock and one detection for the part of the dog that is above my leg those are certainly the common problems of the object detection regardless of the model that you are going to use nothing unusual happening here now we can use our freshly obtained bounding boxes to prompt some so we wrap the whole segmentation part into the separate function and maybe some of you remember that few minutes ago I said that the encoding part is running only once for every image and now you can clearly see that it is the case we run the encoding and then we Loop over the bounding boxes to prompt the model so now let's run that logic and take a look at the result we can see that for every bounding box that we originally extracted using grounding Dyno now we have the corresponding mask that visualization is sick but it's pretty hard to see the detail of every mask let's plot our results in different ways so we can take a deeper look here we see every mask displayed on a separate plot along with the corresponding class name every individual image is basically the result of prompting some with different bounding books but interestingly if we zoom in we see that some of the bounding boxes produced multiple masks this is something that we will need to handle when we will save those masks into some common annotation format in our case pascalvok XML we will save each of those detached masks as a separate object it will happen behind the scene but it's worth to know that it happened okay at this point we know how to process and annotate a single image let's do the same for the whole set of images similarly as before we defined our class list and other parameters of zero shot detection and Below we have the code that we will use to automatically annotate all the images I will just let it run and explain what is happening in the meantime however it's pretty straightforward we are listing all the images in the directory we load the image using opencv we run grounding Dyno to extract the bounding boxes and then we filter out all the bounding boxes that don't have assigned class it can happen from time to time so to prevent any errors in the later part of the notebook we are simply discarding those bounding boxes and at the very end we run sum to extract masks associated with every image now the cool part is we can actually plot all of those masks on corresponding images in my case it's pretty simple I only have eight images but keep in mind if you have larger image set that may actually take a lot of time to finish so either skip that sell completely or just pick some images from your data set this is really cool because we can get a lot more insight before we will save our data set to Pascal block XML and send it to robaflow and get a lot more intuition about the expected quality of our annotations in case of any problems we can play around change the parameters of our zero shop detection to ensure the best possible predictions quality now we are finally ready to save our predictions like I said before we will use pascalvoc XML to do that and we will use one of the latest features of supervision to make it as simple as possible there are three very important parameters at this stage and those are mean and Max image area percentage and the approximation percentage they can have high impact on the quality of the annotations that you will see in the roboflow editor so I guess it's worth to spend few seconds to give you a deeper understanding ending so a few minutes ago I showed you a chart where you saw each mask as a separate image I mentioned that we will need to take care of each detached part of that mask as a separate object and those parameters allow you to customize that process the first to filter out masks based on their area in relation to the whole image you can use them to drop masks that are too small or too large sometimes especially for small images grounding Diner produce bounding boxes that pretty much cover the whole image on the other hand in many cases some when it's not sure what to do with given bounding box can produce mask that is detached into multiple elements even 20 or so you can get rid of both of those glitchy types of detections as most likely they are not what you are expecting using those two filtering parameters the third parameter will steer the amount of points that will constitute your annotation poly and now you may ask what annotation polygon up until now we were only working with masks you see most annotation formats actually don't use mask they convert those masks into polygons that pretty much cover the same area this is exactly the case for Pascal Vogue XML and obviously depending on the amount of points that polygon may be more precise or less precise so some of you I assume right now have another question why would I limit the amount of points in my polygon I guess the amount of points in my polygon directly translate into quality of my mask well not exactly and not every time sometimes especially for high resolution images that conversion between mask and polygon may cause a lot of glitches and weird geometry of your polygon so it's actually better to keep 25 or 50 percent of the points in your original polygon and keep that geometry simpler especially in the cases where you don't lose a lot of Precision but you need to remember not to drop too many points because in some cases you may end up with triangles or rectangles and the second important factor is the size of your annotation files if you have thousands of points in your polygons and you have tens of polygons on a single image that file for every image May grow and it will be hard to work with sure it's a trade-off in some cases you actually aim for this level of precision but actually in like 99 of the cases this is not what you need okay enough of the talking let's finish the conversion and upload our annotations into roboflow but before we do that let's open up the file explorer and take a look at the content of data directory so you can see that I have eight images over here and when I save my mask to annotation files I will actually create separate XML file for every image and this is exactly what we see right now in our file explorer we created annotations directory and we have eight XML files in it now I can download one of the those files onto my hard drive and actually show you the contents of that file so let's do it let me just bring the vs code editor on my shared screen and yeah here's the content of the XML file we can see the name of the associated image file the resolution of that image and here in the later part of that file every object has a separate tag in it you can find the bounding box and the polygon now that's pretty straightforward we can now close the vs code and given the fact that we have XML files created for every image we can now proceed to upload them into roboflow for feather refinement we will use roboflow package to do that so let me first change the name of the project and its description and we can run the next cell to actually create that empty project in my roboflow workspace so I follow the Link in the Jupiter notebook to authenticate myself generate the token copy and paste it in the notebook this authentication is obviously required because I will be executing things like creating projects and uploading images and annotations into it now that the cell is actually successfully executed I can sign in into my roboflow account and sure enough I can see the empty project named YouTube tutorial and soon we will field that project with images and annotations so we just go back to our jupyter notebook press shift enter and wait few seconds for our images and automatically generated annotations to be uploaded into the roboflow account looks like the process is completed now we can go back to our project and go into the data set and we can see that images and the annotations are here we can quickly take a look and when we zoom in we can see that some of them not all of them have those unregular edges this is unwanted effect being the result of basically low quality of the Mask being converted into low quality of polygon you can solve that problem partially by dropping a larger amount of points from the polygon which will indirectly have the result of smoothing the edge of the polygon let's switch the view to layers because I guess it's easier to examine the scenes where we have multiple polygons of multiple classes we can see the shoes the backpack the nose of the dark uh the ears part of the person because the person got splitted into two parts as the backpack basically divided that polygon into two parts and interestingly we have two detections for dogs so we can simply right click on one of the polygons and delete it and that's pretty much all the data cleaning for that particular image let's maybe take a look at one more so here we have the cup it's pretty cool the chair just actually not the chair but in that particular case maybe we can assume this shoes ears dog divided into two parts similarly as the person divided by the backpack on the previous image I'm certainly very happy with the quality of the predictions for that image okay so that's it zero shot annotation with polygons that part assumed that the only thing that you have are images but that is not always the case sometimes we already have a data set it's just the data set is annotated with bounding boxes and for some reason sometimes business decision for example we would like to migrate from bounding boxes to instant segmentation up until quite recently it wasn't easy task to do but with the release of segment anything model is actually really really easy and we actually showed how to do it fully in the UI of roboflow annotation tool but this time I want to show you how to do it with python I will use blueberry data set available at roboflow universe as an example so here it is quite a small only 78 images but the quality of the annotations is really good and we have a lot of objects on every image so I think it's a perfect example we start by downloading the data set into our python environment we don't need to authenticate once again because we already logged in and after just a few seconds the data sets already downloaded like I said pretty small and we can find a data set using data set that location property when we hit shift enter we see that it's located under content blueberries dash one and when we refresh our file explorer we see the contents of that directory it's divided into three subdirectories test train and valid and for the purpose of this tutorial I will only handle the train directory but you can imagine that you can just Loop over them and process them in a similar way to give us a little bit more insight into this particular data set I create the cell that you can rerun and every time that you do it you will see a different image from that data set annotated with the bounding boxes so if I scroll back up and rerun it you will see a different image okay that's pretty cool now let's try to run some and convert those bounding boxes into segmentations this example is a lot simpler than the previous one here we only use samp so we don't need to use grounding Dyno to produce the initial bounding boxes we get them from the data set and that speeds up the process because grounding Dyna was actually the slowest part of our previous processing pipeline however here like you see on simultas we have even like 100 bounding boxes so we need to Loop over the bounding boxes and query some with every bounding box separately and that can take a little bit of time tqdm loader says it will take around two more minutes so let's speed up that process Boom the processing is done now we can just take a look at the result similarly as before I have the cell they can just rerun and every time I do it it will give me different image from the data set and you can see the masks I mean it's so awesome I love that data set because it shows the true potential and power of Sam I don't even want to imagine how much time it would take me to annotate those fruits manually just like before we will save those masks in Pascal VOC XML format same drill just a different set of images but you know what I will change the approximation percentage from 75 to 50 so I will keep more of the points in the polygons and that's because some of those images are actually quite low resolution images and for them we run into the risk of having yeah those triangles and rectangles that I mentioned before we can change the project description so that it's something unique heat shift enter create empty project and now when we go back to our roboflow workspace we will see that we have one more data set blueberry is it's empty just like before finally we can just hit shift enter Loop over our images and annotation and slowly send them one by one to roboflow backend it will take a little bit of time for the process to complete but we can go to the project right now and we can see that the number is growing in the meantime we can look at images and annotations that are already there the first one here I think everything is just fine we got almost perfect annotations for each of the fruits let's take a look at the next one here finally enough we missed one of the blueberries in the middle so we can just pick the smart poly tool from the toolbox on the right hand side hover over the fruit to see the preview of our mask and just click when the geometry is acceptable looks like I picked the wrong class so let's fix that right click and select the proper one and yeah I guess that's it it turned out to be pretty long video so thank you very much if you are still here I hope that you can see how using grounding Dino and Sam can speed up your annotation process regardless if you just have images or you have your object detection data set and you would like to convert that into instant segmentation one of course that process is not perfect and you still need to have a human in the loop but instead instead of spending like literally hundreds or even thousands of hours annotating those images right now your work for most of the use cases boils down to just clicking between images confirming that everything is just right or maybe adding one annotation here and there not even close to time and money investment that you had to make up until now and now the time has come for the announcement that I teased in the beginning of the video over the last few weeks we experimented a lot with grounding Dino and Sam and we feel that we gathered a lot of expertise with using those particular models those models are obviously too slow to run them in real-time environments however we see a lot of potential in using them to automatically annotate your data sets and use that data to train your real-time detectors like yellow V8 for example the coolest thing about it is that in many cases you don't even need to annotate your data you can simply distill the knowledge that is already in grounding Dyno or some or other zero shot detectors and transfer it through training process to those really quick object detectors and we are currently building a python library that will hopefully allow you to do exactly this with minimum effort so stay tuned because we plan to release it very soon but in the meantime like And subscribe to stay up to date and maybe leave a star under notebooks and supervision repositories that's all for today my name is Peter and I see you next time bye
Info
Channel: Roboflow
Views: 39,643
Rating: undefined out of 5
Keywords: Segment Anything, Segmentation, Labeling, One-Shot, Detection, SAM, Segment Anything Model, Meta AI, Python, foundation models, Grounding DINO, Object Detection, Zero-Shot Object Detection, Text Recognition, Computer Vision, Cross-Modality Decoder, Natural Language Processing, Transformers, SOTA, State of The Art, Auto Annotation, Zero-Shot, Bounding Box, Segmentation Mask, Instance Segmentation, GroundedSAM, Grounded SAM, Promptable
Id: oEQYStnF2l8
Channel Id: undefined
Length: 24min 43sec (1483 seconds)
Published: Thu Apr 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.