YOLO-World: Real-Time, Zero-Shot Object Detection Explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what if I tell you that there is a model that you can use to detect all of these objects and more without any training and that it can run in real time well at least if you have access to a powerful GPU but you can still get a decent FPS on cheap Nvidia T4 today we are going to talk about YOLO World a zero shot object detector that is 20 times faster than its predecessors we'll talk about architecture discuss the main reasons why it's so fast and above all I will show you how to run it in Google collab to process images and videos traditional object detection models such as faster rcnn SSD or YOLO are designed to detect objects within predefined set of categories for instance models trained on Coco data sets are limited to 80 categories if you want a model to detect new objects we need to create a new data set with images depicting the objects we want to detect annotate them and train our detector this of course is timec consuming and expensive in response to this limitation researchers began to develop open vocabulary models not even a year ago I showed you grounding dyo a zero shot object detector that back then blew my mind and to be honest I'm still impressed by its capabilities all have to do is prom the model specifying the list of classes that you are looking for and that's it no training is required the downside of grounding Dino was its speed it took around 1 second to process a single image good enough if you don't care about the amount of latency but pretty slow if you are thinking about processing live video streams that's because zero shut detectors are usually using heavy Transformer based architecture and require simultaneous processing of text and images during the inference and that brings us to Yellow World a zero shot object detector that at least according to the paper is equally accurate and 20 times faster than its predecessors if you would like to learn more about grounding Dino and yellow world you can find links to two of my blog posts covering those models in the description below now let me give you a brief overview of the model architecture YOLO world has three key Parts YOLO detector that extracts multiscale features from input image clip text encoder that encodes the text into text embeddings and a custom networks that perform multi-level cross modality Fusion between image features and text embeddings which has the name so complicated that I won't even try to pronounce it rep parameterizable Vision language path aggregation network using a lighter and faster CNN Network as its backbone is one of the reasons for yellow world speed the second one is prompt then detect Paradigm instead of encoding your prompt each time you run inference yellow world use clip to convert the text into embeddings those embeddings are then cached and reused bypassing the need for realtime text and coding okay enough of the talk let's take a look at some code but first a short announcement this week we will have our first Community session so if you have any questions about yellow World model or about the code that I will show today leave them in the comments I will try to answer all of them during the live that we will host this week you can find more details about it in the description below you can also join the live and ask your questions real time that would be awesome because I don't want to sit alone the link to the The cookbook I'll be using is in the description below and I strongly encourage you to open it in separate Tab and follow along we click the open in collab button located at the very top of the cookbook and after a few seconds we should get redirected to Google collab website now the first thing we need to do is to ensure that our collab is GPU accelerated as usual we can do it by executing the Nvidia SMI command after brief moment we should see a table containing information including installed version of Cuda and the name of the graphic card that we have at our disposal now that we confirm that Cuda session has GPU support it's time to install the necessary libres the first is rof flow inference a python package that we will use to run Yello World locally but you can use it to run all sorts of different computer vision models and the second is supervision the computer vision Swiss army knife that we will use among other things for filtering and annotating our detections the installation may take few moments so let's use the magic of Cinema to skip it to confirm that everything went smoothly let's try to import the packages we need both open CV and tqdm are available in Google collab out of the box that's why we didn't include them in the installation section if you're running the notebook locally on your PC make sure that those packages are also installed yellow world is available in four different sizes SM m l and X but for now only the first three are accessible via inference package of course along with different sizes you should expect different speeds and accuracies in this tutorial I will use the L version but you should use the version that is suitable for you based on your speed accuracy and Hardware requirements to load the model we simply create an instance of yellow world class which we imported a few cells above this class has two core functions set classes and infer as mentioned in the introduction to avoid the need for realtime text encoding yellow World utilizes The Prompt then detect Paradigm by using the set class method our prompt is encoded into offline vocabulary let's choose a list of our classes person by backpack dog eye nose ear and tongue we see that the clip model is being downloaded in the background it will be used to convert our list of classes into embeddings now we just need to load our image and pass it as an argument to a second method I mentioned infer then using the utilities available in supervision package we can visualize our results well unfortunately from the entire list of the class cles that we provided only two were detected a person and a dock in my experiments I noticed that classes outside of Coco data set are detected with a significantly lower confidence level let's try to lower the threshold to include detections that model is less certain about our updated code is very similar essentially only two things have changed we drastically lowered the confidence threshold from the default 0.5 to 0.03 and we updated our code visualizing detections to display not only the class name but also the confidence level such a low confidence level is not something that I would usually recommend but in case of yellow world this strategy works really well in this visualization we see that this time significantly more of the wanted classes have been detected but we have new problem duplicated detections each object is now associated with two or even three bounding boxes to solve this problem we'll use non-max suppression non-max suppression is an algorithm that use intersection over Union IOU to estimate the degree to which detections overlap with each other detections with high IOU over the set threshold and low confidence are then discarded some time ago I wrote a blog post about nms so if you want to learn more the link is in the description once again we introduced only minor change in the code this time we use the with nms function available in supervision and set an aggressive IOU threshold value of 0.1 the lower the value within the range between 0 and one the smaller the overlap between detections must be for one of them to be discarded comparing the results at each stage we see that ultimately we could detect a lot more objects from Wanted list while maintaining the high quality of obtained results I know that confidence levels at around 1% appear quite unusual but as I said in case of yellow World it just works yellow World shines brightest when processing videos not just individual images according to the paper it can achieve up to 50 FPS on Nvidia v00 during my experiments I got 15 FPS on Nvidia T4 a much budget friendly alternative transition from processing single image to processing entire videos is fairly straightforward we just Loop over the frames of the video and run inference for each of them because we look for the same objects on each frame we only need to encode our class list once in our next experiment we will attempt something more ambitious than detecting dock class in this video we see an object with holes that are being filled with yellow substance let's check if yellow world will be able to locate those filled holes because the objects that we are looking for are hard to Define choosing the right prompt was a challenge I tried several variants But ultimately using Color reference proved to be the most effective one thing that I didn't me mention in the intro but I think is quite important authors of the paper show examples of using color and position as reference so don't hesitate to use them in your prompts let's load the first frame of our video and run set classes method setting our prompt to Yellow feeling as before we run inference with low confidence threshold of 0.02 now our model successfully detects individual holes but along with them it accidentally detect the entire object with holes we observed the same effect few months ago while testing grounding dyo both models tend to return large highlevel bounding boxes that in some sense meet our criteria but are not the objects that we are looking for this especially happens when processing images or video with low resolution to solve this problem we'll filter our detections based on the relative area if a given bounding box occupy a larger percentage of the frame than a set threshold it will be dropped it sounds quite complicated but implementing it with supervision is a piece of cake first using the video info class will obtain information about the resolution of the video the width and the height of the entire frame knowing this values calculating the total area in pixels is quite straightforward on the other hand supervision also provides easy access to the area of individual bounding boxes now all we need to do is divide the individual areas by the area of the entire frame to obtain the relative area thanks to naai the entire operation can be vectorized and performed simultaneously for all bounding Boxes by setting a relative area threshold in my case 0.1 we can construct a l iCal condition that allows us to filter out bounding boxes larger than 10% of entire frame when we visualize the result we see that only detections representing the filled holes remain awesome the final step is to process the entire video and save the result in separate file to do this we will use two utilities available in supervision the frame generator will help us Loop over frames of the source video while video sync will take care of recording according the result all the code we've wrote so far goes inside the for Loop and will be triggered for each frame of the video we see that yellow world has successfully solved a task that traditionally would have required training of a model on the custom data set so is yellow World a golden solution the model that ends training on custom data sets well no there are still cases where I would choose model trained on custom data set over zero detector like yellow world let's start with obvious issue latency yellow world is much faster than its predecessors but still significantly slower than state-of-the-art real-time object detectors therefore if we need fast processing and have limited computational resources we still need to rely on more traditional Solutions yellow world is also less accurate and reliable than detectors trained on custom data set there are of course cases when yellow world may prove useful especially when we tightly control the environment a perfect example is one of the Snippets I showed you in the intro I think yellow world could be successfully deployed a cound factory for instance to keep count of daily production unexpected objects cannot suddenly come into the view of the camera and qu sounds move one by one on the other hand preparing the suitcase demo took me quite some time and and I still must admit I did some cherry picking yellow World excels at detecting suitcases however other objects often appear in the frame as well and the model May occasionally misclassify those objects as suitcases so to sum it up I encourage you to prototype with yellow World check if it works in your use case use the techniques that I showed you today to refine detections however be prepared that at the end of the day you may still train your model on custom data set especially if you don't control the environment you have poor camera placement or you simply look for objects that for some reason cannot be detected yellow World opens the way to use cases that so far have been impossible like open vocabulary video processing or using zero shut detectors on the edge but it's just the beginning by combining yellow world with fast segmentation models like Fast sum or efficient sum we can build a zero shut segmentation pipeline that run dozens of times faster than grounding dyo plus some combo I showed you a few months ago inspired by this idea I created a hagging face space where you can use yolow world and efficient Sam to process your images and videos you can even take this idea a step further and build video processing pipeline that automatically removes background behind detections use the fusion base models to dynamically replace them or completely remove them from the frame by combin combining yellow World efficient Sam and prop [Music] [Music] painter [Music] yellow world is an important step in making open vocabulary detection faster cheaper and wildly available maintaining almost the same accuracy it is 20 times faster and five times smaller than leading zero shot object detectors I highly encourage you to take a look at our materials the link to cookbook and hagging face space are in the description below if you have any questions leave them in the comment I'll try to answer all of them during the upcoming Community session and that's all for today if you like the video make sure to like And subscribe and stay tuned for more computer vision content coming to this channel soon my name is Peter and I see you next time bye [Music]
Info
Channel: Roboflow
Views: 21,599
Rating: undefined out of 5
Keywords: YOLO-World, zero-shot object detection, open-vocabulary object detection, real-time object detection, Ultralytics YOLO-World, RepVL-PAN, Image-Pooling Attention, prompt-then-detect, YOLO-World vs GroundingDINO, YOLO-World vs GLIP, Roboflow Inference, Roboflow Supervision, multimodality
Id: X7gKBGVz4vs
Channel Id: undefined
Length: 17min 48sec (1068 seconds)
Published: Wed Feb 21 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.