How to run YOLO-World for real-time custom object detection without any training or datasets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
no data sets no training only few ler code now we have this new YOLO World model for open vocabulary update detection so now we don't need anything we don't have to label our data we don't have to go in and act like finetune an up detction model before we can deploy that but now we can use this YOLO World model we can specify what types of classes and labels we want to detect or we can just leave it blank and it's just going to detect whatever it can in the image so we're going to cover that in this video we're going to go through the GitHub repository let's go in and see what it can do some benchmarking and then we're going to see how we can run it in code and also some crazy good results at the end so let's just jump straight into the project page we can see here yolow world real time open vocabulary optic detection because we don't need any labels anymore and the cool thing here is that this model act like runs in real time it is built on top of the Yol V8 architecture from allytics so this is able to run in real time it only takes like a few milliseconds to process the images so now we have this open vocabulary running real time you can have your webcam move it around you you can have videos throw everything as we're doing with the ul8 models throw everything to the model and it's going to give you the results detect arbitrary Optics in the scene you don't have to label data you can just tell it to the tect bottles as we're going to see later in this video so here we can basically just scroll through it first of all they have a demo on Hawken face as well we can just goe and take a look at that one so it should be running here on a T4 then we can enter the different classes to protected separate by comma this is just 80 coo data set classes so you can use those as default and this is basically just for pre- models but you can also specify whatever you want yourself so just go and track a drop an image here we just have a bunch of bottles we're going to run this video through uh a bit later let's just do it here with the default prompts and let's just hit submit and see if we're able to process the image so we'll just take a couple of seconds to act like load out the model do the inference and now we can see the results here so we get a syn here probably a bit low confidence score but we actually like able to detect all of these bottles here in the frame so yeah it misses detection like it doesn't get the bottles here in the back because again it is really hard to see but all the bottles which is act like really clear in the foreground it gets all of them this model has not been train on anything right now we're just specifying all of these classes we can also just like if we don't want to have to sync here we can also just specify bottle hit submit again let's do the inference it's going to process it and let's now see if we're only detecting the bottles and that is the case here so now the sink is removed and we're now only processing bottles so you can just go into this Hawk and face space and just play around with it drop in some images here and also play around with the prompts we can also deploy and Export the own X model directly you'll get a download link we can specify the maximum number of box that we want to detect score fold for a confidence score and also our non maximum suppression so if you just hit deploy and export on X model it's actually going to take this specific class together with the YOLO World model which is built on top of YOLO V 8 and then it's basically just going to export that model together with your labels so now we have an OB model we don't have any data we have not labeled our data we have not trained any models we just use this out of the box we specified The Prompt and it's running in real time so here you can see YOLO large on next we can download and use it in our own applications and projects so just go back here again now we have the demo we can read about like the abstract they also have this paper that you can go through where we basically just have this training online vocabulary so the main idea behind an open vocabulary is that we don't have like specific labels we just want to like have an open vocabulary and we're just going to do like Sero shot and train on tons of different classes and objects so for the deployment we actually need to have a user specifying what type of prompts and so on or like what type of classes and objects that we want to detect so we take our user vocabulary throw it into a text encoder it's going to use clip in this example here and then we have a man and a woman are skiing with a dog and then it's basically just going to extract this from the online vocabulary in the training phase F into the text encoder and then also use the user inputs then we have these vocabulary embeddings it's going to be combined with the Yol 8 back bone so it's using Y8 as the backbone it combines it and then it's going to throw out the results here at the end so you can go through this paper here if you want to dive more into details but here we're just going to go over the benchmarks the highlights and then let's see the cool results and let's see how we can run it in code and also how fast it runs we can see the framework highlights and so on we can see here that it's clip based text encoder as I just mentioned before next generation of YOLO detectors so here are all the data set that it has been pre-trained on so these are large scale Vision length anguage data sets World empowers YOLO world with strong serero shot open vocabulary curability and gring ability which is basically just that we can detect objects that we have not trained on before and again fast INF F speed so this runs just as fast as the Yul V8 models if you go down and take a look at performance they're comparing like glip method grounding Dino and so on which were actually like used before and it is using like this swin Transformer as the backbone but now we can use these convolutional backbones from your 8 to get significantly higher processing speed so before if you're using clip or grounding Dino you can see you got one two frames per seconds on an Nvidia v00 so that's a very high end GPU and all also the average positions and also average Precision recoil here it is significant lower compared to what we have now and you can just see the FPS here for the YOLO World models again we have a small medium large variations so this is pretty cool we can get up to 50 60 75 frames per seconds I'm actually like running this as well and you're going to see that we get way better performance and also better inference speed compared to these ones so we have good Precision we have good inference speed this is pretty crazy we don't have to do anything now we can also see a speed and accuracy curve here where we can just see like H how significantly better it is compared to the other ones up to 20x speed up here I just have a bunch of visualizations going over a number of examples you can go more into details with that but right now let's just jump straight into it and let's see how we can actually like run it so if you go inside the AL ltic documentation they have already implemented this YOLO World model and basically just having aov request that they have merged together so now we can see we have this YOLO World model you get a short description as we just went over you get the key features and so on but here you can see how we can extract the weights so here you can see how we can actually use it in the command prompt so we just have YOLO predict we specify the model and also the path to the source so that could be like nonp powerray YouTube video webcam stream video path EX Etc so you can throw all of that into the source and it's going to do the predictions straight away you can also specify save equal to true and also uh show equal to true we can set the different prompts here beforehand so right now we need to set the classes so person boss if you're only interested in those two classes but we're going to see that in just a second so this is how easy it is to run no data set as I mentioned no training only a few lines of code we can actually get away with it with only two lines of code after we've imported Yolo from ultral litics so let's just jump straight into this python script here I have a bunch of videos we can run through test it out with no prompts with like with no classes that we're specifying and also specify a couple of classes so first of all from Alo litics we're going to import YOLO then we have CV2 and we're also going to import supervision from roof flow s SV because we're going to use that to annotate our frames and we'll jump into the GitHub repository and also the documentation in just a second so you guys can see how easy it is to extract the results from Ultra lytics and then visualize them with supervision after we've done that we can set up the model so we just need to specify UL 8 Medium as we do in all the other videos where we're using Ultra liix models and now we just have to specify world you can also change this out with segmentation if you want to run segmentation don't specify anything for optic detection can also be obb for oriented bounding boxes pose and also classes so this is how easy it is to use the Y8 models from Ultra lytics but right now we're going to use the world model to start with just a bottle example so I just have a wi where these bottles are actually like running um in production then we're going to set up our SV bounding box annotator so this is how we can extract the results from allytics throw them into supervision so Bing box annotator and we're also going to set up a label annotator so we can visualize our exate labels on top of our bounding boxes now we need to specify the path to our video capture and then it's just going to open up that video could also just be a webcam stream and so on and we're going to set up a wi writer so restored results in a video file then we're going to have a w loop as always we're going to have cap is opened we read an image from our webcam throw that image directly through our model model. predict we throw in the image and we get the results out then we can take the results sv. detections we can create a new detections um instance of this detection class and then we call the method from allytics we specify the result take the Ser of index of that and then we now have our detections that we can then use for drawing our bounding boxes and also our label anot Ator so this is how we can use it from roof flow and also from supervision so let's just jump in and take a look at this supervision GitHub repository from roof flow because it's a really cool tool that you should guys should definitely know of when you're playing around with computer vision models optimation models and so on because you can just take the results from a number of different Frameworks a number of different models and then you can visualize and do some cool visualizations directly they have a bunch of examples in here that you can go over the only thing that you need to do is p install supervision and then we can set up the models as we have now with the Yola model we extract the detections from alter litics and then we can go down and create our annotator which is the exact same thing that we already have done inside the code we also have a bunch more examples so if you have like for example you want to use a robo flow model you can also do that and we can also get all these different types of annotation so let's just scroll through it here we can see we get these like crossair we can get heat map zones directly we just need to specify that um so we have a bunch of different annotation ways so definitely go through this GitHub rep repository they also have nice documentation going over all of it all the different arguments and so on that you can specify for the box annotator and also the label annotator we can detect an annotate we can track objects we can also filter objects if you just go inside the API for annotators they have a bunch of different like types of annotation so we can do bounding box round box box Corner color circle and if you want if you want to do some more hardcore things they also have like pixel8 or blur let's say that you want to detect faces in an image and then you just want to blur those out you can do that directly with only a few lines of code so this is pretty cool that's why I'm using it and you should definitely like test it out because again we don't have to deal with extracting all results setting up all like the put text um rectangles lines and so on in open CV let's jump back to into it now we act like have everything we have set it up with supervision we extracting the results so let now going and take this video here with the bottles so we can see these bottles running in a production line L I'm just going to copy the relative path and this now go back again into our script we just overwrite this one here change the path so it's correct and now we should be good to go we can go in and open up a new terminal and just run this program directly so this is just brand new from allytics if you have an earlier version of Alo litic make sure that you're act like go in and uninstall it first or basically just upgrade it with Pip because that is required if you're getting any errors it is most likely because of that so just go and pick uninstall it or just upgrade that regular with P well I've already done that so I can go in hit python yolow world. Pi I can hit enter and it's going to run the program in the first run it's going to download the model weights I've already been running this so they are already in my trajectory or in my path so here we can see the bottles running this conveyor Bel we get really nice detections and here we can see that the label was act like zero because right now we only care about one class so it's basically just going to be a bottle we can always swap that out or we can specify the labels as well and we can set the arguments for label annotator with supervision and then it's basically just going to show bottles instead of a zero but it doesn't really matter too much and we're going to play around with a bunch of videos here but let's just run it again and act like see that it keeps track of all of these objects here it doesn't even lose detections in any of the frames it kind of like looks like we're running optic tracking on top of this so let go and grab some other examples some other videos in here maybe we can just try to not specify anything and just run it quickly again there we go it we open up in just a second let's see if it's capable of detecting the bottle yeah even though we don't specify anything it's just going to like Set uh set the label here to 39 it's most likely because it is the index in the Coco data set because altic is using that as default in here let's go in and try and see if we can find another video um so let's just scroll through some of them there we go let's try to grab this video we copy relative path we go back again you guys can see how easy it is to run it on different videos we can also go in and specify our um webcam in just a second there we go let's now run it and now we should be able to run it on the fish so let me just specify that fish save run and then we're going to see the results in just a second there we go now we can see that we again we have label zero because we only have one class that we care about even the fish fishes here in the background like this is pretty crazy results and again I'm running this uh with the media model I'm running this on a 30 um 370 RTX graphic card from Nvidia 12 milliseconds inference so that's around 80 frames per seconds so this is pretty crazy and even like higher compared to the benchmarks that they had inside the project page okay let's try another video here let's just try one where we don't specify anything let's try to see if we can detect some C cars I'm just going to go through some of them here yeah let's grab this one let's go back again there we go let's not specify any classes up here at top run it and again you can also like specify like whatever class that you actually want to detect and they're just prompted we can see that all of the detections like some of the cars are pretty small in the images but it's still able to to detect them we even see like some traffic signs but again class two is is a car in the Coco data set so this is pretty crazy again it's only running with a few lines of code and we can set up these label annotators bounding boxes and so on let's try a last one here where I'm basically just specifying the index to my webcam let's run that and pull up my webcam so I'm just going to grab it so we have the webcam over here it's going to open that up in just a second so now I've just opened up the webcam I've just specified index to my my web camera that's the one in this example if you only have one camera attach to your computer it's just going to be zero but let's just pull this away we can see that we're detecting me as a person let's turn it around and see if it's capable of detecting some of the other ones so we have a phone let's see if we have keyboards we have two keyboards and we should also like have a monitor here so that's also correct so again we get all these predictions around let's see if we can get this water bottle we're going to detect that as well so sometimes 39 41 but we can go and fill the those detections with under sections over unions and so one so this is pretty cool we have a speaker we have the mouse here so it's basically detecting that as the same classes so it's not perfect but we can go in and prompt that so no just go in um and prompt it to a computer mouse let's see if it's actually like capable of doing that computer mouse just going to run it and now I should actually only detect the mouse here that I have in front of me I have not tried this before so let's see okay here we go we can see that now it is detecting this um Mouse we get some false predictions over here could be because the confidence scores are pretty low but let's just try to move it around we don't get any other ones over here and if I move the mouse even though we have the white keyboard in the background it act like keeps pretty good track of the op so this is pretty cool test it out on yourself try to throw in different prompts again it's really easy to use with aluto litics and also combine it with supervision for these cool visualizations we can go and do the blurring and so on so thank you guys for watching this video here I hope you have learned it ton this is a very cool model so you don't need any data set no labeling anymore like that takes up a lot of time we don't have to go in and train our models export them use them in our own application projects now we can just try this to start with try to see if we actually get good enough results before we set up the whole computer vision pipeline so yeah that's pretty cool if you want to learn more about like optic taking optic tracking how we can train these models from scratch and also the theory behind that I have courses on my website so definitely go in and check those out we also have research papers like how we can read research papers Implement architectures from the research papers have the architectures on one side code them directly from scratch on the other side so if you're interested in any of those definitely go and check them out on my website or else I see next we guys until then Happy learning
Info
Channel: Nicolai Nielsen
Views: 3,845
Rating: undefined out of 5
Keywords: Object Detection, YOLO-WORLD, Real-Time, Open-Vocabulary, Real-Time Open-Vocabulary Object Detection, Object detection without traning, Object Detection no dataset nedded, how to do object detection with yolo-world, how to use open-vocabulary object detection, clip based encoder, Yolov8, next generation of Yolo detectors, fast inference speed, real-time object detection, ultralytics
Id: 7EmfDY2Ooqo
Channel Id: undefined
Length: 17min 14sec (1034 seconds)
Published: Fri Feb 23 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.