Segment Anything with Webcam in Real-Time with FastSAM

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys welcome to a new video in this video here we're going to take a look at the new Fast Sam model we're going to see how we can do live interference on our webcam so one of the previous videos I showed you how we can set up the fasttime model and also how we can do inference on individual images in this video here we're going to open up a live webcam stream and then do live inference with the segment anything model the SEC file segment anything model is up to 50 times faster than the original one from meta AI which we're also going to see it is still a trade-off between like accuracy and speed we won't get the same accuracy as in the same model but we still get pretty good accuracy so let's just jump straight into Visual Studio code let's take a look at how we can set this up so we can run live inference on our webcam or just any other camera or video stream so I already had a video where we did the setup of this fast time model so definitely check that out before that we went over how we can actually set up the model we did a comparison with the same model and so on in this video we're just going to jump straight into it the good thing about our Fast Sam is that we can both like pass in a point we can pass in a boundary box or like a region of interest that we want to segment anything inside so we can also use the tech prompt we can just say like segment now the dog in the image and then it will only segment out the talk instead of like segmenting out anything if you don't specify anything it will basically just like segment anything in the image as it also does in the original same model first of all here we just import our Fast Sam and also our Fast Sam prompt import pytorch numpy CV2 and also time so we can see how fast it actually runs in ferns because they actually specify the inference beat 4X like just passing it through the model and that is around like 50 milliseconds on their Benchmark and I get similar results on my computer here I'm using an RTX 4090 so that's to keep you on the higher end but we can still run this like fairly fast on like lower budget gpus as well and it is still significantly faster compared to the original Sam model so if you have a large data set if you have a large video stream and so on like this model here is just like way better at least it will take less time so first of all here we can just set up an instance of the fast model fast same model so this is the exact same thing as we did in the last video so we both have a small model and also a default model here which is the extra large model these five SM models they're based on the YOLO architecture and the yellow models so they're basically using that as a backbone and then they have like this segmentation hits on top of the YOLO V8 backbone if I just go inside the directory here you can just see the Fast Sam repository that is cloned from GitHub I did all the setup in the previous videos definitely check that out if you're interested we have the two models we also have like a line of inference in the previous video we used this code here we're basically just throw in an individual single image now let's go in and do live inference on our webcam so yeah we can create an instance of a model we don't need to specify the image path right now so because right now we're just going to open up a video stream with opencv we need to set up our device first of all so if Cuda is available if you have an Nvidia TPU on our computer and if we have installed pipesource with Cuda support we're going to use Cuda to do your inference or else if we're going to use the CPU also if you're on a Mac a computer you can also use the MPS printing what the ones we're using then we can open up a video capture as we have done in all the previous computer vision tutorials and all those different things so we just open up a video capture specify the video index so right now I have a one here in your example it will probably be a zero depending on how many cameras you have actually connected to your computer so I have a main camera and then also have my webcam where we're going to run the live inference then we're going to have a while loop just running as long as the webcam is open we're going to run inside of this while loop on this we hit escape on our keyboard and we will terminate but then we're going to take an image here so we're going to call cat.read we load in an image from the webcam stored in this Frame variable then we're going to start our timer so we can actually just time how long it takes to do interference the model here like the yellow model would act like go in and specify it both the pre-processing the inference beat and also the post processing speed so we can directly take it from there but if you do like additional processing to the results and so on we'll just time like the whole like pass of our data through our while loop so yeah we get around like 50 milliseconds so that is around like a 20 frames per seconds just for the inference but we're doing a lot of other like processing post processing pre-processing and so on so it will act like B slower so it's not fully real time but it is still way faster compared to the same model and we can still like move around somehow get like kind of like real time or at least we can detect what is in the frame when we're moving around with the camera which is not possible with the same model at all so yeah we're starting the timer then we can specify the model so we have to create an instance of our model now we can just do a forward pass hearing specify the source that we want to pass into it um so here we can either have our own while loop or we can also have this outside of while loop if we don't want to like extractables and so on like this is actually like a generator that it returns so we can go in extractive results all the time but here we just pass in a single frame because we want to have our whole while loop here maybe we want to do some pre-processing to our image before passing it in or it's coming from somewhere else so let's just pass in the image we could also just specify the video capture source so we can just specify the one or a zero and then it could also do live inference here on your webcam you need to do a bit of modifications for act like doing the plotting but yeah here we're just going to specify the frame after that we're going to print some of the results so here we just have like all the results that we're that we are detecting we can extract The Mask we can get the shape of the Mask just to see the dimensions of the mask and also how many Optics are we actually detecting we can take the bounding boxes we can extract x y x y coordinate the boundary box so the top left corner and the bottom right corner we can go in and extract that directly from the results so this is kind of similar to the YOLO V8 and the YOLO format and framework because this is based on the UL V8 architecture and then they just have this segment anything hit on top of that backbone we can then have a for Loop running through all of our detections so this is basically just how we can extract our results if we want to use it in our own application and project for some specific things this is how you can do it if you don't want that you just want the output we can just directly go in and use this fast same prompt we can throw in the frame we can throw in our results and also the device and then we can do like a processing of our prompt our prompt could either be like that we want to segment out anything we can go in and throw in a pounding box so we can specify a region so basically just a region of Interest inside of that region of Interest we want to segment all the Optics within that boundary box again as I'll talk about in the start we can throw in the text prompt we're going to do that when we run the live inference we also just have a point prompt so just take one single point in image and then it will try to segment out that whole like object or that class that is act like inside of that point or like around that point and then we can basically just end the timer so now we have done all the processing we can extract the results then we're going to end the timer and then for the plotting I don't really want to like time the plotting because that is not necessary if you want to use it in an application or a project but now we can also go in and plot the results so we specify the frame and also the annotations print the number of frames per seconds show the original image and also show the image where we act like plotting the results if it queue on a keyboard at any time we will terminate our program go out of the while loop destroy all the windows that we have to open up with opensv and also release our webcam so first of all before we're going to run it now let's now go in and just take a look at these like classes and also the methods for act like running their inference because this is like fairly fairly important pretty important when we're actually going to do the inference and I always act like modifying some of the code to be able to run this live in Ferns and also display the results so first of all here let's just try to go in and take a look at the faster models now when we create an instance we can specify the source and also a stream and we also have a number of other arguments as intersection of reunion threshold and so on that here you can actually just go in and see the documentation those are exact same as in the YOLO model so for a source we can specify string and integer pil or like a numpy array so in this example we're specifying a numpy array you can also just take an integer and then they will use your webcam that you have it connected to your computer so again it accepts all the types of accepted by the YOLO model as well so it's based on the YOLO framework and it also returns a list with all the predictions results and then you can go in and access all the attributes go inside the autolytics we will be eight documentation then you can see how these results are set up I'm also going to show you when we're actually going to extractive souls but you can just on the on the act like predictions you can just go in and call like dot bounding boxes or like boxes dot mask dot labels classes and so on props for probabilities so yeah you can go and extract like all the results from this generator that it returns when you're just calling this predict function so it's just delete that or like close that let's just hit save we also have this prompt one but let's now go down and take a look at the fast prompt here so now we can have this fast prompt and this is basically if you want to do this prompting of our fasttime model if you want to specify like the region of Interest Point text prompt as well so this is basically just the code for that you can go and extract the results this is not really like that important but here we're using this plot to results later on so yeah we extract our prompt processes we specified that we want to like segment out any everything in our image and then we go down and call our prompt process Dot Plot the results and that is the function that I just showed you we specify the image here you might actually need to go in and and add this image here as well because this is the only thing that you actually need to go in and modify because here they actually just specify a path or image so to be able to go in and run real time inference and also extract the results live so we can see this webcam View with all the masks visualized on our plot we actually need to go in and modify this function so this is the only thing that you need to do so instead of image pass we just pass in an image and instead of the image path here we just set that equal to our image that we throw into our plot to results method as well so yeah that is pretty pretty much like everything that you need to do here it was just learning in an image from from the code or like from a specified path so we don't need to do that we just want to run and also plot the results on the image that we load in from our webcam then it does all the plotting here with ohmsv and matplotlib we don't really care too much about that so when you clone these GitHub repositories also as I showed you in the last video we can just go in and run their examples with a specific like image or just a single image but once you need to go in and do some modifications when you want to use it in your own application and project you need to go inside like the source code do some modifications here and there so it acts like fits your project and application and that's what I've been doing here to be able to run it in real time or like not real time but run inference on a webcam and also visualize the results on that frame that we passed through the model so yeah that should be everything now we're ready to run the program so just go and run the program here it should open up the webcam it will just take a second or two and we should be able to see the results so we're using cougar I have a 40 90 GPU from Nvidia we no see different right now we're printing all the bounding boxes and just for the sake of the bounding boxes to see how we can extract the results I also just plot the bounding boxes on our image so yeah here we can see that we are just taking or and we're segmenting out like anything in our in our frame I'll just take my camera up and move it a bit around so you guys can see the results so it acts like it looks pretty good it does a very very good job but it acts like segmenting out the different objects here we run around like five to six frames per second it also depends on like how many tasks that I've opened um and so on but again this is still way way faster and up to like 50 times faster than the original same model so yeah it looks pretty good so here we can see the results from the YOLO model so once we do a forward pass this is actually what it returns so it Returns the speed the pre-processing speed full milliseconds around 60 milliseconds for inference two milliseconds for post processing we can see the image Dimensions we can also like try to lower the image Dimensions as well here we can see that we're detecting 14 objects so this will be the shapes of our mask and then we also return this format here or like this is the dimensions of our image so we have again we just extract we have 14 masks with these image dimensions and then we can extract that and go and then plot it we can also go in and extract the results directly so right now I just have a for Loop running through all the bounding boxes instead of having the bounding boxes here you can also just have the mask if you want to extract all the mask in the image and do some processing and of that right now we just take the results throw it into the plot function and then it does everything for us because it's a bit harder to actually go in and plot old masks with open CV and so on but yeah you can go and extract all the results and it follows the same structure as the YOLO V8 model so yeah here we can just have our bounding boxes and then let's just go in we don't want to print that and you can see W extracted mounting boxes so these are the coordinates of all the bounding boxes of the objects that we're segmenting out in the frame so we have the top left corner and the bottom right corner so this is basically how you can run extracted let's now go down and actually see if we can throw in a text prompt so instead of throwing in like this everything prompt we can also go down and try a text prompt so we have our annotation equal to our prompt underscore process dot text prompt then we can basically just throw in a text a cream colored chair so let's try to see if we can take my chair here in the background like it's not going to like perfectly segment out like the chairs and so on because um you can see like there's an armrest and so on so it will not be perfect all the time but if you have some really good images and you just want to like segment out um some some very simple objects in the frame you can basically do that so yeah it can be used for a lot of different on things also if you have like a box prompt or like fixed Optics in the scene with Point prompts this is really easy there's a lot of use cases for these fast segment anything models so let's run it let's try to see how this how it takes this cream colored chair I'll just take my webcam up here and then let's point it at the chair in the background let's see results so here we see that it acts like segment out this cream color chair like um once in a while so sometimes it acts like business detection sometimes it gets the whole chair so sometimes also like just the back of it also sometimes the walls that's not like too good let's take the camera a bit further away there we go we've got a pretty good detection also here when we can see like the full object it acts like does a pretty good job once you can see the whole chair so now we get a false prediction so yeah sometimes if we get some good predictions here and there we can see that this runs like significantly slower because now we're using this text prompt as well so that will slow down like the whole algorithm like the whole model significantly because now we need to go in and use the clip model to act like encode detect from a text prompt and then still go in and segment out anything so yeah all these method here they will like slow down the process but if we're just running like segment anything we can get fairly good speed when we're doing live inference on a webcam so thank you guys for watching this video here I hope you have learned a ton and that you can use this in your own projects and applications if you either want to do like also labeling of your data set if you just want to like segment out specific objects in your scene so right now we don't really need to train like an update detection model or like an object segmentation model before we're going to segment objects in our scene especially if you just have like a couple of images that you want to run inference on and not like real-time inference on Lower budget or like on the edge then you can only use this model here without pre-training it and so on again it requires a lot more processing power compared to like your standard yellow V8 model for segmentation but this can be used for other use cases as well so really cool model way faster than the original same model I'm excited for this I'm excited to play around with it a bit more and create some projects and applications around it so thank you guys wasn't again remember the Subscribe button under the video here or else and I'll see you next week guys bye for now

Info

Channel: Nicolai Nielsen

Views: 7,059

Rating: undefined out of 5

Keywords: fastsam, fast sam, sam, meta sam, segment anything, fast segment anything, segment anything model, segment anything real-time, segment anything colab, fastsam colab, yolov8 fastsam, how to segment anything, how to use segment anything, yolov8, fastsam python, how to use fastsam, how to setup fastsam, fastsam real-time, real-time sam, meta ai sam, sam meta, meta sam segmentation, real-time segmentation, opencv segmentation

Id: SslzS0AsiAw

Channel Id: undefined

Length: 16min 34sec (994 seconds)

Published: Mon Jul 10 2023