FastSAM: Segment Anything in Real-Time

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys and welcome to a new video in this video here we're going to talk about the new Fast segment anything model so this is basically just a fast version of the segment ending model from meta AI this acts like a significant increase in the inference speed as we're going to see in just a second so first of all we're going to open up in Google collab we're going to see how we can use it in a Google collab notebook then we're going to see how we can set it up in our own environment on our own local machine so we can use this segment anything model run it live as well we're going to test it out set it up on our own environment and then see how we can do inference segment out anything in an image on our own local machine in a custom python script so let's now jump into the GitHub repository we're just basically just going to like go through it first of all we're going to take a look at the model also the infern speed compared to the segment anything model so like the original Sam model from meta AI here we can basically see that we have their GitHub repository it was released a couple of days ago it is acting like a significant upgrade or like increase in parent speed we can see it's up to like 50 times faster compared to the original Sam model from meta AI this is the new Fast Sam so this is actually like really cool we can now go in and do like real time interference with this fast segment anything model we can actually do segmentation on the live video stream or with a live webcam so we can have a webcam in our own python script move it around and so on we're going to create another video for that where we're basically going to take a video uh video or like a webcam we're going to have a live webcam stream running this fast segment editing model on that webcam so that's going to be a really awesome video in this video here we're just going to see how we can set it up use it in Google collab notebook for a single image just to do inference and then also how we can do it in our own custom python script if you want to run it like if you want to run it through a video or something like that here you can see that we basically just take an input image we can either throw it like through the original same model or the fast same model it only takes like around 40 milliseconds per image with the Fast Sam and for understand that same from mereya it takes over like 2 000 milliseconds so that's over two seconds to do interference of course there's always a trade-off between like accuracy performance and also the inference speed so the accuracy and performance is not as good with the Fast Sam model compared to the original Sam model but we still get some pretty good results and now we can even like run it in real time we get up to 50x speed increase when we're doing interference so that is really cool we only go down a bit of a performance it doesn't really like matter that much if you're just running like live segmentation or if you want to generate some segmentation mask for our optic takers and so on so this is actually like based on the YOLO V8 model so that is that is also like really good and really fast of like detecting objects and also running real time so it's based on a convolutional backbone so this fast sand model it is basically the yellow V8 model together with segmenting out anything in the image so here we see that we can actually have this like real time and now we're able to perform or like to achieve real-time inference with this fast time model where if we compare to the original Sam model which is based on the visual Transformers it is not like anywhere near able to run in real time here we can see the model architecture so it's based on a convolutional neural network backbone it has a future pyramid Network here it has the detection Brands and it also has the these mask Brands here to actually like do the segmentation we can see we get some pretty nice results the good thing about this segment anything model here as well the Fast Sam is that we have this text encoder with the clip and we also have an image encoder so we can basically throw in different things so we can throw in like a point prompt box prompt and also text prompt so if you want to detect or like segment out a specific object in our scene we can basically just tell it to segment out the black dock here in the image we can have like a box prompt if we're doing detections with a YOLO model for example we can then throw that bounding box into the fast segment anything model and then we can segment out the Optics within that bounding box region we in order to just take a single point in the image and then it will try to like figure out what is this object acts like for this specific point and then try to segment out that whole optic in the image as we can see here so that is a bit overview over so there's a quick overview over this model here now let's go into the Google collab notebook see how we can set it up and then we're going to see how we can install it on our own local machine so here we see that we have this Google colab notebook we can change the runtime to GPU so we can actually use the GPU to increase the inference speed even more so let's just run the blocks of code so first of all we need to just clone the GitHub repository that we just saw inside of GitHub then we also need to get the fasttime model so this is the fastsam.pt so we have the python models we can actually optimize it even more so we can get even faster inference speed if we convert it and optimize it to a framework like 10 to RT I even think that this show that on the GitHub repository but then we can actually run inference with like around 15 to 20 milliseconds per image and it is act like a a very large Improvement in inference speed also just compared to the pytorch model so I think they've been smarting it on f3090 graphic card from Nvidia I'm using a 490 on my computer so we might see some better results later on especially when we're going to run in real time on a live webcam so now we're going to PIV install the requirements from the GitHub repository and also this clip model for our text encoding so those are the only things that we actually need to install we're going to see how we can install it in our own content environment in just a second but right now right now it should just take a couple of seconds to set up we can also just run this so it's basically just downloading an image example um here from GitHub so now we can see we have downloaded this image should be able to go over here and see it inside of our folder so we have our images and then we have this dog image that we're going to pass through our model and try to do segment segmentation of anything that is going on within this image we're going to see the accuracy we're also going to see how fast it acts like runs but this is just for a single example we won't get really good we won't get as good accuracy as the segment anything model from mid AI but it is also like significantly larger and takes up more time and in most cases we don't really need that good accuracy we can see we have this fast amp this is basically just all the files from the GitHub repository so now we can import our matplotlib so we can visualize the results and also opencv then we're going to open up or like read in the image with opencv from the directory and then we're just going to convert it from pgr to rqb column format because that is a color format that this model takes in first of all we're going to get the width and the height of our original image then we basically just plot our image and show our image so we can see what our act like going to feed through the model so this is the input to the model then we want to segment out anything with the fast same model so now we can use this variance python script from the fasttime GitHub repository that we have cloned we need to specify the model path here so we can see that we have this fast Amper PT and then we also need to specify the image path to the dock.jpg and also the image size so we have 10 24 here so still a relative large image but we can actually run fairly fairly fast in ferns so let's run this now it might take some time because first of all we need to run the python file here it needs to import different kind of like modules load model and so on before act like the other inference but we are going to see the infern speed when it has acts like done the forward pass off the image after that we're just going to like read in the image from our output directory going to convert it from Big yards RGB so we can visualize with matplotlib and then we're just going to like do an image show with matplotlib so just take a second or two here we're going to see the time so here we can see we have 124 milliseconds in ferns so this is not as good dpu here in Google collab but it's still significantly faster compared to the same model and when I'm going to run it locally we're going to get way faster interference speed as well we can also you take some time to do the pre-rocessing of the image and also the post processing but this is basically what's going on we're detecting 35 Optics within that image here we can just run it here we can see the results it actually does a pretty good job we can see some of the boundaries that are not segmented like perfectly but that is what we can expect from this model when we want to act like optimize it for Speed and so on so it's always a bit a trade-off between speed and accuracy but again we segment out like pretty much like all of the objects here like the legs the feet the dog like here the bowl we also have like a chair here in the background the floor the wall so those acts like some pretty nice segmentations just from this example image so let's now go back here again to the GitHub repository and open up an anaconda prompt and let's now go in and install it in our own local environment so we can see how fast it runs on my 4090 GPU and also just to see how we can use it in our own python script because we might want to do it on a video and then later on we also might want to do it on a live webcam so we can basically just move the camera around segment out anything that the webcam sees first of all we just need to get like clone our repository so let's do that you have the command prompt just going to copy paste it cloning into Fast Sam after that we're going to create a new Anaconda environment and then we're going to use Python 3.9 so we're just going to create that after that is created we can go in and activate that environment so here we basically just with all that I had though that that environment existing so first of all we just need to delete it overriding it so we're basically just um basically just installing it here from scratch so you guys can see how to install it and set it up but it is basically like straightforward forward forward so now we're going to activate the fast time environment we need to see the intro fast time to install the packages and then we have our requirements text file where we have all the modules and and the requirements that we need so we can just go and directly PIV install that and the last thing that we need to do is actually just install our clip module here so this might just take a couple of seconds or like a minute or two before it is act like I'm done installing these different modules so we can see that it installs like ohms V python scipy torch and so on we might actually need to go in and verify that we're using a Cuda version of pytor so here I can see that it just downloads the original pytorch without Cuda support so let's not go inside pytorch here and actually install the Cuda version so here we just go inside the pi torch website PIV install um PIV install torch torch Vision Cuda 11.7 so I'm just going to copy that go inside my environment so right now it's still installing all the other modules but after that we need to PIV uninstall our torch and also torch vision and then we can go in and install it with Cuda support so it doesn't really do that as default and if you don't do it it will only be able to run in the CPU if you have an Nvidia GPU we'll you will need to do this or else it won't be able to detect it and use the dpu for inference so this should be done in just a second let's go back to the GitHub repository as well there we go we're now done let's go in and act like first of all we need to pip uninstall it so let's do that so pip uninstall torch and we also need to pip uninstalled torch vision so start with doing that yes and yes now we can go ahead and PIV install it with Cuda support after you're done then we can go ahead and install our clip for our text encoding to our model that is basically what it's using for encoding the text that we throw into it with a text prompt and so on so we can segment out specific objects in the scene and I'm going to show you how to do that in our custom python script as well so this might just take a second or two it has to download this 2.3 gigabytes and then we need to PIV install clip after that we also need to go in and install the like download the model weights so to have two different kind of like models they have a default ones and then they also have like um a faster one or like a smaller model so this is based on the YOLO V8 spawn model and this is based on the YOLO V8 X model so this is actually significantly better than the other one here but it really depends on your Hardware the performance that you need and also the inference speed so this will also be significantly faster than the other one again it's a Traders between accuracy and speed but I've downloaded both the names to my computer and then you basically just need to copy paste those into your directory that we're going to use so let's go back here again now we have everything installed we can now open up our Visual Studio code so we are not opening up visual studio code I've opened up the fast same repository clone from their GitHub we can now see that I've copy pasted these two fast time models so we have the small model and then we also have the X model as well so the larger or the default model then we have the RF inference python script but we can also go down and use this SEC predict.pi so first of all here we import the fasttime model from fastempt and we also need to import torch then we can set up an instance of our model we need to specify the path to our fast same model so here let's just start with the X model or the default one and then we also need to specify the directory where we want to have our image for inference we can then set up the device here decoder device that we're going to use so if Cuda is available we're going to use the GPU if we have MPS variable if on your on a Mac with MPS available it's going to use that or else if you don't have all of those things you it will just use the CPU as default after that we can then do a four pairs of our model so we have our model we pass in the image path so we don't need to like load an image first of all we just need to pass an image path later on I'm going to show you how we can run this on a live video stream but in this video here we're basically just going to show how to set it up and load a single image so make sure to hit the Subscribe button and Bell notification under the video so you get a notification when I upload that video so we can see how we can run real-time interference with this fast sand model on a live webcam so then we do a forward pass we specify the device retina Mass the image size also the confidence score and our intersection over Union threshold then we can also set up a fast same prompt if we want to prompt it with something it could either be like a bounding box prompt a text prompt or a point prompt so that is how we can set that up they have basically just have an example here we can try to test that out but first of all here we're just going to try to segment out anything we should be able to just run it directly directly we create an instance of our model we have the correct um the correct the directory or four or dock so let's just run it and see the results first all we need to set it up so no module name clip so let's just make sure that we actually like installed that so we did act like install clip let me just make sure that we're inside the correct environment there we go run conda activate now we should be able to run it there we go we got the inference results so here we see that it took around like 50 milliseconds to run in front so this is kind of like similar to what they had in the GitHub repository still very fast this is around like 20 frames per second when we have 55 milliseconds it only takes like five milliseconds to do the pre-processing also the post-processing so still fairly good we did take it around like 21 Optics in this example let's go inside the outputs we can see the dock so this is actually like the results that we we got let me just make sure that we actually had that um so we should actually be able to segment out like anything let me just go back here and see so here everything prompt um prompt process oops we actually just need to delete this one here so here we just have like single points that we act like using the point prompt and this is enable here so that is how you can use that so that was the point prompt let's run it again and now we should actually go in and just segment out anything so let's run it again we'll take a second or two there we go we got the inference result around 60 milliseconds now we're detecting 21 objects let's just go up inside the docks so yeah now we can see that we're basically segmenting out anything uh we don't really get the exact same results at with as with the standard like segment editing model from meta we segment out the talks here they look pretty good we also have the sand the sky here is not really segmented out um that good we can see like the the wooden stick here that they have in the mouth it also looks pretty good the leg here is is not really that good so we can see we don't really get the exact same accuracy but this is still like fairly good again this is with the large model we can also try with the smaller model so let's go ahead and run that so that is the small model then we can also see the inference speed so let's just pass the exact same image through that after we've done that let's just try with the cat example that they have as well so now we can see that this was actually like significantly faster so now we're running like 30 uh 30 milliseconds here for inference so that's act like 30 frames per seconds plus I was going to see the results we have the dock now it doesn't segment out the docs individually but it doesn't segment out like the sand here it doesn't segment out like the the sky so these results here they're not really good like they're probably only good for these points here let's try to do it with the point again so the point prompt so let's do that so we just specify the the exact same point for the right dock let's run it and see the results and then we can test out the cat example here again 34 33 milliseconds see the docs so here we see that we specified a point around here but now it acts like six segments out both Torx docks compared to like the large model or like the default model uh where it could act like segment out the individual dock which was actually pretty nice so I don't really think that the results for the small model is is that good of course it runs faster but I don't really I don't really feel like the results are are that good with the small model compared to the last Model so let's just go with the large model and try it out with the cat example that's just the cat and let's run it again it will take a bit longer now but then we should be able to see the results here we go 60 milliseconds let's see the cat example so here I actually got a pretty good one so this is still the pro like the point example that we see we get some really nice boundaries or like shapes here around our boundaries of our cat this looks pretty good let's try to just delete this point prompt run it again let's go and see cat very good pretty good annotations here like pretty good segmentations we can see that it acts like segments out the eyes here some pretty good segmentation for the table that the cat sits on also the background with the wall we have some some problems here with the earbuds still nice boundaries around like most of the Optics in the scene so I actually think that these results they look uh they look fairly good we can also go in and try with like for example like um a text prompt so let's go ahead and try that so here we have an annotation with a text prompt which is also like fairly nice now we should probably go up and specify the docks again so let's use the dock example and now we just want to throw and detect a photo of a dock and then we want to segment out that so first of all here we can see that we actually need to like download another model here so this is basically just for for a clip model that we need to use for encoding our text that we can then feed into our fast segment anything model as well there we go we should be able to go and see our dog example now so yeah there we go now we can see it acted like segments segments out this this dock we could have maybe have specified it should segment out like the black dot let's just try that a photo of the dock so a black dock let's try to run again and see if we get some different results just directly going to run it let's try to see if it acts like just segments out the right dock now it should be done now but yeah we can see that now it acts like segments out the the black dog here compared to the other one so again it is really nice that we can go in and do segmentations based on these annotations and also based on text prompts because now we can go in and do auto labeling of your data set if you're for example doing segmentation tasks for specific objects you have a whole data set you can basically just throw it in here use the clip model to actually go in and encode your text use that together with this segment anything model so instead of using the same model you just use this fast segment anything so you can also label your data set significantly faster so this is a really cool model I'm really excited to to get to play with it and also play with it more in the future we're going to try to run real-time inference on a webcam as well so definitely make sure that you watch that video once that is out so thank you guys for watching this video here and again remember to subscribe button and bellification under the video also like this video here if you like the content and you want more in the future really excited for the future videos these are some really cool models that we can use for a lot of different kind of like domains tasks for optic detection optic segmentation and so on so I hope to see you in one of the next video guys bye for now
Info
Channel: Nicolai Nielsen
Views: 6,745
Rating: undefined out of 5
Keywords: fastsam, fast sam, sam, meta sam, segment anything, fast segment anything, segment anything model, segment anything real-time, segment anything colab, fastsam colab, yolov8 fastsam, how to segment anything, how to use segment anything, yolov8, fastsam python, how to use fastsam, how to setup fastsam, fastsam real-time, real-time sam, meta ai sam, sam meta, meta sam segmentation, real-time segmentation, opencv segmentation
Id: yyqnFucIAu0
Channel Id: undefined
Length: 20min 49sec (1249 seconds)
Published: Thu Jul 06 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.