Fast Segment Anything (FastSAM) vs SAM | Is it 50x faster?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
segment anything model is easily one of the most influential computer vision models of 2023. given a photo of Sam can with high degree of precision generate masks segmenting objects on the image almost immediately after it's released company rushed to incorporate it into their products developers started to build the whole ecosystem of tools and libraries around segment anything model and of course researchers tried to create models building on some success over the last two months we got personalized Sam Sam HQ mobile sum fast sum and probably a lot more than that if I omitted any important model based on some let me know in the comments today we are going to play with one of the most popular of those models fastsum we will run it on images with different level of complexity test different types of prompts and of course compare its accuracy and speed with the original sum we spent hours researching and testing Fast Sam so you don't have to fast Sam is trying to address one of the biggest drawbacks of segment anything models speed to do that outer is completely abandoned heavy Transformer based architecture and instead replaced it with yellow V8 SEC a real-time CNN based solution the model was then trained on two percent of sa-1b data set if you don't remember that was the data set used to train the original sum it consists of 11 million images and 1 billion high quality masks and it was released together with some in many ways fast some paper is the attempt to prove that the data set you use for model training is the primary factor influencing its final accuracy probably worth more than the powerful architecture by the way if you would like to learn how to train ultralytics YOLO V8 segmentation model on custom data set make sure to check our tutorial you will find the link in the top right corner and in the description below okay now without further Ado let's dive in as usual we prepared Google collab that you can used to test fast sum on your data it's exactly the same code I'm using in this tutorial you can find it on GitHub in the roboflow notebooks repository and for your convenience I also added the link in the description of this video at the top of the notebook you will find information about fast sum links to paper and the repository as well as bunch of complementary materials covering segment anything model before we put fast sum to the test let's run Nvidia SMI command just to confirm that we have access to the GPU both Sam and Fast Sam can run on CPU but GPU is where they can really spread their wings if you run the code in Google collab but Nvidia SMI command failed follow the instructions in the cell above to solve the problem in our case everything worked as expected awesome now we can move on and set up our python environment first we clone fast sum repository and install all packages listed in requirements txt file to enable text prompts we will also need to install clip if you want to learn more about clip we have a separate video covering that model too but in short it allows you to predict the most accurate description for a given image because we would like to run fast sum and some side by side and compare their results we are also going to install segment anything model and lastly additional packages like rubber flow supervision bounding box widget that we are going to use along the way next we need to download weights for both models we create a separate directory and use wget to download the files we need when the process finishes we run LS to list directory content make sure to use Dash H flag to see the file size in human readable format we can already notice the difference fast sum weights are probably 15 times smaller than sums all we need to do now is to download few example images feel free to use your own data you can just open file manager and drag and drop the image into the collab in the meantime you can see the data that I'm going to use my standard doc image and quite a complicated car factory scene with a lot of robots and car parts we'll see how fast some handles hard cases like this just like some Fast Sam offers several inference modes you can generate masks for every object visible on the scene but you can also be more picky and prompt model for specific objects we can do it by providing coordinates of bounding box or specific points on the image first sum will return masks most accurately associated with your prompt on top of that fastsam offers prompting with text that's why we installed clip now let's dive in and try each of those inference methods one by one starting with everything prompt before we load past some into the memory we need to import it unfortunately at the time of recording the model lacks proper packaging so to do it we need to be inside fast sum directory keep in mind trying to import it from any other directory will probably end up with exception so let's just use CD command to be sure that we are in the right place now you can just load the model into the memory and we are good to go first we specify the path to our input image and run the model just like some fast sum is also dividing the inference process into two stages first we obtain General yellow V8 results by providing our image device and a bunch of additional parameters the second stage takes into account the prompting method we want to use and return filtered masks you can also visualize the results by calling the plot method we can see that the first in France is taking quite a lot of time that's because the weights of our model are being loaded into the memory creating additional time overhead you can also see that on our GPU Ram consumption chart now we can run the cell once again and this time we see that the results are being produced much faster unfortunately the plot method can only save results on the hard drive you can just return it let's refresh the file manager we can see that the output directory was created and we can also click to examine the result image searching like that after every inference is not very convenient so to make it a little bit easier to use I created a separate utility function using supervision that will draw the mask on the image and return the result let's trigger it and we see our result in collab awesome as we can see the quality of the result is not ideal that is a good opportunity to use the additional inference parameters that I mentioned earlier there are two main that we are going to use confident threshold and IOU threshold will slightly increase confidence threshold from 0.4 to 0.5 this way we'll discard masks with lower level of certainty in parallel we'll decrease IOU threshold from 0.9 to 0.6 consequently dropping masks that overlap with each other in significant way after rerunning the inference we can see that the overall result looks much cleaner next up prompting with box I wanted to make this demo interactive and allow you to draw The Prompt box on top of the image inside the notebook this is why we need to run those few utility functions defined at the top of the section now we can use the mouse to Define our prompt first let's go for docs tank run few next cells and you can see that the image is being correctly segmented now let's move back and use a different prompt how about nose delete the alt box run cells below once again and yeah fast sum is once again correct how about going for the whole dock we make the prompt run the cells below once again and plot the result well this time fast sum did something unexpected returned not only the dock but the whole person holding the dock this is important because it's inconsistent with Sam's Behavior some would never return the masks extending Beyond The Box prompt now let's move on and try prompting the model with points unfortunately I wasn't able to make it as interactive as the boxes so we need to do it the boring way we Define the point as the Two element lists starring X and Y coordinates and pass it as an argument to the point prompt method in my case I'm aiming for the middle of the nose but surprisingly once again the model decided to return the most General mask I'm not saying the result is incorrect but probably could be a bit more precise last but not least the text prompt all we need to do is pass desired text as an argument here the inference takes noticeably longer and because internally Fast Sam is looping over masks and cropping the original image all the crops are passed through clip model looking for the one that most accurately fits the text we used but after a few seconds we see the result and it's correct now the problem is if I go up and change the prompt to a dock for example and rerun the inference I will get the sky and if I'll do it once more and ask for a car I'll get part of the backpack I'm not sure if it's because of some back and fast some implementation or clip is simply not powerful enough to assign the right label to the crop I remember that doing cvpr I saw multiple papers trying to use clip in a similar way and actually succeeding so I guess it should be possible now if you would like to use fast sum with text prompting in your project make sure to test it and confirm that it can work reliably I think that one of the most interesting things that we can do here is to run fast sum and some on the same input image and compare the results we loaded fast sum already so at this point all we need to do is to do the same with some to save time I won't dive deep into some API in this video but we have an awesome Sam tutorial probably one of the most popular on YouTube feel free to use it to get up to speed as usual the link is in the top right corner and in the description below in the meantime we change our input image to the one depicting the robot in the car factory load the sum weights into the memory once again we can see increased GPU RAM usage now our GPU stores two models simultaneously we can finally run an inference with both of them and present the result side by side the difference is hard to notice at first glance because there are so many things happening in the image so let's scroll a bit lower and create a separate plot but this time using blank black background now we start to grasp the difference between fast sum and sum output Sam is a lot better at segmenting those small objects in the center and the right side of the image we can also print the mask count for both models and it's 172 versus 68. another cool thing that we can do is to look for masks that are detected by some but are not detected by fast sum and plot them separately to do that we'll first Define a utility function that will filter masks based on their IOU and then run it after a few seconds we will get our plot as you can see for complicated scenes the difference is quite significant fast sum delivers on the promise of smaller faster alternative for Segment anything model obviously Google collab is not the perfect place for precise or even unprecise model performance measurements therefore treat roughly 100 millisecond inference speed we obtained with a grain of salt fast some outers reported 40 milliseconds but they performed their Benchmark on significantly better GPU Nvidia RTX 3090 so all thing considered such a discrepancy in results seems quite normal either way fast sum is at least an order of magnitude faster than unoptimized sum in terms of accuracy fast sum seems to be doing quite well on simple scenes however the difference in quality between fast sum and sum increases with the complexity of the scene in a rather difficult example with car factory robot fastsam was able to find less than a half of masks found by some also keep in mind that fast sum has different API than the original model classes and methods have different names and Take different arguments so you won't be able to just plug it in and replace the original sum probably additional depth work will be required lack of proper python packaging can also prove problematic especially if you plan to plug the model into larger application however yesterday while writing script to this video I find out that fast sum is now available as part of ultralytics speed package I didn't verify how it works but still decided to let you know it might be a potential solution to this problem and finally the text prompt well it's not there yet in my opinion works very unreliably especially for more complicated scenes so make sure to test if it works for your particular use case all in all Fast Sam is an interesting alternative to the mighty sum especially if you are willing to give up a little bit of prediction quality for much lower latency or if the images that you are working with are simple enough in those cases it's quite possible you won't even notice the difference that's all for today if you liked the video consider subscribing and watching our other Sam related content here's an interesting one where we combined grounding Dino and some and ended up with a powerful zero shot instant segmentation model so strong it can even be used for automated data labeling give it a try in the meantime stay tuned for more computer vision content coming to this channel soon my name is Peter and I see you next time bye
Info
Channel: Roboflow
Views: 14,991
Rating: undefined out of 5
Keywords: Segment Anything, Segmentation, SAM, FastSAM, Segment Anything Model, Meta AI, Image Segmentation, Python, Computer Vision, Zero-Shot, Promptable, SA-1B, CLIP, YOLOv8, Prompt
Id: yHNPyqazYYU
Channel Id: undefined
Length: 16min 2sec (962 seconds)
Published: Fri Jul 07 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.