Segment Anything Paper Explained: New Foundation Model From Meta AI Is Impressive!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Just a few days ago, Meta published "Segment Anything Model" or SAM. Segmentation is the process of dividing an image into smaller regions, where each region corresponds to a specific object or background in the image. And even though SAM is an image processing model, it is somewhat similar to models like ChatGPT and Bard. In this short video, I will cover what's interesting about SAM, how to use it, and why it matters for AI research. Let me first show you how SAM works on Meta's demo page. I’ll use an image that I previously generated in Midjourney to make sure that SAM has never seen this image before, and I will simply upload it to the model. This step takes a second or two, and after it's done, we can prompt the image in several ways to get the segmentation we want. Now we have several options, we can select some points in the image, and SAM will find a segmentation that corresponds to this selection. For example, this point selected the face, and by clicking on another, we can add hair. The demo page also has this multi-layer view that predicts multiple mask options from a single point. Next, we can prompt the image by using a bounding box. Like this, I can select the region around the lightsaber. It predicts what I want and correctly segments the object. And lastly, we can click on "everything" and let the model find all objects in the image automatically. But notice that SAM won't tell you what each segment is. But it can be the first step of your object detection pipeline or some other image-processing task. So what is SAM, and how does it work? Well, first and foremost, it is not a standard segmentation model. The main goal of this project is to develop a "foundation model" for image segmentation. This term was coined two years ago in this Sanford paper, and it refers to any model that is trained on broad data that can be adapted to a wide range of downstream tasks. Current examples of foundation models are, for example, BERT, CLIP, GPT-4, or BARD. Foundation models are usually trained on a large amount of data - we're talking about billions of data points. But such data were not available for image segmentation before this work. So Meta built something called a "data engine". Data engine is a model-in-the-loop annotation strategy, and it collected training data in three stages: The first stage was called Assisted-Manual, and in this stage, Meta hired a team of professional annotators to label images with segmentation masks. In this stage, the annotators labeled any object they could find in the order of prominence. And they were also assisted with a segmentation model that was previously trained on standard segmentation datasets. In this stage, the data engine collected over 4 million masks from 120 thousand images. The second stage was called Semi-Automatic, and the goal here was to increase the diversity of the dataset. It also involved human annotators, but this time, they were presented with already annotated images and were asked to label anything else they could find. After this stage, the average number of masks per image went from 44 to 72 masks, and the data engine collected additional 5.9 million masks. And finally, the last stage was called Fully-Automatic, and it didn't involve human annotators anymore. Here, they prompted SAM with a 32×32 grid of points and, for each point, predicted a set of masks that may correspond to valid objects. They applied this process to 11 million high-resolution images that Meta previously collected from their provider and got 1.1 billion high-quality masks. Here, you can see some complex images where SAM predicted more than 500 masks. [ images in the last row ] And the examples look pretty good. So that is the final dataset, called SA-1B, and Meta is making it publicly available under a permissive license. It has 6x more images and 400x more masks than any existing segmentation dataset. So that is a big contribution to this field. But note two things: It includes only the images from the Fully-Automatic stage, and the images are slightly downsampled compared to what SAM was trained on. Using this dataset, Meta trained the Segment Anything Model, and its architecture is quite unique because, similar to ChatGPT, it allows for prompt engineering. And it also does zero-shot learning. Let me explain... The model has three main components: The first one is an image encoder that takes an image and computes its embedding. That happens when you upload an image to SAM on the demo page, and it is the most computationally expensive step. But that happens only once, and then, you can start prompting the model for segmentation masks. That is where the second component comes into play, which is the prompt encoder. It takes a prompt, which can be a set of points, a bounding box, another mask, or simply some text, like "cats". and outputs a prompt embedding. And then, this embedding is combined with the image embedding and fed into the lightweight decoder that predicts our segmentation masks. So this architecture is promptable, and language models like ChatGPT show that prompting is a promising technique for zero-shot and few-shot learning. But if we want to use SAM as a general tool for segmentation, it must be able to resolve ambiguity in prompts. Similar to ChatGPT, which can write unique poems about anything you ask for, SAM must be able to give a reasonable segmentation from any prompt. For example, when selecting a point on my image here, it must consider that maybe I want a segmentation of the face, but maybe I want the segmentation of the whole body. So to address this ambiguity, SAM predicts multiple valid masks with a confidence score for each mask. Using this method and some other modeling choices, we can now solve other downstream tasks by simply engineering appropriate prompts. The authors evaluated its zero-shot capabilities on several tasks. One of them is edge detection. Here, you can see an example where SAM predicts very reasonable edges from this input image, even though it wasn't trained for edge detection. Of course, it performs worse when compared to the state-of-the-art models trained on edge-detection datasets, but it compares well with other task-specific models, and it performs far better than other zero-shot techniques. So that is the Segment Anything model. This foundation model definitely has a lot of potential, and I hope that it allows for further research in this area. Because of its zero-shot capabilities, it can be adapted to many interesting image-processing tasks, such as MCC, which performs 3D reconstruction from images. Overall, it’s great to see that Meta is releasing such a large dataset. They also released the code for SAM, and it is very easy to set up. So if you are interested, you can try and run SAM on your own machine. It took me only about 5 minutes to set-it all up, and I will leave links to all resources needed for that in the description below the video. If you liked the content, please consider subscribing, and thanks for watching!
Info
Channel: Bots Know Best
Views: 10,118
Rating: undefined out of 5
Keywords: AI, Meta, Segment Anything, Computer Vision, Computers, AI Research, Machine Learning, FAIR
Id: JUMmqX-EHMY
Channel Id: undefined
Length: 6min 15sec (375 seconds)
Published: Thu Apr 13 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.