Segment Anything! Meta's Amazing New AI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
segmentation is the ability to take an image and identify the objects people or anything of interest it's done by identifying which image pixels belong to Which object and it's super useful for tons of applications where you need to know what's going on like a self-driving car on the road identifying other cars and pedestrians we also know that prompting is a new skill for communicating with AIS what about promptable segmentation promptable segmentation is a new task that was just introduced with an amazing new AI model by meta sum sum stands for Segment anything modal and is able to segment anything following a prompt how cool is that in one click you can segment any object from any photo or video it's the first Foundation model for this task trained to generate masks for almost any existing object it's just like judge BT for segmenting images a very general model pretty much trained with every type of of image and video with a good understanding of every object and similarly it has adaptation capabilities for more complicated objects like a very specific tool or machine this means you can help it segment unknown objects through prompts without retraining the model which is called zero shot transfer zero shot as in it has never seen that in training some is super exciting for all segmentation related tasks with Incredible capabilities and is open source super promising for the research Community including myself and has tons of applications you've seen the results and you can see even more using the demo linked below if you'd like we've also had a quick overview of what it is but how does it work and why is it so good to answer the second question of why it's that good we must go back to the root of all current AI systems data it's that good because we trained it with a new data set which I cite is the largest ever segmentation data set indeed the data set called segment anything 1 billion was built specifically for this task and is composed of 1.1 billion high quality segmentation masks from 11 million images that represents approximately 400 times more masks than any existing segmentation data set to date this is enormous and of super high quality with really high definition images and that's the recipe for Success always more data and good curation other than data which most models use anyways let's see how the model works and how it implements prompting into segmentation tasks because this is all related indeed the dataset was built using the model itself iteratively as you can see here on the right they use the model to annotate the data further train the model and repeat this is because we cannot simply find images with masks around objects on the internet instead we start by training our model with human help to correct the predicted masks we then repeat with less and less human involvement primarily for the objects that the model didn't see before but where is prompting used it's used to say what we want to segment from the image as we've talked in my recent podcast episode with sander sulath founder of learn prompting which I think you should listen to a prompt can be anything in this case it's either text or spatial information like a rough box or just a point on the image basically asking what you want or showing it then we use an image encoder as with all segmentation tasks and a prompt encoder the image encoder will be similar to most I already covered on the channel where we take the image and basically extracts the most valuable information from it using a neural network here the novelty is our prompt encoder having this prompt encoder separated from our image encoder is what makes the approach so fast and responsive since we can simply process the image once and then iterate prompts to segment multiple objects as you can see by yourself in their online demo the image encoder is another Vision Transformer or vit that you can learn more about in my vision Transformer video if you'd like it will produce our image embeddings which are our extracted information then we will use this information along with our prompts to generate a segmentation but how can we combine our text and spatial prompts to this image embedding we represent the spatial prompts through the use of positional encodings basically giving the spatial information as is then for the text it's simple we use clip as always a model able to encode text similar to how images are encoded clip is amazing for this application since it was trained with tons of image caption pairs to encode both similarly so when it gets a clear text prompt it's a bridge for comparing text and images and finally we need to produce a good segmentation from all those information this can be done using a decoder which is simply put the reverse network of the image encoder taking condensed information and recreating an image though now we only want to create masks that we put back over the initial image so it's much easier than generating a completely new image as Delhi or mid Journey does such models use diffusion models but in this case they decided to go for a similar architecture as the image encoder a vision Transformer based decoder that works really well and voila this was a simple overview of how the new Sam model by meta works of course it's not perfect and has limitations like missing fine structures or sometimes hallucinating small disconnected components still it's extremely powerful and a huge step forward introducing a new interesting and highly applicable tasks I invite you to read meta's great blog post and paper to learn more about the model or try it directly with their code or demo all the links are in the description below I hope you've enjoyed this episode and I will see you next time with another amazing paper [Music] foreign
Info
Channel: What's AI by Louis-François Bouchard
Views: 18,313
Rating: undefined out of 5
Keywords: ai, artificial intelligence, machine learning, deep learning, ml, python, data, data science, ai news, whats ai, whatsai, bouchard, ai simplified, simple ai, ai explained, ai demystified, demystify ai, explain ai, meta, segment anything, sam, sam model, sam ai, ai sam, meta ai, ai meta, meta sam model, meta sam segmentation, promptable segmentation, prompting, prompt segmentation, segment with prompt, segment with prompting, prompting segmentations, open ai, openai, metaai
Id: bx0He5eE8fE
Channel Id: undefined
Length: 6min 31sec (391 seconds)
Published: Thu Apr 06 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.