AI powered image annotation with James Gallagher

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello and welcome to the satellite image deep dining podcast I'm Robin Cole and it's my pleasure to present another technically focused episode in the series in this episode I catch up with James Gallagher to learn about the latest AI Innovations reshaping image annotation our conversation covered significant new models such as segment anything grounding Dyno and remote clip and discussed how these models can be linked together to enable new capabilities I hope you enjoyed this episode hi James how are you I'm doing great thanks Robin how are you yeah fantastic thanks really looking forward to hearing about what you're up to um do you mind giving us a quick background on where you work and what you do absolutely so I'm James I work at rlow um on our marketing team and I'm responsible for not only a lot of our written content um as uh but also our open source projects so I uh help out with supervision a popular Library used and um computer vision applications autod distill which I hope we can talk a bit more about for automatically labeling images and I also spend a lot of time exploring kind of the frontier of where vision's going so playing around with Foundation models like remote clip hugging faces eye defects um sand H really anything that I think can help accelerate us towards what I think the future of vision will be where we don't have to label as many images anymore and we can just get straight to model straight to Value fantastic yeah annotating imagery is very time consuming and I think for me personally I've been very excited by all these new Innovations which can speed up that process are not reducing the quality of The annotation so what's the state of play today what is possible on on a platform like roof flow absolutely um so I'm going to share my screen quickly and talk through a demo um so one application of let's see here one application of uh the segment anything model which Facebook released earlier this year is automated labeling segment anything allows you to as the name suggests segment anything in uh in an image and do so at scale you can run segment anything in this case we've got it running in the rubber F annotation tool and just go from image to image and rather than drawing bounding boxes or indeed clicking and clicking and doing polygon zooming in zooming out refining and indeed if you're working with other labelers trying to convey how they do that properly it's difficult so we Sam within two weeks of its release um and on my screen now I'm zooming into an image of aerial solar panels I have in rub flow and throughout I can go and highlight over particular parts of an image say a swimming pool part of a roof um or indeed a solar panel and rather than uh rather than draw a binding box around it or those labels I can click and keep refining from there um in this case I can just enter oops save my label if I wanted to start labeling trees for example I could have over trees and this is yeah one way where we make labeling faster particularly with aerial imagery where often times objects like solar panels or roofs or airplanes are very distinct from others so we find it's really effective on the first shot and you can also refine your annotations too from there right so a big Big Time Saver um and segment anything itself was that trained on any aerial imagery or is this just a property of the model generalize well to area imagery absolutely um segment anything I found to generalize well to a range of different object types so uh in some of my testing for aerial imagery in particular like airplanes and some features to airport have has worked really well um the data set was vast um and I haven't looked inside um but yeah from aerial imagery all the way to uh labeling animals uh to segmenting Windows uh for surveying and Property Management we've seen excellent results mhm in terms of the experience using it um so it will segment anything you say but it won't tell you what that is you still need to say this is a tree this is a roof something like that exactly and that's that's an unfortunate drawback but segment anything was so revolutionary that it's like we're lucky to have it but we are engineering around that so segment anything um in particular can't tell you what something is but other models can so one thing that we're experimenting with a lot of rubber flow and we're going to have a product announcement coming out in the coming weeks for this is helping you automatically label images by chaining these models together um so I'll talk a bit more about the framework in a moment but one idea that we've had is like grinding dyo for example a popular zero shot object detection model which now Finds Its way into the backbone of various other research projects um can be combined with segment anything so I can say to grinding Dino tell me where all the planes are in this image and it will find the planes and then I can say run Sam only on those regions and get the planes so now what I have is a label for a segment mask and a bounding box for all the airplanes and an image with a label already done um so yeah we're definitely not there yet um in terms of being able to generalize across lots of objects smaller objects in particular pose a problem but definitely we're trending towards let's combine some of these models together um and see what we can do right and so as I understand it segment anything you can interact with in a couple of ways but one is to put a box around something you care about and then it will annotate the edges of that but you can also generate multiple boxes and it will put edges around multiple objects of that case so the idea here is one model produces the box and then segment in comes in and says this is the boundaries of that that thing that you've got a box around exactly um and one other way I just described like you grinding Dino to do object detection but another thing you can do is combine grounded Dino Sam and a third model clip to refine your predictions even further so um this comes down to my playing around with models a lot um and just figuring out which one works best for different use cases but one combination we found work exceptionally well is gring Dino s and clip gring Dino finds say all the airplanes uh Sam segments out those airplanes and then clip does a more refined classification on so let's say I want to know given uh let's say I have a ground view of airplanes on tarmac and I want to know how many Lanza planes are there or how many ear Lingus planes are there I can use clip to do that classification because it has such general knowledge and so I get that Bounty box I get that mask and then I replace the label from my initial like airplane with something more specific um and one example I can share with you here um from a project on which I worked recently um which is trending on GitHub today is um me trying to label logos uh so grinding Dino and Sam doesn't know what a Lo what McDonald's logo is but it knows sort of what a logo looks like and here I gave H gry Dino The Prompt McDonald's Sam gave me segmentation masks and then clip I gave it like Burger King McDonald's and a couple other brands and it did this really well so particularly for that level of like refined um classification brand detection and um or indeed like analyzing the status of something like is this food half eaten clip is sort of coming into play and clip that takes text inputs or you can also Supply sort of s like template and examples so that's a great question and so in uh there's kind of two ways in which I use clip first is here is a tech here's a couple text labels in this case like McDonald's Burger King or if I was doing uh something like let's say product detection it might be different product skews or like brand names or something um and but with that said clip doesn't know everything um we can get around that limitation though by using the clip embeddings and sort of playing around in a couple different ways so if I'm doing just a plain classification task I could like train a linear probe using clip and beddings and then use that to help um uh help classify uh with which we find good results and Joseph R CEO has been playing around with that for a few years now in his spare time just excited about the possibilities but also and one thing uh we haven't quite got out the door yet but it's coming soon is labeling images using clip but rather than a text label embeddings so let's say I know um what like 10 different um vinyl records look like for example then what I can do is calculate a clip embedding for each final record cover and then use clip to label uh those images so now rather than just saying give me does this contain a vinyl record or not I can say does it contain Taylor Swift's midnights does it contain John batist we are and indeed I use that example in particular because I used um I used clip to automatically label some images um and I'll talk about how in a moment and I gave it 40 images uh clip and grounding Daniel labeled them all and it took me only two minutes of human review and now I had a fit model which identify 10 different vinal records and so we're going we well we're uh increasingly going from here are 10,000 images I've got to go find an ex labeling team and in the case of um let's say government images that's not a given you've got to uh there's High requirements there in terms of security clearances available to even see images or if you're working with taex documents Ty Source labeling is more difficult to get going but if we can start saying let's just use these models to label images we can save a lot of time money and again as I mentioned earlier get to Value quicker and then get an active learning Loop going refine our model over time I want to dig in a bit deeper onto the embeddings side because there's lots of different ways of creating embeddings and you might imagine having a custom model say on some medical data or some other kind of data that's not common like the examples you shown how would that then fit into the sort of pipeline you've outlined here absolutely um so here I will introduce Auto distill and then I'll talk about how it could specifically be used for that use case um so Auto distill and let me share the read me um also diss a framework that we've been working on for about nine months at rlow um we're continuously refining it um we're also getting feedback literally every week about how we can improve it what we can add and also the still is helping progress us from where we are right now which is humans do most of the labeling for a lot of use cases um to reduce the amount of human involvement the ideal Paradigm towards which we're striving and we see this as an intermediary step is um machines label images humans review and QA those images and then you train a model and and then eventually maybe that QA time will be reduced um but will take Stepping Stones so what AO distill does is it takes a large Foundation model like Sam grounding dyo clip and allows you to label images with a simple and standard AP AP so the API works the same across clip as it does for meta clip which is meta ai's version of clip same with alt clip uh which does bilingual um embeddings and indeed object detection classification so what we want you to be able to do and here's an example on screen of we literally just fed also to still a video of milk bottles and now we could identify all the bottles um not perfect there's some some false positives there some overlapping detections but again for no labeling that was amazing and then we get to let me scroll down here this API here and I'll zoom in where what we can say is I want to use grounded Sam I want to label the shipping containers in this image and I want to save those in my data set with the label container because that's the ontology that I'm already using then I can pass in a whole folder of images grounded Sam which is grounding Dino and Sam combined will iteratively run over all of save it in a YOLO data set and then I can just plug that straight into a YOLO model to train on my own device using again uh a similar API so I could replace YOLO V8 with yolo V5 and do the same thing um and I can yeah train a model I could upload it to ROFL for deployment then you get our API going um or um indeed if you need to do that Qs stage instead of training on your own Hardware just upload your images to roof flow then refine them and figure out is this working um and we also have uh powered by supervision the ability to visualize predictions and a few lines of code so if you're like is this actually going to work on um let's say my aerial imagery and it might not because some of these General models don't work well on those more specific refined classes then you can test it out see how it goes um and one anecdote I can give is uh we had a research project at rfl uh where we were building a logistics data set we wanted to identify 20 different classes from a wooden pallet to a forklift um to people to um safety violations including not wearing a safety vest um on a construction yard 100,000 images of which approximately 70% were Auto labeled with fil distill the model combination we used there was uh DC which is a slightly older model by Facebook that just worked really well for the classes we were doing we labeled those images and then we had a YOLO model at the end so instead of having like the two to 5 Second difference time for detch if I remember correctly we went down to being able to do multiple frames a second with a Yol model um that's where we want to be and when it comes to remote sensing that's where we want to be there too we don't currently support remote clip which I think is a really promising uh remote sensing model but we want it to be that simple where you say like hey classify these images with remote clip save them into classification dat set trade a model or do the same with object detection and as I mentioned earlier you can combine these do maybe um uh grounding Dino to find the planes then remote clip for um more refined classification it's really interesting just take a step back and think how the the process of creating data sets is changing then we're moving from people manually identify all of the pixels and that's where the time is spent probably more to refining Which choice of models you use to generate these predictions and how you then update those models really to to to improve the predictions absolutely um and like I've mentioned grinding Dino a lot and of course GPT uh and its Vision capabilities are on the horizon uh we're seeing some interesting performance and indeed uh we're going to be releasing a tool this week uh to that shows uh GPT running on different prompts so that you can day by day see how it's changing see if it improves that document classification or sorry document crr document understanding graph understanding all these things uh um that's a side note I just like sharing things beforehand um but yeah I mean um these big models like branding Dino and Sam are amazing for more General objects but we do we do see a world where you're kind of yeah combining these models together it does involve a little bit of understanding and also exploration that's sort of where I've been spending a lot of my time um is just understanding what models are out there so of course I keep mentioning it remote clip it was trained specifically on a data set for remote sensing and achieved state-of-the-art performance across a range of benchmarks for zero short image classification um and so do I think GPT is going to be able to identify all these very specific things in remote sensing probably not because that's not their goal to refine on these very particular things but it is someone's goal because it has importance so we we want to yeah help people bring these models together label your data do a bit of human QA have a model in the end yeah you mentioned uh remote clip and how that is trained on remote sensing specific data I think there's quite a few domains where you can imagine that happening like Medical Imaging remote sensing other sort of specialized domains uh many of these data sets as you previously mentioned being used for training foundational models what's the sort of bigger context there about how foundational models and the sort of models you've just been talking about relate to each other absolutely um so with regard to um I guess the way things have been going for a long time in machine learning particularly over the last five years has been that every so often we see just a transformational Innovation that changes the model architecture itself so of course over the last 10 years we're familiar with going from like alexnet to RNN CNN's Transformers and so on but now what we found is the data is what really matters so these architectural improvements they're crucial we want to make architectures faster we want to make them more accurate but equally uh like the old adage goes in computer science garbage in garbage out um if you don't have good data to start with you're not going to get a good result and so what I love about like initiatives like remote clip and indeed I expect we'll see and maybe they already exist specific models for medical imaging we have IBM's foundational models which they built in collaboration with NASA for for um things like flood segmentation um and even Facebook was doing like aerial surveying for tree canopies and they had some depth sensor data in there too um it's all about what data you have available I feel like we're going to continue moving towards this Paradigm where large models like say GPT cover a lot of the general cases like here's a chart give me a number or here is an image tell me if there's a person too close to forklifts like that in itself used to take so much engineering and so much time and gept can sort of solve it but if it comes down to um I have this medical image and I want to identify an abrasion or a tumor um then we're not going to rely on these large models we're going to have refined fine T models for that use case developed by experts in that field and so that's um yeah I recommend just checking out the archive uh uh papers that come out every day because often times you see words like remote sensing distillation um the medical imaging and all these other things because people are exploring it we've got the architecture now we need the data that's really a good summary thank you um how much data is required in your opinion to make a useful foundational model is there an infinite scaling that the more data is always going to make a better model or do you think that at some point people just say this is enough enough data now we're done with this and we'll move on to something else absolutely so clip um one of the they use millions of images um and not only images but also the text language Pairs and for these General models like say GPT they have the advantage of Internet level scale so they can crawl the web and to develop text pairs there's many things you can do you can uh like look for words that are closely associated to an image for alt text and build those labels from there um but when it comes to remote sensing privileg data um it's not possible to do that you may already have a repulser of data but even in the case of clip there was still that human stage where you need to verify your labels you need to make sure that your labels match up to your images so data quality becoming even more important there um in terms of the size of a foundation model I I don't think we've converged upon any particular even range um of how many images you need but there was an interesting Revelation earlier this here that helps sort of change the way that we can think about these models so we've spoken about Sam aot it's a really powerful model segment anything but a researcher came out and took a small percentage um I can't remember the exact number it was in the range of like 10% if not less of the data uh that was used in the data set to train segment anything and they trains uh a smaller model with that smaller percentage of data was it as accurate were the masks as clean as what I just showed you on that solar panel no but was it reasonably accurate yes and can it run on your Mac without I almost crashing the machine on a lower end H model yes um and so that's that has me thinking about like clip Sam a lot of these data sets are trained on Millions if not billions in the case of Sam of images or masks and the annotations um perhaps there's a world where as these architectures improve as data quality we find out that you need few images to train a model for your use case um and again it's all about need to so fast Sam was h a great name because it is faster Fast Sam um comes from the angle of like and it's integrated with auto distill too comes from the angle of hey h Sam is slow let's make it faster your predictions aren't going to be as accurate but again human uh reviewed um can be done there but more importantly you can do Active Learning and if it does work out and you've only spent an hour building that model that's okay um maybe another model will come along later on that you can just plug right in and hopefully with auto to still you change maybe three or four words to say the new model once we've packaged it up and run it on your data set and maybe you'll see something really interesting that's really interesting perspective um well thank you so much for this summary uh if people want to follow along your updates because there's so much happening in you you seem to be really on top of this particular aspect of it where's the best place for them to follow you absolutely so for AO to Ste in particular we have a GitHub organization uh which is github.com AO distill there's a lot of repositories there we cover models from clip to blip to grounded Dino to Sam to GPT um there's a lot for you to play around with um interestingly one person once left an issue to say that H aut still saved them having to manually label 14 ,000 images and it was like I got my weekend back um so that was exciting so also still on GitHub also just the rlow GitHub organization in general where right now we've got a newer poer coming out called multimodal Maestro which introduces new labeling techniques for working with multimodel models implementations of set of Mark at the beginning and we're looking towards more um and as always rub.com we've got a demo on our homepage where you can play around with the uh a model trained on the Coco data set it's really fun just to feel computer vision running in your browser and then from there we have a ple plethora of resources which will get you to labeling using Sam toau uh to um refine your labels uh as I showed earlier uh training models and then deploying your models at scale well thanks so much again I'll put all the details in the show notes and until next time thank you thank you Robert

Info

Channel: Robin Cole

Views: 1,141

Rating: undefined out of 5

Keywords:

Id: 0sbXo4auyv0

Channel Id: undefined

Length: 23min 58sec (1438 seconds)

Published: Sat Jan 13 2024