[Code] How to use Facebook's DETR object detection algorithm in Python (Full Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
howdy-ho how's it going so today we are going to try out the DET är the end to end object detection with transformers from facebook AI research and they have a github repo and they pretty much give you everything like the model the pre trained weights and so on so today we're going to check out how easy it is to get started with that so in order to do that they have like a collab but we we won't look at it too much I've glanced at it and we'll basically see how far can we go without looking at it too much and how easy is that so what I've done is I've spun up a collab that I will share at the end and I've imported torch and just loaded the model so you don't have to wait for that to happen so I've loaded that up and now we have it in the cache so now we can basically go ahead and load an image into the model and try to detect objects in the image so first of all this is super easy right you simply load this from torch hub it's kind of like the the tensorflow hub you simply give the name of the model you say I want the pre trained please chugga-boom you now have a model so if we look at that model this is going to be this entire Det our model right here with all the transformer and ResNet and whatnot okay this is almost a bit too much right here so what we want is an image so let's go find an image where better they find an image than Google so let's find an image of dogs because dogs is one of the classes in this cocoa dataset this one's nice right okay so we want the image address we want to load it in here somehow so so that the URL is let's make this into some sort of like an input thing where we can paste the URL right here okay there we go so we have this right here and that's the URL all right no that's not the URL at all is it but a beam but a boom what about cool better now we need to load this for that we gonna use the requests library always a pleasure requests requests so the way to load a binary file is you can put the URL here and you can say streamed here I glanced this from the other thing and the raw entry will get you the eventual deeply bytes no oh sorry get darrell streamed stream yeah so this will get you the sort of the the bytes of the image and then use just say image dot open and of course we need the image from a the pill library the python image library so import image we got that and we can open that image up and with a bit of luck yeah yeah so this model expects I think cocoa dataset is 640 by 480 images but they if you can see right here and we're gonna take a quick glance at their transforming they resize it to 800 so we're gonna we're gonna steal that part right here people last time where some some found it really funny that I called copy pasting to go Suraj so will from now and we'll call it just Suraj Inge what we also need are the class labels because that's in defined in the cocoa dataset right so these are the class labels let's take those and okay so this T here these are torch vision transforms we're gonna need that so from say so if you don't know torch vision it's kind of an addition to PI torch that just helps you with with images and has a lot of data sets and these transforms they're really helpful because so let's call this image because you can you know resize but they have much more like random cropping and rotating images and so on pretty much everything you need for pre-training and this here is just the standard image net I believe the image net normalization so these are the means and these are the standard deviations from the image net data set and let's already resize our image actually to this weight hundred and I believe I believe if you rescale the 640 to 800 you get 600 here right fairly sure okay and then let's display it just because we can okay what it's it's a bit squished but we don't care and let's put that up here so we only need to execute it once nice okay so from now on it should be a breeze so what these transforms do is they resize the image okay we don't need that anymore they make it into a tensor and then they normalize by that so if we run our image through this because our image right now is this is pill image right so our our image is this pill image but if we run it through the transforms then we'll get a tensor so that's pretty cool so the model as it is a deep learning model that expects batches so we'll unscrew is that in the first dimension and then we get batches so shape let's see we don't have on skis no of course we don't so this is a one image of three channels of 600 by 800 so this is the Y index coordinates I guess are shifted yes in pi torch cool so we'll call this our image tensor now we just need to put it into the model so model we put that in there and since we don't let's actually up here put the model in eval mode I don't know if that's already done but you know you can never be sure enough that the batch norms aren't so I think it probably doesn't have batch norms okay you're not utilizing the GPU we'll do that we'll do that Thanks so how do we use the GPU we put our model on the GPU model equals model CUDA yes yes yes I think so this is gonna work okay we're gonna come back to this later so we forward our image of course we also need that on the GPU and it's worked did this work this worked nice okay and since this is just for evaluation we should probably go with no grad right here because we don't need this whole gradient stuff if we do that okay I'm dumb there you go and nothing happens of course because we need to capture the output somehow let's look at that output Wow Wow just wow so the output is a dictionary right because we get back class labels and bounding boxes so let's look at the bread boxes let's look at that tensor that's a tensor very nice let's look at its shape let's not print giant tensors anymore cool so since this was a batch of one we should probably go the zeroeth and you can see right here there is a hundred bounding boxes and each one has four numbers and if you go with the other thing that's in there the log it's then you'll see that there are also should be a hundred log it's and hello there should be a hundred log it's and each one is of size 92 because there are 92 different classes 92 we'll see about that well one is going to be the nothing class right by the way how many classes do we have we have 91 classes okay cool we can deal with that all right so what are we gonna do next what we want to do is for each of the for each of the for each of the log it predictions we want to find which classic corresponds to so what we're going to do is we're going to take the Arg max of the last dimension right so you can see here almost all of these things correspond to class 91 and class 91 is not in our classes because our class is only length 91 so that must be the nothing class so what we can technically do is for log its and boxes in let's just zip them together and [Music] like this okay class is oops class as the law gets Arg max if that's 92 or let's say safe that's larger than the length of our classes we'll just skip it for now okay so that should work somehow and if not then our label should be the class index right here so let's just see what the detector detects right here it detects nothing why does it detect nothing that's isn't seem good what are we doing wrong we zip together the log it's oh yeah of course we still need the zero with entry we are dumb dumb dumb cool so so so so we can delete this and now finally beautiful dogs - dogs detected excellent so now for each of these dogs we want the bounding box okay so now we somehow need to think of how are we gonna draw this on an image and well let's let's actually make a copy of that image because I don't really trust myself and then at the end of this we're just going to display that image right now actually the reason I make a copy is because in these in this pillow library you can actually draw on these images and we're going to that to draw these bounding boxes so for that we need an image draw if I remember correctly and I think later we also want some text so we need an image font yes all right so let's draw a bounding box right here where so first of all let's look at that bounding box let's call this box box print box dot shape and break right here what's happening let's not do this right now so this is a boxes of size four now this could be two things it could be X 0 y 0 X 1 Y 1 so the two corner points or the kind of the boundaries or it could be X Y width height now from the paper I know that they predict the center and the width and the height so I'm gonna go with that and I'm just gonna guess that it's like X Y WH and not some other way around if this is a bad guess then yeah we'll see we can just print out one of these boxes and honestly that looks reason oh by the way we should scale that up yeah so these are normalized coordinates probably between 0 and 1 so we should scale that up so we should probably the x coordinates which is scaled by 800 and the Y by 600 so let's do it so first of all we scale our box by 800 in the X and here is a Y and the width is the X direction and this is the Y Direction boom okay we should probably get that on CPU will just hack together a bunch of things right here ok so now this isn't the correct so we sold our x and y and WH are going to be this box so now we need to actually draw on the image we're gonna do that so let's first go X 0 X 1 is X minus W 1/2 X plus W half y 0 y 1 is the same for a y with H plus H half Coolio now we need an image draw object so I think draw on this image so whatever you draw on the draw object will end up on the image so we can use that to draw a bounding box and let's just quickly look it up so pill Python draw rectangle maybe there we go okay so there's this rectangle yeah there's the rectangle function and you can see you put in a shape XY here and width height like this wait for real we wouldn't even have to need to transform it I'm pretty sure you can go X I thought I remember you could do the different thing as well but it's called rectangle okay so let's do that so draw rectangle and we'll go we'll go X 0 or we'll go X Y width height let's display that down here yeah that looks that looks nothing like we want but it's you know it's a start maybe actually we need the other thing here we need X 0 y 0 X 1 Y 1 mm yes yes doggy okay we still have the break in here now we get both dogs nice nice okay let's do I think Phil yes red and let's go for with five or so five seems like a good width oh god five is a terrible with oh it's not feel I think it's its outline yeah yeah okay okay let's go still go with five cool we got our dogs now we need to put like some some snappy text labels I think there's actually a pill image draw text I think that exists because I've this font thing yeah exactly so you need the font thing get it font in there and then yeah exactly you could put a text like this okay so you probably need the x and y coordinates of the text so let's do that W dot text and let's just go with x and y right here put it right in the middle and the text is going to be our label of course and we want the fill that's now going to be the color of the text let's go with white and the font we're going to load some font right here font dot how we're doing this true type true type ah no not cheating let's just go with regular fonts it won't look as fancy but we'll be fine so we're where is our text you see it I don't see it red let's make it red yes there we go okay so it wasn't red enough this should work on it so I did we just I just not see it I'm domina cool so we have two dogs how easy was that actually we wasted the most time with like bounding boxes and stuff absolutely cool right okay so now we can have some fun with it I'm going to scale this down for a bit because you don't need to see the actual code anymore so much so you can see the image more so we'll go to the images and the first thing I want to do is the dress what does this think of the dress okay so we'll copy that and we'll go into our collab and just paste this right here butter boom but a beam sounds nice and what is wrong the size of a tensor must match the size of a tensor we do something wrong transform image or images this maybe this is like an RGBA image I think if this is rgba we should just convert it to like an RGB pretty sure you can do something like this right here this should work as an alpha Channel then that will remove it yes now it works okay let's see what the model thinks of this yeah okay apparently there's a car and there's a surfboard and there's a person and there's a person nice see well we didn't figure out whether the dress was blue or white through gold it was just a person now they you could actually like threshold by how sure you are of a given class but where's the fun in that so let's go further and let's do some Rorschach inkblots because those are always lots and lots of fun so which one should we go for it this one looks like fun okay so we'll put this into here and it's astonishing right it's this cocoa data said it only has these 90 classes like it doesn't have anything anything else so it's a cake it's a cake and this here what is it okay we'll have to go maybe with blue what is it stop sign okay but so you might think it what if what if we want more like what if we want more predictions so there is a hack right right now the model can always assign math to this not a class thing like right here this class 91 in order for it to say I don't think there's anything there but generally we have a hundred predictions right so you see where this is going so yes let's let's change it but let's change it up a bit and let's go here let's first extract these tensors and boxes okay so we have the boxes and this and log its and boxes okay so we got that what we wanna do is basically we want to filter the we want to basically just remove the last class before we do the Arg max and thereby we want to force the model to make a prediction now it won't be a very good prediction because of course this is only the second highest class and it's arguable how much that counts but still it will do something so this must be done in the log it's right so well look at the log it's and the log it's our of shape 100 so we have 100 predictions of 92 classes now the first thing we want to do is just remove the last class so let's go everything here until the last class all right so now we have 91 actually let's make it more generic whatever thing however many classes are okay so we don't have this class anymore so now if we do the softmax over the last thing we can technically we get 91 but now they're normalized so they add up to one so it's kind of a probability distribution next we we want to find the max over this and that that will give us a max output so we don't want to plot all the 100 predictions because that would just be like like squares all over the place and we'd have no clue what happening so this max output right here this what we're trying to find is we're trying to find a let's say the five best predictions or so the five ones where the model thinks where the model is most confident it's not really good metric but you know so these are the probability values of all of the hundred predictions so what we want is like the top K okay so let's go with five and again we'll get like a top K output let's call that top K and I think it also has like values and indices yes so now we simply need to filter from the log it's and the boxes where these these top ones are so well filter the log it's [Music] will filter the log it's by that top K indices and we'll also filter thee I am not very gifted today boxes by the way I'm using a collab just because it's nice to kind of play around with a model because if I were to use a file I'd have to restart reload the model over and over again just not as nice so now we have the log it's and the boxes and if we do that right now we get always the top 5 predictions how nice is that and you can see the top 5 predictions are probably still kkkkkk cake and just to verify that and we can put its shape yeah see this is what I don't like about this stuff yes okay so we just have five predictions of 92 things and we don't want the 92 we've already said so we just want the 91 let's actually could put that here [Music] okay so now we have five by 91 and now to give us the top five are there we go so many takes and many stop sighs that's fine that's cool so the ultimate test right here is going to be yes the human adversarial example let's check it out so we put in a Jackson Pollock image and we'll see what the model says now we're actually forcing it to make predictions right so it can't escape it will need to do something okay I made another mistake I would need to copy the image address right here like this that's what happens when you're not an idiot you get the actual image so what does the model think of our pretty image okay can't even read that so let's make this into white bird bird bird okay lots of birds in this image clearly clearly lots of birds in this image let's try another one let's go with this this this one yes yes absolutely love it love it okay so we copy image address and beam Mormons Wow there's a lot of birds in these Pollock images just so many birds okay let's try one last how about this one this one is a bit more human-friendly right put it in here and and and okay we get some detections there's a clock right here there is a what's that how's horses let's print let's print the labels so just so we know what they are cake horse car horse and clock okay so I see the clock like this here is clearly a clock then this rectangle on the right side must be something let's put this to read as well now that's terrible ah white back to white how about back to white okay clock we got horse right here and house probably and the entire image is again a cake yes okay so as you can see it is a pretty pretty good system but of course it is only these 90 classes but it's for now it's a it's pretty cool and it works pretty well and just the easiness with which you get which which you can get this stuff elephants in Kruger National Park just the easiness is astonishing you can just load it up kind of have this have a bit of a notebook and with a bit of like a very few lines of code you can put something together that detects these bounding boxes lots of elephants and remember we only have the top five elephants right here so what happens if we go for more where is our top k so here we can let maybe say the top 15 predictions and as always if we want to make the model to make its own decision we can simply revert back and add back the no class label all right with that I hope you like this video if you did then maybe tell YouTube that you liked it share it out and I will share this notebook in the description for you to find and play around with alright thanks for watching bye bye
Info
Channel: Yannic Kilcher
Views: 27,715
Rating: 4.9773159 out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, facebook, fair, fb, facebook ai, object detection, coco, bounding boxes, hungarian, matching, bipartite, cnn, transformer, attention, encoder, decoder, images, vision, pixels, segmentation, classes, stuff, things, attention mechanism, squared, unrolled, overlap, threshold, rcnn, code, pytorch, colab, notebook, ipython, python, torch, hub, torchvision, bounding box, image, computer vision
Id: LfUsGv-ESbc
Channel Id: undefined
Length: 33min 30sec (2010 seconds)
Published: Sat May 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.