DETR: End-to-End Object Detection with Transformers (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there today we're going to look at end to end object detection with transformers by Nicolas carrion Francisco masa and others at Facebook AI research so on a high level this paper does object detection in images using first a CNN and then a transformer to detect objects and it does so via a bipartite matching training objective and this leaves you basically with an architecture that is super super simple compared to the previous architectures that had all kinds of engineering hurdles and thresholds and hyper parameters so really excited for this as always if you like content like this consider leaving a like comment or subscribe let's get into it so let's say you have a picture like this here and you're supposed to detect all the objects in it and also where they are and what they are this task is called object detection so a good classifier here would say there's a bird right here and so this is a bird and then this here is also a bird right they can be overlapping these bounding boxes so this is you see the first problem that bird why is that green nevermind okay and those are the only two objects so there's a number of very difficult things here first of all you need to sort of detect the objects you need to know how many there are it's all it's not always the same in each image there can be multiple objects of the same class there can be multiple objects of different classes they can be anywhere of any size that can be overlapping in the background small or across the entire image they can include each other partially so the problem is a very very difficult problem and previous work has has done a lot of engineering on this like building detectors and then kind of you want to classify every single pixel here and then you you you get like two detection right here that are very close for the same classes they are that must maybe be the same instance right so there's only one thing here and not things and so on so there there used to be very complicated architectures that solve these problems and this paper here comes up with a super simple architecture and we'll kind of go from the high level to the low to the implementation of each of the parts so what does this paper propose how do we solve a task like this first of all we put the image and the image here without the labels of course we put it through a convolutional neural network encoder since this is an image task it's you know kind of understandable that we do this mostly because CNN's just works so well for images so this gives us this set of image features and I think this this vector here is not really representative of what's happening so let's actually take this picture right here and throw it in kind of an angled way and what what we'll do with the CNN is we'll simply sort of scale it down but have it multiple so here it's three channels right it's red green and blue like this three channels but we'll scale it down but we make it more channels so yeah so more channels okay but it's still sort of an image right here it still has the image form okay so the the CNN basically gives us this thing which which is sort of a higher level representation of the image with many more feature channels but still kind of information where in the image those features are and this is going to be important in a second because now this thing which is this set of image features goes into a transformer encoder decoder and this is sort of the magic thing here as as a component we'll look into that in this in a second but we'll take it out right here is this set of box predictions so outcomes each of these boxes here is going to be consisting of a tuple and the tuple is going to be the class and the bounding box okay so an example for this could be bird bird at x equals two y equals five okay that that's an example another example of this could also be there is nothing at x equals seven y equals nine okay so nothing the nothing class is a valid class right here and that's also important but safe to say there is this set of box predictions and then that is basically your output right these things are your output if you have those things you can draw these bounding boxes you can assign the labels the question is how do you train it now what you're given is a database of images and these images as you see here on the right these images already have by human annotators drawn these bounding boxes in and also labels so this here would be annotated with bird and this here would be annotated with bird but it doesn't have any of these like it doesn't annotate the nothing classes or and so on so the question is how do you compare the two can you simply say okay if the first one here is the bird and then and the second one is this bird then it's good but then you know that the ordering shouldn't matter you simply simply care whether you have the correct bounding boxes you don't care whether you have put them in the correct order and also what if your classifier does something like this it outputs those two boxes we see here but it also outputs this here and says bird or like one that is slightly off and says bird and so on so how do you deal with all of these cases so the way that this paper deals with all of these cases is with their bipartite matching loss this thing right here so how does it work let's say your where can we go let's say your classifier so here is an image I'll have to wait for this to catch up here is an image and we put it through this entire pipeline and we get a set of predictions right and they're going to be class bounding box class bounding box class bounding box now the first thing you need to know is that there are always the same amount of predictions right there are always this size here is fixed that's large n okay that is sort of that's kind of a maximum of predictions since you can always predict either a class or the nothing class in in this case you could predict anywhere from zero to five objects in the scene right okay and then the second thing is from your from your database you get out an image with its bounding box annotations right that are made by human laborers let's say these two and you also do class bounding box class bounding box but now you see we only have two two instances so here we just pad with the nothing class so I don't know what the bounding box should be for the nothing class it doesn't really matter nothing no bounding box nothing no bounding box no bounding box so your ground truth labels if you will are also of size n so you always compare n things here on the left that your classifier output with n things on the right now as we already said the question is how do you deal with you can't simply compare one by one because the the ordering should not be important but also you don't want to encourage your classifier to always kind of if there is if if the one bird is very prominent right you don't want to encourage your classifier to say do you say that here's a bird here's a bird there's a bird right here hey hey there's a bird there's a bird there's a bird and basically just because the signal for that bird is stronger and basically ignore the other bird what you want to do is you want to encourage some sort of your classifier to detect if it has already detected an object it shouldn't detect again in a slightly different place so what the way you do this is with this bipartite matching loss so at the time when you compute the loss you go here and you compute what's called a maximum matching now what you have to provide is a loss function so we can there's a loss function L and L will take two of these things L will take the read the predicted thing of your model and L will take the true under one of the true underlying things and L will compute a number and we'll say how well do these two agree so you can say for example if either of them is the nothing class then I have no loss like I don't care about them that gives you no loss but if the two if the two classes agree and the two bounding boxes agree then it's very good right and we maybe even gives like some negative loss or give loss zero but if if the bounding boxes agree but the classes don't agree then you say that's bad or the other way around if the classes agree in the bottom or even if everything disagrees it's the worst what what you're basically saying is if if these two would correspond to each other right if the thing on the left were the prediction for the thing on right which we don't know right it could be that the thing on the right refers to the bird on the right and the thing on the Left refers to the bird on the left so would be natural that the bounding boxes aren't the same but you say if these were corresponding to each other what what would the loss be how well would they do and now if you compute this bipartite matching what you want I guess it's a it's a minimum matching in this case what you want is you four to find an assignment of things on the left two things on the right a one to one assignment this is an example of a one to one assignment everything on the left is assigned exactly one thing on the right such that the total loss is minimized right so you're going to say I'm going to align the things on the left with the things on the right such that it's maximally favorable right I give you the maximum benefit of the doubt by aligning these things and what so in the best possible case what's the loss okay hope this is this is somehow clear so this you're trying to find the assignment from the left to the right that makes that basically is the best case for this output right here where you really say oh okay here you output the output a bird very close to the bird ear in the ground truth label that's this here so I'm going to connect I'm going to connect these two because that's sort of it's it's it gives the model the most benefit of the doubt and the loss that you have at the end of that matching so this loss here would only then count wherever these connections are that loss is going to be your training loss okay so this solves the problems we had before it is not dependent on the order because if you reorder the things your minimum matching will simplify it will simply swap with it it is it is um if you output the same bird multiple times only one of these is going to be assigned so if if this here is that bird only one of them only this one maybe is going to be assigned to that one and the other ones can't be assigned to that one are forced to be assigned to a different one let's say this one here and are going to incur a loss so you encourage your model to output let's say diverse bounding boxes different bounding boxes for things okay so this D solves these problems and it's very clever and there are algorithms to compute these these minimum matchings and they use the Hungarian algorithm which will give you exactly such a matching again this is possible because you have n things on each side and the N is in effect here is a the maximum of objects that you can detect at once I guess if there is less you can simply pad right here and then the model of course is encouraged to come up with the equal number of no class predictions because if it outputs a prediction when it shouldn't right if it already predicts two things and these are assigned to these two things and then it outputs one more thing it is going to be penalized because it should output three things with no class but it has output one-to-many with a with a class is going to be penalized okay so the this is a pretty pretty cool thing it again it relies on the fact that you have n on both sides but you can make n so large that basically it covers all of the cases so you can make n like 50 so you can detect up to 50 things in a scene alright that's the algorithm in a high-level they do show their loss here you see the loss ultimately is going to be so it's going to be over this matching right here that's the minimum bipartite assignment that basically minimizes this total loss over your prediction and label matchings and the loss they come up with here I said you have to give the algorithm a loss is this one and they kind of go into how they do it I don't think it's super important so the the class algorithm sorry the loss on the class labels I think it's going to be a soft Max or a sorry a cross-entropy loss like in usual classification and the loss on the to say whether to bounding boxes agree is a mixture of the l1 loss that compares to bounding boxes and this iou loss which is not dependent on the scale of the bounding boxes it kind of computes how much fraction of the two bounding boxes overlap but in any case the lost base they consist of saying how eyeli how much do the labels agree and how much do the bounding boxes agree okay again this is only possible because after that you compute this matching otherwise you would have no clue which boxes to come which predictions to compare to which other predictions so let's look at this architecture a bit more in detail as we said you have this what they call the backbone which is a convolutional neural network and with that you put in some positional encodings now I already said the you should look at the these features right here as just smaller feature versions of the image but they still have some image nature then they are flattened so once they are put in the transformer encoder because the transformer is naturally a sequence processing unit okay so it takes in just a sequence of vectors right here and since an image is not a sequence what you'll do is if you have your image features and we said we have a bunch of channels let's say we have four channels and their height and width and see you're going to unroll and flatten that into one sequence so this is height times width you basically unroll across these axes right here into this axis and it's channels I so basically you have a sequence here of of C dimensional feature vectors that you then put into your encoder okay so your encoder will now transform this sequence into an equally long sequence yet again of features and the good thing about a transformer because why do you use a transformer the good thing about the transformer is that in such a sequence and I've done videos on transformers it you can basic mainly look at the video attention is all you need if you want to under than this more fully this thing can basically have a tension so it has attention layers it can attend from each position to each position in a one-shot manner so as it transforms this representation up the transformer layers at each step it can basically aggregate information from everywhere in the sequence to anywhere else and therefore it's very it's very powerful if you have a sequence and you need sort of global connections across the sequence this is very good for a language processing because in a sentence let's look at this sentence the input images are matched together all right applying blah blah blah blah blah blah blah blah blah blah and then there is they write the word they and you need you need to know that they refers to the input images okay and but you see this is very very far away in the sentence so you need a model that makes use of long range dependencies and they make the case that in such a task right here you also need the long range dependencies because these bounding boxes as you see right here there can be quite large so if you have an image you need that this part here communicates with these and this and this and this part basically anywhere in the bounding box and these bounding boxes can be quite large so the transformer architecture actually makes sense here now I want to go a bit later into why I think it actually makes even more sense for a bounding box detection but right now I just want to keep going through this through this architecture right here so if my computer here decides to come back yes we can go on so what will get out is yet another so in here we put this thing we put down here we put into the transformer encoder and we get an equally sized equally shaped sequence out of the transformer encoder you see that this thing here goes as a side input into this transformer decoder so the transformer encoder here is just a bit more of a feature mapping technically just for the architecture you could think of just putting this into here but of course it's gonna go better with the transformer encoder the transformer decoder now does something similar but you see it has the encoder as a side input this is very much like this is not like Burt Burt is like a only encoder transformer whereas this is much like the original attention is all you need transformer that has an encoder and then the decoder as a side input basically as conditioning information has the encoder output what does the decoder do again since it's a transformer it's going to take a sequence and output a sequence the sequence it takes is right here is what they call object queries and this also is different from the attention is all you need papers and they don't do it autoregressive lee they just do it one shot what does it mean it means that you start with a sequence here of four things and this is these are the this is this big n and you out you output the sequence of a sequence of four things and it's important to see what they're going to end up so these things are then directly going through a classifier that now outputs the so these things here are these class label bounding box outputs okay so each of these things is going to after transformation end up being one of these bounding boxes either defining an object or saying that there isn't an object somewhere okay you see here this bounding box refers to this bird this bounding box refers to this bird so each of these things is going to to be one bounding box and the what they call object queries is the question of course is what do you input here right actually I want to transform this image information that comes from the left here I want to transform that into the bounding boxes what do I input here and the answer is you just input at the start you just input n random vectors because what's that gonna give you is basically n outputs you want and outputs because you want n of these bounding box classifications so you need n things and if I input n things into a transformer it's going to give me n things as an output and then in each step I can simply condition on the information that comes in the images and it it'll give me right then I can incorporate that information it's a very deep learning way of thinking about it actually that you just need the information somewhere in there and I need n things now they go more into detail into this transformer architecture help help in the helpful fashion in the appendix and will go there quickly so this I think here makes more sense so the image features come in here right and you see this is just a transformer stack an encoder stack of multi-head self attention and feed-forward in instants wise or like token wise feed-forward network and then that information is taken and is given as conditioning information over here now in here as I said you input these object queries which at the beginning are just n random vectors and what you're going to do you Argos are going to feature and code them and then you combine it with this image information so ultimately if you think of this one of these things one of these things is going to be a vector right and then that vector is going to be transformed and then it will have as it is transformed it will have the opportunity to basically look at features that come from here now the arrow is in the wrong direction so you have already taken the image and you've transformed it into a feature representation which is also a vector right you have the features of the image right here now as you transform this vector this object query queue you have the opportunity to look at the image features right and that's how do you get the image information in there so the image features will come in here transform that through attention so this is an attention mechanism on the image and then what you will output is a bounding box and a little class label it's really hard to explain I would guess you need to understand really what attention mechanisms are and of course the crucial part of of course is what what's this what do you input at the beginning and these object queries aren't actually random as I said they are learned so what you're going to do is you're going to learn independent of the input image you're going to learn n different object queries and these object queries now it's very it's very interesting because these object queries are sort of going to be different it's like you have different people that can ask the input image different questions right and this they have so there n is 100 but they show 20 of these object queries that they learn and so did they have visualization of all bounding box predictions on all images so it's it's sort of like you have n different people at your disposal and you train these n different people to kind of ask different questions of the input image ok you say this person up here will always ask irrespective of what the input image is will always ask sort of hey input image what's what's on your bottom left right that's I'm really interested what's on your bottom left and sometimes I'm a bit interested in what's here but I'm mainly interested what's on the bottom left of the image whereas this person right here sorry this person right here is more interested in what's in the center that the different colors here is refer to different sizes of bounding boxes so this person is also interested so the person on the top-left is interested mainly in I think small bounding boxes that are on the bottom left and the person here is mostly interested in what I'm really interested what's in the center what's large in the center I want give me large things that are in the center right and then this person right here is really interested on stuff that's on the right side of the image so you see in order to get different sort of a difference in bounding box predictions you train n different people to ask different questions of the of the input image and this asking of questions is exactly what an attention mechanism is so this person right here let's let's take this this person and I'm saying person these are vectors these are learned object queries but this person first they will simply ask the question what's on what's on the right side and then the image features right I'm getting poor drawing the image features it will have an attention mechanism to this part of the image features and then it will get back some signal right and then it will transform that with its own signal up and then it will ask maybe again okay now that I know more because you see that person is interested in multiple things it's interested in those things and those things so at first it will focus on these things but then it says oh now I'm now I know more right there is there I know I see there is actually something on the right side so in the higher layers it can then go back and ask the image more questions by sending these cue vectors of the attention mechanism and it will get back the V vectors from the image features that correspond to these cue things so up and up the layers this person can ask more refined questions about what that particular person is interested in okay and since you have the different people here that ask different questions you basically learn the people in a way such that across the data set they all together they cover every possible image pretty well again these people what they're interested in initially is not dependent on the picture you simply learn this in a global manner all right this is the best way I have of describing it you basically learn n people that are each one is interested in different things different classes and different regions in the image and each one of these people is going to output their best guess of what is where based on what they're interested in so that person might say I'm you know I'm the person that's interested kind of in the left side of things so I am going to output that there is a bird right here now these people if this is a transformer right and everything can attend to everything they can actually communicate with each other as they incorporate information from the image so in each layer they can do both they can incorporate information from the image and they can communicate with each other and then in the next layer that can do it again and again and again and thereby they can sort of they can sort of say well you already got the left side I will take the right side you already got the bird class I will take the elephant class and so on so you see here how the the architecture of the transformer actually is also very conducive to doing this bounding box prediction in that these different things can sort of attend to each other and therefore communicate with each other all right I hope that sort of makes sense now before we get into the experiments I want to list a third reason of why the transformer especially the encoders might actually also make a giant amount of sense here since you on the image into height and width and you have to imagine what does the transformer do the transformer as we said here has this notion of a tension where from any point in the sequence it can gather information from any other point in the sequence and this that's usually one of the downsides of the Transformers is done via a quadratic attention mechanism so if I just list one feature channel go over here if I just list one feature Channel right here this is height times width of the image right this is this is the entire image unrolled in one vector height times width and here I unroll it again height times width then this this matrix that I can build right here which is called the attention matrix right here it will tell me which parts of the sequence attends to which other parts okay so if you have an image that has the let's say the number three and you really want to figure out whether or not this is a three then the bow up here must communicate with the bow down here right they need to share information you say oh there's a bow here there's a bow here and there is a spiky thing here that must be a three so you want something this is rather at the beginning of the sequence you want this to attend first of all it will attend itself so you get fairly high values along the diagonal maybe 1010 1011 1112 and I saw this all eg skated a hundred million nine nine but it also like this this part here at the beginning of the sequence let's say it's here because this is unrolled right needs to attend to the end so this needs to attend to the end which we will put an 11 here and the other way around doesn't always need to be symmetrical by the way okay but in any case this is going to be a H times W squared matrix because everything can attend to everything and that's the attention mechanism why do I think this is so good for bounding boxes because let's let's imagine you actually have a matrix that is like this okay height times width times height times width every single point in here actually defines a bounding box because this point this point right here in this dimension corresponds to one location in the image and on this axis it corresponds to another location now in the attention matrix simply means these two points need to communicate but if you have two pixels you actually have defined a bounding box right here right you you were actually you're defining a bounding box and the fact that this is happening in the exact same matrices could mean that the Transformers are uniquely well the Transformers across sequences of these height times with unrolled images are uniquely well conducive to these bounding box prediction tasks I'm actually a bit astounded because when I first just read the title this immediately popped to my mind I'm like oh yes of course and they're going to predict the bounding boxes by simply training so what you would do what I thought this was gonna be as out you output an actual matrix like this and then you simply each point you can you can simply classify right so you can classify here whether whether or not like at in this direction there is a bird right and then if you have two points like this for example you and you also classify whether in this direction there is a bird right and this naturally defines a bounding box or you could like take this matrix and actually just classify individual points in this matrix to be the bounding boxes because they already define bounding boxes so I just I think these these quadratic things are are uniquely I mean someone must have thought of this or if not like the YouTube channel it would be funny first paper ever to actually have to cite the YouTube channel but again yeah so transformers seem to be a good idea for these kinds of things so how do they do of course they do well they are on par where with these other much much much more complex architectures these faster our CNN models they are apparently much more complex but they are on par with this they do however train forever I think they train for like six days on eight GPUs is not that much if you compare to like language models on hundreds of TP use but still okay I don't want to go into the numbers of experiments but what is really cool is that they can now visualize this sort of attention and you can see right here that if they look at a particular point in the image and visualize the attention it will actually attend to the instance itself so it will like these are usually the problems for these detection algorithms when things overlap and are partially occluded but you can see right here that the attention is on the part of the image that makes the instance in the back and the attention here is on the part of this and it doesn't sort of overlap into the others so that is one thing that's pretty impressive about these architectures the other thing they show is for example it can generalize to many many instances so here it has never seen 24 giraffes in one image but yet it can absolutely do that and giraffe giraffe to rupture after of and the one of the coolest images I find are these here where you can see right here again attention visualisation and you see that even within the bounding box of the front elephant here you see that the attention on this foot of the back elephant is is is assigned to this blue bounding box so this is the blue basically the blue bounding box person that is attending to that back foot that means they they these things really sort of understand or they learn these things like occlusion and you know just hard I have a hard time describing it but you can see it visually here right like how it clearly learns that these are two instances that are sort of occluding each other but this this this instance can actually appear within the bounding box of the other instance and the same goes for the zebra here that are partially occluding each other and you can see that the attention is correctly like even here this back foot of this zebra is correctly labeled so all in all that is pretty cool and they take it a step further and they say well with this architecture we can actually pretty easily do pixel wise classification so this is this cocoa stuff and things data set where I don't know which one is the stuff and which one is the things I think things is the objects and stuff is like sky and mountains and so on and so this is a classification task where you actually have to label every single pixel so what they do is they simply input this through their detector and they detect the instances they take the attention maps of the instances and then they scale it up this right here is just a CNN sort of in Reverse that scales up the image because they have scaled it down as we said they scale it up again and then they scan simply classify each pixel where each of these you remember we had these different people here that it that cared about different things in the image each of these people will classify their respective pixels the pixels they feel responsible for and then you simply merge all of these people's predictions together into this prediction and again this gives pretty pretty impressive results I am I mean this is this is fun this looks like it sort of works I haven't they do quantitative analysis of course but I'm just impressed by the examples right here alright that was sort of it I really enjoyed reading this papers the simplicity is pretty cool they do have not only do they have code in the paper to show how ridiculously easy it is to get this to run this is all you need in pi torch but they do actually have code and as I understand they also have pre trained models so they have this model Zoo right here where they give you the pre trained models so you can play with it and you can even load it from torch hub yourself and you can train it yourself they have a collab all is there all right again if you enjoy this video consider leaving a like subscribing and I'll see you next time bye bye

Info

Channel: Yannic Kilcher

Views: 82,438

Rating: 4.9732556 out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, facebook, fair, fb, facebook ai, object detection, coco, bounding boxes, hungarian, matching, bipartite, cnn, transformer, attention, encoder, decoder, images, vision, pixels, segmentation, classes, stuff, things, attention mechanism, squared, unrolled, overlap, threshold, rcnn

Id: T35ba_VXkMY

Channel Id: undefined

Length: 40min 56sec (2456 seconds)

Published: Thu May 28 2020