DETR - End to end object detection with transformers (ECCV2020)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is nicola carrion and i'm a phd student at facebook ai research today i'm going to present our latest work on object detection using transformers the task of object detection consists in finding and localizing the objects visible in a given input image a common way of characterizing these objects is through an unordered set of tight bounding boxes associated with a category label for the past decades the main approach to object detection was to reduce it to the well-studied problem of image classification by classifying all the bounding boxes however the set of all possible boxes is in finite so this formulation requires selecting an appropriate subset of boxes to operate on in a second step the boxes are generally refined through a regression step this formulation entails a cascade of technical problems that needs to be addressed and the final pipeline is generally not end-to-end differentiable by contrast we set out to tackle the set prediction problem directly our new approach called deter incorporates almost no geometric priors and is fully differentiable we show that it matches the performance of heavily hand-tuned baselines and it can also be extended to additional tasks such as panoptix segmentation here is an overview of our method in comparison to the popular faster rcnn pipeline details architecture is built upon a transformer which is a popular model from the nlp community as such it doesn't require any detection-specific components now let's take a closer look at the inner workings of the model we first fit the image to a cnn to get image features since the transformer is permutation equivalent some extra care is required to retain the 2d structure of the image to that effect we add fixed 2d positional embeddings made of sinuses at different frequencies the resulting feature set is passed to a transformer encoder for decoding we pass a fixed set of learnt embeddings called object queries through a transformer decoder the feature vectors test obtained are fed to shared fully connected layers that predict the class and bonding box for each queries for training we match the set of predicted objects to the ground tools objects using the hungarian algorithm finally for each prediction we set as target the grantor subject he was assigned to we compare detail to popular one and two stage architectures using the same backbone and comparable training procedure with the same number of parameters and influence time but using half as much computation detour matches the performance of these well-established baselines in the dc5 model we applied dilated convolutions in the last stage of the resnet backbone to double the feature map resolution this incurs a high computational cost since the transformers attention has a quadratic complexity but allows us to gain one point detail has a different performance profile than the compare methods in particular it is much more effective on large objects we hypothesize that it is due to the global reasoning capability of the attention mechanism by contrast it is less effective on small objects we leave it as future work to address this issue now we investigate the role of all the components of the model the transformer encoder uses self-attention to globally reason about the image we can visualize the attention patterns to understand what is going on for a given point in the image highlighted here in red we compute the attention score with respect to all the other pixels in the image average over the attention heads we observe high attention scores for the pixels belonging to the same object we can then repeat this experiment using other source points overall this shows that the pixels comprising each object have high pairwise cosine similarity in other words the encoder's role is to separate the object instances we now look into the decoder we found it helpful to add supervision at each layer of the decoder as a side effect we can dynamically choose how many layers to use from a trained model at this time as expected the performance improves as we use more layers in the first few layers the model tends to output duplicate height confidence boxes for the same object we can quantify this phenomenon by applying an ms to the model's predictions it drastically improves the performance after a few layers but as we go deeper the model is able to suppress the duplicates and the nms doesn't bring any more improvements here we visualize the cross attention scores for each of the object's embeddings in the last decoder layer we overlay the attention map using the same color as the corresponding object the attention focuses on the extremities of each object to be able to accurately predict the bounding box notice how it carefully avoids attending to the wrong object even if the objects are occluding each other at this point i want to draw your attention to the object queries object queries is the name we give to the inputs of the decoder layers they are randomly initialized embeddings that are refined through the course of training then fixed for evaluation in total we use 100 object embeddings and this defines the upper bound on the number of objects that this model can detect unlike oncores we do not manually incorporate any geometric prior in the object queries instead the model learns it directly from data here we visualize the average center of the boxes predicted by each of the hardware queries on all the images of the coco validation set we notice that the queries normally tend to tile the space uniformly covering all the possible object locations we can also track which query is active as we slide the object over the image finally we show the detail generalizes to high counts of objects in the image even on sample out of the training distribution we now briefly present the panoptix segmentation task it is a fusion of instance segmentation which aims at predicting a mass for each distinct instant of a foreground object and segmenting segmentation which aims at predicting a class label for each pixel in the background the resulting task requires that each pixel belongs to exactly one segment through panoptix segmentation we aim at understanding whether detour's object embeddings can be used for other downstream tasks to approach this task we first train dito to predict boxes around both foreground and background objects in a uniform manner by contrast existing methods tend to treat both kind of entities differently once the detection model is strained we freeze the weights and we train a mask head for 25 epochs here is an overview of the panoptic architecture we first feed the image to the cnn and we set aside the activations from the intermediate layers after the encoder we also set aside the encoded version of the image and then proceed to the decoder we end up with one object embedding for the foreground cow and one objector meading for each segment of the background namely the sky grass and trees we then use a multi-head attention layer that returns the attention scores over the encoded image for each object embedding we proceed to upsample and clean these masks using a convolutional network that uses the intermediate activations from the backbone as a result we get high resolution maps where each pixel contains a binary logic of belonging to the mask finally we merge the masks by assigning each pixel to the mask with the highest logit using a simple pixel-wise rmx we show that this recipe for panoptix segmentation outperforms state-of-the-art models especially when it comes to segmenting background objects again we attribute this performance to global reasoning capability enabled by the attention mechanisms here are some qualitative results for the panoptix segmentation model as a conclusion we presented a stream line detection pipeline composed of a vanilla resnet backbone and a transformer we show competitive performance in both object detection and panoptix segmentation although details performances on small objects are lagging behind thanks to the simplicity of the approach we expect innovations on backbones as well as transformers to directly apply to data finally we provide an open source by torch implementation of the model as well as pre-trained models and values google collabs to delve deeper into the model thank you for your attention and see you at the q a sessions

Info

Channel: Nicolas Carion

Views: 7,825

Rating: undefined out of 5

Keywords: Deep learning, DETR, object detection, transformer, resnet, computer vision, eccv 2020

Id: utxbUlo9CyY

Channel Id: undefined

Length: 9min 35sec (575 seconds)

Published: Tue Aug 04 2020