PaperReading

Deformable DETR: Deformable Transformers for End-to-End Object Detection

October 2020

tl;dr: Improved DETR that trains faster and performs better to small objects.

Overall impression

Issues with DETR: long training epochs to converge and low performance at detecting small objects. DETR uses small-size feature maps to save computation, but hurt small objects.

Deformable DETR first reduces computation by attending to only a small set of key sampling points around a reference. It then uses multi-scale deformable attention module to aggregate multi-scale features (without FPN) to help small object detection.

Each object query is restricted to attend to a small set of key sampling points around the reference points instead of all points in the feature map.

Deformable DETR is one of the highest scored papers in ICLR 2021.

There are several papers on improving the training speed of DETR.

Deformable DETR: sparse attention
TSP: sparse attention
Sparse RCNN: sparse proposal and iterative refinement

Key ideas

Efficient Attention
- Pre-defined sparse attention patterns.
- Learn data-dependent sparse attention –> Deformable DETR belongs to this
- Low rank property in self-attention
Complexity of DETR
- Encoder: self attention $O(H^2W^2C)$, quadratically with feature size.
- Decoder: cross attention $O(HWC^2 + NHWC)$, linearly with feature size. Self-attention $O(2NC^2+N^2C)$

Technical details

Summary of technical details

Notes

Questions and notes on how to improve/revise the current work