January 2021
tl;dr: Single-stage, anchor-free point-based lidar object detector.
Overall impression
This study brings the point-based lidar object detectors (PointRCNN) to the realtime realm. Almost all previous real-time lidar object detectors are voxel based, such as PIXOR, PointPollars and VoxelNet.
Existing point based lidar object detector have largely four stages
  - Set abstraction (SA): downsampling to get representative points
- Feature propagation (FP): upsampling
- RPN on each point
- Refinement stage
The feature propagation and refinement module takes more than half of the time and thus need to be eliminated. The largest contribution of 3DSSD is the use of feature based furthest point sampling to eliminate the need for feature propagation. –> This reminds me of feature-metric loss ECCV 2020 in monocular depth estimation, and BEV feat sticthing.
The paper introduces quite a few acronyms which makes the paper a bit hard to read.
Key ideas
  - Set abstraction (SA) sampling: it takes the form of furthest point sampling (FPS) based on Euclidean distance, or D-FPS.
    
      - The downsampled representative points by D-FPS discards too many points on the FG objects. Thus FP layer has to be used to recall the discarded points.
- Feature-FPS takes in the semantic feature distance to solve this issue.
 
- Fusion sampling (FS): A combined feature-FPS and distance-FPS yields the best performance.
- Candidate generation (CG): shift the points selected by F-FPS to the center, by a dedicated supervision loss. The shifted representative points are called candidate points.
- Anchor-free head: for each candidate point, predict distance offset to the center of the corresponding instance. Less cumbersome than anchor-based method
- Losses:
    
      - classification loss: cls conf and centerness score with CE loss
- regression loss: distance, orientation, size, corner loss
- shift loss to shift representative point towards the center.
 
Technical details
  - Centerness assignment: inspired by FCOS, every candidate points also regresses one centerness score to describe how far away it is from the center of the predicted object.
    
      - Although not explicitly described, this centerness score is most likely used together with cls conf in postprocessing to filter out FPs.
- The original points before shifting is not suitable to regress this centerness score as almost all points are on the surface of the object and are all quite far away from the center. Without shifting, predicting centerness actually hurts the performance.
 
Notes
  - Questions and notes on how to improve/revise the current work