July 2020
tl;dr: Distance and velocity estimation from monocular video.
Overall impression
Achieves better performance and is more end to end than monocular_velocity. It uses optical flow and RoIAligned features to regress velocity and distance. It does not use off-the-shelf depth estimator as in monocular_velocity.
3D velocity estimation can be seen as the prediction of sparse scene flow. This is to be compared to the 2d offset prediction in CenterTrack, which can be seen as a sparse optical flow. Scene flow = optical flow + depth.
SOTA velocity estimation is about 0.48 m/s.
Key ideas
  - Input: two stacked images.
- Main idea: if we know the two corresponding point and their depth in two neighboring frames, then we can calculate the velocity of that point.
- Uses PWCNet encoder as feature extractor for feature F.
- distance: feature vector F from RoIAligned current frame + geometry vectors (intrinsics + bbox)
    
      - Vehicle centric or not, does not matter much
 
- velocity: feature vector F + optical flow vector M RoIAligned from two neighboring frames + geometry vectors (intrinsics + bbox)
    
      - Velocity estimation needs to be vehicle centric as optical flow works much better on image patches than on the whole image.
 
Technical details
  - It regresses the closest point to the vehicle, and uses bbox center as the proxy. This could be problematic for side distance estimation.
- The supervised DORN performance is about the same as vehicle centric distance estimation. Self-supervised method is much worse. This is somewhat surprising.
Notes
  - Questions and notes on how to improve/revise the current work