PaperReading

Review of Monocular 3D Object Detection

October 2019

name Time venue title tl;dr predecessor backbone 3d size 3d shape keypoint 3d orientation distance 2D to 3D tight optim required input drawbacks tricks and contributions insights  
Mono3D 1512 CVPR 2016 Mono3D: Monocular 3D Object Detection for Autonomous Driving The pioneering paper on monocular 3dod, with tons of hand crafted feature Mono3D Faster RCNN from 3 template per class None None scoring of dense proposal scoring of dense proposal None 2D bbox, 2D seg mask, 3D bbox   shared feature maps (mono3D)    
Deep3DBox 1612 CVPR 2017 Deep3dBox: 3D Bounding Box Estimation Using Deep Learning and Geometry Monocular 3d object detection (3dod) by using 2d bbox and geometry constraints. Deep3DBox MS-CNN L2 loss for offset from subtype average None None multi-bin for yaw 2D/3D optimization the original deep3DBox optimization 2D bbox, 3D bbox, intrinsics locking in the error in 2D object detection      
Deep MANTA 1703 CVPR 2017 Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image Predict keypoints and use 3D to 2D projection (Epnp) to get position and orientation of the 3D bbox. None cascaded Faster RCNN template classification scaled by a scaling factor template classification scaled by a scaling factor 36 keypoints 6DoF pose by 2D/3D matching Epnp 6DoF pose by 2D/3D matching Epnp None 2D bbox, 3D bbox, 103 3D CAD with 36 keypoint annotation   semi-auto labeling by putting template into 3D bbox    
3D-RCNN 1712 CVPR 2018 3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare inverse graphics, predict shape and pose, render and compare Deep3DBox Faster RCNN subtype average TSDF encoding, PCA, 10-dim space 2D projection of 3D center viewpoint (azimuth, elevation, tilt) with improved weighted average multi-bin find d by moving along ray angle until 3d tightly fit 2D yes, move 3D along ray until fit tightly into 2D bbox 2D bbox, 3D bbox, 3D CAD        
MLF 1712 CVPR 2018 MLF: Multi-Level Fusion based 3D Object Detection from Monocular Images Estimate depth map from monocular RGB and concat to be RGBD for mono 3DOD. Deep3DBox Faster RCNN offset from whole dataset average None None multi-bin, and SL1 for cos and sin MonoDepth, SL1 for depth regression None 2D bbox, 3D bbox, pretrained depth model pretrained depth model point cloud as 3-ch xyz map    
MonoGRNet 1811 AAAI 2019 MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization Use the same network to estimate instance depth, 2D and 3D bbox. MonoGRNet MultiNet (YOLO+RoIAlign) regress 8 corners in allocentric coordinate system None 2D projection of 3D center regress 8 corners in allocentric coordinate system instance depth estimation (IDE) according to a grid   2D bbox, 3D bbox, intrinsics, depth map requires depth map for training 2D/3D center loss, local/global corner loss; stagewise training to start 3D after 2D instance depth estimation: pixel level depth estimation does not focus on object localization by design; depth of the nearest object instance  
OFT 1811 BMVC 2019 OFT: Orthographic Feature Transform for Monocular 3D Object Detection Learn a projection of camera image to BEV for 3D object detection. OFT ResNet18+ResNet16 top down network L1 loss for offset from subtype average in log space None None L1 on cos and sin positional offset in BEV space from local peaks None 2D bbox, 3D bbox (intrinsics learned)   TopDown network to reason in BEV    
ROI-10D 1812 CVPR 2019 ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape Concat depth map and coord map to RGB features + 2DOD + car shape reconstruction (6d latent space) for mono 3DOD.   Faster RCNN with FPN offset from whole dataset average TSDF encoding, 3D Autoencoder, 6-dim space None 4-d quaternion regress depth z None 2D bbox, 3D bbox, intrinsics, pretrained depth model   8-corner loss; stagewise training to start 3D after 2D    
Pseudo-Lidar 1812 CVPR 2019 Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving estimate depth map from RGB image (mono/stereo) and use it to lift RGB to point cloud Pseudo-lidar Frustum-PointNet / AVOD 3DOD on point cloud None None 3DOD on point cloud DORN depth estimation None 2D bbox, 3D bbox, intrinsics, pretrained depth model pretrained depth model   data representation matters  
Mono3D++ 1901 AAAI 2018 Mono3D++: Monocular 3D Vehicle Detection with Two-Scale 3D Hypotheses and Task Priors Mono 3DOD based on 3D and 2D consistency, in particular landmark and shape recon. DeepMANTA SSD for 2D bbox, stacked hourgalss for keypoint, monodepth for depth   N basis shape (N=?) 14 landmarks CE cls on 360 bins MonoDepth L1 loss 2D bbox, 3D bbox, pretrained depth model, 3D CAD model with keypoints     cars should staty on the ground, should look like a car, and should be at a resaonable distance. How to ensure 2D/3D consistency between generated 3D vehicle hypothesis.  
GS3D 1903 CVPR 2019 GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving Get 3D bbox proposal (guidance) from 2D bbox + prior knowledge, then refine 3D bbox through surface features   FasterRCNN with VGG16 (2D+O) subtype average None None from RoIAligned features (possibly multibin) approximated with bbox height * 0.93 None 2D bbox, 3D bbox, intrinsics   quality aware loss, surface feature extraction    
Pseudo-Lidar Color 1903 ICCV 2019 Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving Concurrent proj with Pseudo-lidar but with color embedding Pseudo-lidar Frustum-PointNet 3DOD on point cloud None None 3DOD on point cloud various pretrained depth weight None 2D bbox, 3D bbox, intrinsics, pretrained depth model        
BirdGAN 1904 IROS 2019 BirdGAN: Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles Learn to map 2D perspective image to BEV with GAN BirdGAN DCGAN oriented 2DOD on BEV point cloud None None oriented 2DOD on BEV point cloud oriented 2DOD on BEV point cloud None 2D bbox, 3D bbox (intrinsics learned) In the clipping case, the frontal detectable depth is only about 10 to 15 meters      
FQNet 1904 CVPR 2019 FQNet: Deep Fitting Degree Scoring Network for Monocular 3D Object Detection Train a network to score the 3D IOU of a projected 3D wireframe with GT. Train a network to score the 3D IOU of a projected 3D wireframe with GT. Deep3DBox MS-CNN k-means clustering and multi-bin None None k-means clustering and multi-bin approximated via optimization similar to Deep3DBox (details in appendix) 2D bbox, 3D bbox, intrinsics        
MonoPSR 1904 CVPR 2019 MonoPSR: Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction 3DOD by generating 3D proposal first and then reconstructing local point cloud of dynamic object Deep3DBox, Pseudo-lidar MS-CNN L2 loss for offset from subtype average None None multi-bin for yaw approximated with bbox height, then regress the residual from RoIAligned feature None 2D bbox, 3D bbox, intrinsics   shared feature maps (mono3D0    
MonoDIS 1905 ICCV 2019 MonoDIS: Disentangling Monocular 3D Object Detection end2end training of 2D and 3D heads on top of RetinaNet for monocular 3D object detection MonoGRNet RetinaNet+2D/3D head offset from whole dataset average, learned via 3D corner loss None 2D projection of 3D center learned via 3D corner loss regressed from dataset average, learned via 3D corner loss None 2D bbox, 3D bbox, intrinsics   signed IoU loss (pulls together even before intersecting), disentangle learning disentangling transformation to split the original combinational loss (e.g., size and location of bbox at the same time) into different groups, each group only contains the loss of one group of parameters and the rest using the GT  
monogrnet_russian 1905   MonoGRNet 2: Monocular 3D Object Detection via Geometric Reasoning on Keypoints Regress keypoints in 2D images and use 3D CAD model to infer depth DeepMANTA Mask RCNN with FPN SL1 loss for offset from subtype average in log space 5 CAD 14 landmarks multi-bin for yaw in 72 non-overlapping bins approximated with windshield height None 2D bbox, 3D bbox, intrinsics   semi-auto labeling by putting template into 3D bbox    
Pseudo-Lidar end2end 1905 ICCV 2019 Pseudo lidar-e2e: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud End-to-end pseudo-lidar training with 2D/3D bbox consistency loss Pseudo-Lidar Frustum-PointNet 3DOD on point cloud None None 3DOD on point cloud DORN depth estimation bbox conistency loss 2D bbox, 2D seg mask, 3D bbox, intrinsics pretrained depth model 2D/3D bbox consistency    
Shift RCNN 1905 IEEE ICIP 2019 Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints Extend the work of deep3Dbox by regressing residual center positions. Deep3DBox Faster RCNN L2 loss for offset from subtype average None None cos and sin, with unity constriant approximated via optimization Slightly different from Deep3DBox 2D bbox, 3D bbox, intrinsics        
BEV IPM OD 1906 IV 2019 BEV-IPM: Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image IPM of the pitch/role corrected camera image, and then perform 2DOD on the IPM imag   YOLOv3 oriented 2DOD on BEV point cloud None None oriented 2DOD on BEV point cloud oriented 2DOD on BEV point cloud None 2D bbox, BEV oriented bbox, IMU correction up to 40 meters Motion cancellation using IMU IPM assumptions: 1) road is flat 2) mounting position of the camera is stationary –> motion cancellation helps this. 3) the vehicle to be detected is on the ground  
Pseudo-Lidar++ 1906   Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving Improve depth estimation of pseudo-lidar with stereo depth network (SDN) and sparse depth measurements on landmark pixels with few-line lidars. Pseudo-lidar Frustum-PointNet / AVOD 3DOD on point cloud None None 3DOD on point cloud PSMNet finetuned stereo depth None 2D bbox, 3D bbox, pretrained depth model, sparse lidar data   use sparse lidar to correct depth, stereo depth loss    
SS3D 1906   SS3D: Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss CenterNet like structure to directly regress 26 attributes per object to fit a 3D bbox   U-Net like arch log size None 8 3D corners projected to 2D cos and sin (multibn not suitable) direclty regress None 2D bbox, 3D bbox, intrinsics   models uncertainty, direclty regress 26 number, 20 fps inference    
TLNet 1906 CVPR 2019 TLNet: Triangulation Learning Network: from Monocular to Stereo 3D Object Detection Place 3D anchors inside the frustum subtended by 2D object detection as the mono baseline   Faster RCNN with two refine stages refined from dataset average None None refined from 0 and 90 degrees anchors refined from 3D anchors None 2D bbox, 3D bbox, intrinsics   stereo coherence score and channel reweighting    
M3D-RPN 1907 ICCV 2019 M3D-RPN: Monocular 3D Region Proposal Network for Object Detection Regress 2D and 3D bbox parameters simultaneously by precomputing 3D mean stats for each 2D anchor.   Faster RCNN log size times 3D anchor size None None smooth L1 directly on angle, postprocess to refine   None 2D bbox, 3D bbox, intrinsics angle postprocessing 2D anchor with 2D/3D properties, depth aware conv, neg log IoU loss for 2D detection, directly regress 12 numbers Reliance on additional sub-networks introduces persistent noise  
ForeSeE 1909   ForeSeE: Task-Aware Monocular Depth Estimation for 3D Object Detection Train a depth estimator focused on the foreground moving object and improve 3DOD based on pseudo-lidar. Pseudo-lidar Frustum-PointNet / AVOD 3DOD on point cloud None None 3DOD on point cloud learn foreground/background depth individually   2D bbox, 3D bbox, depth map   Depth combination: Element-wise maximum value of confidence vector in C depth bins are obtained, and pass through a softmax Not all pixels are equal. Estimation error on a car is much different from the same error on a building.  
CenterNet 1904   Objects as Points Object detection as detection of the center point of the object and regression of its associated properties. CenterNet DLA (Unet) L1 loss over absolute dimension None None multi-bin for global yaw in two overlapping bins L1 loss on 1 over regressed disparity None 2D bbox, 3D bbox, intrinsics     highly flexible network  
Mono3D Track 1811 ICCV 2019 Joint Monocular 3D Vehicle Detection and Tracking Add 3D tracking with LSTM based on mono3d object detection. Deep3DBox Faster RCNN L1 loss for offset from subtype average None 2D projection of 3D center multi-bin for local yaw in two bins L1 loss on 1 over regressed disparity None 2D bbox, 3D bbox, intrinsics     regressing 2D projection of 3D center helps recover amodal 3D bbox  
CasGeo 1909   3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results Extends Deep3DBox by regressing the 3d bbox center on bottom edge and viewpoint classification Deep3DBox MS-CNN refined from subtype average None 2D projection of bottom surface center multi-bin for yaw, viewpoint estimation approximated via optimization (Gauss-Newton) similar to Deep3DBox 2D bbox, 3D bbox, intrinsics     regress 3d height projection to help with initial guess of distance  
GPP 1811 ArXiv GPP: Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road Regress tireline and height and project to the best ground plane near the car GPP RetinaNet+2D/3D head refined from subtype average None 2D projection of tirelines (observer facing vertices) coarse (8) viewpoint classification IPM based on best fitting ground plane None 2D bbox, 3D bbox, intrinsics, fitted road planes Need to collect and fit road data able to predict local road pose   NA
MVRA 1910 ICCV 2019 MVRA: Multi-View Reprojection Architecture for Orientation Estimation Build the 2D/3D constraints optimization into neural network and use iterative method to refine cropped cases. Deep3DBox Faster RCNN refined from subtype average None None multi-bin for yaw, viewpoint estimation, iterative trial and error for truncated approximated via optimization similar to Deep3DBox (details in appendix) 2D bbox, 3D bbox, intrinsics   predict better for truncated bbox   NA