PaperReading

Review of Monocular 3D Object Detection

October 2019

Update: 10/24/2019, initial creation of table
Update: 10/28/2019, added centerNet (from UT Austin) and mono 3d tracking (from DeepDrive)
Update: 11/25/2019, blog post published to towarddatascience and google spreadsheet.

name	Time	venue	title	tl;dr	predecessor	backbone	3d size	3d shape	keypoint	3d orientation	distance	2D to 3D tight optim	required input	drawbacks	tricks and contributions	insights
Mono3D	1512	CVPR 2016	Mono3D: Monocular 3D Object Detection for Autonomous Driving	The pioneering paper on monocular 3dod, with tons of hand crafted feature	Mono3D	Faster RCNN	from 3 template per class	None	None	scoring of dense proposal	scoring of dense proposal	None	2D bbox, 2D seg mask, 3D bbox		shared feature maps (mono3D)
Deep3DBox	1612	CVPR 2017	Deep3dBox: 3D Bounding Box Estimation Using Deep Learning and Geometry	Monocular 3d object detection (3dod) by using 2d bbox and geometry constraints.	Deep3DBox	MS-CNN	L2 loss for offset from subtype average	None	None	multi-bin for yaw	2D/3D optimization	the original deep3DBox optimization	2D bbox, 3D bbox, intrinsics	locking in the error in 2D object detection
Deep MANTA	1703	CVPR 2017	Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image	Predict keypoints and use 3D to 2D projection (Epnp) to get position and orientation of the 3D bbox.	None	cascaded Faster RCNN	template classification scaled by a scaling factor	template classification scaled by a scaling factor	36 keypoints	6DoF pose by 2D/3D matching Epnp	6DoF pose by 2D/3D matching Epnp	None	2D bbox, 3D bbox, 103 3D CAD with 36 keypoint annotation		semi-auto labeling by putting template into 3D bbox
3D-RCNN	1712	CVPR 2018	3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare	inverse graphics, predict shape and pose, render and compare	Deep3DBox	Faster RCNN	subtype average	TSDF encoding, PCA, 10-dim space	2D projection of 3D center	viewpoint (azimuth, elevation, tilt) with improved weighted average multi-bin	find d by moving along ray angle until 3d tightly fit 2D	yes, move 3D along ray until fit tightly into 2D bbox	2D bbox, 3D bbox, 3D CAD
MLF	1712	CVPR 2018	MLF: Multi-Level Fusion based 3D Object Detection from Monocular Images	Estimate depth map from monocular RGB and concat to be RGBD for mono 3DOD.	Deep3DBox	Faster RCNN	offset from whole dataset average	None	None	multi-bin, and SL1 for cos and sin	MonoDepth, SL1 for depth regression	None	2D bbox, 3D bbox, pretrained depth model	pretrained depth model	point cloud as 3-ch xyz map
MonoGRNet	1811	AAAI 2019	MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization	Use the same network to estimate instance depth, 2D and 3D bbox.	MonoGRNet	MultiNet (YOLO+RoIAlign)	regress 8 corners in allocentric coordinate system	None	2D projection of 3D center	regress 8 corners in allocentric coordinate system	instance depth estimation (IDE) according to a grid		2D bbox, 3D bbox, intrinsics, depth map	requires depth map for training	2D/3D center loss, local/global corner loss; stagewise training to start 3D after 2D	instance depth estimation: pixel level depth estimation does not focus on object localization by design; depth of the nearest object instance
OFT	1811	BMVC 2019	OFT: Orthographic Feature Transform for Monocular 3D Object Detection	Learn a projection of camera image to BEV for 3D object detection.	OFT	ResNet18+ResNet16 top down network	L1 loss for offset from subtype average in log space	None	None	L1 on cos and sin	positional offset in BEV space from local peaks	None	2D bbox, 3D bbox (intrinsics learned)		TopDown network to reason in BEV
ROI-10D	1812	CVPR 2019	ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape	Concat depth map and coord map to RGB features + 2DOD + car shape reconstruction (6d latent space) for mono 3DOD.		Faster RCNN with FPN	offset from whole dataset average	TSDF encoding, 3D Autoencoder, 6-dim space	None	4-d quaternion	regress depth z	None	2D bbox, 3D bbox, intrinsics, pretrained depth model		8-corner loss; stagewise training to start 3D after 2D
Pseudo-Lidar	1812	CVPR 2019	Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving	estimate depth map from RGB image (mono/stereo) and use it to lift RGB to point cloud	Pseudo-lidar	Frustum-PointNet / AVOD	3DOD on point cloud	None	None	3DOD on point cloud	DORN depth estimation	None	2D bbox, 3D bbox, intrinsics, pretrained depth model	pretrained depth model		data representation matters
Mono3D++	1901	AAAI 2018	Mono3D++: Monocular 3D Vehicle Detection with Two-Scale 3D Hypotheses and Task Priors	Mono 3DOD based on 3D and 2D consistency, in particular landmark and shape recon.	DeepMANTA	SSD for 2D bbox, stacked hourgalss for keypoint, monodepth for depth		N basis shape (N=?)	14 landmarks	CE cls on 360 bins	MonoDepth	L1 loss	2D bbox, 3D bbox, pretrained depth model, 3D CAD model with keypoints			cars should staty on the ground, should look like a car, and should be at a resaonable distance. How to ensure 2D/3D consistency between generated 3D vehicle hypothesis.
GS3D	1903	CVPR 2019	GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving	Get 3D bbox proposal (guidance) from 2D bbox + prior knowledge, then refine 3D bbox through surface features		FasterRCNN with VGG16 (2D+O)	subtype average	None	None	from RoIAligned features (possibly multibin)	approximated with bbox height * 0.93	None	2D bbox, 3D bbox, intrinsics		quality aware loss, surface feature extraction
Pseudo-Lidar Color	1903	ICCV 2019	Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving	Concurrent proj with Pseudo-lidar but with color embedding	Pseudo-lidar	Frustum-PointNet	3DOD on point cloud	None	None	3DOD on point cloud	various pretrained depth weight	None	2D bbox, 3D bbox, intrinsics, pretrained depth model
BirdGAN	1904	IROS 2019	BirdGAN: Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles	Learn to map 2D perspective image to BEV with GAN	BirdGAN	DCGAN	oriented 2DOD on BEV point cloud	None	None	oriented 2DOD on BEV point cloud	oriented 2DOD on BEV point cloud	None	2D bbox, 3D bbox (intrinsics learned)	In the clipping case, the frontal detectable depth is only about 10 to 15 meters
FQNet	1904	CVPR 2019	FQNet: Deep Fitting Degree Scoring Network for Monocular 3D Object Detection	Train a network to score the 3D IOU of a projected 3D wireframe with GT. Train a network to score the 3D IOU of a projected 3D wireframe with GT.	Deep3DBox	MS-CNN	k-means clustering and multi-bin	None	None	k-means clustering and multi-bin	approximated via optimization	similar to Deep3DBox (details in appendix)	2D bbox, 3D bbox, intrinsics
MonoPSR	1904	CVPR 2019	MonoPSR: Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction	3DOD by generating 3D proposal first and then reconstructing local point cloud of dynamic object	Deep3DBox, Pseudo-lidar	MS-CNN	L2 loss for offset from subtype average	None	None	multi-bin for yaw	approximated with bbox height, then regress the residual from RoIAligned feature	None	2D bbox, 3D bbox, intrinsics		shared feature maps (mono3D0
MonoDIS	1905	ICCV 2019	MonoDIS: Disentangling Monocular 3D Object Detection	end2end training of 2D and 3D heads on top of RetinaNet for monocular 3D object detection	MonoGRNet	RetinaNet+2D/3D head	offset from whole dataset average, learned via 3D corner loss	None	2D projection of 3D center	learned via 3D corner loss	regressed from dataset average, learned via 3D corner loss	None	2D bbox, 3D bbox, intrinsics		signed IoU loss (pulls together even before intersecting), disentangle learning	disentangling transformation to split the original combinational loss (e.g., size and location of bbox at the same time) into different groups, each group only contains the loss of one group of parameters and the rest using the GT
monogrnet_russian	1905		MonoGRNet 2: Monocular 3D Object Detection via Geometric Reasoning on Keypoints	Regress keypoints in 2D images and use 3D CAD model to infer depth	DeepMANTA	Mask RCNN with FPN	SL1 loss for offset from subtype average in log space	5 CAD	14 landmarks	multi-bin for yaw in 72 non-overlapping bins	approximated with windshield height	None	2D bbox, 3D bbox, intrinsics		semi-auto labeling by putting template into 3D bbox
Pseudo-Lidar end2end	1905	ICCV 2019	Pseudo lidar-e2e: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud	End-to-end pseudo-lidar training with 2D/3D bbox consistency loss	Pseudo-Lidar	Frustum-PointNet	3DOD on point cloud	None	None	3DOD on point cloud	DORN depth estimation	bbox conistency loss	2D bbox, 2D seg mask, 3D bbox, intrinsics	pretrained depth model	2D/3D bbox consistency
Shift RCNN	1905	IEEE ICIP 2019	Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints	Extend the work of deep3Dbox by regressing residual center positions.	Deep3DBox	Faster RCNN	L2 loss for offset from subtype average	None	None	cos and sin, with unity constriant	approximated via optimization	Slightly different from Deep3DBox	2D bbox, 3D bbox, intrinsics
BEV IPM OD	1906	IV 2019	BEV-IPM: Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image	IPM of the pitch/role corrected camera image, and then perform 2DOD on the IPM imag		YOLOv3	oriented 2DOD on BEV point cloud	None	None	oriented 2DOD on BEV point cloud	oriented 2DOD on BEV point cloud	None	2D bbox, BEV oriented bbox, IMU correction	up to 40 meters	Motion cancellation using IMU	IPM assumptions: 1) road is flat 2) mounting position of the camera is stationary –> motion cancellation helps this. 3) the vehicle to be detected is on the ground
Pseudo-Lidar++	1906		Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving	Improve depth estimation of pseudo-lidar with stereo depth network (SDN) and sparse depth measurements on landmark pixels with few-line lidars.	Pseudo-lidar	Frustum-PointNet / AVOD	3DOD on point cloud	None	None	3DOD on point cloud	PSMNet finetuned stereo depth	None	2D bbox, 3D bbox, pretrained depth model, sparse lidar data		use sparse lidar to correct depth, stereo depth loss
SS3D	1906		SS3D: Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss	CenterNet like structure to directly regress 26 attributes per object to fit a 3D bbox		U-Net like arch	log size	None	8 3D corners projected to 2D	cos and sin (multibn not suitable)	direclty regress	None	2D bbox, 3D bbox, intrinsics		models uncertainty, direclty regress 26 number, 20 fps inference
TLNet	1906	CVPR 2019	TLNet: Triangulation Learning Network: from Monocular to Stereo 3D Object Detection	Place 3D anchors inside the frustum subtended by 2D object detection as the mono baseline		Faster RCNN with two refine stages	refined from dataset average	None	None	refined from 0 and 90 degrees anchors	refined from 3D anchors	None	2D bbox, 3D bbox, intrinsics		stereo coherence score and channel reweighting
M3D-RPN	1907	ICCV 2019	M3D-RPN: Monocular 3D Region Proposal Network for Object Detection	Regress 2D and 3D bbox parameters simultaneously by precomputing 3D mean stats for each 2D anchor.		Faster RCNN	log size times 3D anchor size	None	None	smooth L1 directly on angle, postprocess to refine		None	2D bbox, 3D bbox, intrinsics	angle postprocessing	2D anchor with 2D/3D properties, depth aware conv, neg log IoU loss for 2D detection, directly regress 12 numbers	Reliance on additional sub-networks introduces persistent noise
ForeSeE	1909		ForeSeE: Task-Aware Monocular Depth Estimation for 3D Object Detection	Train a depth estimator focused on the foreground moving object and improve 3DOD based on pseudo-lidar.	Pseudo-lidar	Frustum-PointNet / AVOD	3DOD on point cloud	None	None	3DOD on point cloud	learn foreground/background depth individually		2D bbox, 3D bbox, depth map		Depth combination: Element-wise maximum value of confidence vector in C depth bins are obtained, and pass through a softmax	Not all pixels are equal. Estimation error on a car is much different from the same error on a building.
CenterNet	1904		Objects as Points	Object detection as detection of the center point of the object and regression of its associated properties.	CenterNet	DLA (Unet)	L1 loss over absolute dimension	None	None	multi-bin for global yaw in two overlapping bins	L1 loss on 1 over regressed disparity	None	2D bbox, 3D bbox, intrinsics			highly flexible network
Mono3D Track	1811	ICCV 2019	Joint Monocular 3D Vehicle Detection and Tracking	Add 3D tracking with LSTM based on mono3d object detection.	Deep3DBox	Faster RCNN	L1 loss for offset from subtype average	None	2D projection of 3D center	multi-bin for local yaw in two bins	L1 loss on 1 over regressed disparity	None	2D bbox, 3D bbox, intrinsics			regressing 2D projection of 3D center helps recover amodal 3D bbox
CasGeo	1909		3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results	Extends Deep3DBox by regressing the 3d bbox center on bottom edge and viewpoint classification	Deep3DBox	MS-CNN	refined from subtype average	None	2D projection of bottom surface center	multi-bin for yaw, viewpoint estimation	approximated via optimization (Gauss-Newton)	similar to Deep3DBox	2D bbox, 3D bbox, intrinsics			regress 3d height projection to help with initial guess of distance
GPP	1811	ArXiv	GPP: Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road	Regress tireline and height and project to the best ground plane near the car	GPP	RetinaNet+2D/3D head	refined from subtype average	None	2D projection of tirelines (observer facing vertices)	coarse (8) viewpoint classification	IPM based on best fitting ground plane	None	2D bbox, 3D bbox, intrinsics, fitted road planes	Need to collect and fit road data	able to predict local road pose		NA
MVRA	1910	ICCV 2019	MVRA: Multi-View Reprojection Architecture for Orientation Estimation	Build the 2D/3D constraints optimization into neural network and use iterative method to refine cropped cases.	Deep3DBox	Faster RCNN	refined from subtype average	None	None	multi-bin for yaw, viewpoint estimation, iterative trial and error for truncated	approximated via optimization	similar to Deep3DBox (details in appendix)	2D bbox, 3D bbox, intrinsics		predict better for truncated bbox		NA