Skip to yearly menu bar Skip to main content


Poster

RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection

Ming Chang · Xishan Zhang · Rui Zhang · Zhipeng Zhao · Guanhua He · Shaoli Liu

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Wed 2 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Long-term temporal fusion is frequently employed in camera-based Bird’s-Eye-View (BEV) 3D object detection to improve detection of occluded objects. Existing methods can be divided into two categories, parallel fusion and recurrent fusion. Recurrent fusion reduces inference latency and memory consumption but fails to exploit the long-term information as well as parallel fusion. In this paper, we first find two shortcomings of recurrent fusion paradigm: (1) Gradients of previous BEV features cannot directly contribute to the fusion module. (2) Semantic ambiguity are caused by coarse granularity of the BEV grids during aligning BEV features. Then based on the above analysis, we propose RecurrentBEV, a novel recurrent temporal fusion method for BEV based 3D object detector. By adopting RNN-style back-propagation and new-designed inner grid transformation, RecurrentBEV improves the long-term fusion ability while still enjoying efficient inference latency and memory consumption during inference. Extensive experiments on the nuScenes benchmark demonstrate its effectiveness, achieving a new state-of-the-art performance of 57.4 mAP and 65.1 NDS on the test set. The real-time version (25.6 FPS) achieves 44.5 mAP and 54.9 NDS without external dataset, outperforming the previous best method StreamPETR by 1.3 mAP and 0.9 NDS.

Live content is unavailable. Log in and register to view live content