Existing end-to-end trackers for vision-based 3D perception suffer from performance degradation due to the conflict between detection and tracking tasks. In this work, we get to the bottom of this conflict, which was vaguely attributed to incompatible task-specific object features previously. We find the conflict between the two tasks lies in their partially conflicted classification gradients, which stems from their subtle difference in positive sample assignments. Based on this observation, we propose to coordinate those conflicted gradients by accurately identifying object queries with contradicted positivity in the two tasks. We also dynamically mask all attention between contracted object queries and modify the tracking classification loss to suppress inaccurate predictions. To this end, we propose OneTrack, the first one-stage joint detection and tracking model that bridges the gap between detection and tracking under a unified object feature representation. On the nuScenes camera-based object tracking benchmark, OneTrack outperforms previous works by 6.9% AMOTA on the validation set and by 3.3% AMOTA on the test set. The code will be released.
Live content is unavailable. Log in and register to view live content