Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to exploit temporal information to improve the temporal consistency and accuracy of stereo matching. To this end, we build a temporally consistent stereo matching network, which includes two stages. In the first stage, we leverage temporal information to obtain a well-initialized disparity. In the second stage, we iteratively refine the disparity based on the temporal initialization. Specifically, we propose a temporal disparity completion module, which completes a semi-dense disparity map transformed from the previous moment. Then, we use a temporal state fusion module to fuse the state of the completion module and the hidden state of the refinement from the previous frame, providing a coherent state for further refinement. Based on this coherent state, we introduce a dual-space refinement module to iteratively refine the initialized result in both the disparity space and the disparity gradient space, improving the estimations in ill-posed regions. Extensive experiments demonstrate that our method effectively alleviates temporal inconsistency and enhances accuracy and efficiency. As of present, our method ranks second on the KITTI 2015 benchmark, while achieving superior efficiency compared to other state-of-the-art methods.
Live content is unavailable. Log in and register to view live content