Summary: | CNN-based absolute camera pose estimation methods lack scene generalizability as the network is trained with scene-specific parameters. In this paper, we aim to solve the scene generalizability problem in 6-DoF camera pose estimation using a novel deep photo-geometric loss. We train a CNN-based relative pose estimation network end-to-end, by jointly optimizing the proposed deep photo-geometric loss along with the pose regression loss. Most traditional pose estimation methods use local keypoints to find 2D-2D correspondences, which fails under occlusion, textureless surfaces, motion blur, or repetitive structures. Given camera intrinsics, poses and depth, our method generates uniform 2D-2D photometric correspondence pairs via epipolar geometry during the training process with constraints to avoid textureless surfaces and occlusion, without the need of manually annotated keypoints information. The network is then trained with the correspondences information in such a way that not only the network learns from auxiliary photometric consistency information but also efficiently leverages scene geometry, consequently, we call it photo-geometric loss. The input to the photo-geometric loss layer is taken from the activation maps of the deep network, which contains much more information than a simple 2D-2D correspondence, and thus alleviating the need to choose a robust pose regression loss and its hyperparameters. With extensive experiments on three public datasets, we show that the proposed method significantly outperforms state-of-the-art relative pose estimation methods. The presented method also depicts state-of-the-art results on these datasets under cross-database evaluation settings, which proves its significance in terms of scene generalization.
|