Summary: | With the rapid development of computer vision technology, depth estimation is widely
used in autonomous driving. Combined with object detection, the effect of pseudo-laser
detection or three-dimensional reconstruction can be achieved; Combined with semantic
segmentation, it can be extended from 2D to 3D to obtain semantic and depth information of
pixels, such as lane line detection; In addition, depth estimation can also be used for general
obstacle detection[1]. Therefore, depth estimation is an important visual task in autonomous
driving. The method of monocular[2] depth estimation is to estimate the depth from a single
or a series of visible light photos taken simultaneously in the same scene. It also includes
methods based on monocular vision, stereo matching, multi-view stereoscopic and 3D
reconstruction.
This dissertation first introduces some basic technology and several commonly used
methods for depth estimation. Then, the paper presents a comprehensive study of monocular
depth estimation using the Monodepth2 model[25]. The Monodepth2 model is explained in
detail, including its network structure, components, and loss function.
The environment setup and datasets used for pre-training the model on the Cityscapes
dataset[28] and testing and fine-tuning it on the KITTI dataset[27] are described in the
experimental section. This study evaluates the model using acceptable depth estimation
indices as MSE, MAE, and Abs.rel. The outcomes of this experiment are evaluated using
three different training techniques: monocular training, stereo training, and monocular plus
stereo training[25]. In the end, it is discovered that the experimental results that have been
examined and replicated are nearly identical to the original experimental results. Based on
the experimental findings, a direction and method for enhancing the Monodepth2 model in
future studies are suggested. Overall, this study offers insightful information about
monocular depth estimation using the traditional Monodepth2 method.
|