RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers

The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutiona...

Full description

Bibliographic Details
Main Authors: Hatem Ibrahem, Ahmed Salem, Hyun-Soo Kang
Format: Article
Language:English
Published: MDPI AG 2022-05-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/22/10/3849
_version_ 1797495649733705728
author Hatem Ibrahem
Ahmed Salem
Hyun-Soo Kang
author_facet Hatem Ibrahem
Ahmed Salem
Hyun-Soo Kang
author_sort Hatem Ibrahem
collection DOAJ
description The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.
first_indexed 2024-03-10T01:52:40Z
format Article
id doaj.art-26f376ab20ed4624ae31ce8d8ce9ad46
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-10T01:52:40Z
publishDate 2022-05-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-26f376ab20ed4624ae31ce8d8ce9ad462023-11-23T13:02:32ZengMDPI AGSensors1424-82202022-05-012210384910.3390/s22103849RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision TransformersHatem Ibrahem0Ahmed Salem1Hyun-Soo Kang2Department of Information and Communication Engineering, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si 28644, KoreaDepartment of Information and Communication Engineering, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si 28644, KoreaDepartment of Information and Communication Engineering, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si 28644, KoreaThe latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.https://www.mdpi.com/1424-8220/22/10/3849monocular depth estimationconvolutional neural networksvision transformersreal-time processing
spellingShingle Hatem Ibrahem
Ahmed Salem
Hyun-Soo Kang
RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
Sensors
monocular depth estimation
convolutional neural networks
vision transformers
real-time processing
title RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_full RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_fullStr RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_full_unstemmed RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_short RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_sort rt vit real time monocular depth estimation using lightweight vision transformers
topic monocular depth estimation
convolutional neural networks
vision transformers
real-time processing
url https://www.mdpi.com/1424-8220/22/10/3849
work_keys_str_mv AT hatemibrahem rtvitrealtimemonoculardepthestimationusinglightweightvisiontransformers
AT ahmedsalem rtvitrealtimemonoculardepthestimationusinglightweightvisiontransformers
AT hyunsookang rtvitrealtimemonoculardepthestimationusinglightweightvisiontransformers