Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

Accurate 3D object detection is vital for autonomous driving since it facilitates accurate perception of the environment through multiple sensors. Although cameras can capture detailed color and texture features, they have limitations regarding depth information. Additionally, they can struggle unde...

Full description

Bibliographic Details
Main Authors:	Simegnew Yihunie Alaba, John E. Ball
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Autonomous driving LiDAR multimodal fusion network compression pruning quantization
Online Access:	https://ieeexplore.ieee.org/document/10493018/

_version_	1797213617757618176
author	Simegnew Yihunie Alaba John E. Ball
author_facet	Simegnew Yihunie Alaba John E. Ball
author_sort	Simegnew Yihunie Alaba
collection	DOAJ
description	Accurate 3D object detection is vital for autonomous driving since it facilitates accurate perception of the environment through multiple sensors. Although cameras can capture detailed color and texture features, they have limitations regarding depth information. Additionally, they can struggle under adverse weather or lighting conditions. In contrast, LiDAR sensors offer robust depth information but lack the visual detail for precise object classification. This work presents a multimodal fusion model that improves 3D object detection by combining the benefits of LiDAR and camera sensors to address these challenges. This model processes camera images and LiDAR point cloud data into a voxel-based representation, further refined by encoder networks to enhance spatial interaction and reduce semantic ambiguity. The proposed multiresolution attention module and integration of discrete wavelet transform and inverse discrete wavelet transform to the image backbone improve the feature extraction capability. This approach enhances the fusion of LiDAR depth information with the camera’s textural and color detail. The model also incorporates a transformer decoder network with self-attention and cross-attention mechanisms, fostering robust and accurate detection through global interaction between identified objects and encoder features. Furthermore, the proposed network is refined with advanced optimization techniques, including pruning and Quantization-Aware Training (QAT), to maintain a competitive performance while significantly decreasing the need for memory and computational resources. Performance evaluations on the nuScenes dataset show that the optimized model architecture offers competitive results and significantly improves operational efficiency and effectiveness in multimodal fusion 3D object detection.
first_indexed	2024-04-24T11:01:08Z
format	Article
id	doaj.art-bc10e9c624214157af567889c08f2855
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-24T11:01:08Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-bc10e9c624214157af567889c08f28552024-04-11T23:00:27ZengIEEEIEEE Access2169-35362024-01-0112501655017610.1109/ACCESS.2024.338543910493018Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous DrivingSimegnew Yihunie Alaba0https://orcid.org/0000-0002-3796-3201John E. Ball1https://orcid.org/0000-0002-6774-4851Department of Electrical and Computer Engineering, James Worth Bagley College of Engineering, Mississippi State University, Starkville, MS, USADepartment of Electrical and Computer Engineering, James Worth Bagley College of Engineering, Mississippi State University, Starkville, MS, USAAccurate 3D object detection is vital for autonomous driving since it facilitates accurate perception of the environment through multiple sensors. Although cameras can capture detailed color and texture features, they have limitations regarding depth information. Additionally, they can struggle under adverse weather or lighting conditions. In contrast, LiDAR sensors offer robust depth information but lack the visual detail for precise object classification. This work presents a multimodal fusion model that improves 3D object detection by combining the benefits of LiDAR and camera sensors to address these challenges. This model processes camera images and LiDAR point cloud data into a voxel-based representation, further refined by encoder networks to enhance spatial interaction and reduce semantic ambiguity. The proposed multiresolution attention module and integration of discrete wavelet transform and inverse discrete wavelet transform to the image backbone improve the feature extraction capability. This approach enhances the fusion of LiDAR depth information with the camera’s textural and color detail. The model also incorporates a transformer decoder network with self-attention and cross-attention mechanisms, fostering robust and accurate detection through global interaction between identified objects and encoder features. Furthermore, the proposed network is refined with advanced optimization techniques, including pruning and Quantization-Aware Training (QAT), to maintain a competitive performance while significantly decreasing the need for memory and computational resources. Performance evaluations on the nuScenes dataset show that the optimized model architecture offers competitive results and significantly improves operational efficiency and effectiveness in multimodal fusion 3D object detection.https://ieeexplore.ieee.org/document/10493018/Autonomous drivingLiDARmultimodal fusionnetwork compressionpruningquantization
spellingShingle	Simegnew Yihunie Alaba John E. Ball Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving IEEE Access Autonomous driving LiDAR multimodal fusion network compression pruning quantization
title	Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
title_full	Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
title_fullStr	Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
title_full_unstemmed	Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
title_short	Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
title_sort	transformer based optimized multimodal fusion for 3d object detection in autonomous driving
topic	Autonomous driving LiDAR multimodal fusion network compression pruning quantization
url	https://ieeexplore.ieee.org/document/10493018/
work_keys_str_mv	AT simegnewyihuniealaba transformerbasedoptimizedmultimodalfusionfor3dobjectdetectioninautonomousdriving AT johneball transformerbasedoptimizedmultimodalfusionfor3dobjectdetectioninautonomousdriving

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

Similar Items