Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper revi...

Full description

Bibliographic Details
Main Authors:	Wenhao Chai, Gaoang Wang
Format:	Article
Language:	English
Published:	MDPI AG 2022-06-01
Series:	Applied Sciences
Subjects:	multimodal learning computer vision deep learning introductory survey
Online Access:	https://www.mdpi.com/2076-3417/12/13/6588

_version_	1797480914679234560
author	Wenhao Chai Gaoang Wang
author_facet	Wenhao Chai Gaoang Wang
author_sort	Wenhao Chai
collection	DOAJ
description	Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends.
first_indexed	2024-03-09T22:07:02Z
format	Article
id	doaj.art-ed848531b85940b2bd30490c2740574d
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-09T22:07:02Z
publishDate	2022-06-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-ed848531b85940b2bd30490c2740574d2023-11-23T19:39:26ZengMDPI AGApplied Sciences2076-34172022-06-011213658810.3390/app12136588Deep Vision Multimodal Learning: Methodology, Benchmark, and TrendWenhao Chai0Gaoang Wang1Zhejiang University-University of Illinois at Urbana-Champaign Institute, Zhejiang University, Haining 314400, ChinaZhejiang University-University of Illinois at Urbana-Champaign Institute, Zhejiang University, Haining 314400, ChinaDeep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends.https://www.mdpi.com/2076-3417/12/13/6588multimodal learningcomputer visiondeep learningintroductorysurvey
spellingShingle	Wenhao Chai Gaoang Wang Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend Applied Sciences multimodal learning computer vision deep learning introductory survey
title	Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_full	Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_fullStr	Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_full_unstemmed	Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_short	Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_sort	deep vision multimodal learning methodology benchmark and trend
topic	multimodal learning computer vision deep learning introductory survey
url	https://www.mdpi.com/2076-3417/12/13/6588
work_keys_str_mv	AT wenhaochai deepvisionmultimodallearningmethodologybenchmarkandtrend AT gaoangwang deepvisionmultimodallearningmethodologybenchmarkandtrend

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Similar Items