Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper revi...

Full description

Bibliographic Details
Main Authors: Wenhao Chai, Gaoang Wang
Format: Article
Language:English
Published: MDPI AG 2022-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/13/6588
_version_ 1797480914679234560
author Wenhao Chai
Gaoang Wang
author_facet Wenhao Chai
Gaoang Wang
author_sort Wenhao Chai
collection DOAJ
description Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends.
first_indexed 2024-03-09T22:07:02Z
format Article
id doaj.art-ed848531b85940b2bd30490c2740574d
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T22:07:02Z
publishDate 2022-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-ed848531b85940b2bd30490c2740574d2023-11-23T19:39:26ZengMDPI AGApplied Sciences2076-34172022-06-011213658810.3390/app12136588Deep Vision Multimodal Learning: Methodology, Benchmark, and TrendWenhao Chai0Gaoang Wang1Zhejiang University-University of Illinois at Urbana-Champaign Institute, Zhejiang University, Haining 314400, ChinaZhejiang University-University of Illinois at Urbana-Champaign Institute, Zhejiang University, Haining 314400, ChinaDeep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends.https://www.mdpi.com/2076-3417/12/13/6588multimodal learningcomputer visiondeep learningintroductorysurvey
spellingShingle Wenhao Chai
Gaoang Wang
Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
Applied Sciences
multimodal learning
computer vision
deep learning
introductory
survey
title Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_full Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_fullStr Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_full_unstemmed Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_short Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
title_sort deep vision multimodal learning methodology benchmark and trend
topic multimodal learning
computer vision
deep learning
introductory
survey
url https://www.mdpi.com/2076-3417/12/13/6588
work_keys_str_mv AT wenhaochai deepvisionmultimodallearningmethodologybenchmarkandtrend
AT gaoangwang deepvisionmultimodallearningmethodologybenchmarkandtrend