Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend
Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper revi...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-06-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/12/13/6588 |
_version_ | 1797480914679234560 |
---|---|
author | Wenhao Chai Gaoang Wang |
author_facet | Wenhao Chai Gaoang Wang |
author_sort | Wenhao Chai |
collection | DOAJ |
description | Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends. |
first_indexed | 2024-03-09T22:07:02Z |
format | Article |
id | doaj.art-ed848531b85940b2bd30490c2740574d |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-09T22:07:02Z |
publishDate | 2022-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-ed848531b85940b2bd30490c2740574d2023-11-23T19:39:26ZengMDPI AGApplied Sciences2076-34172022-06-011213658810.3390/app12136588Deep Vision Multimodal Learning: Methodology, Benchmark, and TrendWenhao Chai0Gaoang Wang1Zhejiang University-University of Illinois at Urbana-Champaign Institute, Zhejiang University, Haining 314400, ChinaZhejiang University-University of Illinois at Urbana-Champaign Institute, Zhejiang University, Haining 314400, ChinaDeep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends.https://www.mdpi.com/2076-3417/12/13/6588multimodal learningcomputer visiondeep learningintroductorysurvey |
spellingShingle | Wenhao Chai Gaoang Wang Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend Applied Sciences multimodal learning computer vision deep learning introductory survey |
title | Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend |
title_full | Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend |
title_fullStr | Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend |
title_full_unstemmed | Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend |
title_short | Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend |
title_sort | deep vision multimodal learning methodology benchmark and trend |
topic | multimodal learning computer vision deep learning introductory survey |
url | https://www.mdpi.com/2076-3417/12/13/6588 |
work_keys_str_mv | AT wenhaochai deepvisionmultimodallearningmethodologybenchmarkandtrend AT gaoangwang deepvisionmultimodallearningmethodologybenchmarkandtrend |