DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality Attention

There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple enco...

Full description

Bibliographic Details
Main Authors: Shunsuke Kitada, Yuki Iwazaki, Riku Togashi, Hitoshi Iyatomi
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9947020/
_version_ 1798016998709395456
author Shunsuke Kitada
Yuki Iwazaki
Riku Togashi
Hitoshi Iyatomi
author_facet Shunsuke Kitada
Yuki Iwazaki
Riku Togashi
Hitoshi Iyatomi
author_sort Shunsuke Kitada
collection DOAJ
description There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM2S2). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.
first_indexed 2024-04-11T15:58:28Z
format Article
id doaj.art-edeadd9e92c2405a910487d347560b7d
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-11T15:58:28Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-edeadd9e92c2405a910487d347560b7d2022-12-22T04:15:05ZengIEEEIEEE Access2169-35362022-01-011012002312003410.1109/ACCESS.2022.32218129947020DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality AttentionShunsuke Kitada0https://orcid.org/0000-0002-3330-8779Yuki Iwazaki1Riku Togashi2Hitoshi Iyatomi3https://orcid.org/0000-0003-4108-4178Department of Applied Informatics, Graduate School of Science and Engineering, Hosei University, Tokyo, JapanCyberAgent Inc., Tokyo, JapanCyberAgent Inc., Tokyo, JapanDepartment of Applied Informatics, Graduate School of Science and Engineering, Hosei University, Tokyo, JapanThere is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM2S2). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.https://ieeexplore.ieee.org/document/9947020/Attention mechanismdeep neural networksmultimodal learning
spellingShingle Shunsuke Kitada
Yuki Iwazaki
Riku Togashi
Hitoshi Iyatomi
DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality Attention
IEEE Access
Attention mechanism
deep neural networks
multimodal learning
title DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality Attention
title_full DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality Attention
title_fullStr DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality Attention
title_full_unstemmed DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality Attention
title_short DM<sup>2</sup>S<sup>2</sup>: Deep Multimodal Sequence Sets With Hierarchical Modality Attention
title_sort dm sup 2 sup s sup 2 sup deep multimodal sequence sets with hierarchical modality attention
topic Attention mechanism
deep neural networks
multimodal learning
url https://ieeexplore.ieee.org/document/9947020/
work_keys_str_mv AT shunsukekitada dmsup2supssup2supdeepmultimodalsequencesetswithhierarchicalmodalityattention
AT yukiiwazaki dmsup2supssup2supdeepmultimodalsequencesetswithhierarchicalmodalityattention
AT rikutogashi dmsup2supssup2supdeepmultimodalsequencesetswithhierarchicalmodalityattention
AT hitoshiiyatomi dmsup2supssup2supdeepmultimodalsequencesetswithhierarchicalmodalityattention