Vision-Text Cross-Modal Fusion for Accurate Video Captioning
In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10283847/ |
_version_ | 1797649760751976448 |
---|---|
author | Kaouther Ouenniche Ruxandra Tapu Titus Zaharia |
author_facet | Kaouther Ouenniche Ruxandra Tapu Titus Zaharia |
author_sort | Kaouther Ouenniche |
collection | DOAJ |
description | In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we integrate temporal attention into the features obtained from a 3D CNN to learn the contextual information in the video using task-oriented training. In addition, we incorporate an auxiliary task that employs a contrastive loss function to enhance the model’s generalization capability and foster a deeper understanding of the inter-modal relationships and underlying semantics. The task involves comparing the multimodal representation of the video-transcript with the caption representation, facilitating improved performance and knowledge transfer within the model. Finally, a transformer architecture is used to effectively capture and encode the interdependencies between the text and video information using attention mechanisms. During the decoding phase, the transformer allows the model to attend to relevant elements in the encoded features, effectively capturing long-range dependencies and ultimately generating semantically meaningful captions. The experimental evaluation, carried out on the MSRVTT benchmark, validates the proposed methodology, which achieves BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. When compared to the state-of-the-art methods, the proposed approach shows superior performance, with gains in performance ranging from 1.21% to 1.52% across the three metrics considered. |
first_indexed | 2024-03-11T15:50:40Z |
format | Article |
id | doaj.art-22e464e148304ed8b760d17ce19179ce |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-11T15:50:40Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-22e464e148304ed8b760d17ce19179ce2023-10-25T23:01:02ZengIEEEIEEE Access2169-35362023-01-011111547711549210.1109/ACCESS.2023.332405210283847Vision-Text Cross-Modal Fusion for Accurate Video CaptioningKaouther Ouenniche0https://orcid.org/0009-0008-3346-713XRuxandra Tapu1https://orcid.org/0000-0003-3170-4150Titus Zaharia2https://orcid.org/0000-0002-6589-1241Institut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceInstitut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceInstitut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceIn this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we integrate temporal attention into the features obtained from a 3D CNN to learn the contextual information in the video using task-oriented training. In addition, we incorporate an auxiliary task that employs a contrastive loss function to enhance the model’s generalization capability and foster a deeper understanding of the inter-modal relationships and underlying semantics. The task involves comparing the multimodal representation of the video-transcript with the caption representation, facilitating improved performance and knowledge transfer within the model. Finally, a transformer architecture is used to effectively capture and encode the interdependencies between the text and video information using attention mechanisms. During the decoding phase, the transformer allows the model to attend to relevant elements in the encoded features, effectively capturing long-range dependencies and ultimately generating semantically meaningful captions. The experimental evaluation, carried out on the MSRVTT benchmark, validates the proposed methodology, which achieves BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. When compared to the state-of-the-art methods, the proposed approach shows superior performance, with gains in performance ranging from 1.21% to 1.52% across the three metrics considered.https://ieeexplore.ieee.org/document/10283847/Multimodal video captioningmultimodal learningcross correlationtransformerscontrastive learning |
spellingShingle | Kaouther Ouenniche Ruxandra Tapu Titus Zaharia Vision-Text Cross-Modal Fusion for Accurate Video Captioning IEEE Access Multimodal video captioning multimodal learning cross correlation transformers contrastive learning |
title | Vision-Text Cross-Modal Fusion for Accurate Video Captioning |
title_full | Vision-Text Cross-Modal Fusion for Accurate Video Captioning |
title_fullStr | Vision-Text Cross-Modal Fusion for Accurate Video Captioning |
title_full_unstemmed | Vision-Text Cross-Modal Fusion for Accurate Video Captioning |
title_short | Vision-Text Cross-Modal Fusion for Accurate Video Captioning |
title_sort | vision text cross modal fusion for accurate video captioning |
topic | Multimodal video captioning multimodal learning cross correlation transformers contrastive learning |
url | https://ieeexplore.ieee.org/document/10283847/ |
work_keys_str_mv | AT kaoutherouenniche visiontextcrossmodalfusionforaccuratevideocaptioning AT ruxandratapu visiontextcrossmodalfusionforaccuratevideocaptioning AT tituszaharia visiontextcrossmodalfusionforaccuratevideocaptioning |