Vision-Text Cross-Modal Fusion for Accurate Video Captioning

In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we...

Full description

Bibliographic Details
Main Authors: Kaouther Ouenniche, Ruxandra Tapu, Titus Zaharia
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10283847/
_version_ 1797649760751976448
author Kaouther Ouenniche
Ruxandra Tapu
Titus Zaharia
author_facet Kaouther Ouenniche
Ruxandra Tapu
Titus Zaharia
author_sort Kaouther Ouenniche
collection DOAJ
description In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we integrate temporal attention into the features obtained from a 3D CNN to learn the contextual information in the video using task-oriented training. In addition, we incorporate an auxiliary task that employs a contrastive loss function to enhance the model’s generalization capability and foster a deeper understanding of the inter-modal relationships and underlying semantics. The task involves comparing the multimodal representation of the video-transcript with the caption representation, facilitating improved performance and knowledge transfer within the model. Finally, a transformer architecture is used to effectively capture and encode the interdependencies between the text and video information using attention mechanisms. During the decoding phase, the transformer allows the model to attend to relevant elements in the encoded features, effectively capturing long-range dependencies and ultimately generating semantically meaningful captions. The experimental evaluation, carried out on the MSRVTT benchmark, validates the proposed methodology, which achieves BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. When compared to the state-of-the-art methods, the proposed approach shows superior performance, with gains in performance ranging from 1.21% to 1.52% across the three metrics considered.
first_indexed 2024-03-11T15:50:40Z
format Article
id doaj.art-22e464e148304ed8b760d17ce19179ce
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-11T15:50:40Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-22e464e148304ed8b760d17ce19179ce2023-10-25T23:01:02ZengIEEEIEEE Access2169-35362023-01-011111547711549210.1109/ACCESS.2023.332405210283847Vision-Text Cross-Modal Fusion for Accurate Video CaptioningKaouther Ouenniche0https://orcid.org/0009-0008-3346-713XRuxandra Tapu1https://orcid.org/0000-0003-3170-4150Titus Zaharia2https://orcid.org/0000-0002-6589-1241Institut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceInstitut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceInstitut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceIn this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we integrate temporal attention into the features obtained from a 3D CNN to learn the contextual information in the video using task-oriented training. In addition, we incorporate an auxiliary task that employs a contrastive loss function to enhance the model’s generalization capability and foster a deeper understanding of the inter-modal relationships and underlying semantics. The task involves comparing the multimodal representation of the video-transcript with the caption representation, facilitating improved performance and knowledge transfer within the model. Finally, a transformer architecture is used to effectively capture and encode the interdependencies between the text and video information using attention mechanisms. During the decoding phase, the transformer allows the model to attend to relevant elements in the encoded features, effectively capturing long-range dependencies and ultimately generating semantically meaningful captions. The experimental evaluation, carried out on the MSRVTT benchmark, validates the proposed methodology, which achieves BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. When compared to the state-of-the-art methods, the proposed approach shows superior performance, with gains in performance ranging from 1.21% to 1.52% across the three metrics considered.https://ieeexplore.ieee.org/document/10283847/Multimodal video captioningmultimodal learningcross correlationtransformerscontrastive learning
spellingShingle Kaouther Ouenniche
Ruxandra Tapu
Titus Zaharia
Vision-Text Cross-Modal Fusion for Accurate Video Captioning
IEEE Access
Multimodal video captioning
multimodal learning
cross correlation
transformers
contrastive learning
title Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_full Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_fullStr Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_full_unstemmed Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_short Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_sort vision text cross modal fusion for accurate video captioning
topic Multimodal video captioning
multimodal learning
cross correlation
transformers
contrastive learning
url https://ieeexplore.ieee.org/document/10283847/
work_keys_str_mv AT kaoutherouenniche visiontextcrossmodalfusionforaccuratevideocaptioning
AT ruxandratapu visiontextcrossmodalfusionforaccuratevideocaptioning
AT tituszaharia visiontextcrossmodalfusionforaccuratevideocaptioning