Vision-Text Cross-Modal Fusion for Accurate Video Captioning

In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we...

Full description

Bibliographic Details
Main Authors:	Kaouther Ouenniche, Ruxandra Tapu, Titus Zaharia
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Multimodal video captioning multimodal learning cross correlation transformers contrastive learning
Online Access:	https://ieeexplore.ieee.org/document/10283847/

_version_	1797649760751976448
author	Kaouther Ouenniche Ruxandra Tapu Titus Zaharia
author_facet	Kaouther Ouenniche Ruxandra Tapu Titus Zaharia
author_sort	Kaouther Ouenniche
collection	DOAJ
description	In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we integrate temporal attention into the features obtained from a 3D CNN to learn the contextual information in the video using task-oriented training. In addition, we incorporate an auxiliary task that employs a contrastive loss function to enhance the model’s generalization capability and foster a deeper understanding of the inter-modal relationships and underlying semantics. The task involves comparing the multimodal representation of the video-transcript with the caption representation, facilitating improved performance and knowledge transfer within the model. Finally, a transformer architecture is used to effectively capture and encode the interdependencies between the text and video information using attention mechanisms. During the decoding phase, the transformer allows the model to attend to relevant elements in the encoded features, effectively capturing long-range dependencies and ultimately generating semantically meaningful captions. The experimental evaluation, carried out on the MSRVTT benchmark, validates the proposed methodology, which achieves BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. When compared to the state-of-the-art methods, the proposed approach shows superior performance, with gains in performance ranging from 1.21% to 1.52% across the three metrics considered.
first_indexed	2024-03-11T15:50:40Z
format	Article
id	doaj.art-22e464e148304ed8b760d17ce19179ce
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-11T15:50:40Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-22e464e148304ed8b760d17ce19179ce2023-10-25T23:01:02ZengIEEEIEEE Access2169-35362023-01-011111547711549210.1109/ACCESS.2023.332405210283847Vision-Text Cross-Modal Fusion for Accurate Video CaptioningKaouther Ouenniche0https://orcid.org/0009-0008-3346-713XRuxandra Tapu1https://orcid.org/0000-0003-3170-4150Titus Zaharia2https://orcid.org/0000-0002-6589-1241Institut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceInstitut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceInstitut Polytechnique de Paris, Télécom SudParis, Laboratoire SAMOVAR, Evry, FranceIn this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we integrate temporal attention into the features obtained from a 3D CNN to learn the contextual information in the video using task-oriented training. In addition, we incorporate an auxiliary task that employs a contrastive loss function to enhance the model’s generalization capability and foster a deeper understanding of the inter-modal relationships and underlying semantics. The task involves comparing the multimodal representation of the video-transcript with the caption representation, facilitating improved performance and knowledge transfer within the model. Finally, a transformer architecture is used to effectively capture and encode the interdependencies between the text and video information using attention mechanisms. During the decoding phase, the transformer allows the model to attend to relevant elements in the encoded features, effectively capturing long-range dependencies and ultimately generating semantically meaningful captions. The experimental evaluation, carried out on the MSRVTT benchmark, validates the proposed methodology, which achieves BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. When compared to the state-of-the-art methods, the proposed approach shows superior performance, with gains in performance ranging from 1.21% to 1.52% across the three metrics considered.https://ieeexplore.ieee.org/document/10283847/Multimodal video captioningmultimodal learningcross correlationtransformerscontrastive learning
spellingShingle	Kaouther Ouenniche Ruxandra Tapu Titus Zaharia Vision-Text Cross-Modal Fusion for Accurate Video Captioning IEEE Access Multimodal video captioning multimodal learning cross correlation transformers contrastive learning
title	Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_full	Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_fullStr	Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_full_unstemmed	Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_short	Vision-Text Cross-Modal Fusion for Accurate Video Captioning
title_sort	vision text cross modal fusion for accurate video captioning
topic	Multimodal video captioning multimodal learning cross correlation transformers contrastive learning
url	https://ieeexplore.ieee.org/document/10283847/
work_keys_str_mv	AT kaoutherouenniche visiontextcrossmodalfusionforaccuratevideocaptioning AT ruxandratapu visiontextcrossmodalfusionforaccuratevideocaptioning AT tituszaharia visiontextcrossmodalfusionforaccuratevideocaptioning

Vision-Text Cross-Modal Fusion for Accurate Video Captioning

Similar Items