Vision-Text Cross-Modal Fusion for Accurate Video Captioning
In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data. The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation. Further, we...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10283847/ |