Fusion of Multi-Modal Features to Enhance Dense Video Caption

Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding...

Full description

Bibliographic Details
Main Authors: Xuefei Huang, Ka-Hou Chan, Weifan Wu, Hao Sheng, Wei Ke
Format: Article
Language:English
Published: MDPI AG 2023-06-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/23/12/5565
_version_ 1797592689580965888
author Xuefei Huang
Ka-Hou Chan
Weifan Wu
Hao Sheng
Wei Ke
author_facet Xuefei Huang
Ka-Hou Chan
Weifan Wu
Hao Sheng
Wei Ke
author_sort Xuefei Huang
collection DOAJ
description Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
first_indexed 2024-03-11T01:56:42Z
format Article
id doaj.art-1ddb9e3ba9674d3ab83f179c055b2516
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-11T01:56:42Z
publishDate 2023-06-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-1ddb9e3ba9674d3ab83f179c055b25162023-11-18T12:32:54ZengMDPI AGSensors1424-82202023-06-012312556510.3390/s23125565Fusion of Multi-Modal Features to Enhance Dense Video CaptionXuefei Huang0Ka-Hou Chan1Weifan Wu2Hao Sheng3Wei Ke4Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaDense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.https://www.mdpi.com/1424-8220/23/12/5565dense video captionvideo captioningmulti-modal feature fusionfeature extractionneural network
spellingShingle Xuefei Huang
Ka-Hou Chan
Weifan Wu
Hao Sheng
Wei Ke
Fusion of Multi-Modal Features to Enhance Dense Video Caption
Sensors
dense video caption
video captioning
multi-modal feature fusion
feature extraction
neural network
title Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_full Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_fullStr Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_full_unstemmed Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_short Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_sort fusion of multi modal features to enhance dense video caption
topic dense video caption
video captioning
multi-modal feature fusion
feature extraction
neural network
url https://www.mdpi.com/1424-8220/23/12/5565
work_keys_str_mv AT xuefeihuang fusionofmultimodalfeaturestoenhancedensevideocaption
AT kahouchan fusionofmultimodalfeaturestoenhancedensevideocaption
AT weifanwu fusionofmultimodalfeaturestoenhancedensevideocaption
AT haosheng fusionofmultimodalfeaturestoenhancedensevideocaption
AT weike fusionofmultimodalfeaturestoenhancedensevideocaption