Fusion of Multi-Modal Features to Enhance Dense Video Caption

Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding...

Full description

Bibliographic Details
Main Authors:	Xuefei Huang, Ka-Hou Chan, Weifan Wu, Hao Sheng, Wei Ke
Format:	Article
Language:	English
Published:	MDPI AG 2023-06-01
Series:	Sensors
Subjects:	dense video caption video captioning multi-modal feature fusion feature extraction neural network
Online Access:	https://www.mdpi.com/1424-8220/23/12/5565

_version_	1797592689580965888
author	Xuefei Huang Ka-Hou Chan Weifan Wu Hao Sheng Wei Ke
author_facet	Xuefei Huang Ka-Hou Chan Weifan Wu Hao Sheng Wei Ke
author_sort	Xuefei Huang
collection	DOAJ
description	Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
first_indexed	2024-03-11T01:56:42Z
format	Article
id	doaj.art-1ddb9e3ba9674d3ab83f179c055b2516
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-11T01:56:42Z
publishDate	2023-06-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-1ddb9e3ba9674d3ab83f179c055b25162023-11-18T12:32:54ZengMDPI AGSensors1424-82202023-06-012312556510.3390/s23125565Fusion of Multi-Modal Features to Enhance Dense Video CaptionXuefei Huang0Ka-Hou Chan1Weifan Wu2Hao Sheng3Wei Ke4Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaDense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.https://www.mdpi.com/1424-8220/23/12/5565dense video captionvideo captioningmulti-modal feature fusionfeature extractionneural network
spellingShingle	Xuefei Huang Ka-Hou Chan Weifan Wu Hao Sheng Wei Ke Fusion of Multi-Modal Features to Enhance Dense Video Caption Sensors dense video caption video captioning multi-modal feature fusion feature extraction neural network
title	Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_full	Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_fullStr	Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_full_unstemmed	Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_short	Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_sort	fusion of multi modal features to enhance dense video caption
topic	dense video caption video captioning multi-modal feature fusion feature extraction neural network
url	https://www.mdpi.com/1424-8220/23/12/5565
work_keys_str_mv	AT xuefeihuang fusionofmultimodalfeaturestoenhancedensevideocaption AT kahouchan fusionofmultimodalfeaturestoenhancedensevideocaption AT weifanwu fusionofmultimodalfeaturestoenhancedensevideocaption AT haosheng fusionofmultimodalfeaturestoenhancedensevideocaption AT weike fusionofmultimodalfeaturestoenhancedensevideocaption

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Similar Items