Fusion of Multi-Modal Features to Enhance Dense Video Caption
Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-06-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/23/12/5565 |
_version_ | 1797592689580965888 |
---|---|
author | Xuefei Huang Ka-Hou Chan Weifan Wu Hao Sheng Wei Ke |
author_facet | Xuefei Huang Ka-Hou Chan Weifan Wu Hao Sheng Wei Ke |
author_sort | Xuefei Huang |
collection | DOAJ |
description | Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset. |
first_indexed | 2024-03-11T01:56:42Z |
format | Article |
id | doaj.art-1ddb9e3ba9674d3ab83f179c055b2516 |
institution | Directory Open Access Journal |
issn | 1424-8220 |
language | English |
last_indexed | 2024-03-11T01:56:42Z |
publishDate | 2023-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj.art-1ddb9e3ba9674d3ab83f179c055b25162023-11-18T12:32:54ZengMDPI AGSensors1424-82202023-06-012312556510.3390/s23125565Fusion of Multi-Modal Features to Enhance Dense Video CaptionXuefei Huang0Ka-Hou Chan1Weifan Wu2Hao Sheng3Wei Ke4Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaFaculty of Applied Sciences, Macao Polytechnic University, Macau 999078, ChinaDense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.https://www.mdpi.com/1424-8220/23/12/5565dense video captionvideo captioningmulti-modal feature fusionfeature extractionneural network |
spellingShingle | Xuefei Huang Ka-Hou Chan Weifan Wu Hao Sheng Wei Ke Fusion of Multi-Modal Features to Enhance Dense Video Caption Sensors dense video caption video captioning multi-modal feature fusion feature extraction neural network |
title | Fusion of Multi-Modal Features to Enhance Dense Video Caption |
title_full | Fusion of Multi-Modal Features to Enhance Dense Video Caption |
title_fullStr | Fusion of Multi-Modal Features to Enhance Dense Video Caption |
title_full_unstemmed | Fusion of Multi-Modal Features to Enhance Dense Video Caption |
title_short | Fusion of Multi-Modal Features to Enhance Dense Video Caption |
title_sort | fusion of multi modal features to enhance dense video caption |
topic | dense video caption video captioning multi-modal feature fusion feature extraction neural network |
url | https://www.mdpi.com/1424-8220/23/12/5565 |
work_keys_str_mv | AT xuefeihuang fusionofmultimodalfeaturestoenhancedensevideocaption AT kahouchan fusionofmultimodalfeaturestoenhancedensevideocaption AT weifanwu fusionofmultimodalfeaturestoenhancedensevideocaption AT haosheng fusionofmultimodalfeaturestoenhancedensevideocaption AT weike fusionofmultimodalfeaturestoenhancedensevideocaption |