Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature

Automatic video description, or video captioning, is a challenging yet much attractive task. It aims to combine video with text. Multiple methods have been proposed based on neural networks, utilizing Convolutional Neural Networks (CNN) to extract features, and Recurrent Neural Networks (RNN) to enc...

Full description

Bibliographic Details
Main Authors:	Xu, Yuecong, Yang, Jianfei, Mao, Kezhi
Other Authors:	School of Electrical and Electronic Engineering
Format:	Journal Article
Language:	English
Published:	2021
Subjects:	Engineering::Electrical and electronic engineering Video Captioning Long Short-term Memory
Online Access:	https://hdl.handle.net/10356/151341

_version_	1811696729304596480
author	Xu, Yuecong Yang, Jianfei Mao, Kezhi
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Xu, Yuecong Yang, Jianfei Mao, Kezhi
author_sort	Xu, Yuecong
collection	NTU
description	Automatic video description, or video captioning, is a challenging yet much attractive task. It aims to combine video with text. Multiple methods have been proposed based on neural networks, utilizing Convolutional Neural Networks (CNN) to extract features, and Recurrent Neural Networks (RNN) to encode and decode videos to generate descriptions. Previously, a number of methods used in video captioning task are motivated by image captioning approaches. However, videos carry much more information than images. This increases the difficulty of video captioning task. Current methods commonly lack the ability to utilize the additional information provided by videos, especially the semantic and structural information of the videos. To address the above shortcoming, we propose a Semantic-Filtered Soft-Split-Aware-Gated LSTM (SF-SSAG-LSTM) model, that would improve video captioning quality by combining semantic concepts with audio-augmented feature extracted from input videos, while understanding the underlying structure of input videos. In the experiments, we quantitatively evaluate the performance of our model which matches other prominent methods on three benchmark datasets. We also qualitatively examine the result of our model, and show that our generated descriptions are more detailed and logical.
first_indexed	2024-10-01T07:43:59Z
format	Journal Article
id	ntu-10356/151341
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T07:43:59Z
publishDate	2021
record_format	dspace
spelling	ntu-10356/1513412021-07-09T01:29:56Z Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature Xu, Yuecong Yang, Jianfei Mao, Kezhi School of Electrical and Electronic Engineering Engineering::Electrical and electronic engineering Video Captioning Long Short-term Memory Automatic video description, or video captioning, is a challenging yet much attractive task. It aims to combine video with text. Multiple methods have been proposed based on neural networks, utilizing Convolutional Neural Networks (CNN) to extract features, and Recurrent Neural Networks (RNN) to encode and decode videos to generate descriptions. Previously, a number of methods used in video captioning task are motivated by image captioning approaches. However, videos carry much more information than images. This increases the difficulty of video captioning task. Current methods commonly lack the ability to utilize the additional information provided by videos, especially the semantic and structural information of the videos. To address the above shortcoming, we propose a Semantic-Filtered Soft-Split-Aware-Gated LSTM (SF-SSAG-LSTM) model, that would improve video captioning quality by combining semantic concepts with audio-augmented feature extracted from input videos, while understanding the underlying structure of input videos. In the experiments, we quantitatively evaluate the performance of our model which matches other prominent methods on three benchmark datasets. We also qualitatively examine the result of our model, and show that our generated descriptions are more detailed and logical. 2021-07-09T01:29:56Z 2021-07-09T01:29:56Z 2019 Journal Article Xu, Y., Yang, J. & Mao, K. (2019). Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature. Neurocomputing, 357, 24-35. https://dx.doi.org/10.1016/j.neucom.2019.05.027 0925-2312 0000-0002-8075-0439 https://hdl.handle.net/10356/151341 10.1016/j.neucom.2019.05.027 2-s2.0-85065823631 357 24 35 en Neurocomputing © 2019 Elsevier B.V. All rights reserved.
spellingShingle	Engineering::Electrical and electronic engineering Video Captioning Long Short-term Memory Xu, Yuecong Yang, Jianfei Mao, Kezhi Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
title	Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
title_full	Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
title_fullStr	Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
title_full_unstemmed	Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
title_short	Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
title_sort	semantic filtered soft split aware video captioning with audio augmented feature
topic	Engineering::Electrical and electronic engineering Video Captioning Long Short-term Memory
url	https://hdl.handle.net/10356/151341
work_keys_str_mv	AT xuyuecong semanticfilteredsoftsplitawarevideocaptioningwithaudioaugmentedfeature AT yangjianfei semanticfilteredsoftsplitawarevideocaptioningwithaudioaugmentedfeature AT maokezhi semanticfilteredsoftsplitawarevideocaptioningwithaudioaugmentedfeature

Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature

Similar Items