Video Moment Localization Network Based on Text Multi-semantic Clues Guidance

With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video mo...

Full description

Bibliographic Details
Main Authors:	WU, G., XU, T.
Format:	Article
Language:	English
Published:	Stefan cel Mare University of Suceava 2023-08-01
Series:	Advances in Electrical and Computer Engineering
Subjects:	information retrieval machine learning computer vision natural language processing pattern matching
Online Access:	http://dx.doi.org/10.4316/AECE.2023.03010

_version_	1797725128360984576
author	WU, G. XU, T.
author_facet	WU, G. XU, T.
author_sort	WU, G.
collection	DOAJ
description	With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.
first_indexed	2024-03-12T10:25:34Z
format	Article
id	doaj.art-5bd3e20ea4d746e18c35f8341e428e65
institution	Directory Open Access Journal
issn	1582-7445 1844-7600
language	English
last_indexed	2024-03-12T10:25:34Z
publishDate	2023-08-01
publisher	Stefan cel Mare University of Suceava
record_format	Article
series	Advances in Electrical and Computer Engineering
spelling	doaj.art-5bd3e20ea4d746e18c35f8341e428e652023-09-02T09:41:56ZengStefan cel Mare University of SuceavaAdvances in Electrical and Computer Engineering1582-74451844-76002023-08-01233859210.4316/AECE.2023.03010Video Moment Localization Network Based on Text Multi-semantic Clues GuidanceWU, G.XU, T.With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.http://dx.doi.org/10.4316/AECE.2023.03010information retrievalmachine learningcomputer visionnatural language processingpattern matching
spellingShingle	WU, G. XU, T. Video Moment Localization Network Based on Text Multi-semantic Clues Guidance Advances in Electrical and Computer Engineering information retrieval machine learning computer vision natural language processing pattern matching
title	Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_full	Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_fullStr	Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_full_unstemmed	Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_short	Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_sort	video moment localization network based on text multi semantic clues guidance
topic	information retrieval machine learning computer vision natural language processing pattern matching
url	http://dx.doi.org/10.4316/AECE.2023.03010
work_keys_str_mv	AT wug videomomentlocalizationnetworkbasedontextmultisemanticcluesguidance AT xut videomomentlocalizationnetworkbasedontextmultisemanticcluesguidance

Video Moment Localization Network Based on Text Multi-semantic Clues Guidance

Similar Items