Video Moment Localization Network Based on Text Multi-semantic Clues Guidance

With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video mo...

Full description

Bibliographic Details
Main Authors: WU, G., XU, T.
Format: Article
Language:English
Published: Stefan cel Mare University of Suceava 2023-08-01
Series:Advances in Electrical and Computer Engineering
Subjects:
Online Access:http://dx.doi.org/10.4316/AECE.2023.03010
_version_ 1797725128360984576
author WU, G.
XU, T.
author_facet WU, G.
XU, T.
author_sort WU, G.
collection DOAJ
description With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.
first_indexed 2024-03-12T10:25:34Z
format Article
id doaj.art-5bd3e20ea4d746e18c35f8341e428e65
institution Directory Open Access Journal
issn 1582-7445
1844-7600
language English
last_indexed 2024-03-12T10:25:34Z
publishDate 2023-08-01
publisher Stefan cel Mare University of Suceava
record_format Article
series Advances in Electrical and Computer Engineering
spelling doaj.art-5bd3e20ea4d746e18c35f8341e428e652023-09-02T09:41:56ZengStefan cel Mare University of SuceavaAdvances in Electrical and Computer Engineering1582-74451844-76002023-08-01233859210.4316/AECE.2023.03010Video Moment Localization Network Based on Text Multi-semantic Clues GuidanceWU, G.XU, T.With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.http://dx.doi.org/10.4316/AECE.2023.03010information retrievalmachine learningcomputer visionnatural language processingpattern matching
spellingShingle WU, G.
XU, T.
Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
Advances in Electrical and Computer Engineering
information retrieval
machine learning
computer vision
natural language processing
pattern matching
title Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_full Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_fullStr Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_full_unstemmed Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_short Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
title_sort video moment localization network based on text multi semantic clues guidance
topic information retrieval
machine learning
computer vision
natural language processing
pattern matching
url http://dx.doi.org/10.4316/AECE.2023.03010
work_keys_str_mv AT wug videomomentlocalizationnetworkbasedontextmultisemanticcluesguidance
AT xut videomomentlocalizationnetworkbasedontextmultisemanticcluesguidance