Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video mo...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Stefan cel Mare University of Suceava
2023-08-01
|
Series: | Advances in Electrical and Computer Engineering |
Subjects: | |
Online Access: | http://dx.doi.org/10.4316/AECE.2023.03010 |
_version_ | 1797725128360984576 |
---|---|
author | WU, G. XU, T. |
author_facet | WU, G. XU, T. |
author_sort | WU, G. |
collection | DOAJ |
description | With the rapid development of the Internet and information technology, people are able to create multimedia data
such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast
video data. The video moment localization task aims to locate the video moment which best matches the query in
the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully
represent the semantic features of text, and also do not consider the masking of crucial information in the
video by text information when using text to guide the extraction of video features. To solve the above
problems, we propose a video moment localization network based on text multi-semantic clues guidance.
Specifically, we first design a text encoder based on fusion gate to better capture the semantic information
in the text through multi-semantic clues composed of word embedding, local features and global features.
Then text guidance module guides the extraction of video features by text semantic features to highlight
the video features related to text semantics. Experimental results on two datasets, Charades-STA and
ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art
methods. |
first_indexed | 2024-03-12T10:25:34Z |
format | Article |
id | doaj.art-5bd3e20ea4d746e18c35f8341e428e65 |
institution | Directory Open Access Journal |
issn | 1582-7445 1844-7600 |
language | English |
last_indexed | 2024-03-12T10:25:34Z |
publishDate | 2023-08-01 |
publisher | Stefan cel Mare University of Suceava |
record_format | Article |
series | Advances in Electrical and Computer Engineering |
spelling | doaj.art-5bd3e20ea4d746e18c35f8341e428e652023-09-02T09:41:56ZengStefan cel Mare University of SuceavaAdvances in Electrical and Computer Engineering1582-74451844-76002023-08-01233859210.4316/AECE.2023.03010Video Moment Localization Network Based on Text Multi-semantic Clues GuidanceWU, G.XU, T.With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.http://dx.doi.org/10.4316/AECE.2023.03010information retrievalmachine learningcomputer visionnatural language processingpattern matching |
spellingShingle | WU, G. XU, T. Video Moment Localization Network Based on Text Multi-semantic Clues Guidance Advances in Electrical and Computer Engineering information retrieval machine learning computer vision natural language processing pattern matching |
title | Video Moment Localization Network Based on Text Multi-semantic Clues Guidance |
title_full | Video Moment Localization Network Based on Text Multi-semantic Clues Guidance |
title_fullStr | Video Moment Localization Network Based on Text Multi-semantic Clues Guidance |
title_full_unstemmed | Video Moment Localization Network Based on Text Multi-semantic Clues Guidance |
title_short | Video Moment Localization Network Based on Text Multi-semantic Clues Guidance |
title_sort | video moment localization network based on text multi semantic clues guidance |
topic | information retrieval machine learning computer vision natural language processing pattern matching |
url | http://dx.doi.org/10.4316/AECE.2023.03010 |
work_keys_str_mv | AT wug videomomentlocalizationnetworkbasedontextmultisemanticcluesguidance AT xut videomomentlocalizationnetworkbasedontextmultisemanticcluesguidance |