Level-wise aligned dual networks for text–video retrieval

Abstract The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important. The current methods leverage a high-dimensional space to align video and text for these tasks. However, a high-dimensional space cannot fully use different levels of inf...

Full description

Bibliographic Details
Main Authors: Qiubin Lin, Wenming Cao, Zhiquan He
Format: Article
Language:English
Published: SpringerOpen 2022-07-01
Series:EURASIP Journal on Advances in Signal Processing
Subjects:
Online Access:https://doi.org/10.1186/s13634-022-00887-y
_version_ 1818519170010054656
author Qiubin Lin
Wenming Cao
Zhiquan He
author_facet Qiubin Lin
Wenming Cao
Zhiquan He
author_sort Qiubin Lin
collection DOAJ
description Abstract The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important. The current methods leverage a high-dimensional space to align video and text for these tasks. However, a high-dimensional space cannot fully use different levels of information in videos and text. In this paper, we put forward a method called level-wise aligned dual networks (LADNs) for text–video retrieval. LADN uses four common latent spaces to improve the performance of text–video retrieval and utilizes the semantic concept space to increase the interpretability of the model. Specifically, LADN first extracts different levels of information, including global, local, temporal, and spatial–temporal information, from videos and text. Then, they are mapped into four different latent spaces and one semantic space. Finally, LADN aligns different levels of information in various spaces. Extensive experiments conducted on three widely used datasets, including MSR-VTT, VATEX, and TRECVID AVS 2016-2018, demonstrate that our proposed approach is superior to several state-of-the-art text–video retrieval approaches.
first_indexed 2024-12-11T01:20:25Z
format Article
id doaj.art-0b6bb2d6193541109ce80d6ff193ce82
institution Directory Open Access Journal
issn 1687-6180
language English
last_indexed 2024-12-11T01:20:25Z
publishDate 2022-07-01
publisher SpringerOpen
record_format Article
series EURASIP Journal on Advances in Signal Processing
spelling doaj.art-0b6bb2d6193541109ce80d6ff193ce822022-12-22T01:25:43ZengSpringerOpenEURASIP Journal on Advances in Signal Processing1687-61802022-07-012022112010.1186/s13634-022-00887-yLevel-wise aligned dual networks for text–video retrievalQiubin Lin0Wenming Cao1Zhiquan He2College of Electronics and Information Engineering, Shenzhen UniversityCollege of Electronics and Information Engineering, Shenzhen UniversityCollege of Electronics and Information Engineering, Shenzhen UniversityAbstract The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important. The current methods leverage a high-dimensional space to align video and text for these tasks. However, a high-dimensional space cannot fully use different levels of information in videos and text. In this paper, we put forward a method called level-wise aligned dual networks (LADNs) for text–video retrieval. LADN uses four common latent spaces to improve the performance of text–video retrieval and utilizes the semantic concept space to increase the interpretability of the model. Specifically, LADN first extracts different levels of information, including global, local, temporal, and spatial–temporal information, from videos and text. Then, they are mapped into four different latent spaces and one semantic space. Finally, LADN aligns different levels of information in various spaces. Extensive experiments conducted on three widely used datasets, including MSR-VTT, VATEX, and TRECVID AVS 2016-2018, demonstrate that our proposed approach is superior to several state-of-the-art text–video retrieval approaches.https://doi.org/10.1186/s13634-022-00887-yText–video retrievalLevel-wise aligned mechanismSemantic spaceLatent space
spellingShingle Qiubin Lin
Wenming Cao
Zhiquan He
Level-wise aligned dual networks for text–video retrieval
EURASIP Journal on Advances in Signal Processing
Text–video retrieval
Level-wise aligned mechanism
Semantic space
Latent space
title Level-wise aligned dual networks for text–video retrieval
title_full Level-wise aligned dual networks for text–video retrieval
title_fullStr Level-wise aligned dual networks for text–video retrieval
title_full_unstemmed Level-wise aligned dual networks for text–video retrieval
title_short Level-wise aligned dual networks for text–video retrieval
title_sort level wise aligned dual networks for text video retrieval
topic Text–video retrieval
Level-wise aligned mechanism
Semantic space
Latent space
url https://doi.org/10.1186/s13634-022-00887-y
work_keys_str_mv AT qiubinlin levelwisealigneddualnetworksfortextvideoretrieval
AT wenmingcao levelwisealigneddualnetworksfortextvideoretrieval
AT zhiquanhe levelwisealigneddualnetworksfortextvideoretrieval