Level-wise aligned dual networks for text–video retrieval
Abstract The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important. The current methods leverage a high-dimensional space to align video and text for these tasks. However, a high-dimensional space cannot fully use different levels of inf...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2022-07-01
|
Series: | EURASIP Journal on Advances in Signal Processing |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13634-022-00887-y |
_version_ | 1818519170010054656 |
---|---|
author | Qiubin Lin Wenming Cao Zhiquan He |
author_facet | Qiubin Lin Wenming Cao Zhiquan He |
author_sort | Qiubin Lin |
collection | DOAJ |
description | Abstract The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important. The current methods leverage a high-dimensional space to align video and text for these tasks. However, a high-dimensional space cannot fully use different levels of information in videos and text. In this paper, we put forward a method called level-wise aligned dual networks (LADNs) for text–video retrieval. LADN uses four common latent spaces to improve the performance of text–video retrieval and utilizes the semantic concept space to increase the interpretability of the model. Specifically, LADN first extracts different levels of information, including global, local, temporal, and spatial–temporal information, from videos and text. Then, they are mapped into four different latent spaces and one semantic space. Finally, LADN aligns different levels of information in various spaces. Extensive experiments conducted on three widely used datasets, including MSR-VTT, VATEX, and TRECVID AVS 2016-2018, demonstrate that our proposed approach is superior to several state-of-the-art text–video retrieval approaches. |
first_indexed | 2024-12-11T01:20:25Z |
format | Article |
id | doaj.art-0b6bb2d6193541109ce80d6ff193ce82 |
institution | Directory Open Access Journal |
issn | 1687-6180 |
language | English |
last_indexed | 2024-12-11T01:20:25Z |
publishDate | 2022-07-01 |
publisher | SpringerOpen |
record_format | Article |
series | EURASIP Journal on Advances in Signal Processing |
spelling | doaj.art-0b6bb2d6193541109ce80d6ff193ce822022-12-22T01:25:43ZengSpringerOpenEURASIP Journal on Advances in Signal Processing1687-61802022-07-012022112010.1186/s13634-022-00887-yLevel-wise aligned dual networks for text–video retrievalQiubin Lin0Wenming Cao1Zhiquan He2College of Electronics and Information Engineering, Shenzhen UniversityCollege of Electronics and Information Engineering, Shenzhen UniversityCollege of Electronics and Information Engineering, Shenzhen UniversityAbstract The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important. The current methods leverage a high-dimensional space to align video and text for these tasks. However, a high-dimensional space cannot fully use different levels of information in videos and text. In this paper, we put forward a method called level-wise aligned dual networks (LADNs) for text–video retrieval. LADN uses four common latent spaces to improve the performance of text–video retrieval and utilizes the semantic concept space to increase the interpretability of the model. Specifically, LADN first extracts different levels of information, including global, local, temporal, and spatial–temporal information, from videos and text. Then, they are mapped into four different latent spaces and one semantic space. Finally, LADN aligns different levels of information in various spaces. Extensive experiments conducted on three widely used datasets, including MSR-VTT, VATEX, and TRECVID AVS 2016-2018, demonstrate that our proposed approach is superior to several state-of-the-art text–video retrieval approaches.https://doi.org/10.1186/s13634-022-00887-yText–video retrievalLevel-wise aligned mechanismSemantic spaceLatent space |
spellingShingle | Qiubin Lin Wenming Cao Zhiquan He Level-wise aligned dual networks for text–video retrieval EURASIP Journal on Advances in Signal Processing Text–video retrieval Level-wise aligned mechanism Semantic space Latent space |
title | Level-wise aligned dual networks for text–video retrieval |
title_full | Level-wise aligned dual networks for text–video retrieval |
title_fullStr | Level-wise aligned dual networks for text–video retrieval |
title_full_unstemmed | Level-wise aligned dual networks for text–video retrieval |
title_short | Level-wise aligned dual networks for text–video retrieval |
title_sort | level wise aligned dual networks for text video retrieval |
topic | Text–video retrieval Level-wise aligned mechanism Semantic space Latent space |
url | https://doi.org/10.1186/s13634-022-00887-y |
work_keys_str_mv | AT qiubinlin levelwisealigneddualnetworksfortextvideoretrieval AT wenmingcao levelwisealigneddualnetworksfortextvideoretrieval AT zhiquanhe levelwisealigneddualnetworksfortextvideoretrieval |