Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations:...

Full description

Bibliographic Details
Main Authors:	Fudong Nian, Ling Ding, Yuxia Hu, Yanhong Gu
Format:	Article
Language:	English
Published:	MDPI AG 2022-09-01
Series:	Mathematics
Subjects:	video–text retrieval multi-level space learning cross-modal similarity calculation
Online Access:	https://www.mdpi.com/2227-7390/10/18/3346

_version_	1797485182858559488
author	Fudong Nian Ling Ding Yuxia Hu Yanhong Gu
author_facet	Fudong Nian Ling Ding Yuxia Hu Yanhong Gu
author_sort	Fudong Nian
collection	DOAJ
description	This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video–text retrieval by jointly modeling video–text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial–temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video–text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video–text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.
first_indexed	2024-03-09T23:16:03Z
format	Article
id	doaj.art-d5fe79d4b6474ad1bfa02125931bf40d
institution	Directory Open Access Journal
issn	2227-7390
language	English
last_indexed	2024-03-09T23:16:03Z
publishDate	2022-09-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj.art-d5fe79d4b6474ad1bfa02125931bf40d2023-11-23T17:37:09ZengMDPI AGMathematics2227-73902022-09-011018334610.3390/math10183346Multi-Level Cross-Modal Semantic Alignment Network for Video–Text RetrievalFudong Nian0Ling Ding1Yuxia Hu2Yanhong Gu3School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, ChinaSchool of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, ChinaAnhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling, Anhui Jianzhu University, Hefei 230601, ChinaSchool of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, ChinaThis paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video–text retrieval by jointly modeling video–text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial–temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video–text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video–text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.https://www.mdpi.com/2227-7390/10/18/3346video–text retrievalmulti-level space learningcross-modal similarity calculation
spellingShingle	Fudong Nian Ling Ding Yuxia Hu Yanhong Gu Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval Mathematics video–text retrieval multi-level space learning cross-modal similarity calculation
title	Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval
title_full	Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval
title_fullStr	Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval
title_full_unstemmed	Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval
title_short	Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval
title_sort	multi level cross modal semantic alignment network for video text retrieval
topic	video–text retrieval multi-level space learning cross-modal similarity calculation
url	https://www.mdpi.com/2227-7390/10/18/3346
work_keys_str_mv	AT fudongnian multilevelcrossmodalsemanticalignmentnetworkforvideotextretrieval AT lingding multilevelcrossmodalsemanticalignmentnetworkforvideotextretrieval AT yuxiahu multilevelcrossmodalsemanticalignmentnetworkforvideotextretrieval AT yanhonggu multilevelcrossmodalsemanticalignmentnetworkforvideotextretrieval

Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

Similar Items