Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Text-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when as...

Full description

Bibliographic Details
Main Authors: Xiaoyu Wu, Tiantian Wang, Shengjin Wang
Format: Article
Language:English
Published: MDPI AG 2020-12-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/9/12/2125
_version_ 1797544948137984000
author Xiaoyu Wu
Tiantian Wang
Shengjin Wang
author_facet Xiaoyu Wu
Tiantian Wang
Shengjin Wang
author_sort Xiaoyu Wu
collection DOAJ
description Text-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when associating the two modalities of semantic encoding, and the associated result is poor. In this paper, we propose a multi-modal retrieval algorithm based on semantic association and multi-task learning. Firstly, the multi-level features of video or text are extracted based on multiple deep learning networks, so that the information of the two modalities can be fully encoded. Then, in the public feature space where the two modalities information are mapped together, we propose a semantic similarity measurement and semantic consistency classification based on text-video features for a multi-task learning framework. With the semantic consistency classification task, the learning of semantic association task is restrained. So multi-task learning guides the better feature mapping of two modalities and optimizes the construction of unified feature subspace. Finally, the experimental results of our proposed algorithm on the Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT) are better than the existing research, which prove that our algorithm can improve the performance of cross-modal retrieval.
first_indexed 2024-03-10T14:07:41Z
format Article
id doaj.art-495ebd2a62a347bb8989484e6e3ec74d
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-10T14:07:41Z
publishDate 2020-12-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-495ebd2a62a347bb8989484e6e3ec74d2023-11-21T00:27:22ZengMDPI AGElectronics2079-92922020-12-01912212510.3390/electronics9122125Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video RetrievalXiaoyu Wu0Tiantian Wang1Shengjin Wang2School of Information and Communication Engineering, Communication University of China, Beijing 100024, ChinaSchool of Information and Communication Engineering, Communication University of China, Beijing 100024, ChinaDepartment of Electronic Engineering, Tsinghua University, Beijing 100084, ChinaText-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when associating the two modalities of semantic encoding, and the associated result is poor. In this paper, we propose a multi-modal retrieval algorithm based on semantic association and multi-task learning. Firstly, the multi-level features of video or text are extracted based on multiple deep learning networks, so that the information of the two modalities can be fully encoded. Then, in the public feature space where the two modalities information are mapped together, we propose a semantic similarity measurement and semantic consistency classification based on text-video features for a multi-task learning framework. With the semantic consistency classification task, the learning of semantic association task is restrained. So multi-task learning guides the better feature mapping of two modalities and optimizes the construction of unified feature subspace. Finally, the experimental results of our proposed algorithm on the Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT) are better than the existing research, which prove that our algorithm can improve the performance of cross-modal retrieval.https://www.mdpi.com/2079-9292/9/12/2125cross-model learningtext-video retrievalsemantic correlationmulti-task learning
spellingShingle Xiaoyu Wu
Tiantian Wang
Shengjin Wang
Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
Electronics
cross-model learning
text-video retrieval
semantic correlation
multi-task learning
title Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
title_full Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
title_fullStr Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
title_full_unstemmed Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
title_short Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
title_sort cross modal learning based on semantic correlation and multi task learning for text video retrieval
topic cross-model learning
text-video retrieval
semantic correlation
multi-task learning
url https://www.mdpi.com/2079-9292/9/12/2125
work_keys_str_mv AT xiaoyuwu crossmodallearningbasedonsemanticcorrelationandmultitasklearningfortextvideoretrieval
AT tiantianwang crossmodallearningbasedonsemanticcorrelationandmultitasklearningfortextvideoretrieval
AT shengjinwang crossmodallearningbasedonsemanticcorrelationandmultitasklearningfortextvideoretrieval