Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
Text-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when as...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-12-01
|
Series: | Electronics |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-9292/9/12/2125 |
_version_ | 1797544948137984000 |
---|---|
author | Xiaoyu Wu Tiantian Wang Shengjin Wang |
author_facet | Xiaoyu Wu Tiantian Wang Shengjin Wang |
author_sort | Xiaoyu Wu |
collection | DOAJ |
description | Text-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when associating the two modalities of semantic encoding, and the associated result is poor. In this paper, we propose a multi-modal retrieval algorithm based on semantic association and multi-task learning. Firstly, the multi-level features of video or text are extracted based on multiple deep learning networks, so that the information of the two modalities can be fully encoded. Then, in the public feature space where the two modalities information are mapped together, we propose a semantic similarity measurement and semantic consistency classification based on text-video features for a multi-task learning framework. With the semantic consistency classification task, the learning of semantic association task is restrained. So multi-task learning guides the better feature mapping of two modalities and optimizes the construction of unified feature subspace. Finally, the experimental results of our proposed algorithm on the Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT) are better than the existing research, which prove that our algorithm can improve the performance of cross-modal retrieval. |
first_indexed | 2024-03-10T14:07:41Z |
format | Article |
id | doaj.art-495ebd2a62a347bb8989484e6e3ec74d |
institution | Directory Open Access Journal |
issn | 2079-9292 |
language | English |
last_indexed | 2024-03-10T14:07:41Z |
publishDate | 2020-12-01 |
publisher | MDPI AG |
record_format | Article |
series | Electronics |
spelling | doaj.art-495ebd2a62a347bb8989484e6e3ec74d2023-11-21T00:27:22ZengMDPI AGElectronics2079-92922020-12-01912212510.3390/electronics9122125Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video RetrievalXiaoyu Wu0Tiantian Wang1Shengjin Wang2School of Information and Communication Engineering, Communication University of China, Beijing 100024, ChinaSchool of Information and Communication Engineering, Communication University of China, Beijing 100024, ChinaDepartment of Electronic Engineering, Tsinghua University, Beijing 100084, ChinaText-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when associating the two modalities of semantic encoding, and the associated result is poor. In this paper, we propose a multi-modal retrieval algorithm based on semantic association and multi-task learning. Firstly, the multi-level features of video or text are extracted based on multiple deep learning networks, so that the information of the two modalities can be fully encoded. Then, in the public feature space where the two modalities information are mapped together, we propose a semantic similarity measurement and semantic consistency classification based on text-video features for a multi-task learning framework. With the semantic consistency classification task, the learning of semantic association task is restrained. So multi-task learning guides the better feature mapping of two modalities and optimizes the construction of unified feature subspace. Finally, the experimental results of our proposed algorithm on the Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT) are better than the existing research, which prove that our algorithm can improve the performance of cross-modal retrieval.https://www.mdpi.com/2079-9292/9/12/2125cross-model learningtext-video retrievalsemantic correlationmulti-task learning |
spellingShingle | Xiaoyu Wu Tiantian Wang Shengjin Wang Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval Electronics cross-model learning text-video retrieval semantic correlation multi-task learning |
title | Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval |
title_full | Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval |
title_fullStr | Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval |
title_full_unstemmed | Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval |
title_short | Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval |
title_sort | cross modal learning based on semantic correlation and multi task learning for text video retrieval |
topic | cross-model learning text-video retrieval semantic correlation multi-task learning |
url | https://www.mdpi.com/2079-9292/9/12/2125 |
work_keys_str_mv | AT xiaoyuwu crossmodallearningbasedonsemanticcorrelationandmultitasklearningfortextvideoretrieval AT tiantianwang crossmodallearningbasedonsemanticcorrelationandmultitasklearningfortextvideoretrieval AT shengjinwang crossmodallearningbasedonsemanticcorrelationandmultitasklearningfortextvideoretrieval |