Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval

Multi-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the...

Full description

Bibliographic Details
Main Authors: Yan Hua, Yingyun Yang, Jianhe Du
Format: Article
Language:English
Published: MDPI AG 2020-03-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/9/3/466
_version_ 1798026931174637568
author Yan Hua
Yingyun Yang
Jianhe Du
author_facet Yan Hua
Yingyun Yang
Jianhe Du
author_sort Yan Hua
collection DOAJ
description Multi-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the distances of heterogeneous data can be compared directly; thus, inter-modal retrieval can be achieved by the nearest neighboring search. However, most of them ignore intra-modal relations and complicated semantics between multi-modal data. In this paper, we propose a deep multi-modal metric learning method with multi-scale semantic correlation to deal with the retrieval tasks between image and text modalities. A deep model with two branches is designed to nonlinearly map raw heterogeneous data into comparable representations. In contrast to binary similarity, we formulate semantic relationship with multi-scale similarity to learn fine-grained multi-modal distances. Inter-modal and intra-modal correlations constructed on multi-scale semantic similarity are incorporated to train the deep model in an end-to-end way. Experiments validate the effectiveness of our proposed method on multi-modal retrieval tasks, and our method outperforms state-of-the-art methods on NUS-WIDE, MIR Flickr, and Wikipedia datasets.
first_indexed 2024-04-11T18:43:05Z
format Article
id doaj.art-8e469b9a65b64fa4a6631f12f7b8c227
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-04-11T18:43:05Z
publishDate 2020-03-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-8e469b9a65b64fa4a6631f12f7b8c2272022-12-22T04:08:55ZengMDPI AGElectronics2079-92922020-03-019346610.3390/electronics9030466electronics9030466Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text RetrievalYan Hua0Yingyun Yang1Jianhe Du2School of Information and Communication Engineering, Communication University of China, Beijing 100024, ChinaSchool of Information and Communication Engineering, Communication University of China, Beijing 100024, ChinaSchool of Information and Communication Engineering, Communication University of China, Beijing 100024, ChinaMulti-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the distances of heterogeneous data can be compared directly; thus, inter-modal retrieval can be achieved by the nearest neighboring search. However, most of them ignore intra-modal relations and complicated semantics between multi-modal data. In this paper, we propose a deep multi-modal metric learning method with multi-scale semantic correlation to deal with the retrieval tasks between image and text modalities. A deep model with two branches is designed to nonlinearly map raw heterogeneous data into comparable representations. In contrast to binary similarity, we formulate semantic relationship with multi-scale similarity to learn fine-grained multi-modal distances. Inter-modal and intra-modal correlations constructed on multi-scale semantic similarity are incorporated to train the deep model in an end-to-end way. Experiments validate the effectiveness of our proposed method on multi-modal retrieval tasks, and our method outperforms state-of-the-art methods on NUS-WIDE, MIR Flickr, and Wikipedia datasets.https://www.mdpi.com/2079-9292/9/3/466deep learningmetric learningmulti-modal correlationcross-modal retrievalimage–text retrieval
spellingShingle Yan Hua
Yingyun Yang
Jianhe Du
Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
Electronics
deep learning
metric learning
multi-modal correlation
cross-modal retrieval
image–text retrieval
title Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
title_full Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
title_fullStr Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
title_full_unstemmed Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
title_short Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
title_sort deep multi modal metric learning with multi scale correlation for image text retrieval
topic deep learning
metric learning
multi-modal correlation
cross-modal retrieval
image–text retrieval
url https://www.mdpi.com/2079-9292/9/3/466
work_keys_str_mv AT yanhua deepmultimodalmetriclearningwithmultiscalecorrelationforimagetextretrieval
AT yingyunyang deepmultimodalmetriclearningwithmultiscalecorrelationforimagetextretrieval
AT jianhedu deepmultimodalmetriclearningwithmultiscalecorrelationforimagetextretrieval