Cross‐modal semantic correlation learning by Bi‐CNN network

Abstract Cross modal retrieval can retrieve images through a text query and vice versa. In recent years, cross modal retrieval has attracted extensive attention. The purpose of most now available cross modal retrieval methods is to find a common subspace and maximize the different modal correlation....

Full description

Bibliographic Details
Main Authors: Chaoyi Wang, Liang Li, Chenggang Yan, Zhan Wang, Yaoqi Sun, Jiyong Zhang
Format: Article
Language:English
Published: Wiley 2021-12-01
Series:IET Image Processing
Subjects:
Online Access:https://doi.org/10.1049/ipr2.12176
_version_ 1798027600539418624
author Chaoyi Wang
Liang Li
Chenggang Yan
Zhan Wang
Yaoqi Sun
Jiyong Zhang
author_facet Chaoyi Wang
Liang Li
Chenggang Yan
Zhan Wang
Yaoqi Sun
Jiyong Zhang
author_sort Chaoyi Wang
collection DOAJ
description Abstract Cross modal retrieval can retrieve images through a text query and vice versa. In recent years, cross modal retrieval has attracted extensive attention. The purpose of most now available cross modal retrieval methods is to find a common subspace and maximize the different modal correlation. To generate specific representations consistent with cross modal tasks, this paper proposes a novel cross modal retrieval framework, which integrates feature learning and latent space embedding. In detail, we proposed a deep CNN and a shallow CNN to extract the feature of the samples. The deep CNN is used to extract the representation of images, and the shallow CNN uses a multi‐dimensional kernel to extract multi‐level semantic representation of text. Meanwhile, we enhance the semantic manifold by constructing cross modal ranking and within‐modal discriminant loss to improve the division of semantic representation. Moreover, the most representative samples are selected by using online sampling strategy, so that the approach can be implemented on a large‐scale data. This approach not only increases the discriminative ability among different categories, but also maximizes the relativity between different modalities. Experiments on three real word datasets show that the proposed method is superior to the popular methods.
first_indexed 2024-04-11T18:54:04Z
format Article
id doaj.art-63c3d119128048deb09c7f15c31adc7e
institution Directory Open Access Journal
issn 1751-9659
1751-9667
language English
last_indexed 2024-04-11T18:54:04Z
publishDate 2021-12-01
publisher Wiley
record_format Article
series IET Image Processing
spelling doaj.art-63c3d119128048deb09c7f15c31adc7e2022-12-22T04:08:14ZengWileyIET Image Processing1751-96591751-96672021-12-0115143674368410.1049/ipr2.12176Cross‐modal semantic correlation learning by Bi‐CNN networkChaoyi Wang0Liang Li1Chenggang Yan2Zhan Wang3Yaoqi Sun4Jiyong Zhang5Hangzhou Dianzi University Hangzhou ChinaInstitute of computing technology, CAS Beijing ChinaHangzhou Dianzi University Hangzhou ChinaRTInvent Technology Co., Ltd Beijing ChinaHangzhou Dianzi University Hangzhou ChinaHangzhou Dianzi University Hangzhou ChinaAbstract Cross modal retrieval can retrieve images through a text query and vice versa. In recent years, cross modal retrieval has attracted extensive attention. The purpose of most now available cross modal retrieval methods is to find a common subspace and maximize the different modal correlation. To generate specific representations consistent with cross modal tasks, this paper proposes a novel cross modal retrieval framework, which integrates feature learning and latent space embedding. In detail, we proposed a deep CNN and a shallow CNN to extract the feature of the samples. The deep CNN is used to extract the representation of images, and the shallow CNN uses a multi‐dimensional kernel to extract multi‐level semantic representation of text. Meanwhile, we enhance the semantic manifold by constructing cross modal ranking and within‐modal discriminant loss to improve the division of semantic representation. Moreover, the most representative samples are selected by using online sampling strategy, so that the approach can be implemented on a large‐scale data. This approach not only increases the discriminative ability among different categories, but also maximizes the relativity between different modalities. Experiments on three real word datasets show that the proposed method is superior to the popular methods.https://doi.org/10.1049/ipr2.12176Optical, image and video signal processingComputer vision and image processing techniquesInformation retrieval techniquesNeural nets
spellingShingle Chaoyi Wang
Liang Li
Chenggang Yan
Zhan Wang
Yaoqi Sun
Jiyong Zhang
Cross‐modal semantic correlation learning by Bi‐CNN network
IET Image Processing
Optical, image and video signal processing
Computer vision and image processing techniques
Information retrieval techniques
Neural nets
title Cross‐modal semantic correlation learning by Bi‐CNN network
title_full Cross‐modal semantic correlation learning by Bi‐CNN network
title_fullStr Cross‐modal semantic correlation learning by Bi‐CNN network
title_full_unstemmed Cross‐modal semantic correlation learning by Bi‐CNN network
title_short Cross‐modal semantic correlation learning by Bi‐CNN network
title_sort cross modal semantic correlation learning by bi cnn network
topic Optical, image and video signal processing
Computer vision and image processing techniques
Information retrieval techniques
Neural nets
url https://doi.org/10.1049/ipr2.12176
work_keys_str_mv AT chaoyiwang crossmodalsemanticcorrelationlearningbybicnnnetwork
AT liangli crossmodalsemanticcorrelationlearningbybicnnnetwork
AT chenggangyan crossmodalsemanticcorrelationlearningbybicnnnetwork
AT zhanwang crossmodalsemanticcorrelationlearningbybicnnnetwork
AT yaoqisun crossmodalsemanticcorrelationlearningbybicnnnetwork
AT jiyongzhang crossmodalsemanticcorrelationlearningbybicnnnetwork