Cross‐modal semantic correlation learning by Bi‐CNN network
Abstract Cross modal retrieval can retrieve images through a text query and vice versa. In recent years, cross modal retrieval has attracted extensive attention. The purpose of most now available cross modal retrieval methods is to find a common subspace and maximize the different modal correlation....
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2021-12-01
|
Series: | IET Image Processing |
Subjects: | |
Online Access: | https://doi.org/10.1049/ipr2.12176 |
_version_ | 1798027600539418624 |
---|---|
author | Chaoyi Wang Liang Li Chenggang Yan Zhan Wang Yaoqi Sun Jiyong Zhang |
author_facet | Chaoyi Wang Liang Li Chenggang Yan Zhan Wang Yaoqi Sun Jiyong Zhang |
author_sort | Chaoyi Wang |
collection | DOAJ |
description | Abstract Cross modal retrieval can retrieve images through a text query and vice versa. In recent years, cross modal retrieval has attracted extensive attention. The purpose of most now available cross modal retrieval methods is to find a common subspace and maximize the different modal correlation. To generate specific representations consistent with cross modal tasks, this paper proposes a novel cross modal retrieval framework, which integrates feature learning and latent space embedding. In detail, we proposed a deep CNN and a shallow CNN to extract the feature of the samples. The deep CNN is used to extract the representation of images, and the shallow CNN uses a multi‐dimensional kernel to extract multi‐level semantic representation of text. Meanwhile, we enhance the semantic manifold by constructing cross modal ranking and within‐modal discriminant loss to improve the division of semantic representation. Moreover, the most representative samples are selected by using online sampling strategy, so that the approach can be implemented on a large‐scale data. This approach not only increases the discriminative ability among different categories, but also maximizes the relativity between different modalities. Experiments on three real word datasets show that the proposed method is superior to the popular methods. |
first_indexed | 2024-04-11T18:54:04Z |
format | Article |
id | doaj.art-63c3d119128048deb09c7f15c31adc7e |
institution | Directory Open Access Journal |
issn | 1751-9659 1751-9667 |
language | English |
last_indexed | 2024-04-11T18:54:04Z |
publishDate | 2021-12-01 |
publisher | Wiley |
record_format | Article |
series | IET Image Processing |
spelling | doaj.art-63c3d119128048deb09c7f15c31adc7e2022-12-22T04:08:14ZengWileyIET Image Processing1751-96591751-96672021-12-0115143674368410.1049/ipr2.12176Cross‐modal semantic correlation learning by Bi‐CNN networkChaoyi Wang0Liang Li1Chenggang Yan2Zhan Wang3Yaoqi Sun4Jiyong Zhang5Hangzhou Dianzi University Hangzhou ChinaInstitute of computing technology, CAS Beijing ChinaHangzhou Dianzi University Hangzhou ChinaRTInvent Technology Co., Ltd Beijing ChinaHangzhou Dianzi University Hangzhou ChinaHangzhou Dianzi University Hangzhou ChinaAbstract Cross modal retrieval can retrieve images through a text query and vice versa. In recent years, cross modal retrieval has attracted extensive attention. The purpose of most now available cross modal retrieval methods is to find a common subspace and maximize the different modal correlation. To generate specific representations consistent with cross modal tasks, this paper proposes a novel cross modal retrieval framework, which integrates feature learning and latent space embedding. In detail, we proposed a deep CNN and a shallow CNN to extract the feature of the samples. The deep CNN is used to extract the representation of images, and the shallow CNN uses a multi‐dimensional kernel to extract multi‐level semantic representation of text. Meanwhile, we enhance the semantic manifold by constructing cross modal ranking and within‐modal discriminant loss to improve the division of semantic representation. Moreover, the most representative samples are selected by using online sampling strategy, so that the approach can be implemented on a large‐scale data. This approach not only increases the discriminative ability among different categories, but also maximizes the relativity between different modalities. Experiments on three real word datasets show that the proposed method is superior to the popular methods.https://doi.org/10.1049/ipr2.12176Optical, image and video signal processingComputer vision and image processing techniquesInformation retrieval techniquesNeural nets |
spellingShingle | Chaoyi Wang Liang Li Chenggang Yan Zhan Wang Yaoqi Sun Jiyong Zhang Cross‐modal semantic correlation learning by Bi‐CNN network IET Image Processing Optical, image and video signal processing Computer vision and image processing techniques Information retrieval techniques Neural nets |
title | Cross‐modal semantic correlation learning by Bi‐CNN network |
title_full | Cross‐modal semantic correlation learning by Bi‐CNN network |
title_fullStr | Cross‐modal semantic correlation learning by Bi‐CNN network |
title_full_unstemmed | Cross‐modal semantic correlation learning by Bi‐CNN network |
title_short | Cross‐modal semantic correlation learning by Bi‐CNN network |
title_sort | cross modal semantic correlation learning by bi cnn network |
topic | Optical, image and video signal processing Computer vision and image processing techniques Information retrieval techniques Neural nets |
url | https://doi.org/10.1049/ipr2.12176 |
work_keys_str_mv | AT chaoyiwang crossmodalsemanticcorrelationlearningbybicnnnetwork AT liangli crossmodalsemanticcorrelationlearningbybicnnnetwork AT chenggangyan crossmodalsemanticcorrelationlearningbybicnnnetwork AT zhanwang crossmodalsemanticcorrelationlearningbybicnnnetwork AT yaoqisun crossmodalsemanticcorrelationlearningbybicnnnetwork AT jiyongzhang crossmodalsemanticcorrelationlearningbybicnnnetwork |