Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer

Image–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semanti...

Full description

Bibliographic Details
Main Authors: Kaijie Wang, Tiejun Wang, Xiaoran Guo, Kui Xu, Jiao Wu
Format: Article
Language:English
Published: MDPI AG 2024-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/2/807
_version_ 1827369853994926080
author Kaijie Wang
Tiejun Wang
Xiaoran Guo
Kui Xu
Jiao Wu
author_facet Kaijie Wang
Tiejun Wang
Xiaoran Guo
Kui Xu
Jiao Wu
author_sort Kaijie Wang
collection DOAJ
description Image–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semantic Embedding (VSE) model. The method introduces an adaptive pooling layer to improve the feature extraction capability of semantic associations between Thangka images and texts. We also improved the traditional Transformer architecture by combining bidirectional residual concatenation and mask attention mechanisms to improve the stability of the matching process and the ability to extract semantic information. In addition, we designed a multi-granularity tag alignment module that maps global and local features of images and text into a common coding space, leveraging inter- and intra-modal semantic associations to improve image and text accuracy. Comparative experiments on the Thangka dataset show that our method achieves significant improvements compared to the VSE baseline method. Specifically, our method improves the recall by 9.4% and 10.5% for image-matching text and text-matching images, respectively. Furthermore, without any large-scale corpus pre-training, our method outperforms all models without pre-training and outperforms two out of four pre-trained models on the Flickr30k public dataset. Also, the execution efficiency of our model is an order of magnitude higher than that of the pre-trained models, which highlights the superior performance and efficiency of our model in the image–text matching task.
first_indexed 2024-03-08T09:57:37Z
format Article
id doaj.art-3a372d9b4fe0486aa542d3b74810f0ef
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-08T09:57:37Z
publishDate 2024-01-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-3a372d9b4fe0486aa542d3b74810f0ef2024-01-29T13:44:45ZengMDPI AGApplied Sciences2076-34172024-01-0114280710.3390/app14020807Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved TransformerKaijie Wang0Tiejun Wang1Xiaoran Guo2Kui Xu3Jiao Wu4Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaImage–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semantic Embedding (VSE) model. The method introduces an adaptive pooling layer to improve the feature extraction capability of semantic associations between Thangka images and texts. We also improved the traditional Transformer architecture by combining bidirectional residual concatenation and mask attention mechanisms to improve the stability of the matching process and the ability to extract semantic information. In addition, we designed a multi-granularity tag alignment module that maps global and local features of images and text into a common coding space, leveraging inter- and intra-modal semantic associations to improve image and text accuracy. Comparative experiments on the Thangka dataset show that our method achieves significant improvements compared to the VSE baseline method. Specifically, our method improves the recall by 9.4% and 10.5% for image-matching text and text-matching images, respectively. Furthermore, without any large-scale corpus pre-training, our method outperforms all models without pre-training and outperforms two out of four pre-trained models on the Flickr30k public dataset. Also, the execution efficiency of our model is an order of magnitude higher than that of the pre-trained models, which highlights the superior performance and efficiency of our model in the image–text matching task.https://www.mdpi.com/2076-3417/14/2/807Thangkaimage–text matchingadaptive pooling layerbidirectional residual connectionmasking attention mechanisms
spellingShingle Kaijie Wang
Tiejun Wang
Xiaoran Guo
Kui Xu
Jiao Wu
Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer
Applied Sciences
Thangka
image–text matching
adaptive pooling layer
bidirectional residual connection
masking attention mechanisms
title Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer
title_full Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer
title_fullStr Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer
title_full_unstemmed Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer
title_short Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer
title_sort thangka image text matching based on adaptive pooling layer and improved transformer
topic Thangka
image–text matching
adaptive pooling layer
bidirectional residual connection
masking attention mechanisms
url https://www.mdpi.com/2076-3417/14/2/807
work_keys_str_mv AT kaijiewang thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer
AT tiejunwang thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer
AT xiaoranguo thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer
AT kuixu thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer
AT jiaowu thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer