Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer
Image–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semanti...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-01-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/14/2/807 |
_version_ | 1827369853994926080 |
---|---|
author | Kaijie Wang Tiejun Wang Xiaoran Guo Kui Xu Jiao Wu |
author_facet | Kaijie Wang Tiejun Wang Xiaoran Guo Kui Xu Jiao Wu |
author_sort | Kaijie Wang |
collection | DOAJ |
description | Image–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semantic Embedding (VSE) model. The method introduces an adaptive pooling layer to improve the feature extraction capability of semantic associations between Thangka images and texts. We also improved the traditional Transformer architecture by combining bidirectional residual concatenation and mask attention mechanisms to improve the stability of the matching process and the ability to extract semantic information. In addition, we designed a multi-granularity tag alignment module that maps global and local features of images and text into a common coding space, leveraging inter- and intra-modal semantic associations to improve image and text accuracy. Comparative experiments on the Thangka dataset show that our method achieves significant improvements compared to the VSE baseline method. Specifically, our method improves the recall by 9.4% and 10.5% for image-matching text and text-matching images, respectively. Furthermore, without any large-scale corpus pre-training, our method outperforms all models without pre-training and outperforms two out of four pre-trained models on the Flickr30k public dataset. Also, the execution efficiency of our model is an order of magnitude higher than that of the pre-trained models, which highlights the superior performance and efficiency of our model in the image–text matching task. |
first_indexed | 2024-03-08T09:57:37Z |
format | Article |
id | doaj.art-3a372d9b4fe0486aa542d3b74810f0ef |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-08T09:57:37Z |
publishDate | 2024-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-3a372d9b4fe0486aa542d3b74810f0ef2024-01-29T13:44:45ZengMDPI AGApplied Sciences2076-34172024-01-0114280710.3390/app14020807Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved TransformerKaijie Wang0Tiejun Wang1Xiaoran Guo2Kui Xu3Jiao Wu4Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaKey Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, ChinaImage–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semantic Embedding (VSE) model. The method introduces an adaptive pooling layer to improve the feature extraction capability of semantic associations between Thangka images and texts. We also improved the traditional Transformer architecture by combining bidirectional residual concatenation and mask attention mechanisms to improve the stability of the matching process and the ability to extract semantic information. In addition, we designed a multi-granularity tag alignment module that maps global and local features of images and text into a common coding space, leveraging inter- and intra-modal semantic associations to improve image and text accuracy. Comparative experiments on the Thangka dataset show that our method achieves significant improvements compared to the VSE baseline method. Specifically, our method improves the recall by 9.4% and 10.5% for image-matching text and text-matching images, respectively. Furthermore, without any large-scale corpus pre-training, our method outperforms all models without pre-training and outperforms two out of four pre-trained models on the Flickr30k public dataset. Also, the execution efficiency of our model is an order of magnitude higher than that of the pre-trained models, which highlights the superior performance and efficiency of our model in the image–text matching task.https://www.mdpi.com/2076-3417/14/2/807Thangkaimage–text matchingadaptive pooling layerbidirectional residual connectionmasking attention mechanisms |
spellingShingle | Kaijie Wang Tiejun Wang Xiaoran Guo Kui Xu Jiao Wu Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer Applied Sciences Thangka image–text matching adaptive pooling layer bidirectional residual connection masking attention mechanisms |
title | Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer |
title_full | Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer |
title_fullStr | Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer |
title_full_unstemmed | Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer |
title_short | Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer |
title_sort | thangka image text matching based on adaptive pooling layer and improved transformer |
topic | Thangka image–text matching adaptive pooling layer bidirectional residual connection masking attention mechanisms |
url | https://www.mdpi.com/2076-3417/14/2/807 |
work_keys_str_mv | AT kaijiewang thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer AT tiejunwang thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer AT xiaoranguo thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer AT kuixu thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer AT jiaowu thangkaimagetextmatchingbasedonadaptivepoolinglayerandimprovedtransformer |