Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning

The discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic...

Full description

Bibliographic Details
Main Authors: Tian Xie, Weiping Ding, Jinbao Zhang, Xusen Wan, Jiehua Wang
Format: Article
Language:English
Published: MDPI AG 2023-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/13/7916
_version_ 1827735056490168320
author Tian Xie
Weiping Ding
Jinbao Zhang
Xusen Wan
Jiehua Wang
author_facet Tian Xie
Weiping Ding
Jinbao Zhang
Xusen Wan
Jiehua Wang
author_sort Tian Xie
collection DOAJ
description The discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a higher order. The bidirectional long short-term memory (Bi-LSTM) has garnered wide acceptance in executing image captioning tasks. Of late, scholarly attention has been focused on modifying suitable models for innovative and precise subtitle captions, although tuning the parameters of the model does not invariably yield optimal outcomes. Given this, the current research proposes a model that effectively employs the bidirectional LSTM and attention mechanism (Bi-LS-AttM) for image captioning endeavors. This model exploits the contextual comprehension from both anterior and posterior aspects of the input data, synergistically with the attention mechanism, thereby augmenting the precision of visual language interpretation. The distinctiveness of this research is embodied in its incorporation of Bi-LSTM and the attention mechanism to engender sentences that are both structurally innovative and accurately reflective of the image content. To enhance temporal efficiency and accuracy, this study substitutes convolutional neural networks (CNNs) with fast region-based convolutional networks (Fast RCNNs). Additionally, it refines the process of generation and evaluation of common space, thus fostering improved efficiency. Our model was tested for its performance on Flickr30k and MSCOCO datasets (80 object categories). Comparative analyses of performance metrics reveal that our model, leveraging the Bi-LS-AttM, surpasses unidirectional and Bi-LSTM models. When applied to caption generation and image-sentence retrieval tasks, our model manifests time economies of approximately 36.5% and 26.3% vis-a-vis the Bi-LSTM model and the deep Bi-LSTM model, respectively.
first_indexed 2024-03-11T01:45:59Z
format Article
id doaj.art-3ff52bf0a76e4213803eadc97014c86a
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T01:45:59Z
publishDate 2023-07-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-3ff52bf0a76e4213803eadc97014c86a2023-11-18T16:13:17ZengMDPI AGApplied Sciences2076-34172023-07-011313791610.3390/app13137916Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image CaptioningTian Xie0Weiping Ding1Jinbao Zhang2Xusen Wan3Jiehua Wang4School of Information Science and Technology, Nantong University, Nantong 226019, ChinaSchool of Information Science and Technology, Nantong University, Nantong 226019, ChinaSchool of Information Science and Technology, Nantong University, Nantong 226019, ChinaSchool of Information Science and Technology, Nantong University, Nantong 226019, ChinaSchool of Information Science and Technology, Nantong University, Nantong 226019, ChinaThe discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a higher order. The bidirectional long short-term memory (Bi-LSTM) has garnered wide acceptance in executing image captioning tasks. Of late, scholarly attention has been focused on modifying suitable models for innovative and precise subtitle captions, although tuning the parameters of the model does not invariably yield optimal outcomes. Given this, the current research proposes a model that effectively employs the bidirectional LSTM and attention mechanism (Bi-LS-AttM) for image captioning endeavors. This model exploits the contextual comprehension from both anterior and posterior aspects of the input data, synergistically with the attention mechanism, thereby augmenting the precision of visual language interpretation. The distinctiveness of this research is embodied in its incorporation of Bi-LSTM and the attention mechanism to engender sentences that are both structurally innovative and accurately reflective of the image content. To enhance temporal efficiency and accuracy, this study substitutes convolutional neural networks (CNNs) with fast region-based convolutional networks (Fast RCNNs). Additionally, it refines the process of generation and evaluation of common space, thus fostering improved efficiency. Our model was tested for its performance on Flickr30k and MSCOCO datasets (80 object categories). Comparative analyses of performance metrics reveal that our model, leveraging the Bi-LS-AttM, surpasses unidirectional and Bi-LSTM models. When applied to caption generation and image-sentence retrieval tasks, our model manifests time economies of approximately 36.5% and 26.3% vis-a-vis the Bi-LSTM model and the deep Bi-LSTM model, respectively.https://www.mdpi.com/2076-3417/13/13/7916image captioningbidirectional long short-term memoryattention mechanismfast region-based convolutional networkcommon space
spellingShingle Tian Xie
Weiping Ding
Jinbao Zhang
Xusen Wan
Jiehua Wang
Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning
Applied Sciences
image captioning
bidirectional long short-term memory
attention mechanism
fast region-based convolutional network
common space
title Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning
title_full Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning
title_fullStr Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning
title_full_unstemmed Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning
title_short Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning
title_sort bi ls attm a bidirectional lstm and attention mechanism model for improving image captioning
topic image captioning
bidirectional long short-term memory
attention mechanism
fast region-based convolutional network
common space
url https://www.mdpi.com/2076-3417/13/13/7916
work_keys_str_mv AT tianxie bilsattmabidirectionallstmandattentionmechanismmodelforimprovingimagecaptioning
AT weipingding bilsattmabidirectionallstmandattentionmechanismmodelforimprovingimagecaptioning
AT jinbaozhang bilsattmabidirectionallstmandattentionmechanismmodelforimprovingimagecaptioning
AT xusenwan bilsattmabidirectionallstmandattentionmechanismmodelforimprovingimagecaptioning
AT jiehuawang bilsattmabidirectionallstmandattentionmechanismmodelforimprovingimagecaptioning