A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning

Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relat...

Full description

Bibliographic Details
Main Authors: Jiajia Peng, Tianbing Tang
Format: Article
Language:English
Published: MDPI AG 2024-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/6/2657
_version_ 1797242181115707392
author Jiajia Peng
Tianbing Tang
author_facet Jiajia Peng
Tianbing Tang
author_sort Jiajia Peng
collection DOAJ
description Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relationships contained within the images. In an effort to overcome these limitations, we introduce a novel encoder–decoder framework called A Unified Visual and Linguistic Semantics Method. Our method comprises three key components: an encoder, a mapping network, and a decoder. The encoder employs a fusion of CLIP (Contrastive Language–Image Pre-training) and SegmentCLIP to process and extract salient image features. SegmentCLIP builds upon CLIP’s foundational architecture by employing a clustering mechanism, thereby enhancing the semantic relationships between textual and visual elements in the image. The extracted features are then transformed by a mapping network into a fixed-length prefix. A GPT-2-based decoder subsequently generates a corresponding Chinese language description for the image. This framework aims to harmonize feature extraction and semantic enrichment, thereby producing more contextually accurate and comprehensive image descriptions. Our quantitative assessment reveals that our model exhibits notable enhancements across the intricate AIC-ICC, Flickr8k-CN, and COCO-CN datasets, evidenced by a 2% improvement in BLEU@4 and a 10% uplift in CIDEr scores. Additionally, it demonstrates acceptable efficiency in terms of simplicity, speed, and reduction in computational burden.
first_indexed 2024-04-24T18:35:08Z
format Article
id doaj.art-fa69bb6b93694be0b41b10c6f0b4055b
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-04-24T18:35:08Z
publishDate 2024-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-fa69bb6b93694be0b41b10c6f0b4055b2024-03-27T13:20:22ZengMDPI AGApplied Sciences2076-34172024-03-01146265710.3390/app14062657A Unified Visual and Linguistic Semantics Method for Enhanced Image CaptioningJiajia Peng0Tianbing Tang1School of Computer and Electronical Infonmation, Guangxi University, Nanning 530004, ChinaSchool of Computer and Electronical Infonmation, Guangxi University, Nanning 530004, ChinaImage captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relationships contained within the images. In an effort to overcome these limitations, we introduce a novel encoder–decoder framework called A Unified Visual and Linguistic Semantics Method. Our method comprises three key components: an encoder, a mapping network, and a decoder. The encoder employs a fusion of CLIP (Contrastive Language–Image Pre-training) and SegmentCLIP to process and extract salient image features. SegmentCLIP builds upon CLIP’s foundational architecture by employing a clustering mechanism, thereby enhancing the semantic relationships between textual and visual elements in the image. The extracted features are then transformed by a mapping network into a fixed-length prefix. A GPT-2-based decoder subsequently generates a corresponding Chinese language description for the image. This framework aims to harmonize feature extraction and semantic enrichment, thereby producing more contextually accurate and comprehensive image descriptions. Our quantitative assessment reveals that our model exhibits notable enhancements across the intricate AIC-ICC, Flickr8k-CN, and COCO-CN datasets, evidenced by a 2% improvement in BLEU@4 and a 10% uplift in CIDEr scores. Additionally, it demonstrates acceptable efficiency in terms of simplicity, speed, and reduction in computational burden.https://www.mdpi.com/2076-3417/14/6/2657image captioningimage featuresclustering mechanismChinese language description
spellingShingle Jiajia Peng
Tianbing Tang
A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
Applied Sciences
image captioning
image features
clustering mechanism
Chinese language description
title A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
title_full A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
title_fullStr A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
title_full_unstemmed A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
title_short A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
title_sort unified visual and linguistic semantics method for enhanced image captioning
topic image captioning
image features
clustering mechanism
Chinese language description
url https://www.mdpi.com/2076-3417/14/6/2657
work_keys_str_mv AT jiajiapeng aunifiedvisualandlinguisticsemanticsmethodforenhancedimagecaptioning
AT tianbingtang aunifiedvisualandlinguisticsemanticsmethodforenhancedimagecaptioning
AT jiajiapeng unifiedvisualandlinguisticsemanticsmethodforenhancedimagecaptioning
AT tianbingtang unifiedvisualandlinguisticsemanticsmethodforenhancedimagecaptioning