Style-Enhanced Transformer for Image Captioning in Construction Scenes

Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construct...

Full description

Bibliographic Details
Main Authors: Kani Song, Linlin Chen, Hengyou Wang
Format: Article
Language:English
Published: MDPI AG 2024-03-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/26/3/224
_version_ 1797241149366206464
author Kani Song
Linlin Chen
Hengyou Wang
author_facet Kani Song
Linlin Chen
Hengyou Wang
author_sort Kani Song
collection DOAJ
description Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.
first_indexed 2024-04-24T18:18:44Z
format Article
id doaj.art-0e1c4c09135d4118ac5cd7fbbb93198b
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-04-24T18:18:44Z
publishDate 2024-03-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-0e1c4c09135d4118ac5cd7fbbb93198b2024-03-27T13:36:54ZengMDPI AGEntropy1099-43002024-03-0126322410.3390/e26030224Style-Enhanced Transformer for Image Captioning in Construction ScenesKani Song0Linlin Chen1Hengyou Wang2School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaSchool of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaSchool of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaImage captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.https://www.mdpi.com/1099-4300/26/3/224image captioningconstruction scenestyle featuretransformer
spellingShingle Kani Song
Linlin Chen
Hengyou Wang
Style-Enhanced Transformer for Image Captioning in Construction Scenes
Entropy
image captioning
construction scene
style feature
transformer
title Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_full Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_fullStr Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_full_unstemmed Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_short Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_sort style enhanced transformer for image captioning in construction scenes
topic image captioning
construction scene
style feature
transformer
url https://www.mdpi.com/1099-4300/26/3/224
work_keys_str_mv AT kanisong styleenhancedtransformerforimagecaptioninginconstructionscenes
AT linlinchen styleenhancedtransformerforimagecaptioninginconstructionscenes
AT hengyouwang styleenhancedtransformerforimagecaptioninginconstructionscenes