Style-Enhanced Transformer for Image Captioning in Construction Scenes

Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construct...

Full description

Bibliographic Details
Main Authors:	Kani Song, Linlin Chen, Hengyou Wang
Format:	Article
Language:	English
Published:	MDPI AG 2024-03-01
Series:	Entropy
Subjects:	image captioning construction scene style feature transformer
Online Access:	https://www.mdpi.com/1099-4300/26/3/224

_version_	1797241149366206464
author	Kani Song Linlin Chen Hengyou Wang
author_facet	Kani Song Linlin Chen Hengyou Wang
author_sort	Kani Song
collection	DOAJ
description	Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.
first_indexed	2024-04-24T18:18:44Z
format	Article
id	doaj.art-0e1c4c09135d4118ac5cd7fbbb93198b
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-04-24T18:18:44Z
publishDate	2024-03-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-0e1c4c09135d4118ac5cd7fbbb93198b2024-03-27T13:36:54ZengMDPI AGEntropy1099-43002024-03-0126322410.3390/e26030224Style-Enhanced Transformer for Image Captioning in Construction ScenesKani Song0Linlin Chen1Hengyou Wang2School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaSchool of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaSchool of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaImage captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.https://www.mdpi.com/1099-4300/26/3/224image captioningconstruction scenestyle featuretransformer
spellingShingle	Kani Song Linlin Chen Hengyou Wang Style-Enhanced Transformer for Image Captioning in Construction Scenes Entropy image captioning construction scene style feature transformer
title	Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_full	Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_fullStr	Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_full_unstemmed	Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_short	Style-Enhanced Transformer for Image Captioning in Construction Scenes
title_sort	style enhanced transformer for image captioning in construction scenes
topic	image captioning construction scene style feature transformer
url	https://www.mdpi.com/1099-4300/26/3/224
work_keys_str_mv	AT kanisong styleenhancedtransformerforimagecaptioninginconstructionscenes AT linlinchen styleenhancedtransformerforimagecaptioninginconstructionscenes AT hengyouwang styleenhancedtransformerforimagecaptioninginconstructionscenes

Style-Enhanced Transformer for Image Captioning in Construction Scenes

Similar Items