Style-Enhanced Transformer for Image Captioning in Construction Scenes
Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construct...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-03-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/26/3/224 |
_version_ | 1797241149366206464 |
---|---|
author | Kani Song Linlin Chen Hengyou Wang |
author_facet | Kani Song Linlin Chen Hengyou Wang |
author_sort | Kani Song |
collection | DOAJ |
description | Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively. |
first_indexed | 2024-04-24T18:18:44Z |
format | Article |
id | doaj.art-0e1c4c09135d4118ac5cd7fbbb93198b |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-04-24T18:18:44Z |
publishDate | 2024-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-0e1c4c09135d4118ac5cd7fbbb93198b2024-03-27T13:36:54ZengMDPI AGEntropy1099-43002024-03-0126322410.3390/e26030224Style-Enhanced Transformer for Image Captioning in Construction ScenesKani Song0Linlin Chen1Hengyou Wang2School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaSchool of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaSchool of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, ChinaImage captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.https://www.mdpi.com/1099-4300/26/3/224image captioningconstruction scenestyle featuretransformer |
spellingShingle | Kani Song Linlin Chen Hengyou Wang Style-Enhanced Transformer for Image Captioning in Construction Scenes Entropy image captioning construction scene style feature transformer |
title | Style-Enhanced Transformer for Image Captioning in Construction Scenes |
title_full | Style-Enhanced Transformer for Image Captioning in Construction Scenes |
title_fullStr | Style-Enhanced Transformer for Image Captioning in Construction Scenes |
title_full_unstemmed | Style-Enhanced Transformer for Image Captioning in Construction Scenes |
title_short | Style-Enhanced Transformer for Image Captioning in Construction Scenes |
title_sort | style enhanced transformer for image captioning in construction scenes |
topic | image captioning construction scene style feature transformer |
url | https://www.mdpi.com/1099-4300/26/3/224 |
work_keys_str_mv | AT kanisong styleenhancedtransformerforimagecaptioninginconstructionscenes AT linlinchen styleenhancedtransformerforimagecaptioninginconstructionscenes AT hengyouwang styleenhancedtransformerforimagecaptioninginconstructionscenes |