Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning

Transformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual...

Full description

Bibliographic Details
Main Authors:	Hojun Lee, Hyunjun Cho, Jieun Park, Jinyeong Chae, Jihie Kim
Format:	Article
Language:	English
Published:	MDPI AG 2022-02-01
Series:	Sensors
Subjects:	medical image captioning deep learning transformer
Online Access:	https://www.mdpi.com/1424-8220/22/4/1429

_version_	1797476726816636928
author	Hojun Lee Hyunjun Cho Jieun Park Jinyeong Chae Jihie Kim
author_facet	Hojun Lee Hyunjun Cho Jieun Park Jinyeong Chae Jihie Kim
author_sort	Hojun Lee
collection	DOAJ
description	Transformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual Extractor (GLVE) to capture both global features and local features. (2) The Cross Encoder-Decoder Transformer (CEDT) for injecting multiple-level encoder features into the decoding process. GLVE extracts not only global visual features that can be obtained from an entire image, such as size of organ or bone structure, but also local visual features that can be generated from a local region, such as lesion area. Given an image, CEDT can create a detailed description of the overall features by injecting both low-level and high-level encoder outputs into the decoder. Each method contributes to performance improvement and generates a description such as organ size and bone structure. The proposed model was evaluated on the IU X-ray dataset and achieved better performance than the transformer-based baseline results, by 5.6% in BLEU score, by 0.56% in METEOR, and by 1.98% in ROUGE-L.
first_indexed	2024-03-09T21:06:45Z
format	Article
id	doaj.art-04b970dd6f314f269a9796220032b5e0
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T21:06:45Z
publishDate	2022-02-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-04b970dd6f314f269a9796220032b5e02023-11-23T21:59:22ZengMDPI AGSensors1424-82202022-02-01224142910.3390/s22041429Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image CaptioningHojun Lee0Hyunjun Cho1Jieun Park2Jinyeong Chae3Jihie Kim4Department of Computer Science and Engineering, Dongguk University, Seoul 04620, KoreaDepartment of Computer Science and Engineering, Dongguk University, Seoul 04620, KoreaDepartment of Computer Science and Engineering, Dongguk University, Seoul 04620, KoreaDepartment of Artificial Intelligence, Dongguk University, Seoul 04620, KoreaDepartment of Artificial Intelligence, Dongguk University, Seoul 04620, KoreaTransformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual Extractor (GLVE) to capture both global features and local features. (2) The Cross Encoder-Decoder Transformer (CEDT) for injecting multiple-level encoder features into the decoding process. GLVE extracts not only global visual features that can be obtained from an entire image, such as size of organ or bone structure, but also local visual features that can be generated from a local region, such as lesion area. Given an image, CEDT can create a detailed description of the overall features by injecting both low-level and high-level encoder outputs into the decoder. Each method contributes to performance improvement and generates a description such as organ size and bone structure. The proposed model was evaluated on the IU X-ray dataset and achieved better performance than the transformer-based baseline results, by 5.6% in BLEU score, by 0.56% in METEOR, and by 1.98% in ROUGE-L.https://www.mdpi.com/1424-8220/22/4/1429medical image captioningdeep learningtransformer
spellingShingle	Hojun Lee Hyunjun Cho Jieun Park Jinyeong Chae Jihie Kim Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning Sensors medical image captioning deep learning transformer
title	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_full	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_fullStr	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_full_unstemmed	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_short	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_sort	cross encoder decoder transformer with global local visual extractor for medical image captioning
topic	medical image captioning deep learning transformer
url	https://www.mdpi.com/1424-8220/22/4/1429
work_keys_str_mv	AT hojunlee crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT hyunjuncho crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT jieunpark crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT jinyeongchae crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT jihiekim crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning

Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning

Similar Items