Image Captioning Using Motion-CNN with Object Detection

Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the...

Full description

Bibliographic Details
Main Authors:	Kiyohiko Iwamura, Jun Younes Louhi Kasahara, Alessandro Moro, Atsushi Yamashita, Hajime Asama
Format:	Article
Language:	English
Published:	MDPI AG 2021-02-01
Series:	Sensors
Subjects:	deep learning image captioning motion estimation object detection
Online Access:	https://www.mdpi.com/1424-8220/21/4/1270

_version_	1797411570946408448
author	Kiyohiko Iwamura Jun Younes Louhi Kasahara Alessandro Moro Atsushi Yamashita Hajime Asama
author_facet	Kiyohiko Iwamura Jun Younes Louhi Kasahara Alessandro Moro Atsushi Yamashita Hajime Asama
author_sort	Kiyohiko Iwamura
collection	DOAJ
description	Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.
first_indexed	2024-03-09T04:48:04Z
format	Article
id	doaj.art-86b6fef88ab24279a59c759a9675993a
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T04:48:04Z
publishDate	2021-02-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-86b6fef88ab24279a59c759a9675993a2023-12-03T13:13:51ZengMDPI AGSensors1424-82202021-02-01214127010.3390/s21041270Image Captioning Using Motion-CNN with Object DetectionKiyohiko Iwamura0Jun Younes Louhi Kasahara1Alessandro Moro2Atsushi Yamashita3Hajime Asama4Department of Precision Engineering, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, JapanDepartment of Precision Engineering, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, JapanDepartment of Precision Engineering, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, JapanDepartment of Precision Engineering, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, JapanDepartment of Precision Engineering, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, JapanAutomatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.https://www.mdpi.com/1424-8220/21/4/1270deep learningimage captioningmotion estimationobject detection
spellingShingle	Kiyohiko Iwamura Jun Younes Louhi Kasahara Alessandro Moro Atsushi Yamashita Hajime Asama Image Captioning Using Motion-CNN with Object Detection Sensors deep learning image captioning motion estimation object detection
title	Image Captioning Using Motion-CNN with Object Detection
title_full	Image Captioning Using Motion-CNN with Object Detection
title_fullStr	Image Captioning Using Motion-CNN with Object Detection
title_full_unstemmed	Image Captioning Using Motion-CNN with Object Detection
title_short	Image Captioning Using Motion-CNN with Object Detection
title_sort	image captioning using motion cnn with object detection
topic	deep learning image captioning motion estimation object detection
url	https://www.mdpi.com/1424-8220/21/4/1270
work_keys_str_mv	AT kiyohikoiwamura imagecaptioningusingmotioncnnwithobjectdetection AT junyouneslouhikasahara imagecaptioningusingmotioncnnwithobjectdetection AT alessandromoro imagecaptioningusingmotioncnnwithobjectdetection AT atsushiyamashita imagecaptioningusingmotioncnnwithobjectdetection AT hajimeasama imagecaptioningusingmotioncnnwithobjectdetection

Image Captioning Using Motion-CNN with Object Detection

Similar Items