Generating textual captions for ultrasound visuals in an automated fashion

<p>Generating captions for ultrasound images and videos is an area that is yet to be fully studied and explored. The aim of the work in this thesis is to learn joint image-text representations to describe ultrasound images with rich vocabulary consisting of nouns, verbs, and adjectives. Prepar...

Full description

Bibliographic Details
Main Author:	Alsharid, M
Other Authors:	Noble, A
Format:	Thesis
Language:	English
Published:	2021
Subjects:	Imaging Fetus > Ultrasonic imaging Computer vision Natural language processing (Computer science)

_version_	1797109370735034368
author	Alsharid, M
author2	Noble, A
author_facet	Noble, A Alsharid, M
author_sort	Alsharid, M
collection	OXFORD
description	<p>Generating captions for ultrasound images and videos is an area that is yet to be fully studied and explored. The aim of the work in this thesis is to learn joint image-text representations to describe ultrasound images with rich vocabulary consisting of nouns, verbs, and adjectives. Preparing medical image captioning benchmarks is challenging for two reasons: (a) describing medical images with specific terminology requires expert knowledge of medical professionals; and (b) the sensitive nature of medical images prevents wide-scale annotation, for instance, using crowd-sourcing services (e.g. Amazon Mechanical Turk) and similar methods. Therefore, automatic image captioning has not been widely studied on ultrasound images before, the challenge being enhanced by the lack of readily available large datasets of ultrasound images with captions.</p> <p>First, the thesis explores different combinations of recurrent neural networks, concatenation techniques, word embedding vectors in different model architecture configurations. We identify in this process the configuration most suitable for the fetal ultrasound image captioning task and the dataset at hand. We show that a configuration incorporating an LSTM-RNN and word2vec embeddings and using a merge-by-concatenation operation performed best. The thesis then explores three solutions to the challenge of working with real world datasets. We introduce a curriculum learning based strategy that incorporates the complexities of the image and text information to prepare the data for training. We show that by training captioning models with the order of data samples determined by the curriculum, we can achieve higher scores on the evaluation metrics with the same amount of data. We also look into augmenting the data through the creation of pseudocaptions to pair up with caption-less images. Finally, we explore leveraging other available data from a different modality, specifically eye gaze points, to supplement available image-text data. We find that using eye gaze data can help in training models that score relatively higher on the evaluation metrics; however since the improvements are small and the pre-training steps involved are considerable, this leads us to the recommendation that improving base models should take precedence over relying on data from other modalities to improve the performance of captioning models.</p> <p>To the best of our knowledge, the work in this thesis is the first attempt to perform automatic image captioning on fetal ultrasound images (video frames), using sonographer spoken words to describe their scanning experience. The thesis can help serve as a blue print for future endeavours in fetal ultrasound captioning by providing guidelines to follow and pitfalls to avoid and as an aid for those attempting medical image captioning, more generally.</p>
first_indexed	2024-03-07T07:40:56Z
format	Thesis
id	oxford-uuid:6ce96f29-3dfd-433a-8660-f3342931de10
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:40:56Z
publishDate	2021
record_format	dspace
spelling	oxford-uuid:6ce96f29-3dfd-433a-8660-f3342931de102023-04-24T14:42:52ZGenerating textual captions for ultrasound visuals in an automated fashionThesishttp://purl.org/coar/resource_type/c_db06uuid:6ce96f29-3dfd-433a-8660-f3342931de10ImagingFetus--Ultrasonic imagingComputer visionNatural language processing (Computer science)EnglishHyrax Deposit2021Alsharid, MNoble, ARittscher, JYaqub, M<p>Generating captions for ultrasound images and videos is an area that is yet to be fully studied and explored. The aim of the work in this thesis is to learn joint image-text representations to describe ultrasound images with rich vocabulary consisting of nouns, verbs, and adjectives. Preparing medical image captioning benchmarks is challenging for two reasons: (a) describing medical images with specific terminology requires expert knowledge of medical professionals; and (b) the sensitive nature of medical images prevents wide-scale annotation, for instance, using crowd-sourcing services (e.g. Amazon Mechanical Turk) and similar methods. Therefore, automatic image captioning has not been widely studied on ultrasound images before, the challenge being enhanced by the lack of readily available large datasets of ultrasound images with captions.</p> <p>First, the thesis explores different combinations of recurrent neural networks, concatenation techniques, word embedding vectors in different model architecture configurations. We identify in this process the configuration most suitable for the fetal ultrasound image captioning task and the dataset at hand. We show that a configuration incorporating an LSTM-RNN and word2vec embeddings and using a merge-by-concatenation operation performed best. The thesis then explores three solutions to the challenge of working with real world datasets. We introduce a curriculum learning based strategy that incorporates the complexities of the image and text information to prepare the data for training. We show that by training captioning models with the order of data samples determined by the curriculum, we can achieve higher scores on the evaluation metrics with the same amount of data. We also look into augmenting the data through the creation of pseudocaptions to pair up with caption-less images. Finally, we explore leveraging other available data from a different modality, specifically eye gaze points, to supplement available image-text data. We find that using eye gaze data can help in training models that score relatively higher on the evaluation metrics; however since the improvements are small and the pre-training steps involved are considerable, this leads us to the recommendation that improving base models should take precedence over relying on data from other modalities to improve the performance of captioning models.</p> <p>To the best of our knowledge, the work in this thesis is the first attempt to perform automatic image captioning on fetal ultrasound images (video frames), using sonographer spoken words to describe their scanning experience. The thesis can help serve as a blue print for future endeavours in fetal ultrasound captioning by providing guidelines to follow and pitfalls to avoid and as an aid for those attempting medical image captioning, more generally.</p>
spellingShingle	Imaging Fetus--Ultrasonic imaging Computer vision Natural language processing (Computer science) Alsharid, M Generating textual captions for ultrasound visuals in an automated fashion
title	Generating textual captions for ultrasound visuals in an automated fashion
title_full	Generating textual captions for ultrasound visuals in an automated fashion
title_fullStr	Generating textual captions for ultrasound visuals in an automated fashion
title_full_unstemmed	Generating textual captions for ultrasound visuals in an automated fashion
title_short	Generating textual captions for ultrasound visuals in an automated fashion
title_sort	generating textual captions for ultrasound visuals in an automated fashion
topic	Imaging Fetus--Ultrasonic imaging Computer vision Natural language processing (Computer science)
work_keys_str_mv	AT alsharidm generatingtextualcaptionsforultrasoundvisualsinanautomatedfashion

Generating textual captions for ultrasound visuals in an automated fashion

Similar Items