Generating textual captions for ultrasound visuals in an automated fashion

<p>Generating captions for ultrasound images and videos is an area that is yet to be fully studied and explored. The aim of the work in this thesis is to learn joint image-text representations to describe ultrasound images with rich vocabulary consisting of nouns, verbs, and adjectives. Prepar...

Full description

Bibliographic Details
Main Author: Alsharid, M
Other Authors: Noble, A
Format: Thesis
Language:English
Published: 2021
Subjects:
_version_ 1797109370735034368
author Alsharid, M
author2 Noble, A
author_facet Noble, A
Alsharid, M
author_sort Alsharid, M
collection OXFORD
description <p>Generating captions for ultrasound images and videos is an area that is yet to be fully studied and explored. The aim of the work in this thesis is to learn joint image-text representations to describe ultrasound images with rich vocabulary consisting of nouns, verbs, and adjectives. Preparing medical image captioning benchmarks is challenging for two reasons: (a) describing medical images with specific terminology requires expert knowledge of medical professionals; and (b) the sensitive nature of medical images prevents wide-scale annotation, for instance, using crowd-sourcing services (e.g. Amazon Mechanical Turk) and similar methods. Therefore, automatic image captioning has not been widely studied on ultrasound images before, the challenge being enhanced by the lack of readily available large datasets of ultrasound images with captions.</p> <p>First, the thesis explores different combinations of recurrent neural networks, concatenation techniques, word embedding vectors in different model architecture configurations. We identify in this process the configuration most suitable for the fetal ultrasound image captioning task and the dataset at hand. We show that a configuration incorporating an LSTM-RNN and word2vec embeddings and using a merge-by-concatenation operation performed best. The thesis then explores three solutions to the challenge of working with real world datasets. We introduce a curriculum learning based strategy that incorporates the complexities of the image and text information to prepare the data for training. We show that by training captioning models with the order of data samples determined by the curriculum, we can achieve higher scores on the evaluation metrics with the same amount of data. We also look into augmenting the data through the creation of pseudocaptions to pair up with caption-less images. Finally, we explore leveraging other available data from a different modality, specifically eye gaze points, to supplement available image-text data. We find that using eye gaze data can help in training models that score relatively higher on the evaluation metrics; however since the improvements are small and the pre-training steps involved are considerable, this leads us to the recommendation that improving base models should take precedence over relying on data from other modalities to improve the performance of captioning models.</p> <p>To the best of our knowledge, the work in this thesis is the first attempt to perform automatic image captioning on fetal ultrasound images (video frames), using sonographer spoken words to describe their scanning experience. The thesis can help serve as a blue print for future endeavours in fetal ultrasound captioning by providing guidelines to follow and pitfalls to avoid and as an aid for those attempting medical image captioning, more generally.</p>
first_indexed 2024-03-07T07:40:56Z
format Thesis
id oxford-uuid:6ce96f29-3dfd-433a-8660-f3342931de10
institution University of Oxford
language English
last_indexed 2024-03-07T07:40:56Z
publishDate 2021
record_format dspace
spelling oxford-uuid:6ce96f29-3dfd-433a-8660-f3342931de102023-04-24T14:42:52ZGenerating textual captions for ultrasound visuals in an automated fashionThesishttp://purl.org/coar/resource_type/c_db06uuid:6ce96f29-3dfd-433a-8660-f3342931de10ImagingFetus--Ultrasonic imagingComputer visionNatural language processing (Computer science)EnglishHyrax Deposit2021Alsharid, MNoble, ARittscher, JYaqub, M<p>Generating captions for ultrasound images and videos is an area that is yet to be fully studied and explored. The aim of the work in this thesis is to learn joint image-text representations to describe ultrasound images with rich vocabulary consisting of nouns, verbs, and adjectives. Preparing medical image captioning benchmarks is challenging for two reasons: (a) describing medical images with specific terminology requires expert knowledge of medical professionals; and (b) the sensitive nature of medical images prevents wide-scale annotation, for instance, using crowd-sourcing services (e.g. Amazon Mechanical Turk) and similar methods. Therefore, automatic image captioning has not been widely studied on ultrasound images before, the challenge being enhanced by the lack of readily available large datasets of ultrasound images with captions.</p> <p>First, the thesis explores different combinations of recurrent neural networks, concatenation techniques, word embedding vectors in different model architecture configurations. We identify in this process the configuration most suitable for the fetal ultrasound image captioning task and the dataset at hand. We show that a configuration incorporating an LSTM-RNN and word2vec embeddings and using a merge-by-concatenation operation performed best. The thesis then explores three solutions to the challenge of working with real world datasets. We introduce a curriculum learning based strategy that incorporates the complexities of the image and text information to prepare the data for training. We show that by training captioning models with the order of data samples determined by the curriculum, we can achieve higher scores on the evaluation metrics with the same amount of data. We also look into augmenting the data through the creation of pseudocaptions to pair up with caption-less images. Finally, we explore leveraging other available data from a different modality, specifically eye gaze points, to supplement available image-text data. We find that using eye gaze data can help in training models that score relatively higher on the evaluation metrics; however since the improvements are small and the pre-training steps involved are considerable, this leads us to the recommendation that improving base models should take precedence over relying on data from other modalities to improve the performance of captioning models.</p> <p>To the best of our knowledge, the work in this thesis is the first attempt to perform automatic image captioning on fetal ultrasound images (video frames), using sonographer spoken words to describe their scanning experience. The thesis can help serve as a blue print for future endeavours in fetal ultrasound captioning by providing guidelines to follow and pitfalls to avoid and as an aid for those attempting medical image captioning, more generally.</p>
spellingShingle Imaging
Fetus--Ultrasonic imaging
Computer vision
Natural language processing (Computer science)
Alsharid, M
Generating textual captions for ultrasound visuals in an automated fashion
title Generating textual captions for ultrasound visuals in an automated fashion
title_full Generating textual captions for ultrasound visuals in an automated fashion
title_fullStr Generating textual captions for ultrasound visuals in an automated fashion
title_full_unstemmed Generating textual captions for ultrasound visuals in an automated fashion
title_short Generating textual captions for ultrasound visuals in an automated fashion
title_sort generating textual captions for ultrasound visuals in an automated fashion
topic Imaging
Fetus--Ultrasonic imaging
Computer vision
Natural language processing (Computer science)
work_keys_str_mv AT alsharidm generatingtextualcaptionsforultrasoundvisualsinanautomatedfashion