Ultrasound image representation learning by modeling sonographer visual attention

Image representations are commonly learned from class labels, which are a simplistic approximation of human image understanding. In this paper we demonstrate that transferable representations of images can be learned without manual annotations by modeling human visual attention. The basis of our ana...

Full description

Bibliographic Details
Main Authors: Droste, R, Cai, Y, Sharma, H, Chatelain, P, Drukker, L, Papageorghiou, A, Noble, J
Format: Conference item
Published: Springer 2019
_version_ 1826288637065560064
author Droste, R
Cai, Y
Sharma, H
Chatelain, P
Drukker, L
Papageorghiou, A
Noble, J
author_facet Droste, R
Cai, Y
Sharma, H
Chatelain, P
Drukker, L
Papageorghiou, A
Noble, J
author_sort Droste, R
collection OXFORD
description Image representations are commonly learned from class labels, which are a simplistic approximation of human image understanding. In this paper we demonstrate that transferable representations of images can be learned without manual annotations by modeling human visual attention. The basis of our analyses is a unique gaze tracking dataset of sonographers performing routine clinical fetal anomaly screenings. Models of sonographer visual attention are learned by training a convolutional neural network (CNN) to predict gaze on ultrasound video frames through visual saliency prediction or gaze-point regression. We evaluate the transferability of the learned representations to the task of ultrasound standard plane detection in two contexts. Firstly, we perform transfer learning by fine-tuning the CNN with a limited number of labeled standard plane images. We find that fine-tuning the saliency predictor is superior to training from random initialization, with an average F1-score improvement of 9.6% overall and 15.3% for the cardiac planes. Secondly, we train a simple softmax regression on the feature activations of each CNN layer in order to evaluate the representations independently of transfer learning hyper-parameters. We find that the attention models derive strong representations, approaching the precision of a fully-supervised baseline model for all but the last layer.
first_indexed 2024-03-07T02:16:43Z
format Conference item
id oxford-uuid:a27fe42b-3a94-4b0f-bc7d-2173c0348b6f
institution University of Oxford
last_indexed 2024-03-07T02:16:43Z
publishDate 2019
publisher Springer
record_format dspace
spelling oxford-uuid:a27fe42b-3a94-4b0f-bc7d-2173c0348b6f2022-03-27T02:20:31ZUltrasound image representation learning by modeling sonographer visual attentionConference itemhttp://purl.org/coar/resource_type/c_5794uuid:a27fe42b-3a94-4b0f-bc7d-2173c0348b6fSymplectic Elements at OxfordSpringer2019Droste, RCai, YSharma, HChatelain, PDrukker, LPapageorghiou, ANoble, JImage representations are commonly learned from class labels, which are a simplistic approximation of human image understanding. In this paper we demonstrate that transferable representations of images can be learned without manual annotations by modeling human visual attention. The basis of our analyses is a unique gaze tracking dataset of sonographers performing routine clinical fetal anomaly screenings. Models of sonographer visual attention are learned by training a convolutional neural network (CNN) to predict gaze on ultrasound video frames through visual saliency prediction or gaze-point regression. We evaluate the transferability of the learned representations to the task of ultrasound standard plane detection in two contexts. Firstly, we perform transfer learning by fine-tuning the CNN with a limited number of labeled standard plane images. We find that fine-tuning the saliency predictor is superior to training from random initialization, with an average F1-score improvement of 9.6% overall and 15.3% for the cardiac planes. Secondly, we train a simple softmax regression on the feature activations of each CNN layer in order to evaluate the representations independently of transfer learning hyper-parameters. We find that the attention models derive strong representations, approaching the precision of a fully-supervised baseline model for all but the last layer.
spellingShingle Droste, R
Cai, Y
Sharma, H
Chatelain, P
Drukker, L
Papageorghiou, A
Noble, J
Ultrasound image representation learning by modeling sonographer visual attention
title Ultrasound image representation learning by modeling sonographer visual attention
title_full Ultrasound image representation learning by modeling sonographer visual attention
title_fullStr Ultrasound image representation learning by modeling sonographer visual attention
title_full_unstemmed Ultrasound image representation learning by modeling sonographer visual attention
title_short Ultrasound image representation learning by modeling sonographer visual attention
title_sort ultrasound image representation learning by modeling sonographer visual attention
work_keys_str_mv AT droster ultrasoundimagerepresentationlearningbymodelingsonographervisualattention
AT caiy ultrasoundimagerepresentationlearningbymodelingsonographervisualattention
AT sharmah ultrasoundimagerepresentationlearningbymodelingsonographervisualattention
AT chatelainp ultrasoundimagerepresentationlearningbymodelingsonographervisualattention
AT drukkerl ultrasoundimagerepresentationlearningbymodelingsonographervisualattention
AT papageorghioua ultrasoundimagerepresentationlearningbymodelingsonographervisualattention
AT noblej ultrasoundimagerepresentationlearningbymodelingsonographervisualattention