Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks

In this work, we present a novel gaze-assisted natural language processing (NLP)-based video captioning model to describe routine second-trimester fetal ultrasound scan videos in a vocabulary of spoken sonography. The primary novelty of our multi-modal approach is that the learned video captioning m...

Ausführliche Beschreibung

Bibliographische Detailangaben
Hauptverfasser:	Alsharid, M, Cai, Y, Sharma, H, Drukker, L, Noble, JA, Papageorghiou, AT
Format:	Journal article
Sprache:	English
Veröffentlicht:	Elsevier 2022

_version_	1826309541595185152
author	Alsharid, M Cai, Y Sharma, H Drukker, L Noble, JA Papageorghiou, AT
author_facet	Alsharid, M Cai, Y Sharma, H Drukker, L Noble, JA Papageorghiou, AT
author_sort	Alsharid, M
collection	OXFORD
description	In this work, we present a novel gaze-assisted natural language processing (NLP)-based video captioning model to describe routine second-trimester fetal ultrasound scan videos in a vocabulary of spoken sonography. The primary novelty of our multi-modal approach is that the learned video captioning model is built using a combination of ultrasound video, tracked gaze and textual transcriptions from speech recordings. The textual captions that describe the spatio-temporal scan video content are learnt from sonographer speech recordings. The generation of captions is assisted by sonographer gaze-tracking information reflecting their visual attention while performing live-imaging and interpreting a frozen image. To evaluate the effect of adding, or withholding, different forms of gaze on the video model, we compare spatio-temporal deep networks trained using three multi-modal configurations, namely: (1) a gaze-less neural network with only text and video as input, (2) a neural network additionally using real sonographer gaze in the form of attention maps, and (3) a neural network using automatically-predicted gaze in the form of saliency maps instead. We assess algorithm performance through established general text-based metrics (BLEU, ROUGE-L, F1 score), a domain-specific metric (ARS), and metrics that consider the richness and efficiency of the generated captions with respect to the scan video. Results show that the proposed gaze-assisted models can generate richer and more diverse captions for clinical fetal ultrasound scan videos than those without gaze at the expense of the perceived sentence structure. The results also show that the generated captions are similar to sonographer speech in terms of discussing the visual content and the scanning actions performed.
first_indexed	2024-03-07T07:35:46Z
format	Journal article
id	oxford-uuid:fd0b3a29-bf34-4f6c-a3e2-54ebfdc14618
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:35:46Z
publishDate	2022
publisher	Elsevier
record_format	dspace
spelling	oxford-uuid:fd0b3a29-bf34-4f6c-a3e2-54ebfdc146182023-03-07T11:13:16ZGaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networksJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:fd0b3a29-bf34-4f6c-a3e2-54ebfdc14618EnglishSymplectic ElementsElsevier2022Alsharid, MCai, YSharma, HDrukker, LNoble, JAPapageorghiou, ATIn this work, we present a novel gaze-assisted natural language processing (NLP)-based video captioning model to describe routine second-trimester fetal ultrasound scan videos in a vocabulary of spoken sonography. The primary novelty of our multi-modal approach is that the learned video captioning model is built using a combination of ultrasound video, tracked gaze and textual transcriptions from speech recordings. The textual captions that describe the spatio-temporal scan video content are learnt from sonographer speech recordings. The generation of captions is assisted by sonographer gaze-tracking information reflecting their visual attention while performing live-imaging and interpreting a frozen image. To evaluate the effect of adding, or withholding, different forms of gaze on the video model, we compare spatio-temporal deep networks trained using three multi-modal configurations, namely: (1) a gaze-less neural network with only text and video as input, (2) a neural network additionally using real sonographer gaze in the form of attention maps, and (3) a neural network using automatically-predicted gaze in the form of saliency maps instead. We assess algorithm performance through established general text-based metrics (BLEU, ROUGE-L, F1 score), a domain-specific metric (ARS), and metrics that consider the richness and efficiency of the generated captions with respect to the scan video. Results show that the proposed gaze-assisted models can generate richer and more diverse captions for clinical fetal ultrasound scan videos than those without gaze at the expense of the perceived sentence structure. The results also show that the generated captions are similar to sonographer speech in terms of discussing the visual content and the scanning actions performed.
spellingShingle	Alsharid, M Cai, Y Sharma, H Drukker, L Noble, JA Papageorghiou, AT Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks
title	Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks
title_full	Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks
title_fullStr	Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks
title_full_unstemmed	Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks
title_short	Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks
title_sort	gaze assisted automatic captioning of fetal ultrasound videos using three way multi modal deep neural networks
work_keys_str_mv	AT alsharidm gazeassistedautomaticcaptioningoffetalultrasoundvideosusingthreewaymultimodaldeepneuralnetworks AT caiy gazeassistedautomaticcaptioningoffetalultrasoundvideosusingthreewaymultimodaldeepneuralnetworks AT sharmah gazeassistedautomaticcaptioningoffetalultrasoundvideosusingthreewaymultimodaldeepneuralnetworks AT drukkerl gazeassistedautomaticcaptioningoffetalultrasoundvideosusingthreewaymultimodaldeepneuralnetworks AT nobleja gazeassistedautomaticcaptioningoffetalultrasoundvideosusingthreewaymultimodaldeepneuralnetworks AT papageorghiouat gazeassistedautomaticcaptioningoffetalultrasoundvideosusingthreewaymultimodaldeepneuralnetworks

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks

Ähnliche Einträge