Improving automatic speech recognition through head pose driven visual grounding

In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely ob...

Full description

Bibliographic Details
Main Author:	Vosoughi, Soroush
Other Authors:	Massachusetts Institute of Technology. Media Laboratory
Format:	Article
Language:	en_US
Published:	Association for Computing Machinery 2014
Online Access:	http://hdl.handle.net/1721.1/86943 https://orcid.org/0000-0002-2564-8909

_version_	1826216198262489088
author	Vosoughi, Soroush
author2	Massachusetts Institute of Technology. Media Laboratory
author_facet	Massachusetts Institute of Technology. Media Laboratory Vosoughi, Soroush
author_sort	Vosoughi, Soroush
collection	MIT
description	In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer vision algorithms are used to recognize the objects in the scene and automatic real time head pose estimation is done using depth data captured via a Microsoft Kinect. The system was evaluated on multiple participants. Overall, incorporating visual information into the speech recognizer greatly improved speech recognition accuracy. The rapidly decreasing cost of 3D sensing technologies such as the Kinect allows systems with similar underlying principles to be used for many speech recognition tasks where there is visual information.
first_indexed	2024-09-23T16:43:51Z
format	Article
id	mit-1721.1/86943
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T16:43:51Z
publishDate	2014
publisher	Association for Computing Machinery
record_format	dspace
spelling	mit-1721.1/869432022-09-29T21:04:17Z Improving automatic speech recognition through head pose driven visual grounding Vosoughi, Soroush Massachusetts Institute of Technology. Media Laboratory Program in Media Arts and Sciences (Massachusetts Institute of Technology) Vosoughi, Soroush Vosoughi, Soroush In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer vision algorithms are used to recognize the objects in the scene and automatic real time head pose estimation is done using depth data captured via a Microsoft Kinect. The system was evaluated on multiple participants. Overall, incorporating visual information into the speech recognizer greatly improved speech recognition accuracy. The rapidly decreasing cost of 3D sensing technologies such as the Kinect allows systems with similar underlying principles to be used for many speech recognition tasks where there is visual information. 2014-05-14T16:17:37Z 2014-05-14T16:17:37Z 2014 Article http://purl.org/eprint/type/JournalArticle 9781450324731 http://hdl.handle.net/1721.1/86943 Vosoughi, Soroush. “Improving Automatic Speech Recognition through Head Pose Driven Visual Grounding.” Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems - CHI ’14 (2014), April 26–May 01, 2014, Toronto, ON, Canada. https://orcid.org/0000-0002-2564-8909 en_US http://dx.doi.org/10.1145/2556288.2556957 Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI '14 Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. application/pdf Association for Computing Machinery Soroush Vosoughi
spellingShingle	Vosoughi, Soroush Improving automatic speech recognition through head pose driven visual grounding
title	Improving automatic speech recognition through head pose driven visual grounding
title_full	Improving automatic speech recognition through head pose driven visual grounding
title_fullStr	Improving automatic speech recognition through head pose driven visual grounding
title_full_unstemmed	Improving automatic speech recognition through head pose driven visual grounding
title_short	Improving automatic speech recognition through head pose driven visual grounding
title_sort	improving automatic speech recognition through head pose driven visual grounding
url	http://hdl.handle.net/1721.1/86943 https://orcid.org/0000-0002-2564-8909
work_keys_str_mv	AT vosoughisoroush improvingautomaticspeechrecognitionthroughheadposedrivenvisualgrounding

Improving automatic speech recognition through head pose driven visual grounding

Similar Items