Improving automatic speech recognition through head pose driven visual grounding

In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely ob...

Olles dieđut

Bibliográfalaš dieđut
Váldodahkki: Vosoughi, Soroush
Eará dahkkit: Massachusetts Institute of Technology. Media Laboratory
Materiálatiipa: Artihkal
Giella:en_US
Almmustuhtton: Association for Computing Machinery 2014
Liŋkkat:http://hdl.handle.net/1721.1/86943
https://orcid.org/0000-0002-2564-8909