Unsupervised Phonetic Category Learning from Audio and Visual Input

Understanding how children learn the phonetic categories of their native language is an open area of research in cognitive science and child language development. However, despite experimental evidence that phonetic processing is very often a multimodal phenomenon (involving both auditory and visual...

Full description

Bibliographic Details
Main Author:	Zhi, Sophia
Other Authors:	Levy, Roger
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/151659

_version_	1826213014181773312
author	Zhi, Sophia
author2	Levy, Roger
author_facet	Levy, Roger Zhi, Sophia
author_sort	Zhi, Sophia
collection	MIT
description	Understanding how children learn the phonetic categories of their native language is an open area of research in cognitive science and child language development. However, despite experimental evidence that phonetic processing is very often a multimodal phenomenon (involving both auditory and visual cues), computational research has primarily modeled phonetic category learning as a function of only auditory input. In this thesis, I investigate whether multimodal information benefits phonetic category learning under a clustering model. Due to the lack of an appropriate dataset, I also introduce a method for creating a high-quality dataset of synthetic videos of speakers’ faces for an existing audio corpus. This model trained and tested on audiovisual data achieves up to a 9.1% improvement on a phoneme discrimination battery over the random baseline compared to a model trained and tested on only audio data. The audiovisual model also outperforms the audio model by up to 4.7% over the baseline when both are tested on audio-only data, suggesting that visual information guides the learner towards better clusters. Further analysis indicates that visual information benefits most, but not all, phonemic contrasts. In follow-up analyses, I investigate the learned audiovisual clusters and their relationship to auditory gestures and phones, finding that the clusters capture a unit of speech smaller than phonemes. This work demonstrates the benefit of visual information to a computational model of phonetic category learning, suggesting that children may benefit substantively by using visual cues while learning phonetic categories.
first_indexed	2024-09-23T15:41:58Z
format	Thesis
id	mit-1721.1/151659
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T15:41:58Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1516592023-08-01T03:10:17Z Unsupervised Phonetic Category Learning from Audio and Visual Input Zhi, Sophia Levy, Roger Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Understanding how children learn the phonetic categories of their native language is an open area of research in cognitive science and child language development. However, despite experimental evidence that phonetic processing is very often a multimodal phenomenon (involving both auditory and visual cues), computational research has primarily modeled phonetic category learning as a function of only auditory input. In this thesis, I investigate whether multimodal information benefits phonetic category learning under a clustering model. Due to the lack of an appropriate dataset, I also introduce a method for creating a high-quality dataset of synthetic videos of speakers’ faces for an existing audio corpus. This model trained and tested on audiovisual data achieves up to a 9.1% improvement on a phoneme discrimination battery over the random baseline compared to a model trained and tested on only audio data. The audiovisual model also outperforms the audio model by up to 4.7% over the baseline when both are tested on audio-only data, suggesting that visual information guides the learner towards better clusters. Further analysis indicates that visual information benefits most, but not all, phonemic contrasts. In follow-up analyses, I investigate the learned audiovisual clusters and their relationship to auditory gestures and phones, finding that the clusters capture a unit of speech smaller than phonemes. This work demonstrates the benefit of visual information to a computational model of phonetic category learning, suggesting that children may benefit substantively by using visual cues while learning phonetic categories. M.Eng. 2023-07-31T19:57:03Z 2023-07-31T19:57:03Z 2023-06 2023-06-06T16:35:03.470Z Thesis https://hdl.handle.net/1721.1/151659 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Zhi, Sophia Unsupervised Phonetic Category Learning from Audio and Visual Input
title	Unsupervised Phonetic Category Learning from Audio and Visual Input
title_full	Unsupervised Phonetic Category Learning from Audio and Visual Input
title_fullStr	Unsupervised Phonetic Category Learning from Audio and Visual Input
title_full_unstemmed	Unsupervised Phonetic Category Learning from Audio and Visual Input
title_short	Unsupervised Phonetic Category Learning from Audio and Visual Input
title_sort	unsupervised phonetic category learning from audio and visual input
url	https://hdl.handle.net/1721.1/151659
work_keys_str_mv	AT zhisophia unsupervisedphoneticcategorylearningfromaudioandvisualinput

Unsupervised Phonetic Category Learning from Audio and Visual Input

Similar Items