Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designe...

Full description

Bibliographic Details
Main Author:	Palmer, Ian A.
Other Authors:	Glass, James R.
Format:	Thesis
Published:	Massachusetts Institute of Technology 2022
Online Access:	https://hdl.handle.net/1721.1/139030

_version_	1826195453843079168
author	Palmer, Ian A.
author2	Glass, James R.
author_facet	Glass, James R. Palmer, Ian A.
author_sort	Palmer, Ian A.
collection	MIT
description	Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a large-scale image dataset that features controls for biases encoded into many other common image datasets. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. We also present an analysis of the vocabulary of our collected captions. Lastly, we show baseline results on several audio-visual machine learning tasks, including retrieval and machine captioning. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting. We intend to make our dataset openly available to the general public to encourage new lines of work in training models that are better equipped to operate in the real world.
first_indexed	2024-09-23T10:12:56Z
format	Thesis
id	mit-1721.1/139030
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T10:12:56Z
publishDate	2022
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1390302022-01-15T03:52:58Z Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset Palmer, Ian A. Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a large-scale image dataset that features controls for biases encoded into many other common image datasets. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. We also present an analysis of the vocabulary of our collected captions. Lastly, we show baseline results on several audio-visual machine learning tasks, including retrieval and machine captioning. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting. We intend to make our dataset openly available to the general public to encourage new lines of work in training models that are better equipped to operate in the real world. M.Eng. 2022-01-14T14:45:39Z 2022-01-14T14:45:39Z 2021-06 2021-06-17T20:13:59.319Z Thesis https://hdl.handle.net/1721.1/139030 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Palmer, Ian A. Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title	Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_full	Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_fullStr	Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_full_unstemmed	Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_short	Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_sort	spoken objectnet creating a bias controlled spoken caption dataset
url	https://hdl.handle.net/1721.1/139030
work_keys_str_mv	AT palmeriana spokenobjectnetcreatingabiascontrolledspokencaptiondataset

Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset

Similar Items