Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designe...

Full description

Bibliographic Details
Main Author: Palmer, Ian A.
Other Authors: Glass, James R.
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/139030
_version_ 1826195453843079168
author Palmer, Ian A.
author2 Glass, James R.
author_facet Glass, James R.
Palmer, Ian A.
author_sort Palmer, Ian A.
collection MIT
description Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a large-scale image dataset that features controls for biases encoded into many other common image datasets. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. We also present an analysis of the vocabulary of our collected captions. Lastly, we show baseline results on several audio-visual machine learning tasks, including retrieval and machine captioning. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting. We intend to make our dataset openly available to the general public to encourage new lines of work in training models that are better equipped to operate in the real world.
first_indexed 2024-09-23T10:12:56Z
format Thesis
id mit-1721.1/139030
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T10:12:56Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1390302022-01-15T03:52:58Z Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset Palmer, Ian A. Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a large-scale image dataset that features controls for biases encoded into many other common image datasets. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. We also present an analysis of the vocabulary of our collected captions. Lastly, we show baseline results on several audio-visual machine learning tasks, including retrieval and machine captioning. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting. We intend to make our dataset openly available to the general public to encourage new lines of work in training models that are better equipped to operate in the real world. M.Eng. 2022-01-14T14:45:39Z 2022-01-14T14:45:39Z 2021-06 2021-06-17T20:13:59.319Z Thesis https://hdl.handle.net/1721.1/139030 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Palmer, Ian A.
Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_full Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_fullStr Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_full_unstemmed Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_short Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
title_sort spoken objectnet creating a bias controlled spoken caption dataset
url https://hdl.handle.net/1721.1/139030
work_keys_str_mv AT palmeriana spokenobjectnetcreatingabiascontrolledspokencaptiondataset