Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset
Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designe...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/139030 |
_version_ | 1826195453843079168 |
---|---|
author | Palmer, Ian A. |
author2 | Glass, James R. |
author_facet | Glass, James R. Palmer, Ian A. |
author_sort | Palmer, Ian A. |
collection | MIT |
description | Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a large-scale image dataset that features controls for biases encoded into many other common image datasets.
We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. We also present an analysis of the vocabulary of our collected captions. Lastly, we show baseline results on several audio-visual machine learning tasks, including retrieval and machine captioning. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting. We intend to make our dataset openly available to the general public to encourage new lines of work in training models that are better equipped to operate in the real world. |
first_indexed | 2024-09-23T10:12:56Z |
format | Thesis |
id | mit-1721.1/139030 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T10:12:56Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1390302022-01-15T03:52:58Z Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset Palmer, Ian A. Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a large-scale image dataset that features controls for biases encoded into many other common image datasets. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. We also present an analysis of the vocabulary of our collected captions. Lastly, we show baseline results on several audio-visual machine learning tasks, including retrieval and machine captioning. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting. We intend to make our dataset openly available to the general public to encourage new lines of work in training models that are better equipped to operate in the real world. M.Eng. 2022-01-14T14:45:39Z 2022-01-14T14:45:39Z 2021-06 2021-06-17T20:13:59.319Z Thesis https://hdl.handle.net/1721.1/139030 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Palmer, Ian A. Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset |
title | Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset |
title_full | Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset |
title_fullStr | Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset |
title_full_unstemmed | Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset |
title_short | Spoken ObjectNet: Creating a Bias-Controlled Spoken Caption Dataset |
title_sort | spoken objectnet creating a bias controlled spoken caption dataset |
url | https://hdl.handle.net/1721.1/139030 |
work_keys_str_mv | AT palmeriana spokenobjectnetcreatingabiascontrolledspokencaptiondataset |