Text-Free Audio Captions of Short Videos from Latent Space Representation

In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models t...

Full description

Bibliographic Details
Main Author: Agarwal, Anisha
Other Authors: Oliva, Aude
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/144873
_version_ 1811075438115553280
author Agarwal, Anisha
author2 Oliva, Aude
author_facet Oliva, Aude
Agarwal, Anisha
author_sort Agarwal, Anisha
collection MIT
description In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech.
first_indexed 2024-09-23T10:05:52Z
format Thesis
id mit-1721.1/144873
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T10:05:52Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1448732022-08-30T03:51:10Z Text-Free Audio Captions of Short Videos from Latent Space Representation Agarwal, Anisha Oliva, Aude Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech. M.Eng. 2022-08-29T16:17:43Z 2022-08-29T16:17:43Z 2022-05 2022-05-27T16:18:36.566Z Thesis https://hdl.handle.net/1721.1/144873 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Agarwal, Anisha
Text-Free Audio Captions of Short Videos from Latent Space Representation
title Text-Free Audio Captions of Short Videos from Latent Space Representation
title_full Text-Free Audio Captions of Short Videos from Latent Space Representation
title_fullStr Text-Free Audio Captions of Short Videos from Latent Space Representation
title_full_unstemmed Text-Free Audio Captions of Short Videos from Latent Space Representation
title_short Text-Free Audio Captions of Short Videos from Latent Space Representation
title_sort text free audio captions of short videos from latent space representation
url https://hdl.handle.net/1721.1/144873
work_keys_str_mv AT agarwalanisha textfreeaudiocaptionsofshortvideosfromlatentspacerepresentation