Text-Free Audio Captions of Short Videos from Latent Space Representation

In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models t...

Full description

Bibliographic Details
Main Author:	Agarwal, Anisha
Other Authors:	Oliva, Aude
Format:	Thesis
Published:	Massachusetts Institute of Technology 2022
Online Access:	https://hdl.handle.net/1721.1/144873

_version_	1811075438115553280
author	Agarwal, Anisha
author2	Oliva, Aude
author_facet	Oliva, Aude Agarwal, Anisha
author_sort	Agarwal, Anisha
collection	MIT
description	In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech.
first_indexed	2024-09-23T10:05:52Z
format	Thesis
id	mit-1721.1/144873
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T10:05:52Z
publishDate	2022
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1448732022-08-30T03:51:10Z Text-Free Audio Captions of Short Videos from Latent Space Representation Agarwal, Anisha Oliva, Aude Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech. M.Eng. 2022-08-29T16:17:43Z 2022-08-29T16:17:43Z 2022-05 2022-05-27T16:18:36.566Z Thesis https://hdl.handle.net/1721.1/144873 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Agarwal, Anisha Text-Free Audio Captions of Short Videos from Latent Space Representation
title	Text-Free Audio Captions of Short Videos from Latent Space Representation
title_full	Text-Free Audio Captions of Short Videos from Latent Space Representation
title_fullStr	Text-Free Audio Captions of Short Videos from Latent Space Representation
title_full_unstemmed	Text-Free Audio Captions of Short Videos from Latent Space Representation
title_short	Text-Free Audio Captions of Short Videos from Latent Space Representation
title_sort	text free audio captions of short videos from latent space representation
url	https://hdl.handle.net/1721.1/144873
work_keys_str_mv	AT agarwalanisha textfreeaudiocaptionsofshortvideosfromlatentspacerepresentation

Text-Free Audio Captions of Short Videos from Latent Space Representation

Similar Items