Text-Free Audio Captions of Short Videos from Latent Space Representation
In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models t...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/144873 |
_version_ | 1811075438115553280 |
---|---|
author | Agarwal, Anisha |
author2 | Oliva, Aude |
author_facet | Oliva, Aude Agarwal, Anisha |
author_sort | Agarwal, Anisha |
collection | MIT |
description | In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech. |
first_indexed | 2024-09-23T10:05:52Z |
format | Thesis |
id | mit-1721.1/144873 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T10:05:52Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1448732022-08-30T03:51:10Z Text-Free Audio Captions of Short Videos from Latent Space Representation Agarwal, Anisha Oliva, Aude Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech. M.Eng. 2022-08-29T16:17:43Z 2022-08-29T16:17:43Z 2022-05 2022-05-27T16:18:36.566Z Thesis https://hdl.handle.net/1721.1/144873 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Agarwal, Anisha Text-Free Audio Captions of Short Videos from Latent Space Representation |
title | Text-Free Audio Captions of Short Videos from Latent Space Representation |
title_full | Text-Free Audio Captions of Short Videos from Latent Space Representation |
title_fullStr | Text-Free Audio Captions of Short Videos from Latent Space Representation |
title_full_unstemmed | Text-Free Audio Captions of Short Videos from Latent Space Representation |
title_short | Text-Free Audio Captions of Short Videos from Latent Space Representation |
title_sort | text free audio captions of short videos from latent space representation |
url | https://hdl.handle.net/1721.1/144873 |
work_keys_str_mv | AT agarwalanisha textfreeaudiocaptionsofshortvideosfromlatentspacerepresentation |