Foley Music: Learning to Generate Music from Videos

In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from au...

Full description

Bibliographic Details
Main Authors: Gan, Chuang, Huang, Deng, Chen, Peihao, Tenenbaum, Joshua B, Torralba, Antonio
Other Authors: MIT-IBM Watson AI Lab
Format: Book
Language:English
Published: Springer International Publishing 2021
Online Access:https://hdl.handle.net/1721.1/130350
_version_ 1811078488654872576
author Gan, Chuang
Huang, Deng
Chen, Peihao
Tenenbaum, Joshua B
Torralba, Antonio
author2 MIT-IBM Watson AI Lab
author_facet MIT-IBM Watson AI Lab
Gan, Chuang
Huang, Deng
Chen, Peihao
Tenenbaum, Joshua B
Torralba, Antonio
author_sort Gan, Chuang
collection MIT
description In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results.
first_indexed 2024-09-23T11:00:45Z
format Book
id mit-1721.1/130350
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T11:00:45Z
publishDate 2021
publisher Springer International Publishing
record_format dspace
spelling mit-1721.1/1303502022-09-27T16:31:28Z Foley Music: Learning to Generate Music from Videos Gan, Chuang Huang, Deng Chen, Peihao Tenenbaum, Joshua B Torralba, Antonio MIT-IBM Watson AI Lab Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results. ONR MURI (N00014-16-1-2007) 2021-04-02T14:22:06Z 2021-04-02T14:22:06Z 2020-11 2021-01-28T15:39:50Z Book http://purl.org/eprint/type/ConferencePaper 9783030586201 9783030586218 0302-9743 1611-3349 https://hdl.handle.net/1721.1/130350 Gan, Chuang et al. "Foley Music: Learning to Generate Music from Videos." ECCV: European Conference on Computer Vision, Lecture Notes in Computer Science, 12356, Springer International Publishing, 2020, 758-775. © 2020 Springer Nature Switzerland AG en http://dx.doi.org/10.1007/978-3-030-58621-8_44 Lecture Notes in Computer Science Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer International Publishing arXiv
spellingShingle Gan, Chuang
Huang, Deng
Chen, Peihao
Tenenbaum, Joshua B
Torralba, Antonio
Foley Music: Learning to Generate Music from Videos
title Foley Music: Learning to Generate Music from Videos
title_full Foley Music: Learning to Generate Music from Videos
title_fullStr Foley Music: Learning to Generate Music from Videos
title_full_unstemmed Foley Music: Learning to Generate Music from Videos
title_short Foley Music: Learning to Generate Music from Videos
title_sort foley music learning to generate music from videos
url https://hdl.handle.net/1721.1/130350
work_keys_str_mv AT ganchuang foleymusiclearningtogeneratemusicfromvideos
AT huangdeng foleymusiclearningtogeneratemusicfromvideos
AT chenpeihao foleymusiclearningtogeneratemusicfromvideos
AT tenenbaumjoshuab foleymusiclearningtogeneratemusicfromvideos
AT torralbaantonio foleymusiclearningtogeneratemusicfromvideos