Foley Music: Learning to Generate Music from Videos
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from au...
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Book |
Language: | English |
Published: |
Springer International Publishing
2021
|
Online Access: | https://hdl.handle.net/1721.1/130350 |
_version_ | 1811078488654872576 |
---|---|
author | Gan, Chuang Huang, Deng Chen, Peihao Tenenbaum, Joshua B Torralba, Antonio |
author2 | MIT-IBM Watson AI Lab |
author_facet | MIT-IBM Watson AI Lab Gan, Chuang Huang, Deng Chen, Peihao Tenenbaum, Joshua B Torralba, Antonio |
author_sort | Gan, Chuang |
collection | MIT |
description | In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results. |
first_indexed | 2024-09-23T11:00:45Z |
format | Book |
id | mit-1721.1/130350 |
institution | Massachusetts Institute of Technology |
language | English |
last_indexed | 2024-09-23T11:00:45Z |
publishDate | 2021 |
publisher | Springer International Publishing |
record_format | dspace |
spelling | mit-1721.1/1303502022-09-27T16:31:28Z Foley Music: Learning to Generate Music from Videos Gan, Chuang Huang, Deng Chen, Peihao Tenenbaum, Joshua B Torralba, Antonio MIT-IBM Watson AI Lab Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results. ONR MURI (N00014-16-1-2007) 2021-04-02T14:22:06Z 2021-04-02T14:22:06Z 2020-11 2021-01-28T15:39:50Z Book http://purl.org/eprint/type/ConferencePaper 9783030586201 9783030586218 0302-9743 1611-3349 https://hdl.handle.net/1721.1/130350 Gan, Chuang et al. "Foley Music: Learning to Generate Music from Videos." ECCV: European Conference on Computer Vision, Lecture Notes in Computer Science, 12356, Springer International Publishing, 2020, 758-775. © 2020 Springer Nature Switzerland AG en http://dx.doi.org/10.1007/978-3-030-58621-8_44 Lecture Notes in Computer Science Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer International Publishing arXiv |
spellingShingle | Gan, Chuang Huang, Deng Chen, Peihao Tenenbaum, Joshua B Torralba, Antonio Foley Music: Learning to Generate Music from Videos |
title | Foley Music: Learning to Generate Music from Videos |
title_full | Foley Music: Learning to Generate Music from Videos |
title_fullStr | Foley Music: Learning to Generate Music from Videos |
title_full_unstemmed | Foley Music: Learning to Generate Music from Videos |
title_short | Foley Music: Learning to Generate Music from Videos |
title_sort | foley music learning to generate music from videos |
url | https://hdl.handle.net/1721.1/130350 |
work_keys_str_mv | AT ganchuang foleymusiclearningtogeneratemusicfromvideos AT huangdeng foleymusiclearningtogeneratemusicfromvideos AT chenpeihao foleymusiclearningtogeneratemusicfromvideos AT tenenbaumjoshuab foleymusiclearningtogeneratemusicfromvideos AT torralbaantonio foleymusiclearningtogeneratemusicfromvideos |