Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

This paper describes an automatic drum transcription (ADT) method that directly estimates a tatum-level drum score from a music signal in contrast to most conventional ADT methods that estimate the frame-level onset probabilities of drums. To estimate a tatum-level score, we propose a deep transcrip...

Full description

Bibliographic Details
Main Authors:	Ryoto Ishizuka, Ryo Nishikimi, Kazuyoshi Yoshii
Format:	Article
Language:	English
Published:	MDPI AG 2021-08-01
Series:	Signals
Subjects:	automatic drum transcription self-attention mechanism transformer positional encoding masked language model
Online Access:	https://www.mdpi.com/2624-6120/2/3/31

_version_	1797517227832901632
author	Ryoto Ishizuka Ryo Nishikimi Kazuyoshi Yoshii
author_facet	Ryoto Ishizuka Ryo Nishikimi Kazuyoshi Yoshii
author_sort	Ryoto Ishizuka
collection	DOAJ
description	This paper describes an automatic drum transcription (ADT) method that directly estimates a tatum-level drum score from a music signal in contrast to most conventional ADT methods that estimate the frame-level onset probabilities of drums. To estimate a tatum-level score, we propose a deep transcription model that consists of a frame-level encoder for extracting the latent features from a music signal and a tatum-level decoder for estimating a drum score from the latent features pooled at the tatum level. To capture the global repetitive structure of drum scores, which is difficult to learn with a recurrent neural network (RNN), we introduce a self-attention mechanism with tatum-synchronous positional encoding into the decoder. To mitigate the difficulty of training the self-attention-based model from an insufficient amount of paired data and to improve the musical naturalness of the estimated scores, we propose a regularized training method that uses a global structure-aware masked language (score) model with a self-attention mechanism pretrained from an extensive collection of drum scores. The experimental results showed that the proposed regularized model outperformed the conventional RNN-based model in terms of the tatum-level error rate and the frame-level F-measure, even when only a limited amount of paired data was available so that the non-regularized model underperformed the RNN-based model.
first_indexed	2024-03-10T07:13:48Z
format	Article
id	doaj.art-c24f2c572b2349b39abcb34a5b93c99a
institution	Directory Open Access Journal
issn	2624-6120
language	English
last_indexed	2024-03-10T07:13:48Z
publishDate	2021-08-01
publisher	MDPI AG
record_format	Article
series	Signals
spelling	doaj.art-c24f2c572b2349b39abcb34a5b93c99a2023-11-22T15:15:53ZengMDPI AGSignals2624-61202021-08-012350852610.3390/signals2030031Global Structure-Aware Drum Transcription Based on Self-Attention MechanismsRyoto Ishizuka0Ryo Nishikimi1Kazuyoshi Yoshii2Graduate School of Informatics, Kyoto University, Kyoto 606-8501, JapanGraduate School of Informatics, Kyoto University, Kyoto 606-8501, JapanGraduate School of Informatics, Kyoto University, Kyoto 606-8501, JapanThis paper describes an automatic drum transcription (ADT) method that directly estimates a tatum-level drum score from a music signal in contrast to most conventional ADT methods that estimate the frame-level onset probabilities of drums. To estimate a tatum-level score, we propose a deep transcription model that consists of a frame-level encoder for extracting the latent features from a music signal and a tatum-level decoder for estimating a drum score from the latent features pooled at the tatum level. To capture the global repetitive structure of drum scores, which is difficult to learn with a recurrent neural network (RNN), we introduce a self-attention mechanism with tatum-synchronous positional encoding into the decoder. To mitigate the difficulty of training the self-attention-based model from an insufficient amount of paired data and to improve the musical naturalness of the estimated scores, we propose a regularized training method that uses a global structure-aware masked language (score) model with a self-attention mechanism pretrained from an extensive collection of drum scores. The experimental results showed that the proposed regularized model outperformed the conventional RNN-based model in terms of the tatum-level error rate and the frame-level F-measure, even when only a limited amount of paired data was available so that the non-regularized model underperformed the RNN-based model.https://www.mdpi.com/2624-6120/2/3/31automatic drum transcriptionself-attention mechanismtransformerpositional encodingmasked language model
spellingShingle	Ryoto Ishizuka Ryo Nishikimi Kazuyoshi Yoshii Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms Signals automatic drum transcription self-attention mechanism transformer positional encoding masked language model
title	Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms
title_full	Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms
title_fullStr	Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms
title_full_unstemmed	Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms
title_short	Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms
title_sort	global structure aware drum transcription based on self attention mechanisms
topic	automatic drum transcription self-attention mechanism transformer positional encoding masked language model
url	https://www.mdpi.com/2624-6120/2/3/31
work_keys_str_mv	AT ryotoishizuka globalstructureawaredrumtranscriptionbasedonselfattentionmechanisms AT ryonishikimi globalstructureawaredrumtranscriptionbasedonselfattentionmechanisms AT kazuyoshiyoshii globalstructureawaredrumtranscriptionbasedonselfattentionmechanisms

Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

Similar Items