Variational autoencoder for prosody-based speaker recognition

This paper describes a novel end-to-end deep generative model-based speaker recognition system using prosodic features. The usefulness of variational autoencoders (VAE) in learning the speaker-specific prosody representations for the speaker recognition task is examined herein for the first time. Th...

Full description

Bibliographic Details
Main Authors: Starlet Ben Alex, Leena Mary
Format: Article
Language:English
Published: Electronics and Telecommunications Research Institute (ETRI) 2023-08-01
Series:ETRI Journal
Subjects:
Online Access:https://doi.org/10.4218/etrij.2021-0377
_version_ 1797692853909979136
author Starlet Ben Alex
Leena Mary
author_facet Starlet Ben Alex
Leena Mary
author_sort Starlet Ben Alex
collection DOAJ
description This paper describes a novel end-to-end deep generative model-based speaker recognition system using prosodic features. The usefulness of variational autoencoders (VAE) in learning the speaker-specific prosody representations for the speaker recognition task is examined herein for the first time. The speech signal is first automatically segmented into syllable-like units using vowel onset points (VOP) and energy valleys. Prosodic features, such as the dynamics of duration, energy, and fundamental frequency (F0), are then extracted at the syllable level and used to train/adapt a speaker-dependent VAE from a universal VAE. The initial comparative studies on VAEs and traditional autoencoders (AE) suggest that the former can efficiently learn speaker representations. Investigations on the impact of gender information in speaker recognition also point out that gender-dependent impostor banks lead to higher accuracies. Finally, the evaluation on the NIST SRE 2010 dataset demonstrates the usefulness of the proposed approach for speaker recognition.
first_indexed 2024-03-12T02:33:36Z
format Article
id doaj.art-385e84b5cef94c2a97de5507e46ef1f5
institution Directory Open Access Journal
issn 1225-6463
language English
last_indexed 2024-03-12T02:33:36Z
publishDate 2023-08-01
publisher Electronics and Telecommunications Research Institute (ETRI)
record_format Article
series ETRI Journal
spelling doaj.art-385e84b5cef94c2a97de5507e46ef1f52023-09-05T01:46:15ZengElectronics and Telecommunications Research Institute (ETRI)ETRI Journal1225-64632023-08-0145467868910.4218/etrij.2021-037710.4218/etrij.2021-0377Variational autoencoder for prosody-based speaker recognitionStarlet Ben AlexLeena MaryThis paper describes a novel end-to-end deep generative model-based speaker recognition system using prosodic features. The usefulness of variational autoencoders (VAE) in learning the speaker-specific prosody representations for the speaker recognition task is examined herein for the first time. The speech signal is first automatically segmented into syllable-like units using vowel onset points (VOP) and energy valleys. Prosodic features, such as the dynamics of duration, energy, and fundamental frequency (F0), are then extracted at the syllable level and used to train/adapt a speaker-dependent VAE from a universal VAE. The initial comparative studies on VAEs and traditional autoencoders (AE) suggest that the former can efficiently learn speaker representations. Investigations on the impact of gender information in speaker recognition also point out that gender-dependent impostor banks lead to higher accuracies. Finally, the evaluation on the NIST SRE 2010 dataset demonstrates the usefulness of the proposed approach for speaker recognition.https://doi.org/10.4218/etrij.2021-0377deep neural networksprosodic featuresspeaker recognitionsyllablesvae
spellingShingle Starlet Ben Alex
Leena Mary
Variational autoencoder for prosody-based speaker recognition
ETRI Journal
deep neural networks
prosodic features
speaker recognition
syllables
vae
title Variational autoencoder for prosody-based speaker recognition
title_full Variational autoencoder for prosody-based speaker recognition
title_fullStr Variational autoencoder for prosody-based speaker recognition
title_full_unstemmed Variational autoencoder for prosody-based speaker recognition
title_short Variational autoencoder for prosody-based speaker recognition
title_sort variational autoencoder for prosody based speaker recognition
topic deep neural networks
prosodic features
speaker recognition
syllables
vae
url https://doi.org/10.4218/etrij.2021-0377
work_keys_str_mv AT starletbenalex variationalautoencoderforprosodybasedspeakerrecognition
AT leenamary variationalautoencoderforprosodybasedspeakerrecognition