Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-l...

Full description

Bibliographic Details
Main Authors:	Woo Hyun Kang, Nam Soo Kim
Format:	Article
Language:	English
Published:	MDPI AG 2019-04-01
Series:	Applied Sciences
Subjects:	speech embedding deep learning speaker recognition
Online Access:	https://www.mdpi.com/2076-3417/9/8/1597

_version_	1818007643074068480
author	Woo Hyun Kang Nam Soo Kim
author_facet	Woo Hyun Kang Nam Soo Kim
author_sort	Woo Hyun Kang
collection	DOAJ
description	Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for embedding the speaker variability in a non-linear fashion have shown impressive performance in various speaker verification tasks. However, since most of these techniques are trained in a supervised manner, which requires speaker labels for the training data, it is difficult to use them when a scarce amount of labeled data is available for training. In this paper, we propose a novel technique for extracting an i-vector-like feature based on the variational autoencoder (VAE), which is trained in an unsupervised manner to obtain a latent variable representing the variability within a Gaussian mixture model (GMM) distribution. The proposed framework is compared with the conventional i-vector method using the TIDIGITS dataset. Experimental results showed that the proposed method could cope with the performance deterioration caused by the short duration. Furthermore, the performance of the proposed approach improved significantly when applied in conjunction with the conventional i-vector framework.
first_indexed	2024-04-14T05:18:19Z
format	Article
id	doaj.art-40c3e9350722437f8c4980bf31e8f41f
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-04-14T05:18:19Z
publishDate	2019-04-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-40c3e9350722437f8c4980bf31e8f41f2022-12-22T02:10:17ZengMDPI AGApplied Sciences2076-34172019-04-0198159710.3390/app9081597app9081597Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit StringsWoo Hyun Kang0Nam Soo Kim1Department of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul 08826, KoreaRecently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for embedding the speaker variability in a non-linear fashion have shown impressive performance in various speaker verification tasks. However, since most of these techniques are trained in a supervised manner, which requires speaker labels for the training data, it is difficult to use them when a scarce amount of labeled data is available for training. In this paper, we propose a novel technique for extracting an i-vector-like feature based on the variational autoencoder (VAE), which is trained in an unsupervised manner to obtain a latent variable representing the variability within a Gaussian mixture model (GMM) distribution. The proposed framework is compared with the conventional i-vector method using the TIDIGITS dataset. Experimental results showed that the proposed method could cope with the performance deterioration caused by the short duration. Furthermore, the performance of the proposed approach improved significantly when applied in conjunction with the conventional i-vector framework.https://www.mdpi.com/2076-3417/9/8/1597speech embeddingdeep learningspeaker recognition
spellingShingle	Woo Hyun Kang Nam Soo Kim Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings Applied Sciences speech embedding deep learning speaker recognition
title	Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings
title_full	Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings
title_fullStr	Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings
title_full_unstemmed	Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings
title_short	Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings
title_sort	unsupervised learning of total variability embedding for speaker verification with random digit strings
topic	speech embedding deep learning speaker recognition
url	https://www.mdpi.com/2076-3417/9/8/1597
work_keys_str_mv	AT woohyunkang unsupervisedlearningoftotalvariabilityembeddingforspeakerverificationwithrandomdigitstrings AT namsookim unsupervisedlearningoftotalvariabilityembeddingforspeakerverificationwithrandomdigitstrings

Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Similar Items