Few-shot short utterance speaker verification using meta-learning

Short utterance speaker verification (SV) in the actual application is the task of accepting or rejecting the identity claim of a speaker based on a few enrollment utterances. Traditional methods have used deep neural networks to extract speaker representations for verification. Recently, several me...

Full description

Bibliographic Details
Main Authors:	Weijie Wang, Hong Zhao, Yikun Yang, YouKang Chang, Haojie You
Format:	Article
Language:	English
Published:	PeerJ Inc. 2023-04-01
Series:	PeerJ Computer Science
Subjects:	Speaker verification Meta-learning Support set Prototypical network Global classification Episodic training strategy
Online Access:	https://peerj.com/articles/cs-1276.pdf

_version_	1827961001089171456
author	Weijie Wang Hong Zhao Yikun Yang YouKang Chang Haojie You
author_facet	Weijie Wang Hong Zhao Yikun Yang YouKang Chang Haojie You
author_sort	Weijie Wang
collection	DOAJ
description	Short utterance speaker verification (SV) in the actual application is the task of accepting or rejecting the identity claim of a speaker based on a few enrollment utterances. Traditional methods have used deep neural networks to extract speaker representations for verification. Recently, several meta-learning approaches have learned a deep distance metric to distinguish speakers within meta-tasks. Among them, a prototypical network learns a metric space that may be used to compute the distance to the prototype center of speakers, in order to classify speaker identity. We use emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN) to implement the necessary function for the prototypical network, which is a nonlinear mapping from the input space to the metric space for either few-shot SV task. In addition, optimizing only for speakers in given meta-tasks cannot be sufficient to learn distinctive speaker features. Thus, we used an episodic training strategy, in which the classes of the support and query sets correspond to the classes of the entire training set, further improving the model performance. The proposed model outperforms comparison models on the VoxCeleb1 dataset and has a wide range of practical applications.
first_indexed	2024-04-09T16:19:51Z
format	Article
id	doaj.art-e8c5426d5a28405c89ed4ffda40c8d30
institution	Directory Open Access Journal
issn	2376-5992
language	English
last_indexed	2024-04-09T16:19:51Z
publishDate	2023-04-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj.art-e8c5426d5a28405c89ed4ffda40c8d302023-04-23T15:05:06ZengPeerJ Inc.PeerJ Computer Science2376-59922023-04-019e127610.7717/peerj-cs.1276Few-shot short utterance speaker verification using meta-learningWeijie Wang0Hong Zhao1Yikun Yang2YouKang Chang3Haojie You4School of Computer and Communication, Lanzhou University of Technology, Lanzhou, ChinaSchool of Computer and Communication, Lanzhou University of Technology, Lanzhou, ChinaSchool of Information Science & Engineering, Lanzhou University, Lanzhou, ChinaSchool of Computer and Communication, Lanzhou University of Technology, Lanzhou, ChinaSchool of Computer and Communication, Lanzhou University of Technology, Lanzhou, ChinaShort utterance speaker verification (SV) in the actual application is the task of accepting or rejecting the identity claim of a speaker based on a few enrollment utterances. Traditional methods have used deep neural networks to extract speaker representations for verification. Recently, several meta-learning approaches have learned a deep distance metric to distinguish speakers within meta-tasks. Among them, a prototypical network learns a metric space that may be used to compute the distance to the prototype center of speakers, in order to classify speaker identity. We use emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN) to implement the necessary function for the prototypical network, which is a nonlinear mapping from the input space to the metric space for either few-shot SV task. In addition, optimizing only for speakers in given meta-tasks cannot be sufficient to learn distinctive speaker features. Thus, we used an episodic training strategy, in which the classes of the support and query sets correspond to the classes of the entire training set, further improving the model performance. The proposed model outperforms comparison models on the VoxCeleb1 dataset and has a wide range of practical applications.https://peerj.com/articles/cs-1276.pdfSpeaker verificationMeta-learningSupport setPrototypical networkGlobal classificationEpisodic training strategy
spellingShingle	Weijie Wang Hong Zhao Yikun Yang YouKang Chang Haojie You Few-shot short utterance speaker verification using meta-learning PeerJ Computer Science Speaker verification Meta-learning Support set Prototypical network Global classification Episodic training strategy
title	Few-shot short utterance speaker verification using meta-learning
title_full	Few-shot short utterance speaker verification using meta-learning
title_fullStr	Few-shot short utterance speaker verification using meta-learning
title_full_unstemmed	Few-shot short utterance speaker verification using meta-learning
title_short	Few-shot short utterance speaker verification using meta-learning
title_sort	few shot short utterance speaker verification using meta learning
topic	Speaker verification Meta-learning Support set Prototypical network Global classification Episodic training strategy
url	https://peerj.com/articles/cs-1276.pdf
work_keys_str_mv	AT weijiewang fewshotshortutterancespeakerverificationusingmetalearning AT hongzhao fewshotshortutterancespeakerverificationusingmetalearning AT yikunyang fewshotshortutterancespeakerverificationusingmetalearning AT youkangchang fewshotshortutterancespeakerverificationusingmetalearning AT haojieyou fewshotshortutterancespeakerverificationusingmetalearning

Few-shot short utterance speaker verification using meta-learning

Similar Items