Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are re...

Full description

Bibliographic Details
Main Authors:	Soonshin Seo, Ji-Hwan Kim
Format:	Article
Language:	English
Published:	MDPI AG 2020-10-01
Series:	Electronics
Subjects:	text-independent speaker verification system self-attentive pooling multi-layer aggregation feature recalibration deep length normalization speaker embedding
Online Access:	https://www.mdpi.com/2079-9292/9/10/1706

_version_	1797550634408345600
author	Soonshin Seo Ji-Hwan Kim
author_facet	Soonshin Seo Ji-Hwan Kim
author_sort	Soonshin Seo
collection	DOAJ
description	One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).
first_indexed	2024-03-10T15:32:07Z
format	Article
id	doaj.art-0416d5034ae74d72a10542d5b3c1c9ac
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-10T15:32:07Z
publishDate	2020-10-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-0416d5034ae74d72a10542d5b3c1c9ac2023-11-20T17:28:30ZengMDPI AGElectronics2079-92922020-10-01910170610.3390/electronics9101706Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification SystemSoonshin Seo0Ji-Hwan Kim1Department of Computer Science and Engineering, Sogang University, Seoul 04107, KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul 04107, KoreaOne of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).https://www.mdpi.com/2079-9292/9/10/1706text-independent speaker verification systemself-attentive poolingmulti-layer aggregationfeature recalibrationdeep length normalizationspeaker embedding
spellingShingle	Soonshin Seo Ji-Hwan Kim Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System Electronics text-independent speaker verification system self-attentive pooling multi-layer aggregation feature recalibration deep length normalization speaker embedding
title	Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_full	Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_fullStr	Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_full_unstemmed	Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_short	Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_sort	self attentive multi layer aggregation with feature recalibration and deep length normalization for text independent speaker verification system
topic	text-independent speaker verification system self-attentive pooling multi-layer aggregation feature recalibration deep length normalization speaker embedding
url	https://www.mdpi.com/2079-9292/9/10/1706
work_keys_str_mv	AT soonshinseo selfattentivemultilayeraggregationwithfeaturerecalibrationanddeeplengthnormalizationfortextindependentspeakerverificationsystem AT jihwankim selfattentivemultilayeraggregationwithfeaturerecalibrationanddeeplengthnormalizationfortextindependentspeakerverificationsystem

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Similar Items