Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are re...

Full description

Bibliographic Details
Main Authors: Soonshin Seo, Ji-Hwan Kim
Format: Article
Language:English
Published: MDPI AG 2020-10-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/9/10/1706
_version_ 1797550634408345600
author Soonshin Seo
Ji-Hwan Kim
author_facet Soonshin Seo
Ji-Hwan Kim
author_sort Soonshin Seo
collection DOAJ
description One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).
first_indexed 2024-03-10T15:32:07Z
format Article
id doaj.art-0416d5034ae74d72a10542d5b3c1c9ac
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-10T15:32:07Z
publishDate 2020-10-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-0416d5034ae74d72a10542d5b3c1c9ac2023-11-20T17:28:30ZengMDPI AGElectronics2079-92922020-10-01910170610.3390/electronics9101706Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification SystemSoonshin Seo0Ji-Hwan Kim1Department of Computer Science and Engineering, Sogang University, Seoul 04107, KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul 04107, KoreaOne of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).https://www.mdpi.com/2079-9292/9/10/1706text-independent speaker verification systemself-attentive poolingmulti-layer aggregationfeature recalibrationdeep length normalizationspeaker embedding
spellingShingle Soonshin Seo
Ji-Hwan Kim
Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
Electronics
text-independent speaker verification system
self-attentive pooling
multi-layer aggregation
feature recalibration
deep length normalization
speaker embedding
title Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_full Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_fullStr Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_full_unstemmed Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_short Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
title_sort self attentive multi layer aggregation with feature recalibration and deep length normalization for text independent speaker verification system
topic text-independent speaker verification system
self-attentive pooling
multi-layer aggregation
feature recalibration
deep length normalization
speaker embedding
url https://www.mdpi.com/2079-9292/9/10/1706
work_keys_str_mv AT soonshinseo selfattentivemultilayeraggregationwithfeaturerecalibrationanddeeplengthnormalizationfortextindependentspeakerverificationsystem
AT jihwankim selfattentivemultilayeraggregationwithfeaturerecalibrationanddeeplengthnormalizationfortextindependentspeakerverificationsystem