Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features...

Full description

Bibliographic Details
Main Authors:	Meng Wang, Dazheng Feng, Tingting Su, Mohan Chen
Format:	Article
Language:	English
Published:	MDPI AG 2022-03-01
Series:	Sensors
Subjects:	convolutional neural networks speaker verification temporal-frequency aggregation self-attention
Online Access:	https://www.mdpi.com/1424-8220/22/6/2147

_version_	1797442453815427072
author	Meng Wang Dazheng Feng Tingting Su Mohan Chen
author_facet	Meng Wang Dazheng Feng Tingting Su Mohan Chen
author_sort	Meng Wang
collection	DOAJ
description	Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.
first_indexed	2024-03-09T12:42:04Z
format	Article
id	doaj.art-a323c215a63341bcaaed092b59af8879
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T12:42:04Z
publishDate	2022-03-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-a323c215a63341bcaaed092b59af88792023-11-30T22:16:54ZengMDPI AGSensors1424-82202022-03-01226214710.3390/s22062147Attention-Based Temporal-Frequency Aggregation for Speaker VerificationMeng Wang0Dazheng Feng1Tingting Su2Mohan Chen3National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaNational Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaNational Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaNational Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaConvolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.https://www.mdpi.com/1424-8220/22/6/2147convolutional neural networksspeaker verificationtemporal-frequency aggregationself-attention
spellingShingle	Meng Wang Dazheng Feng Tingting Su Mohan Chen Attention-Based Temporal-Frequency Aggregation for Speaker Verification Sensors convolutional neural networks speaker verification temporal-frequency aggregation self-attention
title	Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_full	Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_fullStr	Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_full_unstemmed	Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_short	Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_sort	attention based temporal frequency aggregation for speaker verification
topic	convolutional neural networks speaker verification temporal-frequency aggregation self-attention
url	https://www.mdpi.com/1424-8220/22/6/2147
work_keys_str_mv	AT mengwang attentionbasedtemporalfrequencyaggregationforspeakerverification AT dazhengfeng attentionbasedtemporalfrequencyaggregationforspeakerverification AT tingtingsu attentionbasedtemporalfrequencyaggregationforspeakerverification AT mohanchen attentionbasedtemporalfrequencyaggregationforspeakerverification

Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Similar Items