Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments

Speaker identification in challenging acoustic environments, influenced by noise, reverberation, and emotional fluctuations, requires improved feature extraction techniques. Although existing methods effectively extract distinct acoustic features, they show limitations in these adverse settings. To...

Full description

Bibliographic Details
Main Authors:	Yassin Terraf, Youssef Iraqi
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Speaker identification feature extraction challenging acoustic environments temporal context-enhanced features convolutional neural networks long short-term memory
Online Access:	https://ieeexplore.ieee.org/document/10410836/

_version_	1797338480329621504
author	Yassin Terraf Youssef Iraqi
author_facet	Yassin Terraf Youssef Iraqi
author_sort	Yassin Terraf
collection	DOAJ
description	Speaker identification in challenging acoustic environments, influenced by noise, reverberation, and emotional fluctuations, requires improved feature extraction techniques. Although existing methods effectively extract distinct acoustic features, they show limitations in these adverse settings. To overcome these limitations, we propose the Temporal Context-Enhanced Features (TCEF) approach, which provides a consistent audio representation for better performance under various acoustic conditions. TCEF leverages a context window to average features in adjacent frames, effectively reducing short-term variations caused by noise, reverberation, fluctuations in emotional speech, and those in neutral recordings. This approach improves the distinctive features of a speaker voice, improving speaker identification in challenging and neutral acoustic environments. To evaluate the performance of TCEF against conventional features, One-Dimensional Convolutional Neural Network (1D-CNN) was used for a detailed frame-level analysis and Long Short-Term Memory (LSTM) for a comprehensive sequence-level analysis.We used four datasets to assess the effectiveness of the TCEF approach. The GRID and RAVDESS datasets represent neutral and emotional speech, respectively. To test the robustness of our system under adverse acoustic conditions, we created two additional datasets: GRID-NR and RAVDESS-NR. These are modified versions of the original GRID and RAVDESS, incorporating added noise and reverberation. Performance evaluation results showed that TCEF significantly outperformed existing feature extraction methods in identifying speakers in diverse acoustic environments.
first_indexed	2024-03-08T09:31:50Z
format	Article
id	doaj.art-5f6d67e3757e41a3ab47285af60a2905
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-08T09:31:50Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-5f6d67e3757e41a3ab47285af60a29052024-01-31T00:01:18ZengIEEEIEEE Access2169-35362024-01-0112140941411510.1109/ACCESS.2024.335673010410836Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic EnvironmentsYassin Terraf0https://orcid.org/0009-0004-4026-5887Youssef Iraqi1https://orcid.org/0000-0003-0112-2600College of Computing, University Mohammed VI Polytechnic, Ben Guerir, MoroccoCollege of Computing, University Mohammed VI Polytechnic, Ben Guerir, MoroccoSpeaker identification in challenging acoustic environments, influenced by noise, reverberation, and emotional fluctuations, requires improved feature extraction techniques. Although existing methods effectively extract distinct acoustic features, they show limitations in these adverse settings. To overcome these limitations, we propose the Temporal Context-Enhanced Features (TCEF) approach, which provides a consistent audio representation for better performance under various acoustic conditions. TCEF leverages a context window to average features in adjacent frames, effectively reducing short-term variations caused by noise, reverberation, fluctuations in emotional speech, and those in neutral recordings. This approach improves the distinctive features of a speaker voice, improving speaker identification in challenging and neutral acoustic environments. To evaluate the performance of TCEF against conventional features, One-Dimensional Convolutional Neural Network (1D-CNN) was used for a detailed frame-level analysis and Long Short-Term Memory (LSTM) for a comprehensive sequence-level analysis.We used four datasets to assess the effectiveness of the TCEF approach. The GRID and RAVDESS datasets represent neutral and emotional speech, respectively. To test the robustness of our system under adverse acoustic conditions, we created two additional datasets: GRID-NR and RAVDESS-NR. These are modified versions of the original GRID and RAVDESS, incorporating added noise and reverberation. Performance evaluation results showed that TCEF significantly outperformed existing feature extraction methods in identifying speakers in diverse acoustic environments.https://ieeexplore.ieee.org/document/10410836/Speaker identificationfeature extractionchallenging acoustic environmentstemporal context-enhanced featuresconvolutional neural networkslong short-term memory
spellingShingle	Yassin Terraf Youssef Iraqi Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments IEEE Access Speaker identification feature extraction challenging acoustic environments temporal context-enhanced features convolutional neural networks long short-term memory
title	Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments
title_full	Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments
title_fullStr	Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments
title_full_unstemmed	Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments
title_short	Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments
title_sort	robust feature extraction using temporal context averaging for speaker identification in diverse acoustic environments
topic	Speaker identification feature extraction challenging acoustic environments temporal context-enhanced features convolutional neural networks long short-term memory
url	https://ieeexplore.ieee.org/document/10410836/
work_keys_str_mv	AT yassinterraf robustfeatureextractionusingtemporalcontextaveragingforspeakeridentificationindiverseacousticenvironments AT youssefiraqi robustfeatureextractionusingtemporalcontextaveragingforspeakeridentificationindiverseacousticenvironments

Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments

Similar Items