Automatic Voice Disorder Detection Using Self-Supervised Representations

Many speech features and models, including Deep Neural Networks (DNN), are used for classification tasks between healthy and pathological speech with the Saarbruecken Voice Database (SVD). However, accuracy values of 80.71% for phrases or 82.8% for vowels /aiu/ are the highest...

Full description

Bibliographic Details
Main Authors:	Dayana Ribas, Miguel A. Pastor, Antonio Miguel, David Martinez, Alfonso Ortega, Eduardo Lleida
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Voice disorder pathological speech Saarbruecken voice database advanced voice function assessment database self-supervised class token
Online Access:	https://ieeexplore.ieee.org/document/10041907/

_version_	1828013822654283776
author	Dayana Ribas Miguel A. Pastor Antonio Miguel David Martinez Alfonso Ortega Eduardo Lleida
author_facet	Dayana Ribas Miguel A. Pastor Antonio Miguel David Martinez Alfonso Ortega Eduardo Lleida
author_sort	Dayana Ribas
collection	DOAJ
description	Many speech features and models, including Deep Neural Networks (DNN), are used for classification tasks between healthy and pathological speech with the Saarbruecken Voice Database (SVD). However, accuracy values of 80.71% for phrases or 82.8% for vowels /aiu/ are the highest reported for audio samples in SVD when the evaluation includes the wide amount of pathologies in the database, instead of a selection of some pathologies. This paper targets this top performance in the state-of-the-art Automatic Voice Disorder Detection (AVDD) systems. In the framework of a DNN-based AVDD system we study the capability of Self-Supervised (SS) representation learning for describing discriminative cues between healthy and pathological speech. The system processes the SS temporal sequence of features with a single feed-forward layer and Class-Token (CT) Transformer for obtaining the classification between healthy and pathological speech. Furthermore, there is evaluated a suitable data extension of the training set with out-of-domain data is also evaluated to deal with the low availability of data for using DNN-based models in voice pathology detection. Experimental results using audio samples corresponding to phrases in the SVD dataset, including all pathologies available, show classification accuracy values until 93.36%. This means that the proposed AVDD system achieved accuracy improvements of 4.1% without the training data extension, and 15.62% after the training data extension compared to the baseline system. Beyond the novelty of using SS representations for AVDD, the fact of obtaining accuracies over 90% in these conditions and using the whole set of pathologies in the SVD is a milestone for voice disorder-related research. Furthermore, the study on the amount of in-domain data in the training set related to the system performance show guidance for the data preparation stage. Lessons learned in this work suggest guidelines for taking advantage of DNN, to boost the performance in developing automatic systems for diagnosis, treatment, and monitoring of voice pathologies.
first_indexed	2024-04-10T09:51:58Z
format	Article
id	doaj.art-53616cb4fc3a49809ecae4af0f0a370a
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-10T09:51:58Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-53616cb4fc3a49809ecae4af0f0a370a2023-02-17T00:00:33ZengIEEEIEEE Access2169-35362023-01-0111149151492710.1109/ACCESS.2023.324398610041907Automatic Voice Disorder Detection Using Self-Supervised RepresentationsDayana Ribas0https://orcid.org/0000-0003-3813-4998Miguel A. Pastor1Antonio Miguel2https://orcid.org/0000-0001-5803-4316David Martinez3https://orcid.org/0000-0001-7593-1377Alfonso Ortega4https://orcid.org/0000-0002-3886-7748Eduardo Lleida5https://orcid.org/0000-0001-9137-4013ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, SpainViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, SpainViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, SpainLumenvox, Munich, GermanyViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, SpainViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, SpainMany speech features and models, including Deep Neural Networks (DNN), are used for classification tasks between healthy and pathological speech with the Saarbruecken Voice Database (SVD). However, accuracy values of 80.71% for phrases or 82.8% for vowels /aiu/ are the highest reported for audio samples in SVD when the evaluation includes the wide amount of pathologies in the database, instead of a selection of some pathologies. This paper targets this top performance in the state-of-the-art Automatic Voice Disorder Detection (AVDD) systems. In the framework of a DNN-based AVDD system we study the capability of Self-Supervised (SS) representation learning for describing discriminative cues between healthy and pathological speech. The system processes the SS temporal sequence of features with a single feed-forward layer and Class-Token (CT) Transformer for obtaining the classification between healthy and pathological speech. Furthermore, there is evaluated a suitable data extension of the training set with out-of-domain data is also evaluated to deal with the low availability of data for using DNN-based models in voice pathology detection. Experimental results using audio samples corresponding to phrases in the SVD dataset, including all pathologies available, show classification accuracy values until 93.36%. This means that the proposed AVDD system achieved accuracy improvements of 4.1% without the training data extension, and 15.62% after the training data extension compared to the baseline system. Beyond the novelty of using SS representations for AVDD, the fact of obtaining accuracies over 90% in these conditions and using the whole set of pathologies in the SVD is a milestone for voice disorder-related research. Furthermore, the study on the amount of in-domain data in the training set related to the system performance show guidance for the data preparation stage. Lessons learned in this work suggest guidelines for taking advantage of DNN, to boost the performance in developing automatic systems for diagnosis, treatment, and monitoring of voice pathologies.https://ieeexplore.ieee.org/document/10041907/Voice disorderpathological speechSaarbruecken voice databaseadvanced voice function assessment databaseself-supervisedclass token
spellingShingle	Dayana Ribas Miguel A. Pastor Antonio Miguel David Martinez Alfonso Ortega Eduardo Lleida Automatic Voice Disorder Detection Using Self-Supervised Representations IEEE Access Voice disorder pathological speech Saarbruecken voice database advanced voice function assessment database self-supervised class token
title	Automatic Voice Disorder Detection Using Self-Supervised Representations
title_full	Automatic Voice Disorder Detection Using Self-Supervised Representations
title_fullStr	Automatic Voice Disorder Detection Using Self-Supervised Representations
title_full_unstemmed	Automatic Voice Disorder Detection Using Self-Supervised Representations
title_short	Automatic Voice Disorder Detection Using Self-Supervised Representations
title_sort	automatic voice disorder detection using self supervised representations
topic	Voice disorder pathological speech Saarbruecken voice database advanced voice function assessment database self-supervised class token
url	https://ieeexplore.ieee.org/document/10041907/
work_keys_str_mv	AT dayanaribas automaticvoicedisorderdetectionusingselfsupervisedrepresentations AT miguelapastor automaticvoicedisorderdetectionusingselfsupervisedrepresentations AT antoniomiguel automaticvoicedisorderdetectionusingselfsupervisedrepresentations AT davidmartinez automaticvoicedisorderdetectionusingselfsupervisedrepresentations AT alfonsoortega automaticvoicedisorderdetectionusingselfsupervisedrepresentations AT eduardolleida automaticvoicedisorderdetectionusingselfsupervisedrepresentations

Automatic Voice Disorder Detection Using Self-Supervised Representations

Similar Items