Statistical Depth for Text Data: An Application to the Classification of Healthcare Data

This manuscript introduces a new concept of statistical depth function: the compositional <i>D</i>-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the <i>tf-idf</i&g...

Full description

Bibliographic Details
Main Authors:	Sergio Bolívar, Alicia Nieto-Reyes, Heather L. Rogers
Format:	Article
Language:	English
Published:	MDPI AG 2023-01-01
Series:	Mathematics
Subjects:	compositional depth multivariate data natural language processing qualitative data statistical depth supervised classification
Online Access:	https://www.mdpi.com/2227-7390/11/1/228

_version_	1797440434324111360
author	Sergio Bolívar Alicia Nieto-Reyes Heather L. Rogers
author_facet	Sergio Bolívar Alicia Nieto-Reyes Heather L. Rogers
author_sort	Sergio Bolívar
collection	DOAJ
description	This manuscript introduces a new concept of statistical depth function: the compositional <i>D</i>-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the <i>tf-idf</i> (term frequency–inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, <i>D</i>. This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>D</mi><msup><mi>D</mi><mi>G</mi></msup></mrow></semantics></math></inline-formula>-classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional <i>D</i>-depth.
first_indexed	2024-03-09T12:08:06Z
format	Article
id	doaj.art-db5370ce129c4662bf50941419a5a280
institution	Directory Open Access Journal
issn	2227-7390
language	English
last_indexed	2024-03-09T12:08:06Z
publishDate	2023-01-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj.art-db5370ce129c4662bf50941419a5a2802023-11-30T22:55:45ZengMDPI AGMathematics2227-73902023-01-0111122810.3390/math11010228Statistical Depth for Text Data: An Application to the Classification of Healthcare DataSergio Bolívar0Alicia Nieto-Reyes1Heather L. Rogers2Department of Mathematics, Statistics and Computer Science, Universidad de Cantabria, 39005 Santander, SpainDepartment of Mathematics, Statistics and Computer Science, Universidad de Cantabria, 39005 Santander, SpainBiocruces Bizkaia Health Research Institute, 48903 Barakaldo, SpainThis manuscript introduces a new concept of statistical depth function: the compositional <i>D</i>-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the <i>tf-idf</i> (term frequency–inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, <i>D</i>. This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>D</mi><msup><mi>D</mi><mi>G</mi></msup></mrow></semantics></math></inline-formula>-classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional <i>D</i>-depth.https://www.mdpi.com/2227-7390/11/1/228compositional depthmultivariate datanatural language processingqualitative datastatistical depthsupervised classification
spellingShingle	Sergio Bolívar Alicia Nieto-Reyes Heather L. Rogers Statistical Depth for Text Data: An Application to the Classification of Healthcare Data Mathematics compositional depth multivariate data natural language processing qualitative data statistical depth supervised classification
title	Statistical Depth for Text Data: An Application to the Classification of Healthcare Data
title_full	Statistical Depth for Text Data: An Application to the Classification of Healthcare Data
title_fullStr	Statistical Depth for Text Data: An Application to the Classification of Healthcare Data
title_full_unstemmed	Statistical Depth for Text Data: An Application to the Classification of Healthcare Data
title_short	Statistical Depth for Text Data: An Application to the Classification of Healthcare Data
title_sort	statistical depth for text data an application to the classification of healthcare data
topic	compositional depth multivariate data natural language processing qualitative data statistical depth supervised classification
url	https://www.mdpi.com/2227-7390/11/1/228
work_keys_str_mv	AT sergiobolivar statisticaldepthfortextdataanapplicationtotheclassificationofhealthcaredata AT alicianietoreyes statisticaldepthfortextdataanapplicationtotheclassificationofhealthcaredata AT heatherlrogers statisticaldepthfortextdataanapplicationtotheclassificationofhealthcaredata

Statistical Depth for Text Data: An Application to the Classification of Healthcare Data

Similar Items