Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feat...

Full description

Bibliographic Details
Main Authors:	Jose Dixon, Md Rahman
Format:	Article
Language:	English
Published:	MDPI AG 2023-12-01
Series:	Machine Learning and Knowledge Extraction
Subjects:	text retrieval text classification imbalanced sampling feature engineering statistical analysis data preprocessing
Online Access:	https://www.mdpi.com/2504-4990/5/4/95

_version_	1827574280194359296
author	Jose Dixon Md Rahman
author_facet	Jose Dixon Md Rahman
author_sort	Jose Dixon
collection	DOAJ
description	The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a <i>t</i>-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.
first_indexed	2024-03-08T20:34:56Z
format	Article
id	doaj.art-83a5456082754e36883f86dcd470ce2d
institution	Directory Open Access Journal
issn	2504-4990
language	English
last_indexed	2024-03-08T20:34:56Z
publishDate	2023-12-01
publisher	MDPI AG
record_format	Article
series	Machine Learning and Knowledge Extraction
spelling	doaj.art-83a5456082754e36883f86dcd470ce2d2023-12-22T14:22:16ZengMDPI AGMachine Learning and Knowledge Extraction2504-49902023-12-01541953197810.3390/make5040095Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical LiteratureJose Dixon0Md Rahman1Computer Science Department, Morgan State University, Baltimore, MD 21251, USAComputer Science Department, Morgan State University, Baltimore, MD 21251, USAThe overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a <i>t</i>-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.https://www.mdpi.com/2504-4990/5/4/95text retrievaltext classificationimbalanced samplingfeature engineeringstatistical analysisdata preprocessing
spellingShingle	Jose Dixon Md Rahman Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature Machine Learning and Knowledge Extraction text retrieval text classification imbalanced sampling feature engineering statistical analysis data preprocessing
title	Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
title_full	Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
title_fullStr	Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
title_full_unstemmed	Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
title_short	Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
title_sort	statistical analysis of imbalanced classification with training size variation and subsampling on datasets of research papers in biomedical literature
topic	text retrieval text classification imbalanced sampling feature engineering statistical analysis data preprocessing
url	https://www.mdpi.com/2504-4990/5/4/95
work_keys_str_mv	AT josedixon statisticalanalysisofimbalancedclassificationwithtrainingsizevariationandsubsamplingondatasetsofresearchpapersinbiomedicalliterature AT mdrahman statisticalanalysisofimbalancedclassificationwithtrainingsizevariationandsubsamplingondatasetsofresearchpapersinbiomedicalliterature

Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

Similar Items