A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery

Knowledge discovery in biomedical data using supervised methods assumes that the data contain structure relevant to the class structure if a classifier can be trained to assign a case to the correct class better than by guessing. In this setting, acceptance or rejection of a scientific hypothesis ma...

Full description

Bibliographic Details
Main Authors:	Jörn Lötsch, Benjamin Mayer
Format:	Article
Language:	English
Published:	MDPI AG 2022-10-01
Series:	BioMedInformatics
Subjects:	data science artificial intelligence machine-learning digital medicine
Online Access:	https://www.mdpi.com/2673-7426/2/4/34

_version_	1797622141859921920
author	Jörn Lötsch Benjamin Mayer
author_facet	Jörn Lötsch Benjamin Mayer
author_sort	Jörn Lötsch
collection	DOAJ
description	Knowledge discovery in biomedical data using supervised methods assumes that the data contain structure relevant to the class structure if a classifier can be trained to assign a case to the correct class better than by guessing. In this setting, acceptance or rejection of a scientific hypothesis may depend critically on the ability to classify cases better than randomly, without high classification performance being the primary goal. Random forests are often chosen for knowledge-discovery tasks because they are considered a powerful classifier that does not require sophisticated data transformation or hyperparameter tuning and can be regarded as a reference classifier for tabular numerical data. Here, we report a case where the failure of random forests using the default hyperparameter settings in the standard implementations of R and Python would have led to the rejection of the hypothesis that the data contained structure relevant to the class structure. After tuning the hyperparameters, classification performance increased from 56% to 65% balanced accuracy in R, and from 55% to 67% balanced accuracy in Python. More importantly, the 95% confidence intervals in the tuned versions were to the right of the value of 50% that characterizes guessing-level classification. Thus, tuning provided the desired evidence that the data structure supported the class structure of the data set. In this case, the tuning made more than a quantitative difference in the form of slightly better classification accuracy, but significantly changed the interpretation of the data set. This is especially true when classification performance is low and a small improvement increases the balanced accuracy to over 50% when guessing.
first_indexed	2024-03-11T09:06:07Z
format	Article
id	doaj.art-93bf94a8c1714895a4aff0953f8bfc4c
institution	Directory Open Access Journal
issn	2673-7426
language	English
last_indexed	2024-03-11T09:06:07Z
publishDate	2022-10-01
publisher	MDPI AG
record_format	Article
series	BioMedInformatics
spelling	doaj.art-93bf94a8c1714895a4aff0953f8bfc4c2023-11-16T19:21:21ZengMDPI AGBioMedInformatics2673-74262022-10-012454455210.3390/biomedinformatics2040034A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge DiscoveryJörn Lötsch0Benjamin Mayer1Institute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, GermanyInstitute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, GermanyKnowledge discovery in biomedical data using supervised methods assumes that the data contain structure relevant to the class structure if a classifier can be trained to assign a case to the correct class better than by guessing. In this setting, acceptance or rejection of a scientific hypothesis may depend critically on the ability to classify cases better than randomly, without high classification performance being the primary goal. Random forests are often chosen for knowledge-discovery tasks because they are considered a powerful classifier that does not require sophisticated data transformation or hyperparameter tuning and can be regarded as a reference classifier for tabular numerical data. Here, we report a case where the failure of random forests using the default hyperparameter settings in the standard implementations of R and Python would have led to the rejection of the hypothesis that the data contained structure relevant to the class structure. After tuning the hyperparameters, classification performance increased from 56% to 65% balanced accuracy in R, and from 55% to 67% balanced accuracy in Python. More importantly, the 95% confidence intervals in the tuned versions were to the right of the value of 50% that characterizes guessing-level classification. Thus, tuning provided the desired evidence that the data structure supported the class structure of the data set. In this case, the tuning made more than a quantitative difference in the form of slightly better classification accuracy, but significantly changed the interpretation of the data set. This is especially true when classification performance is low and a small improvement increases the balanced accuracy to over 50% when guessing.https://www.mdpi.com/2673-7426/2/4/34data scienceartificial intelligencemachine-learningdigital medicine
spellingShingle	Jörn Lötsch Benjamin Mayer A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery BioMedInformatics data science artificial intelligence machine-learning digital medicine
title	A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery
title_full	A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery
title_fullStr	A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery
title_full_unstemmed	A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery
title_short	A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery
title_sort	biomedical case study showing that tuning random forests can fundamentally change the interpretation of supervised data structure exploration aimed at knowledge discovery
topic	data science artificial intelligence machine-learning digital medicine
url	https://www.mdpi.com/2673-7426/2/4/34
work_keys_str_mv	AT jornlotsch abiomedicalcasestudyshowingthattuningrandomforestscanfundamentallychangetheinterpretationofsuperviseddatastructureexplorationaimedatknowledgediscovery AT benjaminmayer abiomedicalcasestudyshowingthattuningrandomforestscanfundamentallychangetheinterpretationofsuperviseddatastructureexplorationaimedatknowledgediscovery AT jornlotsch biomedicalcasestudyshowingthattuningrandomforestscanfundamentallychangetheinterpretationofsuperviseddatastructureexplorationaimedatknowledgediscovery AT benjaminmayer biomedicalcasestudyshowingthattuningrandomforestscanfundamentallychangetheinterpretationofsuperviseddatastructureexplorationaimedatknowledgediscovery

A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery

Similar Items