Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation

In recent years, there have been several calls by practitioners of machine learning to provide more guidelines on how to use its methods and techniques. For example, the current literature on resampling methods is confusing and sometimes contradictory; worse, there are sometimes no practical guideli...

Full description

Bibliographic Details
Main Author: Nakatsu Robbie T.
Format: Article
Language:English
Published: De Gruyter 2023-07-01
Series:Journal of Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1515/jisys-2022-0224
_version_ 1797773478759235584
author Nakatsu Robbie T.
author_facet Nakatsu Robbie T.
author_sort Nakatsu Robbie T.
collection DOAJ
description In recent years, there have been several calls by practitioners of machine learning to provide more guidelines on how to use its methods and techniques. For example, the current literature on resampling methods is confusing and sometimes contradictory; worse, there are sometimes no practical guidelines offered at all. To address this shortcoming, a simulation study was conducted that evaluated ridge regression models fitted on five real-world datasets. The study compared the performance of four resampling methods, namely, Monte Carlo resampling, bootstrap, k-fold cross-validation, and repeated k-fold cross-validation. The goal was to find the best-fitting λ (regularization) parameter that would minimize mean squared error, by using nine variations of these resampling methods. For each of the nine resampling variations, 1,000 runs were performed to see how often a good fit, average fit, and poor fit λ value would be chosen. The resampling method that chose good fit values the greatest number of times was deemed the best method. Based on the results of the investigation, three general recommendations are made: (1) repeated k-fold cross-validation is the best method to select as a general-purpose resampling method; (2) k = 10 folds is a good choice in k-fold cross-validation; (3) Monte Carlo and bootstrap are underperformers, so they are not recommended as general-purpose resampling methods. At the same time, no resampling method was found to be uniformly better than the others.
first_indexed 2024-03-12T22:07:07Z
format Article
id doaj.art-70a24210b3b643c38114ad5c38190106
institution Directory Open Access Journal
issn 2191-026X
language English
last_indexed 2024-03-12T22:07:07Z
publishDate 2023-07-01
publisher De Gruyter
record_format Article
series Journal of Intelligent Systems
spelling doaj.art-70a24210b3b643c38114ad5c381901062023-07-24T11:18:52ZengDe GruyterJournal of Intelligent Systems2191-026X2023-07-01321556710.1515/jisys-2022-0224Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validationNakatsu Robbie T.0Department of Information Systems and Business Analytics, Loyola Marymount University, Los Angeles, CA 90045, USAIn recent years, there have been several calls by practitioners of machine learning to provide more guidelines on how to use its methods and techniques. For example, the current literature on resampling methods is confusing and sometimes contradictory; worse, there are sometimes no practical guidelines offered at all. To address this shortcoming, a simulation study was conducted that evaluated ridge regression models fitted on five real-world datasets. The study compared the performance of four resampling methods, namely, Monte Carlo resampling, bootstrap, k-fold cross-validation, and repeated k-fold cross-validation. The goal was to find the best-fitting λ (regularization) parameter that would minimize mean squared error, by using nine variations of these resampling methods. For each of the nine resampling variations, 1,000 runs were performed to see how often a good fit, average fit, and poor fit λ value would be chosen. The resampling method that chose good fit values the greatest number of times was deemed the best method. Based on the results of the investigation, three general recommendations are made: (1) repeated k-fold cross-validation is the best method to select as a general-purpose resampling method; (2) k = 10 folds is a good choice in k-fold cross-validation; (3) Monte Carlo and bootstrap are underperformers, so they are not recommended as general-purpose resampling methods. At the same time, no resampling method was found to be uniformly better than the others.https://doi.org/10.1515/jisys-2022-0224ridge regressionmachine learningmodel validationcross validationresampling methods
spellingShingle Nakatsu Robbie T.
Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
Journal of Intelligent Systems
ridge regression
machine learning
model validation
cross validation
resampling methods
title Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
title_full Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
title_fullStr Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
title_full_unstemmed Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
title_short Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
title_sort validation of machine learning ridge regression models using monte carlo bootstrap and variations in cross validation
topic ridge regression
machine learning
model validation
cross validation
resampling methods
url https://doi.org/10.1515/jisys-2022-0224
work_keys_str_mv AT nakatsurobbiet validationofmachinelearningridgeregressionmodelsusingmontecarlobootstrapandvariationsincrossvalidation