A New Formula for Faster Computation of the K-Fold Cross-Validation and Good Regularisation Parameter Values in Ridge Regression

In the present paper, we prove a new theorem, resulting in an update formula for linear regression model residuals calculating the exact k-fold cross-validation residuals for any choice of cross-validation strategy without model refitting. The required matrix inversions are limited by the cross-vali...

Full description

Bibliographic Details
Main Authors: Kristian Hovde Liland, Joakim Skogholt, Ulf Geir Indahl
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10411898/
Description
Summary:In the present paper, we prove a new theorem, resulting in an update formula for linear regression model residuals calculating the exact k-fold cross-validation residuals for any choice of cross-validation strategy without model refitting. The required matrix inversions are limited by the cross-validation segment sizes and can be executed with high efficiency in parallel. The well-known formula for leave-one-out cross-validation follows as a special case of the theorem. In situations where the cross-validation segments consist of small groups of repeated measurements, we suggest a heuristic strategy for fast serial approximations of the cross-validated residuals and associated Predicted Residual Sum of Squares (<inline-formula> <tex-math notation="LaTeX">$PRESS$ </tex-math></inline-formula>) statistic. We also suggest strategies for efficient estimation of the minimum <inline-formula> <tex-math notation="LaTeX">$PRESS$ </tex-math></inline-formula> value and full <inline-formula> <tex-math notation="LaTeX">$PRESS$ </tex-math></inline-formula> function over a selected interval of regularisation values. The computational effectiveness of the parameter selection for Ridge- and Tikhonov regression modelling resulting from our theoretical findings and heuristic arguments is demonstrated in several applications with real and highly multivariate datasets.
ISSN:2169-3536