Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation

This data article curates datasets from Freddie Mac's Single-Family Loan-Level Dataset (SFLLD) quarterly snapshots. The SFLLD tracks loan originations in the USA along with the ensuing repayment trends. This live dataset undergoes quarterly updates. The current work is based on over 50 million...

Full description

Bibliographic Details
Main Authors:	Jonah Mushava, Michael Murray
Format:	Article
Language:	English
Published:	Elsevier 2024-06-01
Series:	Data in Brief
Subjects:	Credit risk Classification techniques Machine learning Freddie Mac
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340924002312

_version_	1797268993940127744
author	Jonah Mushava Michael Murray
author_facet	Jonah Mushava Michael Murray
author_sort	Jonah Mushava
collection	DOAJ
description	This data article curates datasets from Freddie Mac's Single-Family Loan-Level Dataset (SFLLD) quarterly snapshots. The SFLLD tracks loan originations in the USA along with the ensuing repayment trends. This live dataset undergoes quarterly updates. The current work is based on over 50 million fully amortized fixed-rate mortgage loans, which were initiated from 1999 through June 2022. Monthly performance metrics for these loans span from 1999 to September 30, 2022. Loan origination and repayment data were integrated using a unique loan ID, with defaults being identified when three payments were missed within specific performance windows (12-, 24-, 36-, 48-, and 60-months). To ensure rigorous model evaluation, only loans initiated post-2008 and their performance up to 2019 were considered, intentionally sidestepping external influences from the 2007 to 2008 financial crisis and the COVID-19 pandemic. The data was stratified by credit scores, leading to 10 folders with three distinct datasets for model training, out-of-sample testing, and out-of-time testing. We designed the out-of-time testing dataset to mimic real-life conditions as closely as possible. A unique “out-of-universe” test dataset was further constructed from 2019-originated loans, capturing their performance throughout the pandemic. In each dataset, there are 1464 covariates and a binary target label. With the release of these datasets, we hope to empower researchers to utilize common datasets, especially in the credit-scoring area, where access to proprietary datasets is limited.
first_indexed	2024-04-25T01:41:19Z
format	Article
id	doaj.art-a730d4bda82d4251a6f2c1cc421db137
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-04-25T01:41:19Z
publishDate	2024-06-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-a730d4bda82d4251a6f2c1cc421db1372024-03-08T05:18:52ZengElsevierData in Brief2352-34092024-06-0154110262Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluationJonah Mushava0Michael Murray1Corresponding author.; School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Westville Campus, Private Bag X54001, Durban, 4000, South AfricaSchool of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Westville Campus, Private Bag X54001, Durban, 4000, South AfricaThis data article curates datasets from Freddie Mac's Single-Family Loan-Level Dataset (SFLLD) quarterly snapshots. The SFLLD tracks loan originations in the USA along with the ensuing repayment trends. This live dataset undergoes quarterly updates. The current work is based on over 50 million fully amortized fixed-rate mortgage loans, which were initiated from 1999 through June 2022. Monthly performance metrics for these loans span from 1999 to September 30, 2022. Loan origination and repayment data were integrated using a unique loan ID, with defaults being identified when three payments were missed within specific performance windows (12-, 24-, 36-, 48-, and 60-months). To ensure rigorous model evaluation, only loans initiated post-2008 and their performance up to 2019 were considered, intentionally sidestepping external influences from the 2007 to 2008 financial crisis and the COVID-19 pandemic. The data was stratified by credit scores, leading to 10 folders with three distinct datasets for model training, out-of-sample testing, and out-of-time testing. We designed the out-of-time testing dataset to mimic real-life conditions as closely as possible. A unique “out-of-universe” test dataset was further constructed from 2019-originated loans, capturing their performance throughout the pandemic. In each dataset, there are 1464 covariates and a binary target label. With the release of these datasets, we hope to empower researchers to utilize common datasets, especially in the credit-scoring area, where access to proprietary datasets is limited.http://www.sciencedirect.com/science/article/pii/S2352340924002312Credit riskClassification techniquesMachine learningFreddie Mac
spellingShingle	Jonah Mushava Michael Murray Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation Data in Brief Credit risk Classification techniques Machine learning Freddie Mac
title	Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation
title_full	Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation
title_fullStr	Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation
title_full_unstemmed	Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation
title_short	Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation
title_sort	comprehensive credit scoring datasets for robust testing out of sample out of time and out of universe evaluation
topic	Credit risk Classification techniques Machine learning Freddie Mac
url	http://www.sciencedirect.com/science/article/pii/S2352340924002312
work_keys_str_mv	AT jonahmushava comprehensivecreditscoringdatasetsforrobusttestingoutofsampleoutoftimeandoutofuniverseevaluation AT michaelmurray comprehensivecreditscoringdatasetsforrobusttestingoutofsampleoutoftimeandoutofuniverseevaluation

Comprehensive credit scoring datasets for robust testing: Out-of-sample, out-of-time, and out-of-universe evaluation

Similar Items