Autoencoders for sample size estimation for fully connected neural network classifiers

Abstract Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problem...

Full description

Bibliographic Details
Main Authors:	Faris F. Gulamali, Ashwin S. Sawant, Patricia Kovatch, Benjamin Glicksberg, Alexander Charney, Girish N. Nadkarni, Eric Oermann
Format:	Article
Language:	English
Published:	Nature Portfolio 2022-12-01
Series:	npj Digital Medicine
Online Access:	https://doi.org/10.1038/s41746-022-00728-0

_version_	1797641544927281152
author	Faris F. Gulamali Ashwin S. Sawant Patricia Kovatch Benjamin Glicksberg Alexander Charney Girish N. Nadkarni Eric Oermann
author_facet	Faris F. Gulamali Ashwin S. Sawant Patricia Kovatch Benjamin Glicksberg Alexander Charney Girish N. Nadkarni Eric Oermann
author_sort	Faris F. Gulamali
collection	DOAJ
description	Abstract Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.
first_indexed	2024-03-11T13:47:12Z
format	Article
id	doaj.art-58170488a06d42cbb95fa02c45949816
institution	Directory Open Access Journal
issn	2398-6352
language	English
last_indexed	2024-03-11T13:47:12Z
publishDate	2022-12-01
publisher	Nature Portfolio
record_format	Article
series	npj Digital Medicine
spelling	doaj.art-58170488a06d42cbb95fa02c459498162023-11-02T10:12:11ZengNature Portfolionpj Digital Medicine2398-63522022-12-01511810.1038/s41746-022-00728-0Autoencoders for sample size estimation for fully connected neural network classifiersFaris F. Gulamali0Ashwin S. Sawant1Patricia Kovatch2Benjamin Glicksberg3Alexander Charney4Girish N. Nadkarni5Eric Oermann6Icahn School of MedicineIcahn School of MedicineIcahn School of MedicineIcahn School of MedicineIcahn School of MedicineIcahn School of MedicineNew York UniversityAbstract Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.https://doi.org/10.1038/s41746-022-00728-0
spellingShingle	Faris F. Gulamali Ashwin S. Sawant Patricia Kovatch Benjamin Glicksberg Alexander Charney Girish N. Nadkarni Eric Oermann Autoencoders for sample size estimation for fully connected neural network classifiers npj Digital Medicine
title	Autoencoders for sample size estimation for fully connected neural network classifiers
title_full	Autoencoders for sample size estimation for fully connected neural network classifiers
title_fullStr	Autoencoders for sample size estimation for fully connected neural network classifiers
title_full_unstemmed	Autoencoders for sample size estimation for fully connected neural network classifiers
title_short	Autoencoders for sample size estimation for fully connected neural network classifiers
title_sort	autoencoders for sample size estimation for fully connected neural network classifiers
url	https://doi.org/10.1038/s41746-022-00728-0
work_keys_str_mv	AT farisfgulamali autoencodersforsamplesizeestimationforfullyconnectedneuralnetworkclassifiers AT ashwinssawant autoencodersforsamplesizeestimationforfullyconnectedneuralnetworkclassifiers AT patriciakovatch autoencodersforsamplesizeestimationforfullyconnectedneuralnetworkclassifiers AT benjaminglicksberg autoencodersforsamplesizeestimationforfullyconnectedneuralnetworkclassifiers AT alexandercharney autoencodersforsamplesizeestimationforfullyconnectedneuralnetworkclassifiers AT girishnnadkarni autoencodersforsamplesizeestimationforfullyconnectedneuralnetworkclassifiers AT ericoermann autoencodersforsamplesizeestimationforfullyconnectedneuralnetworkclassifiers

Autoencoders for sample size estimation for fully connected neural network classifiers

Similar Items