Last Layer Retraining of Selectively Sampled Wild Data Improves Performance

While AI models perform well in labs where training and testing data are in a similar domain, they experience significant drops in performance in the wild where the data can lie in domains outside the training distribution. Out-of-distribution (OOD) generalization is difficult because these domains...

Full description

Bibliographic Details
Main Author:	Yang, Hao Bang
Other Authors:	Solomon, Justin
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/151358

_version_	1826195166720950272
author	Yang, Hao Bang
author2	Solomon, Justin
author_facet	Solomon, Justin Yang, Hao Bang
author_sort	Yang, Hao Bang
collection	MIT
description	While AI models perform well in labs where training and testing data are in a similar domain, they experience significant drops in performance in the wild where the data can lie in domains outside the training distribution. Out-of-distribution (OOD) generalization is difficult because these domains are underrepresented or non-existent in training data. The pursuit of a solution to bridging the performance gap between in-distribution and out-of-distribution data has led to the development of various generalization algorithms that target finding invariant/"good" features. Recent results have highlighted the possibility of poorly generalized classification layers as the main contributor to the performance difference while the featurizer is already able to produce sufficiently good features. This thesis will verify this possibility over a combination of datasets, generalization algorithms, and training methods for the classifier. We show that we can improve the OOD performance significantly compared to the original models when evaluated in natural OOD domains by simply retraining a new classification layer using a small number of labeled examples. We further study methods for efficient selection of labeled OOD examples to train the classifier by utilizing clustering techniques on featurized unlabeled OOD data.
first_indexed	2024-09-23T10:08:13Z
format	Thesis
id	mit-1721.1/151358
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T10:08:13Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1513582023-08-01T04:06:16Z Last Layer Retraining of Selectively Sampled Wild Data Improves Performance Yang, Hao Bang Solomon, Justin Yurochkin, Mikhail Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science While AI models perform well in labs where training and testing data are in a similar domain, they experience significant drops in performance in the wild where the data can lie in domains outside the training distribution. Out-of-distribution (OOD) generalization is difficult because these domains are underrepresented or non-existent in training data. The pursuit of a solution to bridging the performance gap between in-distribution and out-of-distribution data has led to the development of various generalization algorithms that target finding invariant/"good" features. Recent results have highlighted the possibility of poorly generalized classification layers as the main contributor to the performance difference while the featurizer is already able to produce sufficiently good features. This thesis will verify this possibility over a combination of datasets, generalization algorithms, and training methods for the classifier. We show that we can improve the OOD performance significantly compared to the original models when evaluated in natural OOD domains by simply retraining a new classification layer using a small number of labeled examples. We further study methods for efficient selection of labeled OOD examples to train the classifier by utilizing clustering techniques on featurized unlabeled OOD data. M.Eng. 2023-07-31T19:33:53Z 2023-07-31T19:33:53Z 2023-06 2023-06-06T16:35:16.386Z Thesis https://hdl.handle.net/1721.1/151358 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Yang, Hao Bang Last Layer Retraining of Selectively Sampled Wild Data Improves Performance
title	Last Layer Retraining of Selectively Sampled Wild Data Improves Performance
title_full	Last Layer Retraining of Selectively Sampled Wild Data Improves Performance
title_fullStr	Last Layer Retraining of Selectively Sampled Wild Data Improves Performance
title_full_unstemmed	Last Layer Retraining of Selectively Sampled Wild Data Improves Performance
title_short	Last Layer Retraining of Selectively Sampled Wild Data Improves Performance
title_sort	last layer retraining of selectively sampled wild data improves performance
url	https://hdl.handle.net/1721.1/151358
work_keys_str_mv	AT yanghaobang lastlayerretrainingofselectivelysampledwilddataimprovesperformance

Last Layer Retraining of Selectively Sampled Wild Data Improves Performance

Similar Items