How early can we average Neural Networks?

There is a recurring observation in deep learning that neural networks can be combined simply with arithmetic averages over their parameters. This observation has led to many new research directions in model ensembling, meta-learning, federated learning, and optimization. We investigate the evolutio...

Full description

Bibliographic Details
Main Author:	Nasimov, Umarbek
Other Authors:	Poggio, Tomaso
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/151660

_version_	1826217448097972224
author	Nasimov, Umarbek
author2	Poggio, Tomaso
author_facet	Poggio, Tomaso Nasimov, Umarbek
author_sort	Nasimov, Umarbek
collection	MIT
description	There is a recurring observation in deep learning that neural networks can be combined simply with arithmetic averages over their parameters. This observation has led to many new research directions in model ensembling, meta-learning, federated learning, and optimization. We investigate the evolution of this phenomenon during the training trajectory of neural network models initialized from a common set of parameters (parent). Surprisingly, the benefit of averaging the parameters persists over long child trajectories from parent parameters with minimal training. Furthermore, we find that the parent can be merged with a single child with significant improvement in both training and test loss. Through analysis of the loss landscape, we find that the loss becomes sufficiently convex early on in training, and, as a consequence, models obtained by averaging multiple children often outperform any individual child.
first_indexed	2024-09-23T17:03:49Z
format	Thesis
id	mit-1721.1/151660
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T17:03:49Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1516602023-08-01T03:01:44Z How early can we average Neural Networks? Nasimov, Umarbek Poggio, Tomaso Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science There is a recurring observation in deep learning that neural networks can be combined simply with arithmetic averages over their parameters. This observation has led to many new research directions in model ensembling, meta-learning, federated learning, and optimization. We investigate the evolution of this phenomenon during the training trajectory of neural network models initialized from a common set of parameters (parent). Surprisingly, the benefit of averaging the parameters persists over long child trajectories from parent parameters with minimal training. Furthermore, we find that the parent can be merged with a single child with significant improvement in both training and test loss. Through analysis of the loss landscape, we find that the loss becomes sufficiently convex early on in training, and, as a consequence, models obtained by averaging multiple children often outperform any individual child. M.Eng. 2023-07-31T19:57:08Z 2023-07-31T19:57:08Z 2023-06 2023-06-06T16:35:02.790Z Thesis https://hdl.handle.net/1721.1/151660 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Nasimov, Umarbek How early can we average Neural Networks?
title	How early can we average Neural Networks?
title_full	How early can we average Neural Networks?
title_fullStr	How early can we average Neural Networks?
title_full_unstemmed	How early can we average Neural Networks?
title_short	How early can we average Neural Networks?
title_sort	how early can we average neural networks
url	https://hdl.handle.net/1721.1/151660
work_keys_str_mv	AT nasimovumarbek howearlycanweaverageneuralnetworks

How early can we average Neural Networks?

Similar Items