Investigating system resilience in distributed evolutionary GAN training

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021

Bibliographic Details
Main Author:	Mustafi, Urmi.
Other Authors:	Erik Hemberg and Jamal Toutouh.
Format:	Thesis
Language:	eng
Published:	Massachusetts Institute of Technology 2021
Subjects:	Electrical Engineering and Computer Science.
Online Access:	https://hdl.handle.net/1721.1/130707

_version_	1826196975244017664
author	Mustafi, Urmi.
author2	Erik Hemberg and Jamal Toutouh.
author_facet	Erik Hemberg and Jamal Toutouh. Mustafi, Urmi.
author_sort	Mustafi, Urmi.
collection	MIT
description	Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021
first_indexed	2024-09-23T10:40:48Z
format	Thesis
id	mit-1721.1/130707
institution	Massachusetts Institute of Technology
language	eng
last_indexed	2024-09-23T10:40:48Z
publishDate	2021
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1307072021-05-25T03:32:42Z Investigating system resilience in distributed evolutionary GAN training Mustafi, Urmi. Erik Hemberg and Jamal Toutouh. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Electrical Engineering and Computer Science. Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021 Cataloged from the official PDF of thesis. Includes bibliographical references (pages 57-58). General Adverserial Networks (GANs) provide a useful approach to new data generation with a few common problems of mode collapsing and oscillating behavior. Lipizzaner improves the performance of distributed GAN training with the use of a spatially distributed coevolutionary algorithm and gradient-based optimizers. However, in its current state the use of Lipizzaner is limited by its vulnerabilities on systems that encounter frequent node failures. When faced with a single node failure, Lipizzaner's entire experiment comes to a halt and must be restarted. We see a need for increasing Lipizzaner's resilience to such failures and do the following. We apply a combination of uncoordinated checkpointing, attempted reconnecting, and restarting nodes to form a simple and efficient solution for system resilience in Lipizzaner. We find that checkpointing and reconnecting are essential and simple solutions to failure recovery in Lipizzaner, while restarting nodes requires a more nuanced approach that shows promising results when used correctly to address node failures. by Urmi Mustafi. M. Eng. M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science 2021-05-24T19:52:31Z 2021-05-24T19:52:31Z 2021 2021 Thesis https://hdl.handle.net/1721.1/130707 1251801498 eng MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582 58 pages application/pdf Massachusetts Institute of Technology
spellingShingle	Electrical Engineering and Computer Science. Mustafi, Urmi. Investigating system resilience in distributed evolutionary GAN training
title	Investigating system resilience in distributed evolutionary GAN training
title_full	Investigating system resilience in distributed evolutionary GAN training
title_fullStr	Investigating system resilience in distributed evolutionary GAN training
title_full_unstemmed	Investigating system resilience in distributed evolutionary GAN training
title_short	Investigating system resilience in distributed evolutionary GAN training
title_sort	investigating system resilience in distributed evolutionary gan training
topic	Electrical Engineering and Computer Science.
url	https://hdl.handle.net/1721.1/130707
work_keys_str_mv	AT mustafiurmi investigatingsystemresilienceindistributedevolutionarygantraining

Investigating system resilience in distributed evolutionary GAN training

Similar Items