Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML
As AI/ML research progresses, the amount of compute needed to train and evaluate state-of-the-art AI algorithms consistently increases. With increasing needs for compute, researchers spend time designing distributed systems to scalably train and hyper-parameter optimize their latest model rather tha...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/139258 |
_version_ | 1826200957358178304 |
---|---|
author | Lamp, Avery |
author2 | Agrawal, Pulkit |
author_facet | Agrawal, Pulkit Lamp, Avery |
author_sort | Lamp, Avery |
collection | MIT |
description | As AI/ML research progresses, the amount of compute needed to train and evaluate state-of-the-art AI algorithms consistently increases. With increasing needs for compute, researchers spend time designing distributed systems to scalably train and hyper-parameter optimize their latest model rather than focusing on their core research. We aim to build a fault-tolerant distributed system capable of cheaply and flexibly scheduling reproducible research training jobs on heterogeneous hybrid-cloud compute clusters including local machines and provider agnostic cloud machines. Our system focuses on ML researchers with two main goals, minimizing costs (using preemptible/spot-instances) and user friendliness. The system aims to require minimal user setup and configuration, allowing researchers to quickly get started training models. The Monkey System includes a web console and visualization dashboard to track, evaluate, and compare multiple jobs’ progress and results. |
first_indexed | 2024-09-23T11:44:23Z |
format | Thesis |
id | mit-1721.1/139258 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T11:44:23Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1392582022-01-15T03:38:43Z Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML Lamp, Avery Agrawal, Pulkit Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science As AI/ML research progresses, the amount of compute needed to train and evaluate state-of-the-art AI algorithms consistently increases. With increasing needs for compute, researchers spend time designing distributed systems to scalably train and hyper-parameter optimize their latest model rather than focusing on their core research. We aim to build a fault-tolerant distributed system capable of cheaply and flexibly scheduling reproducible research training jobs on heterogeneous hybrid-cloud compute clusters including local machines and provider agnostic cloud machines. Our system focuses on ML researchers with two main goals, minimizing costs (using preemptible/spot-instances) and user friendliness. The system aims to require minimal user setup and configuration, allowing researchers to quickly get started training models. The Monkey System includes a web console and visualization dashboard to track, evaluate, and compare multiple jobs’ progress and results. M.Eng. 2022-01-14T14:59:56Z 2022-01-14T14:59:56Z 2021-06 2021-06-17T20:13:33.943Z Thesis https://hdl.handle.net/1721.1/139258 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Lamp, Avery Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML |
title | Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML |
title_full | Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML |
title_fullStr | Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML |
title_full_unstemmed | Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML |
title_short | Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML |
title_sort | monkey platform agnostic hybrid cloud cluster compute orchestration designed for ai ml |
url | https://hdl.handle.net/1721.1/139258 |
work_keys_str_mv | AT lampavery monkeyplatformagnostichybridcloudclustercomputeorchestrationdesignedforaiml |