Congestion Control in Machine Learning Clusters

This paper argues that fair-sharing, the holy grail of congestion control algorithms for decades, is not necessarily a desirable property in Machine Learning (ML) training clusters. We demonstrate that for a specific combination of jobs, introducing unfairness improves the training time for all comp...

Full description

Bibliographic Details
Main Author: Rajasekaran, Sudarsanan
Other Authors: Ghobadi, Manya
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156313