"Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" by Aurick Qiao

Selected Works of Eric P. Xing

Article

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021

Aurick Qiao, Petuum, Inc.
Sang Keun Choe, Carnegie Mellon University
Suhas Jayaram Subramanya, Carnegie Mellon University
Willie Neiswanger, Petuum, Inc.
Qirong Ho, Petuum, Inc.
Hao Zhang, Petuum, Inc.
Gregory R. Ganger, Carnegie Mellon University
Eric P. Xing, Petuum, Inc. & Mohamed bin Zayed University of Artificial Intelligence & Carnegie Mellon University

Link

Document Type

Conference Proceeding

Abstract

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize the provided resources. Pollux simultaneously considers both aspects. By monitoring the status of each job during training, Pollux models how their goodput (a metric we introduce to combine system throughput with statistical efficiency) would change by adding or removing resources. Pollux dynamically (re-)assigns resources to improve cluster-wide goodput, while respecting fairness and continually optimizing each DL job to better utilize those resources. In experiments with real DL jobs and with trace-driven simulations, Pollux reduces average job completion times by 37–50% relative to state-of-the-art DL schedulers, even when they are provided with ideal resource and training configurations for every job. Pollux promotes fairness among DL jobs competing for resources, based on a more meaningful measure of useful job progress, and reveals a new opportunity for reducing DL cost in cloud environments. Pollux is implemented and publicly available as part of an open-source project at https://github.com/petuum/adaptdl.

Publication Date

1-1-2021

Keywords

Adaptive clusters; Cloud environments; Dependent factors; Open source projects; Scheduling performance; Statistical efficiency; System throughput; Trace driven simulation

Disciplines

Computer Sciences

Comments

IR deposit conditions: none described

Open access to the Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX

Citation Information

A. Qiao et al., "Pollux: Co-adaptive cluster scheduling for Goodput--optimized deep learning," in Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation , July 14–16, 2021, pp 1-18. https://www.usenix.org/system/files/osdi21-qiao.pdf