Skip to main content
Article
Beamer: Stage-Aware Coflow Scheduling to Accelerate Hyper-Parameter Tuning in Deep Learning Clusters
IEEE Transactions on Network and Service Management
  • Yihong He, Key Laboratory Of Optical Fiber Sensing And Communications, Ministry Of Education, University Of Electronic Science And Technology Of China, Chengdu, 611731, China
  • Weibo Cai, Key Laboratory Of Optical Fiber Sensing And Communications, Ministry Of Education, University Of Electronic Science And Technology Of China, Chengdu, 611731, China
  • Pan Zhou, Key Laboratory Of Optical Fiber Sensing And Communications, Ministry Of Education, University Of Electronic Science And Technology Of China, Chengdu, 611731, China
  • Gang Sun, Key Laboratory Of Optical Fiber Sensing And Communications, Ministry Of Education, University Of Electronic Science And Technology Of China, Chengdu, 611731, China & Agile And Intelligent Computing Key Laboratory Of Sichuan Province, Chengdu, 611731, China
  • Shouxi Luo, Southwest Jiaotong University, School Of Computing And Artificial Intelligence, Chengdu, 611756, China
  • Hongfang Yu, Key Laboratory Of Optical Fiber Sensing And Communications, Ministry Of Education, University Of Electronic Science And Technology Of China, Chengdu, 611731, China & Peng Cheng Laboratory, Shenzhen, 518066, China
  • Mohsen Guizani, Mohamed Bin Zayed University of Artificial Intelligence
Document Type
Article
Abstract

Training a neural network requires retraining the same model many times to search for the configuration of hyper-parameters with the best training result. It is common to launch multiple training jobs and evaluate them in stages. At the completion of each stage, jobs with unpromising configurations will be terminated and jobs with new configurations will start. Each job typically performs distributed training across multiple GPUs, and GPUs periodically synchronize their models over the network. However, model synchronizations of running jobs cause severe network congestion, significantly increasing the stage completion time (SCT) and thus the time to successfully search for the desired configuration. Existing flow schedulers are ineffective to reduce SCT since they are agnostic to training stages. In this paper, we propose a stage-aware coflow scheduling method to minimize the average SCT. In this method, an algorithm is designed to order coflows by considering stage information and then coflows are scheduled according to the order. Mathematical analysis shows that the method achieves the average SCT within 20/3 of the optimal. We implement the method in a real system called Beamer. Extensive testbed experiments and simulations show that Beamer significantly outperforms advanced network designs, such as Sincronia, FIFO-LM, and per-flow fair sharing. © 2004-2012 IEEE.

DOI
10.1109/TNSM.2021.3132361
Publication Date
6-1-2022
Keywords
  • Computer graphics,
  • Graphics processing unit,
  • Program processors,
  • Scheduling,
  • Completion time,
  • Deep learning,
  • Flow scheduling,
  • Graphic processing unit,
  • Graphics processing,
  • Hyper-parameter,
  • Hyper-parameter tuning/search,
  • Neural network.,
  • Neural-networks,
  • Parameters tuning,
  • Processing units,
  • Schedule,
  • Stage completion time,
  • Tuning,
  • Deep learning
Comments

IR Deposit conditions:

OA version (pathway a): Accepted version

No embargo

When accepted for publication, set statement to accompany deposit (see policy)

Must link to publisher version with DOI

Publisher copyright and source must be acknowledged

Citation Information
Y. He, et al., "Beamer: Stage-Aware Coflow Scheduling to Accelerate Hyper-Parameter Tuning in Deep Learning Clusters", IEEE Transactions on Network and Service Management, vol 19(2), pp.1083-1097, Jun 2022. doi:10.1109/TNSM.2021.3132361