Skip to main content
Article
QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks
Proc. of the 2017 ACM on Conference on Information and Knowledge Management (CIKM-17) (2017)
  • Zhou Fang
  • Tong Yu
  • Ole J Mengshoel
  • Rajesh Gupta
Abstract
Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.
Keywords
  • neural networks,
  • deep learning,
  • machine learning,
  • scheduling,
  • QoS,
  • reinforcement learning
Publication Date
November 7, 2017
Citation Information
Zhou Fang, Tong Yu, Ole J Mengshoel and Rajesh Gupta. "QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks" Proc. of the 2017 ACM on Conference on Information and Knowledge Management (CIKM-17) (2017) p. 2067 - 2070
Available at: http://works.bepress.com/ole_mengshoel/79/