Skip to main content
Accelerating Big Data Applications Using Lightweight Virtualization Framework on Enterprise Cloud
21st IEEE High Performance Extreme Computing Conference (HPEC 2017) (2017)
  • Janki Bhimani, Northeastern University
  • Zhengyu Yang, Northeastern University
  • Miriam Leeser, Northeastern University
  • Ningfang Mi, Northeastern University
Hypervisor-based virtualization technology has been successfully used to deploy high-performance and scalable infrastructure for Hadoop, and now Spark applications. Container-based virtualization techniques are becoming an important option, which is increasingly used due to their lightweight operation and better scaling when compared to Virtual Machines (VM). With containerization techniques such as Docker becoming mature and promising better performance, we can use Docker to speed-up big data applications. However, as applications have different behaviors and resource requirements, before replacing traditional hypervisor-based virtual machines with Docker, it is important to analyze and compare performance of applications running in the cloud with VMs and Docker containers. VM provides distributed resource management for different virtual machines running with their own allocated resources, while Docker relies on shared pool of resources among all containers. Here, we investigate the performance of different Apache Spark applications using both Virtual Machines (VM) and Docker containers. While others have looked at Docker’s performance, this is the first study that compares these different virtualization frameworks for a big data enterprise cloud environment using Apache Spark. In addition to makespan and execution time, we also analyze different resource utilization (CPU, disk, memory, etc.) by Spark applications. Our results show that Spark using Docker can obtain speed-up of over 10 times when compared to using VM. However, we observe that this may not apply to all applications due to different workload patterns and different resource management schemes performed by virtual machines and containers. Our work can guide application developers, system administrators and researchers to better design and deploy big data applications on their platforms to improve the overall performance.
  • Virtual Machine (VM),
  • Container,
  • Docker,
  • Apache Spark,
  • Big Data,
  • Cloud Computing,
  • Resource Management,
  • Task Assignment,
  • Workload Evaluation & Estimation
Publication Date
Citation Information
Janki Bhimani, Zhengyu Yang, Miriam Leeser and Ningfang Mi. "Accelerating Big Data Applications Using Lightweight Virtualization Framework on Enterprise Cloud" 21st IEEE High Performance Extreme Computing Conference (HPEC 2017) (2017)
Available at: