Skip to main content
Article
Workload-Dependent Relative Fault Sensitivity and Error Contribution Factor of GPU Onchip Memory Structures
Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (2013, Agios Konstantinos, Greece)
  • Ronak Shah
  • Minsu Choi, Missouri University of Science and Technology
  • Byunghyun Jang
Abstract

GPU (Graphics Processing Unit) is emerging as an efficient and scalable accelerator for data-parallel workloads in various applications ranging from tablet PCs to HPC (High Performance Computing) mainframes. Unlike traditional 3D graphics rendering, general-purpose compute applications demand stringent assurance of reliability. Therefore, single error tolerance schemes such as SECDED (Single Error Correcting Double Error Detecting) code are being rapidly introduced to high-end GPUs targeting high-performance general-purpose computing. However, relative fault sensitivity and error contribution of critical on-chip memory structures such as active mask stack (AMS), register file (REG) and local memory (MEM) are yet to be studied. Also, implications of single error tolerance on various GPGPU (General Purpose computing on GPU) workloads have not been quantitatively analyzed to reveal its relative cost/fault-tolerance efficiency. To address this issue, a novel Monte Carlo simulation framework has been explored in this work to enumerate and analyze well-converged fault injection data. Instead of estimating AVF (Architectural Vulnerability Factor) of each structure individually, we have injected faults to a whole memory (AMS, REG and MEM combined) in a structure-oblivious fashion. Then, we further categorized and analyzed each structure's relative fault sensitivity and error contribution factor. Finally, we have studied implications of single error tolerance on the memory structures by further considering eight different possible ECC profiles. Results show that relative fault sensitivity and error contribution of REG is highest among the considered memory structures; therefore, ECC (Error Correction Code) protection of REG is most critical and cost-effective.

Meeting Name
International Conference on Embedded Computer Systems: Architectures, Modeling, and SImulation: SAMOS XIII (2013: Jul. 15-18, Agios Konstantinos, Greece)
Department(s)
Electrical and Computer Engineering
Keywords and Phrases
  • Computer Graphics,
  • Computer Simulation,
  • Monte Carlo Methods,
  • Personal Computers,
  • Program Processors,
  • Architectural Vulnerability Factor,
  • Error Correction Codes,
  • Fault Sensitivity,
  • General Purpose Computing on GPU,
  • General-Purpose Computing,
  • Graphics Processing Unit,
  • High Performance Computing,
  • Monte-Carlo Simulations,
  • Computer Graphics Equipment
International Standard Book Number (ISBN)
978-1479901036
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2013 Institute of Electrical and Electronics Engineers (IEEE), All rights reserved.
Publication Date
7-1-2013
Publication Date
01 Jul 2013
Citation Information
Ronak Shah, Minsu Choi and Byunghyun Jang. "Workload-Dependent Relative Fault Sensitivity and Error Contribution Factor of GPU Onchip Memory Structures" Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (2013, Agios Konstantinos, Greece) (2013) p. 271 - 278
Available at: http://works.bepress.com/minsu-choi/17/