DOI QR코드

DOI QR Code

Analysis and solution of memory failure phenomenon in Server systems

서버시스템에서의 메모리 불량현상 분석 및 해결방법

  • Shin, Hyunsung (Dept. of Computer Science Engineering, Seoul National University) ;
  • Yoo, Sungjoo (Dept. of Computer Science Engineering, Seoul National University)
  • Received : 2017.12.08
  • Accepted : 2017.12.21
  • Published : 2017.12.31

Abstract

In order to maintain numerous server systems used in enterprise and data center environments, the most important thing is to prevent the occurrence of UE (Uncorrectable Error) of each server system. With the recent development of cloud services, more memory modules are being used than ever before, while the operating frequency of server systems has increased and the process of developing memory has continued to shrink, making it more likely to fail. In these environments, there is a way to repair memory defects directly in the server system, but there is no currently available guideline to use it effectively. In this paper, we propose a method to effectively prevent memory failure in a server system based on the observation and analysis of memory failure phenomenon in existing system.

엔터프라이즈 및 데이터센터환경에서 사용되는 수많은 서버시스템을 유지하기 위해서 가장 중요한 것은 각각의 서버시스템에서 UE(Uncorrectable Error)의 발생을 방지하는 것이다. 최근 클라우드 서비스의 발전으로 더 많은 용량의 메모리 모듈이 기존보다 더 많이 사용되고 있는 반면에 서버시스템의 동작 주파수는 높아지고 또한 메모리를 개발하기 위한 공정은 계속해서 축소되어 이전보다 불량이 발생될 확률이 매우 높아졌다. 이런 환경에서 서버시스템에서 직접 메모리 불량을 교체할 수 있는 방법이 제공되고 있지만 이를 효과적으로 사용할 수 있는 가이드라인이 현재 제공되지 않고 있다. 본 논문에서는 기존 시스템에서의 메모리 불량현상을 관찰하고 분석한 결과를 토대로 서버 시스템에서 효율적으로 메모리 불량을 방지하고 대처할 수 있는 방안을 제시하였다.

Keywords

References

  1. Luiz Andre Barroso, Jimmy Clidaras and Urs Holzle, "The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second edition," Morgan & Claypool publishers, pp. 154, Jul, 2013. DOI:10.2200/S00516ED2V01Y201306CAC024)
  2. Charles Slayman, Manny Ma and Scott Lindley, "Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability /Low-Cost Server Memory" in Proc of the integrated Reliability Workshop Final Report, 2006 IEEE International, 2006. DOI: 10.1109/IRWS.2006.305243
  3. INTEL, "List of Intel Xeon microprocessors", https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors
  4. AMD, "List of Intel AMD microprocessors", https://en.wikipedia.org/wiki/List_of_AMD_microprocessors
  5. Chin-Lung Su, Yi-Ting Yeh and Cheng-Wen Wu, "An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement" in Proc of the 2005 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2005. DOI: 10.1109/DFTVS.2005.18
  6. SDDC, "Intel x4 Single Device Data Correction", http://www.ece.umd.edu/courses/enee759h.S2003/references/29227401.pdf
  7. Torvalds, "Linux Kernel Drivers for Intel Sandy-Bridge Integrated MC", https://github.com/torvalds/linux/blob/master/drivers/edac/sb_edac.c
  8. Mcelog, "Advanced hardware error handling for x86 Linux", http://www.mcelog.org
  9. Hspice, "Device Level Circuit Simulation", https://www.synopsys.com/verification/ams-verification/circuit-simulation/hspice.html
  10. TCAD, "Technology Computer Aided Design", https://www.synopsys.com/silicon/tcad.html