DOI QR코드

DOI QR Code

Design of a Dispatch Unit & Operand Selection Unit for Improving the SIMT Based GP-GPU Instruction Performance

SIMT구조 GP-GPU의 명령어 처리 성능 향상을 위한 Dispatch Unit과 Operand Selection Unit설계

  • Received : 2015.09.09
  • Accepted : 2015.09.23
  • Published : 2015.09.30

Abstract

This paper proposes a dispatch unit of GP-GPU with SIMT architecture to support the acceleration of general-purpose operation as well as graphics processing. If all the information of an operand used instructions issued from the warp scheduler is decoded, an unnecessary operand load occurs, resulting in register loads. To resolve this problem, this paper proposes a method that can reduce the operand load and the load on the resister by decoding only the information of the operand using a pre-decoding method. The operand information from the dispatch unit is passed to the operand selection unit with preventing register bank collisions. Thus the overall performance are improved. In the simulation test, the total clock cycles required by processing 10,000 arbitrary instructions issued from the wrap scheduler using ModelSim SE 10.0b are measured. It shows that the application of the dispatch unit equipped with the pre-decoding function proposed in this paper can make an improvement of about 12% in processing performance compared to the conventional method.

본 논문은 그래픽 처리 뿐 만 아니라 범용 연산의 가속화를 지원하기 위한 SIMT 구조 GP-GPU의 Dispatch Unit과 Operand Selection Unit을 제안한다. Warp Scheduler로부터 발행된 명령어에서 사용되는 Operand의 모든 정보를 Decoding 하면 불필요한 Operand Load가 발생하여 레지스터 부하가 발생 한다. 이러한 문제점을 해결하기 위해 Pre-decoding방법을 사용하여 Operand의 정보만을 먼저 Decoding 하여 Operand Load를 줄이고, 레지스터의 부하를 줄일 수 있는 방법을 제안한다. 제안하는 Dispatch Unit에서 나온 Operand 정보들을 레지스터 뱅크 충돌을 방지하는 방법을 적용한 Operand Selection Unit에 전달해 전체적인 처리 성능을 향상 시켰다. Modelsim 10.0b를 이용하여 Warp Scheduler로부터 발행된 10,000개의 임의의 명령어를 처리하여 소요되는 총 Clock Cycle을 측정하였다. 본 논문에서 제안한 Pre-Decoding 기능을 탑재한 Dispatch Unit과 Operand Selection Unit을 적용하여 기존의 방법들 보다 각각 약 11%, 24%의 처리 효율이 증가한 것을 확인 할 수 있었다.

Keywords

References

  1. Lashgar. A, Baniasadi. A, Khonsari. A, "Dynamic warp resizing: Analysis and benefits in high-performance simt," Computer Design, ICCD, 2012 IEEE 30th International Conference, pp. 502-503, 2012.
  2. Xue Yang, Lixin Yu, Wei Zhuang, Yingpan Wu, Li Hao, "Design of instruction decode logic for dual-issue superscalar processor based on leon2," Consumer Electronics, Berlin (ICCE-Berlin), 2013. ICCEBerlin 2013. IEEE Third International Conference, pp. 1-4, September 2013.
  3. Sohl, J., Jian Wang Karlsson, A., Liu, D., "Conflict-free data access for multi-bank memory architectures using padding," High Performance Computing (HiPC), 2013 20th International Conference, pp. 425-432, 2013.
  4. Qi Zhang, Li, Qing, Yunyang Dai, Kuo, C.-C.J., "Reducing memory bank conflict for embedded multimedia systems," Multimedia and Expo, 2004. ICME '04. 2004 IEEE International Conference, pp. 471-474 Vol.1, June 2004.
  5. Xilinx, "VC707 User Guide," http://www.xilinx.com