DOI QR코드

DOI QR Code

XEM: Tensor accelerator for AB21 supercomputing artificial intelligence processor

  • Won Jeon (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute) ;
  • Mi Young Lee (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute) ;
  • Joo Hyun Lee (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute) ;
  • Chun-Gi Lyuh (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute)
  • Received : 2024.03.24
  • Accepted : 2024.08.14
  • Published : 2024.10.10

Abstract

As computing systems become increasingly larger, high-performance computing (HPC) is gaining importance. In particular, as hyperscale artificial intelligence (AI) applications, such as large language models emerge, HPC has become important even in the field of AI. Important operations in hyperscale AI and HPC are mainly linear algebraic operations based on tensors. An AB21 supercomputing AI processor has been proposed to accelerate such applications. This study proposes a XEM accelerator to accelerate linear algebraic operations in an AB21 processor effectively. The XEM accelerator has outer product-based parallel floating-point units that can efficiently process tensor operations. We provide hardware details of the XEM architecture and introduce new instructions for controlling the XEM accelerator. Additionally, hardware characteristic analyses based on chip fabrication and simulator-based functional verification are conducted. In the future, the performance and functionalities of the XEM accelerator will be verified using an AB21 processor.

Keywords

Acknowledgement

This work was supported by the Supercomputer Development Leading Program of the National Research Foundation (NRF) funded by the Korea government (MSIT) (2021M3H6A1017683, Supercomputer Processor Research and Development).

References

  1. W. Jeon and C.-G. Lyuh, Technical trends in hyper-scale artificial intelligence processors, Electron. Telecommun. Trends 38 (2023), no. 5, 1-11.
  2. J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart, Linpack users' guide, Vol. 8, SIAM, 1979.
  3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, Adv. Neural Inform. Process. Syst. 30 (2017), 1-11.
  4. C. G. Lyuh, B. J. Kim, C. Kim, H. Kim, K. H. Park, J. H. Suk, K. Shin, M. Y. Lee, J. H. Lee, and W. Jeon, Supercomputer SoC design and verification/FPGA platform development, (Summer Annu. Conf. IEIE, Jeju, Republic of Korea), 2023, pp. 2732-2734.
  5. C. Kim, J. H. Suk, S. Jun, and C.-G. Lyuh, Porting linux on an FPGA board for ARM64 SoC test, (Fall Annu. Conf. IEIE, Gwangju, Republic of Korea), 2022, pp. 125-127.
  6. M. Y. Lee, J. H. Lee, and C.-G. Lyuh, Intrinsic functions, libraries, and test application environment for accelerated parallel computing in matrix and vector operations, (Fall Annu. Conf. IEIE, Seoul, Republic of Korea), 2023, pp. 363-365.
  7. W. Jeon, Y. C. P. Cho, H. M. Kim, H. Kim, J. Chung, J. Kim, M. Lee, C.-G. Lyuh, J. Han, and Y. Kwon, M3FPU: Multi-format matrix multiplication FPU architectures for neural network computations, (IEEE 4th International Conference on Artificial Intelligence Circuits and Systems, Incheon, Rep. of Korea), 2022, pp. 150-153.
  8. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and S. Anadkat, Gpt-4 technical report, 2023. https://doi.org/10. 48550/arXiv.2303.08774 https://doi.org/10.48550/arXiv.2303.08774
  9. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and A. Askell, Language models are few-shot learners, Adv. Neural Inform. Process. Syst. 33 (2020), 1877-1901.
  10. Y. C. P. Cho, J. Chung, J. Yang, C.-G. Lyuh, H. Kim, C. Kim, J. Ham, M. Choi, K. Shin, J. Han, and Y. Kwon, AB9: A neural processor for inference acceleration, ETRI J. 42 (2020), no. 4, 491-504.
  11. J. Chung, H. Kim, K. Shin, C.-G. Lyuh, Y. C. P. Cho, J. Han, Y. Kwon, Y.-H. Gong, and S. W. Chung, A layer-wise frequency scaling for a neural processing unit, ETRI J. 44 (2022), no. 5, 849-858.
  12. ARM, ARM neoverses V1 core technical reference manual, 2023.
  13. ARM, ARM neoverses CMN-700 coherent mesh network technical reference manual, 2023.
  14. NVIDIA, NVIDIA H100 tensor core GPU architecture: exceptional performance, scalability, and security for the data center, 2023.
  15. AMD, AMD CDNA3 architecture: the all-new AMD GPU architecture for the modern era of HPC and AI, 2023.
  16. A. Waterman and K. Asanovic, The RISC-V instruction set manual volume I: user-level ISA v2.2, 2017.
  17. H. Kaul, M. Anders, S. Mathew, S. Hsu, A. Agarwal, F. Sheikh, R. Krishnamurthy, and S. Borkar, A 1.45 GHz 52-to-162gflops/w variable-precision floating-point fused multiply-add unit with certainty tracking in 32nm CMOS, (IEEE Int. Solid-State Circuits Conf., San Francisco, CA, USA), 2012, pp. 182-184.
  18. S. Mach, F. Schuiki, F. Zaruba, and L. Benini, Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing, IEEE Trans. Very Large Scale Integr. Syst. 29 (2020), no. 4, 774-787.
  19. H. Zhang, D. Chen, and S.-B. Ko, Efficient multiple-precision floating-point fused multiply-add with mixed-precision support, IEEE Trans. Comput. 68 (2019), no. 7, 1035-1048.
  20. N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, Training deep neural networks with 8-bit floating point numbers, Adv. Neural Inform. Process. Syst. 31 (2018), 1-10.
  21. R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, Multi2sim: A simulation framework for CPU-GPU computing, (Proc. 21st Int. Conf. Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA), 2012, pp. 335-344.
  22. RISC-V International, Spike RISC-V ISA simulator, 2019.
  23. NVIDIA, NVIDIA blackwell architecture technical brief: Powering the new era of generative ai and accelerated computing, 2024.
  24. A. Vahdat and M. Lohmeyer, Enabling next-generation AI workloads: announcing TPU v5p and AI hypercomputer, 2023.
  25. Graphcore, Graphcore documents. Available from: https://docs.graphcore.ai/en/latest/
  26. A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, and B. Dreyer, M TIA: First generation silicon targeting meta's recommendation systems, (Proceedings of the 50th Annual International Symposium on Computer Architecture, Orlando FL USA), 2023, pp. 1-13.
  27. Intel, Intel Gaudi 3 AI accelerator, 2024.
  28. E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, and V. Samant, The microarchitecture of DOJO, Tesla's exa-scale computer, IEEE Micro. 43 (2023), no. 3, 31-39.
  29. FuriosaAI, RNGD: The most efficient data center accelerator for high-performance LLM and multimodal deployment. Available from: https://furiosa.ai/renegade-spec
  30. Rebellions, REBEL: shaping the future of gen AI. Available from: https://rebellions.ai/products/
  31. SAPEON, Sapeon x330 product brief, 2024.
  32. Cerebras, Wafer-scale engine 3: the largest chip ever built, 2024.