Acknowledgement
This work was supported by the Supercomputer Development Leading Program of the National Research Foundation (NRF) funded by the Korea government (MSIT) (2021M3H6A1017683, Supercomputer Processor Research and Development).
References
- W. Jeon and C.-G. Lyuh, Technical trends in hyper-scale artificial intelligence processors, Electron. Telecommun. Trends 38 (2023), no. 5, 1-11.
- J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart, Linpack users' guide, Vol. 8, SIAM, 1979.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, Adv. Neural Inform. Process. Syst. 30 (2017), 1-11.
- C. G. Lyuh, B. J. Kim, C. Kim, H. Kim, K. H. Park, J. H. Suk, K. Shin, M. Y. Lee, J. H. Lee, and W. Jeon, Supercomputer SoC design and verification/FPGA platform development, (Summer Annu. Conf. IEIE, Jeju, Republic of Korea), 2023, pp. 2732-2734.
- C. Kim, J. H. Suk, S. Jun, and C.-G. Lyuh, Porting linux on an FPGA board for ARM64 SoC test, (Fall Annu. Conf. IEIE, Gwangju, Republic of Korea), 2022, pp. 125-127.
- M. Y. Lee, J. H. Lee, and C.-G. Lyuh, Intrinsic functions, libraries, and test application environment for accelerated parallel computing in matrix and vector operations, (Fall Annu. Conf. IEIE, Seoul, Republic of Korea), 2023, pp. 363-365.
- W. Jeon, Y. C. P. Cho, H. M. Kim, H. Kim, J. Chung, J. Kim, M. Lee, C.-G. Lyuh, J. Han, and Y. Kwon, M3FPU: Multi-format matrix multiplication FPU architectures for neural network computations, (IEEE 4th International Conference on Artificial Intelligence Circuits and Systems, Incheon, Rep. of Korea), 2022, pp. 150-153.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and S. Anadkat, Gpt-4 technical report, 2023. https://doi.org/10. 48550/arXiv.2303.08774 https://doi.org/10.48550/arXiv.2303.08774
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and A. Askell, Language models are few-shot learners, Adv. Neural Inform. Process. Syst. 33 (2020), 1877-1901.
- Y. C. P. Cho, J. Chung, J. Yang, C.-G. Lyuh, H. Kim, C. Kim, J. Ham, M. Choi, K. Shin, J. Han, and Y. Kwon, AB9: A neural processor for inference acceleration, ETRI J. 42 (2020), no. 4, 491-504.
- J. Chung, H. Kim, K. Shin, C.-G. Lyuh, Y. C. P. Cho, J. Han, Y. Kwon, Y.-H. Gong, and S. W. Chung, A layer-wise frequency scaling for a neural processing unit, ETRI J. 44 (2022), no. 5, 849-858.
- ARM, ARM neoverses V1 core technical reference manual, 2023.
- ARM, ARM neoverses CMN-700 coherent mesh network technical reference manual, 2023.
- NVIDIA, NVIDIA H100 tensor core GPU architecture: exceptional performance, scalability, and security for the data center, 2023.
- AMD, AMD CDNA3 architecture: the all-new AMD GPU architecture for the modern era of HPC and AI, 2023.
- A. Waterman and K. Asanovic, The RISC-V instruction set manual volume I: user-level ISA v2.2, 2017.
- H. Kaul, M. Anders, S. Mathew, S. Hsu, A. Agarwal, F. Sheikh, R. Krishnamurthy, and S. Borkar, A 1.45 GHz 52-to-162gflops/w variable-precision floating-point fused multiply-add unit with certainty tracking in 32nm CMOS, (IEEE Int. Solid-State Circuits Conf., San Francisco, CA, USA), 2012, pp. 182-184.
- S. Mach, F. Schuiki, F. Zaruba, and L. Benini, Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing, IEEE Trans. Very Large Scale Integr. Syst. 29 (2020), no. 4, 774-787.
- H. Zhang, D. Chen, and S.-B. Ko, Efficient multiple-precision floating-point fused multiply-add with mixed-precision support, IEEE Trans. Comput. 68 (2019), no. 7, 1035-1048.
- N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, Training deep neural networks with 8-bit floating point numbers, Adv. Neural Inform. Process. Syst. 31 (2018), 1-10.
- R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, Multi2sim: A simulation framework for CPU-GPU computing, (Proc. 21st Int. Conf. Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA), 2012, pp. 335-344.
- RISC-V International, Spike RISC-V ISA simulator, 2019.
- NVIDIA, NVIDIA blackwell architecture technical brief: Powering the new era of generative ai and accelerated computing, 2024.
- A. Vahdat and M. Lohmeyer, Enabling next-generation AI workloads: announcing TPU v5p and AI hypercomputer, 2023.
- Graphcore, Graphcore documents. Available from: https://docs.graphcore.ai/en/latest/
- A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, and B. Dreyer, M TIA: First generation silicon targeting meta's recommendation systems, (Proceedings of the 50th Annual International Symposium on Computer Architecture, Orlando FL USA), 2023, pp. 1-13.
- Intel, Intel Gaudi 3 AI accelerator, 2024.
- E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, and V. Samant, The microarchitecture of DOJO, Tesla's exa-scale computer, IEEE Micro. 43 (2023), no. 3, 31-39.
- FuriosaAI, RNGD: The most efficient data center accelerator for high-performance LLM and multimodal deployment. Available from: https://furiosa.ai/renegade-spec
- Rebellions, REBEL: shaping the future of gen AI. Available from: https://rebellions.ai/products/
- SAPEON, Sapeon x330 product brief, 2024.
- Cerebras, Wafer-scale engine 3: the largest chip ever built, 2024.