DOI QR코드

DOI QR Code

PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

  • Hyeji Kim (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute) ;
  • Yeongmin Lee (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute) ;
  • Chun-Gi Lyuh (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute)
  • Received : 2024.03.13
  • Accepted : 2024.08.16
  • Published : 2024.10.10

Abstract

Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.

Keywords

Acknowledgement

Next-generation Intelligence Semiconductor Foundation grant funded by the Korea government (MSIT) (no. 2020-0-01308, Intelligent Mobile Processor based on Deep-Learning Micro Core Array).

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you FIGURE 11 Overall performance for block layer of GPT2-L. need, Adv. Neural Inf. Process. Syst. 30 (2017), 261-272.
  2. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, An image is worth 16 × 16 words: transformers for image recognition at scale, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2010.11929
  3. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin transformer: hierarchical vision transformer using shifted windows, (Proceedings of the IEEE/CVF Int. Conf. on Computer Vision, Montreal, Canada), 2021, pp. pp. 10012-10022.
  4. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language models are unsupervised multitask learners, OpenAI blog 1 (2019), no. 8, 9.
  5. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, Zero-shot text-to-image generation, (Int. Conf. Mach. Learn., Virtual Only), 2021, pp. 8821-8831.
  6. Y. C. P. Cho, J. Chung, J. Yang, C.-G. Lyuh, H. Kim, C. Kim, J. Ham, M. Choi, K. Shin, J. Han, and Y. Kwon, AB9: A neural processor for inference acceleration, ETRI J. 42 (2020), no. 4, 491-504.
  7. J. Chung, H. Kim, K. Shin, C.-G. Lyuh, Y. C. P. Cho, J. Han, Y. Kwon, Y.-H. Gong, and S. W. Chung, A layer-wise frequency scaling for a neural processing unit, ETRI J. 44 (2022), no. 5, 849-858.
  8. W. Jeon, Y. C. P. Cho, H. M. Kim, H. Kim, J. Chung, J. Kim, M. Lee, C.-G. Lyuh, J. Han, and Y. Kwon, M3FPU: multiformat matrix multiplication FPU architectures for neural network computations, (IEEE 4th Int. Conf. Artif. Intell. Circuits Syst., Incheon, Rep. of Korea), 2022, pp. 150-153.
  9. N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, and C. Young, TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings, (Proc. 50th Annual Int. Symp. Comput Architecture, Orlando, FL, USA), 2023, pp. 1-14.
  10. H. Kim, C.-G. Lyuh, and Y. Kwon, Automated optimization for memory-efficient high-performance deep neural network accelerators, ETRI J. 42 (2020), no. 4, 505-517.
  11. H. Face, GPT-2, 2024. Available from: https://huggingface.co/openai-community/gpt2 [last accessed March 2024].
  12. E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, GPTQ: accurate post-training quantization for generative pre-trained transformers, arXiv preprint, 2022. https://doi.org/10.48550/arXiv.2210.17323
  13. J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, AWQ: activation-aware weight quantization for on-device LLM compression and acceleration, Proc. Mach. Learn. Syst. 6 (2024), 87-100.
  14. B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, and S. Dusan, Microscaling data formats for deep learning, arXiv preprint, 2023. https://doi.org/10.48550/arXiv.2310.10537
  15. X. Sun, J. Choi, C.-Y. Chen, N. Wang, S. Venkataramani, V. V. Srinivasan, X. Cui, W. Zhang, and K. Gopalakrishnan, Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks, Adv. Neural Inf. Process. Syst. 32 (2019), 4900-4909.
  16. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, and A. Rodriguez, LLAMA: open and efficient foundation language models, arXiv preprint, 2023. https://doi.org/10.48550/arXiv.2302.13971
  17. H. Face, GPT-2 medium, 2024. Available from: https://huggingface.co/openai-community/gpt2-medium [last accessed March 2024].
  18. H. Face, GPT-2 large, 2024. Available from: https://huggingface.co/openai-community/gpt2-large [last accessed March 2024].
  19. D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen, and J. Yang, A study of BFLOAT16 for deep learning training, arXiv preprint, 2019. https://doi.org/10.48550/arXiv.1905.12322
  20. P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, and N. Mellempudi, FP8 formats for deep learning, arXiv preprint, 2022. https://doi.org/10.48550/arXiv.2209.05433
  21. X. Sun, N. Wang, C.-Y. Chen, J. Ni, A. Agrawal, X. Cui, S. Venkataramani, K. El Maghraoui, V. V. Srinivasan, and K. Gopalakrishnan, Ultra-low precision 4-bit training of deep neural networks, Adv. Neural Inf. Process. Syst. 33 (2020), 1796-1807.