PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

Hyeji Kim;Yeongmin Lee;Chun-Gi Lyuh;

doi:10.4218/etrij.2024-0111

ETRI Journal

Volume 46 Issue 5
/
Pages.817-828
/
2024
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

Hyeji Kim (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute) ;
Yeongmin Lee (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute) ;
Chun-Gi Lyuh (Hyperscale AI SoC Research Section, Electronics and Telecommunications Research Institute)

Received : 2024.03.13
Accepted : 2024.08.16
Published : 2024.10.10

https://doi.org/10.4218/etrij.2024-0111 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.

Keywords

Acknowledgement

Next-generation Intelligence Semiconductor Foundation grant funded by the Korea government (MSIT) (no. 2020-0-01308, Intelligent Mobile Processor based on Deep-Learning Micro Core Array).

References

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you FIGURE 11 Overall performance for block layer of GPT2-L. need, Adv. Neural Inf. Process. Syst. 30 (2017), 261-272.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, An image is worth 16 × 16 words: transformers for image recognition at scale, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2010.11929
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin transformer: hierarchical vision transformer using shifted windows, (Proceedings of the IEEE/CVF Int. Conf. on Computer Vision, Montreal, Canada), 2021, pp. pp. 10012-10022.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language models are unsupervised multitask learners, OpenAI blog 1 (2019), no. 8, 9.
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, Zero-shot text-to-image generation, (Int. Conf. Mach. Learn., Virtual Only), 2021, pp. 8821-8831.
Y. C. P. Cho, J. Chung, J. Yang, C.-G. Lyuh, H. Kim, C. Kim, J. Ham, M. Choi, K. Shin, J. Han, and Y. Kwon, AB9: A neural processor for inference acceleration, ETRI J. 42 (2020), no. 4, 491-504. https://doi.org/10.4218/etrij.2020-0134
J. Chung, H. Kim, K. Shin, C.-G. Lyuh, Y. C. P. Cho, J. Han, Y. Kwon, Y.-H. Gong, and S. W. Chung, A layer-wise frequency scaling for a neural processing unit, ETRI J. 44 (2022), no. 5, 849-858. https://doi.org/10.4218/etrij.2022-0094
W. Jeon, Y. C. P. Cho, H. M. Kim, H. Kim, J. Chung, J. Kim, M. Lee, C.-G. Lyuh, J. Han, and Y. Kwon, M3FPU: multiformat matrix multiplication FPU architectures for neural network computations, (IEEE 4th Int. Conf. Artif. Intell. Circuits Syst., Incheon, Rep. of Korea), 2022, pp. 150-153.
N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, and C. Young, TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings, (Proc. 50th Annual Int. Symp. Comput Architecture, Orlando, FL, USA), 2023, pp. 1-14.
H. Kim, C.-G. Lyuh, and Y. Kwon, Automated optimization for memory-efficient high-performance deep neural network accelerators, ETRI J. 42 (2020), no. 4, 505-517. https://doi.org/10.4218/etrij.2020-0125
H. Face, GPT-2, 2024. Available from: https://huggingface.co/openai-community/gpt2 [last accessed March 2024].
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, GPTQ: accurate post-training quantization for generative pre-trained transformers, arXiv preprint, 2022. https://doi.org/10.48550/arXiv.2210.17323
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, AWQ: activation-aware weight quantization for on-device LLM compression and acceleration, Proc. Mach. Learn. Syst. 6 (2024), 87-100.
B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, and S. Dusan, Microscaling data formats for deep learning, arXiv preprint, 2023. https://doi.org/10.48550/arXiv.2310.10537
X. Sun, J. Choi, C.-Y. Chen, N. Wang, S. Venkataramani, V. V. Srinivasan, X. Cui, W. Zhang, and K. Gopalakrishnan, Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks, Adv. Neural Inf. Process. Syst. 32 (2019), 4900-4909.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, and A. Rodriguez, LLAMA: open and efficient foundation language models, arXiv preprint, 2023. https://doi.org/10.48550/arXiv.2302.13971
H. Face, GPT-2 medium, 2024. Available from: https://huggingface.co/openai-community/gpt2-medium [last accessed March 2024].
H. Face, GPT-2 large, 2024. Available from: https://huggingface.co/openai-community/gpt2-large [last accessed March 2024].
D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen, and J. Yang, A study of BFLOAT16 for deep learning training, arXiv preprint, 2019. https://doi.org/10.48550/arXiv.1905.12322
P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, and N. Mellempudi, FP8 formats for deep learning, arXiv preprint, 2022. https://doi.org/10.48550/arXiv.2209.05433
X. Sun, N. Wang, C.-Y. Chen, J. Ni, A. Agrawal, X. Cui, S. Venkataramani, K. El Maghraoui, V. V. Srinivasan, and K. Gopalakrishnan, Ultra-low precision 4-bit training of deep neural networks, Adv. Neural Inf. Process. Syst. 33 (2020), 1796-1807.

ETRI Journal

PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)