An Efficient Matrix Multiplier Available in Multi-Head Attention and Feed-Forward Network of Transformer Algorithms

Seok-Woo Chang;Dong-Sun Kim;

doi:10.7471/ikeee.2024.28.1.53

전기전자학회논문지 (Journal of IKEEE)

제28권1호
/
Pages.53-64
/
2024
/
1226-7244(pISSN)
/
2288-243X(eISSN)

한국전기전자학회 (Institute of Korean Electrical and Electronics Engineers)

DOI QR Code

트랜스포머 알고리즘의 멀티 헤드 어텐션과 피드포워드 네트워크에서 활용 가능한 효율적인 행렬 곱셈기

An Efficient Matrix Multiplier Available in Multi-Head Attention and Feed-Forward Network of Transformer Algorithms

장석우 ;
김동순

Seok-Woo Chang (Dept. of Semiconductor Systems Engineering, Sejong University) ;
Dong-Sun Kim (Dept. of Semiconductor Systems Engineering, Sejong University)

투고 : 2024.02.27
심사 : 2024.03.26
발행 : 2024.03.31

https://doi.org/10.7471/ikeee.2024.28.1.53 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

자연어 처리 모델이 발전함에 따라 챗 GPT와 같은 대화형 언어 생성 AI 모델이 널리 사용되고 있다. 따라서 자연어 처리 최신 모델의 기반이 되는 트랜스포머 알고리즘을 하드웨어로 구현하여 연산 속도와 전력 소비량을 개선하는 것은 중요하다고 할 수 있다. 특히, 행렬 곱셈을 통해 문장에서 서로 다른 단어 간의 관계를 분석하는 멀티 헤드 어텐션과 피드 포워드 네트워크는 트랜스포머에서 연산량이 가장 큰 핵심적인 알고리즘이다. 본 논문에서는 기존의 시스톨릭 어레이를 변형하여 행렬 곱 연산 속도를 개선하고, 입력 단어 개수 변동에 따라 지연시간도 변동되는 유동적인 구조를 제안한다. 또한, 트랜스포머 알고리즘의 정확도를 유지하는 형태로 양자화를 하여 메모리 효율성과 연산 속도를 높였다. 본 논문은 평가를 위해 멀티헤드어텐션과 피드포워드 네트워크에서 소요되는 클럭사이클을 검증하고 다른 곱셈기와 성능을 비교하였다.

With the advancement of NLP(Natural Language Processing) models, conversational AI such as ChatGPT is becoming increasingly popular. To enhance processing speed and reduce power consumption, it is important to implement the Transformer algorithm, which forms the basis of the latest natural language processing models, in hardware. In particular, the multi-head attention and feed-forward network, which analyze the relationships between different words in a sentence through matrix multiplication, are the most computationally intensive core algorithms in the Transformer. In this paper, we propose a new variable systolic array based on the number of input words to enhance matrix multiplication speed. Quantization maintains Transformer accuracy, boosting memory efficiency and speed. For evaluation purposes, this paper verifies the clock cycles required in multi-head attention and feed-forward network and compares the performance with other multipliers.

키워드

과제정보

This work was supported by the faculty research fund of Sejong University in 2023.

참고문헌

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. DOI: 10.48550/arXiv.1810.04805
Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer," The Journal of Machine Learning Research 21.1, pp.5485-5551, 2020. DOI: 10.48550/arXiv.1910.10683
Dura, Davide, "Design and analysis of VLSI architectures for Transformers," Diss. Politecnico di Torino, pp.1-2, 2022.
Vaswani, A., et al. "Attention is all you need," Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, pp.6000-6010, 2017. DOI: 10.48550/arXiv.1706.03762
Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations," arXiv preprint arXiv:1909.11942, 2019. DOI: 10.48550/arXiv.1909.11942
Lu, Siyuan, et al. "Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer," 2020 IEEE 33rd International System-on-Chip Conference, IEEE, pp.2-3, 2020. DOI: 10.48550/arXiv.2009.08605
Ye, Wenhua, et al. "Accelerating attention mechanism on fpgas based on efficient reconfigurable systolic array," ACM Transactions on Embedded Computing Systems vol.22, no.6, pp.1-22, 2023. DOI: 10.1145/3549937
Fang, Chao, et al. "An efficient hardware accelerator for sparse transformer neural networks," 2022 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, pp.2670-2674, 2022. DOI: 10.1109/ISCAS48785.2022.9937659
Fang, Chao et al. "An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.30, pp.1573-1586, 2022. DOI: 10.1109/TVLSI.2022.3197282
Tuli, Shikhar, Niraj Kumar Jha, "AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.42, pp.4038-4051, 2023. DOI: 10.1109/TCAD.2023.3273992
H. T. Kung, B. McDanel, et al. "Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays," 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), New York, NY, USA, pp.42-50, 2019. DOI: 10.1109/ASAP.2019.00-31
Ye, Wenhua, et al. "Accelerating attention mechanism on fpgas based on efficient reconfigurable systolic array," vol.22, no.6, pp.1-22, 2023. DOI: 10.1145/3549937
Bansal, Himanshu, et al. "Wallace tree multiplier designs: a performance comparison," Innov Syst Des Eng, vol.5, no.5, pp.67, 2014. DOI: 10.5120/13825-1414
Tiwari, Shivangi, et al. "Fpga design and implementation of matrix multiplication architecture by ppi-mo techniques," International Journal of Computer Applications, vol.80, no.1, pp.19-22, 2013. DOI: 10.5120/13825-1414
Elliott, Desmond, et al. "Multi30k: Multilingual english-german image descriptions," arXiv preprint arXiv:1605.00459, pp.73, 2016. DOI:10.18653/v1/W16-3210
Cettolo, Mauro, et al. "Overview of the iwslt 2017 evaluation campaign," Proceedings of the 14th International Workshop on Spoken Language Translation, pp.4, 2017.
Wang, Longyue, et al. "Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs," arXiv preprint arXiv:2311.03127, pp.58, 2023. DOI: 10.48550/arXiv.2311.03127

전기전자학회논문지 (Journal of IKEEE)

트랜스포머 알고리즘의 멀티 헤드 어텐션과 피드포워드 네트워크에서 활용 가능한 효율적인 행렬 곱셈기

An Efficient Matrix Multiplier Available in Multi-Head Attention and Feed-Forward Network of Transformer Algorithms

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)