• Title/Summary/Keyword: Systolic multiplier

Search Result 43, Processing Time 0.016 seconds

An Efficient Matrix Multiplier Available in Multi-Head Attention and Feed-Forward Network of Transformer Algorithms (트랜스포머 알고리즘의 멀티 헤드 어텐션과 피드포워드 네트워크에서 활용 가능한 효율적인 행렬 곱셈기)

  • Seok-Woo Chang;Dong-Sun Kim
    • Journal of IKEEE
    • /
    • v.28 no.1
    • /
    • pp.53-64
    • /
    • 2024
  • With the advancement of NLP(Natural Language Processing) models, conversational AI such as ChatGPT is becoming increasingly popular. To enhance processing speed and reduce power consumption, it is important to implement the Transformer algorithm, which forms the basis of the latest natural language processing models, in hardware. In particular, the multi-head attention and feed-forward network, which analyze the relationships between different words in a sentence through matrix multiplication, are the most computationally intensive core algorithms in the Transformer. In this paper, we propose a new variable systolic array based on the number of input words to enhance matrix multiplication speed. Quantization maintains Transformer accuracy, boosting memory efficiency and speed. For evaluation purposes, this paper verifies the clock cycles required in multi-head attention and feed-forward network and compares the performance with other multipliers.

Characteristic analysis of Modular Multipliers and Squarers for GF($2^m$) (유한 필드 GF($2^m$)상의 모듈러 곱셈기 및 제곱기 특성 분석)

  • 한상덕;김창훈;홍춘표
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.7 no.5
    • /
    • pp.167-174
    • /
    • 2002
  • This paper analyzes the characteristics of three multipliers and squarers in finite fields GF(2/sup m/) from the point of view of processing time and area complexity. First, we analyze structures of three multipliers and squarers: 1) Systolic array structure, 2), LFSR structure, and 3) CA structure. To make performance analysis, each multiplier and squarer was modeled in VHDL and was synthesized for FPGA implementation. The simulation results show that CA structure is the best from the point view of processing time, and LFSR structure is the best from the point of view of area complexity.

  • PDF

Implementation of RSA modular exponentiator using Division Chain (나눗셈 체인을 이용한 RSA 모듈로 멱승기의 구현)

  • 김성두;정용진
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.12 no.2
    • /
    • pp.21-34
    • /
    • 2002
  • In this paper we propos a new hardware architecture of modular exponentiation using a division chain method which has been proposed in (2). Modular exponentiation using the division chain is performed by receding an exponent E as a mixed form of multiplication and addition with divisors d=2 or $d=2^I +1$ and respective remainders r. This calculates the modular exponentiation in about $1.4log_2$E multiplications on average which is much less iterations than $2log_2$E of conventional Binary Method. We designed a linear systolic array multiplier with pipelining and used a horizontal projection on its data dependence graph. So, for k-bit key, two k-bit data frames can be inputted simultaneously and two modular multipliers, each consisting of k/2+3 PE(Processing Element)s, can operate in parallel to accomplish 100% throughput. We propose a new encoding scheme to represent divisors and remainders of the division chain to keep regularity of the data path. When it is synthesized to ASIC using Samsung 0.5 um CMOS standard cell library, the critical path delay is 4.24ns, and resulting performance is estimated to be abort 140 Kbps for a 1024-bit data frame at 200Mhz clock In decryption process, the speed can be enhanced to 560kbps by using CRT(Chinese Remainder Theorem). Futhermore, to satisfy real time requirements we can choose small public exponent E, such as 3,17 or $2^{16} +1$, in encryption and verification process. in which case the performance can reach 7.3Mbps.