• Title/Summary/Keyword: matrix-vector multiplication

Search Result 35, Processing Time 0.031 seconds

PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

  • Hyeji Kim;Yeongmin Lee;Chun-Gi Lyuh
    • ETRI Journal
    • /
    • v.46 no.5
    • /
    • pp.817-828
    • /
    • 2024
  • Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.

A Parallel-Architecture Processor Design for the Fast Multiplication of Homogeneous Transformation Matrices (Homogeneous Transformation Matrix의 곱셈을 위한 병렬구조 프로세서의 설계)

  • Kwon Do-All;Chung Tae-Sang
    • The Transactions of the Korean Institute of Electrical Engineers D
    • /
    • v.54 no.12
    • /
    • pp.723-731
    • /
    • 2005
  • The $4{\times}4$ homogeneous transformation matrix is a compact representation of orientation and position of an object in robotics and computer graphics. A coordinate transformation is accomplished through the successive multiplications of homogeneous matrices, each of which represents the orientation and position of each corresponding link. Thus, for real time control applications in robotics or animation in computer graphics, the fast multiplication of homogeneous matrices is quite demanding. In this paper, a parallel-architecture vector processor is designed for this purpose. The processor has several key features. For the accuracy of computation for real application, the operands of the processors are floating point numbers based on the IEEE Standard 754. For the parallelism and reduction of hardware redundancy, the processor takes column vectors of homogeneous matrices as multiplication unit. To further improve the throughput, the processor structure and its control is based on a pipe-lined structure. Since the designed processor can be used as a special purpose coprocessor in robotics and computer graphics, additionally to special matrix/matrix or matrix/vector multiplication, several other useful instructions for various transformation algorithms are included for wide application of the new design. The suggested instruction set will serve as standard in future processor design for Robotics and Computer Graphics. The design is verified using FPGA implementation. Also a comparative performance improvement of the proposed design is studied compared to a uni-processor approach for possibilities of its real time application.

An Implementation of Digital Neural Network Using Systolic Array Processor (영어 수계를 이용한 디지털 신경망회로의 실현)

  • 윤현식;조원경
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.30B no.2
    • /
    • pp.44-50
    • /
    • 1993
  • In this paper, we will present an array processor for implementation of digital neural networks. Back-propagation model can be formulated as a consecutive matrix-vector multiplication problem with some prespecified thresholding operation. This operation procedure is suited for the design of an array processor, because it can be recursively and repeatedly executed. Systolic array circuit architecture with Residue Number System is suggested to realize the efficient arithmetic circuit for matrix-vector multiplication and compute sigmoid function. The proposed design method would expect to adopt for the application field of neural networks, because it can be realized to currently developed VLSI technology.

  • PDF

NEW ALGORITHMS FOR SOLVING ODES BY PSEUDOSPECTRAL METHOD

  • Darvishi, M.T.
    • Journal of applied mathematics & informatics
    • /
    • v.7 no.2
    • /
    • pp.439-451
    • /
    • 2000
  • To compute derivatives using matrix vector multiplication method, new algorithms were introduced in [1.2]n By these algorithms, we reduced roundoff error in computing derivative using Chebyshev collocation methods (CCM). In this paper, some applications of these algorithms ar presented.

GPU-Based ECC Decode Unit for Efficient Massive Data Reception Acceleration

  • Kwon, Jisu;Seok, Moon Gi;Park, Daejin
    • Journal of Information Processing Systems
    • /
    • v.16 no.6
    • /
    • pp.1359-1371
    • /
    • 2020
  • In transmitting and receiving such a large amount of data, reliable data communication is crucial for normal operation of a device and to prevent abnormal operations caused by errors. Therefore, in this paper, it is assumed that an error correction code (ECC) that can detect and correct errors by itself is used in an environment where massive data is sequentially received. Because an embedded system has limited resources, such as a low-performance processor or a small memory, it requires efficient operation of applications. In this paper, we propose using an accelerated ECC-decoding technique with a graphics processing unit (GPU) built into the embedded system when receiving a large amount of data. In the matrix-vector multiplication that forms the Hamming code used as a function of the ECC operation, the matrix is expressed in compressed sparse row (CSR) format, and a sparse matrix-vector product is used. The multiplication operation is performed in the kernel of the GPU, and we also accelerate the Hamming code computation so that the ECC operation can be performed in parallel. The proposed technique is implemented with CUDA on a GPU-embedded target board, NVIDIA Jetson TX2, and compared with execution time of the CPU.

A Study on the Incoherent Optical Vector-Matrix Multiplier(IOVMM)using a LED array (LED배열을 이용한 인코히어런트광벡터매트릭스 곱셈기〈IOVMM〉에 관한 연구)

  • 최평석;박한규
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.9 no.3
    • /
    • pp.127-131
    • /
    • 1984
  • The IOVMM(Incoherent Optical Vector Matrix Multiplier) is constructed, which can process much information very fast by incogerent light source, and its experimental results are compared with the theoretical values. The input vector and matirx elements are limited to the positive number in this paper. The input vector is made by the LED array and the matrix is encoded on the film by the area modulation method. The result of the vector-matrix multiplication is detected by the photodiode array through the lens system. The analog multiplexer is used for looking at output signal on one channel.

  • PDF

Design of Low Complexity and High Throughput Encoder for Structured LDPC Codes (구조적 LDPC 부호의 저복잡도 및 고속 부호화기 설계)

  • Jung, Yong-Min;Jung, Yun-Ho;Kim, Jae-Seok
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.10
    • /
    • pp.61-69
    • /
    • 2009
  • This paper presents the design results of a low complexity and high throughput LDPC encoder structure. In order to solve the high complexity problem of the LDPC encoder, a simplified matrix-vector multiplier is proposed instead of the conventional complex matrix-vector multiplier. The proposed encoder also adopts a partially parallel structure and performs column-wise operations in matrix-vector multiplication to achieve high throughput. Implementation results show that the proposed architecture reduces the number of logic gates and memory elements by 37.4% and 56.7%, compared with existing five-stage pipelined architecture. The proposed encoder also supports 800Mbps throughput at 40MHz clock frequency which is improved about three times more than the existing architecture.

Algorithm for Efficient D-Class Computation (효율적인 D-클래스 계산을 위한 알고리즘)

  • Han, Jae-Il
    • Journal of Information Technology Services
    • /
    • v.6 no.1
    • /
    • pp.151-158
    • /
    • 2007
  • D-class computation requires multiplication of three Boolean matrices for each of all possible triples of $n{\times}n$ Boolean matrices and search for equivalent $n{\times}n$ Boolean matrices according to a specific equivalence relation. It is easy to see that even multiplying all $n{\times}n$ Boolean matrices with themselves shows exponential time complexity and D-Class computation was left an unsolved problem due to its computational complexity. The vector-based multiplication theory shows that the multiplication of three Boolean matrices for each of all possible triples of $n{\times}n$ Boolean matrices can be done much more efficiently. However, D-Class computation requires computation of equivalent classes in addition to the efficient multiplication. The paper discusses a theory and an algorithm for efficient D-class computation, and shows execution results of the algorithm.

Rendering of Sweep Surfaces using Programmable Graphics Hardware (그래픽스 하드웨어를 이용한 스윕 곡면의 렌더링)

  • Ko, Dae-Hyun;Yoon, Seung-Hyun;Lee, Ji-Eun
    • Journal of the Korea Computer Graphics Society
    • /
    • v.16 no.4
    • /
    • pp.11-16
    • /
    • 2010
  • We present an efficient algorithm for rendering sweep surfaces using programmable graphics hardware. A sweep surface can be represented by a cross-section curve undergoing a spline motion. This representation has a simple matrix-vector multiplication structure that can easily be adapted to programmable graphics hardware. The data for the motion and cross-section curves are stored in texture memory. The vertex processor considers a pair of surface parameters as a vertex and evaluates its coordinates and normal vector with a single matrix multiplication. Using the GPU in this way is between 10 and 40 times as fast as CPU-based rendering.

Strain Decomposition Method in Hull Stress Monitoring System for Container Ship

  • Park, Jae-Woong;Kang, Yun-Tae
    • Journal of Ship and Ocean Technology
    • /
    • v.7 no.3
    • /
    • pp.56-65
    • /
    • 2003
  • The hull monitoring systems of container ships with four long-base gages give enough information for identifying the hull girder loads such as bending and torsional moments. But such a load-identification for container ships has not been known. In this paper, a load-identification method is suggested in terms of a linear matrix equation that the measured strain vector equals to the multiplication of the transformation matrix and the desired strain component vector. The equation is proved to be mathematically complete by the property of positive-definite determinant of the transformation matrix. The method is applied to a hull stress monitoring system for 8100TED container ship during sea trial, and the estimated external loads illustrate reasonable results in comparison with the pre-estimated results. This moment decomposition concept has also been tested in real operation conditions. The typical phenomena over the Suez Canal illustrated very suitable results comparing with the physical understandings. Henceforth, one can effectively use the proposed concept to monitor the hull girder loads such as bending and torsional moments.