• Title/Summary/Keyword: Sparse Matrix Multiplication

Search Result 14, Processing Time 0.024 seconds

GPU-Based ECC Decode Unit for Efficient Massive Data Reception Acceleration

  • Kwon, Jisu;Seok, Moon Gi;Park, Daejin
    • Journal of Information Processing Systems
    • /
    • v.16 no.6
    • /
    • pp.1359-1371
    • /
    • 2020
  • In transmitting and receiving such a large amount of data, reliable data communication is crucial for normal operation of a device and to prevent abnormal operations caused by errors. Therefore, in this paper, it is assumed that an error correction code (ECC) that can detect and correct errors by itself is used in an environment where massive data is sequentially received. Because an embedded system has limited resources, such as a low-performance processor or a small memory, it requires efficient operation of applications. In this paper, we propose using an accelerated ECC-decoding technique with a graphics processing unit (GPU) built into the embedded system when receiving a large amount of data. In the matrix-vector multiplication that forms the Hamming code used as a function of the ECC operation, the matrix is expressed in compressed sparse row (CSR) format, and a sparse matrix-vector product is used. The multiplication operation is performed in the kernel of the GPU, and we also accelerate the Hamming code computation so that the ECC operation can be performed in parallel. The proposed technique is implemented with CUDA on a GPU-embedded target board, NVIDIA Jetson TX2, and compared with execution time of the CPU.

Acceleration of ECC Computation for Robust Massive Data Reception under GPU-based Embedded Systems (GPU 기반 임베디드 시스템에서 대용량 데이터의 안정적 수신을 위한 ECC 연산의 가속화)

  • Kwon, Jisu;Park, Daejin
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.7
    • /
    • pp.956-962
    • /
    • 2020
  • Recently, as the size of data used in an embedded system increases, the need for an ECC decoding operation to robustly receive a massive data is emphasized. In this paper, we propose a method to accelerate the execution of computations that derive syndrome vectors when ECC decoding is performed using Hamming code in an embedded system with a built-in GPU. The proposed acceleration method uses the matrix-vector multiplication of the decoding operation using the CSR format, one of the data structures representing sparse matrix, and is performed in parallel in the CUDA kernel of the GPU. We evaluated the proposed method using a target embedded board with a GPU, and the result shows that the execution time is reduced when ECC decoding operation accelerated based on the GPU than used only CPU.

Efficient Sparse Matrix-Matrix Multiplication for circuit optimization (회로 최적화를 위한 효율적인 희소 행렬 간 곱셈 연산에 관한 연구)

  • 임은진;김경훈
    • Proceedings of the Korea Multimedia Society Conference
    • /
    • 2003.11b
    • /
    • pp.994-997
    • /
    • 2003
  • 행렬 연산은 계산 과학을 사용하는 공학 물리, 화학, 생명 과학, 경제학 등에서 다양하게 사용되고 있으며 이 행렬은 크기가 크고 대부분의 원소가 0 값을 갖는 희소 행렬일 경우가 많다. 본 논문에서는 희소 행렬의 연산 중, 회로 설계 시 최적화 과정에 사용되는 연산에서 문제가 되는 희소 행렬 A 와 블록 대각 행렬 H에 대하여 AH$A^{T}$ 의 연산을 효율적으로 행하는 방법들을 검토하고 메모리 접근 횟수를 모델링하여 수행 속도와 메모리 사용량 면에서 비교한다.

  • PDF

A Study on Circular Filtering in Orthogonal Transform Domain

  • Song, Bong-Seop;Lee, Sang-Uk
    • Journal of Electrical Engineering and information Science
    • /
    • v.1 no.2
    • /
    • pp.125-133
    • /
    • 1996
  • In this paper, we dicuss on the properties related to the circular filtering in orthogonal transform domain. The efficient filtering schemes in six orthogonal transform domains are presented by generalizing the convolution-multiplication property of the DFT. In brief, the circular filtering can be accomplished by multiplying the transform domain filtering matrix W, which is shown to be very sparse, yielding the computational gains compared with the time domain processing. As an application, decimation and interpolation techniques in orthogonal transform domains are also investigated.

  • PDF

GPU-based Sparse Matrix-Vector Multiplication Schemes for Random Walk with Restart: A Performance Study (랜덤워크 기법을 위한 GPU 기반 희소행렬 벡터 곱셈 방안에 대한 성능 평가)

  • Yu, Jae-Seo;Bae, Hong-Kyun;Kang, Seokwon;Yu, Yongseung;Park, Yongjun;Kim, Sang-Wook
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.11a
    • /
    • pp.96-97
    • /
    • 2020
  • 랜덤워크 기반 노드 랭킹 방식 중 하나인 RWR(Random Walk with Restart) 기법은 희소행렬 벡터 곱셈 연산과 벡터 간의 합 연산을 반복적으로 수행하며, RWR 의 수행 시간은 희소행렬 벡터 곱셈 연산 방법에 큰 영향을 받는다. 본 논문에서는 CSR5(Compressed Sparse Row 5) 기반 희소행렬 벡터 곱셈 방식과 CSR-vector 기반 희소행렬 곱셈 방식을 채택한 GPU 기반 RWR 기법 간의 비교 실험을 수행한다. 실험을 통해 데이터 셋의 특징에 따른 RWR 의 성능 차이를 분석하고, 적합한 희소행렬 벡터 곱셈 방안 선택에 관한 가이드라인을 제안한다.

Random Partial Haar Wavelet Transformation for Single Instruction Multiple Threads (단일 명령 다중 스레드 병렬 플랫폼을 위한 무작위 부분적 Haar 웨이블릿 변환)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.16 no.5
    • /
    • pp.805-813
    • /
    • 2015
  • Many researchers expect the compressive sensing and sparse recovery problem can overcome the limitation of conventional digital techniques. However, these new approaches require to solve the l1 norm optimization problems when it comes to signal reconstruction. In the signal reconstruction process, the transform computation by multiplication of a random matrix and a vector consumes considerable computing power. To address this issue, parallel processing is applied to the optimization problems. In particular, due to huge size of original signal, it is hard to store the random matrix directly in memory, which makes one need to design a procedural approach in handling the random matrix. This paper presents a new parallel algorithm to calculate random partial Haar wavelet transform based on Single Instruction Multiple Threads (SIMT) platform.

An Efficient Computation of Matrix Triple Products (삼중 행렬 곱셈의 효율적 연산)

  • Im, Eun-Jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.3
    • /
    • pp.141-149
    • /
    • 2006
  • In this paper, we introduce an improved algorithm for computing matrix triple product that commonly arises in primal-dual optimization method. In computing $P=AHA^{t}$, we devise a single pass algorithm that exploits the block diagonal structure of the matrix H. This one-phase scheme requires fewer floating point operations and roughly half the memory of the generic two-phase algorithm, where the product is computed in two steps, computing first $Q=HA^{t}$ and then P=AQ. The one-phase scheme achieved speed-up of 2.04 on Intel Itanium II platform over the two-phase scheme. Based on memory latency and modeled cache miss rates, the performance improvement was evaluated through performance modeling. Our research has impact on performance tuning study of complex sparse matrix operations, while most of the previous work focused on performance tuning of basic operations.

  • PDF

Fast Binary Block Inverse Jacket Transform

  • Lee Moon-Ho;Zhang Xiao-Dong;Pokhrel Subash Shree;Choe Chang-Hui;Hwang Gi-Yean
    • Journal of electromagnetic engineering and science
    • /
    • v.6 no.4
    • /
    • pp.244-252
    • /
    • 2006
  • A block Jacket transform and. its block inverse Jacket transformn have recently been reported in the paper 'Fast block inverse Jacket transform'. But the multiplication of the block Jacket transform and the corresponding block inverse Jacket transform is not equal to the identity transform, which does not conform to the mathematical rule. In this paper, new binary block Jacket transforms and the corresponding binary block inverse Jacket transforms of orders $N=2^k,\;3^k\;and\;5^k$ for integer values k are proposed and the mathematical proofs are also presented. With the aid of the Kronecker product of the lower order Jacket matrix and the identity matrix, the fast algorithms for realizing these transforms are obtained. Due to the simple inverse, fast algorithm and prime based $P^k$ order of proposed binary block inverse Jacket transform, it can be applied in communications such as space time block code design, signal processing, LDPC coding and information theory. Application of circular permutation matrix(CPM) binary low density quasi block Jacket matrix is also introduced in this paper which is useful in coding theory.

Design of the Adaptive Systolic Array Architecture for Efficient Sparse Matrix Multiplication (희소 행렬 곱셈을 효율적으로 수행하기 위한 유동적 시스톨릭 어레이 구조 설계)

  • Seo, Juwon;Kong, Joonho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.11a
    • /
    • pp.24-26
    • /
    • 2022
  • 시스톨릭 어레이는 DNN training 등 인공지능 연산의 대부분을 차지하는 행렬 곱셈을 수행하기 위한 하드웨어 구조로 많이 사용되지만, sparsity 가 높은 행렬을 연산할 때 불필요한 동작으로 인해 효율성이 크게 떨어진다. 본 논문에서 제안된 유동적 시스톨릭 어레이는 matrix condensing, weight switching, 그리고 direct output path 의 방법과 구조를 통해 sparsity 가 높은 행렬 곱셈의 수행 사이클을 줄일 수 있다. 시뮬레이션을 통해 기존 시스톨릭 어레이와 유동적 시스톨릭 어레이의 성능을 비교하였으며 8×8, 16×16, 32×32 의 크기를 가진 행렬을 동일 크기의 시스톨릭 어레이로 연산하였을 때 필요 사이클 수를 최대 12 사이클 절감할 수 있는 것을 확인하였다.

Study on Multiple sparse matrix-matrix multiplication hardware accelerator (다중 희소 행렬-행렬 곱셈 하드웨어 가속기 연구)

  • Tae-Hyoung Kim;Yeong-Pil Cho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.47-50
    • /
    • 2024
  • 희소 행렬은 대부분의 요소가 0 인 행렬이다. 이러한 희소 행렬-행렬 곱셈을 수행할 경우 0 인 데이터 또한 곱셈을 수행하니 불필요한 연산이 발생한다. 이러한 문제를 해결하고자 행렬 압축 알고리즘 또는 곱셈의 부분합의 수를 줄이는 연구들이 활발히 진행 중이다. 하지만 현재의 연구들은 주로 단일 행렬 연산에 집중되어 있어 FPGA(Field Programmable Gate Array)와 특정 용도로 사용하는 가속기에서는 리소스를 충분히 활용하지 못해 비효율적이다. 본 연구는 FPGA 의 모든 리소스를 사용하여 다중 희소 행렬 곱셈을 수행하는 아키텍처를 제안한다.