• Title/Summary/Keyword: Hardware Accelerators

Search Result 28, Processing Time 0.027 seconds

Multi-threaded system to support reconfigurable hardware accelerators on Zynq SoC (Zynq SoC에서 재구성 가능한 하드웨어 가속기를 지원하는 멀티쓰레딩 시스템 설계)

  • Shin, Hyeon-Jun;Lee, Joo-Heung
    • Journal of IKEEE
    • /
    • v.24 no.1
    • /
    • pp.186-193
    • /
    • 2020
  • In this paper, we propose a multi-threading system to support reconfigurable hardware accelerators on Zynq SoC. We implement high-performance JPEG decoder with reconfigurable 2D IDCT hardware accelerators to achieve maximum performance available on the platform. In this system, up to four reconfigurable hardware accelerators synchronized with SW threads can be dynamically reconfigured to provide adaptive computing capabilities according to the given image resolution and the compression ratio. JPEG decoding is operated using images with resolutions 480p, 720p, 1080p at the compression ratio of 7:1-109:1. We show that significant performance improvements are achieved as the image resolution or the compression ratio increase. For 1080p resolution, the performance improvement is up to 79.11 times with throughput speed of 99 fps at the compression ratio 17:1.

Resolving Memory Bottlenecks in Hardware Accelerators with Data Prefetch

  • Hyein Lee;Jinoo Joung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.6
    • /
    • pp.1-12
    • /
    • 2024
  • Deep learning with faster and more accurate results requires large amounts of storage space and large computations. Accordingly, many studies are using hardware accelerators for quick and accurate calculations. However, the performance bottleneck is due to data movement between the hardware accelerators and the CPU. In this paper, we propose a data prefetch strategy that can efficiently reduce such operational bottlenecks. The core idea of the data prefetch strategy is to predict the data needed for the next task and upload it to local memory while the hardware accelerator (Matrix Multiplication Unit, MMU) performs a task. This strategy can be enhanced by using a dual buffer to perform read and write operations simultaneously. This reduces latency and execution time of data transfer. Through simulations, we demonstrate a 24% improvement in the performance of hardware accelerators by maximizing parallel processing with dual buffers and bottlenecks between memories with data prefetch.

Sparse Matrix Compression Technique and Hardware Design for Lightweight Deep Learning Accelerators (경량 딥러닝 가속기를 위한 희소 행렬 압축 기법 및 하드웨어 설계)

  • Kim, Sunhee;Shin, Dongyeob;Lim, Yong-Seok
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.17 no.4
    • /
    • pp.53-62
    • /
    • 2021
  • Deep learning models such as convolutional neural networks and recurrent neual networks process a huge amounts of data, so they require a lot of storage and consume a lot of time and power due to memory access. Recently, research is being conducted to reduce memory usage and access by compressing data using the feature that many of deep learning data are highly sparse and localized. In this paper, we propose a compression-decompression method of storing only the non-zero data and the location information of the non-zero data excluding zero data. In order to make the location information of non-zero data, the matrix data is divided into sections uniformly. And whether there is non-zero data in the corresponding section is indicated. In this case, section division is not executed only once, but repeatedly executed, and location information is stored in each step. Therefore, it can be properly compressed according to the ratio and distribution of zero data. In addition, we propose a hardware structure that enables compression and decompression without complex operations. It was designed and verified with Verilog, and it was confirmed that it can be used in hardware deep learning accelerators.

Research on the Main Memory Access Count According to the On-Chip Memory Size of an Artificial Neural Network (인공 신경망 가속기 온칩 메모리 크기에 따른 주메모리 접근 횟수 추정에 대한 연구)

  • Cho, Seok-Jae;Park, Sungkyung;Park, Chester Sungchung
    • Journal of IKEEE
    • /
    • v.25 no.1
    • /
    • pp.180-192
    • /
    • 2021
  • One widely used algorithm for image recognition and pattern detection is the convolution neural network (CNN). To efficiently handle convolution operations, which account for the majority of computations in the CNN, we use hardware accelerators to improve the performance of CNN applications. In using these hardware accelerators, the CNN fetches data from the off-chip DRAM, as the massive computational volume of data makes it difficult to derive performance improvements only from memory inside the hardware accelerator. In other words, data communication between off-chip DRAM and memory inside the accelerator has a significant impact on the performance of CNN applications. In this paper, a simulator for the CNN is developed to analyze the main memory or DRAM access with respect to the size of the on-chip memory or global buffer inside the CNN accelerator. For AlexNet, one of the CNN architectures, when simulated with increasing the size of the global buffer, we found that the global buffer of size larger than 100kB has 0.8x as low a DRAM access count as the global buffer of size smaller than 100kB.

Comparison of Nios II Core-based Accelerators (Niod II 코어기반 가속기 비교)

  • Song, Gi-Yong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.16 no.1
    • /
    • pp.639-645
    • /
    • 2015
  • Checksum and residue checking accelerators were implemented on a Nios II core-based platform according to component method, in which the corresponding hardware was implemented with HDL coding, a custom instruction method, in which the instruction set of the processor was extended, and the C2H method, in which the corresponding logic was automatically created by the C2H compiler. The processing results from each accelerator for each algorithm were then examined and compared. The results of the comparison showed that the accelerator implemented with the C2H method is the fastest in terms of the execution time, and the accelerator with custom instruction requires the least add-on from the viewpoint of add-on hardware.

Implementation of low power BSPE Core for deep learning hardware accelerators (딥러닝을 하드웨어 가속기를 위한 저전력 BSPE Core 구현)

  • Jo, Cheol-Won;Lee, Kwang-Yeob;Nam, Ki-Hun
    • Journal of IKEEE
    • /
    • v.24 no.3
    • /
    • pp.895-900
    • /
    • 2020
  • In this paper, BSPE replaced the existing multiplication algorithm that consumes a lot of power. Hardware resources are reduced by using a bit-serial multiplier, and variable integer data is used to reduce memory usage. In addition, MOA resource usage and power usage were reduced by applying LOA (Lower-part OR Approximation) to MOA (Multi Operand Adder) used to add partial sums. Therefore, compared to the existing MBS (Multiplication by Barrel Shifter), hardware resource reduction of 44% and power consumption of 42% were reduced. Also, we propose a hardware architecture design for BSPE Core.

Design of a HMAC for a IPsec's Message Authentication Module (IPsec의 Message Authentication Module을 위한 HMAC의 설계)

  • 하진석;이광엽;곽재창
    • Proceedings of the IEEK Conference
    • /
    • 2002.06b
    • /
    • pp.117-120
    • /
    • 2002
  • In this paper, we construct cryptographic accelerators using hardware Implementations of HMACS based on a hash algorithm such as MD5.It is basically a secure version of his previous algorithm, MD4 which is a little faster than MD5 The algorithm takes as Input a message of arbitrary length and produces as output a 128-blt message digest The input is processed In 512-bit blocks In this paper, new architectures, Iterative and full loop, of MD5 have been implemented using Field Programmable Gate Arrays(FPGAS). For the full-loop design, the performance Is about 500Mbps @ 100MHz

  • PDF

Hardware and Software Co-Design Platform for Energy-Efficient FPGA Accelerator Design (에너지 효율적인 FPGA 가속기 설계를 위한 하드웨어 및 소프트웨어 공동 설계 플랫폼)

  • Lee, Dongkyu;Park, Daejin
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.1
    • /
    • pp.20-26
    • /
    • 2021
  • Recent systems contain hardware and software components together for faster execution speed and less power consumption. In conventional hardware and software co-design, the ratio of software and hardware was divided by the designer's empirical knowledge. To find optimal results, designers iteratively reconfigure accelerators and applications and simulate it. Simulating iteratively while making design change is time-consuming. In this paper, we propose a hardware and software co-design platform for energy-efficient FPGA accelerator design. The proposed platform makes it easy for designers to find an appropriate hardware ratio by automatically generating application program code and hardware code by parameterizing the components of the accelerator. The co-design platform based on the Vitis unified software platform runs on a server with Xilinx Alveo U200 FPGA card. As a result of optimizing the multiplication accelerator for two matrices with 1000 rows, execution time was reduced by 90.7% and power consumption was reduced by 56.3%.

Implementation and Performance Evaluation of PCI express on Xilinx FPGA (Xilinx FPGA용 PCI express 구현 및 성능 분석)

  • Lee, Jin
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.12
    • /
    • pp.1667-1674
    • /
    • 2018
  • Recently, speeding up real time calculation using the specialized hardware accelerator is often used in the various engineering and science area, and the accelerators are required to include PCI express interconnection between FPGA and a host computer. The implementation of the high speed PCIe for the multi-giga bytes per second transmission is one of the most difficult issue in the development of the accelerators. There are several commercialized IP solutions and research results in the literature, but these solutions are required extra cost and design period to analyze the detailed implementation method. For the hardware accelerator on Xilinx FPGA, utilizing Xilinx's XDMA PCIe IP, which is provided without extra charge, can be the best solution in terms of the development period and cost. Consequently, this paper presents the evaluation system on Zynq-7000 FPGA and Windows 10 host computer, and analyze the performance of the PCIe IP with various configuration parameters.

Implementation of SDR-based LTE-A PDSCH Decoder for Supporting Multi-Antenna Using Multi-Core DSP (멀티코어 DSP를 이용한 다중 안테나를 지원하는 SDR 기반 LTE-A PDSCH 디코더 구현)

  • Na, Yong;Ahn, Heungseop;Choi, Seungwon
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.15 no.4
    • /
    • pp.85-92
    • /
    • 2019
  • This paper presents a SDR-based Long Term Evolution Advanced (LTE-A) Physical Downlink Shared Channel (PDSCH) decoder using a multicore Digital Signal Processor (DSP). For decoder implementation, multicore DSP TMS320C6670 is used, which provides various hardware accelerators such as turbo decoder, fast Fourier transformer and Bit Rate Coprocessors. The TMS320C6670 is a DSP specialized in implementing base station platforms and is not an optimized platform for implementing mobile terminal platform. Accordingly, in this paper, the hardware accelerator was changed to the terminal implementation to implement the LTE-A PDSCH decoder supporting the multi-antenna and the functions not provided by the hardware accelerator were implemented through core programming. Also pipeline using multicore was implemented to meet the transmission time interval. To confirm the feasibility of the proposed implementation, we verified the real-time decoding capability of the PDSCH decoder implemented using the LTE-A Reference Measurement Channel (RMC) waveform about transmission mode 2 and 3.