• Title/Summary/Keyword: Embedded memory

Search Result 723, Processing Time 0.026 seconds

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression (구조적 압축을 통한 FPGA 기반 GRU 추론 가속기 설계)

  • Chae, Byeong-Cheol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.6
    • /
    • pp.850-858
    • /
    • 2022
  • To deploy Gate Recurrent Units (GRU) on resource-constrained embedded devices, this paper presents a reconfigurable FPGA-based GRU accelerator that enables structured compression. Firstly, a dense GRU model is significantly reduced in size by hybrid quantization and structured top-k pruning. Secondly, the energy consumption on external memory access is greatly reduced by the proposed reuse computing pattern. Finally, the accelerator can handle a structured sparse model that benefits from the algorithm-hardware co-design workflows. Moreover, inference tasks can be flexibly performed using all functional dimensions, sequence length, and number of layers. Implemented on the Intel DE1-SoC FPGA, the proposed accelerator achieves 45.01 GOPs in a structured sparse GRU network without batching. Compared to the implementation of CPU and GPU, low-cost FPGA accelerator achieves 57 and 30x improvements in latency, 300 and 23.44x improvements in energy efficiency, respectively. Thus, the proposed accelerator is utilized as an early study of real-time embedded applications, demonstrating the potential for further development in the future.

Parallel Implementations of Digital Focus Indices Based on Minimax Search Using Multi-Core Processors

  • HyungTae, Kim;Duk-Yeon, Lee;Dongwoon, Choi;Jaehyeon, Kang;Dong-Wook, Lee
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.2
    • /
    • pp.542-558
    • /
    • 2023
  • A digital focus index (DFI) is a value used to determine image focus in scientific apparatus and smart devices. Automatic focus (AF) is an iterative and time-consuming procedure; however, its processing time can be reduced using a general processing unit (GPU) and a multi-core processor (MCP). In this study, parallel architectures of a minimax search algorithm (MSA) are applied to two DFIs: range algorithm (RA) and image contrast (CT). The DFIs are based on a histogram; however, the parallel computation of the histogram is conventionally inefficient because of the bank conflict in shared memory. The parallel architectures of RA and CT are constructed using parallel reduction for MSA, which is performed through parallel relative rating of the image pixel pairs and halved the rating in every step. The array size is then decreased to one, and the minimax is determined at the final reduction. Kernels for the architectures are constructed using open source software to make it relatively platform independent. The kernels are tested in a hexa-core PC and an embedded device using Lenna images of various sizes based on the resolutions of industrial cameras. The performance of the kernels for the DFIs was investigated in terms of processing speed and computational acceleration; the maximum acceleration was 32.6× in the best case and the MCP exhibited a higher performance.

Ubiquitous-Based Mobile Control and Monitoring of CNC Machines for Development of u-Machine

  • Kim Dong-Hoon;Song Jun-Yeob
    • Journal of Mechanical Science and Technology
    • /
    • v.20 no.4
    • /
    • pp.455-466
    • /
    • 2006
  • This study was an attempt to control and monitor Computerized Numerical Controller (CNC) machines anywhere and anytime for the development of a ubiquitous machine (u-machine). With a Personal Digital Assistant (PDA) phone, the machine status and machining data of CNC machines can be monitored in wired and wireless environments, including the environments of IMT2000 and Wireless LAN. Moreover, CNC machines can be controlled anywhere and anytime. The concept of the anywhere-anytime controlling and monitoring of a manufacturing system was implemented in this study for the purpose of u-manufacturing and u-machines. In this concept, the communication between the CNC controller and the PDA phone was successfully performed anywhere and anytime for the real-time monitoring and control of CNC machines. In addition, the interface between the CNC controller and the developed application module was implemented by Object linking and embedding for Process Control (OPC) and shared CNC memory. For communication, the design of a server contents module within the target CNC was based on a TCP/IP. Furthermore, the client contents module within the PDA phone was designed with the aid of embedded c++ programming for mobile communication. For the interface, the monitoring data, such as the machine status, the machine running state, the name of the Numerical Control (NC) program, the alarm and the position of the stage axes, were acquired in real time from real machines with the aid of the OPC method and by sharing the CNC memory. The control data, such as the start, hold, emergency stop, reserved start and reserved stop, were also applied to the CNC domain of the real machine. CNC machines can therefore be controlled and monitored in real time, anywhere and anytime. Moreover, prompt notification from CNC machines to mobile phones, including cellular phones and PDA phones, can be automatically realized in emergencies.

Hardware Architecture Design and Implementation of IPM-based Curved Lane Detector (IPM기반 곡선 차선 검출기 하드웨어 구조 설계 및 구현)

  • Son, Haengseon;Lee, Seonyoung;Min, Kyoungwon;Seo, Sungjin
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.10 no.4
    • /
    • pp.304-310
    • /
    • 2017
  • In this paper, we propose the architecture of an IPM based lane detector for autonomous vehicles to detect and control the driving route along the curved lane. In the IPM image, we divide the area into two fields, Far/Near Field, and the lane candidate region is detected using the Hough transform to perform the matching for the curved lane. In autonomous vehicles, various algorithms must be embedded in the system. To reduce the system resources, we proposed a method to minimize the number of memory accesses to the image and various parameters on the external memory. The proposed circuit has 96% lane recognition rate and occupies 16% LUT, 5.9% FF and 29% BRAM in Xilinx XC7Z020. It processes Full-HD image at a rate of 42 fps at a 100 MHz operating clock.

An Evaluation of Multimedia Data Downstream with PDA in an Infrastructure Network

  • Hong, Youn-Sik;Hur, Hye-Sun
    • Journal of Information Processing Systems
    • /
    • v.2 no.2
    • /
    • pp.76-81
    • /
    • 2006
  • A PDA is used mainly for downloading data from a stationary server such as a desktop PC in an infrastructure network based on wireless LAN. Thus, the overall performance depends heavily on the performance of such downloading with PDA. Unfortunately, for a PDA the time taken to receive data from a PC is longer than the time taken to send it by 53%. Thus, we measured and analyzed all possible factors that could cause the receiving time of a PDA to be delayed with a test bed system. There are crucial factors: the TCP window size, file access time of a PDA, and the inter-packet delay that affects the receiving time of a PDA. The window size of a PDA during the downstream is reduced dramatically to 686 bytes from 32,581 bytes. In addition, because flash memory is embedded into a PDA, writing data into the flash memory takes twice as long as reading the data from it. To alleviate these, we propose three distinct remedies: First, in order to keep the window size at a sender constant, both the size of a socket send buffer for a desktop PC and the size of a socket receive buffer for a PDA should be increased. Second, to shorten its internal file access time, the size of an application buffer implemented in an application should be doubled. Finally, the inter-packet delay of a PDA and a desktop PC at the application layer should be adjusted asymmetrically to lower the traffic bottleneck between these heterogeneous terminals.

Automatic Virtual Platform Generation for Fast SoC Verification (고속 SoC 검증을 위한 자동 가상 플랫폼 생성)

  • Jung, Jun-Mo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.9 no.5
    • /
    • pp.1139-1144
    • /
    • 2008
  • In this paper, we propose an automatic generation method of transaction level(TL) model from algorithmic model to verify system specification fast and effectively using virtual platform. The TL virtual platform including structural properties such as timing, synchronization and real-time is one of the effective verification frameworks. However, whenever change system specification or HW/SW mapping, we must rebuild virtual platform and additional design/verification time is required. And the manual description is very time-consuming and error-prone process. To solve these problems, we build TL library which consists of basic components of virtual platform such as CPU, memory, timer. We developed a set of design/verification tools in order to generate a virtual platform automatically. Our tools generate a virtual platform which consists of embedded real-time operating system (RTOS) and hardware components from an algorithmic modeling. And for communication between HW and SW, memory map and device drivers are generated. The effectiveness of our proposed framework has been successfully verified with a Joint Photographic Expert Group (JPEG) and H.264 algorithm. We claim that our approach enables us to generate an application specific virtual platform $100x{\tims}1000x$ faster than manual designs. Also, we can refine an initial platform incrementally to find a better HW/SW mapping. Furthermore, application software can be concurrently designed and optimized as well as RTOS by the generated virtual platform

Implementation of a System for RFID Education to be based on an EPC global Network Standard (EPC global Network 표준을 따르는 RFID 교육용 시스템의 구현)

  • Kim, Dae-Hee;Chung, Joong-Soo;Kim, Hyu-Chan;Jung, Kwang-Wook;Kim, Seog-Gyu
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.11
    • /
    • pp.90-99
    • /
    • 2009
  • This paper presents the implementation of RFID EPC global network educational system based on using 900MHz air interface between the reader and the active tag. The software of reader and the active tag is developed on embedded environment, and the software of PC controlling the reader is based on window OS operated as the server. The ATmega128 VLSI chip is used for the processor of the reader and the active tag. As the development environment, AVR compiler is used for the reader and the active tag of which the programming language is C. The visual C++language of the visual studio on the PC activated as the server is used for development language. Main functions of this system are to control tag containing EPC global Data by PC through the reader, to obtain information of tag through the internet and to read/write data on tag memory. Finally the data written from the active tag's memory is sent to the PC via the reader as "read" operation and compare the received data with one already sent to the tag. Software implementation of 900MHz EPC global RFID educational system is done on the basis of these functions.

Behavior of Fiber-Reinforced Smart Soft Composite Actuators According to Material Composition (섬유 강화 지능형 연성 복합재 구동기의 재료구성에 따른 거동특성 평가)

  • Han, Min-Woo;Kim, Hyung-Il;Song, Sung-Hyuk;Ahn, Sung-Hoon
    • Transactions of the Korean Society of Mechanical Engineers A
    • /
    • v.41 no.2
    • /
    • pp.81-85
    • /
    • 2017
  • Fiber-reinforced polymer composites, which are made by combining a continuous fiber that acts as reinforcement and a homogeneous polymeric material that acts as a host, are engineering materials with high strength and stiffness and a lightweight structure. In this study, a shape memory alloy(SMA) reinforced composite actuator is presented. This actuator is used to generate large deformations in single lightweight structures and can be used in applications requiring a high degree of adaptability to various external conditions. The proposed actuator consists of numerous individual laminas of the glass-fiber fabric that are embedded in a polymeric matrix. To characterize its deformation behavior, the composition of the actuator was changed by changing the matrix material and the number of the glass-fiber fabric layers. In addition, current of various magnitudes were applied to each actuator to study the effect of the heating of SMA wires on applying current.

An Image Coding Method by Using the Bit-Level Information of Wavelet Coefficients (웨이블릿 계수의 비트 레벨 정보를 사용한 영상 부호화 기법)

  • Park, Sung-Wook;Park, Jong-Wook
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.16 no.3
    • /
    • pp.23-33
    • /
    • 2011
  • In this paper, the wavelet image coder, that can encode the bit-level information of wavelet coefficients, is proposed. The proposed coder is used the modified EZW algorithm and significant coefficient array that has bit level information of the wavelet coefficients to reduce the memory requirement in coding process. The significant coefficient array is two dimensional data structure that has bit level information of the wavelet coefficients. The proposed algorithm performs the coding of the significance coefficients and coding of bit level information of wavelet coefficients at a time by using the significant coefficient array. Experimental results show a better or similar performance of the proposed method when compared with conventional embedded wavelet coding algorithm. Especially, the proposed algorithm performs stably without image distortion at various bit rates with minimum memory usage by using the significant coefficient array.

Optimization for H.264/AVC De-blocking Filter on the TMS320C64x+ DSP (TMS320C64x+ DSP에서의 H.264/AVC 디블록킹 필터 최적화)

  • Lee, Jin-Seop;Kang, Dae-Beom;Sim, Dong-Gyu;Lee, Soo-Youn
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.48 no.2
    • /
    • pp.41-52
    • /
    • 2011
  • It is important to reduce computational complexity of de-blocking filter for real-time implementation, because it accounts for a great part of total computational complexity of the decoder. Because there are a lot of conditional branches and memory accesses in a decoding loop, it is not easy to speed up the de-blocking filter. Therefore, this paper presents a new algorithm of de-blocking filter minimizing conditional branches and memory accesses. The proposed structure of de-blocking filter enables filter operation to parallelize by software pipelining. The proposed optimization method was implemented on a TMS320DM6467 EVM board and we achieved approximately 46% cycle reduction, compared with that of FFmpeg.