• Title/Summary/Keyword: datapath

Search Result 63, Processing Time 0.029 seconds

Performance Evaluation and Verification of MMX-type Instructions on an Embedded Parallel Processor (임베디드 병렬 프로세서 상에서 MMX타입 명령어의 성능평가 및 검증)

  • Jung, Yong-Bum;Kim, Yong-Min;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.10
    • /
    • pp.11-21
    • /
    • 2011
  • This paper introduces an SIMD(Single Instruction Multiple Data) based parallel processor that efficiently processes massive data inherent in multimedia. In addition, this paper implements MMX(MultiMedia eXtension)-type instructions on the data parallel processor and evaluates and analyzes the performance of the MMX-type instructions. The reference data parallel processor consists of 16 processors each of which has a 32-bit datapath. Experimental results for a JPEG compression application with a 1280x1024 pixel image indicate that MMX-type instructions achieves a 50% performance improvement over the baseline instructions on the same data parallel architecture. In addition, MMX-type instructions achieves 100% and 51% improvements over the baseline instructions in energy efficiency and area efficiency, respectively. These results demonstrate that multimedia specific instructions including MMX-type have potentials for widely used many-core GPU(Graphics Processing Unit) and any types of parallel processors.

Design of Low-Complexity 128-Bit AES-CCM* IP for IEEE 802.15.4-Compatible WPAN Devices (IEEE 802.15.4 호환 WPAN 기기를 위한 낮은 복잡도를 갖는128-bit AES-CCM* IP 설계)

  • Choi, Injun;Lee, Jong-Yeol;Kim, Ji-Hoon
    • Journal of IKEEE
    • /
    • v.19 no.1
    • /
    • pp.45-51
    • /
    • 2015
  • Recently, as WPAN (Wireless Personal Area Network) becomes the necessary feature in IoT (Internet of Things) devices, the importance of data security also hugely increases. In this paper, we present the low-complexity 128-bit AES-$CCM^*$ hardware IP for IEEE 802.15.4 standard. For low-cost and low-power implementation which is essentially required in IoT devices, we propose two optimization methods. First, the folded AES(Advanced Encryption Standard) processing core with 8-bit datapath is presented where composite field arithmetic is adopted for reduced hardware complexity. In addition, to support $CCM^*$ mode defined in IEEE 802.15.4, we propose the mode-toggling architecture which requires less hardware resources and processing time. With the proposed methods, the gate count of the proposed AES-$CCM^*$ IP can be lowered up to 57% compared to the conventional architecture.

Design and Implementation of eBPF-based Virtual TAP for Inter-VM Traffic Monitoring (가상 네트워크 트래픽 모니터링을 위한 eBPF 기반 Virtual TAP 설계 및 구현)

  • Hong, Jibum;Jeong, Seyeon;Yoo, Jae-Hyung;Hong, James Won-Ki
    • KNOM Review
    • /
    • v.21 no.2
    • /
    • pp.26-34
    • /
    • 2018
  • With the proliferation of cloud computing and services, the internet traffic and the demand for better quality of service are increasing. For this reason, server virtualization and network virtualization technology, which uses the resources of internal servers in the data center more efficiently, is receiving increased attention. However, the existing hardware Test Access Port (TAP) equipment is unfit for deployment in the virtual datapaths configured for server virtualization. Virtual TAP (vTAP), which is a software version of the hardware TAP, overcomes this problem by duplicating packets in a virtual switch. However, implementation of vTAP in a virtual switch has a performance problem because it shares the computing resources of the host machines with virtual switch and other VMs. We propose a vTAP implementation technique based on the extended Berkeley Packet Filter (eBPF), which is a high-speed packet processing technology, and compare its performance with that of the existing vTAP.

An Optimized Hardware Implementation of SHA-3 Hash Functions (SHA-3 해시 함수의 최적화된 하드웨어 구현)

  • Kim, Dong-Seong;Shin, Kyung-Wook
    • Journal of IKEEE
    • /
    • v.22 no.4
    • /
    • pp.886-895
    • /
    • 2018
  • This paper describes a hardware design of the Secure Hash Algorithm-3 (SHA-3) hash functions that are the latest version of the SHA family of standards released by NIST, and an implementation of ARM Cortex-M0 interface for security SoC applications. To achieve an optimized design, the tradeoff between hardware complexity and performance was analyzed for five hardware architectures, and the datapath of round block was determined to be 1600-bit on the basis of the analysis results. In addition, the padder with a 64-bit interface to round block was implemented in hardware. A SoC prototype that integrates the SHA-3 hash processor, Cortex-M0 and AHB interface was implemented in Cyclone-V FPGA device, and the hardware/software co-verification was carried out. The SHA-3 hash processor uses 1,672 slices of Virtex-5 FPGA and has an estimated maximum clock frequency of 289 Mhz, achieving a throughput of 5.04 Gbps.

A Lightweight Hardware Implementation of ECC Processor Supporting NIST Elliptic Curves over GF(2m) (GF(2m) 상의 NIST 타원곡선을 지원하는 ECC 프로세서의 경량 하드웨어 구현)

  • Lee, Sang-Hyun;Shin, Kyung-Wook
    • Journal of IKEEE
    • /
    • v.23 no.1
    • /
    • pp.58-67
    • /
    • 2019
  • A design of an elliptic curve cryptography (ECC) processor that supports both pseudo-random curves and Koblitz curves over $GF(2^m)$ defined by the NIST standard is described in this paper. A finite field arithmetic circuit based on a word-based Montgomery multiplier was designed to support five key lengths using a datapath of fixed size, as well as to achieve a lightweight hardware implementation. In addition, Lopez-Dahab's coordinate system was adopted to remove the finite field division operation. The ECC processor was implemented in the FPGA verification platform and the hardware operation was verified by Elliptic Curve Diffie-Hellman (ECDH) key exchange protocol operation. The ECC processor that was synthesized with a 180-nm CMOS cell library occupied 10,674 gate equivalents (GEs) and a dual-port RAM of 9 kbits, and the maximum clock frequency was estimated at 154 MHz. The scalar multiplication operation over the 223-bit pseudo-random elliptic curve takes 1,112,221 clock cycles and has a throughput of 32.3 kbps.

Microcode based Controller for Compact CNN Accelerators Aimed at Mobile Devices (모바일 디바이스를 위한 소형 CNN 가속기의 마이크로코드 기반 컨트롤러)

  • Na, Yong-Seok;Son, Hyun-Wook;Kim, Hyung-Won
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.3
    • /
    • pp.355-366
    • /
    • 2022
  • This paper proposes a microcode-based neural network accelerator controller for artificial intelligence accelerators that can be reconstructed using a programmable architecture and provide the advantages of low-power and ultra-small chip size. In order for the target accelerator to support various neural network models, the neural network model can be converted into microcode through microcode compiler and mounted on accelerator to control the operators of the accelerator such as datapath and memory access. While the proposed controller and accelerator can run various CNN models, in this paper, we tested them using the YOLOv2-Tiny CNN model. Using a system clock of 200 MHz, the Controller and accelerator achieved an inference time of 137.9 ms/image for VOC 2012 dataset to detect object, 99.5ms/image for mask detection dataset to detect wearing mask. When implementing an accelerator equipped with the proposed controller as a silicon chip, the gate count is 618,388, which corresponds to 65.5% reduction in chip area compared with an accelerator employing a CPU-based controller (RISC-V).

Hardware Approach to Fuzzy Inference―ASIC and RISC―

  • Watanabe, Hiroyuki
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1993.06a
    • /
    • pp.975-976
    • /
    • 1993
  • This talk presents the overview of the author's research and development activities on fuzzy inference hardware. We involved it with two distinct approaches. The first approach is to use application specific integrated circuits (ASIC) technology. The fuzzy inference method is directly implemented in silicon. The second approach, which is in its preliminary stage, is to use more conventional microprocessor architecture. Here, we use a quantitative technique used by designer of reduced instruction set computer (RISC) to modify an architecture of a microprocessor. In the ASIC approach, we implemented the most widely used fuzzy inference mechanism directly on silicon. The mechanism is beaded on a max-min compositional rule of inference, and Mandami's method of fuzzy implication. The two VLSI fuzzy inference chips are designed, fabricated, and fully tested. Both used a full-custom CMOS technology. The second and more claborate chip was designed at the University of North Carolina(U C) in cooperation with MCNC. Both VLSI chips had muliple datapaths for rule digital fuzzy inference chips had multiple datapaths for rule evaluation, and they executed multiple fuzzy if-then rules in parallel. The AT & T chip is the first digital fuzzy inference chip in the world. It ran with a 20 MHz clock cycle and achieved an approximately 80.000 Fuzzy Logical inferences Per Second (FLIPS). It stored and executed 16 fuzzy if-then rules. Since it was designed as a proof of concept prototype chip, it had minimal amount of peripheral logic for system integration. UNC/MCNC chip consists of 688,131 transistors of which 476,160 are used for RAM memory. It ran with a 10 MHz clock cycle. The chip has a 3-staged pipeline and initiates a computation of new inference every 64 cycle. This chip achieved an approximately 160,000 FLIPS. The new architecture have the following important improvements from the AT & T chip: Programmable rule set memory (RAM). On-chip fuzzification operation by a table lookup method. On-chip defuzzification operation by a centroid method. Reconfigurable architecture for processing two rule formats. RAM/datapath redundancy for higher yield It can store and execute 51 if-then rule of the following format: IF A and B and C and D Then Do E, and Then Do F. With this format, the chip takes four inputs and produces two outputs. By software reconfiguration, it can store and execute 102 if-then rules of the following simpler format using the same datapath: IF A and B Then Do E. With this format the chip takes two inputs and produces one outputs. We have built two VME-bus board systems based on this chip for Oak Ridge National Laboratory (ORNL). The board is now installed in a robot at ORNL. Researchers uses this board for experiment in autonomous robot navigation. The Fuzzy Logic system board places the Fuzzy chip into a VMEbus environment. High level C language functions hide the operational details of the board from the applications programme . The programmer treats rule memories and fuzzification function memories as local structures passed as parameters to the C functions. ASIC fuzzy inference hardware is extremely fast, but they are limited in generality. Many aspects of the design are limited or fixed. We have proposed to designing a are limited or fixed. We have proposed to designing a fuzzy information processor as an application specific processor using a quantitative approach. The quantitative approach was developed by RISC designers. In effect, we are interested in evaluating the effectiveness of a specialized RISC processor for fuzzy information processing. As the first step, we measured the possible speed-up of a fuzzy inference program based on if-then rules by an introduction of specialized instructions, i.e., min and max instructions. The minimum and maximum operations are heavily used in fuzzy logic applications as fuzzy intersection and union. We performed measurements using a MIPS R3000 as a base micropro essor. The initial result is encouraging. We can achieve as high as a 2.5 increase in inference speed if the R3000 had min and max instructions. Also, they are useful for speeding up other fuzzy operations such as bounded product and bounded sum. The embedded processor's main task is to control some device or process. It usually runs a single or a embedded processer to create an embedded processor for fuzzy control is very effective. Table I shows the measured speed of the inference by a MIPS R3000 microprocessor, a fictitious MIPS R3000 microprocessor with min and max instructions, and a UNC/MCNC ASIC fuzzy inference chip. The software that used on microprocessors is a simulator of the ASIC chip. The first row is the computation time in seconds of 6000 inferences using 51 rules where each fuzzy set is represented by an array of 64 elements. The second row is the time required to perform a single inference. The last row is the fuzzy logical inferences per second (FLIPS) measured for ach device. There is a large gap in run time between the ASIC and software approaches even if we resort to a specialized fuzzy microprocessor. As for design time and cost, these two approaches represent two extremes. An ASIC approach is extremely expensive. It is, therefore, an important research topic to design a specialized computing architecture for fuzzy applications that falls between these two extremes both in run time and design time/cost. TABLEI INFERENCE TIME BY 51 RULES {{{{Time }}{{MIPS R3000 }}{{ASIC }}{{Regular }}{{With min/mix }}{{6000 inference 1 inference FLIPS }}{{125s 20.8ms 48 }}{{49s 8.2ms 122 }}{{0.0038s 6.4㎲ 156,250 }} }}

  • PDF

Design and Implementation of a Single-Chip 8-Bit Microcontroller (단일 칩 8비트 마이크로컨트롤러의 설계 및 구현)

  • Ahn, Jung-Il;Park, Sung-Hwan;Kwon, Sung-Jae
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.11 no.4
    • /
    • pp.72-81
    • /
    • 2006
  • In this paper, we first define a total of 64 instructions that are considered to be essential and frequently used, construct a datapath diagram, determine the control sequence using a finite state machine, and implement an 8-bit microcontroller using FPGA in VHDL. In the past, only functional simulation results of a rudimentary microcontroller were reported, the microcontroller lacked interrupt handling capability, or it was not implemented in hardware. We have designed a self-contained 8-bit microcontroller such that it can perform data transfer, addition, and logical operations, as well as stack and external interrupt operations. Following timing simulation of the designed microcontroller, we implemented it in an FPGA and verified its operation successfully. The design and implementation has been done under the Altera MAX+PLUS II integrated development environment using the EP1K50TC144-3 chip. The maximum operating frequency, the total number of logic elements used, and the logic utilization were found to be 9.39 MHz, 2813, and 97%, respectively. The result can be used as a microcontroller IP, and as needs arise, the VHDL code can be modified accordingly.

  • PDF

Simulation of YUV-Aware Instructions for High-Performance, Low-Power Embedded Video Processors (고성능, 저전력 임베디드 비디오 프로세서를 위한 YUV 인식 명령어의 시뮬레이션)

  • Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.13 no.5
    • /
    • pp.252-259
    • /
    • 2007
  • With the rapid development of multimedia applications and wireless communication networks, consumer demand for video-over-wireless capability on mobile computing systems is growing rapidly. In this regard, this paper introduces YUV-aware instructions that enhance the performance and efficiency in the processing of color image and video. Traditional multimedia extensions (e.g., MMX, SSE, VIS, and AltiVec) depend solely on generic subword parallelism whereas the proposed YUV-aware instructions support parallel operations on two-packed 16-bit YUV (6-bit Y, 5-bits U, V) values in a 32-bit datapath architecture, providing greater concurrency and efficiency for color image and video processing. Moreover, the ability to reduce data format size reduces system cost. Experiment results on a representative dynamically scheduled embedded superscalar processor show that YUV-aware instructions achieve an average speedup of 3.9x over the baseline superscalar performance. This is in contrast to MMX (a representative Intel#s multimedia extension), which achieves a speedup of only 2.1x over the same baseline superscalar processor. In addition, YUV-aware instructions outperform MMX instructions in energy reduction (75.8% reduction with YUV-aware instructions, but only 54.8% reduction with MMX instructions over the baseline).

Implementation of Pixel Subword Parallel Processing Instructions for Embedded Parallel Processors (임베디드 병렬 프로세서를 위한 픽셀 서브워드 병렬처리 명령어 구현)

  • Jung, Yong-Bum;Kim, Jong-Myon
    • The KIPS Transactions:PartA
    • /
    • v.18A no.3
    • /
    • pp.99-108
    • /
    • 2011
  • Processor technology is currently continued to parallel processing techniques, not by only increasing clock frequency of a single processor due to the high technology cost and power consumption. In this paper, a SIMD (Single Instruction Multiple Data) based parallel processor is introduced that efficiently processes massive data inherent in multimedia. In addition, this paper proposes pixel subword parallel processing instructions for the SIMD parallel processor architecture that efficiently operate on the image and video pixels. The proposed pixel subword parallel processing instructions store and process four 8-bit pixels on the partitioned four 12-bit registers in a 48-bit datapath architecture. This solves the overflow problem inherent in existing multimedia extensions and reduces the use of many packing/unpacking instructions. Experimental results using the same SIMD-based parallel processor architecture indicate that the proposed pixel subword parallel processing instructions achieve a speedup of $2.3{\times}$ over the baseline SIMD array performance. This is in contrast to MMX-type instructions (a representative Intel multimedia extension), which achieve a speedup of only $1.4{\times}$ over the same baseline SIMD array performance. In addition, the proposed instructions achieve $2.5{\times}$ better energy efficiency than the baseline program, while MMX-type instructions achieve only $1.8{\times}$ better energy efficiency than the baseline program.