Search | Korea Science

Performance Evaluation and Verification of MMX-type Instructions on an Embedded Parallel Processor (임베디드 병렬 프로세서 상에서 MMX타입 명령어의 성능평가 및 검증)

Jung, Yong-Bum;Kim, Yong-Min;Kim, Cheol-Hong;Kim, Jong-Myon
- Journal of the Korea Society of Computer and Information
- /
- v.16 no.10
- /
- pp.11-21
- /
- 2011
This paper introduces an SIMD(Single Instruction Multiple Data) based parallel processor that efficiently processes massive data inherent in multimedia. In addition, this paper implements MMX(MultiMedia eXtension)-type instructions on the data parallel processor and evaluates and analyzes the performance of the MMX-type instructions. The reference data parallel processor consists of 16 processors each of which has a 32-bit datapath. Experimental results for a JPEG compression application with a 1280x1024 pixel image indicate that MMX-type instructions achieves a 50% performance improvement over the baseline instructions on the same data parallel architecture. In addition, MMX-type instructions achieves 100% and 51% improvements over the baseline instructions in energy efficiency and area efficiency, respectively. These results demonstrate that multimedia specific instructions including MMX-type have potentials for widely used many-core GPU(Graphics Processing Unit) and any types of parallel processors.
https://doi.org/10.9708/jksci.2011.16.10.011 인용 PDF KSCI

A Research about Open Source Distributed Computing System for Realtime CFD Modeling (SU2 with OpenCL and MPI) (실시간 CFD 모델링을 위한 오픈소스 분산 컴퓨팅 기술 연구)

Lee, Jun-Yeob;Oh, Jong-woo;Lee, DongHoon
- Proceedings of the Korean Society for Agricultural Machinery Conference
- /
- 2017.04a
- /
- pp.171-171
- /
- 2017
전산유체역학(CFD: Computational Fluid Dynamics)를 이용한 스마트팜 환경 내부의 정밀 제어 연구가 진행 중이다. 시계열 데이터의 난해한 동적 해석을 극복하기위해, 비선형 모델링 기법의 일종인 인공신경망을 이용하는 방안을 고려하였다. 선행 연구를 통하여 환경 데이터의 비선형 모델링을 위한 Tensorflow활용 방법이 하드웨어 가속 기능을 바탕으로 월등한 성능을 보임을 확인하였다. 그럼에도 오프라인 일괄(Offline batch)처리 방식의 한계가 있는 인공신경망 모델링 기법과 현장 보급이 불가능한 고성능 하드웨어 연산 장치에 대한 대안 마련이 필요하다고 판단되었다. CFD 해석을 위한 Solver로 SU2(http://su2.stanford.edu)를 이용하였다. 운영 체제 및 컴파일러는 1) Mac OS X Sierra 10.12.2 Apple LLVM version 8.0.0 (clang-800.0.38), 2) Windows 10 x64: Intel C++ Compiler version 16.0, update 2, 3) Linux (Ubuntu 16.04 x64): g++ 5.4.0, 4) Clustered Linux (Ubuntu 16.04 x32): MPICC 3.3.a2를 선정하였다. 4번째 개발환경인 병렬 시스템의 경우 하드웨어 가속는 OpenCL(https://www.khronos.org/opencl/) 엔진을 이용하고 저전력 ARM 프로세서의 일종인 옥타코어 Samsung Exynos5422 칩을 장착한 ODROID-XU4(Hardkernel, AnYang, Korea) SBC(Single Board Computer)를 32식 병렬 구성하였다. 분산 컴퓨팅을 위한 환경은 Gbit 로컬 네트워크 기반 NFS(Network File System)과 MPICH(http://www.mpich.org/)로 구성하였다. 공간 분해능을 계측 주기보다 작게 분할할 경우 발생하는 미지의 바운더리 정보를 정의하기 위하여 3차원 Kriging Spatial Interpolation Method를 실험적으로 적용하였다. 한편 병렬 시스템 구성이 불가능한 1,2,3번 환경의 경우 내부적으로 이미 존재하는 멀티코어를 활용하고자 OpenMP(http://www.openmp.org/) 라이브러리를 활용하였다. 64비트 병렬 8코어로 동작하는 1,2,3번 운영환경의 경우 32비트 병렬 128코어로 동작하는 환경에 비하여 근소하게 2배 내외로 연산 속도가 빨랐다. 실시간 CFD 수행을 위한 분산 컴퓨팅 기술이 프로세서의 속도 및 운영체제의 정보 분배 능력에 따라 결정된다고 판단할 수 있었다. 이를 검증하기 위하여 4번 개발환경에서 운영체제를 64비트로 개선하여 5번째 환경을 구성하여 검증하였다. 상반되는 결과로 64비트 72코어로 동작하는 분산 컴퓨팅 환경에서 단일 프로세서 기반 멀티 코어(1,2,3번) 환경보다 보다 2.5배 내외 연산속도 향상이 있었다. ARM 프로세서용 64비트 운영체제의 완성도가 낮은 시점에서 추후 성공적인 실시간 CFD 모델링을 위한 지속적인 검토가 필요하다.
PDF

Design of Software and Hardware Modules for a TCP/IP Offload Engine with Separated Transmission and Reception Paths (송수신 분리형 TCP/IP Offload Engine을 위한 소프트웨어 및 하드웨어 모듈의 설계)

Jang Hank-Kok;Chung Sang-Hwa;Choi Young-In
- Journal of KIISE:Computer Systems and Theory
- /
- v.33 no.9
- /
- pp.691-698
- /
- 2006
TCP/IP Offload Engine (TOE) is a technology that processes TCP/IP on a network adapter instead of a host CPU to reduce protocol processing overhead from the host CPU. There have been some approaches to implementing TOE: software TOE based on an embedded processor; hardware TOE based on ASIC implementation; and hybrid TOE in which software and hardware functions are combined. In this paper, we designed software modules and hardware modules for a hybrid TOE on an FPGA that had two processor cores. Software modules are based on the embedded Linux. Hardware modules are for data transmission (TX) and reception (RX). One core controls the TX path and the other controls the RX path of the Linux. This TX/RX path separation mechanism can reduce task switching overheads between processes and overcome poor performance of single embedded processor. Hardware modules deal with creating headers for outgoing packets, processing headers of incoming packets, and fetching or storing data from or to the host memory by DMA. These can make it possible to improve the performance of data transmission and reception. We proved performance of the TOE with separated transmission and reception paths by performing experiments with a TOE network adapter that was equipped with the FPGA having processor cores.
PDF KSCI

Optimal Many-core Processor Architecture for Different Ultrasonic Image Resolutions (초음파 영상선호의 크기 변화에 따른 최적의 매니코어 프로세서 구조)

Kang, Seong-Mo;Kim, Jong-Myon
- Journal of the Institute of Convergence Signal Processing
- /
- v.13 no.1
- /
- pp.50-55
- /
- 2012
This paper proposes an optima] many-core processor architecture that meets the requirements of low power and high performance for different ultrasonic image resolutions in hand-held ultrasonic devices. To identify the optimal many-core architecture, seven different PE configurations are simulated for processing ultrasonic images in terms of execution performance and energy consumption. Experimental results indicate that the highest energy efficiencies are achieved at PEs=1,024, 64, and 256 for ultrasonic images at $256{\times}256$, $320{\times}240$, and $800{\times}480$ resolutions, respectively. In addition, the maximum area efficiencies are obtained at PEs=256 (for $256{\times}256$ and $800{\times}480$ image resolutions) and 64 (for $320{\times}240$ image resolution).
PDF KSCI

Multiple Signature Comparison of LogTM-SE for Fast Conflict Detection (다중 시그니처 비교를 통한 트랜잭셔널 메모리의 충돌해소 정책의 성능향상)

Kim, Deok-Ho;Oh, Doo-Hwan;Ro, Won-W.
- The KIPS Transactions:PartA
- /
- v.18A no.1
- /
- pp.19-24
- /
- 2011
As era of multi-core processors has arrived, transactional memory has been considered as an effective method to achieve easy and fast multi-threaded programming. Various hardware transactional memory systems such as UTM, VTM, FastTM, LogTM, and LogTM-SE, have been introduced in order to implement high-performance multi-core processors. Especially, LogTM-SE has provided study performance with an efficient memory management policy and a practical thread scheduling method through conflict detection based on signatures. However, increasing number of cores on a processor imposes the hardware complexity for signature processing. This causes overall performance degradation due to the heavy workload on signature comparison. In this paper, we propose a new architecture of multiple signature comparison to improve conflict detection of signature based transactional memory systems.
https://doi.org/10.3745/KIPSTA.2011.18A.1.019 인용 PDF KSCI

A Design of Security SoC Prototype Based on Cortex-M0 (Cortex-M0 기반의 보안 SoC 프로토타입 설계)

Choi, Jun-baek;Choe, Jun-yeong;Shin, Kyung-wook
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2019.05a
- /
- pp.251-253
- /
- 2019
This paper describes an implementation of a security SoC (System-on-Chip) prototype that interfaces a microprocessor with a block cipher crypto-core. The Cortex-M0 was used as a microprocessor, and a crypto-core implemented by integrating ARIA and AES into a single hardware was used as an intellectual property (IP). The integrated ARIA-AES crypto-core supports five modes of operation including ECB, CBC, CFB, CTR and OFB, and two master key sizes of 128-bit and 256-bit. The integrated ARIA-AES crypto-core was interfaced to work with the AHB-light bus protocol of Cortex-M0, and the crypto-core IP was expected to operate at clock frequencies up to 50 MHz. The security SoC prototype was verified by BFM simulation, and then hardware-software co-verification was carried out with FPGA implementation.
PDF

Implementation of LTE Transport Channel on Multicore DSP Software Defined Radio Platform (멀티코어 DSP 기반 소프트웨어 정의 라디오 플랫폼을 활용한 LTE 전송 채널의 구현)

Lee, Jin
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.24 no.4
- /
- pp.508-514
- /
- 2020
To implement the continuously evolving mobile communication standards such as Long Term Evolution (LTE) and 5G, the Software Defined Radio (SDR) concept provides great flexibility and efficiency. For many years, a high-end Digital Signal Processor (DSP) System on Chip (SoC) has been developed to support multicore and various hardware coprocessors. This paper introduces the implementation of the SDR platform hardware using TI's TCI663x chip. Using the platform, LTE transport channel is implemented by interworking multicore DSP with Bit rate Coprocessor (BCP) and Turbo Decoder Coprocessor (TCP) and the performance is evaluated according to various implementation options. In order to evaluate the performance of the implemented LTE transport channel, LTE base station system was constructed by combining FPGA main board for physical channels, SDR platform board, and RF & Antenna board.
https://doi.org/10.6109/jkiice.2020.24.4.508 인용 PDF KSCI

Design of a Delayed Dual-Core Lock-Step Processor with Automatic Recovery in Soft Errors (소프트 에러 발생 시 자동 복구하는 이중 코어 지연 락스텝 프로세서의 설계)

Juho Kim;Seonghyun Yang;Seongsoo Lee
- Journal of IKEEE
- /
- v.27 no.4
- /
- pp.683-686
- /
- 2023
In this paper, we designed a Delayed Dual Core Lock-Step (D-DCLS) processor where two cores operate same instructions with delay and the result is compared to mitigate soft errors and common mode failures in automotive electronic systems. Because D-DCLS does not know which core an error occurred in, each core must be recovered to the point before the error occurred, but complex hardware modifications are required to return all intermediate values on the pipeline stage. In this paper, in order for easy hardware implementation, all register values are saved to a buffer whenever a branch instruction is executed. When an error is detected, the saved register values are automatically restored, and then 'BX LR' instruction is executed to return to the last branch point. The proposed D-DCLS processor was designed using Verilog HDL and was confirmed to continue normal operation after automatically recovering error.
https://doi.org/10.7471/ikeee.2023.27.4.683 인용 PDF

Implementation of 40 Gb/s Network Processor of Wire-Speed Flow Management (40 Gb/s 실시간 플로우 관리 네트워크 프로세서 구현)

Doo, Kyeong-Hwan;Lee, Bhum-Cheol;Kim, Whan-Woo
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.37B no.9
- /
- pp.814-821
- /
- 2012
We propose a network processor called an OmniFlow processor capable of wire-speed flow management by a hardware-based flow admission control(FAC) in this paper. Because the OmniFlow processor can set up and release a wire-speed connection for flows, the update period of flows can be set to a short time, and only active flows can be effectively managed by terminating a flow that does not have a packet transmitted within this period. Therefore, the FAC can be used to provide a reliable transmission of UDP as well as TCP applications. This processor is fabricated in 65nm CMOS technology, and total gate count is 25 million. It has 40 Gb/s throughput performance in using the 32 RISC cores when maximum operating frequency is 555MHz.
https://doi.org/10.7840/kics.2012.37B.9.814 인용 PDF KSCI

Comparison of Parallel Computation Performances for 3D Wave Propagation Modeling using a Xeon Phi x200 Processor (제온 파이 x200 프로세서를 이용한 3차원 음향 파동 전파 모델링 병렬 연산 성능 비교)

Lee, Jongwoo;Ha, Wansoo
- Geophysics and Geophysical Exploration
- /
- v.21 no.4
- /
- pp.213-219
- /
- 2018
In this study, we simulated 3D wave propagation modeling using a Xeon Phi x200 processor and compared the parallel computation performance with that using a Xeon CPU. Unlike the 1st generation Xeon Phi coprocessor codenamed Knights Corner, the 2nd generation x200 Xeon Phi processor requires no additional communication between the internal memory and the main memory since it can run an operating system directly. The Xeon Phi x200 processor can run large-scale computation independently, with the large main memory and the high-bandwidth memory. For comparison of parallel computation, we performed the modeling using the MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) libraries. Numerical examples using the SEG/EAGE salt model demonstrated that we can achieve 2.69 to 3.24 times faster modeling performance using the Xeon Phi with a large number of computational cores and high-bandwidth memory compared to that using the 12-core CPU.
https://doi.org/10.7582/GGE.2018.21.4.213 인용 PDF KSCI HTML

Search Result 312, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)