• 제목/요약/키워드: Parallel Implementation

검색결과 878건 처리시간 0.022초

32-bit RISC-V상에서의 PIPO 경량 블록암호 최적화 구현 (Optimized Implementation of PIPO Lightweight Block Cipher on 32-bit RISC-V Processor)

  • 엄시우;장경배;송경주;이민우;서화정
    • 정보처리학회논문지:컴퓨터 및 통신 시스템
    • /
    • 제11권6호
    • /
    • pp.167-174
    • /
    • 2022
  • PIPO 경량 블록암호는 ICISC'20에서 발표된 암호이다. 본 논문에서는 32-bit RISC-V 프로세서 상에서 PIPO 경량 블록암호 ECB, CBC, CTR 운용 모드의 단일 블록 최적화 구현과 병렬 최적화 구현을 진행한다. 단일 블록 구현에서는 32-bit 레지스터 상에서 효율적인 8-bit 단위의 Rlayer 함수 구현을 제안한다. 병렬 구현에서는 병렬 구현을 위한 레지스터 내부 정렬을 진행하며, 서로 다른 4개의 블록이 하나의 레지스터 상에서 Rlayer 함수 연산을 진행하기 위한 방법에 대해 설명한다. 또한 CBC 운용모드의 병렬 구현에서는 암호화 과정에 병렬 구현 기법 적용이 어렵기 때문에 복호화 과정에서의 병렬 구현 기법 적용을 제안하며, CTR 운용모드의 병렬 구현에서는 확장된 초기화 벡터를 사용하여 레지스터 내부 정렬 생략 기법을 제안한다. 본 논문에서는 병렬 구현 기법이 여러 블록암호 운용모드에 적용 가능함을 보여준다. 결과적으로 ECB 운용모드에서 키 스케줄 과정을 포함하고 있는 기존 연구 구현의 성능 대비 단일 블록 구현에서는 1.7배, 병렬 구현에서는 1.89배의 성능 향상을 확인하였다.

Novel Parallel Approach for SIFT Algorithm Implementation

  • Le, Tran Su;Lee, Jong-Soo
    • Journal of information and communication convergence engineering
    • /
    • 제11권4호
    • /
    • pp.298-306
    • /
    • 2013
  • The scale invariant feature transform (SIFT) is an effective algorithm used in object recognition, panorama stitching, and image matching. However, due to its complexity, real-time processing is difficult to achieve with current software approaches. The increasing availability of parallel computers makes parallelizing these tasks an attractive approach. This paper proposes a novel parallel approach for SIFT algorithm implementation using a block filtering technique in a Gaussian convolution process on the SIMD Pixel Processor. This implementation fully exposes the available parallelism of the SIFT algorithm process and exploits the processing and input/output capabilities of the processor, which results in a system that can perform real-time image and video compression. We apply this implementation to images and measure the effectiveness of such an approach. Experimental simulation results indicate that the proposed method is capable of real-time applications, and the result of our parallel approach is outstanding in terms of the processing performance.

Parallel Implementation of the Recursive Least Square for Hyperspectral Image Compression on GPUs

  • Li, Changguo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제11권7호
    • /
    • pp.3543-3557
    • /
    • 2017
  • Compression is a very important technique for remotely sensed hyperspectral images. The lossless compression based on the recursive least square (RLS), which eliminates hyperspectral images' redundancy using both spatial and spectral correlations, is an extremely powerful tool for this purpose, but the relatively high computational complexity limits its application to time-critical scenarios. In order to improve the computational efficiency of the algorithm, we optimize its serial version and develop a new parallel implementation on graphics processing units (GPUs). Namely, an optimized recursive least square based on optimal number of prediction bands is introduced firstly. Then we use this approach as a case study to illustrate the advantages and potential challenges of applying GPU parallel optimization principles to the considered problem. The proposed parallel method properly exploits the low-level architecture of GPUs and has been carried out using the compute unified device architecture (CUDA). The GPU parallel implementation is compared with the serial implementation on CPU. Experimental results indicate remarkable acceleration factors and real-time performance, while retaining exactly the same bit rate with regard to the serial version of the compressor.

영역 분할에 의한 SIMPLER 모델의 병렬화와 성능 분석 (Implementation and Performance Analysis of a Parallel SIMPLER Model Based on Domain Decomposition)

  • 곽호상;이상산
    • 한국전산유체공학회지
    • /
    • 제3권1호
    • /
    • pp.22-29
    • /
    • 1998
  • Parallel implementation is conducted for a SIMPLER finite volume model. The present parallelism is based on domain decomposition and explicit message passing using MPI and SHMEM. Two parallel solvers to tridiagonal matrix equation are employed. The implementation is verified on the Cray T3E system for a benchmark problem of natural convection in a sidewall-heated cavity. The test results illustrate good scalability of the present parallel models. Performance issues are elaborated in view of convergence as well as conventional parallel overheads and single processor performance. The effectiveness of a localized matrix solution algorithm is demonstrated.

  • PDF

영역분할법에 의한 SIMPLER 기법의 병렬화 (Parallel Implementation of SIMPLER by Using Domain Decomposition Technique)

  • 곽호상
    • 한국전산유체공학회:학술대회논문집
    • /
    • 한국전산유체공학회 1997년도 추계 학술대회논문집
    • /
    • pp.23-28
    • /
    • 1997
  • A parallel implementation is made of a two-dimensional finite volume model based on the SIMPLER. The solution domain is decomposed into several subdomains and the solution at each subdomain is acquired by parallel use of multiple processors. Communications between processors are accomplished by using the standard MPI and the Cray-specific SHMEM. The parallelization method for the overall solution procedure to the Navier-Stokes equations is described in detail, The parallel implementation is validated on the Cray T3E system for a benchmark problem of natural convection in a sidewall-heated cavity. The parallel performance is assessed and the issues encountered in achieving a high-performance parallel model are elaborated.

  • PDF

GPSS 프로그램의 병렬화에 관한 연구 (A Study on the Implementation of GPSS Program on a Parallel Computer)

  • 윤정미
    • 한국시뮬레이션학회논문지
    • /
    • 제8권2호
    • /
    • pp.57-72
    • /
    • 1999
  • With the rapidly increasing complexity of decision-marking or system development in the fields of industry, management, etc., modelling techniques using simulation has become more highlighted. Particularly, the advent of parallel computer systems not only has opened a new horizon of parallel simulation, but also has greatly contributed to the speed-up of the execution of simulation. The implementation of parallel simulation, however, is not a easy job for those who accustomed to the existing computer systems. And it is also necessarily confronted with the problem of synchronization conflict in the process. Thus, how to allow a wider community of users to gain access to parallel simulation while solving synchronization conflicts has become an important issue in simulation study. As a method to solve these problems, this paper is primarily concerned with the implementation of GPSS which is a generally used simulation language for discrete event simulation, onto a parallel computer using C-LINDA. For that, this paper, is to suggest a model and algorithm and to experiment it using a case.

  • PDF

병렬 스트림암호를 위한 m-병렬 비선형 결합함수에 관한 연구 (A study on the m-Parallel Nonlinear Combine functions for the Parallel Stream Cipher)

  • 이훈재;문상재
    • 한국통신학회논문지
    • /
    • 제27권4A호
    • /
    • pp.301-309
    • /
    • 2002
  • 본 논문에서는 병렬 이동형 PS-LFSR을 활용한 여러 가지 형태의 m-병렬 비선형 결합함수에 대하여 제안하고, 이들의 효율적인 구현 방안을 검토하였다. 즉, m-병렬 비메모리-비선형 결합함수, m-병렬 메모리-비선형 결합함수, m-병렬 비선형 필터함수 및 m-병렬 클럭 조절형 결합함수 등 4가지 형태의 m-병렬 비선형 결합함수와 이들의 효율적인 병렬 구현 방안을 제안하였고, 마지막으로 클럭 조절형 LILI-128의 병렬구현 기법을 예시하여 안전성과 성능을 분석하였다.

다중컴퓨터를 이용한 신경회로망 기반 실시간 자동 표적인식시스템의 병렬구현 (Parallel implementation of a neural network-based realtime ATR system using a multicomputer)

  • 전준형;김성완;김진호;최흥문
    • 전자공학회논문지B
    • /
    • 제33B권2호
    • /
    • pp.197-208
    • /
    • 1996
  • A neural network-based PSRI(position, scale, and rotation invariant) feature extraction and ATR (automatic target recognition) system are proposed and an efficient parallel implementatio of the proposed system using multicomputer is also presented. In the proposed system, the scale and rotationinvariant features are extracted from the contour projection of the number of edge pixels on each of the concentric circles, which is input t the cooperative network. We proposed how to decide the optimum depth and the width of the parallel pipeline system for real time applications by modeling the proposed system into a parallel pipeline implementation method using transputers is also proposed. The implementation results show that we can extract PSRI features less sensitive to input variations, and the speedup of the proposed ATR system is about 7.55 for the various rotated and scaled targets using 8-node transputer system.

  • PDF

Design and Implementation of a Latency Efficient Encoder for LTE Systems

  • Hwang, Soo-Yun;Kim, Dae-Ho;Jhang, Kyoung-Son
    • ETRI Journal
    • /
    • 제32권4호
    • /
    • pp.493-502
    • /
    • 2010
  • The operation time of an encoder is one of the critical implementation issues for satisfying the timing requirements of Long Term Evolution (LTE) systems because the encoder is based on binary operations. In this paper, we propose a design and implementation of a latency efficient encoder for LTE systems. By virtue of 8-bit parallel processing of the cyclic redundancy checking attachment, code block (CB) segmentation, and a parallel processor, we are able to construct engines for turbo codings and rate matchings of each CB in a parallel fashion. Experimental results illustrate that although the total area and clock period of the proposed scheme are 19% and 6% larger than those of a conventional method based on a serial scheme, respectively, our parallel structure decreases the latency by about 32% to 65% compared with a serial structure. In particular, our approach is more latency efficient when the encoder processes a number of CBs. In addition, we apply the proposed scheme to a real system based on LTE, so that the timing requirement for ACK/NACK transmission is met by employing the encoder based on the parallel structure.

A Parallel Search Algorithm and Its Implementation for Digital k-Winners-Take-All Circuit

  • Yoon, Myungchul
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • 제15권4호
    • /
    • pp.477-483
    • /
    • 2015
  • The k-Winners-Take-All (kWTA) is an operation to find the largest k (>1) inputs among N inputs. Parallel search algorithm of kWTA for digital inputs is not invented yet, so most of digital kWTA architectures have O(N) time complexity. A parallel search algorithm for digital kWTA operation and the circuits for its VLSI implementation are presented in this paper. The proposed kWTA architecture can compare all inputs simultaneously in parallel. The time complexity of the new architecture is O(logN), so that it is scalable to a large number of digital data. The high-speed kWTA operation and its O(logN) dependency of the new architecture are verified by simulations. It takes 290 ns in searching for 5 winners among 1024 of 32 bit data, which is more than thousands of times faster than existing digital kWTA circuits, as well as existing analog kWTA circuits.