• Title/Summary/Keyword: Parallel Processing Architecture

Search Result 394, Processing Time 0.028 seconds

A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems (방출단층촬영 시스템을 위한 GPU 기반 반복적 기댓값 최대화 재구성 알고리즘 연구)

  • Ha, Woo-Seok;Kim, Soo-Mee;Park, Min-Jae;Lee, Dong-Soo;Lee, Jae-Sung
    • Nuclear Medicine and Molecular Imaging
    • /
    • v.43 no.5
    • /
    • pp.459-467
    • /
    • 2009
  • Purpose: The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Materials and Methods: Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. Results: The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 see, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 see, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. Conclusion: The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries.

Simulation of YUV-Aware Instructions for High-Performance, Low-Power Embedded Video Processors (고성능, 저전력 임베디드 비디오 프로세서를 위한 YUV 인식 명령어의 시뮬레이션)

  • Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.13 no.5
    • /
    • pp.252-259
    • /
    • 2007
  • With the rapid development of multimedia applications and wireless communication networks, consumer demand for video-over-wireless capability on mobile computing systems is growing rapidly. In this regard, this paper introduces YUV-aware instructions that enhance the performance and efficiency in the processing of color image and video. Traditional multimedia extensions (e.g., MMX, SSE, VIS, and AltiVec) depend solely on generic subword parallelism whereas the proposed YUV-aware instructions support parallel operations on two-packed 16-bit YUV (6-bit Y, 5-bits U, V) values in a 32-bit datapath architecture, providing greater concurrency and efficiency for color image and video processing. Moreover, the ability to reduce data format size reduces system cost. Experiment results on a representative dynamically scheduled embedded superscalar processor show that YUV-aware instructions achieve an average speedup of 3.9x over the baseline superscalar performance. This is in contrast to MMX (a representative Intel#s multimedia extension), which achieves a speedup of only 2.1x over the same baseline superscalar processor. In addition, YUV-aware instructions outperform MMX instructions in energy reduction (75.8% reduction with YUV-aware instructions, but only 54.8% reduction with MMX instructions over the baseline).

Implementation of High-radix Modular Exponentiator for RSA using CRT (CRT를 이용한 하이래딕스 RSA 모듈로 멱승 처리기의 구현)

  • 이석용;김성두;정용진
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.10 no.4
    • /
    • pp.81-93
    • /
    • 2000
  • In a methodological approach to improve the processing performance of modulo exponentiation which is the primary arithmetic in RSA crypto algorithm, we present a new RSA hardware architecture based on high-radix modulo multiplication and CRT(Chinese Remainder Theorem). By implementing the modulo multiplier using radix-16 arithmetic, we reduced the number of PE(Processing Element)s by quarter comparing to the binary arithmetic scheme. This leads to having the number of clock cycles and the delay of pipelining flip-flops be reduced by quarter respectively. Because the receiver knows p and q, factors of N, it is possible to apply the CRT to the decryption process. To use CRT, we made two s/2-bit multipliers operating in parallel at decryption, which accomplished 4 times faster performance than when not using the CRT. In encryption phase, the two s/2-bit multipliers can be connected to make a s-bit linear multiplier for the s-bit arithmetic operation. We limited the encryption exponent size up to 17-bit to maintain high speed, We implemented a linear array modulo multiplier by projecting horizontally the DG of Montgomery algorithm. The H/W proposed here performs encryption with 15Mbps bit-rate and decryption with 1.22Mbps, when estimated with reference to Samsung 0.5um CMOS Standard Cell Library, which is the fastest among the publications at present.

Highly Efficient and Low Power FIR Filter Chip for PRML Read Channel (PRML Read Channel용 고효율, 저전력 FIR 필터 칩)

  • Jin Yong, Kang;Byung Gak, Jo;Myung Hoon, Sunwoo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.9
    • /
    • pp.115-124
    • /
    • 2004
  • This paper proposes a high efficient and low power FIR filter chip for partial-response maximum likelihood (PRML) disk drive read channels; it is a 6-bit, 8-tap digital FIR filter. The proposed filter employs a parallel processing architecture and consists of 4 pipeline stages. It uses the modified Booth algorithm for multiplication and compressor logic for addition. CMOS pass-transistor logic is used for low power consumption and single-rail logic is used to reduce the chip area. The proposed filter is actually implemented and the chip dissipates 120mV at 100MHz, uses a 3.3V power supply and occupies 1.88 ${\times}$ 1.38 $\textrm{mm}^2$. The implemented filter requires approximately 11.7% less power compared with the existing architectures that use the similar technology.

Incremental Method for Developing Software Product Family (소프트웨어 제품 군을 개발하기 위한 점진적 방법)

  • Joo, Bok-Gyu;Kim, Young-Chul
    • The KIPS Transactions:PartD
    • /
    • v.10D no.4
    • /
    • pp.697-708
    • /
    • 2003
  • In a software product line approach, developers first develop common software architecture and components by analyzing the characteristics of all software members, and then produce each application by integrating components. The approach is considered very effective means for developing and maintaining in parallel a software product family. Main disadvantage of this approach is that it requires a big up-front investment in preparing product line. Therefore, it takes time to deliver the first version. In this paper, we present an incremental method to develop software families, which requires small additional cost for initial versions and allows an organization to move smoothly to full-scale product line. We present our method by explaining how to record and upgrade the results of variations analysis, and show the application of our method by developing a family of YBS. Our method is a low-risk approach that can be effectively applied to an organization that starts developing software systems but has to deliver the first versions quickly to the market.

Low-power Hardware Design of Deblocking Filter in HEVC In-loop Filter for Mobile System (모바일 시스템을 위한 저전력 HEVC 루프 내 필터의 디블록킹 필터 하드웨어 설계)

  • Park, Seungyong;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.3
    • /
    • pp.585-593
    • /
    • 2017
  • In this paper, we propose a deblocking filter hardware architecture for low-power HEVC (High-Efficiency Video Coding) in-loop for mobile systems. HEVC performs image compression on a block-by-block basis, resulting in blockage of the image due to quantization error. The deblocking filter is used to remove the blocking phenomenon in the image. Currently, UHD video service is supported in various mobile systems, but power consumption is high. The proposed low-power deblocking filter hardware structure minimizes the power consumption by blocking the clock to the internal module when the filter is not applied. It also has four parallel filter structures for high throughput at low operating frequencies and each filter is implemented in a four-stage pipeline. The proposed deblocking filter hardware structure is designed with Verilog HDL and synthesized using TSMC 65nm CMOS standard cell library, resulting in about 52.13K gates. In addition, real-time processing of 8K@84fps video is possible at 110MHz operating frequency, and operation power is 6.7mW.

(Theoretical Performance analysis of 12Mbps, r=1/2, k=7 Viterbi deocder and its implementation using FPGA for the real time performance evaluation) (12Mbps, r=1/2, k=7 비터비 디코더의 이론적 성능분석 및 실시간 성능검증을 위한 FPGA구현)

  • Jeon, Gwang-Ho;Choe, Chang-Ho;Jeong, Hae-Won;Im, Myeong-Seop
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.39 no.1
    • /
    • pp.66-75
    • /
    • 2002
  • For the theoretical performance analysis of Viterbi Decoder for wireless LAN with data rate 12Mbps, code rate 1/2 and constraint length 7 defined in IEEE 802.11a, the transfer function is derived using Cramer's rule and the first-event error probability and bit error probability is derived under the AWGN. In the design process, input symbol is quantized into 16 steps for 4 bit soft decision and register exchange method instead of memory method is proposed for trace back, which enables the majority at the final decision stage. In the implementation, the Viterbi decoder based on parallel architecture with pipelined scheme for processing 12Mbps high speed data rate and AWGN generator are implemented using FPGA chips. And then its performance is verified in real time.

Design of Intra Prediction Circuit for HEVC and H.264 Multi-decoder Supporting UHD Images (UHD 영상을 지원하는 HEVC 및 H.264 멀티 디코더 용 인트라 예측 회로 설계)

  • Yu, Sanghyun;Cho, Kyeongsoon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.12
    • /
    • pp.50-56
    • /
    • 2016
  • This paper proposes the architecture and design of intra prediction circuit for a multi-decoder supporting UHD images. The proposed circuit supports not only the latest video compression standard HEVC but also H.264. In addition to the basic function of performing intra prediction, this circuit has the capability of performing the reference sample filter operation defined in the H.264 standard, and the smoothing and strong sample filter operations defined in the HEVC standard. We reduced the circuit size by sharing the circuit blocks for common operations and internal storage, and improved the circuit performance by parallel processing. The proposed circuit was described at RTL using Verilog HDL and its functionality was verified by using NC-Verilog of Cadence. The RTL circuit was synthesized by using Design Compiler of Synopsys and 130nm standard cell library. The synthesized gate-level circuit consists of 69,694 gates and processes 100 ~ 280 frames per second for 4K-UHD HEVC images at the maximum operation frequency of 157MHz.

A Study on the Full-HD HEVC Encoder IP Design (고해상도 비디오 인코더 IP 설계에 대한 연구)

  • Lee, Sukho;Cho, Seunghyun;Kim, Hyunmi;Lee, Jehyun
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.12
    • /
    • pp.167-173
    • /
    • 2015
  • This paper presents a study on the Full-HD HEVC(High Efficiency Video Coding) encoder IP(Intellectual Property) design. The designed IP is for HEVC main profile 4.1, and performs encoding with a speed of 60 fps of full high definition. Before hardware and software design, overall reference model was developed with C language, and we proposed a parallel processing architecture for low-power consumption. And also we coded firmware and driver programs relating IP. The platform for verification of developed IP was developed, and we verified function and performance for various pictures under several encoding conditions by implementing designed IP to FPGA board. Compared to HM-13.0, about 35% decrease in bit-rate under same PSNR was achieved, and about 25% decrease in power consumption under low-power mode was performed.

Parallelizing Feature Point Extraction in the Multi-Core Environment for Reducing Panorama Image Generation Time (파노라마 이미지 생성시간을 단축하기 위한 멀티코어 환경에서 특징점 추출 병렬화)

  • Kim, Geon-Ho;Choi, Tai-Ho;Chung, Hee-Jin;Kwon, Bom-Jun
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.3
    • /
    • pp.331-335
    • /
    • 2008
  • In this paper, we parallelized a feature point extraction algorithm to reduce panorama image generation time in multi-core environment. While we compose a panorama image with several images, the step to extract feature points of each picture is needed to find overlapped region of pictures. To perform rapidly feature extraction stage which requires much calculation, we developed a parallel algorithm to extract feature points and examined the performance using CBE(Cell Broadband Engine) which is asymmetric multi-core architecture. As a result of the exam, the algorithm we proposed has a property of linear scalability-the performance is increased in proportion the number of processors utilized. In this paper, we will suggest how Image processing operation can make high performance result in multi-core environment.