Search | Korea Science

Bit-Level Systolic Array for Modular Multiplication (모듈러 곱셈연산을 위한 비트레벨 시스토릭 어레이)

최성욱
- Proceedings of the Korea Institutes of Information Security and Cryptology Conference
- /
- 1995.11a
- /
- pp.163-172
- /
- 1995
In this paper, the bit-level 1-dimensionl systolic array for modular multiplication are designed. First of all, the parallel algorithms and data dependence graphs from Walter's Iwamura's methods based on Montgomery Algorithm for modular multiplication are derived and compared. Since Walter's method has the smaller computational index points in data dependence graph than Iwamura's, it is selected as the base algorithm. By the systematic procedure for systolic array design, four 1-dimensional systolic arrays ale obtained and then are evaluated by various criteria. Modifying the array derived from 〔0,1〕 projection direction by adding a control logic and serializing the communication paths of data A, optimal 1-dimensional systolic array is designed. It has constant I/O channels for modular expandable and is good for fault tolerance due to unidirectional paths. And so, it is suitable for RSA Cryptosystem which deals with the large size and many consecutive message blocks.
PDF

IDEA Implementation On TMS320C54X DSP Board (TMS320C54X DSP보드를 이용한 IDEA 구현)

송종관;윤병우
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.3 no.1
- /
- pp.69-74
- /
- 1999
This paper describes the principles of IDEA(international data encryption algorithm) which has been widely accepted as a data encryption algorithm and the implementation of the algorithm on the TMS320C54X DSP board is addressed. It is also shown that the processing time is significantly reduced by adapting high speed multiplication modulo (2^16+1) algorithm. The result shows data rates of about 250∼300Mbyte/sec.
PDF

A Parallelising Algortithm for Matrix Arithmetics of Digital Signal Processings on VLIW Simulator (VLIW 시뮬레이터 상에서의 디지털 신호처리 행렬 연산에 대한 병렬화 알고리즘)

Song, Jin-Hee;Jun, Moon-Seog
- The Transactions of the Korea Information Processing Society
- /
- v.5 no.8
- /
- pp.1985-1996
- /
- 1998
A parallelising algorithm for partitioning and mapping methods of matrix/vector multiplication into linear processor array/VLW simulator is presented in this paper. First we discuss the mapping methods for input matrix or vector into the arbitrarily size of processor arrays. Then, we show partitioning the algorithmss of the large size of computational problem into the size of the processor array. We execute the algorithm on VLIW simuhator and show to effectiviness of algorithm. The result which we achived better parallelising performance on our VLIW simulator dsign than on linear processor array.
PDF

The Design of the Improved Adaptive Contrast Algorithm (개선된 적응형 콘트라스트 알고리즘 설계)

Choi, In-Seok;Youn, Jin-suk;Cho, Hwa-Hyun;Choi, Myung-Ryul
- Proceedings of the Korea Information Processing Society Conference
- /
- 2004.05a
- /
- pp.731-734
- /
- 2004
본 논문은 입력영상의 화질 향상을 위하여 기존의 스트레칭 알고리즘을 이용하여 개선된 콘트라스트 알고리즘을 제안하였다. 입력영상의 픽셀(pixel)을 DR(Difference Range)의 범위에 따라 정해진 가중치를 적용하여 새로운 픽셀을 출력한다. 특별한 사용자 정의(User Define)없이 실시간적으로 화질을 개선할 수 있는 장점이 있다. 또한, 하드웨어 적인 측면에서 곱셈 과 나눗셈 연산을 배럴쉬프트(Barrel Shift)를 이용하여 하드웨어 복잡도를 감소 시켰다. 제안한 방식의 알고리즘의 검증을 위하여 C를 이용한 시각적 검증과 하드웨어 측면에서의 검증을 VHDL을 이용한 컴퓨터 시뮬레이션을 통해 확인하였다.
PDF

Deep Learning-based Real-Time Super-Resolution Architecture Design (경량화된 딥러닝 구조를 이용한 실시간 초고해상도 영상 생성 기술)

Ahn, Saehyun;Kang, Suk-Ju
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2020.11a
- /
- pp.228-229
- /
- 2020
최근 딥러닝 기술은 여러 컴퓨터 비전 응용 분야에서 많이 쓰이고 있다. 물체 인식, 분류 및 영상 생성 등을 예로 들 수 있다. 특히 초고해상도 변환 문제에서 최근 딥러닝을 사용하면서 큰 성능 개선을 얻고 있다. Fast super-resolution convolutional neural network (FSRCNN)은 딥러닝 기반 초고해상도 알고리즘으로 잘 알려져 있으며, 여러 개의 convolutional layer로 추출한 저 해상도의 입력 특징을 활용하여 deconvolutional layer에서 초고해상도의 영상을 출력하는 알고리즘이다. 본 논문에서는 병렬 연산 효율성을 고려한 FPGA 기반 convolutional neural networks 가속기를 제안한다. 특히 deconvolutional layer를 convolutional layer로 변환하는 방법을 통해서 에너지 효율적인 가속기를 설계했다. 또한 제안한 방법은 FPGA 리소스를 고려하여 FSRCNN의 구조를 변형한 Optimal-FSRCNN을 제안한다. 사용하는 곱셈기의 개수를 FSRCNN 대비 2.4 배 압축하였고, 초고해상도 변환 성능을 평가하는 지표인 PSNR은 FSRCNN과 비슷한 성능을 내고 있다. 이를 통해서 FPGA 에 최적화된 네트워크를 구현하여 FHD 입력 영상을 UHD 영상으로 출력하는 실시간 영상처리 기술을 개발했다.
PDF

Exploring GEMM Optimization Techniques for PIM Architecture: A Case Study on UPMEM (PIM 아키텍처를 위한 GEMM 최적화 기법 탐구: UPMEM 사례 연구)

Chan Lee;Heelim Choi;Hanjun Kim
- Proceedings of the Korea Information Processing Society Conference
- /
- 2024.05a
- /
- pp.65-68
- /
- 2024
이 연구는 PIM(Processing-in-Memory) 아키텍처를 활용하여 General Matrix Multiplication(GEMM)의 최적화 기법을 UPMEM PIM 을 통해 탐구한다. 본 연구는 CPU 에서 경험하는 메모리 대역폭의 제한을 극복하고 병렬 처리 구조를 활용함으로써 GEMM 연산에서 PIM 의 잠재적 이점을 확인한다. 또한 연속된 세 개의 행렬 곱셈에 대한 효율성을 평가하고, 데이터 전송 시간이 성능 최적화의 주요병목 현상으로 작용하는 것을 확인한다. CPU 에서 UPMEM 커널로 전송되는 데이터의 양을 한 번에 늘리면서 전송 횟수를 줄이는 방법을 사용하여 CPU 에 비해 성능을 최대 6.57 배 향상시켰다.
https://doi.org/10.3745/PKIPS.y2024m05a.65 인용 PDF

An Efficient Array Algorithm for VLSI Implementation of Vector-radix 2-D Fast Discrete Cosine Transform (Vector-radix 2차원 고속 DCT의 VLSI 구현을 위한 효율적인 어레이 알고리듬)

신경욱;전흥우;강용섬
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.18 no.12
- /
- pp.1970-1982
- /
- 1993
This paper describes an efficient array algorithm for parallel computation of vector-radix two-dimensional (2-D) fast discrete cosine transform (VR-FCT), and its VLSI implementation. By mapping the 2-D VR-FCT onto a 2-D array of processing elements (PEs), the butterfly structure of the VR-FCT can be efficiently importanted with high concurrency and local communication geometry. The proposed array algorithm features architectural modularity, regularity and locality, so that it is very suitable for VLSI realization. Also, no transposition memory is required, which is invitable in the conventional row-column decomposition approach. It has the time complexity of O(N+Nnzp-log2N) for (N*N) 2-D DCT, where Nnzd is the number of non-zero digits in canonic-signed digit(CSD) code, By adopting the CSD arithmetic in circuit desine, the number of addition is reduced by about 30%, as compared to the 2`s complement arithmetic. The computational accuracy analysis for finite wordlength processing is presented. From simulation result, it is estimated that (8*8) 2-D DCT (with Nnzp=4) can be computed in about 0.88 sec at 50 MHz clock frequency, resulting in the throughput rate of about 72 Mega pixels per second.
PDF

A Degree of Difficulty in Operations Area in Elementary Mathematics (초등수학에서 연산영역의 곤란도 분석)

Ahn, Byoung-Gon
- Journal of Elementary Mathematics Education in Korea
- /
- v.13 no.1
- /
- pp.17-30
- /
- 2009
This paper is about the basic skills of four operations in numbers and operations areas from step 1 to step 3 in elementary mathematics. Here are the results of the evaluation. First, addition and subtraction take the largest time. The average difficulty rate in operations area is 91.2%. Most students understand the contents of textbook well. Specifically, students easily understand the step 1. However, subtraction has lower difficulty rate than addition. Also, three mixed computation, calculation in horizontal, and rounding(rounding down) are difficult areas for students. The contents of step 2 are fully understood. However, lots of mistakes are found in the process of rounding(rounding down), and sentence problems are thought as difficult. Second, the multiplication is first starting in the step 2-Ga. The unit 'Multiplication 99' takes 13 hours, the longest. The difficulty rate in this unit is 89.4%, students understand well. However, students are influenced by addition and subtraction errors in the process of multiplication, and have difficulty in changing the sentence problem to multiplication expression. Third, the division, which starts in step 3-Ga, has 89.9% of difficulty rate. Students well understand. Result of this paper: most of students understand well four operations, but accurate concept, the relationship between multiplication and division, specific instructions in teaching principles of division calculation and sentence problems are in need. Setting the amount of the contents and difficulty rate in understanding are depends on every school's situation, so suggesting universal standard is really hard. However, studying more objects broadly and specific study will be helpful to suggest proper contents and effective teaching.
PDF

A Search for an Alternative Articulation and Treatment on the Complex Numbers in Grade - 10 Mathematics Textbook (고등학교 10-가 교과서 복소수 단원에 관한 논리성 분석연구)

Yang, Eun-Young;Lee, Young-Ha
- School Mathematics
- /
- v.10 no.3
- /
- pp.357-374
- /
- 2008
The complex number system is supposed to introduce first chapter in the first grade of high school. When number system is expanded to complex numbers, the main aim is to understand preservation of algebraic structure with regard to the flow of curriculum and textbook. This research reviewed overall alternative articulation and treatment of textbooks from a logical viewpoint. Two research questions are developed below. First, in the structure of the current curriculum, when we consider student's 'level', how are the alternative articulation and treatment of textbooks in complex unit on a logical point of view? Second, What are more logical alternative articulation and treatment? What alternative articulation and treatment are suitable for a running goal? and what are the improvement which is definitive?
PDF

VLSI Design for Folded Wavelet Transform Processor using Multiple Constant Multiplication (MCM과 폴딩 방식을 적용한 웨이블릿 변환 장치의 VLSI 설계)

Kim, Ji-Won;Son, Chang-Hoon;Kim, Song-Ju;Lee, Bae-Ho;Kim, Young-Min
- Journal of Korea Multimedia Society
- /
- v.15 no.1
- /
- pp.81-86
- /
- 2012
This paper presents a VLSI design for lifting-based discrete wavelet transform (DWT) 9/7 filter using multiplierless multiple constant multiplication (MCM) architecture. This proposed design is based on the lifting scheme using pattern search for folded architecture. Shift-add operation is adopted to optimize the multiplication process. The conventional serial operations of the lifting data flow can be optimized into parallel ones by employing paralleling and pipelining techniques. This optimized design has simple hardware architecture and requires less computation without performance degradation. Furthermore, hardware utilization reaches 100%, and the number of registers required is significantly reduced. To compare our work with previous methods, we implemented the architecture using Verilog HDL. We also executed simulation based on the logic synthesis using $0.18{\mu}m$ CMOS standard cells. The proposed architecture shows hardware reduction of up to 60.1% and 44.1% respectively at 200 MHz clock compared to previous works. This implementation results indicate that the proposed design performs efficiently in hardware cost, area, and power consumption.
https://doi.org/10.9717/kmms.2012.15.1.081 인용 PDF KSCI

Search Result 554, Processing Time 0.033 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)