Search | Korea Science

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
- Journal of KIISE:Computing Practices and Letters
- /
- v.16 no.7
- /
- pp.775-787
- /
- 2010
Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.
PDF KSCI

Data De-duplication and Recycling Technique in SSD-based Storage System for Increasing De-duplication Rate and I/O Performance (SSD 기반 스토리지 시스템에서 중복률과 입출력 성능 향상을 위한 데이터 중복제거 및 재활용 기법)

Kim, Ju-Kyeong;Lee, Seung-Kyu;Kim, Deok-Hwan
- Journal of the Institute of Electronics and Information Engineers
- /
- v.49 no.12
- /
- pp.149-155
- /
- 2012
SSD is a storage device of having high-performance controller and cache buffer and consists of many NAND flash memories. Because NAND flash memory does not support in-place update, valid pages are invalidated when update and erase operations are issued in file system and then invalid pages are completely deleted via garbage collection. However, garbage collection performs many erase operations of long latency and then it reduces I/O performance and increases wear leveling in SSD. In this paper, we propose a new method of de-duplicating valid data and recycling invalid data. The method de-duplicates valid data and then recycles invalid data so that it improves de-duplication ratio. Due to reducing number of writes and garbage collection, the method could increase I/O performance and decrease wear leveling in SSD. Experimental result shows that it can reduce maximum 20% number of garbage collections and 9% I/O latency than those of general case.
https://doi.org/10.5573/ieek.2012.49.12.149 인용 PDF

A Reconfigurable Multiplier Architecture Based on Memristor-CMOS Technology (멤리스터-CMOS 기반의 재구성 가능한 곱셈기 구조)

Park, Byungsuk;Lee, Sang-Jin;Jang, Young-Jo;Eshraghian, Kamran;Cho, Kyoungrok
- Journal of the Institute of Electronics and Information Engineers
- /
- v.51 no.10
- /
- pp.64-71
- /
- 2014
Multiplier performs a complex arithmetic operation in various signal processing algorithms such as multimedia and communication system. The multiplier also suffers from its relatively large signal propagation delay, high power dissipation, and large area requirement. This paper presents memristor-CMOS based reconfigurable multiplier reducing area occupation of the multiplier circuitry and increasing compatibility using optimized bit-width for various applications. The performance of the memristor-CMOS based reconfigurable multiplier are estimated with memristor SPICE model and 180 nm CMOS process under 1.8 V supply voltage. The circuit shows performance improvement of 61% for area, 38% for delay and 28% for power consumption respectively compared with the conventional reconfigurable multipliers. It also has an advantage for area reduction of 22% against a twin-precision multiplier.
https://doi.org/10.5573/ieie.2014.51.10.064 인용 PDF KSCI

An Efficient Pitch Estimation for IMBE (Improved Multi-band Excitation) Speech Coder (개량형 다중대역 여기 (IMBE: Improved Multi-band Excitation) 음성 부호기의 피치 예측 개선)

Na, Hoon;Jeong, Dae-Gwon
- The Journal of the Acoustical Society of Korea
- /
- v.20 no.3
- /
- pp.34-41
- /
- 2001
In an IMBE (Improved Multi-band Excitation) speech coder, initial pitch estimation occupies most of the total computing time for the coder due to complex cost function and exhaustive search over candidate pitches. Future frames in initial pitch estimation cause inevitable time delay. Therefore, it is difficult to implement a real-time coder. Furthermore, unvoiced frames use the unnecessary pitch estimation as in the voiced frames. In this paper, each frame is determined voiced or unvoiced by Dyadic Wavelet Transform (DyWT) and, then, initial pitch estimation is performed only for voiced frame. Therefore different pitch estimation algorithms are employed between voiced and unvoiced frames incurring reduced time delay at transmitter and receiver. Simulation result show that the relative complexity of initial pitch estimation is reduced by 23％, and the processing time decreases down to 1/10 ∼ 1/1l of the IMBE coder while speech quality is almost maintained.
PDF

Design of an USB Security Framework for Double Use Detection (이중사용 방지를 위한 USB 보안 프레임워크의 설계)

Jeong, Yoon-Su;Lee, Sang-Ho
- Journal of the Korea Society of Computer and Information
- /
- v.16 no.4
- /
- pp.93-99
- /
- 2011
Recently, the development of internet technology makes user's personal data used by being saved in USB. But there is a critical issue that personal data can be exposed with malicious purpose because that personal data doesn't need to be certificate to use. This paper proposes USB security framework to prevent a duplicate use of personal data for protecting the data which in USB. The proposed USB security framework performs certification process of user with additional 4bite of user's identification data and usage choice of USB security token before certification data when the framework uses USB security product in different network. It makes communication overhead and service delay increased. As a result of the experiment, packet certification delay time is more increased by average 7.6% in the proposed USB security framework than simple USB driver and USB Token, and procedure rate of certification server on the number of USB is also increased by average 9.8%.
https://doi.org/10.9708/jksci.2011.16.4.093 인용 PDF KSCI

An Efficient Bit Stream Instruction-set for Network Packet Processing Applications (네트워크 패킷 처리를 위한 효율적인 비트 스트림 명령어 세트)

Yoon, Yeo-Phil;Lee, Yong-Surk;Lee, Jung-Hee
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.45 no.10
- /
- pp.53-58
- /
- 2008
This paper proposes a new set of instructions to improve the packet processing capacity of a network processor. The proposed set of instructions is able to achieve more efficient packet processing by accelerating integration of packet headers. Furthermore, a hardware configuration dedicated to processing overlay instructions was designed to reduce additional hardware cost. For this purpose, the basic architecture for the network processor was designed using LISA and the overlay block was optimized based on the barrel shifter. The block was synthesized to compare the area and the operation delay, and allocated to a C-level macro function using the compiler known function (CKF). The improvement in performance was confirmed by comparing the execution cycle and the execution time of an application program. Experiments were conducted using the processor designer and the compiler designer from Coware. The result of synthesis with the TSMC ($0.25{\mu}m$) from Synopsys indicated a reduction in operation delay by 20.7% and an improvement in performance of 30.8% with the proposed set of instructions for the entire execution cycle.
PDF KSCI

Design of BCH Code Decoder using Parallel CRC Generation (병렬 CRC 생성 방식을 활용한 BCH 코드 복호기 설계)

Kal, Hong-Ju;Moon, Hyun-Chan;Lee, Won-Young
- The Journal of the Korea institute of electronic communication sciences
- /
- v.13 no.2
- /
- pp.333-340
- /
- 2018
This paper introduces a BCH code decoder using parallel CRC(: Cyclic Redundancy Check) generation. Using a conventional parallel syndrome generator with a LFSR(: Linear Feedback Shift Register), it takes up a lot of space for a short code. The proposed decoder uses the parallel CRC method that is widely used to compute the checksum. This scheme optimizes the a syndrome generator in the decoder by eliminating redundant xor operation compared with the parallel LFSR and thus minimizes chip area and propagation delay. In simulation results, the proposed decoder has accomplished propagation delay reduction of 2.01 ns as compared to the conventional scheme. The proposed decoder has been designed and synthesized in $0.35-{\mu}m$ CMOS process.
https://doi.org/10.13067/JKIECS.2018.13.2.333 인용 PDF KSCI

Design and Implementation HDTV Relay Transmission System for Overlay Multicast (오버레이 멀티캐스트를 위한 HDTV 중계전송 시스템 설계 및 구현)

Son, Seung-Chul;Kwag, Yong-Wan;Heo, Kwon;Lee, Hyung-Ok;Nam, Ji-Seung
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.32 no.1A
- /
- pp.57-65
- /
- 2007
The overlay multicast that has been recently presented as IP alternative for the IP multicast has been getting much persuasion by the computing power of the hardware and the advancement of the network techniques to enforce Routing in application-level. In an overlay multicast, the system resource and the network bandwidth must be utilized efficiently to service real-time HDTV images. Specifically, the system must consider the delay and the jitter that can be incurred at the application-level. In this paper, we implement a server and a client to broadcast HDTV, in the session composed by the existing overlay multicast protocol. The broadcasting server performs the service using a TV tuner, An HDTV camcorder, and files, clients constituting a multicast group relay the received data to other clients. At this time, the information that the clients report periodically, including their delay and the network state, to the server is used as an important information to maintain an overlay session. The implementation is based on the DirectX and its performance is evaluated by the LAN test bed that has been set.
PDF KSCI

Design of a 64×64-Bit Modified Booth Multiplier Using Current-Mode CMOS Quarternary Logic Circuits (전류모드 CMOS 4치 논리회로를 이용한 64×64-비트 변형된 Booth 곱셈기 설계)

Kim, Jeong-Beom
- The KIPS Transactions:PartA
- /
- v.14A no.4
- /
- pp.203-208
- /
- 2007
This paper proposes a $64{\times}64$ Modified Booth multiplier using CMOS multi-valued logic circuits. The multiplier based on the radix-4 algorithm is designed with current mode CMOS quaternary logic circuits. Designed multiplier is reduced the transistor count by 64.4% compared with the voltage mode binary multiplier. The multiplier is designed with Samsung $0.35{\mu}m$ standard CMOS process at a 3.3V supply voltage and unit current $5{\mu}m$. The validity and effectiveness are verified through the HSPICE simulation. The voltage mode binary multiplier is achieved the occupied area of $7.5{\times}9.4mm^2$, the maximum propagation delay time of 9.8ns and the average power consumption of 45.2mW. This multiplier is achieved the maximum propagation delay time of 11.9ns and the average power consumption of 49.7mW. The designed multiplier is reduced the occupied area by 42.5% compared with the voltage mode binary multiplier.
https://doi.org/10.3745/KIPSTA.2007.14-A.4.203 인용 PDF KSCI

A Tag Response Loss Detection Scheme for RFID Group Proof (RFID 그룹증명을 위한 응답손실 감지기법)

Ham, Hyoungmin
- The Journal of the Korea Contents Association
- /
- v.19 no.9
- /
- pp.637-645
- /
- 2019
The RFID group proof is an extension of the yoking proof proving that multiple tags are scanned by a reader simultaneously. Existing group proof schemes provide only delayed tag loss detection which detects loss of tag response in a verification phase. However, delayed tag loss detection is not suitable for real-time applications where tag loss must be detected immediately. In this study, I propose a tag response loss detection scheme which detects loss of tag response in the proof generation process quickly. In the proposed scheme, the tag responds with the sequence number assigned to the tag group, and the reader detects the loss of the tag response through the sequence number. Through an experiment for indistinguishability, I show that the sequence number is secure against an analyzing message attack to distinguish between specific tags and tag groups. In terms of efficiency, the proposed scheme requires fewer transmissions and database operations than existing techniques to determine which tags response is lost.
https://doi.org/10.5392/JKCA.2019.19.09.637 인용 PDF KSCI

Search Result 451, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)