• Title/Summary/Keyword: Bit-Parallel

Search Result 406, Processing Time 0.022 seconds

Low-Complexity Soft-MIMO Detection Algorithm Based on Ordered Parallel Tree-Search Using Efficient Node Insertion (효율적인 노드 삽입을 이용한 순서화된 병렬 트리-탐색 기반 저복잡도 연판정 다중 안테나 검출 알고리즘)

  • Kim, Kilhwan;Park, Jangyong;Kim, Jaeseok
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37A no.10
    • /
    • pp.841-849
    • /
    • 2012
  • This paper proposes an low-complexity soft-output multiple-input multiple-output (soft-MIMO) detection algorithm for achieving soft-output maximum-likelihood (soft-ML) performance under max-log approximation. The proposed algorithm is based on a parallel tree-search (PTS) applying a channel ordering by a sorted-QR decomposition (SQRD) with altered sort order. The empty-set problem that can occur in calculation of log-likelihood ratio (LLR) for each bit is solved by inserting additional nodes at each search level. Since only the closest node is inserted among nodes with opposite bit value to a selected node, the proposed node insertion scheme is very efficient in the perspective of computational complexity. The computational complexity of the proposed algorithm is approximately 37-74% of that of existing algorithms, and from simulation results for a $4{\times}4$ system, the proposed algorithm shows a performance degradation of less than 0.1dB.

Implementation of 2,048-bit RSA Based on RNS(Residue Number Systems) (RNS(Residue Number Systems) 기반의 2,048 비트 RSA 설계)

  • 권택원;최준림
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.4
    • /
    • pp.57-66
    • /
    • 2004
  • This paper proposes the design of a 2,048-bit RSA based on RNS(residue number systems) Montgomery modular multiplier As the systems that RNS processes a fast parallel modular multiplication for a large word partitioned into small words, we introduce Montgomery reduction method(MRM)[1]based on Wallace tree modular multiplier and 33 RNS bases with 64-bit size for RNS Montgomery modular multiplication in this paper. Also, for fast RNS modular multiplication, a modified method based on Chinese remainder theorem(CRT)[2] is presented. We have verified 2,048-bit RSA based on RNS using Samsung 0.35${\mu}{\textrm}{m}$ technology and the 2,048-bit RSA is performed in 2.54㎳ at 100MHz.

Parallel Implementation of the Recursive Least Square for Hyperspectral Image Compression on GPUs

  • Li, Changguo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.7
    • /
    • pp.3543-3557
    • /
    • 2017
  • Compression is a very important technique for remotely sensed hyperspectral images. The lossless compression based on the recursive least square (RLS), which eliminates hyperspectral images' redundancy using both spatial and spectral correlations, is an extremely powerful tool for this purpose, but the relatively high computational complexity limits its application to time-critical scenarios. In order to improve the computational efficiency of the algorithm, we optimize its serial version and develop a new parallel implementation on graphics processing units (GPUs). Namely, an optimized recursive least square based on optimal number of prediction bands is introduced firstly. Then we use this approach as a case study to illustrate the advantages and potential challenges of applying GPU parallel optimization principles to the considered problem. The proposed parallel method properly exploits the low-level architecture of GPUs and has been carried out using the compute unified device architecture (CUDA). The GPU parallel implementation is compared with the serial implementation on CPU. Experimental results indicate remarkable acceleration factors and real-time performance, while retaining exactly the same bit rate with regard to the serial version of the compressor.

HPC(High Performance Computer) Linux Clustering for UltraSPARC(64bit-RISC processor) (UltraSPARC(64bit-RISC processor)을 위한 고성능 컴퓨터 리눅스 클러스터링)

  • 김기영;조영록;장종권
    • Proceedings of the IEEK Conference
    • /
    • 2003.11b
    • /
    • pp.45-48
    • /
    • 2003
  • We can easily buy network system for high performance micro-processor, progress computer architecture is caused of high bandwidth and low delay time. Coupling PC-based commodity technology with distributed computing methodologies provides an important advance in the development of single-user dedicated systems. Lately Network is joined PC or workstation by computers of high performance and low cost. Than it make intensive that Cluster system is resembled supercomputer. Unix, Linux, BSD, NT(Windows series) can use Cluster system OS(operating system). I'm chosen linux gain low cost, high performance and open technical documentation. This paper is benchmark performance of Beowulf clustering by UltraSPARC-1K(64bit-RISC processor). Benchmark tools use MPI(Message Passing Interface) and NetPIPE. Beowulf is a class of experimental parallel workstations developed to evaluate and characterize the design space of this new operating point in price-performance.

  • PDF

An Enhancement of the MPEG-2 Audio Encoder Using General DSPs (범용 DSP를 이용한 MPEG-2 오디오 부호화기의 성능 개선)

  • 오현오;김성윤;윤대희;차일환;이준용
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 1997.11a
    • /
    • pp.63-67
    • /
    • 1997
  • The ISO(International Standard Organization) has standardized MPEG-2 audio. The MPEG-2 audio compression algorithm is based upon subband analysis and exploits the human auditory characteristics to achieve a low bit rate with minimum perceptual loss of audio signal quality. This thesis presents an enhanced MPEG-2 audio encoder using multiple TMS320C30 general purpose DSP's. The developed system is made up of five slave boards and one master board. Each slave board performs susband analysis psychoacoustic parameter calculation for one channel, and the master board manages bit allocation, quantization, and bit-stream formatting for all channels. Parallel processing and pipelining techniques are used in hardware structure and fast algorithms are applied in each subroutine to implement a real-time process. The implemented system supports multichannel up to 5.1 and various bitrates.

  • PDF

Implementation of 2-D DCT/IDCT VLSI based on Fully Bit-Serial Architecture (완전 비트 순차 구조에 근거한 2차원 DCT/IDCT VLSI 구현)

  • 임호근;류근장;권용무;김형곤
    • Journal of the Korean Institute of Telematics and Electronics A
    • /
    • v.31A no.6
    • /
    • pp.188-198
    • /
    • 1994
  • The distributed arithmetic approach has been commonly recognized as an efficient method to implement the inner-product type of computation with fixed coefficients such as DCT/IDCT. This paper presents a novel architecture and the implementation of 2-D DCT/IDCT VLSI chip based on distributed arithmetic. The main feature of the proposed architecture is a fully 2-bit serial pipeline and parallel structure with memory-based signal processing circuitry, which is efficient to the implementation of the bit-serial operation of distributed arithmetic. All modules of the proposed architecture are designed with NP-dynamic circuitry to reduce the power consumption and to increase the performance. This chip is applicable in HDTV systems working at video sampling rate up to 75 MHz.

  • PDF

Research for Improving the Speed of Scrambler in the WAVE System (WAVE 시스템에서 스크램블러의 속도 향상을 위한 연구)

  • Lee, Dae-Sik;You, Young-Mo;Lee, Sang-Youn;Oh, Se-Kab
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37A no.9
    • /
    • pp.799-808
    • /
    • 2012
  • Bit operation of scrambler in the WAVE System become less efficient because parallel processing is impossible in terms of hardware and software. In this paper, we propose algorism to find the starting position of the matrix table. Also, when bit operation algorithm of scrambler and algorithms for matrix table, algorithm used to find starting position of the matrix table were compared with the performance as 8 bit, 16bit, 32 bit processing units. As a result, the number of processing times per second could be done 2917.8 times more in an 8-bit, 5432.1 times in a 16-bit, 10277.8 times in a 32 bit. Therefore, algorithm to find the starting position of the matrix table improves the speed of the scrambler in the WAVE and the receiving speed of a variety of information gathering and precision over the Vehicle to Infra or Vehicle to Vehicle in the Intelligent Transport Systems.

All Phase Discrete Sine Biorthogonal Transform and Its Application in JPEG-like Image Coding Using GPU

  • Shan, Rongyang;Zhou, Xiao;Wang, Chengyou;Jiang, Baochen
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.9
    • /
    • pp.4467-4486
    • /
    • 2016
  • Discrete cosine transform (DCT) based JPEG standard significantly improves the coding efficiency of image compression, but it is unacceptable event in serious blocking artifacts at low bit rate and low efficiency of high-definition image. In the light of all phase digital filtering theory, this paper proposes a novel transform based on discrete sine transform (DST), which is called all phase discrete sine biorthogonal transform (APDSBT). Applying APDSBT to JPEG scheme, the blocking artifacts are reduced significantly. The reconstructed image of APDSBT-JPEG is better than that of DCT-JPEG in terms of objective quality and subjective effect. For improving the efficiency of JPEG coding, the structure of JPEG is analyzed. We analyze key factors in design and evaluation of JPEG compression on the massive parallel graphics processing units (GPUs) using the compute unified device architecture (CUDA) programming model. Experimental results show that the maximum speedup ratio of parallel algorithm of APDSBT-JPEG can reach more than 100 times with a very low version GPU. Some new parallel strategies are illustrated in this paper for improving the performance of parallel algorithm. With the optimal strategy, the efficiency can be improved over 10%.

New High Speed Parallel Multiplier for Real Time Multimedia Systems (실시간 멀티미디어 시스템을 위한 새로운 고속 병렬곱셈기)

  • Cho, Byung-Lok;Lee, Mike-Myung-Ok
    • The KIPS Transactions:PartA
    • /
    • v.10A no.6
    • /
    • pp.671-676
    • /
    • 2003
  • In this paper, we proposed a new First Partial product Addition (FPA) architecture with new compressor (or parallel counter) to CSA tree built in the process of adding partial product for improving speed in the fast parallel multiplier to improve the speed of calculating partial product by about 20% compared with existing parallel counter using full Adder. The new circuit reduces the CLA bit finding final sum by N/2 using the novel FPA architecture. A 5.14nS of multiplication speed of the $16{\times}16$ multiplier is obtained using $0.25\mu\textrm{m}$ CMOS technology. The architecture of the multiplier is easily opted for pipeline design and demonstrates high speed performance.

PDOCM : Fast Text Compression on MasPar Machine (PDOCM : MasPar머쉰상의 새로운 압축기법과 빠른 텍스트 축약)

  • Min, Yong-Sik
    • The Journal of the Acoustical Society of Korea
    • /
    • v.14 no.1
    • /
    • pp.40-47
    • /
    • 1995
  • Due to rapid progress in data communications, we are able to acquire the information we need with ease. One means of achieving this is a parallel machine such as the MasPar. Although the parallel machine makes it possible to receive/transmit enormous quantities of data, because of the increasing volume of information that must be processed, it is necessary to transmit only a minimal amount of data bits. This paper suggests a new coding method for the parallel machine, which compresses the data by reducing redundancy. Parallel Dynamic Octal Compact Mapping (PDOCM) compresses at least 1 byte per word, compared with other coding techniques, and achieves a 54.188-fold speedup with 64 processors to transmit 10 million characters.

  • PDF