• Title/Summary/Keyword: Fully-parallel architecture

Search Result 32, Processing Time 0.023 seconds

A Study on the CAM Designed by Adopting Best-Match Method using Parallel Processing Architecture (병렬 처리 구조를 이용한 최적 정합 방식 CAM 설계에 관한 연구)

  • 김상복;박노경;차균현
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.6
    • /
    • pp.1056-1063
    • /
    • 1994
  • In this paper a content addressable memory (CAM) is designed by adopting best-match method. It has a single processing element(PE) architecture with high computational efficiency and throughput. It is composed of three main functional blocks(input MUX, best-match CAM, control part). It support fully parallel processing. Logic simulation is completed by using QUICKSIM, Circuit simulation is performanced by using HSPICE. Its layout is based on the ETRI 3 m n-well process design rules. Its maximum operating frequency is 20 MHz.

  • PDF

Fully parallel low-density parity-check code-based polar decoder architecture for 5G wireless communications

  • Dinesh Kumar Devadoss;Shantha Selvakumari Ramapackiam
    • ETRI Journal
    • /
    • v.46 no.3
    • /
    • pp.485-500
    • /
    • 2024
  • A hardware architecture is presented to decode (N, K) polar codes based on a low-density parity-check code-like decoding method. By applying suitable pruning techniques to the dense graph of the polar code, the decoder architectures are optimized using fewer check nodes (CN) and variable nodes (VN). Pipelining is introduced in the CN and VN architectures, reducing the critical path delay. Latency is reduced further by a fully parallelized, single-stage architecture compared with the log N stages in the conventional belief propagation (BP) decoder. The designed decoder for short-to-intermediate code lengths was implemented using the Virtex-7 field-programmable gate array (FPGA). It achieved a throughput of 2.44 Gbps, which is four times and 1.4 times higher than those of the fast-simplified successive cancellation and combinational decoders, respectively. The proposed decoder for the (1024, 512) polar code yielded a negligible bit error rate of 10-4 at 2.7 Eb/No (dB). It converged faster than the BP decoding scheme on a dense parity-check matrix. Moreover, the proposed decoder is also implemented using the Xilinx ultra-scale FPGA and verified with the fifth generation new radio physical downlink control channel specification. The superior error-correcting performance and better hardware efficiency makes our decoder a suitable alternative to the successive cancellation list decoders used in 5G wireless communication.

Implementation of a Parallel Web Crawler for the Odysseus Large-Scale Search Engine (오디세우스 대용량 검색 엔진을 위한 병렬 웹 크롤러의 구현)

  • Shin, Eun-Jeong;Kim, Yi-Reun;Heo, Jun-Seok;Whang, Kyu-Young
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.6
    • /
    • pp.567-581
    • /
    • 2008
  • As the size of the web is growing explosively, search engines are becoming increasingly important as the primary means to retrieve information from the Internet. A search engine periodically downloads web pages and stores them in the database to provide readers with up-to-date search results. The web crawler is a program that downloads and stores web pages for this purpose. A large-scale search engines uses a parallel web crawler to retrieve the collection of web pages maximizing the download rate. However, the service architecture or experimental analysis of parallel web crawlers has not been fully discussed in the literature. In this paper, we propose an architecture of the parallel web crawler and discuss implementation issues in detail. The proposed parallel web crawler is based on the coordinator/agent model using multiple machines to download web pages in parallel. The coordinator/agent model consists of multiple agent machines to collect web pages and a single coordinator machine to manage them. The parallel web crawler consists of three components: a crawling module for collecting web pages, a converting module for transforming the web pages into a database-friendly format, a ranking module for rating web pages based on their relative importance. We explain each component of the parallel web crawler and implementation methods in detail. Finally, we conduct extensive experiments to analyze the effectiveness of the parallel web crawler. The experimental results clarify the merit of our architecture in that the proposed parallel web crawler is scalable to the number of web pages to crawl and the number of machines used.

A Study on the IC, Implementation of High Speed Multiplier for Real Time Digital Signal Processing (실시간 디지털 신호 처리용 고속 MULTIPLIER 단일칩화에 관한 연구)

  • 문대철;차균현
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.15 no.7
    • /
    • pp.628-637
    • /
    • 1990
  • In this paper we present on architecture for a high sppeed CMOS multiplier which can be used for real-time digital signal processing. And a synthesis method for designing highly parallel algorithms in VLSI is presented. A parallel multiplier design based on the modified Booth's algorithms and Ling's algorthm. This paper addresses the design of multiplier capable of accpting data in 2's complement notation and coefficients in 2's complement notation. Multiplier consists of an interative array of sequential cells, and are well suited to VLSI implementation as a results of their modularity and regularity. Booth's decoders can be fully tested using a relatively small number af test vector.

  • PDF

Accelerating 2D DCT in Multi-core and Many-core Environments (멀티코어와 매니코어 환경에서의 2 차원 DCT 가속)

  • Hong, Jin-Gun;Jung, Sung-Wook;Kim, Cheong-Ghil;Burgstaller, Bernd
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.04a
    • /
    • pp.250-253
    • /
    • 2011
  • Chip manufacture nowadays turned their attention from accelerating uniprocessors to integrating multiple cores on a chip. Moreover desktop graphic hardware is now starting to support general purpose computation. Desktop users are able to use multi-core CPU and GPU as a high performance computing resources these days. However exploiting parallel computing resources are still challenging because of lack of higher programming abstraction for parallel programming. The 2-dimensional discrete cosine transform (2D-DCT) algorithms are most computational intensive part of JPEG encoding. There are many fast 2D-DCT algorithms already studied. We implemented several algorithms and estimated its runtime on multi-core CPU and GPU environments. Experiments show that data parallelism can be fully exploited on CPU and GPU architecture. We expect parallelized DCT bring performance benefit towards its applications such as JPEG and MPEG.

A CDMA-Based Communication Network for a Multiprocessor SoC (다중 프로세서를 갖는 SoC 를 위한 CDMA 기술에 기반한 통신망 설계)

  • Chun, Ik-Jae;Kim, Bo-Gwan
    • Proceedings of the IEEK Conference
    • /
    • 2005.11a
    • /
    • pp.707-710
    • /
    • 2005
  • In this paper, we propose a new communication network for on-chip communication. The network is based on a direct sequence code division multiple access (DS-CDMA) technique. The new communication network is suitable for a parallel processing system and also drastically reduces the I/O pin count. Our network architecture is mainly divided into a CDMA-based network interface (CNI), a communication channel, a synchronizer. The network includes a reverse communication channel for reducing latency. The network decouples computation task from communication task by the CNI. An extreme truncation is considered to simplify the communication link. For the scalability of the network, we use a PN-code reuse method and a hierarchical structure. The network elements have a modular architecture. The communication network is done using fully synthesizable Verilog HDL to enhance the portability between process technologies.

  • PDF

Energy Efficient and Low-Cost Server Architecture for Hadoop Storage Appliance

  • Choi, Do Young;Oh, Jung Hwan;Kim, Ji Kwang;Lee, Seung Eun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.12
    • /
    • pp.4648-4663
    • /
    • 2020
  • This paper proposes the Lempel-Ziv 4(LZ4) compression accelerator optimized for scale-out servers in data centers. In order to reduce CPU loads caused by compression, we propose an accelerator solution and implement the accelerator on an Field Programmable Gate Array(FPGA) as heterogeneous computing. The LZ4 compression hardware accelerator is a fully pipelined architecture and applies 16 dictionaries to enhance the parallelism for high throughput compressor. Our hardware accelerator is based on the 20-stage pipeline and dictionary architecture, highly customized to LZ4 compression algorithm and parallel hardware implementation. Proposing dictionary architecture allows achieving high throughput by comparing input sequences in multiple dictionaries simultaneously compared to a single dictionary. The experimental results provide the high throughput with intensively optimized in the FPGA. Additionally, we compare our implementation to CPU implementation results of LZ4 to provide insights on FPGA-based data centers. The proposed accelerator achieves the compression throughput of 639MB/s with fine parallelism to be deployed into scale-out servers. This approach enables the low power Intel Atom processor to realize the Hadoop storage along with the compression accelerator.

A Multi-Scale Parallel Convolutional Neural Network Based Intelligent Human Identification Using Face Information

  • Li, Chen;Liang, Mengti;Song, Wei;Xiao, Ke
    • Journal of Information Processing Systems
    • /
    • v.14 no.6
    • /
    • pp.1494-1507
    • /
    • 2018
  • Intelligent human identification using face information has been the research hotspot ranging from Internet of Things (IoT) application, intelligent self-service bank, intelligent surveillance to public safety and intelligent access control. Since 2D face images are usually captured from a long distance in an unconstrained environment, to fully exploit this advantage and make human recognition appropriate for wider intelligent applications with higher security and convenience, the key difficulties here include gray scale change caused by illumination variance, occlusion caused by glasses, hair or scarf, self-occlusion and deformation caused by pose or expression variation. To conquer these, many solutions have been proposed. However, most of them only improve recognition performance under one influence factor, which still cannot meet the real face recognition scenario. In this paper we propose a multi-scale parallel convolutional neural network architecture to extract deep robust facial features with high discriminative ability. Abundant experiments are conducted on CMU-PIE, extended FERET and AR database. And the experiment results show that the proposed algorithm exhibits excellent discriminative ability compared with other existing algorithms.

Analog Parallel Processing Algorithm of CNN-UM for Interframe Change Detection (프레임간의 영상 변화 검출을 위한 CNN-UM의 아날로그 병렬연산처리 알고리즘)

  • 김형석;김선철;손홍락;박영수;한승조
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.40 no.1
    • /
    • pp.1-9
    • /
    • 2003
  • The CNN-UM algorithm which performs the analog parallel subtraction of images has been developed and its application study to the moving target detection has been done. The CNN-UM is the state of the art computation architecture with high computational potential of analog parallel processing. It is one of the strong candidates for the next generation of computing system which fulfills requirement of the real-time image processing. One weakness of the CNN-UM is that its analog parallel processing function is not fully utilized for the inter frame processing. If two subsequent image frames are superimposed with opposite signs on identical capacitors for short time period, the analog subtraction between them is achieved. The Principle of such temporal inter-frame processing algorithm has been described and its mathematical analysis has been done. Practical usefulness of the proposed algorithm has also been verified through the application for moving target detection.

Highly Scalable NAND Flash Memory Cell Design Embracing Backside Charge Storage

  • Kwon, Wookhyun;Park, In Jun;Shin, Changhwan
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.15 no.2
    • /
    • pp.286-291
    • /
    • 2015
  • For highly scalable NAND flash memory applications, a compact ($4F^2/cell$) nonvolatile memory architecture is proposed and investigated via three-dimensional device simulations. The back-channel program/erase is conducted independently from the front-channel read operation as information is stored in the form of charge at the backside of the channel, and hence, read disturbance is avoided. The memory cell structure is essentially equivalent to that of the fully-depleted transistor, which allows a high cell read current and a steep subthreshold slope, to enable lower voltage operation in comparison with conventional NAND flash devices. To minimize memory cell disturbance during programming, a charge depletion method using appropriate biasing of a buried back-gate line that runs parallel to the bit line is introduced. This design is a new candidate for scaling NAND flash memory to sub-20 nm lateral dimensions.