• Title/Summary/Keyword: parallel computer processing

Search Result 651, Processing Time 0.027 seconds

InterCom : Design and Implementation of an Agent-based Internet Computing Environment (InterCom : 에이전트 기반 인터넷 컴퓨팅 환경 설계 및 구현)

  • Kim, Myung-Ho;Park, Kweon
    • The KIPS Transactions:PartA
    • /
    • v.8A no.3
    • /
    • pp.235-244
    • /
    • 2001
  • Development of network and computer technology results in many studies to use physically distributed computers as a single resource. Generally, these studies have focused on developing environments based on message passing. These environments are mainly used to solve problems for scientific computation and process in parallel suing inside parallelism of the given problems. Therefore, these environments provide high parallelism generally, while it is difficult to program and use as well as it is required to have user accounts in the distributed computers. If a given problem is divided into completely independent subproblems, more efficient environment can be provided. We can find these problems in bio-informatics, 3D animatin, graphics, and etc., so the development of new environment for these problems can be considered to be very important. Therefore, we suggest new environment called InterCom based on a proxy computing, which can solve these problems efficiently, and explain the implementation of this environment. This environment consists of agent, server, and client. Merits of this environment are easy programing, no need of user accounts in the distributed computers, and easiness by compiling distributed code automatically.

  • PDF

Novel IME Instructions and their Hardware Architecture for Fast Search Algorithm (고속 탐색 알고리즘에 적합한 움직임 추정 전용 명령어 및 구조 설계)

  • Bang, Ho-Il;SunWoo, Myung-Hoon
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.48 no.12
    • /
    • pp.58-65
    • /
    • 2011
  • This paper presents an ASIP (Application-specific Instruction Processor) for motion estimation that employs specific IME instructions and its programmable and reconfigurable hardware architecture for various video codecs, such as H.264/AVC, MPEG4, etc. With the proposed specific instructions and variable point 2D SAD hardware accelerator, it can handle the real-time processing requirement of High Definition (HD) video. With the SAD unit and its parallel operations using pattern information, the proposed IME instructions support not only full search algorithms but also other fast search algorithms. The hardware size is 25.5K gates for each Processing Element Group (PEG) which has 128 SAD Processor Elements (PEs). The proposed ASIP has been verified by the Synopsys Processor Designer and implemented by the Design Compiler using the IBM 90nm process technology. The hardware size is 453K gates for the IME unit and the operating frequency is 188MHz for 1080p@30 frame in real time. The proposed ASIP can reduce the hardware size about 26% and the number of operation cycles about 18%.

GPU-Optimized BVH and R-Triangle Methods for Rapid Self-Intersection Handling in Fabrics

  • Jong-Hyun Kim
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.8
    • /
    • pp.59-65
    • /
    • 2024
  • In this paper, we present a GPU-based acceleration of computationally intensive self-collision processing in triangular mesh-based cloth simulation. For Compute Unified Device Architecture (CUDA)-based parallel optimization, we propose 1) an efficient way to build, update, and traverse the Bounding Volume Hierarchy (BVH) tree on the GPU, and 2) optimize the Representative-Triangle (R-Triangle) technique on the GPU to minimize primitive collision checking in triangular mesh-based cloth simulations. As a result, the proposed method can handle self-collisions and object collisions of cloth simulation in GPU environment faster and more efficiently than CPU-based algorithms, and experiments on various scenes show that it can achieve simulation results that are 5x to 10x faster. Since the proposed method is optimized for BVH on GPU, it can be easily integrated into various algorithms and fields that utilize BVH.

A Design of Fractional Motion Estimation Engine with 4×4 Block Unit of Interpolator & SAD Tree for 8K UHD H.264/AVC Encoder (8K UHD(7680×4320) H.264/AVC 부호화기를 위한 4×4블럭단위 보간 필터 및 SAD트리 기반 부화소 움직임 추정 엔진 설계)

  • Lee, Kyung-Ho;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.6
    • /
    • pp.145-155
    • /
    • 2013
  • In this paper, we proposed a $4{\times}4$ block parallel architecture of interpolation for high-performance H.264/AVC Fractional Motion Estimation in 8K UHD($7680{\times}4320$) video real time processing. To improve throughput, we design $4{\times}4$ block parallel interpolation. For supplying the $10{\times}10$ reference data for interpolation, we design 2D cache buffer which consists of the $10{\times}10$ memory arrays. We minimize redundant storage of the reference pixel by applying the Search Area Stripe Reuse scheme(SASR), and implement high-speed plane interpolator with 3-stage pipeline(Horizontal Vertical 1/2 interpolation, Diagonal 1/2 interpolation, 1/4 interpolation). The proposed architecture was simulated in 0.13um standard cell library. The gate count is 436.5Kgates. The proposed H.264/AVC Fractional Motion Estimation can support 8K UHD at 30 frames per second by running at 187MHz.

DEX2C: Translation of Dalvik Bytecodes into C Code and its Interface in a Dalvik VM

  • Kim, Minseong;Han, Youngsun;Cho, Myeongjin;Park, Chanhyun;Kim, Seon Wook
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.4 no.3
    • /
    • pp.169-172
    • /
    • 2015
  • Dalvik is a virtual machine (VM) that is designed to run Java-based Android applications. A trace-based just-in-time (JIT) compilation technique is currently employed to improve performance of the Dalvik VM. However, due to runtime compilation overhead, the trace-based JIT compiler provides only a few simple optimizations. Moreover, because each trace contains only a few instructions, the trace-based JIT compiler inherently exploits fewer optimization and parallelization opportunities than a method-based JIT compiler that compiles method-by-method. So we propose a new method-based JIT compiler, named DEX2C, in order to improve performance by finding more opportunities for both optimization and parallelization in Android applications. We employ C code as an intermediate product in order to find more optimization opportunities by using the GNU C Compiler (GCC), and we will detect parallelism by using the Intel C/C++ parallel compiler and the AESOP compiler in our future work. In this paper, we introduce our DEX2C compiler, which dynamically translates Dalvik bytecodes (DEX) into C code with method granularity. We also describe a new method-based JIT interface in the Dalvik VM for the DEX2C compiler. Our experiment results show that our compiler and its interface achieve significant performance improvement by up to 15.2 times and 3.7 times on average, in Element Benchmark, and up to 2.8 times for FFT in Smartbench.

Design and Implementation of KDSM(KAIST Distributed Shared Memory) System (KDSM(KAIST Distributed Shared Memory) 시스템의 설계 및 구현)

  • Lee, Sang-Kwon;Yun, Hee-Chul;Lee, Joon-Won;Maeng, Seung-Ryoul
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.5
    • /
    • pp.257-264
    • /
    • 2002
  • In this paper, we give a detailed description of KDSM(KAIST Distributed Shared Memory) system. KDSM is implemented as a user-level library running on Linux 2.2.13, and TCP/IP is used for communication. KDSM uses page-based invalidation protocol, multiple-writer protocol, and supports HLRC(Home-based Lazy Release Consistency) memory consistency model. To evaluate performance of KDSM, we executed 4 scientific applications and compared the result to JLAJLA. The results showed that performance of KDSM almost equal to JIAJIA for 2 applications and performance of KDSM is better than JIAJIA for 2 applications.

A Multithreaded Architecture for the Efficient Execution of Vector Computations (벡타 연산을 효율적으로 수행하기 위한 다중 스레드 구조)

  • Yun, Seong-Dae;Jeong, Gi-Dong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.2 no.6
    • /
    • pp.974-984
    • /
    • 1995
  • This paper presents a design of a high performance MULVEC (MULtithreaded architecture for the VEctor Computations), as a building block of massively parallel Processing systems. The MULVEC comes from the synthesis of the dataflow model and the extant super sclar RISC microprocesso r. The MULVEC reduces, using status fields, the number of synchronizations in the case of repeated vector computations within the same thread segment, and also reduces the amount of the context switching, network traffic, etc. After be nchmark programs are simulated on the SPARC station 20(super scalar RISC microprocessor)the performance (execution time of programs and the utilization of processors) of MULVEC and the performance(execution time of a program) of *Taccording the different numbers of node are analyzed. We observed that the execution time of the program in MULVEC is faster than that in * T about 1-2 times according the number of nodes and the number of the repetitions of the loop.

  • PDF

Multiple Pipelined Hash Joins using Synchronization of Page Execution Time (페이지 실행시간 동기화를 이용한 다중 파이프라인 해쉬 결합)

  • Lee, Kyu-Ock;Weon, Young-Sun;Hong, Man-Pyo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.7
    • /
    • pp.639-649
    • /
    • 2000
  • In the relational database systems, the join operation is one of the most time-consuming query operations. Many parallel join algorithms have been developed to reduce the execution time. Multiple hash join algorithm using allocation tree is one of most efficient ones. However, it may have some delay on the processing each node of allocation tree, which is occurred in tuple-probing phase by the difference between one page reading time of outer relation and the processing time of already read one. In this paper, to solve the performance degrading problem by the delay, we develop a join algorithm using the concept of 'synchronization of page execution time' for multiple hash joins. We reduce the processing time of each nodes in the allocation tree and improve the total system performance. In addition, we analyze the performance by building the analytical cost model and verify the validity of it by various performance comparison with previous method.

  • PDF

Reconfigurable Architecture Design for H.264 Motion Estimation and 3D Graphics Rendering of Mobile Applications (이동통신 단말기를 위한 재구성 가능한 구조의 H.264 인코더의 움직임 추정기와 3차원 그래픽 렌더링 가속기 설계)

  • Park, Jung-Ae;Yoon, Mi-Sun;Shin, Hyun-Chul
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.1
    • /
    • pp.10-18
    • /
    • 2007
  • Mobile communication devices such as PDAs, cellular phones, etc., need to perform several kinds of computation-intensive functions including H.264 encoding/decoding and 3D graphics processing. In this paper, new reconfigurable architecture is described, which can perform either motion estimation for H.264 or rendering for 3D graphics. The proposed motion estimation techniques use new efficient SAD computation ordering, DAU, and FDVS algorithms. The new approach can reduce the computation by 70% on the average than that of JM 8.2, without affecting the quality. In 3D rendering, midline traversal algorithm is used for parallel processing to increase throughput. Memories are partitioned into 8 blocks so that 2.4Mbits (47%) of memory is shared and selective power shutdown is possible during motion estimation and 3D graphics rendering. Processing elements are also shared to further reduce the chip area by 7%.

Implementation of a 3D Graphics Hardwired T&L Accelerator based on a SoC Platform for a Mobile System (SoC 플랫폼 기반 모바일용 3차원 그래픽 Hardwired T&L Accelerator 구현)

  • Lee, Kwang-Yeob;Koo, Yong-Seo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.44 no.9
    • /
    • pp.59-70
    • /
    • 2007
  • In this paper, we proposed an effective T&L(Transform & Lighting) Processor architecture for a real time 3D graphics acceleration SoC(System on a Chip) in a mobile system. We designed Floating point arithmetic IPs for a T&L processor. And we verified IPs using a SoC Platform. Designed T&L Processor consists of 24 bit floating point data format and 16 bit fixed point data format, and supports the pipeline keeping the balance between Transform process and Lighting process using a parallel computation of 3D graphics. The delay of pipeline processing only Transform operation is almost same as the delay processing both Transform operation and Lighting operation. Designed T&L Processor is implemented and verified using a SoC Platform. The T&L Processor operates at 80MHz frequency in Xilinx-Virtex4 FPGA. The processing speed is measured at the rate of 20M Vertexes/sec.