• Title/Summary/Keyword: Multiprocessor

Search Result 315, Processing Time 0.023 seconds

A 3-D Vision Sensor Implementation on Multiple DSPs TMS320C31 (다중 TMS320C31 DSP를 사용한 3-D 비젼센서 Implementation)

  • Oksenhendler, V.;Bensrhair, Abdelaziz;Miche, Pierre;Lee, Sang-Goog
    • Journal of Sensor Science and Technology
    • /
    • v.7 no.2
    • /
    • pp.124-130
    • /
    • 1998
  • High-speed 3D vision systems are essential for autonomous robot or vehicle control applications. In our study, a stereo vision process has been developed. It consists of three steps : extraction of edges in right and left images, matching corresponding edges and calculation of the 3D map. This process is implemented in a VME 150/40 Imaging Technology vision system. It is a modular system composed by a display, an acquisition, a four Mbytes image frame memory, and three computational cards. Programmable accelerator computational modules are running at 40 MHz and are based on TMS320C31 DSP with a $64{\times}32$ bit instruction cache and two $1024{\times}32$ bit internal RAMs. Each is equipped with 512 Kbytes static RAM, 4 Mbytes image memory, 1 Mbytes flash EEPROM and a serial port. Data transfers and communications between modules are provided by three 8 bit global video bus, and three local configurable pipeline 8 bit video bus. The VME bus is dedicated to system management. Tasks between DSPs are distributed as follows: two DSPs are used to edges detection, one for the right image and the other for the left one. The last processor computes the matching process and the 3D calculation. With $512{\times}512$ pixels images, this sensor generates dense 3D maps at a rate of about 1 Hz depending of the scene complexity. Results can surely be improved by using a special suited multiprocessors cards.

  • PDF

Empirical Modeling for Cache Miss Rates in Multiprocessors (다중 프로세서에서의 캐시접근 실패율을 위한 경험적 모델링)

  • Lee, Kang-Woo;Yang, Gi-Joo;Park, Choon-Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.1_2
    • /
    • pp.15-34
    • /
    • 2006
  • This paper introduces an empirical modeling technique. This technique uses a set of sample results which are collected from a few small scale simulations. Empirical models are developed by applying a couple of statistical estimation techniques to these samples. We built two types of models for cache miss rates in Symmetric Multiprocessor systems. One is for the changes of input data set size while the specification of target system is fixed. The other is for the changes of the number of processors in target system while the input data set size is fixed. To develop accurate models, we built individual model for every kind of cache misses for each shared data structure in a program. The final model is then obtained by integrating them. Besides, combined use of Least Mean Squares and Robust Estimations enhances the quality of models by minimizing the distortion due to outliers. Empirical modeling technique produces extremely accurate models without analysis on sample data. In addition, since only snail scale simulations are necessary, once a set of samples can be collected, empirical method can be adopted in any research areas. In 17 cases among 24 trials, empirical models present extremely low prediction errors below $1\%$. In the remaining cases, the accuracy is excellent, as well. The models sustain high quality even when the behavioral characteristics of programs are irregular and the number of samples are barely enough.

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization (GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법)

  • Thuan, Do Cong;Choi, Yong;Kim, Jong Myon;Kim, Cheol Hong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.5
    • /
    • pp.219-230
    • /
    • 2017
  • General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.

Call-Site Tracing-based Shared Memory Allocator for False Sharing Reduction in DSM Systems (분산 공유 메모리 시스템에서 거짓 공유를 줄이는 호출지 추적 기반 공유 메모리 할당 기법)

  • Lee, Jong-Woo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.7
    • /
    • pp.349-358
    • /
    • 2005
  • False sharing is a result of co-location of unrelated data in the same unit of memory coherency, and is one source of unnecessary overhead being of no help to keep the memory coherency in multiprocessor systems. Moreover. the damage caused by false sharing becomes large in proportion to the granularity of memory coherency. To reduce false sharing in a page-based DSM system, it is necessary to allocate unrelated data objects that have different access patterns into the separate shared pages. In this paper we propose call-site tracing-based shared memory allocator. shortly CSTallocator. CSTallocator expects that the data objects requested from the different call-sites may have different access patterns in the future. So CSTailocator places each data object requested from the different call-sites into the separate shared pages, and consequently data objects that have the same call-site are likely to get together into the same shared pages. We use execution-driven simulation of real parallel applications to evaluate the effectiveness of our CSTallocator. Our observations show that by using CSTallocator a considerable amount of false sharing misses can be additionally reduced in comparison with the existing techniques.

Load Balancing of Unidirectional Dual-link CC-NUMA System Using Dynamic Routing Method (단방향 이중연결 CC-NUMA 시스템의 동적 부하 대응 경로 설정 기법)

  • Suh Hyo-Joon
    • The KIPS Transactions:PartA
    • /
    • v.12A no.6 s.96
    • /
    • pp.557-562
    • /
    • 2005
  • Throughput and latency of interconnection network are important factors of the performance of multiprocessor systems. The dual-link CC-NUMA architecture using point-to-point unidirectional link is one of the popular structures in high-end commercial systems. In terms of optimal path between nodes, several paths exist with the optimal hop count by its native multi-path structure. Furthermore, transaction latency between nodes is affected by congestion of links on the transaction path. Hence the transaction latency may get worse if the transactions make a hot spot on some links. In this paper, I propose a dynamic transaction routing algorithm that maintains the balanced link utilization with the optimal path length, and I compare the performance with the fixed path method on the dual-link CC-NUMA systems. By the proposed method, the link competition is alleviated by the real-time path selection, and consequently, dynamic transaction algorithm shows a better performance. The program-driven simulation results show $1{\~}10\%$ improved fluctuation of link utilization, $1{\~}3\%$ enhanced acquirement of link, and $1{\~}6\%$ improved system performance.