DOI QR코드

DOI QR Code

MPI 브로드캐스트 통신을 위한 서킷 스위칭 기반의 파이프라인 체인 알고리즘 설계

A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System

  • 윤희준 (연세대학교 전기전자공학과 프로세서 연구실) ;
  • 정원영 (연세대학교 전기전자공학과 프로세서 연구실) ;
  • 이용석 (연세대학교 전기전자공학과 프로세서 연구실)
  • 투고 : 2011.12.23
  • 심사 : 2012.07.05
  • 발행 : 2012.09.30

초록

본 논문에서는 분산 메모리 아키텍처를 사용하는 멀티프로세서에서 가장 병목 현상이 심한 집합통신 중 브로드캐스트를 위한 알고리즘 및 하드웨어 구조를 제안한다. 기존 시스템의 파이프라인 브로드캐스트 알고리즘은 전송 대역폭을 최대로 활용하는 알고리즘 이다. 하지만 파이프라인 브로드캐스트는 데이터를 여러 조각으로 나누어서 전송하기 때문에, 불필요한 동기화 과정이 반복된다. 본 논문에서는 동기화 과정의 중복이 없는 서킷 스위칭 기반의 파이프라인 체인 알고리즘을 위한 MPI 유닛을 설계하였고, 이를 systemC를 통하여 모델링하여 평가하였다. 그 결과 파이프라인 브로드캐스트 알고리즘과 비교하여 브로드캐스트 통신의 성능을 최대 3.3배 향상 시켰고, 이는 통신 버스의 전송대역폭을 거의 최대로 사용하였다. 그 후 verilogHDL로 하드웨어를 설계하였고, Synopsys사의 Design Compiler를 사용하여 TSMC 0.18 공정 라이브러리에서 합성하였으며 칩으로 제작하였다. 합성결과 제안하는 구조를 위한 하드웨어는 4,700 게이트(2-input NAND gate) 면적으로, 전체 면적에서 2.4%을 차지하였다. 이는 제안하는 구조가 작은 면적으로 MPSoC의 전체적인 성능을 높이는데 유용하다.

This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

키워드

참고문헌

  1. L. Gwennap, "Apple A5 Adds New Features," Microprocessor report, May 2011
  2. A. C. Klaiber, H. M. Levy, "A comparison of message passing and shared memory architectures for data parallel programs," Proceedings of the 21st annual international symposium on Computer architecture, Vol 22, Apr. 1994, pp. 94-105
  3. P. Stenstrom, "A Survey of Cache Coherence Schemes for Multiprocessors," Computer, Vol. 23, Jun. 1990, pp. 12-24.
  4. M. Tomasevic and V.M. Milutinovic, "Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors," IEEE Micro, vol. 14, nos. 5-6, Oct./Dec. 1994, pp. 52-59. https://doi.org/10.1109/MM.1994.363067
  5. L. Benini and G.de Micheli, " Networks On Chip: A New SoC Paradigm," IEEE Computer, Vol 35, No. 1, Jan. 2002, pp. 70-78.
  6. Daniel L. Ly, Manuel Saldana, Paul Chow, "The Challenges of Using An Embedded MPI for Hardware-based Processing Nodes," Field-Programmable Technology(FPT) 2009, Sydney, NSW, Dec. 2009, pp. 120-127.
  7. T. P. McMahon and A. Skjellum, "eMPI/eMPICH: Embedding MPI," MPI Developers Conf., 1996, pp. 180-184.
  8. M. Saldana, A. Patel, C. Madill, N. D., A. Wang, A. Putnam, R. Wittig, and P. Chow, "MPI as an abstraction for software-hardware interaction for HPRCs," in International Workshop on High-Performance Reconfigurable Computing Technology and Applications, Nov. 2008, pp. 1-10.
  9. C. Pedraza, E. Castillo, J. Castillo, C. Camarero, J. Bosque, J. Martinez, and R. Menendez, "Cluster architecture based on low cost reconfigurable hardware," in International Conference on Field Programmable Logic and Applications, Sept. 2008, pp. 595-598.
  10. MPI-forum, "Message passing interface forum," Jan. 2009, uRL: http://www.mpi-forum.org.
  11. R. Rabenseifner, "Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512," Proceedings of the Message Passing Interface Developer's and User's Conference 1999(MPIDC99), 1999, pp 77-85.
  12. S. S. Vadhiyar, G. E. Fagg, and J. Dongarra. "Automatically Tuned Collective Communications," In Proc. of SC'00: High Performance Networking and Computing, 2000.
  13. Mike Barnett, Satya Gupta, David G. Payne, Lance Shuler, and Robert van de Geijn, "Building a High-Performance Collective Communication Library," Supercomputing'94, Nov. 1994, pp 107-116.
  14. M. Barnett, D. Payne, R. van de Geijn and J. Watts, "Broadcasting on meshes with worm-hole routing," Technical Report , Department of Computer Sciences, the University of Texas at Austin, Nov. 1994.
  15. J. Watts and R. van de Geijn, "A Pipelined Broadcast for Multidimensional Meshes," Parallel Proc. Letter, 1995
  16. Thakur, Rajeev, et al., "Optimization of collective communication operations in mpich," International Journal of High Performance Computing Applications, Feb. 2005, pp. 49-66.
  17. George Almasi, Charles J.Archer, C. Chris Erway, Philip Heidelberger, Xavier Martorell, Jose E. Moreira, B.Steinmacher-Burow, and YiliZheng, "Optimization of MPI Collective Communication on BlueGene/L Systems," ICS'05, Prec. of the 19th annual international conf. on Supercomputing, June. 2005, pp. 253-262.
  18. Poletti Francesco, Poggiali Antonio, and Paul Marchal, "Flexible hardware/software support for message passing on a distributed shared memory architecture," Design, Automation and Test in Europe 2005, Mar., Vol. 2, 2005, pp. 736-741.
  19. Manuel Saldana and Paul Chow, "TMD-MPI IMPLEMETATION FOR MULTIPLE PROCESSORS ACROSS MULTIPLE FPGAS," FPL'06, Aug. 2006, pp. 1-6.
  20. W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks. San Francisco, CA: Morgan Kaufmann, 2004