• Title/Summary/Keyword: Loop-tiling

Search Result 4, Processing Time 0.021 seconds

Locality-Conscious Nested-Loops Parallelization

  • Parsa, Saeed;Hamzei, Mohammad
    • ETRI Journal
    • /
    • v.36 no.1
    • /
    • pp.124-133
    • /
    • 2014
  • To speed up data-intensive programs, two complementary techniques, namely nested loops parallelization and data locality optimization, should be considered. Effective parallelization techniques distribute the computation and necessary data across different processors, whereas data locality places data on the same processor. Therefore, locality and parallelization may demand different loop transformations. As such, an integrated approach that combines these two can generate much better results than each individual approach. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate loop transformation. Applying this transformation results in coarse grain parallelism through exploiting the largest possible groups of outer permutable loops in addition to data locality through dependence satisfaction at inner loops. These groups can be further tiled to improve data locality through exploiting data reuse in multiple dimensions.

Extended Three Region Partitioning Method of Loops with Irregular Dependences (비규칙 종속성을 가진 루프의 확장된 세지역 분할 방법)

  • Jeong, Sam-Jin
    • Journal of the Korea Convergence Society
    • /
    • v.6 no.3
    • /
    • pp.51-57
    • /
    • 2015
  • This paper proposes an efficient method such as Extended Three Region Partitioning Method for nested loops with irregular dependences for maximizing parallelism. Our approach is based on the Convex Hull theory, and also based on minimum dependence distance tiling, the unique set oriented partitioning, and three region partitioning methods. In the proposed method, we eliminate anti dependences from the nested loop by variable renaming. After variable renaming, we present algorithm to select one or more appropriate lines among given four lines such as LMLH, RMLH, LMLT and RMLT. If only one line is selected, the method divides the iteration space into two parallel regions by the selected line. Otherwise, we present another algorithm to find a serial region. The selected lines divide the iteration space into two parallel regions as large as possible and one or less serial region as small as possible. Our proposed method gives much better speedup and extracts more parallelism than other existing three region partitioning methods.

3-channel Tiled-aperture Coherent-beam-combining System Based on Target-in-the-loop Monitoring and SPGD Algorithm (목표물 신호 모니터링 및 SPGD 알고리즘 기반 3 채널 타일형 결맞음 빔결합 시스템 연구)

  • Kim, Youngchan;Yun, Youngsun;Kim, Hansol;Chang, Hanbyul;Park, Jaedeok;Choe, Yunjin;Na, Jeongkyun;Yi, Joohan;Kang, Hyungu;Yeo, Minsu;Choi, Kyuhong;Noh, Young-Chul;Jeong, Yoonchan;Lee, Hyuk-Jae;Yu, Bong-Ahn;Yeom, Dong-Il;Jun, Changsu
    • Korean Journal of Optics and Photonics
    • /
    • v.32 no.1
    • /
    • pp.1-8
    • /
    • 2021
  • We have studied a tiled-aperture coherent-beam-combining system based on constructive interference, as a way to overcome the power limitation of a single laser. A 1-watt-level, 3-channel coherent fiber laser and a 3-channel fiber array of triangular tiling with tip-tilt function were developed. A monitoring system, phase controller, and 3-channel phase modulator formed a closed-loop control system, and the SPGD algorithm was applied. Eventually, phase-locking with a rate of 5-67 kHz and peak-intensity efficiency comparable to the ideal case of 53.3% was successfully realized. We were able to develop the essential elements for a tiled-aperture coherent-beam-combining system that had the potential for highest output power without any beam-combining components, and a multichannel coherent-beam-combining system with higher output power and high speed is anticipated in the future.

Memory data layout and DMA transfer technique research For efficient data transfer of CNN accelerator (CNN 가속기의 효율적인 데이터 전송을 위한 메모리 데이터 레이아웃 및 DMA 전송기법 연구)

  • Cho, Seok-Jae;Park, Sungkyung;Park, Chester Sungchung
    • Journal of IKEEE
    • /
    • v.24 no.2
    • /
    • pp.559-569
    • /
    • 2020
  • One of the deep-running algorithms, CNN's artificial intelligence application uses off-chip memory to store data on the Convolution Layer. DMA can reduce processor load at every data transfer. It can also reduce application performance degradation by varying the order in which data from the Convolution layer is transmitted to the global buffer of the accelerator. For basic layouts with continuous memory addresses, SG-DMA showed about 3.4 times performance improvement in pre-setting DMA compared to using ordinaly DMA, and for Ideal layouts with discontinuous memory addresses, the ordinal DMA was about 1396 cycles faster than SG-DMA. Experiments have shown that a combination of memory data layout and DMA can reduce the DMA preset load by about 86 percent.