DOI QR코드

DOI QR Code

일반적인 GPU 트리 탐색과의 비교실험을 통한 GPU 기반 병렬 Shifted Sort 알고리즘 분석

Analysis of GPU-based Parallel Shifted Sort Algorithm by comparing with General GPU-based Tree Traversal

  • 김희수 (덕성여자대학교 디지털미디어학과) ;
  • 박태정 (덕성여자대학교 디지털미디어학과)
  • Kim, Heesu (Department of Digital Media, Duksung Women's University) ;
  • Park, Taejung (Department of Digital Media, Duksung Women's University)
  • 투고 : 2017.09.01
  • 심사 : 2017.10.25
  • 발행 : 2017.10.31

초록

일반적으로 GPU 기반 트리 탐색을 수행할 경우 병렬 처리 속도가 생각보다 크게 향상되지 않는 경우가 대부분이다. 본 논문에서는 이러한 원인을 분석하고 그 분석 결과로 GPU 병렬 처리 하드웨어 아키텍처 내 최소 물리적 스레드 실행 단위인 warp 내에서 분기문(if문)으로 인한 warp divergence가 일어나기 때문임을 제시한다. 또한 이러한 warp divergence를 최소화할 수 있는 병렬 shifted sort 알고리즘과의 비교를 통해 shifted sort 알고리즘이 일반적인 GPU 내 트리 탐색에 비해 우수한 성능을 보이는 구조임을 제시하였다. 분석 결과 GPU 기반 kd-tree 탐색에 비해 warp divergence가 발생하지 않은 shifted sort 탐색은 3차원 공간에서 데이터나 쿼리의 수가 $2^{23}$개 일 때 16배 이상의 빠른 처리 속도를 보였으며 이 성능 차이는 데이터나 쿼리의 개수가 증가함에 따라 더 커지는 경향을 보였다.

It is common to achieve lower performance in traversing tree data structures in GPU than one expects. In this paper, we analyze the reason of lower-than-expected performance in GPU tree traversal and present that the warp divergences is caused by the branch instructions ("if${\ldots}$ else") which appear commonly in tree traversal CUDA codes. Also, we compare the parallel shifted sort algorithm which can reduce the number of warp divergences with a kd-tree CUDA implementation to show that the shifted sort algorithm can work faster than the kd-tree CUDA implementation thanks to less warp divergences. As the analysis result, the shifted sort algorithm worked about 16-fold faster than the kd-tree CUDA implementation for $2^{23}$ query points and $2^{23}$ data points in $R^3$ space. The performance gaps tend to increase in proportion to the number of query points and data points.

키워드

참고문헌

  1. Euclidean distance website. Available: https://en.wikipedia.org/wiki/Euclidean_distance
  2. Manhattan distance website. Available: https://en.wikipedia.org/wiki/Taxicab_geometry
  3. Max distance website. Available:https://en.wikipedia.org/wiki/Chebyshev_distance
  4. ANN: A Library for Approximate Nearest Neighbor Searching website. Available: https://www.cs.umd.edu/-mount/ANN/
  5. kd-tree searching website. Available: https://en.wikipedia.org/wiki/K-d_tree
  6. T. Park, "Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm," Journal of Digital Contents Society, Vol. 18, No. 4, pp. 739-745, July 2017. https://doi.org/10.9728/DCS.2017.18.4.739
  7. Ingo Wald, "On fast Construction of SAH-based Bounding Volume Hierarchies," Proceedings of the 2007 IEEE symposium on Interactive Ray Tracing, Washington, pp. 33-40, 2007.
  8. S.Li, L. Simons, J. B. Pakaravoor, F. Abbasinejad, J. D. Owens, and N. Amenta, "kANN on the GPU with shifted sorting," In Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics (EGH-HPG'12), Switzerland, pp. 39-47, 2012.
  9. T. Park, "Analysis of Morton Code Conversion for 32 Bit IEEE 754 Floating Point Variables," The Journal of Digital Contents Society, Vol. 17, No. 3, pp. 165-172, June 2016. https://doi.org/10.9728/dcs.2016.17.3.165
  10. J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1sted. Wrox, pp. 6-8, 2014.
  11. NVIDIA Visual Profiler website. Available: https://developer.nvidia.com/nvidia-visual-profiler
  12. J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1sted. Wrox, pp. 87-96, 2014.