Analysis of GPU-based Parallel Shifted Sort Algorithm by comparing with General GPU-based Tree Traversal

Kim, Heesu;Park, Taejung;

doi:10.9728/dcs.2017.18.6.1151

Journal of Digital Contents Society (디지털콘텐츠학회 논문지)

Volume 18 Issue 6
/
Pages.1151-1156
/
2017
/
1598-2009(pISSN)
/
2287-738X(eISSN)

Digital Contents Society (한국디지털콘텐츠학회)

DOI QR Code

Analysis of GPU-based Parallel Shifted Sort Algorithm by comparing with General GPU-based Tree Traversal

일반적인 GPU 트리 탐색과의 비교실험을 통한 GPU 기반 병렬 Shifted Sort 알고리즘 분석

Kim, Heesu (Department of Digital Media, Duksung Women's University) ;
Park, Taejung (Department of Digital Media, Duksung Women's University)

김희수 (덕성여자대학교 디지털미디어학과) ;
박태정 (덕성여자대학교 디지털미디어학과)

Received : 2017.09.01
Accepted : 2017.10.25
Published : 2017.10.31

https://doi.org/10.9728/dcs.2017.18.6.1151 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

It is common to achieve lower performance in traversing tree data structures in GPU than one expects. In this paper, we analyze the reason of lower-than-expected performance in GPU tree traversal and present that the warp divergences is caused by the branch instructions ("if${\ldots}$ else") which appear commonly in tree traversal CUDA codes. Also, we compare the parallel shifted sort algorithm which can reduce the number of warp divergences with a kd-tree CUDA implementation to show that the shifted sort algorithm can work faster than the kd-tree CUDA implementation thanks to less warp divergences. As the analysis result, the shifted sort algorithm worked about 16-fold faster than the kd-tree CUDA implementation for $2^{23}$ query points and $2^{23}$ data points in $R^3$ space. The performance gaps tend to increase in proportion to the number of query points and data points.

일반적으로 GPU 기반 트리 탐색을 수행할 경우 병렬 처리 속도가 생각보다 크게 향상되지 않는 경우가 대부분이다. 본 논문에서는 이러한 원인을 분석하고 그 분석 결과로 GPU 병렬 처리 하드웨어 아키텍처 내 최소 물리적 스레드 실행 단위인 warp 내에서 분기문(if문)으로 인한 warp divergence가 일어나기 때문임을 제시한다. 또한 이러한 warp divergence를 최소화할 수 있는 병렬 shifted sort 알고리즘과의 비교를 통해 shifted sort 알고리즘이 일반적인 GPU 내 트리 탐색에 비해 우수한 성능을 보이는 구조임을 제시하였다. 분석 결과 GPU 기반 kd-tree 탐색에 비해 warp divergence가 발생하지 않은 shifted sort 탐색은 3차원 공간에서 데이터나 쿼리의 수가 $2^{23}$개 일 때 16배 이상의 빠른 처리 속도를 보였으며 이 성능 차이는 데이터나 쿼리의 개수가 증가함에 따라 더 커지는 경향을 보였다.

Keywords

References

Euclidean distance website. Available: https://en.wikipedia.org/wiki/Euclidean_distance
Manhattan distance website. Available: https://en.wikipedia.org/wiki/Taxicab_geometry
Max distance website. Available:https://en.wikipedia.org/wiki/Chebyshev_distance
ANN: A Library for Approximate Nearest Neighbor Searching website. Available: https://www.cs.umd.edu/-mount/ANN/
kd-tree searching website. Available: https://en.wikipedia.org/wiki/K-d_tree
T. Park, "Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm," Journal of Digital Contents Society, Vol. 18, No. 4, pp. 739-745, July 2017. https://doi.org/10.9728/DCS.2017.18.4.739
Ingo Wald, "On fast Construction of SAH-based Bounding Volume Hierarchies," Proceedings of the 2007 IEEE symposium on Interactive Ray Tracing, Washington, pp. 33-40, 2007.
S.Li, L. Simons, J. B. Pakaravoor, F. Abbasinejad, J. D. Owens, and N. Amenta, "kANN on the GPU with shifted sorting," In Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics (EGH-HPG'12), Switzerland, pp. 39-47, 2012.
T. Park, "Analysis of Morton Code Conversion for 32 Bit IEEE 754 Floating Point Variables," The Journal of Digital Contents Society, Vol. 17, No. 3, pp. 165-172, June 2016. https://doi.org/10.9728/dcs.2016.17.3.165
J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1sted. Wrox, pp. 6-8, 2014.
NVIDIA Visual Profiler website. Available: https://developer.nvidia.com/nvidia-visual-profiler
J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1sted. Wrox, pp. 87-96, 2014.