I. Introduction
One of computational science and modeling problems is system of linear equation which is the fundamental part of linear algebra of modern mathematical problems [1]. The system of equation could be expressed by Ax = B where A is the coefficient matrix, B is right hand side vector and x is an unknown vector to be solved. It is very important to study classical algorithms for solving these linear system of equation and to choose which algorithm to solve the problem. Two major methods are normally used which are direct method and iterative method. In direct method, the solver algorithm tends to give the exact solution for the linear equation. On the other hand, iterative method solves a system of linear equation with an approximation solution and based on the correctness comparing to the exact solution, the process continues solving until the error rate is tidy enough [2].
Many iterative solver methods have been used for linear solving. In case the linear system is symmetric, positive and definite, we may use conjugate gradient (CG) as an iterative method for solving the system of linear equation quickly [3]. The performance of CG method can be speedup by using parallel computing in GPU which many processors simultaneously execute. Additionally, GPU-based can be use for solve large linear system faster than CPU due to many processor.
Generally, GPU (Graphic Processing Unit) was using for rendering purpose which commonly handle for computer graphics operation [4]. Beside that, GPU also provides functionality to access general computing, which called GPGPU (General Purpose Graphic Processing Unit) [5]. Since the GPU run on thousands of processing cores in massive parallelism, it’s offering the efficient calculation and provide real-time processing performance. The benefit of GPGPU is suitable for large scale data with many computation operations, which using SIMD (Single Instruction Multiple Data) architecture to produce effective results in term of latency, bandwidth and memory accessing [6]. Currently, the most well-known technologies that support GPGPU programming consist of NVIDIA’s CUDA, AMD’s OpenCL, OpenMP and GLSL [7], which supporting for GPU-based parallel processing This makes such a possibility for a C/C++ program to utilize GPU’s ability with large data in parallel.
Therefore, in this paper, we present a GPU algorithm to solve a system of linear equation by the conjugate gradient method in parallel GPU architecture which using OpenGL compute shader.
II. Preliminaries
1. Related works
GPU has been using as a parallel algorithm for decades and created tons of opportunities for research due to massive speed computation [8]. Likewise, the potential of GPU has been used for parallel conjugate gradient method to solve poisson problems [9]. In previous research from Helfenstein and Koko [10], the parallel preconditioned conjugate gradient using GPU was proposed for linear system which using symmetric, positive definite matrix and SSOR precondition. Existing implementation of CG algorithm used CUDA and OpenCL, GPU language for high performance computing [11]. In Mukunoki’s research [22], CG method was implemented on CPU and GPU and conducted numerical experiments on both different platforms then compared to existed work based on the ExBLAS library. Additionally, conjugate gradient was used in finite-element method (FEM) which exhibits parallelism on a GPU [23].
Likewise, Sparse Matrix-Vector Multiplication (SpMV) has been studied in order to accelerate performance computation using GPU-based algorithm. Different formats of sparse matrices was used to gain the advantage of storage and time consumption [12-14].
2. OpenGL Compute Shader
OpenGL 4.3, released in 2012, contains compute shaders that are used exclusively for computing and rendering purpose. The compute shader is a shader stage that operate differently from other shader stage [15]. Furthermore, SSBO (Shader Storage Buffer Object), which is a buffer object that has been added instead of UBO (Uniform Buffer Object). SSBO acts as global memory in GPU which has potential to read and write data to memory with wide range arrays [16].
Executing compute shaders require a call to glDispatchCompute(), the CPU's OpenGL API function, and also defines a three-dimensional block of workgroup that determines where work group are in execution space. The computing space has a global workgroup (all work items within all workgroups) and local workgroup. Each local work group is then divided again into a number of invocations or work item. Each work item in a local work group also known as compute shader or kernel which is programmable and c like syntax. Figure 1 demonstrates the scheme of workload dimension.
Fig. 1. Scheme of compute workload
3. Matrix format
The coefficient matrix which has m rows and n columns is a matrix that contains the coefficient of the variable of sets in linear equation. There are plenty of representation methods to store the information of the coefficient matrix, each with different storage format, attribute of computation, storage reduction and method of fetching matrix entries. There may a situation when matrix contains most zero values, such matrix is known as sparse matrices. In terms of storage, sparse matrix contains only non-zero values since zero values are meaningless to a system of linear equation.
3.1 Dense Format
Suppose we have matrix M with five rows and five columns as shown in Figure 2, dense format contains each element value existed in matrix. This storage format is well-suited when the number of nonzero values is greater than zero value. In general, dense matrices can be represented using a simple two dimensions array. Due to the key aspect of vector architecture with SIMD (Single Instruction Multiple Data), a vector that contains all elements locating in matrix is required. The vector fetching data from left to right and top to bottom of the matrix. Figure 3 shows a dense format of matrix M.
Fig. 2. Example Matrix M (5x5)
Fig. 3. Example of Dense Matrix format for matrix M
3.2 ELLPACK Format
Since matrix M has only 10 nonzero values in 25 elements of the matrix, dense matrix format is wasting plenty of element to contain zero values. The ELLPACK format uses two Row-by-K to store the matrix M which K is the maximum number of non-zeroes value per row in matrix [17]. ELLPACK requires vector A to store nonzero values and vector Col to store the column indices of nonzero value. Figure 4 shows the example of ELLPACK format of matrix M.
Fig. 4. ELLPACK matrix format as matrix for matrix M
In this paper, since vector architecture is used, the above matrix format is configured as a transposed vector and the sparse matrix is stored in the column major order.
3.3 Coordinate Format
The Coordinate format is also known as COO which store only nonzero value in the matrix. COO format is the most precise and simple format in term of memory allocation and data accessing that lead this format become properly used for the unstructured pattern of nonzero values locating in the matrix. It stores nonzero elements as an array of triplet [18], which each triplet contains row, column and value of nonzero elements respectively. For these reasons, COO format requires 3-by-nonzero element for storage. Figure 5 shows the example of sparse matrix storage with COO format of matrix M.
Fig. 5. Example of COO Matrix format for matrix M
3.4 Compresses Sparse Row Format
Compresses Sparse Row format (CSR) is an almost similar COO format [19]. To store M matrix with 5x5 dimension, it requires three separate array, which first array (A) is used to store nonzero values, second array (J) is used for store the column indices of nonzero elements, and last array (I) is used for store number of each nonzeros value existed in each row of the matrix M plus number of previous nonzeros elements. In contrast, array I always starts with 0 and ends with the number nonzero, which makes it has a size equal to the row number plus one. Figure 6 illustrates the method of CSR format.
Fig. 6. Example of CSR Matrix format
III. The Implementation
Conjugate Gradient method [20] is an iterative method for solving equation Ax = B with given matrix A, vector B and unknown vector x. The conjugate gradient method is given by following equation.
\(r_{o}=p_{o}=B-A x_{o}\) (1)
\(\alpha_{i}=\frac{r_{i}^{T} r_{i}}{\left(p_{i}^{T} A p_{i}\right)}\) (2)
\(x_{i+1}=x_{i}+\alpha_{i} p_{i}\) (3)
\(r_{i+1}=r_{i}-\alpha_{i} A p_{i}\) (4)
\(\beta_{i}=\frac{r_{i+1}^{T} r_{i+1}}{\left(r_{i}^{T} r_{i}\right)}\) (5)
\(p_{i+1}=r_{i+1}+\beta_{i} p_{i}\) (6)
Equation (1) is initial step of the algorithm. Commonly the input vector x can be defined as zero vector for initial solution in the system. So, equation (1) can be written as :
\(r_{o}=p_{o}=B\) (1.1)
From Equation (2) to (6) are routine to define solution with a finite number of iteration that yielding a new approximation solution. The conjugate gradient also converges in n iteration.
In order to implement conjugate gradient using CPU and GPU there were similarities. However, GPU implementation required reading data process from file and data transfer from CPU to SSBO that we use as a buffer for GPU computing as presented in Figure 7. The iterative method of the conjugate gradient method consists of one matrix-vector multiplication, three vector dot products, and three vector operations.
Fig. 7. Conjugate Gradient GPU implementation
Sparse matrix-vector multiplication (SpMV) operation is a crucial operation and requires a suitable matrix format for accelerating. For each SpMV compute shader, workload dimension is different in term of SSBO access by each invocation.
For dense format SpMV algorithm is presented in Table 1, we require two-dimension of work item or invocation which size of X refers to the number row in matrix and Y dimension refer to the number of columns in matrix. In order to access each matrix element by index of invocation properly, the proposed algorithm uses gl_GlobalInvocationID which is exact value of WorkGroupID * WorkGroupSize +LocalInvocationID. Since dense matrices are mapping to buffer and accessing arrays like, index of buffer computation is following by equation in Figure 8.
Fig. 8. Assessing matrix by 2D workgroup and index
Once a product of each element of dense matrix with vector P is completed, new data should be written concurrently to result buffer. There could had problem on writing data process which called races in global data. To solve that, atomicAddis used for writing new data to SSBO simultaneously.
Table 1. Dense format SpMV Compute Shader
Table 2 demonstrated the algorithm of SpMV on ELL format. The algorithm requires one-dimension of invocation, which global size equal to length of Value A array and ACol (Column Vector) array. Since compute shader is used, we may potentially send data through uniform variable (Row & densest) to compute shader. This algorithm runs as one thread per matrix row and coalesced memory access which made memory access to array A and ACol turn into a single transaction.
Table 2. ELL format SpMV Compute Shader
Table 3. COO format SpMV Compute Shader
SpMV on COO format as shown in Table 3 process each nonzero element which made global invocation index used for access each nonzero value to product with vector and explicitly row and column index of matrix stored in SSBO is used as well. After completing sparse matrix product with vector, result is written to Result SSBO by atomicAdd operation.
Table 4 illustrates the CSR SpMV compute shader which, almost identical to COO format excepts compressed row array (IA array) and uses one thread per row of matrix for parallel computing which commonly called scalar kernel.
Table 4. CSR format SpMV Compute Shader
IV. Experiment Results
In our experiment, we conducted a measurement execution time for conjugate gradient method at one iteration by using a naive approach (CPU-based) running on single core compared to our approach which uses parallelized GPU-based linear solving implemented by OpenGL compute shader. We also provide the different result between each sparse matrix format in Table 5 and Table 6. Therefore, C++ Library is used for retrieving execution time for CPU implementation. Likewise, OpenGL is used for obtaining duration of GPU algorithm as well. The proposed algorithm is implemented with Visual Studio 2013 and run on Window platform with i7-00 CPU, 16GB RAM and use NVDIA GeForce GTX 1070 8GB VRAM.
Sparse matrices dataset from [21] is used for the proposed method as well. Each right hand side vector element are made by sum of each row of matrices. Table 5 shows the matrix information which are used to experimental test. All matrices using for experimental test are general problem of computational fluid dynamics problem except lhr07. lhr07 is problem of chemical process simulation and kineticBatchReactor_2 is problem of optimal control problem. In addition, all matrices are real and structured. However, only ex15 is symmetric and positive definite and kineticBatchReactor_2 is symmetric but indefinite.
Table 5. Sparse Matrices Data
In Table 8, the comparison of performance results have been shown that GPU-based parallel linear solving system with dense format is average 233 times faster than CPU-based linear solving system, but it wastes significant amount of memory compares with other formats. ELL format wastes less memory for allocation and average 174 times faster than CPU-based linear solving system, but it is not suitable for unstructured pattern of nonzero values. While COO format is the most uses in general pattern and takes less time for computation in both CPU-based and GPU-based linear solving system with average speed up ratio 20 times. CSR format also has a good performance in both implementation with the result that the GPU-based linear solving system is average 24 times faster than CPU-based linear solving system.
Furthermore, the comparison of average computing time for each format of CPU-based and GPU-based linear solving system have shown that CSR format improves the performance of conjugate gradients algorithm. Beside that, ELL also offers an acceptable performance but this format can not be used as general representation for sparse matrix.
Table 6. CPU Implemented CG Algorithm perform for one iteration and measure in millisecond
Table 7. GPU Implemented CG Algorithm perform for one iteration and measure in millisecond
Table 8. Speed up ratio (CPU/GPU)
V. Conclusion
The proposed GPU-parallel algorithm for sparse linear solving by conjugate gradient method using OpenGL compute shader for any system matrices are symmetric, positive and definite. Different storage format of sparse matrices are used in order to achieve excellent performance and saving memory as much as possible. The proposed method also have composed GPU library which supports with OpenGL compute shader by utilize data using atomic operation on SSBO.
In conclusion, GPU-based parallel linear solving algorithm for computing sparse linear systems through conjugate gradients using an iterative method, which applies CSR format as matrix storage offers the best performance with average computing time 15.37 ms in CPU-based parallel linear solving algorithm and 0.64 ms in GPU-based parallel linear solving algorithm for all cases. ELL format also achieve good performance in GPU-based with average executing time 0.69 ms and COO format is 0.92 ms. The result proves that CSR format is a suitable format in general case of sparse matrix in term of storage and parallelization which lead the CG solver achieve great performance.
In future work, our approach can use for solving dynamic simulation problems such as node-node constraints in order to maintain distance of each constraint in deformable object by using GPU-based CG solver.
ACKNOWLEDGEMENT
This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF-2019R1F1A1062752) funded by the Ministry of Education, and was supported by the Soonchunhyang University Research Fund.
References
- S. Abbasbandy, A. Jafarian, and R. Ezzati, "Conjugate gradient method for fuzzy symmetric positive definite system of linear equations," Applied Mathematics and Computation, Vol. 171, No. 2, pp. 1184-1191, 2005. DOI: https://doi.org/10.1016/j.amc.2005.01.110
- M. Kryshchuk, and J. Lavendels, "Iterative Method for Solving a System of Linear Equations," Procedia Computer Science, Vol. 104, pp. 133-137, 2017. DOI: https://doi.org/10.1016/j.procs.2017.01.085
- A. Bunse-Gerstner, and R. Stover, "On a conjugate gradient-type method for solving complex symmetric linear systems," Linear Algebra and its Applications, Vol. 287, No. 1-3, pp. 105-123, 1999. DOI: https://doi.org/10.1016/S0024-3795(98)10091-5
- A. Cano, "A survey on graphic processing unit computing for large-scale data mining," WIREs Data Mining and Knowledge Discovery, Vol, 8, No. 1, 2018. DOI: https://doi.org/10.1002/widm.1232
- J. Sim, et al, "A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications," Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 47, No. 8, pp. 11-22, 2012. DOI: https://doi.org/10.1145/2145816.2145819
- D. Gerzhoy, X. Sun, M. Zuzak, and D. Yeung, "Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors," ACM Trans. Archit, Vol. 16, No. 4, 2019. DOI: https://doi.org/10.1145/3368304
- W. Shin, K. H. Yoo and N. Baek, "Large-Scale Data Computing Performance Comparisons on SYCL Heterogeneous Parallel Processing Layer Implementations," In Appl. Sci., Vol. 10, No. 5. pp. 1656, 2020. DOI: https://doi.org/10.3390/app10051656
- J. S. Kirtzic, "A parallel algorithm design model for the gpu architecture," Ph.D. Dissertation. University of Texas at Dallas, USA. Advisor(s) Ovidiu Daescu., 2012.
- M. Ament et al, "A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform," 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, Pisa, pp. 583-592, 2010. DOI: https://doi.org/10.1109/PDP.2010.51
- R. Helfenstein, and J. Koko, "Parallel preconditioned conjugate gradient algorithm on GPU," Journal of Computational and Applied Mathematics, Vol. 236, No. 15. pp. 3584-2590, 2012. DOI: https://doi.org/10.1016/j.cam.2011.04.025
- A. C. Ahamed, and F. Magoules, "Conjugate gradient method with graphics processing unit acceleration: CUDA vs OpenCL," Advances in Engineering Software, Vol. 111, pp. 32-42, 2017. DOI: https://doi.org/10.1016/j.advengsoft.2016.10.002
- M. M. Baskaran, and R. Bordawekar, "Optimizing Sparse Matrix-Vector Multiplications on GPUs," Computer Science. Vol. 8, pp. 812-47, 2009.
- A. Benatia, W. Ji, Y. Wang and F. Shi, "Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU," 2016 45th International Conference on Parallel Processing (ICPP), Philadelphia, pp. 496-505, 2016. DOI: https://doi.org/10.1109/ICPP.2016.64
- N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, pp. 1-11, 2009. DOI: https://doi.org/10.1145/1654059.1654078
- D. Shreiner, G. Sellers, J. Kessenich and B. Licea-Kane , "OpenGL programming guide: The Official guide to learning OpenGL," version 4.3. Addison-Wesley, 2013.
- N. J. Sung, Y. J. Choi and M. Hong, "Parallel Structure Design Method for Mass Spring Simulation," Journal of the Korea Computer Graphics Society. Vol. 25, pp. 55-63, 2019. DOI: https://doi.org/10.15701/kcgs.2019.25.3.55
- N. Bell and M. Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA," NVIDIA Technical Report NVR-2008-004, December 2008.
- W. Cao, L. Yao, Zo. Li, Y. Wang and Z. Wang, "Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format," 2010 International Conference on Computer Application and System Modeling, pp. V11-161-V11-165, 2010. DOI: https://doi.org/10.1109/ICCASM.2010.5623237
- G. E. Blelloch, et al, "Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors," Technical Report. Carnegie Mellon University, USA. 1993.
- M. R. Hestenes, and E. Stiefel, "Methods of conjugate gradients for solving linear systems," Journal of research of the National Bureau of Standards, Vol. 49, pp. 409-436, 1952. DOI: https://doi.org/10.6028/jres.049.044
- SuiteSparse Matrix Collection, https://sparse.tamu.edu/
- D. Mukunoki, K. Ozaki, T. Ogita, and R. Iakymchuk "Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme," In The International Conference on High Performance Computing in Asia-Pacific Region, pp. 100-109, 2021. DOI: https://doi.org/10.1145/3432261.3432270
- Pikle, N.K., Sathe, S.R. and Vyavhare, A.Y. "GPGPU-based parallel computing applied in the FEM using the conjugate gradient algorithm: a review," Sadhana, Vol.43, No. 111 2018. DOI: https://doi.org/10.1007/s12046-018-0892-0