Search | Korea Science

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

Vo, Viet Tan;Kim, Cheol Hong
- Journal of Information Processing Systems
- /
- v.17 no.6
- /
- pp.1157-1169
- /
- 2021
Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.
https://doi.org/10.3745/JIPS.01.0084 인용 PDF KSCI

A Design of a High Performance Stream Processor without Superscalar Architecture (슈퍼스칼라 구조를 갖지 않는 고성능 Stream Processor 설계)

Lee, Kwan-Ho;Kim, Chi-Yong
- Journal of IKEEE
- /
- v.21 no.1
- /
- pp.77-80
- /
- 2017
In this paper, we proposed a way to improve performance of GP-GPU by deletion of superscalar issue from its original form. At first, we simplified the structure of stream processor in order to eliminate superscalar issue. Under this condition, preservation of hardware size and increasing of thread number were followed by functional improvement of GP-GPU. As the number of thread was getting larger, we proposed the new model of warp scheduler which adjusts the group of thread. This superscalar issue-deleted warp scheduler transferred the instructions to warp which was activated by Round Robin Scheduling. Performance comparison was conducted by Gaussian filtering and the results indicated that our newly designed GP-GPU showing 7.89 times better in its performance than original one.
https://doi.org/10.7471/ikeee.2017.21.1.77 인용 PDF KSCI

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- Journal of the Korea Society of Computer and Information
- /
- v.24 no.4
- /
- pp.1-9
- /
- 2019
LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.
https://doi.org/10.9708/jksci.2019.24.04.001 인용 PDF KSCI HTML

Design of a High-Performance Mobile GPGPU with SIMT Architecture based on a Small-size Warp Scheduler (작은 크기의 Warp 스케쥴러 기반 SIMT구조 고성능 모바일 GPGPU 설계)

Lee, Kwang-Yeob
- Journal of IKEEE
- /
- v.25 no.3
- /
- pp.479-484
- /
- 2021
This paper proposed and designed a structure to achieve high performance with a small number of cores in GPGPU with SIMT structure. GPGPU for application to mobile devices requires a structure to increase performance compared to power consumption. In order to reduce power consumption, the number of cores decreased, but to improve performance, the size of the warp scheduler for managing threads was set to 4, which was greatly reduced than 32 of general GPGPU. Reducing warp size can reduce the number of idle cycles in pipelines and efficiently apply memory latency to reduce miss penalty when accessing cache memory. The designed GPGPU measured computational performance using a test program that includes floating point operations and measured power consumption through a 28nm CMOS process to obtain 104.5GFlops/Watt as a performance per power. The results of this paper showed about four times better performance per power compared to Tegra K1 of Nvidia
https://doi.org/10.7471/ikeee.2021.25.3.479 인용 PDF KSCI

MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs (GPU 성능 향상을 위한 MSHR 활용률 기반 동적 워프 스케줄러)

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- KIPS Transactions on Computer and Communication Systems
- /
- v.8 no.5
- /
- pp.111-118
- /
- 2019
Recent graphic processing units (GPUs) provide high throughput by using powerful hardware resources. However, massive memory accesses cause GPU performance degradation due to cache inefficiency. Therefore, the performance of GPU can be improved by reducing thread parallelism when cache suffers memory contention. In this paper, we propose a dynamic warp scheduler which controls thread parallelism according to degree of cache contention. Usually, the greedy then oldest (GTO) policy for issuing warp shows lower parallelism than loose round robin (LRR) policy. Therefore, the proposed warp scheduler employs the LRR warp scheduling policy when Miss Status Holding Register(MSHR) utilization is low. On the other hand, the GTO policy is employed in order to reduce thread parallelism when MSHRs utilization is high. Our proposed technique shows better performance compared with LRR and GTO policy since it selects efficient scheduling policy dynamically. According to our experimental results, our proposed technique provides IPC improvement by 12.8% and 3.5% over LRR and GTO on average, respectively.
https://doi.org/10.3745/KTCCS.2019.8.5.111 인용 PDF KSCI HTML

Thread Distribution Method of GP-GPU for Accelerating Parallel Algorithms (병렬 알고리즘의 가속화를 위한 GP-GPU의 Thread할당 기법)

Lee, Kwan-Ho;Kim, Chi-Yong
- Journal of IKEEE
- /
- v.21 no.1
- /
- pp.92-95
- /
- 2017
In this paper, we proposed a way to improve function of small scale GP-GPU. Instead of using superscalar which increase scheduling-complexity, we suggested the application of simple core to maximize GP-GPU performance. Our studies also demonstrated that simplified Stream Processor is one of the way to achieve functional improvement in GP-GPU. In addition, we found that developing of optimal thread-assigning method in Warp Scheduler for specific application improves functional performance of GP-GPU. For examination of GP-GPU functional performance, we suggested the thread-assigning way which coordinated with Deep-Learning system; a part of Neural Network. As a result, we found that functional index in algorithm of Neural Network was increased to 90%, 98% compared with Intel CPU and ARM cortex-A15 4 core respectively.
https://doi.org/10.7471/ikeee.2017.21.1.92 인용 PDF KSCI

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization (GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법)

Thuan, Do Cong;Choi, Yong;Kim, Jong Myon;Kim, Cheol Hong
- KIPS Transactions on Computer and Communication Systems
- /
- v.6 no.5
- /
- pp.219-230
- /
- 2017
General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.
https://doi.org/10.3745/KTCCS.2017.6.5.219 인용 PDF KSCI

Design of a Dispatch Unit & Operand Selection Unit for Improving the SIMT Based GP-GPU Instruction Performance (SIMT구조 GP-GPU의 명령어 처리 성능 향상을 위한 Dispatch Unit과 Operand Selection Unit설계)

Kwak, Jae Chang
- Journal of IKEEE
- /
- v.19 no.3
- /
- pp.455-459
- /
- 2015
This paper proposes a dispatch unit of GP-GPU with SIMT architecture to support the acceleration of general-purpose operation as well as graphics processing. If all the information of an operand used instructions issued from the warp scheduler is decoded, an unnecessary operand load occurs, resulting in register loads. To resolve this problem, this paper proposes a method that can reduce the operand load and the load on the resister by decoding only the information of the operand using a pre-decoding method. The operand information from the dispatch unit is passed to the operand selection unit with preventing register bank collisions. Thus the overall performance are improved. In the simulation test, the total clock cycles required by processing 10,000 arbitrary instructions issued from the wrap scheduler using ModelSim SE 10.0b are measured. It shows that the application of the dispatch unit equipped with the pre-decoding function proposed in this paper can make an improvement of about 12% in processing performance compared to the conventional method.
https://doi.org/10.7471/ikeee.2015.19.3.455 인용 PDF KSCI

A new warp scheduling technique for improving the performance of GPUs by utilizing MSHR information (GPU 성능 향상을 위한 MSHR 정보 기반 워프 스케줄링 기법)

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- The Journal of Korean Institute of Next Generation Computing
- /
- v.13 no.3
- /
- pp.72-83
- /
- 2017
GPUs can provide high throughput with latency hiding by executing many warps in parallel. MSHR(Miss Status Holding Registers) for L1 data cache tracks cache miss requests until required data is serviced from lower level memory. In recent GPUs, excessive requests for cache resources cause underutilization problem of GPU resources due to cache resource reservation fails. In this paper, we propose a new warp scheduling technique to reduce stall cycles under MSHR resource shortage. Cache miss rates for each warp is predicted based on the observation that each warp shows similar cache miss rates for long period. The warps showing low miss rates or computation-intensive warps are given high priority to be issued when MSHR is full status. Our proposal improves GPU performance by utilizing cache resource more efficiently based on cache miss rate prediction and monitoring the MSHR entries. According to our experimental results, reservation fail cycles can be reduced by 25.7% and IPC is increased by 6.2% with the proposed scheduling technique compared to loose round robin scheduler.

An IPC-based Dynamic Cooperative Thread Array Scheduling Scheme for GPUs

Son, Dong Oh;Kim, Jong Myon;Kim, Cheol Hong
- Journal of the Korea Society of Computer and Information
- /
- v.21 no.2
- /
- pp.9-16
- /
- 2016
Recently, many research groups have focused on GPGPUs in order to improve the performance of computing systems. GPGPUs can execute general-purpose applications as well as graphics applications by using parallel GPU hardware resources. GPGPUs can process thousands of threads based on warp scheduling and CTA scheduling. In this paper, we utilize the traditional CTA scheduler to assign a various number of CTAs to SMs. According to our simulation results, increasing the number of CTAs assigned to the SM statically does not improve the performance. To solve the problem in traditional CTA scheduling schemes, we propose a new IPC-based dynamic CTA scheduling scheme. Compared to traditional CTA scheduling schemes, the proposed dynamic CTA scheduling scheme can increase the GPU performance by up to 13.1%.
https://doi.org/10.9708/jksci.2016.21.2.009 인용 PDF KSCI

Search Result 10, Processing Time 0.021 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)