Search | Korea Science

Thread Block Scheduling for GPGPU based on Fine-Grained Resource Utilization (상세 자원 이용률에 기반한 병렬 가속기용 스레드 블록 스케줄링)

Bahn, Hyokyung;Cho, Kyungwoon
- The Journal of the Institute of Internet, Broadcasting and Communication
- /
- v.22 no.5
- /
- pp.49-54
- /
- 2022
With the recent widespread adoption of general-purpose GPUs (GPGPUs) in cloud systems, maximizing the resource utilization through multitasking in GPGPU has become an important issue. In this article, we show that resource allocation based on the workload classification of computing-bound and memory-bound is not sufficient with respect to resource utilization, and present a new thread block scheduling policy for GPGPU that makes use of fine-grained resource utilizations of each workload. Unlike previous approaches, the proposed policy reduces scheduling overhead by separating profiling and scheduling, and maximizes resource utilizations by co-locating workloads with different bottleneck resources. Through simulations under various virtual machine scenarios, we show that the proposed policy improves the GPGPU throughput by 130.6% on average and up to 161.4%.
https://doi.org/10.7236/JIIBC.2022.22.5.49 인용 PDF KSCI HTML

A Design of a High Performance Stream Processor without Superscalar Architecture (슈퍼스칼라 구조를 갖지 않는 고성능 Stream Processor 설계)

Lee, Kwan-Ho;Kim, Chi-Yong
- Journal of IKEEE
- /
- v.21 no.1
- /
- pp.77-80
- /
- 2017
In this paper, we proposed a way to improve performance of GP-GPU by deletion of superscalar issue from its original form. At first, we simplified the structure of stream processor in order to eliminate superscalar issue. Under this condition, preservation of hardware size and increasing of thread number were followed by functional improvement of GP-GPU. As the number of thread was getting larger, we proposed the new model of warp scheduler which adjusts the group of thread. This superscalar issue-deleted warp scheduler transferred the instructions to warp which was activated by Round Robin Scheduling. Performance comparison was conducted by Gaussian filtering and the results indicated that our newly designed GP-GPU showing 7.89 times better in its performance than original one.
https://doi.org/10.7471/ikeee.2017.21.1.77 인용 PDF KSCI

Thread Distribution Method of GP-GPU for Accelerating Parallel Algorithms (병렬 알고리즘의 가속화를 위한 GP-GPU의 Thread할당 기법)

Lee, Kwan-Ho;Kim, Chi-Yong
- Journal of IKEEE
- /
- v.21 no.1
- /
- pp.92-95
- /
- 2017
In this paper, we proposed a way to improve function of small scale GP-GPU. Instead of using superscalar which increase scheduling-complexity, we suggested the application of simple core to maximize GP-GPU performance. Our studies also demonstrated that simplified Stream Processor is one of the way to achieve functional improvement in GP-GPU. In addition, we found that developing of optimal thread-assigning method in Warp Scheduler for specific application improves functional performance of GP-GPU. For examination of GP-GPU functional performance, we suggested the thread-assigning way which coordinated with Deep-Learning system; a part of Neural Network. As a result, we found that functional index in algorithm of Neural Network was increased to 90%, 98% compared with Intel CPU and ARM cortex-A15 4 core respectively.
https://doi.org/10.7471/ikeee.2017.21.1.92 인용 PDF KSCI

MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs (GPU 성능 향상을 위한 MSHR 활용률 기반 동적 워프 스케줄러)

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- KIPS Transactions on Computer and Communication Systems
- /
- v.8 no.5
- /
- pp.111-118
- /
- 2019
Recent graphic processing units (GPUs) provide high throughput by using powerful hardware resources. However, massive memory accesses cause GPU performance degradation due to cache inefficiency. Therefore, the performance of GPU can be improved by reducing thread parallelism when cache suffers memory contention. In this paper, we propose a dynamic warp scheduler which controls thread parallelism according to degree of cache contention. Usually, the greedy then oldest (GTO) policy for issuing warp shows lower parallelism than loose round robin (LRR) policy. Therefore, the proposed warp scheduler employs the LRR warp scheduling policy when Miss Status Holding Register(MSHR) utilization is low. On the other hand, the GTO policy is employed in order to reduce thread parallelism when MSHRs utilization is high. Our proposed technique shows better performance compared with LRR and GTO policy since it selects efficient scheduling policy dynamically. According to our experimental results, our proposed technique provides IPC improvement by 12.8% and 3.5% over LRR and GTO on average, respectively.
https://doi.org/10.3745/KTCCS.2019.8.5.111 인용 PDF KSCI HTML

Scheduling of loop with carried dependence using thread (쓰레드를 이용한 루프 캐리 종속성을 가진 루프의 스케쥴링)

김현철;이종국;유기영
- Proceedings of the Korean Information Science Society Conference
- /
- 2000.04a
- /
- pp.627-629
- /
- 2000
루프를 병렬 처리하기 위해 공유 메모리 다중처리기에 루프를 할당하는 네 가지 기법들을 루프 캐리 종속성(loop-carried dependence)을 가진 루프의 할당에 적용하기 위해 하여 변형 후 그들의 성능을 비교 분석한다. 구현은 자바 쓰레드 환경에서 하였다. 또한, 반복들간에 종속 관계가 생기는 루프의 효율적 수행을 위해 CDSS(Carried-Dependence Self-Scheduling)할당 기법을 제안한다. 종속 거리, 쓰레드 수, 반복 수등을 다양하게 하여 시뮬레이션 해 본 결과 제안한 CDSS는 양호한 부하 균형을 유지하였으며 다른 기법들에 비해 루프 수행 시간을 줄여 효율적임을 알 수 있었다.
PDF

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- Journal of the Korea Society of Computer and Information
- /
- v.24 no.4
- /
- pp.1-9
- /
- 2019
LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.
https://doi.org/10.9708/jksci.2019.24.04.001 인용 PDF KSCI HTML

Scheduling Scheme for Compound Nodes of Hierarchical Task Graph using Thread (스레드를 이용한 계층적 태스크 그래프(HTG)의 복합 노드 스케쥴링 기법)

Kim, Hyun-Chul;Kim, Hyo-Cheol
- Journal of KIISE:Computer Systems and Theory
- /
- v.29 no.8
- /
- pp.445-455
- /
- 2002
In this paper, we present a new task scheduling scheme ior the efficient execution of the tasks of compound nodes of hierarchical task graph(HTG) on shared memory system. The proposed scheme for exploitation functional parallelism is autoscheduling that performs the role of scheduling by processor itself without any dedicated global scheduler. To adapt the proposed scheduling scheme for various platforms, Including a uni-processor systems, Java threads were used for implementation, and the performance is analyzed in comparison with a conventional bit vector method. The experimental results showed that the proposed method was found to be more efficient in its execution time and exhibited good load-balancing when using the experimental parameter values. Furthermore, the memory size could be reduced when using the proposed algorithm compared with a conventional scheme.
PDF KSCI

Non-Preemptive Fixed Priority Scheduling for Design of Real-Time Embedded Systems (실시간 내장형 시스템의 설계를 위할 비선점형 고정우선순위 스케줄링)

Park, Moon-Ju
- Journal of KIISE:Computing Practices and Letters
- /
- v.15 no.2
- /
- pp.89-97
- /
- 2009
Embedded systems widely used in ubiquitous environments usually employ an event-driven programming model instead of thread-based programming model in order to create a more robust system that uses less memory. However, as the software for embedded systems becomes more complex, it becomes hard to program as a single event handler using the event-driven programming model. This paper discusses the implementation of non-preemptive real-time scheduling theory for the design of embedded systems. To this end, we present an efficient schedulability test method for a given non-preemptive task set using a sufficient condition. This paper also shows that the notion of sub-tasks in embedded systems can overcome the problem of low utilization that is a main drawback of non-preemptive scheduling.
PDF KSCI

Real-Time Characteristics Analysis and Improvement for OPRoS Component Scheduler on Windows NT Operating System (Windows NT상에서의 OPRoS 컴포넌트 스케줄러의 실시간성 분석 및 개선)

Lee, Dong-Su;Ahn, Hee-June
- Journal of Institute of Control, Robotics and Systems
- /
- v.17 no.1
- /
- pp.38-46
- /
- 2011
The OPRoS (Open Platform for Robotic Service) framework provides uniform operating environment for service robots. As an OPRoS-based service robot has to support real-time as well as non-real-time applications, application of Windows NT kernel based operating system can be restrictive. On the other hand, various benefits such as rich library and device support and abundant developer pool can be enjoyed when service robots are built on Windows NT. The paper presents a user-mode component scheduler of OPRoS, which can provide near real-time scheduling service on Windows NT based on the restricted real-time features of Windows NT kernel. The component scheduler thread with the highest real-time priority in Windows NT system acquires CPU control. And then the component scheduler suspends and resumes each periodic component executors based on its priority and precedence dependency so that the component executors are scheduled in the preemptive manner. We show experiment analysis on the performance limitations of the proposed scheduling technique. The analysis and experimental results show that the proposed scheduler guarantees highly reliable timing down to the resolution of 10ms.
https://doi.org/10.5302/J.ICROS.2011.17.1.38 인용 PDF KSCI

Analysis of GPU Performance and Memory Efficiency according to Task Processing Units (작업 처리 단위 변화에 따른 GPU 성능과 메모리 접근 시간의 관계 분석)

Son, Dong Oh;Sim, Gyu Yeon;Kim, Cheol Hong
- Smart Media Journal
- /
- v.4 no.4
- /
- pp.56-63
- /
- 2015
Modern GPU can execute mass parallel computation by exploiting many GPU core. GPGPU architecture, which is one of approaches exploiting outstanding computational resources on GPU, executes general-purpose applications as well as graphics applications, effectively. In this paper, we investigate the impact of memory-efficiency and performance according to number of CTAs(Cooperative Thread Array) on a SM(Streaming Multiprocessors), since the analysis of relation between number of CTA on a SM and them provides inspiration for researchers who study the GPU to improve the performance. Our simulation results show that almost benchmarks increasing the number of CTAs on a SM improve the performance. On the other hand, some benchmarks cannot provide performance improvement. This is because the number of CTAs generated from same kernel is a little or the number of CTAs executed simultaneously is not enough. To precisely classify the analysis of performance according to number of CTA on a SM, we also analyze the relations between performance and memory stall, dram stall due to the interconnect congestion, pipeline stall at the memory stage. We expect that our analysis results help the study to improve the parallelism and memory-efficiency on GPGPU architecture.
PDF KSCI

Search Result 34, Processing Time 0.023 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)