Search | Korea Science

Analysis on the GPU Performance according to Hierarchical Memory Organization (계층적 메모리 구성에 따른 GPU 성능 분석)

Choi, Hongjun;Kim, Jongmyon;Kim, Cheolhong
- The Journal of the Korea Contents Association
- /
- v.14 no.3
- /
- pp.22-32
- /
- 2014
Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.
https://doi.org/10.5392/JKCA.2014.14.03.022 인용 PDF KSCI

Analysis on Memory Characteristics of Graphics Processing Units for Designing Memory System of General-Purpose Computing on Graphics Processing Units (범용 그래픽 처리 장치의 메모리 설계를 위한 그래픽 처리 장치의 메모리 특성 분석)

Choi, Hongjun;Kim, Cheolhong
- Smart Media Journal
- /
- v.3 no.1
- /
- pp.33-38
- /
- 2014
Even though the performance of microprocessor is improved continuously, the performance improvement of computing system becomes hard to increase, in order to some drawbacks including increased power consumption. To solve the problem, general-purpose computing on graphics processing units(GPGPUs), which execute general-purpose applications by using specialized parallel-processing device representing graphics processing units(GPUs), have been focused. However, the characteristics of applications related with graphics is substantially different from the characteristics of general-purpose applications. Therefore, GPUs cannot exploit the outstanding computational resources sufficiently due to various constraints, when they execute general-purpose applications. When designing GPUs for GPGPU, memory system is important to effectively exploit the GPUs since typically general-purpose applications requires more memory accesses than graphics applications. Especially, external memory access requiring long latency impose a big overhead on the performance of GPUs. Therefore, the GPU performance must be improved if hierarchical memory architecture which can reduce the number of external memory access is applied. For this reason, we will investigate the analysis of GPU performance according to hierarchical cache architectures in executing various benchmarks.
PDF KSCI

An Optimum Paged Interleaving Memory by a Hierarchical Bit Line (계층 비트라이에 의한 최적 페이지 인터리빙 메모리)

조경연;이주근
- Journal of the Korean Institute of Telematics and Electronics
- /
- v.27 no.6
- /
- pp.901-909
- /
- 1990
With a wide spread of 32 bit personal computers, a simple structure and high performance memory system have been highly required. In this paper, a memory block is constructed by using a modified hierarchical bit line in which the DRAM bit line and the latch which works as a SRAM cell are integrated by an interface gate. And the new architecture memory DSRAM(Dynamic Static RAM) is proposed by interleaving the 16 memory block. Because the DSRAM works with 16 page, the page is miss ratio becomes small and the RAS precharge time which is incurred by page miss is shortened. So the DSRAM can implement an optimum page interleaving and it has good compatibility to the existing DRAMs. The DSRAM can be widely used in small computers as well as a high performance memory system.
PDF

A New VLSI Architecture of a Hierarchical Motion Estimator for Low Bit-rate Video Coding (저전송률 동영상 압축을 위한 새로운 계층적 움직임 추정기의 VLSI 구조)

이재헌;나종범
- Proceedings of the IEEK Conference
- /
- 1999.06a
- /
- pp.601-604
- /
- 1999
We propose a new hierarchical motion estimator architecture that supports the advanced prediction mode of recent low bit-rate video coders such as H.263 and MPEG-4. In the proposed VLSI architecture, a basic searching unit (BSU) is commonly utilized for all hierarchical levels to make a systematic and small sized motion estimator. Since the memory bank of the proposed architecture provides scheduled data flow for calculating 8$\times$8 block-based sum of absolute difference (SAD), both a macroblock-based motion vector (MV) and four block-based MVs are simultaneously obtained for each macroblock in the advanced prediction mode. The proposed motion estimator gives similar coding performance compared with full search block matching algorithm (FSBMA) while achieving small size and satisfying the advanced prediction mode.
PDF

A Cache Controller to Maximize Effectiveness of Hierarchical Memory Architecture (계층적 메모리 구조의 효과를 극대화하는 캐시 제어기)

Uh Bong Yong;Ju Young Kwan;Cheon Joong Nam;Kim Suk Il
- Journal of KIISE:Computer Systems and Theory
- /
- v.32 no.11_12
- /
- pp.608-616
- /
- 2005
A cache architecture is proposed here which evokes prefetch at level 1 cache miss. Existing structures only prefetch at level 2 cache miss. In the proposed cache architecture, level 1 cache miss would select demand fetch block and prefetch block from the level 2 cache and store to level 1 cache and prefetch cache, respectively. According to an experimental analysis using 11 benchmark programs, the hierarchical cache architecture that employs both a level 1 cache prefetcher and a level 2 cache prefetcher obtained a maximum $19\%$ increased performance when compared to the cache architecture that employs only a level 2 cache prefetcher.
PDF KSCI

An Efficient Data Distribution Method on a Distributed Shared Memory Machine (분산공유 메모리 시스템 상에서의 효율적인 자료분산 방법)

Min, Ok-Gee
- The Transactions of the Korea Information Processing Society
- /
- v.3 no.6
- /
- pp.1433-1442
- /
- 1996
Data distribution of SPMD(Single Program Multiple Data) pattern is one of main features of HPF (High Performance Fortran). This paper describes design is sues for such data distribution and its efficient execution model on TICOM IV computer, named SPAX(Scalable Parallel Architecture computer based on X-bar network). SPAX has a hierarchical clustering structure that uses distributed shared memory(DSM). In such memory structure, it cannot make a full system utilization to apply unanimously either SMDD(shared Memory Data Distribution) or DMDD(Distributed Memory Data Distribution). Here we propose another data distribution model, called DSMDD(Distributed Shared Memory Data Distribution), a data distribution model based on hierarchical masters-slaves scheme. In this model, a remote master and slaves are designated in each node, shared address scheme is used within a node and message passing scheme between nodes. In our simulation, assuming a node size in which system performance degradation is minimized,DSMDD is more effective than SMDD and DMDD. Especially,the larger number of logical processors and the less data dependency between distributed data,the better performace is obtained.
PDF

Design of the new parallel processing architecture for commercial applications (상용 응용을 위한 병렬처리 구조 설계)

한우종;윤석한;임기욱
- Journal of the Korean Institute of Telematics and Electronics B
- /
- v.33B no.5
- /
- pp.41-51
- /
- 1996
In this paper, anew parallel processing system based on a cluster architecture which provides scalability of a parallel processing system while maintains shared memory multiprocessor characteristics is proposed. In recent days low cost, high performnce microprocessors have led to construction of large scale parallel processing systems. Such parallel processing systems provides large scalability but are mainly used for scientific applications which have large data parallelism. A shared memory multiprocessor system like TICOM is currently used as aserver for the commercial application, however, the shared memory multiprocessor system is known to have very limited scalability. The proposed architecture can support scalability and performance of the parallel processing system while it provides adaptability for the commerical application, hence it can overcome the limitation of the shared memory multiprocessor. The architecture and characteristics of the proposed system shall be described. A proprietary hierarchical crsossbar network is designed for this system, of which the protocol, routing and switching technique and the signal transfer technique are optimized for the proposed architecture. The design trade-offs for the network are described in this paper and with simulation usihng the SES/workbench, it is explored that the network fits to the proposed architecture.
PDF

Design and Performance Analysis of High Performance Processor-Memory Integrated Architectures (고성능 프로세서-메모리 혼합 구조의 설계 및 성능 분석)

Kim, Young-Sik;Kim, Shin-Dug;Han, Tack-Don
- The Transactions of the Korea Information Processing Society
- /
- v.5 no.10
- /
- pp.2686-2703
- /
- 1998
The widening pClformnnce gap between processor and memory causes an emergence of the promising architecture, processor-memory (PM) integration In this paper, various design issues for P-M integration are studied, First, an analytical model of the DRAM access time is constructed considering both the bank conflict ratio and the DRAM page hit ratio. Then the points of both the performance improvement and the perfonnance bottle neck are found by the proposed model as designing on-chip DRAM architectures. This paper proposes the new architecture, called the delayed precharge bank architecture, to improve the perfonnance of memory system as increasing the DRAM page hit ratio. This paper also adapts an efficient bank interleaving mechanism to the proposed architecture. This architecture is verified !II he better than the hierarchical multi-bank architecture as well as the conventional bank architecture by executiun driven simulation. Eight SPEC95 benchmarks are used for simulation as changing parameters for the cache architecture, the number of DRAM banks, and the delayed time quantum.
PDF

A Technique for Improving the Performance of Cache Memories

Cho, Doosan
- International Journal of Internet, Broadcasting and Communication
- /
- v.13 no.3
- /
- pp.104-108
- /
- 2021
In order to improve performance in IoT, edge computing system, a memory is usually configured in a hierarchical structure. Based on the distance from CPU, the access speed slows down in the order of registers, cache memory, main memory, and storage. Similar to the change in performance, energy consumption also increases as the distance from the CPU increases. Therefore, it is important to develop a technique that places frequently used data to the upper memory as much as possible to improve performance and energy consumption. However, the technique should solve the problem of cache performance degradation caused by lack of spatial locality that occurs when the data access stride is large. This study proposes a technique to selectively place data with large data access stride to a software-controlled cache. By using the proposed technique, data spatial locality can be improved by reducing the data access interval, and consequently, the cache performance can be improved.
https://doi.org/10.7236/IJIBC.2021.13.3.104 인용 PDF KSCI

Hierarchical Binary Search Tree (HBST) for Packet Classification (패킷 분류를 위한 계층 이진 검색 트리)

Chu, Ha-Neul;Lim, Hye-Sook
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.32 no.3B
- /
- pp.143-152
- /
- 2007
In order to provide new value-added services such as a policy-based routing and the quality of services in next generation network, the Internet routers need to classify packets into flows for different treatments, and it is called a packet classification. Since the packet classification should be performed in wire-speed for every packet incoming in several hundred giga-bits per second, the packet classification becomes a bottleneck in the Internet routers. Therefore, high speed packet classification algorithms are required. In this paper, we propose an efficient packet classification architecture based on a hierarchical binary search fee. The proposed architecture hierarchically connects the binary search tree which does not have empty nodes, and hence the proposed architecture reduces the memory requirement and improves the search performance.
PDF KSCI

Search Result 26, Processing Time 0.034 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)