• Title/Summary/Keyword: Large-memory data processing

Search Result 194, Processing Time 0.024 seconds

Performance Analysis of Interconnection Network for Multiprocessor Systems (다중프로세서 시스템을 \ulcorner나 상호결합 네트워크의 성능 분석)

  • 김원섭;오재철
    • The Transactions of the Korean Institute of Electrical Engineers
    • /
    • v.37 no.9
    • /
    • pp.663-670
    • /
    • 1988
  • Advances in VLSI technology have made it possible to have a larger number of processing elements to be included in highly parallel processor system. A system with a large number of processing elements and memory requires a complex data path. Multistage Interconnection networks(MINS) are useful in providing programmable data path between processing elements and memory modules in multiprocessor system. In this thesis, the performance of MINS for the star network has been analyzed and compared with other networks, such as generalized shuffle network, delta network, and referenced crossbar network.

  • PDF

Large-scale 3D fast Fourier transform computation on a GPU

  • Jaehong Lee;Duksu Kim
    • ETRI Journal
    • /
    • v.45 no.6
    • /
    • pp.1035-1045
    • /
    • 2023
  • We propose a novel graphics processing unit (GPU) algorithm that can handle a large-scale 3D fast Fourier transform (i.e., 3D-FFT) problem whose data size is larger than the GPU's memory. A 1D FFT-based 3D-FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data-transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large-scale benchmarks and compare its performance with the state-of-the-art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU-based 3D-FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive-scale problems, whereas our method's performance is stable.

Efficient Data Management for Finite Element Analysis with Pre-Post Processing of Large Structures (전-후 처리 과정을 포함한 거대 구조물의 유한요소 해석을 위한 효율적 데이터 구조)

  • 박시형;박진우;윤태호;김승조
    • Proceedings of the Computational Structural Engineering Institute Conference
    • /
    • 2004.04a
    • /
    • pp.389-395
    • /
    • 2004
  • We consider the interface between the parallel distributed memory multifrontal solver and the finite element method. We give in detail the requirement and the data structure of parallel FEM interface which includes the element data and the node array. The full procedures of solving a large scale structural problem are assumed to have pre-post processors, of which algorithm is not considered in this paper. The main advantage of implementing the parallel FEM interface is shown up in the case that we use a distributed memory system with a large number of processors to solve a very large scale problem. The memory efficiency and the performance effect are examined by analyzing some examples on the Pegasus cluster system.

  • PDF

Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

  • Ordonez, Carlos;Navas, Mario;Garcia-Alvarado, Carlos
    • Journal of Computing Science and Engineering
    • /
    • v.5 no.2
    • /
    • pp.111-120
    • /
    • 2011
  • Data mining algorithms should exploit new hardware technologies to accelerate computations. Such goal is difficult to achieve in database management system (DBMS) due to its complex internal subsystems and because data mining numeric computations of large data sets are difficult to optimize. This paper explores taking advantage of existing multithreaded capabilities of multicore CPUs as well as caching in RAM memory to efficiently compute summaries of a large data set, a fundamental data mining problem. We introduce parallel algorithms working on multiple threads, which overcome the row aggregation processing bottleneck of accessing secondary storage, while maintaining linear time complexity with respect to data set size. Our proposal is based on a combination of table scans and parallel multithreaded processing among multiple cores in the CPU. We introduce several database-style and hardware-level optimizations: caching row blocks of the input table, managing available RAM memory, interleaving I/O and CPU processing, as well as tuning the number of working threads. We experimentally benchmark our algorithms with large data sets on a DBMS running on a computer with a multicore CPU. We show that our algorithms outperform existing DBMS mechanisms in computing aggregations of multidimensional data summaries, especially as dimensionality grows. Furthermore, we show that local memory allocation (RAM block size) does not have a significant impact when the thread management algorithm distributes the workload among a fixed number of threads. Our proposal is unique in the sense that we do not modify or require access to the DBMS source code, but instead, we extend the DBMS with analytic functionality by developing User-Defined Functions.

Algorithmic GPGPU Memory Optimization

  • Jang, Byunghyun;Choi, Minsu;Kim, Kyung Ki
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.14 no.4
    • /
    • pp.391-406
    • /
    • 2014
  • The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-pattern-aware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPGPU workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.

A Walsh-Based Distributed Associative Memory with Genetic Algorithm Maximization of Storage Capacity for Face Recognition

  • Kim, Kyung-A;Oh, Se-Young
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2003.09a
    • /
    • pp.640-643
    • /
    • 2003
  • A Walsh function based associative memory is capable of storing m patterns in a single pattern storage space with Walsh encoding of each pattern. Furthermore, each stored pattern can be matched against the stored patterns extremely fast using algorithmic parallel processing. As such, this special type of memory is ideal for real-time processing of large scale information. However this incredible efficiency generates large amount of crosstalk between stored patterns that incurs mis-recognition. This crosstalk is a function of the set of different sequencies [number of zero crossings] of the Walsh function associated with each pattern to be stored. This sequency set is thus optimized in this paper to minimize mis-recognition, as well as to maximize memory saying. In this paper, this Walsh memory has been applied to the problem of face recognition, where PCA is applied to dimensionality reduction. The maximum Walsh spectral component and genetic algorithm (GA) are applied to determine the optimal Walsh function set to be associated with the data to be stored. The experimental results indicate that the proposed methods provide a novel and robust technology to achieve an error-free, real-time, and memory-saving recognition of large scale patterns.

  • PDF

BIM Geometry Cache Structure for Data Streaming with Large Volume (대용량 BIM 형상 데이터 스트리밍을 위한 캐쉬 구조)

  • Kang, Tae-Wook
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.18 no.9
    • /
    • pp.1-8
    • /
    • 2017
  • The purpose of this study is to propose a cache structure for processing large-volume building information modeling (BIM) geometry data,whereit is difficult to allocate physical memory. As the number of BIM orders has increased in the public sector, it is becoming more common to visualize and calculate large-volume BIM geometry data. Design and review collaboration can require a lot of time to download large-volume BIM data through the network. If the BIM data exceeds the physical free-memory limit, visualization and geometry computation cannot be possible. In order to utilize large amounts of BIM data on insufficient physical memory or a low-bandwidth network, it is advantageous to cache only the data necessary for BIM geometry rendering and calculation time. Thisstudy proposes acache structure for efficiently rendering and calculating large-volume BIM geometry data where it is difficult to allocate enough physical memory.

Technology Trends in CXL Memory and Utilization Software (CXL 메모리 및 활용 소프트웨어 기술 동향 )

  • H.Y. Ahn;S.Y. Kim;Y.M. Park;W.J. Han
    • Electronics and Telecommunications Trends
    • /
    • v.39 no.1
    • /
    • pp.62-73
    • /
    • 2024
  • Artificial intelligence relies on data-driven analysis, and the data processing performance strongly depends on factors such as memory capacity, bandwidth, and latency. Fast and large-capacity memory can be achieved by composing numerous high-performance memory units connected via high-performance interconnects, such as Compute Express Link (CXL). CXL is designed to enable efficient communication between central processing units, memory, accelerators, storage, and other computing resources. By adopting CXL, a composable computing architecture can be implemented, enabling flexible server resource configuration using a pool of computing resources. Thus, manufacturers are actively developing hardware and software solutions to support CXL. We present a survey of the latest software for CXL memory utilization and the most recent CXL memory emulation software. The former supports efficient use of CXL memory, and the latter offers a development environment that allows developers to optimize their software for the hardware architecture before commercial release of CXL memory devices. Furthermore, we review key technologies for improving the performance of both the CXL memory pool and CXL-based composable computing architecture along with various use cases.

Analysis of Performance Requirement for Large-Scale InfiniBand-based DVSM System (대용량의 InfiniBand 기반 DVSM 시스템 구현을 위한 성능 요구 분석)

  • Cho, Myeong-Jin;Kim, Seon-Wook
    • The KIPS Transactions:PartA
    • /
    • v.14A no.4
    • /
    • pp.215-226
    • /
    • 2007
  • For past years, many distributed virtual shared-memory(DVSM) systems have been studied in order to develop a low-cost shared memory system with a fast interconnection network. But the DVSM needs a lot of data and control communication between distributed processing nodes in order to provide memory consistency in software, and this communication overhead significantly dominates the overall performance. In general, the communication overhead also increases as the number of processing nodes increase, so communication overhead is a very important performance factor for developing a large-scale DVSM system. In this paper, we study the performance scalability quantitatively and qualitatively for developing a large-scale DVSM system based on the next generation interconnection network, called the InfiniBand. Based on the study, we analyze a performance requirement of the next-coming interconnection network to be used for developing a performance-scalable DVSM system in the future.

Compound Backup Technique using Hot-Cold Data Classification in the Distributed Memory System (분산메모리시스템에서의 핫콜드 데이터 분류를 이용한 복합 백업 기법)

  • Kim, Woo Chur;Min, Dong Hee;Hong, Ji Man
    • Smart Media Journal
    • /
    • v.4 no.3
    • /
    • pp.16-23
    • /
    • 2015
  • As the IT technology advances, data processing system is required to handle and process large amounts of data. However, the existing On-Disk system has limit to process data which increase rapidly. For that reason, the In-Memory system is being used which saves and manages data on the fast memory not saving data into hard disk. Although it has fast processing capability, it is necessary to use the fault tolerance techniques in the In-Memory system because it has a risk of data loss due to volatility which is one of the memory characteristics. These fault tolerance techniques lead to performance degradation of In-Memory system. In this paper, we classify the data into Hot and Cold data in consideration of the data usage characteristics in the In-Memory system and propose compound backup technique to ensure data persistence. The proposed technique increases the persistence and improves performance degradation.