• Title/Summary/Keyword: In-Memory Computing

Search Result 762, Processing Time 0.023 seconds

A Novel Memory Hierarchy for Flash Memory Based Storage Systems

  • Yim, Keno-Soo
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.5 no.4
    • /
    • pp.262-269
    • /
    • 2005
  • Semiconductor scientists and engineers ideally desire the faster but the cheaper non-volatile memory devices. In practice, no single device satisfies this desire because a faster device is expensive and a cheaper is slow. Therefore, in this paper, we use heterogeneous non-volatile memories and construct an efficient hierarchy for them. First, a small RAM device (e.g., MRAM, FRAM, and PRAM) is used as a write buffer of flash memory devices. Since the buffer is faster and does not have an erase operation, write can be done quickly in the buffer, making the write latency short. Also, if a write is requested to a data stored in the buffer, the write is directly processed in the buffer, reducing one write operation to flash storages. Second, we use many types of flash memories (e.g., SLC and MLC flash memories) in order to reduce the overall storage cost. Specifically, write requests are classified into two types, hot and cold, where hot data is vulnerable to be modified in the near future. Only hot data is stored in the faster SLC flash, while the cold is kept in slower MLC flash or NOR flash. The evaluation results show that the proposed hierarchy is effective at improving the access time of flash memory storages in a cost-effective manner thanks to the locality in memory accesses.

Efficient Labeling Scheme for Query Processing over XML Fragment Stream in Wireless Computing (무선 환경에서 XML 조각 스트림 질의 처리를 위한 효율적인 레이블링 기법)

  • Ko, Hye-Kyeong
    • The KIPS Transactions:PartD
    • /
    • v.17D no.5
    • /
    • pp.353-358
    • /
    • 2010
  • Unlike the traditional databases, queries on XML streams are restricted to a real time processing and memory usage. In this paper, a robust labeling scheme is proposed, which quickly identifies structural relationship between XML fragments. The proposed labeling scheme provides an effective query processing by removing many redundant operations and minimizing the number of fragments being processed. In experimental results, the proposed labeling scheme efficiently processes query processing and optimizes memory usage.

Auto Regulated Data Provisioning Scheme with Adaptive Buffer Resilience Control on Federated Clouds

  • Kim, Byungsang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.11
    • /
    • pp.5271-5289
    • /
    • 2016
  • On large-scale data analysis platforms deployed on cloud infrastructures over the Internet, the instability of the data transfer time and the dynamics of the processing rate require a more sophisticated data distribution scheme which maximizes parallel efficiency by achieving the balanced load among participated computing elements and by eliminating the idle time of each computing element. In particular, under the constraints that have the real-time and limited data buffer (in-memory storage) are given, it needs more controllable mechanism to prevent both the overflow and the underflow of the finite buffer. In this paper, we propose an auto regulated data provisioning model based on receiver-driven data pull model. On this model, we provide a synchronized data replenishment mechanism that implicitly avoids the data buffer overflow as well as explicitly regulates the data buffer underflow by adequately adjusting the buffer resilience. To estimate the optimal size of buffer resilience, we exploits an adaptive buffer resilience control scheme that minimizes both data buffer space and idle time of the processing elements based on directly measured sample path analysis. The simulation results show that the proposed scheme provides allowable approximation compared to the numerical results. Also, it is suitably efficient to apply for such a dynamic environment that cannot postulate the stochastic characteristic for the data transfer time, the data processing rate, or even an environment where the fluctuation of the both is presented.

New execution model for CAPE using multiple threads on multicore clusters

  • Do, Xuan Huyen;Ha, Viet Hai;Tran, Van Long;Renault, Eric
    • ETRI Journal
    • /
    • v.43 no.5
    • /
    • pp.825-834
    • /
    • 2021
  • Based on its simplicity and user-friendly characteristics, OpenMP has become the standard model for programming on shared-memory architectures. Checkpointing-aided parallel execution (CAPE) is an approach that utilizes the discontinuous incremental checkpointing technique (DICKPT) to translate and execute OpenMP programs on distributed-memory architectures automatically. Currently, CAPE implements the OpenMP execution model by utilizing the DICKPT to distribute parallel jobs and their data to slave machines, and then collects the results after executing these distributed jobs. Although this model has been proven to be effective in terms of performance and compatibility with OpenMP on distributed-memory systems, it cannot fully exploit the capabilities of multicore processors. This paper presents a novel execution model for CAPE that utilizes two levels of parallelism. In the proposed model, we add another level of parallelism in the form of multithreaded processes on slave machines with the goal of better exploiting their multicore CPUs. Initial experimental results presented near the end of this paper demonstrate that this model provides significantly enhanced CAPE performance.

Using the On-Package Memory of Manycore Processor for Improving Performance of MPI Intra-Node Communication (MPI 노드 내 통신 성능 향상을 위한 매니코어 프로세서의 온-패키지 메모리 활용)

  • Cho, Joong-Yeon;Jin, Hyun-Wook;Nam, Dukyun
    • Journal of KIISE
    • /
    • v.44 no.2
    • /
    • pp.124-131
    • /
    • 2017
  • The emerging next-generation manycore processors for high-performance computing are equipped with a high-bandwidth on-package memory along with the traditional host memory. The Multi-Channel DRAM (MCDRAM), for example, is the on-package memory of the Intel Xeon Phi Knights Landing (KNL) processor, and theoretically provides a four-times-higher bandwidth than the conventional DDR4 memory. In this paper, we suggest a mechanism to exploit MCDRAM for improving the performance of MPI intra-node communication. The experiment results show that the MPI intra-node communication performance can be improved by up to 272 % compared with the case where the DDR4 is utilized. Moreover, we analyze not only the performance impact of different MCDRAM-utilization mechanisms, but also that of core affinity for processes.

Quantitative comparison and analysis of next generation mobile memory technologies (차세대 모바일 메모리 기술의 정량적 비교 및 분석)

  • Yoon, Changho;Moon, Byungin;Kong, Joonho
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.4
    • /
    • pp.40-51
    • /
    • 2017
  • Recently, as mobile workloads are becoming more data-intensive, high data bandwidth is required for mobile memory which also consumes non-negligible system energy. A variety of researches and technologies are under development to improve and optimize mobile memory technologies. However, a comprehensive study on the latest mobile memory technologies (LPDDR or Wide I/O) has not been extensively performed yet. To construct high-performance and energy-efficient mobile memory systems, quantitative and detailed analysis of these technologies is crucial. In this paper, we simulate the computer system which adopts mobile DRAM technologies (Wide I/O and LPDDR3). Based on our detailed and comprehensive results, we analyze important factors that affect performance and energy-efficiency of mobile DRAM technologies and show which part can be improved to construct better systems.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

The Sequential GHT for the Efficient Pattern Recognition (효율적 패턴 인식을 위한 순차적 GHT)

  • 김수환;임승민;이규태;이태원
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.28B no.5
    • /
    • pp.327-334
    • /
    • 1991
  • This paper proposes an efficient method of implementing the generalized Hough transform (GHT), which has been hindered by an excessive computing load and a large memory requirement. The conventional algorithm requires a parameter space of 4 dimensions in detection a rotated, scaled, and translated object in an input image. Prior to the application of GHT to the input image, the proposed method determines the angle of rotation and the scaling factor of the test image using the proportion of the edge components between the reference image and test image. With the rotation angle and the scaling factor already determined, the parameter spaceis to be reduced to a simple array of 2 dimensions by applying the unit GHT only one time. The experiments with the image of airplanes reveal that both of the computing time and the requires memory size are reduced by 95 percent, without any degradatationof accuracy, compared with the conventional GHT algorithm.

  • PDF

Development and Implementation of Monitoring System for Management of Virtual Resource Based on Cloud Computing (클라우드 컴퓨팅 기반 가상 자원 관리를 위한 모니터링 시스템 설계 및 구현)

  • Cho, Dae-Kyun;Park, Seok-Cheon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.2
    • /
    • pp.41-47
    • /
    • 2013
  • In this paper, for this open system-based virtual resource monitoring system was designed. Virtual resources, CPU, memory, disk, network, each subdivided into parts, each modular implementation. Implementation results in real time CPU, memory, disk, network information, confirmed the results of monitoring. System designed to implement the Windows, Linux, Xen was used for the operating system, implementation language, C++ was used, the structure of the system, such as the ability to upgrade and add scalability and modularity by taking into account the features available in cloud computing environments applicable to cloud computing, virtual resource monitoring system has been implemented.