통합 검색 | Korea Science

An Asynchronous Algorithm for Balancing Unpredictable Workload on Distributed-Memory Machines

Chung, Yong-Hwa;Park, Jin-Won;Yoon, Suk-Han
- ETRI Journal
- /
- 제20권4호
- /
- pp.346-360
- /
- 1998
It is challenging to parallelize problems with irregular computation and communication. In this paper, we propose an asynchronous algorithm for balancing unpredictable workload on distributed-memory machines. By using an initial workload estimate, we first partition the computations such that the workload is distributed evenly across the processors. In addition, we perform task migrations dynamically for adapting to the evolving workload. To demonstrate the usefulness of our load balancing strategy, we conducted experiments on an IBM SP2 and a Cray T3D. Experimental results show that our task migration strategy can balance unpredictable workload with little overhead. Our code using C and MPI is portable onto other distributed-memory machines.
PDF

Distributed memory access architecture and control for fully disaggregated datacenter network

Kyeong-Eun Han;Ji Wook Youn;Jongtae Song;Dae-Ub Kim;Joon Ki Lee
- ETRI Journal
- /
- 제44권6호
- /
- pp.1020-1033
- /
- 2022
In this paper, we propose novel disaggregated memory module (dMM) architecture and memory access control schemes to solve the collision and contention problems of memory disaggregation, reducing the average memory access time to less than 1 ㎲. In the schemes, the distributed scheduler in each dMM determines the order of memory read/write access based on delay-sensitive priority requests in the disaggregated memory access frame (dMAF). We used the memory-intensive first (MIF) algorithm and priority-based MIF (p-MIF) algorithm that prioritize delay-sensitive and/or memory-intensive (MI) traffic over CPU-intensive (CI) traffic. We evaluated the performance of the proposed schemes through simulation using OPNET and hardware implementation. Our results showed that when the offered load was below 0.7 and the payload of dMAF was 256 bytes, the average round trip time (RTT) was the lowest, ~0.676 ㎲. The dMM scheduling algorithms, MIF and p-MIF, achieved delay less than 1 ㎲ for all MI traffic with less than 10% of transmission overhead.
https://doi.org/10.4218/etrij.2021-0335 인용 PDF KSCI

쿠버네티스에서 ML 워크로드를 위한 분산 인-메모리 캐싱 방법 (Distributed In-Memory Caching Method for ML Workload in Kubernetes)

윤동현;송석일
- Journal of Platform Technology
- /
- 제11권4호
- /
- pp.71-79
- /
- 2023
이 논문에서는 기계학습 워크로드의 특징을 분석하고 이를 기반으로 기계학습 워크로드의 성능 향상을 위한 분산 인-메모리 캐싱 기법을 제안한다. 기계학습 워크로드의 핵심은 모델 학습이며 모델 학습은 컴퓨팅 집약적 (Computation Intensive)인 작업이다. 쿠버네티스 기반 클라우드 환경에서 컴퓨팅 프레임워크와 스토리지를 분리한 구조에서 기계학습 워크로드를 수행하는 것은 자원을 효과적으로 할당할 수 있지만, 네트워크 통신을 통해 IO가 수행되야 하므로 지연이 발생할 수 있다. 이 논문에서는 이런 환경에서 수행되는 머신러닝 워크로드의 성능을 향상하기 위한 분산 인-메모리 캐싱 기법을 제안한다. 특히, 제안하는 방법은 쿠버네티스 기반의 머신러닝 파이프라인 관리 도구인 쿠브플로우를 고려하여 머신러닝 워크로드에 필요한 데이터를 분산 인-메모리 캐시에 미리 로드하는 새로운 방법을 제안한다.
PDF

분산 공유 메모리 시스템에서 메모리 참조 패턴에 근거한 거짓 공유 감속 기법 (Reducing False Sharing based on Memory Reference Patterns in Distributed Shared Memory Systems)

조성제
- 한국정보처리학회논문지
- /
- 제7권4호
- /
- pp.1082-1091
- /
- 2000
In Distributed Shared Memory systems, false sharing occurs when two different data items, not shared but accessed by two different processors, are allocated to a single block and is an important factor in degrading system performance. The paper first analyzes shared memory allocation and reference patterns in parallel applications that allocate memory for shared data objects using a dynamic memory allocator. The shared objects are sequentially allocated and generally show different reference patterns. If the objects with the same size are requested successively as many times as the number of processors, each object is referenced by only a particular processor. If the objects with the same size are requested successively much more than the number of processors, two or more successive objects are referenced by only particular processors. On the basis of these analyses, we propose a memory allocation scheme which allocates each object requested by different processors to different pages and evaluate the existing memory allocation techniques for reducing false sharing faults. Our allocation scheme reduces a considerable amount of false sharing faults for some applications with a little additional memory space.
PDF

대용량의 InfiniBand 기반 DVSM 시스템 구현을 위한 성능 요구 분석 (Analysis of Performance Requirement for Large-Scale InfiniBand-based DVSM System)

조명진;김선욱
- 정보처리학회논문지A
- /
- 제14A권4호
- /
- pp.215-226
- /
- 2007
지난 수년간 저가의 공유메모리(Shared Memory) 시스템을 개발하기 위한 방법으로 빠른 상호 연결 네트워크를 이용한 DVSM(Distributed Virtual Shared Memory) 시스템의 구조에 관한 연구가 활발하게 진행되어 왔다. 그러나 DVSM은 소프트웨어 적으로 메모리 일관성을 유지하기 때문에 분산 처리 노드간의 많은 데이터 및 제어 신호 통신이 요구되며 이러한 통신 과부하(overhead)가 전체 성능 향상을 결정짓는 요인으로 작용한다. 일반적으로 프로세싱 노드의 수가 증가하면 통신 과부하도 따라서 증가하기 때문에 통신 과부하는 대용량(large-scale)의 DVSM을 구현하는데 매우 중요한 성능 요인이다. 이 논문에서는 차세대 상호 연결 기술 중 하나인 InfiniBand를 기반으로 대용량 DVSM 시스템을 구현하기 위한 성능 확장성을 정량적 및 정성적으로 연구하였다. 또한 이 연구를 바탕으로 성능 확장성이 뛰어난 DVSM 시스템을 개발하기 위한 차세대 상호 연결 네트워크의 요구 성능을 분석하였다.
https://doi.org/10.3745/KIPSTA.2007.14-A.4.215 인용 PDF KSCI

분산 공유메모리를 기반으로 한 서브 클러스터 그룹의 자료전송방식 (A Data Transfer Method of the Sub-Cluster Group based on the Distributed and Shared Memory)

이기준
- 정보처리학회논문지A
- /
- 제10A권6호
- /
- pp.635-642
- /
- 2003
최근 네트워크 기술의 비약적인 발전은 고속 그리고 저가의 클러스터 시스템을 구축할 수 있는 기본 토대를 제공하여 주었다. 이러한 기존 클러스터 시스템은 안정화된 고속의 지역 네트워크를 기반으로 일정 수준의 시스템으로 구성되는 것이 일반적인 경향이다. 본 논문에서 제안하는 다중 분산 웹 클러스터 그룹은 개방 네트워크상에 존재하는 저가, 저속의 시스템 노드를 대상으로 하여, 주어진 작업에 대한 병렬수행 및 SC-Sever의 공유메모리를 통한 효율적인 작업 분배와 시스템 노드간의 상호 협조 작업을 통하여 고성능, 고효율 그리고 고가용성을 얻을 수 있는 웹 클러스터 모델이다. 이를 위하여 다중 분산 웹 클러스터 그룹은 복수개의 시스템 노드를 단일한 가상 네트워크로 묶어 놓은 서브 클러스터 그룹으로 구성하고, 서브 클러스터 그룹내의 효율적인 자료전송을 위하여 분산 공유 메모리를 이용한다. 제안된 모델은 사용자로부터 요구되는 대규모의 작업에 대하여 분산 공유 메모리를 기반으로 한 부하분배 및 병렬 컴퓨팅 방식을 이용하므로 처리 효율을 높일 수 있다.
https://doi.org/10.3745/KIPSTA.2003.10A.6.635 인용 PDF KSCI

분산공유 메모리 시스템을 위한 동적 제한 디렉터리 기법 (Dynamic Limited Directory Scheme for Distributed Shared Memory Systems)

이동광;권혁성;최성민;안병철
- 한국정보처리학회논문지
- /
- 제6권4호
- /
- pp.1098-1105
- /
- 1999
분산 공유 메모리(distributed shared memory) 시스템에서 캐쉬는 메모리 접근 지연과 통신 부하 줄임으로 성능을 향상시킬 수 있으나 캐쉬일관성 문제를 해결하여야 한다. 본 논문은 DSM 시스템에서 캐쉬일관성 문제를 해결하고 성능을 향상시킬 수 있는 새 디렉터리 프로토콜을 제안한다. 캐시 일관성을 유지하기 일정거리 이내에 있는 처리기는 전체 디렉터리 기법처럼 비트 벡터를 사용하여 통신 오버헤드를 줄일 수 있다. 그리고 일정거리 이상에 있는 처리기는 포인터를 디렉터리 풀에 저장한다. 이 비트 벡터와 디렉터리 풀의 사용은 불필요한 캐쉬 무효화를 방지하므로 시스템의 성능을 향상시킬 수 있다. 제안한 기법은 제한 디렉터리 기법보다 통행량을 66%까지 줄일 수 있으며 동적할당 디렉터리 기법보다 디렉터리 접근 회수도 27%까지 각각 줄일 수 있다.
PDF

비동기 알고리즘을 이용한 분산 메모리 시스템에서의 초대형 선형 시스템 해법의 성능 향상 (Improving Performance of Large Sparse Linear System Solvers On Distributed Memory Systems By Asynchronous Algorithms)

박필성;신순철
- 정보처리학회논문지A
- /
- 제8A권4호
- /
- pp.439-446
- /
- 2001
현재 대부분의 병렬 알고리즘은 동기 알고리즘으로 올바른 계산을 위해서는 프로세서들의 동기화와 부하균형이 필수적이다. 만일 부하균형이 불가능하거나 이질적 클러스터처럼 각 프로세서의 성능이 다른 경우, 연산은 가장 느린 프로세서의 성능에 의해 결정된다. 비동기 반복법은 이런 문제를 해결하는 하나의 방안으로 각광받고 있으나, 현재까지의 연구는 비교적 구현이 쉬운 공유 메모리 시스템을 사용한 것이었다. 본 논문에서는 분산 메모리 환경에서 초대형 선형 시스템 문제를 풀기 위해, 빠른 프로세서의 유휴 시간을 최대한 줄임으로써 전체적으로 성능을 향상시키는 비동기 병렬 알고리즘을 제안하고 이를 클러스터에 구현하였다.
PDF

분산 인-메모리 환경에서 부하 분산을 위한 데이터 복제와 이주 기법 (Data Replication and Migration Scheme for Load Balancing in Distributed Memory Environments)

최기태;윤상원;박재열;임종태;복경수;유재수
- 정보과학회 컴퓨팅의 실제 논문지
- /
- 제22권1호
- /
- pp.44-49
- /
- 2016
최근 소셜 미디어의 성장과 디지털 기기의 활용이 증가함에 따라 데이터가 기하급수적으로 급증하고 있다. 이러한 대용량의 데이터를 효율적으로 처리하기 위해 분산 메모리 처리 시스템을 사용한다. 하지만 분산 환경에서 특정 노드에 부하가 집중이 되면 노드의 성능이 저하되는 문제가 발생한다. 본 논문은 분산 메모리 환경에서 노드의 부하를 적절하게 분배하는 부하 분산 기법을 제안한다. 제안하는 기법은 노드의 부하를 관리하기 위해 핫 데이터를 여러 노드에 복제하고 노드가 추가되거나 삭제될 때 노드의 부하를 고려하여 데이터를 이주시킨다. 클라이언트는 핫 데이터의 메타데이터 정보를 유지하여 직접 노드에 접근함으로써 중앙 서버의 접근을 감소시킨다. 성능 평가를 통해 제안하는 부하 분산 관리 기법이 기존에 캐시 관리 기법에 비해 우수함을 입증한다.
https://doi.org/10.5626/KTCP.2016.22.1.44 인용 KSCI

빅데이터 분석을 위한 슈퍼컴퓨터 환경에서 R의 병렬처리 (Parallel Computing Environment for R with on Supercomputer Systems)

이상열;원중호
- 한국경영과학회지
- /
- 제39권4호
- /
- pp.19-31
- /
- 2014
We study parallel processing techniques for the R programming language of high performance computing technology. In this study, we used massively parallel computing system which has 25,408 cpu cores. We conducted a performance evaluation of a distributed memory system using MPI and of a the shared memory system using OpenMP. Our findings are summarized as follows. First, For some particular algorithms, parallel processing is about 150 times faster than serial processing in R. Second, the distributed memory system gets faster as the number of nodes increases while shared memory system is limited in the improvement of performance, due to the limit of the number of cpus in a single system.
https://doi.org/10.7737/JKORMS.2014.39.4.019 인용 PDF KSCI

검색결과 397건 처리시간 0.026초

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)