• Title/Summary/Keyword: Distributed memory

Search Result 397, Processing Time 0.024 seconds

The Design of MPI Hardware Unit for Enhanced Broadcast Communication (효율적인 브로드캐스트 통신을 지원하는 MPI 하드웨어 유닛 설계)

  • Yun, Hee-Jun;Chung, Won-Young;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.11B
    • /
    • pp.1329-1338
    • /
    • 2011
  • This paper proposes an algorithm and hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional systems, collective communication is converted into point-to-point communications by MPI library cell without considering the state of communication port of each processing node which represents the processing node is in busy state or free state. If conflicting point-to-point communication occurs during broadcast communication, the transmitting speed for broadcast communication is decreased. Thus, this paper proposed an algorithm which determines the order of point-to-point communications for broadcast communication according to the state of each processing node. According to the state of each processing node, the proposed algorithm decreases total broadcast communication time by transmitting message preferentially to the processing node with communication port in free state. The proposed MPI unit for broadcast communication is evaluated by modeling it with systemC. In addition, it achieved a highly improved performance for broadcast communication up to 78% with 16 nodes. This result shows the proposed algorithm is useful to improving total performance of MPSoC.

A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System (MPI 브로드캐스트 통신을 위한 서킷 스위칭 기반의 파이프라인 체인 알고리즘 설계)

  • Yun, Heejun;Chung, Wonyoung;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37B no.9
    • /
    • pp.795-805
    • /
    • 2012
  • This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

Modified TDS (Task Duplicated based Scheduling) Scheme Optimizing Task Execution Time (태스크 실행 시간을 최적화한 개선된 태스크 중복 스케줄 기법)

  • Jang, Sei-Ie;Kim, Sung-Chun
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.6
    • /
    • pp.549-557
    • /
    • 2000
  • Distributed Memory Machine(DMM) is necessary for the effective computation of the data which is complicated and very large. Task scheduling is a method that reduces the communication time among tasks to reduce the total execution time of application program and is very important for the improvement of DMM. Task Duplicated based Scheduling(TDS) method improves execution time by reducing communication time of tasks. It uses clustering method which schedules tasks of the large communication time on the same processor. But there is a problem that cannot optimize communication time between task sending data and task receiving data. Hence, this paper proposes a new method which solves the above problem in TDS. Modified Task Duplicated based Scheduling(MTDS) method which can approximately optimize the communication time between task sending data and task receiving data by checking the optimal condition, resulted in the minimization of task execution time by reducing the communication time among tasks. Also system modeling shows that task execution time of MTDS is about 70% faster than that of TDS in the best case and the same as the result of TDS in the worst case. It proves that MTDS method is better than TDS method.

  • PDF

The Design of Hardware MPI Units for MPSoC (MPSoC를 위한 저비용 하드웨어 MPI 유닛 설계)

  • Jeong, Ha-Young;Chung, Won-Young;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.1B
    • /
    • pp.86-92
    • /
    • 2011
  • In this paper, we propose a novel hardware MPI(Message Passing Interface) unit which supports message passing in multiprocessor system which use distributed memory architecture. MPI Hardware unit processes data synchronization, transmission and completion, and it supports processor non-blocking operation so it reduces overhead according to synchronization. Additionally, MPI hardware unit combines ready entry, request entry, reserve entry which save and manage the synchronized messages and performs the multiple outstanding issue and out of order completion. According to BFM(Bus Functional Model) simulation result, the performance is increased by 25% on many to many communication. After we designed MPI unit using HDL, with synopsys design compiler we synthesized, and for synthesis library we used MagnaChip $0.18{\mu}m$. And then we making prototype chip. The proposed message transmission interface hardware shows high performance for its increase in size. Thus, as we consider low-cost design and scalability, MPI hardware unit is useful in increasing overall performance of embedded MPSoC(Multi-Processor System-on-Chip).

Design 5Q MPI Hardware Unit Supporting Standard Mode (표준 모드를 지원하는 5Q MPI 하드웨어 유닛 설계)

  • Park, Jae-Won;Chung, Won-Young;Lee, Seung-Woo;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37 no.1B
    • /
    • pp.59-66
    • /
    • 2012
  • The use of MPSoC has been increasing because of a rise of use of mobile devices and complex applications. For improving the performance of MPSoC, number of processor has been increasing. Standard MPI is used for efficiently sending data in distributed memory architecture that has advantage in multi processor. Standard In this paper, we propose a scalable distributed memory system with a low cost hardware message passing interface(MPI). The proposed architecture improves transfer rate with buffered send for small size packet. Three queues, Ready Queue, Request Queue, and Reservation Queue, work as previous architecture, and two queues, Small Ready Queue and Small Request Queue, are added to send small size packet. When the critical point is set 8 bytes, the proposed architecture takes more than 2 times the performance improvement in the data that below the critical point.

Scalable Ontology Reasoning Using GPU Cluster Approach (GPU 클러스터 기반 대용량 온톨로지 추론)

  • Hong, JinYung;Jeon, MyungJoong;Park, YoungTack
    • Journal of KIISE
    • /
    • v.43 no.1
    • /
    • pp.61-70
    • /
    • 2016
  • In recent years, there has been a need for techniques for large-scale ontology inference in order to infer new knowledge from existing knowledge at a high speed, and for a diversity of semantic services. With the recent advances in distributed computing, developments of ontology inference engines have mostly been studied based on Hadoop or Spark frameworks on large clusters. Parallel programming techniques using GPGPU, which utilizes many cores when compared with CPU, is also used for ontology inference. In this paper, by combining the advantages of both techniques, we propose a new method for reasoning large RDFS ontology data using a Spark in-memory framework and inferencing distributed data at a high speed using GPGPU. Using GPGPU, ontology reasoning over high-capacity data can be performed as a low cost with higher efficiency over conventional inference methods. In addition, we show that GPGPU can reduce the data workload on each node through the Spark cluster. In order to evaluate our approach, we used LUBM ranging from 10 to 120. Our experimental results showed that our proposed reasoning engine performs 7 times faster than a conventional approach which uses a Spark in-memory inference engine.

Sparse Distributed Memory with Monotonic Decision Function (단조 결정 함수를 갖는 축약 분산 기억 장치)

  • Gwon, Hui-Yong;Jang, Jeong-U;Im, Seong-Jun;Jo, Dong-Seop;Hwang, Hui-Yung
    • The KIPS Transactions:PartB
    • /
    • v.8B no.1
    • /
    • pp.105-113
    • /
    • 2001
  • 최근 축약 분산 기억 장치(SDM)가 적응적 문제 해결 능력과 하드웨어화의 용이성으로 인해 현실성이 있는 신경망의 한 모델로 제안되었다. 그러나 다층 인식자의 개별 뉴런이 선형 또는 비선형 결정 함수로 해 공간을 이분하고 그들이 다양하게 결합함으로써 일반적인 문제 해결 능력을 갖는데 비해, 축약 분산 기억 장치의 뉴런은 해 공간에서 자신을 중심으로 한 일정 반경 영역을 안과 밖으로 이분하고 이들을 단순하게 합하므로써, 해 공간이 실수 공간과 같이 크기 관계를 갖는 경우 비효율적인 모델로 된다. 본 논문에서는 이러한 축약 분산 기억 장치의 특성과 그 원인을 규명하고, 문제의 해 공간이 단조 증가 또는 감소 결정 함수로 양분되는 경우, 기존의 축약 분산 기억 장치에 크기 비교 과정을 도입함으로써, 주어진 문제를 효율적으로 해결할 수 있는 수정된 축약 분산 기억 장치 모델을 제안한다. 아울러 제안된 모델을 ATM망에서의 호 수락 제어 과정에 적용한 예를 보인다.최근 축약 분산 기억 장치(SDM)가 적응적 문제 해결 능력과 하드웨어화의 용이성으로 인해 현실성이 있는 신경망의 한 모델로 제안되었다. 그러나 다층 인식자의 개별 뉴런이 선형 또는 비선형 결정 함수로 해 공간을 이분하고 그들이 다양하게 결합함으로써 일반적인 문제 해결 능력을 갖는데 비해, 축약 분산 기억 장치의 뉴런은 해 공간에서 자신을 중심으로 한 일정 반경 영역을 안과 밖으로 이분하고 이들을 단순하게 합하므로써, 해 공간이 실수 공간과 같이 크기 관계를 갖는 경우 비효율적인 모델로 된다. 본 논문에서는 이러한 축약 분산 기억 장치의 특성과 그 원인을 규명하고, 문제의 해 공간이 단조 증가 또는 감소 결정 함수로 양분되는 경우, 기존의 축약 분산 기억 장치에 크기 비교 과정을 도입함으로써, 주어진 문제를 효율적으로 해결할 수 있는 수정된 축약 분산 기억 장치 모델을 제안한다. 아울러 제안된 모델을 ATM망에서의 호 수락 제어 과정에 적용한 예를 보인다.

  • PDF

Forecasting of Iron Ore Prices using Machine Learning (머신러닝을 이용한 철광석 가격 예측에 대한 연구)

  • Lee, Woo Chang;Kim, Yang Sok;Kim, Jung Min;Lee, Choong Kwon
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.25 no.2
    • /
    • pp.57-72
    • /
    • 2020
  • The price of iron ore has continued to fluctuate with high demand and supply from many countries and companies. In this business environment, forecasting the price of iron ore has become important. This study developed the machine learning model forecasting the price of iron ore a one month after the trading events. The forecasting model used distributed lag model and deep learning models such as MLP (Multi-layer perceptron), RNN (Recurrent neural network) and LSTM (Long short-term memory). According to the results of comparing individual models through metrics, LSTM showed the lowest predictive error. Also, as a result of comparing the models using the ensemble technique, the distributed lag and LSTM ensemble model showed the lowest prediction.

ABox Realization Reasoning in Distributed In-Memory System (분산 메모리 환경에서의 ABox 실체화 추론)

  • Lee, Wan-Gon;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.7
    • /
    • pp.852-859
    • /
    • 2015
  • As the amount of knowledge information significantly increases, a lot of progress has been made in the studies focusing on how to reason large scale ontology effectively at the level of RDFS or OWL. These reasoning methods are divided into TBox classifications and ABox realizations. A TBox classification mainly deals with integrity and dependencies in schema, whereas an ABox realization mainly handles a variety of issues in instances. Therefore, the ABox realization is very important in practical applications. In this paper, we propose a realization method for analyzing the constraint of the specified class, so that the reasoning system automatically infers the classes to which instances belong. Unlike conventional methods that take advantage of the object oriented language based distributed file system, we propose a large scale ontology reasoning method using spark, which is a functional programming-based in-memory system. To verify the effectiveness of the proposed method, we used instances created from the Wine ontology by W3C(120 to 600 million triples). The proposed system processed the largest 600 million triples and generated 951 million triples in 51 minutes (696 K triple / sec) in our largest experiment.

Memory Efficient Parallel Ray Casting Algorithm for Unstructured Grid Volume Rendering on Multi-core CPUs (비정렬 격자 볼륨 렌더링을 위한 다중코어 CPU기반 메모리 효율적 광선 투사 병렬 알고리즘)

  • Kim, Duksu
    • Journal of KIISE
    • /
    • v.43 no.3
    • /
    • pp.304-313
    • /
    • 2016
  • We present a novel memory-efficient parallel ray casting algorithm for unstructured grid volume rendering on multi-core CPUs. Our method is based on the Bunyk ray casting algorithm. To solve the high memory overhead problem of the Bunyk algorithm, we allocate a fixed size local buffer for each thread and the local buffers contain information of recently visited faces. The stored information is used by other rays or replaced by other face's information. To improve the utilization of local buffers, we propose an image-plane based ray grouping algorithm that makes ray groups have high coherency. The ray groups are then distributed to computing threads and each thread processes the given groups independently. We also propose a novel hash function that uses the index of faces as keys for calculating the buffer index each face will use to store the information. To see the benefits of our method, we applied it to three unstructured grid datasets with different sizes and measured the performance. We found that our method requires just 6% of the memory space compared with the Bunyk algorithm for storing face information. Also it shows compatible performance with the Bunyk algorithm even though it uses less memory. In addition, our method achieves up to 22% higher performance for a large-scale unstructured grid dataset with less memory than Bunyk algorithm. These results show the robustness and efficiency of our method and it demonstrates that our method is suitable to volume rendering for a large-scale unstructured grid dataset.