• Title/Summary/Keyword: 분산 데이터 분석

Search Result 1,177, Processing Time 0.032 seconds

Dynamic Cluster Management of Hadoop Distributed Filesystem (하둡 분산 파일시스템의 동적 클러스터 관리 기법)

  • Ryu, Wooseok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2016.10a
    • /
    • pp.435-437
    • /
    • 2016
  • Hadoop Distributed File System(HDFS) is a file system for distributed processing of big data by replicating data to distributed data nodes. HDFS cluster shows a great scalability up to thousands of nodes, but it assumes a exclusive node cluster with numerous nodes for the big data processing. Various operational-purpose worker systems used by office are hardly considered as a part of cluster. This paper discusses this problem and proposes a dynamic cluster management technique to increase storage capability and analytic performance of hadoop cluster. The propsed technique can add legacy systems to the cluster and can remove them from the cluster dynamically depending on their availability.

  • PDF

Performance Factor of Distributed Processing of Machine Learning using Spark (스파크를 이용한 머신러닝의 분산 처리 성능 요인)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.16 no.1
    • /
    • pp.19-24
    • /
    • 2021
  • In this paper, we study performance factor of machine learning in the distributed environment using Apache Spark and presents an efficient distributed processing method through experiments. This work firstly presents performance factor when performing machine learning in a distributed cluster by classifying cluster performance, data size, and configuration of spark engine. In addition, performance study of regression analysis using Spark MLlib running on the Hadoop cluster is performed while changing the configuration of the node and the Spark Executor. As a result of the experiment, it was confirmed that the effective number of executors was affected by the number of data blocks, but depending on the cluster size, the maximum and minimum values were limited by the number of cores and the number of worker nodes, respectively.

A Study On Korea Metadata Standard's Trend and XML Application In GIS (국내 메타데이터 표준 동향 및 XML 실응용사례 연구)

  • Kim, Myung-Gu;Cha, Jung-Sook;Park, Sun-Ho;Kim, Sung-Ryong
    • 한국공간정보시스템학회:학술대회논문집
    • /
    • 2002.03a
    • /
    • pp.29-37
    • /
    • 2002
  • 지난 수십년간 GIS(Geographic Information System) 분야에서는 소프트웨어의 비약적인 발전과 더불어 광대한 양의 공간데이터 구축이 이루어졌다. 이러한 공간데이터는 분산되어 있으며 지리정보 응용시스템의 특정 포맷으로 동일한 지역과 데이터로 중복 구축되어 비용의 낭비뿐만 아니라 신규 구축 시 필요한 시간 및 인력의 낭비를 가져오고 있다. 이를 방지하기 위하여 이기종간 분산되어 있는 공간데이터를 공유 및 관리하기 위한 메타데이터 구축이 절실히 요구되어 국제표준단체(ISO/TC 211)를 통한 메타데이터 표준이 진행되고 있으며 국내의 경우 한국전산원을 주도로 이루어지고 있다. 본 연구는 국제 표준화 동향에 발맞춰 현재 진행되고 있는 국내 메타데이터 표준의 동향을 살펴보고 이를 기반으로 생성된 국내 표준 DTD 및 XML을 이용한 실제 응용사례를 분석하였다. 본 연구를 통하여 향후 분산된 공간데이터의 관리 및 공유를 위한 메타데이터의 효율적인 구축을 할 수 있다.

  • PDF

The Compatibility Analysis on the Track Record Operation Center for the Regional Industry Upbringing (지역산업 육성을 위한 분산형 전원 트렉 레코드 실증 운영 센터 적합성 분석)

  • Kim, IL-Song
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.9 no.4
    • /
    • pp.269-277
    • /
    • 2019
  • In this paper, the compatibility analysis on the track record operation center for the regional industry upbringing is presented. The renewable energy sources such as solar, wind and battery energy storage system have been widely spread in these days. Also the requirements on the efficiency, reliability and power demands of these distributed systems are increasing and they can be demands on the construction of the data track record center for renewable energy industries in order to promote the local industries. The analysis on the size of the de-centralized power system are performed based on the technical issues. The necessities, requirements and center goals to meet these missions are presented in the following session. The proposal of the 1 MW operation center construction and operations are presented in the result of the compatibility analysis. And the lists of the required equipments in the operation facilities are suggested to have proper operational capabilities. The results of this research can be applied in the fundamental data to promote the regional renewable industries.

A Study of the Standardization Model for the National Knowledge and Information Resource (지식정보자원 표준화 모델 연구)

  • Lee, Chang-Yeol;Jeong, Eui-Suk
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.4 s.62
    • /
    • pp.165-177
    • /
    • 2006
  • National Knowledge and Information Resources of KADO(Korea Agency for Digital Opportunity and Promotion) were distributed to the several data centers. The metadata for the resources was the conceptual level recommended standard. It was not for the integration, but the retrieval. So it is not easy to integrate to the central metadata DB or connect metadata among the data centers. In this paper, we analysed the metadata of the several data centers and provided the integrated standard model for the central metadata DB.

A Hot-Data Replication Scheme Based on Data Access Patterns for Enhancing Processing Speed of MapReduce (맵-리듀스의 처리 속도 향상을 위한 데이터 접근 패턴에 따른 핫-데이터 복제 기법)

  • Son, Ingook;Ryu, Eunkyung;Park, Junho;Bok, Kyoungsoo;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.11
    • /
    • pp.21-27
    • /
    • 2013
  • In recently years, with the growth of social media and the development of mobile devices, the data have been significantly increased. Hadoop has been widely utilized as a typical distributed storage and processing framework. The tasks in Mapreduce based on the Hadoop distributed file system are allocated to the map as close as possible by considering the data locality. However, there are data being requested frequently according to the data analysis tasks of Mapreduce. In this paper, we propose a hot-data replication mechanism to improve the processing speed of Mapreduce according to data access patterns. The proposed scheme reduces the task processing time and improves the data locality using the replica optimization algorithm on the high access frequency of hot data. It is shown through performance evaluation that the proposed scheme outperforms the existing scheme in terms of the load of access frequency.

Application of Statistical Analysis to Analyze the Spatial Distribution of Earthquake-induced Strain Data (지진유발 변형률 데이터의 분포 특성 분석을 위한 응용통계기법의 적용)

  • Kim, Bo-Ram;Chae, Byung-Gon;Kim, Yongje;Seo, Yong-Seok
    • The Journal of Engineering Geology
    • /
    • v.23 no.4
    • /
    • pp.353-361
    • /
    • 2013
  • To analyze the distribution of earthquake-induced strain data in rock masses, statistical analysis was performed on four-directional strain data obtained from a ground movement monitoring system installed in Korea. Strain data related to the 2011 Tohoku-oki earthquake and two aftershocks of >M7.0 in 2011 were used in x-MR control chart analysis, a type of univariate statistical analysis that can detect an abnormal distribution. The analysis revealed different dispersion times for each measurement orientation. In a more comprehensive analysis, the strain data were re-evaluated using multivariate statistical analysis (MSA) considering correlations among the various data from the different measurement orientations. $T_2$ and Q-statistics, based on principal component analysis, were used to analyze the time-series strain data in real-time. The procedures were performed with 99.9%, 99.0%, and 95.0% control limits. It is possible to use the MSA data to successfully detect an abnormal distribution caused by earthquakes because the dispersion time using the 99.9% control limit is concurrent with or earlier than that from the x-MR analysis. In addition, the dispersion using the 99.0% and 95.0% control limits detected an abnormal distribution in advance. This finding indicates the potential use of MSA for recognizing abnormal distributions of strain data.

Decombined Distributed Parallel VQ Codebook Generation Based on MapReduce (맵리듀스를 사용한 디컴바인드 분산 VQ 코드북 생성 방법)

  • Lee, Hyunjin
    • Journal of Digital Contents Society
    • /
    • v.15 no.3
    • /
    • pp.365-371
    • /
    • 2014
  • In the era of big data, algorithms for the existing IT environment cannot accept on a distributed architecture such as hadoop. Thus, new distributed algorithms which apply a distributed framework such as MapReduce are needed. Lloyd's algorithm commonly used for vector quantization is developed using MapReduce recently. In this paper, we proposed a decombined distributed VQ codebook generation algorithm based on a distributed VQ codebook generation algorithm using MapReduce to get a result more fast. The result of applying the proposed algorithm to big data showed higher performance than the conventional method.

데이터 마이닝에서의 군집분석 알고리즘 비교 연구

  • Lee, Yeong-Seop;An, Mi-Yeong
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2003.05a
    • /
    • pp.19-25
    • /
    • 2003
  • 데이터베이스에 내재된 패턴이나 관계를 묘사한 것만으로도 의사결정에 필요한 정보를 제공할 수 있는데 이 데이터들의 변수들을 비슷한 특징을 가지는 소그룹으로 나누어 패턴을 찾는 것을 군집분석이라 한다. 이러한 군집 분석에는 분리군집방법과 계층적군집방법이 있는데, 재할당이 가능한 분리군집방법의 여러 알고리즘에 대해 비교해보자. 분리군집알고리즘에는 중심을 평균으로 하는 k-평균 알고리즘과, 중심을 메도이드로하는 PAM, CLARA, CLARANS 알고리즘이 있다. 이러한 알고리즘에 대한 이론과, 장단점을 설명하고, 분산과 중심들간의 평균 거리로 비교해 본다.

  • PDF

Performance Analysis of Distributed Hadoop Systems (분산 하둡 시스템의 성능 비교 분석)

  • Bae, Byoung-Jin;Kim, Young-Joo;Kim, Young-Kuk
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.05a
    • /
    • pp.479-482
    • /
    • 2014
  • Nowadays open-source hadoop systems have been using widely to efficiently manage a fast-growing big data. Hadoop systems consist of distributed file processing system called HDFS (Hadoop Distributed File System) and distributed parallel processing system called MapReduce. The MapReduce reads and processes big data from HDFS and then processed results are written in HDFS again by the MapReduce. Such a processing method has different system structure respectively according to hadoop version. Therefore, this paper shows analysis results for performance of hadoop systems. For this, we devise a way which monitors hadoop systems and measure occurrence frequency of processes, threads, and variables generated in hadoop system itself using the devised way. So, by using the measured results as analysis indicator, we help the indicator predict inner performance of hadoop systems.

  • PDF