• Title/Summary/Keyword: Distributed Data Analysis

Search Result 2,340, Processing Time 0.032 seconds

Performance Analysis of Distributed Hadoop Systems (분산 하둡 시스템의 성능 비교 분석)

  • Bae, Byoung-Jin;Kim, Young-Joo;Kim, Young-Kuk
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.05a
    • /
    • pp.479-482
    • /
    • 2014
  • Nowadays open-source hadoop systems have been using widely to efficiently manage a fast-growing big data. Hadoop systems consist of distributed file processing system called HDFS (Hadoop Distributed File System) and distributed parallel processing system called MapReduce. The MapReduce reads and processes big data from HDFS and then processed results are written in HDFS again by the MapReduce. Such a processing method has different system structure respectively according to hadoop version. Therefore, this paper shows analysis results for performance of hadoop systems. For this, we devise a way which monitors hadoop systems and measure occurrence frequency of processes, threads, and variables generated in hadoop system itself using the devised way. So, by using the measured results as analysis indicator, we help the indicator predict inner performance of hadoop systems.

  • PDF

Performance Analysis for Maintaining Distributed Views

  • Lee, Wookey
    • Proceedings of the Korea Database Society Conference
    • /
    • 1997.10a
    • /
    • pp.515-523
    • /
    • 1997
  • Maintaining materialized views and/or replica can basically be considered as a client-sewer architecture that extracts the changes of the distributed source data and transfers them to the relevant target sites. View maintenance and materialized views are considered to be important for suggesting solutions to the problems such as a decision support, active databases, a data warehouse, temporal databases, internet applications, etc. In this paper an analysis is addressed that formulates the cost functions and evaluates them as the propagation subjects, objects, and update policies. The propagation subject can be the client side, sewer side, and the third: and the objects can be base tables, semi-base tables, and delta files: And the update policies can be the immediate, deferred and periodic ones, respectively.

  • PDF

Flood Runoff Analysis of Multi-purpose Dam Watersheds in the Han River Basin using a Grid-based Rainfall-Runoff Model (격자기반의 강우유출모형을 통한 한강수계 다목적댐의 홍수유출해석)

  • Park, In-Hyeok;Park, Jin-Hyeog;Hur, Young-Teck
    • Journal of Korean Society on Water Environment
    • /
    • v.27 no.5
    • /
    • pp.587-596
    • /
    • 2011
  • The interest in hydrological modeling has increased significantly recently due to the necessity of watershed management, specifically in regards to lumped models, which are being prosperously utilized because of their relatively uncomplicated algorithms which require less simulation time. However, lumped models require empirical coefficients for hydrological analyses, which do not take into consideration the heterogeneity of site-specific characteristics. To overcome such obstacles, a distributed model was offered as an alternative and the number of researches related to watershed management and distributed models has been steadily increasing in the recent years. Thus, in this study, the feasibility of a grid-based rainfall-runoff model was reviewed using the flood runoff process in the Han River basin, including the ChungjuDam, HoengseongDam and SoyangDam watersheds. Hydrological parameters based on GIS/RS were extracted from basic GIS data such as DEM, land cover, soil map and rainfall depth. The accuracy of the runoff analysis for the model application was evaluated using EFF, NRMSE and QER. The calculation results showed that there was a good agreement with the observed data. Besides the ungauged spatial characteristics in the SoyangDam watershed, EFF showed a good result of 0.859.

Analysis on the effect of harmonic loads on other loads in a distributed generation environment (분산전원 환경에서의 인근 수용가의 특정 수용가에 대한 고조파 영향 분석)

  • Song, Chong-Suk;Byeon, Gil-Sung;Hwang, Sung-Chul;Jang, Gil-Soo;Han, Woon-Ki;Park, Chan-Eom;Go, Won-Sik
    • Proceedings of the KIEE Conference
    • /
    • 2011.07a
    • /
    • pp.398-399
    • /
    • 2011
  • In this paper, an analysis is being performed on the effect of harmonics on loads in a distributed generation environment which includes renewable energy sources such as wind farms. The paper will assess the limits of the harmonic content that is allowed to be present in the adjacent loads while conforming to the distributed generation connection standards. The analysis is being performed in PSCAD/EMTDC where field measurements of wind data is being employed for the study.

  • PDF

DETECTING VARIABILITY IN ASTRONOMICAL TIME SERIES DATA: APPLICATIONS OF CLUSTERING METHODS IN CLOUD COMPUTING ENVIRONMENTS

  • Shin, Min-Su;Byun, Yong-Ik;Chang, Seo-Won;Kim, Dae-Won;Kim, Myung-Jin;Lee, Dong-Wook;Ham, Jae-Gyoon;Jung, Yong-Hwan;Yoon, Jun-Weon;Kwak, Jae-Hyuck;Kim, Joo-Hyun
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.36 no.2
    • /
    • pp.131.1-131.1
    • /
    • 2011
  • We present applications of clustering methods to detect variability in massive astronomical time series data. Focusing on variability of bright stars, we use clustering methods to separate possible variable sources from other time series data, which include intrinsically non-variable sources and data with common systematic patterns. We already finished the analysis of the Northern Sky Variability Survey data, which include about 16 million light curves, and present candidate variable sources with their association to other data at different wavelengths. We also apply our clustering method to the light curves of bright objects in the SuperWASP Data Release 1. For the analysis of the SuperWASP data, we exploit a elastically configurable Cloud computing environments that the KISTI Supercomputing Center is deploying. Two quite different configurations are incorporated in our Cloud computing test bed. One system uses the Hadoop distributed processing with its distributed file system, using distributed processing with data locality condition. Another one adopts the Condor and the Lustre network file system. We present test results, considering performance of processing a large number of light curves, and finding clusters of variable and non-variable objects.

  • PDF

Implementation of a Real-time Data fusion Algorithm for Flight Test Computer (비행시험통제컴퓨터용 실시간 데이터 융합 알고리듬의 구현)

  • Lee, Yong-Jae;Won, Jong-Hoon;Lee, Ja-Sung
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.8 no.4 s.23
    • /
    • pp.24-31
    • /
    • 2005
  • This paper presents an implementation of a real-time multi-sensor data fusion algorithm for Flight Test Computer. The sensor data consist of positional information of the target from a radar, a GPS receiver and an INS. The data fusion algorithm is designed by the 21st order distributed Kalman Filter which is based on the PVA model with sensor bias states. A fault detection and correction logics are included in the algorithm for bad measurements and sensor faults. The statistical parameters for the states are obtained from Monte Carlo simulations and covariance analysis using test tracking data. The designed filter is verified by using real data both in post processing and real-time processing.

A Federated Multi-Task Learning Model Based on Adaptive Distributed Data Latent Correlation Analysis

  • Wu, Shengbin;Wang, Yibai
    • Journal of Information Processing Systems
    • /
    • v.17 no.3
    • /
    • pp.441-452
    • /
    • 2021
  • Federated learning provides an efficient integrated model for distributed data, allowing the local training of different data. Meanwhile, the goal of multi-task learning is to simultaneously establish models for multiple related tasks, and to obtain the underlying main structure. However, traditional federated multi-task learning models not only have strict requirements for the data distribution, but also demand large amounts of calculation and have slow convergence, which hindered their promotion in many fields. In our work, we apply the rank constraint on weight vectors of the multi-task learning model to adaptively adjust the task's similarity learning, according to the distribution of federal node data. The proposed model has a general framework for solving optimal solutions, which can be used to deal with various data types. Experiments show that our model has achieved the best results in different dataset. Notably, our model can still obtain stable results in datasets with large distribution differences. In addition, compared with traditional federated multi-task learning models, our algorithm is able to converge on a local optimal solution within limited training iterations.

The Assessment of Application of the Distributed Runoff Model in accordance with Rainfall Data Form (강우 자료 형태에 따른 분포형 유출 모형의 적용성 평가)

  • Choi, Yong Joon;Kim, Joo Cheol
    • Journal of Korean Society on Water Environment
    • /
    • v.26 no.2
    • /
    • pp.252-260
    • /
    • 2010
  • The point rainfall measurements need to be converted to the areal rainfall by means of mean areal precipitation (MAP) estimation methods. And it is not appropriate to evaluate the areal rainfall with constant drift because of the geomorphological influences to rainfall field. Non-stationarity should be applied to the estimation of the areal rainfall, therefore, to consider these effects. Kriging methods with special functional would be a suitable tool in this case. Generalized covariance Kriging method is the most developed one among different Kriging methods. From this point of view this study performs the analysis of its applicability to distributed runoff model. For these purpose, distributed rainfall was created by Thiessen and Kriging method. And distributed rainfall of each method was applied into HyGIS-GRM. The result of applying, Runoff was different in the rainfall data form. Therefore, To apply Kriging method with physical meaning is that it is the useful method as distributed rainfall-runoff model.

A Study on Hierarchical Distributed Intrusion Detection for Secure Home Networks Service (안전한 홈네트워크 서비스를 위한 계층적 분산 침입탐지에 관한 연구)

  • Yu, Jae-Hak;Choi, Sung-Back;Yang, Sung-Hyun;Park, Dai-Hee;Chung, Yong-Wha
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.18 no.1
    • /
    • pp.49-57
    • /
    • 2008
  • In this paper, we propose a novel hierarchical distributed intrusion detection system, named HNHDIDS(Home Network Hierarchical Distributed Intrusion Detection System), which is not only based on the structure of distributed intrusion detection system, but also fully consider the environment of secure home networks service. The proposed system is hierarchically composed of the one-class support vector machine(support vector data description) and local agents, in which it is designed for optimizing for the environment of secure home networks service. We support our findings with computer experiments and analysis.

Performance Factor of Distributed Processing of Machine Learning using Spark (스파크를 이용한 머신러닝의 분산 처리 성능 요인)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.16 no.1
    • /
    • pp.19-24
    • /
    • 2021
  • In this paper, we study performance factor of machine learning in the distributed environment using Apache Spark and presents an efficient distributed processing method through experiments. This work firstly presents performance factor when performing machine learning in a distributed cluster by classifying cluster performance, data size, and configuration of spark engine. In addition, performance study of regression analysis using Spark MLlib running on the Hadoop cluster is performed while changing the configuration of the node and the Spark Executor. As a result of the experiment, it was confirmed that the effective number of executors was affected by the number of data blocks, but depending on the cluster size, the maximum and minimum values were limited by the number of cores and the number of worker nodes, respectively.