• 제목/요약/키워드: large data sets

검색결과 506건 처리시간 0.024초

Data Reduction Method in Massive Data Sets

  • Namo, Gecynth Torre;Yun, Hong-Won
    • Journal of information and communication convergence engineering
    • /
    • 제7권1호
    • /
    • pp.35-40
    • /
    • 2009
  • Many researchers strive to research on ways on how to improve the performance of RFID system and many papers were written to solve one of the major drawbacks of potent technology related with data management. As RFID system captures billions of data, problems arising from dirty data and large volume of data causes uproar in the RFID community those researchers are finding ways on how to address this issue. Especially, effective data management is important to manage large volume of data. Data reduction techniques in attempts to address the issues on data are also presented in this paper. This paper introduces readers to a new data reduction algorithm that might be an alternative to reduce data in RFID Systems. A process on how to extract data from the reduced database is also presented. Performance study is conducted to analyze the new data reduction algorithm. Our performance analysis shows the utility and feasibility of our categorization reduction algorithms.

Min-Hash를 이용한 효율적인 대용량 그래프 클러스터링 기법 (An Efficient Large Graph Clustering Technique based on Min-Hash)

  • 이석주;민준기
    • 정보과학회 논문지
    • /
    • 제43권3호
    • /
    • pp.380-388
    • /
    • 2016
  • 그래프 클러스터링은 서로 유사한 특성을 갖는 정점들을 동일한 클러스터로 묶는 기법으로 그래프 데이터를 분석하고 그 특성을 파악하는데 폭넓게 사용된다. 최근 소셜 네트워크 서비스와 월드 와이드 웹, 텔레폰 네트워크 등의 다양한 응용분야에서 크기가 큰 대용량 그래프 데이터가 생성되고 있다. 이에 따라서 대용량 그래프 데이터를 효율적으로 처리하는 클러스터링 기법의 중요성이 증가하고 있다. 본 논문에서는 대용량 그래프 데이터의 클러스터들을 효율적으로 생성하는 클러스터링 알고리즘을 제안한다. 우리의 제안 기법은 그래프 내의 클러스터들 간의 유사도를 Min-Hash를 이용하여 효과적으로 추정하고 계산된 유사도에 따라서 클러스터들을 생성한다. 실세계 데이터를 이용한 실험에서 우리는 본 논문에서 제안하는 기법과 기존 그래프 클러스터링 기법들과 비교하여 제안기법의 효율성을 보였다.

TCP/IP 소켓통신에서 대용량 스트링 데이터의 전송 속도를 높이기 위한 송수신 모델 설계 및 구현 (A design and implementation of transmit/receive model to speed up the transmission of large string-data sets in TCP/IP socket communication)

  • 강동조;박현주
    • 한국정보통신학회논문지
    • /
    • 제17권4호
    • /
    • pp.885-892
    • /
    • 2013
  • TCP/IP소켓 통신을 활용하여 데이터를 송수신하는 송수신 모델에서 데이터의 크기가 작고 데이터 전송 요청이 빈번하지 않을 경우 서버와 클라이언트 간 통신 속도의 중요성은 부각되지 않지만 오늘날 대용량 데이터에 대한 전송 요청과 빈번한 데이터 전송 요청에서 송수신 모델의 통신 속도에 대한 중요성이 부각되고 있다. 본 논문은 대용량의 데이터를 전송하는 서버의 전송 구조와 데이터를 수신하는 클라이언트의 수신 구조를 변경하여 멀티 코어(이하 CMP : ChipMulti Processor) 환경에서 데이터 전송 속도에 대한 성능향상을 기대할 수 있는 보다 효율적인 TCP/IP 송수신 모델을 제안한다.

Constructing Reference Transcriptome Sets of Codonopsis lanceolate(Deodeok) and Ixeridium dentatum

  • Tae-Ho Lee;Yun-Ho Oh
    • 한국작물학회:학술대회논문집
    • /
    • 한국작물학회 2022년도 추계학술대회
    • /
    • pp.242-242
    • /
    • 2022
  • As the aging population increases and interest in well-being increases, the importance of developing special crops increases. Natural medicine based on the special crops has been mainly used as an adjunct therapy for many diseases and symptoms based on culture, traditional medicine, and experience. In particular, it is attracting attention as a new resource to develop new drugs such as Artemisinin, a treatment for malaria. In order to efficiently use crops, it is essential to establish omics data such as genomes, transcriptomes, and metabolites of special-purpose crops. However, many special-purpose crops have large, heterogeneous and polyploid genomes that require high cost and long time to reference genome sequencing. Therefore, we built an inexpensive, fast, but very usefill reference transcriptome as the first step. We constructed high-quality reference transcriptom sets of Codonopsis lanceolata and Ixeridium dentatum with PacBio data. Our team will continue to construct reference transcriptoms of more special-purpose crops, and the data will be released by the National Agricultural Biotechnology Information Center (NABIC) in order to be widely used in agricultural as well as medical R&D.

  • PDF

데이타 마이닝에서 기존의 연관 규칙을 갱신하는 앨고리듬 개발 (An Algorithm for Updating Discovered Association Rules in Data Mining)

  • 이동명;지영근;황종원;강맹규
    • 산업경영시스템학회지
    • /
    • 제20권43호
    • /
    • pp.265-276
    • /
    • 1997
  • There have been many studies on efficient discovery of association rules in large databases. However, it is nontrivial to maintain such discovered rules in large databases because a database may allow frequent or occasional updates and such updates may not only invalidate some existing strong association rules but also turn some weak rules into strong ones. The major idea of updating algorithm is to resuse the information of the old large itemsets and to integrate the support information of the new large itemsets in order to substantially reduce the pool of candidate sets to be re-exmained. In this paper, an updating algorithm is proposed for efficient maintenance of discovered assocation rules when new transaction data are added to a transaction database. And superiority of the proposed updating algorithm will be shown by comparing with FUP algorithm that was already proposed.

  • PDF

사례기반추론을 이용한 대용량 데이터의 실시간 처리 방법론 : 고혈압 고위험군 관리를 위한 자기학습 시스템 프레임워크 (Data Mining Approach for Real-Time Processing of Large Data Using Case-Based Reasoning : High-Risk Group Detection Data Warehouse for Patients with High Blood Pressure)

  • 박성혁;양근우
    • 한국IT서비스학회지
    • /
    • 제10권1호
    • /
    • pp.135-149
    • /
    • 2011
  • In this paper, we propose the high-risk group detection model for patients with high blood pressure using case-based reasoning. The proposed model can be applied for public health maintenance organizations to effectively manage knowledge related to high blood pressure and efficiently allocate limited health care resources. Especially, the focus is on the development of the model that can handle constraints such as managing large volume of data, enabling the automatic learning to adapt to external environmental changes and operating the system on a real-time basis. Using real data collected from local public health centers, the optimal high-risk group detection model was derived incorporating optimal parameter sets. The results of the performance test for the model using test data show that the prediction accuracy of the proposed model is two times better than the natural risk of high blood pressure.

대용량 자료에 대한 서포트 벡터 회귀에서 모수조절 (Parameter Tuning in Support Vector Regression for Large Scale Problems)

  • 류지열;곽민정;윤민
    • 한국지능시스템학회논문지
    • /
    • 제25권1호
    • /
    • pp.15-21
    • /
    • 2015
  • 커널에 대한 모수의 조절은 서포트 벡터 기계의 일반화 능력에 영향을 준다. 이와 같이 모수들의 적절한 값을 결정하는 것은 종종 어려운 작업이 된다. 서포트 벡터 회귀에서 이와 같은 모수들의 값을 결정하기 위한 부담은 앙상블 학습을 사용함으로써 감소시킬 수 있다. 그러나 대용량의 자료에 대한 문제에 직접적으로 적용하기에는 일반적으로 시간 소모적인 방법이다. 본 논문에서 서포트 벡터 회귀의 모수 조절에 대한 부담을 감소하기 위하여 원래 자료집합을 유한개의 부분집합으로 분해하는 방법을 제안하였다. 제안하는 방법은 대용량의 자료들인 경우와 특히 불균등 자료 집합에서 효율적임을 보일 것이다.

유전자 알고리즘과 회귀식을 이용한 오염부하량의 예측 (Estimation of Pollutant Load Using Genetic-algorithm and Regression Model)

  • 박윤식
    • 한국환경농학회지
    • /
    • 제33권1호
    • /
    • pp.37-43
    • /
    • 2014
  • BACKGROUND: Water quality data are collected less frequently than flow data because of the cost to collect and analyze, while water quality data corresponding to flow data are required to compute pollutant loads or to calibrate other hydrology models. Regression models are applicable to interpolate water quality data corresponding to flow data. METHODS AND RESULTS: A regression model was suggested which is capable to consider flow and time variance, and the regression model coefficients were calibrated using various measured water quality data with genetic-algorithm. Both LOADEST and the regression using genetic-algorithm were evaluated by 19 water quality data sets through calibration and validation. The regression model using genetic-algorithm displayed the similar model behaviors to LOADEST. The load estimates by both LOADEST and the regression model using genetic-algorithm indicated that use of a large proportion of water quality data does not necessarily lead to the load estimates with smaller error to measured load. CONCLUSION: Regression models need to be calibrated and validated before they are used to interpolate pollutant loads, as separating water quality data into two data sets for calibration and validation.

MapReduce 시스템을 위한 에너지 관리 알고리즘의 성능평가 (Performance Evaluation of Energy Management Algorithms for MapReduce System)

  • 김민기;조행래
    • 대한임베디드공학회논문지
    • /
    • 제9권2호
    • /
    • pp.109-115
    • /
    • 2014
  • Analyzing large scale data has become an important activity for many organizations. Since MapReduce is a promising tool for processing the massive data sets, there are increasing studies to evaluate the performance of various algorithms related to MapReduce. In this paper, we first develop a simulation framework that includes MapReduce workload model, data center model, and the model of data access pattern. Then we propose two algorithms that can reduce the energy consumption of MapReduce systems. Using the simulation framework, we evaluate the performance of the proposed algorithms under different application characteristics and configurations of data centers.

PERFORMANCE EVALUATION OF INFORMATION CRITERIA FOR THE NAIVE-BAYES MODEL IN THE CASE OF LATENT CLASS ANALYSIS: A MONTE CARLO STUDY

  • Dias, Jose G.
    • Journal of the Korean Statistical Society
    • /
    • 제36권3호
    • /
    • pp.435-445
    • /
    • 2007
  • This paper addresses for the first time the use of complete data information criteria in unsupervised learning of the Naive-Bayes model. A Monte Carlo study sets a large experimental design to assess these criteria, unusual in the Bayesian network literature. The simulation results show that complete data information criteria underperforms the Bayesian information criterion (BIC) for these Bayesian networks.