• Title/Summary/Keyword: Data stream mining

Search Result 97, Processing Time 0.053 seconds

Discovering Frequent Itemsets Reflected User Characteristics Using Weighted Batch based on Data Stream (스트림 데이터 환경에서 배치 가중치를 이용하여 사용자 특성을 반영한 빈발항목 집합 탐사)

  • Seo, Bok-Il;Kim, Jae-In;Hwang, Bu-Hyun
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.1
    • /
    • pp.56-64
    • /
    • 2011
  • It is difficult to discover frequent itemsets based on whole data from data stream since data stream has the characteristics of infinity and continuity. Therefore, a specialized data mining method, which reflects the properties of data and the requirement of users, is required. In this paper, we propose the method of FIMWB discovering the frequent itemsets which are reflecting the property that the recent events are more important than old events. Data stream is splitted into batches according to the given time interval. Our method gives a weighted value to each batch. It reflects user's interestedness for recent events. FP-Digraph discovers the frequent itemsets by using the result of FIMWB. Experimental result shows that FIMWB can reduce the generation of useless items and FP-Digraph method shows that it is suitable for real-time environment in comparison to a method based on a tree(FP-Tree).

Dynamic Subspace Clustering for Online Data Streams (온라인 데이터 스트림에서의 동적 부분 공간 클러스터링 기법)

  • Park, Nam Hun
    • Journal of Digital Convergence
    • /
    • v.20 no.2
    • /
    • pp.217-223
    • /
    • 2022
  • Subspace clustering for online data streams requires a large amount of memory resources as all subsets of data dimensions must be examined. In order to track the continuous change of clusters for a data stream in a finite memory space, in this paper, we propose a grid-based subspace clustering algorithm that effectively uses memory resources. Given an n-dimensional data stream, the distribution information of data items in data space is monitored by a grid-cell list. When the frequency of data items in the grid-cell list of the first level is high and it becomes a unit grid-cell, the grid-cell list of the next level is created as a child node in order to find clusters of all possible subspaces from the grid-cell. In this way, a maximum n-level grid-cell subspace tree is constructed, and a k-dimensional subspace cluster can be found at the kth level of the subspace grid-cell tree. Through experiments, it was confirmed that the proposed method uses computing resources more efficiently by expanding only the dense space while maintaining the same accuracy as the existing method.

Sediment discharge assessment and stable channel analysis using Model Tree of data mining for Naesung Stream (데이터 마이닝의 Model Tree를 활용한 내성천의 유사량 산정 및 안정하도 평가)

  • Jang, Eun-Kyung;Ji, Un;Ahn, Myeonghui
    • Journal of Korea Water Resources Association
    • /
    • v.51 no.11
    • /
    • pp.999-1009
    • /
    • 2018
  • A Model Tree technique of data mining was applied to derive optimal equations for sediment discharge assessment based on the measured sediment data and then to evaluate stable channel design for Naesung Stream. The sediment discharge formula as a function of channel width, velocity, depth, slope and median grain diameter which was developed by a Model Tree technique with sediment discharge data measured in Korean River had a high goodness-of-fit between measured and calculated results. In case of the sediment discharge formula as a function of channel width, velocity, depth and median grain diameter which was developed by a Model Tree technique with sediment discharge data only measured in Naesung Stream represented the highest goodness-of-fit. Two types of sediment discharge formulas were applied to evaluate stable channel analysis for Yonghyeol Station of Naesung Stream. As a result, bed erosion was expected in the study section compared to the current section. It was also presented that the be slope might be changed to be a milder slope than the current slope to reach equilibrium condition in the long term.

Analysis and Evaluation of Frequent Pattern Mining Technique based on Landmark Window (랜드마크 윈도우 기반의 빈발 패턴 마이닝 기법의 분석 및 성능평가)

  • Pyun, Gwangbum;Yun, Unil
    • Journal of Internet Computing and Services
    • /
    • v.15 no.3
    • /
    • pp.101-107
    • /
    • 2014
  • With the development of online service, recent forms of databases have been changed from static database structures to dynamic stream database structures. Previous data mining techniques have been used as tools of decision making such as establishment of marketing strategies and DNA analyses. However, the capability to analyze real-time data more quickly is necessary in the recent interesting areas such as sensor network, robotics, and artificial intelligence. Landmark window-based frequent pattern mining, one of the stream mining approaches, performs mining operations with respect to parts of databases or each transaction of them, instead of all the data. In this paper, we analyze and evaluate the techniques of the well-known landmark window-based frequent pattern mining algorithms, called Lossy counting and hMiner. When Lossy counting mines frequent patterns from a set of new transactions, it performs union operations between the previous and current mining results. hMiner, which is a state-of-the-art algorithm based on the landmark window model, conducts mining operations whenever a new transaction occurs. Since hMiner extracts frequent patterns as soon as a new transaction is entered, we can obtain the latest mining results reflecting real-time information. For this reason, such algorithms are also called online mining approaches. We evaluate and compare the performance of the primitive algorithm, Lossy counting and the latest one, hMiner. As the criteria of our performance analysis, we first consider algorithms' total runtime and average processing time per transaction. In addition, to compare the efficiency of storage structures between them, their maximum memory usage is also evaluated. Lastly, we show how stably the two algorithms conduct their mining works with respect to the databases that feature gradually increasing items. With respect to the evaluation results of mining time and transaction processing, hMiner has higher speed than that of Lossy counting. Since hMiner stores candidate frequent patterns in a hash method, it can directly access candidate frequent patterns. Meanwhile, Lossy counting stores them in a lattice manner; thus, it has to search for multiple nodes in order to access the candidate frequent patterns. On the other hand, hMiner shows worse performance than that of Lossy counting in terms of maximum memory usage. hMiner should have all of the information for candidate frequent patterns to store them to hash's buckets, while Lossy counting stores them, reducing their information by using the lattice method. Since the storage of Lossy counting can share items concurrently included in multiple patterns, its memory usage is more efficient than that of hMiner. However, hMiner presents better efficiency than that of Lossy counting with respect to scalability evaluation due to the following reasons. If the number of items is increased, shared items are decreased in contrast; thereby, Lossy counting's memory efficiency is weakened. Furthermore, if the number of transactions becomes higher, its pruning effect becomes worse. From the experimental results, we can determine that the landmark window-based frequent pattern mining algorithms are suitable for real-time systems although they require a significant amount of memory. Hence, we need to improve their data structures more efficiently in order to utilize them additionally in resource-constrained environments such as WSN(Wireless sensor network).

A Study of Heavy Metal Equilibria in Acid Mine Drainage Receiving Stream (광산배수 수용하천의 중금속이온 평형에 관한 연구)

  • Kim, Jin-Beom;Jun, Sang-Ho;Kim, Hee-Jong
    • Economic and Environmental Geology
    • /
    • v.29 no.6
    • /
    • pp.733-738
    • /
    • 1996
  • Heavy metal equilibria in the Dongnam stream which receives the wastewater from mining activities are investigated to provide some basic data for the management of small stream with acid mine drainage. Saturation, undersaturation, and supersaturation of some heavy metal ions with respect to some mineral phases are evaluated by saturation index (logIAP/Ksp). The $Al^{3+}$ activities showed equilibrium with $AIOHSO_4$ solid phase below a pH of 6.0. The $Fe^{3+}$ activities appeared to be controlled by Fe $(OH)_{3(amorphous)}$ solid phase below a pH of 4.0. $Zn^{2+}$ activities appeared to be regulated by $ZnCO_3$ solid phase above a pH of 6.8. Some heavy metal activities appeared to be depended upon the pH.

  • PDF

Intrusion Detection based on Clustering a Data Stream (데이터 스트림 클러스터링을 이용한 침임탐지)

  • Oh Sang-Hyun;Kang Jin-Suk;Byun Yung-Cheol
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2005.11a
    • /
    • pp.529-532
    • /
    • 2005
  • In anomaly intrusion detection, how to model the normal behavior of activities performed by a user is an important issue. To extract the normal behavior as a profile, conventional data mining techniques are widely applied to a finite audit data set. However, these approaches can only model the static behavior of a user in the audit data set This drawback can be overcome by viewing the continuous activities of a user as an audit data stream. This paper proposes a new clustering algorithm which continuously models a data stream. A set of features is used to represent the characteristics of an activity. For each feature, the clusters of feature values corresponding to activities observed so far in an audit data stream are identified by the proposed clustering algorithm for data streams. As a result, without maintaining any historical activity of a user physically, new activities of the user can be continuously reflected to the on-going result of clustering.

  • PDF

A Study for Filter and Signature on IDS (Feature Construction with Data Mining) (IDS에서 Filter 와 Signature 의 역할에 대한 연구 (Date Mining을 이용한 Feature Construction))

  • Lee, Jung-Hyun;Weon, Ill-Young;Lee, Chang-Hun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2001.04b
    • /
    • pp.1089-1092
    • /
    • 2001
  • IDS 에서 가장 중요한 것은 침입을 논리적으로 모델링하고, 이것을 센싱 할 수 있는 Filter 의 개발이며 Filter 에서 발생한 이벤트들에서 특정 공격 행위를 인식할 수 있는 신호인 Signature 의 정의를 통해 이벤트 스트림에서 Signature 를 자동으로 인식할 수 있는 방법에 대한 연구가 가장 핵심적이라고 할 수 있다. 본 논문은 이러한 filter 와 Signature 에서 사용할 수 있도록 특징들이 정의 되어있는 양식으로 원시 데이터로부터 profile 을 생성 filter 와 signature 에서 탐지할 수 있는 모듈을 적용할 수 있도록 네트웍과 host input stream 등의 raw audit data 에서 특징을 추출 Feature Construction 작성에 대한 연구이다.

  • PDF

Real-Time Ransomware Infection Detection System Based on Social Big Data Mining (소셜 빅데이터 마이닝 기반 실시간 랜섬웨어 전파 감지 시스템)

  • Kim, Mihui;Yun, Junhyeok
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.7 no.10
    • /
    • pp.251-258
    • /
    • 2018
  • Ransomware, a malicious software that requires a ransom by encrypting a file, is becoming more threatening with its rapid propagation and intelligence. Rapid detection and risk analysis are required, but real-time analysis and reporting are lacking. In this paper, we propose a ransomware infection detection system using social big data mining technology to enable real-time analysis. The system analyzes the twitter stream in real time and crawls tweets with keywords related to ransomware. It also extracts keywords related to ransomware by crawling the news server through the news feed parser and extracts news or statistical data on the servers of the security company or search engine. The collected data is analyzed by data mining algorithms. By comparing the number of related tweets, google trends (statistical information), and articles related wannacry and locky ransomware infection spreading in 2017, we show that our system has the possibility of ransomware infection detection using tweets. Moreover, the performance of proposed system is shown through entropy and chi-square analysis.

An Adaptive Grid-based Clustering Algorithm over Multi-dimensional Data Streams (적응적 격자기반 다차원 데이터 스트림 클러스터링 방법)

  • Park, Nam-Hun;Lee, Won-Suk
    • The KIPS Transactions:PartD
    • /
    • v.14D no.7
    • /
    • pp.733-742
    • /
    • 2007
  • A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, memory usage for data stream analysis should be confined finitely although new data elements are continuously generated in a data stream. To satisfy this requirement, data stream processing sacrifices the correctness of its analysis result by allowing some errors. The old distribution statistics are diminished by a predefined decay rate as time goes by, so that the effect of the obsolete information on the current result of clustering can be eliminated without maintaining any data element physically. This paper proposes a grid based clustering algorithm for a data stream. Given a set of initial grid cells, the dense range of a grid cell is recursively partitioned into a smaller cell based on the distribution statistics of data elements by a top down manner until the smallest cell, called a unit cell, is identified. Since only the distribution statistics of data elements are maintained by dynamically partitioned grid cells, the clusters of a data stream can be effectively found without maintaining the data elements physically. Furthermore, the memory usage of the proposed algorithm is adjusted adaptively to the size of confined memory space by flexibly resizing the size of a unit cell. As a result, the confined memory space can be fully utilized to generate the result of clustering as accurately as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics

Developing a semi-automatic data conversion tool for Korean ecological data standardization

  • Lee, Hyeonjeong;Jung, Hoseok;Shin, Miyoung;Kwon, Ohseok
    • Journal of Ecology and Environment
    • /
    • v.41 no.3
    • /
    • pp.78-84
    • /
    • 2017
  • Recently, great demands are rising around the globe for monitoring and studying of long-term ecological changes. To go with the stream, many researchers in South Korea have attempted to share and integrate ecological data for practical use. Although some achievements were made in the meantime, we still have to overcome a big obstacle that existing ecological data in South Korea are mostly spread all over the country in various formats of computer files. In this study, we aim to handle the situation by developing a semi-automatic data conversion tool for Korean ecological data standardization, based on some predefined protocols for ecological data collection and management. The current implementation of this tool works on only five species (libythea celtis, spittle bugs, mosquitoes, pinus, and quercus mongolica), helping data managers to quickly and efficiently obtain a standardized format of ecological data from raw collection data. With this tool, the procedure of data conversion is divided into four steps: data file and protocol selection step, species selection step, attribute mapping step, and data standardization step. To find the usability of this tool, we utilized it to conduct the standardization of raw five species data collected from six different observatory sites of Korean National Parks. As a result, we could obtain a common form of standardized data in a relatively short time. With the help of this tool, various ecological data could be easily integrated into the nationwide common platform, providing broad applicability towards solving many issues in ecological and environmental system.