Finding Weighted Sequential Patterns over Data Streams via a Gap-based Weighting Approach

발생 간격 기반 가중치 부여 기법을 활용한 데이터 스트림에서 가중치 순차패턴 탐색

  • Chang, Joong-Hyuk (Department of Computer and Information Technology, Daegu University)
  • 장중혁 (대구대학교 컴퓨터.IT공학부)
  • Received : 2010.05.13
  • Accepted : 2010.07.21
  • Published : 2010.09.30

Abstract

Sequential pattern mining aims to discover interesting sequential patterns in a sequence database, and it is one of the essential data mining tasks widely used in various application fields such as Web access pattern analysis, customer purchase pattern analysis, and DNA sequence analysis. In general sequential pattern mining, only the generation order of data element in a sequence is considered, so that it can easily find simple sequential patterns, but has a limit to find more interesting sequential patterns being widely used in real world applications. One of the essential research topics to compensate the limit is a topic of weighted sequential pattern mining. In weighted sequential pattern mining, not only the generation order of data element but also its weight is considered to get more interesting sequential patterns. In recent, data has been increasingly taking the form of continuous data streams rather than finite stored data sets in various application fields, the database research community has begun focusing its attention on processing over data streams. The data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. In data stream processing, each data element should be examined at most once to analyze the data stream, and the memory usage for data stream analysis should be restricted finitely although new data elements are continuously generated in a data stream. Moreover, newly generated data elements should be processed as fast as possible to produce the up-to-date analysis result of a data stream, so that it can be instantly utilized upon request. To satisfy these requirements, data stream processing sacrifices the correctness of its analysis result by allowing some error. Considering the changes in the form of data generated in real world application fields, many researches have been actively performed to find various kinds of knowledge embedded in data streams. They mainly focus on efficient mining of frequent itemsets and sequential patterns over data streams, which have been proven to be useful in conventional data mining for a finite data set. In addition, mining algorithms have also been proposed to efficiently reflect the changes of data streams over time into their mining results. However, they have been targeting on finding naively interesting patterns such as frequent patterns and simple sequential patterns, which are found intuitively, taking no interest in mining novel interesting patterns that express the characteristics of target data streams better. Therefore, it can be a valuable research topic in the field of mining data streams to define novel interesting patterns and develop a mining method finding the novel patterns, which will be effectively used to analyze recent data streams. This paper proposes a gap-based weighting approach for a sequential pattern and amining method of weighted sequential patterns over sequence data streams via the weighting approach. A gap-based weight of a sequential pattern can be computed from the gaps of data elements in the sequential pattern without any pre-defined weight information. That is, in the approach, the gaps of data elements in each sequential pattern as well as their generation orders are used to get the weight of the sequential pattern, therefore it can help to get more interesting and useful sequential patterns. Recently most of computer application fields generate data as a form of data streams rather than a finite data set. Considering the change of data, the proposed method is mainly focus on sequence data streams.

일반적인 순차패턴 마이닝에서는 분석 대상 데이터 집합에 포함되는 구성요소의 발생 순서만을 고려하며, 따라서 단순 순차패턴은 쉽게 찾을 수 있는 반면 실제 응용 분야에서 널리 활용될 수 있는 관심도가 큰 순차패턴을 탐색하는데 한계가 있다. 이러한 단점을 보완하기 위한 대표적인 연구 주제들 중의 하나가 가중치 순차패턴 탐색이다. 가중치 순차패턴 탐색에서는 관심도가 큰 순차패턴을 얻기 위해서 구성요소의 단순 발생 순서 뿐만 아니라 구성요소의 가중치를 추가로 고려한다. 본 논문에서는 발생 간격에 기반 한 순차패턴 가중치 부여 기법 및 이를 활용한 순차 데이터 스트림에 대한 가중치 순차패턴 탐색 방법을 제안한다. 발생 간격 기반 가중치는 사전에 정의된 별도의 가중치 정보를 필요로 하지 않으며 순차정보를 구성하는 구성요소들의 발생 간격으로부터 구해진다. 즉, 순차패턴의 가중치를 구하는데 있어서 구성요소의 발생순서와 더불어 이들의 발생 간격을 고려하며, 따라서 보다 관심도가 크고 유용한 순차패턴을 얻는데 도움이 된다. 한편, 근래 대부분의 컴퓨터 응용 분야에서는 한정적인 데이터 집합 형태가 아닌 데이터 스트림 형태로 정보를 발생시키고 있다. 이와 같은 데이터 생성 환경의 변화를 고려하여 본 논문에서는 순차 데이터 스트림을 마이닝 대상으로 고려하였다.

Keywords

References

  1. Agrawal, R. and R. Srikant., "Mining Sequential Patterns", Proc. of the 1995 Int'l Conf. on Data Engineering, (1995), 3-14.
  2. Chang, J. H. and W. S. Lee., "Efficient Mining Method for Retrieving Sequential Patterns over Online Data Streams", Journal of Information Science, Vol.31, No.5(2005), 420-432. https://doi.org/10.1177/0165551505055405
  3. Chen, Y. L. and T. C.-H. Huang., "Discovering Time-Interval Sequential Patterns in Sequence Databases", Expert Systems with Applications, Vol.25, No.1(2003), 343-354. https://doi.org/10.1016/S0957-4174(03)00075-7
  4. Chen, Y. L., M. C. Chiang. and M. T. Ko., "Discovering Fuzzy Time-Interval Sequential Patterns in Sequence Databases", IEEE Transactions on Systems, Man, and Cybernetics-Part B : Cybernetics, Vol.35, No.5(2005), 959-972. https://doi.org/10.1109/TSMCB.2005.847741
  5. Garofalakis, M., J. Gehrke. and R. Rastogi., "Querying and Mining Data Streams : You Only Get One Look", in The tutorial notes of the 28th Int'l Conf. onVery Large Data Bases, (2002).
  6. Huang, Q. and W. Ouyang., "Mining Sequential Patterns in Data Streams", Proc. of the 6th Int'l Symposium on Neural Networks, (2009), 865-874.
  7. Ji, X., J. Bailey. and G. Dong., "Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints", Knowledge and Information Systems, Vol.11, No.3(2007), 259-296. https://doi.org/10.1007/s10115-006-0038-2
  8. Kum, H. C., J. Pei., W. Wang. and D. Duncan., "ApproxMAP : Approximate Mining of Consensus Sequential Patterns", Proc. of the 2003 SIAM Int'l Conf. on Data Mining(SDM'03), 311-315, (2003).
  9. Kum, H. C., J. H. Chang. and W. Wang., "Sequential Pattern Mining in Multi- Databases via Multiple Alignment", Data Mining and Knowledge Discovery, Vol.12, No.2(2006), 151-180. https://doi.org/10.1007/s10618-005-0017-3
  10. Lin, M. Y., S. C. Hsueh. and C. W. Chang., "Fast Discovery of Sequential Patterns in Large Databases using Effective Time-Indexing", Information Sciences, Vol.178, No.22(2008), 4228-4245. https://doi.org/10.1016/j.ins.2008.07.012
  11. Lo, S., "Binary Prediction based on Weighted Sequential mining method", Proc. of the (2005) Int'l Conf. on Web Intelligence, pp. 755-761, 2005.
  12. Luo, C. and S. M. Chung., "Efficient Mining of Maximal Sequential Patterns Using Multiple Samples", Proc. of the 2005 SIAM Int'l Conf. on Data Mining(SDM '05), 64-72, (2005).
  13. Pei, J., J. Han. and W. Wang., "Mining Sequential Patterns with Constraints in Large Databases", Proc. of the 2002 ACM Int'l Conf. on Information and Knowledge Management (CIKM '02), (2002), 18-25.
  14. Pei, J., J. Han., B. Mortazavi-Asl., J. Wang., H. Pinto., Q. Chen., U. Dayal. and M. C. Hsu., "Mining Sequential Patterns by Pattern- Growth: The PrefixSpan Approach", IEEE Transactions on Knowledge and Data Engineering, Vol.16, No.11(2004), 1424-1440. https://doi.org/10.1109/TKDE.2004.77
  15. Wang, J., J. Han. and C. Li., "Frequent Closed Sequence Mining without Candidate Maintenance", IEEE Transactions on Knowledge and Data Engineering, Vol.19, No. 8(2007), 1042-1056. https://doi.org/10.1109/TKDE.2007.1043
  16. Yun, U., "A New Framework for Detecting Weighted Sequential Patterns in Large Sequence Databases", Knowledge-Based Systems, Vol.21, No.2(2008), 110-122. https://doi.org/10.1016/j.knosys.2007.04.002