확률적 다차원 연속패턴의 생성을 위한 효율적인 마이닝 알고리즘

An Efficient Mining Algorithm for Generating Probabilistic Multidimensional Sequential Patterns

  • 이창환 (동국대학교 정보통신공학과)
  • 발행 : 2005.02.01

초록

연속패턴은 다양한 분야에서 사용되는 데이타 마이닝 기법의 한 종류이다. 하지만 현재의 연속 패턴 방법은 한개의 속성내에서의 패턴만을 감지할 수 있으며 속성간의 패턴을 생성할 수 없다. 다차원의 연속패턴은 일차원에 비하여 훤씬 유용한 정보를 제공할 수 있다. 본 연구에서는 Hellinger 엔트로피 함수를 사용하여 다차원의 연속패턴을 생성하는 방법을 게시한다 기존의 연속패턴방법과 달리 본 방법에서는 각 연속패턴의 중요도를 자동으로 계산할 수 있다. 또한 계산의 복잡도를 감소시키기 위한 다수의 법칙이 개발되었으며 다수의 실험 결과를 제시하였다.

Sequential pattern mining is an important data mining problem with broad applications. While the current methods are generating sequential patterns within a single attribute, the proposed method is able to detect them among different attributes. By incorporating these additional attributes, the sequential patterns found are richer and more informative to the user This paper proposes a new method for generating multi-dimensional sequential patterns with the use of Hellinger entropy measure. Unlike the Previously used methods, the proposed method can calculate the significance of each sequential pattern. Two theorems are proposed to reduce the computational complexity of the proposed system. The proposed method is tested on some synthesized purchase transaction databases.

키워드

참고문헌

  1. R. Agrawal and R. Srikant, Mining Sequential Patterns, Int. Conf. on Data Engineering pp. 3-14, 1995 https://doi.org/10.1109/ICDE.1995.380415
  2. F. Masseglia, F. Cathala and P. Poncelet, The PSP Approach for Mining Sequential Patterns. The 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD '98), Vol. 1510, pages 176-184, Nantes, France, LNAI, September 1998 https://doi.org/10.1007/BFb0094818
  3. M. Garofalaskis, R. Rastogi, and K. Shim, Spirit:Sequential Pattern Mining with Regular Expression Constraints, 1999 Inernational Conference on Very Large Databases, pp. 223-234, 1999
  4. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal and M-C. Hsu. FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining. In Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD'00), 355-359, Boston, MA, Aug. 2000 https://doi.org/10.1145/347090.347167
  5. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-projected Pattern Growth, 2001 Int. Conf. on Data Engineering, pp. 215-224, 2001
  6. M. J. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. In Proc. of Machine Learning Journal, special issue on Unsupervised Learning(Doug Fisher, ed.), Vol. 42 No. 1/2, pages 31-60, Jan/Feb 2001
  7. H. Pinto, J. Han, J. Pei, K. Wang, Q. Chen, and U. Dayal, Multi-dimensional Sequential Pattern Mining, Int. Corf. on Information and Knowledge Management, Atlanta, GA, 2001
  8. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publisher, 1993
  9. R. J. Beran. Minimum Hellinger Distances for Parametric Models, Ann Statistics, Vol. 5, pp. 445-463, 1977 https://doi.org/10.1214/aos/1176343842
  10. R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements, the 5th Inernational Conference on Extending Database Technology, pp. 3-17, 1996
  11. J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Francisco: Morgan Kaufmann Publishers, 2001
  12. S.-J. Yen and A. Chen, An Efficient Approach to Discovering Knowledge from large Databases, The 4th Int'l Conf. on Parallel and Distributed Information Systems, 1996 https://doi.org/10.1109/PDIS.1996.568663