범주형 시퀀스들에 대한 확장성 있는 클러스터링 방법

A Scalable Clustering Method for Categorical Sequences

  • 오승준 (한양대학교 산업공학과) ;
  • 김재련 (한양대학교 산업공학과)
  • 발행 : 2004.04.01


소매점 거래 데이터와 단백질 시퀀스, 웹 로그 등과 같은 상업적이거나 과학적인 데이터의 폭발적인 증가를 볼 수 있다. 이런 데이터들은 순서적인 면을 가지고 있는 시퀀스 데이터들이다. 그러나, 순서적인 면을 고려한 클러스터링 알고리듬은 소수이다. 따라서, 본 연구에서는 시퀀스 데이터들을 클러스터링 하는 방법을 연구한다. 시퀀스들 간의 유사도를 계산하기 위한 새로운 유사도를 제안한다. 또한, 유사도를 효율적으로 계산하기 위한 방법과 클러스터링 방법도 제안한다. 계층적 클러스터링 알고리듬은 높은 계산량을 가지고 있기에, 새로운 클러스터링 방법이 요구된다. 그러므로, 본 연구에서는 샘플링과 k-nn 방법을 이용한 확장성 있는 클러스터링 방법을 제안한다. 실제 데이터 셋과 합성 데이터 셋을 이용하여, 본 연구에서 제안하는 방법이 기존 방법보다 성능이 우수함을 보여준다.

There has been enormous growth in the amount of commercial and scientific data, such as retail transactions, protein sequences, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. We also present an efficient method for determining the similarity measure and develop a clustering algorithm. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a real dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.



  1. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, pp. 335-393, 2001.
  2. M. Perkowitz and O. Etzioni, "Towards Adaptive Web Sites: Conceptual Framework and Case Study", Proc. 8th Int. WWW Conf., Canada, 1999.
  3. D. Gusfield, Algorithm on Strings, Trees, and Sequences, Press Syndicate of the University of Cambridge, New York, 1997.
  4. D. S. Hirschberg, Pattern Matching Algorithms, Oxford University Press, pp. 123-142, 1997.
  5. P. Moen, Attribute, Event Sequence, and Event Type Similarity Notions for Data Mining, Ph.D. Thesis, University of Helsinki, Dept. of Computer Science, 2000.
  6. K. Charter, J. Schaeffer and D. Szafron, "Sequence alignment using FastLSA", Proc. 2000 Int. Conf. Math and Eng. Tech. in Medicine and Biological Sci., Nevada, pp. 239-245, 2000.
  7. S. Guha, R. Rastogi and K. Shim, "CURE: An Efficient Clustering Algorithm for Large Databases", Information Syst., Vol. 25, No. 1, pp. 35-58, 2001.
  8. J. Han, M. Kamber and A. K. H. Tung, "Spatial Clustering Methods in Data Mining: A Survey", H. J. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, NY: Taylor and Francis, 2001.
  9. S. Guha, R. Rastogi and K. Shim, "ROCK: A Robust Clustering Algorithm for Categorical Attributes", Information Syst., Vol. 25, No. 5, pp. 345-366, 2000.
  10. K. Wang, C. Xu and B. Liu, "Clustering Transactions Using Large Items", ACM CIKM Int. Conf. Information and Knowledge Management, pp. 483-490, 1999.
  11. R. Agrawal and R. Srikant, "Mining Sequential Patterns", Proc. Int. Conf. Data Engineering, Taiwan, 1995.
  12. M. Joshi, G. Karypis and V. Kumar, "Universal Formulation of Sequential Patterns", Technical Report TR 99-021, University of Minnesota, 1999.
  13. T. Morzy, M. Wojciechowski and M. Zakrzewicz, "Scalable Hierarchical Clustering Method for Sequences of Categorical Values", Proc. 5th Pacific-Asia Conf. KDD, Hong Kong, 2001.
  14. B. Hay, G. Wets and K. Vanhoof, "Clustering Navigation Patterns on a Website Using a Sequence Alignment Method", 2001 Int. Joint Conf. on Artificial Intelligence, 2001.
  15. W. Wang and O. R. Zaiane, "Clustering Web Sessions by Sequence Alignment", 13th Int. Workshop on Database and Expert Syst. Applications, France, 2002.
  16. C. L. Blake and C. J. Merz, UCI repository of machine learning databases, 1998.
  17. R. Agrawal, M. Mehta, J. Shafer, R. Srikant, A. Aming and T. Bollinger, "The Quest Data Mining System", Proc. 2nd Int. Conf. KDD, Portland, 1996.