Proceedings of the Korea Inteligent Information System Society Conference (한국지능정보시스템학회:학술대회논문집)
- 2004.11a
- /
- Pages.221-229
- /
- 2004
Mining Clusters of Sequence Data using Sequence Element-based Similarity Measure
시퀀스 요소 기반의 유사도를 이용한 시퀀스 데이터 클러스터링
Abstract
Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, only a few of the existing clustering algorithms consider sequentiality. This study presents a method for clustering such sequence datasets. The similarity between sequences must be decided before clustering the sequences. This study proposes a new similarity measure to compute the similarity between two sequences using a sequence element. Two clustering algorithms using the proposed similarity measure are proposed: a hierarchical clustering algorithm and a scalable clustering algorithm that uses sampling and a k-nearest neighbor method. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed clustering algorithms is better than that of clusters produced by traditional clustering algorithms.