DOI QR코드

DOI QR Code

Comparison of the Cluster Validation Methods for High-dimensional (Gene Expression) Data

고차원 (유전자 발현) 자료에 대한 군집 타당성분석 기법의 성능 비교

  • Published : 2007.03.31

Abstract

Many clustering algorithms and cluster validation techniques for high-dimensional gene expression data have been suggested. The evaluations of these cluster validation techniques have, however, seldom been implemented. In this paper we compared various cluster validity indices for low-dimensional simulation data and real gene expression data, and found that Dunn's index is the most effective and robust, Silhouette index is next and Davies-Bouldin index is the bottom among the internal measures. Jaccard index is much more effective than Goodman-Kruskal index and adjusted Rand index among the external measures.

유전자 발현 자료(gene expression data)는 전형적인 고차원 자료이며, 이를 분석하기 위한 여러 가지 군집 알고리즘(clustering algorithm)과 군집 결과들을 검증하는 군집타당성분석 기법(cluster validation technique)이 제안되고 있지만, 이들 군집 타당성을 분석하는 기법의 성능에 대한 비교, 평가는 매우 드물다. 본 논문에서는 저차원의 모의실험 자료와 실제 유전자 발현 자료에 대하여 군집 타당성분석 기법들의 성능을 비교하였으며, 그 결과 내적 측도에서는 Dunn 지수, Silhouette 지수 순으로 뛰어났고 외적 측도에서는 Jaccard 지수가 성능이 가장 우수한 것으로 평가되었다.

Keywords

References

  1. Bezdek, J. C. and Pal, N. R. (1998). Some new indexes of cluster validity, IEEE Transactions on Systems, Man and Cybernetics, Part B:Cybemetics, 28, Issue 3, 301-315 https://doi.org/10.1109/3477.678624
  2. Bolshakova, N. and Azuaje, F. (2003a). Improving expression data mining through cluster validation, Conference Proceedings. 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine 2003, 19-22
  3. Bolshakova, N. and Azuaje, F. (2003b). Cluster validation techniques for genome expression data classification, Signal Processing, 83, 825-833 https://doi.org/10.1016/S0165-1684(02)00475-9
  4. Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure, IEEE Transactions on Pattern Recognition and Machine Intelligence, 1, 224-227 https://doi.org/10.1109/TPAMI.1979.4766909
  5. Dunn, J. (1974). Well separated clusters and optimal fuzzy partitions, Journal Cybernet, 4, 95-104 https://doi.org/10.1080/01969727408546059
  6. Fort, G. and Lambert-Lacroix, S. (2005). Classification using partial least squares with penalized logistic regression, Bioinformatics, 21, 1104-1111 https://doi.org/10.1093/bioinformatics/bti114
  7. Goodman, L. and Kruskal, W. (1954). Measures of associations for cross-validations, Journal of the American Statistical Association, 49, 732-764 https://doi.org/10.2307/2281536
  8. Handl, J., Knowles, J. and Kell, D. B. (2005). Computational cluster validation in postgenomic data analysis, Bioinformatics, 21, 3201-3212 https://doi.org/10.1093/bioinformatics/bti517
  9. Hubert, L. and Arabie, P. (1985). Comparing partitions, Journal of Classification, 2, 193-218 https://doi.org/10.1007/BF01908075
  10. Hubert, L. and Schultz, J. (1976). Quadratic assignment as a general data-analysis strategy, The British Journal of Mathematical & Statistical Psychology, 29, 190-241 https://doi.org/10.1111/j.2044-8317.1976.tb00714.x
  11. Jaccard, P. (1912). The distribution of flora in the alpine zone, New Phytologist, 11, 37-50 https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
  12. Pauwels, E. J. and Frederix, G. (1999). Finding salient regions in images: nonparametric clustering for image segmentation and grouping, Computer Vision and Image Understanding, 75, 73-85 https://doi.org/10.1006/cviu.1999.0763
  13. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66, 846-850 https://doi.org/10.2307/2284239
  14. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65 https://doi.org/10.1016/0377-0427(87)90125-7
  15. Yeung, K. Y. and Ruzzo, W. L. (2000). An Empirical Study on Principal Component Analysis for Clustering Gene Expression Data, Technical Report UW-CSE-2000-11-03, Department of Computer Science and Engineering, University of Washington

Cited by

  1. On the Use of Modified Adaptive Nearest Neighbors for Classification vol.23, pp.6, 2010, https://doi.org/10.5351/KJAS.2010.23.6.1093