DOI QR코드

DOI QR Code

혼합형태 심볼릭 데이터의 군집분석방법

A Divisive Clustering for Mixed Feature-Type Symbolic Data

  • 김재직 (성균관대학교 통계학과)
  • Kim, Jaejik (Department of Statistics, Sungkyunkwan University)
  • 투고 : 2015.09.14
  • 심사 : 2015.11.03
  • 발행 : 2015.12.31

초록

오늘날 데이터는 p-차원의 공간에서 점들로써 표현되는 전통적인 형태를 벗어나 시그널(signal), 함수, 이미지(image), 모양(shape) 등과 같은 다양한 형태의 자료들이 데이터로써 고려되고 분석되고있다. 그러한 종류의 새로운 종류의 데이터 중 하나로 심볼릭 데이터(symbolic data)를 고려할 수 있다. 심볼릭 데이터는 구간(interval), 히스토그램(histogram), 목록(list), 통계표, 분포, 또는 모형 등과 같은 다양한 형태들을 가질 수 있다. 지금까지의 연구가 주로 심볼릭 데이터의 각각의 형태별 자료를 고려했다면, 본 연구에서는 이를 확장하여 수집된 히스토그램과 멀티모달의 혼합된 형태로 이루어진 자료에 대한 계층 분할적 군집분석방법을 소개하고 이를 업종별 산업재해자료의 분석을 위해 이용한다.

Nowadays we are considering and analyzing not only classical data expressed by points in the p-dimensional Euclidean space but also new types of data such as signals, functions, images, and shapes, etc. Symbolic data also can be considered as one of those new types of data. Symbolic data can have various formats such as intervals, histograms, lists, tables, distributions, models, and the like. Up to date, symbolic data studies have mainly focused on individual formats of symbolic data. In this study, it is extended into datasets with both histogram and multimodal-valued data and a divisive clustering method for the mixed feature-type symbolic data is introduced and it is applied to the analysis of industrial accident data.

키워드

참고문헌

  1. Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley and Sons, New Jersey.
  2. Billard, L. and Kim, J. (2013). Clustering in contemporary mixed-valued data, In Proceedings of the 2013 World Statistics Congress, International Statistical Institute.
  3. Bock, H. H. and Diday, E. (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, New York.
  4. Cha, S. H. and Srihari, S. H. (2002). On measuring the distance between histograms, Pattern Recognition Letter, 35, 1355-1370. https://doi.org/10.1016/S0031-3203(01)00118-2
  5. Chavent, M. (1998). A monothetic clustering method, Pattern Recognition Letters, 19, 989-996. https://doi.org/10.1016/S0167-8655(98)00087-7
  6. Chavent, M. (2000). Criterion-based divisive clustering for symbolic data. In: Bock, H.H., Diday, E. (Eds.), Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information from Complex Data, Springer, New York, 299-311.
  7. Davis, D. L. and Bouldin, D. W. (1979). A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 224-227.
  8. De Carvalho, F. A. T. (1994). Proximity coefficients between boolean symbolic objects. In: Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., (Eds.), New Approaches in Classification and Data Analysis, Springer-Verlag, Berlin, 387-394.
  9. De Carvalho, F. A. T. (1998). Extension based proximity coefficients between constrained boolean symbolic objects. In: Hayashi, C., Ohsumi, N., Yajima, K., Tanaka, Y., Bock, H.-H., Baba, Y., (Eds.), In Proceedings of the Fifth Conference of the International Federation of Classification Societies (IFCS-96), Springer-Verlag, Berlin, 370-378.
  10. De Carvalho, F. A. T., Brito, P. and Bock, H. H. (2006). Dynamic clustering for interval data based on $L_2$ distance, Computational Statistics, 2, 231-245.
  11. De Carvalho, F. A. T. and Lechevallier, Y. (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances, Pattern Recognition, 42, 1223-1236. https://doi.org/10.1016/j.patcog.2008.11.016
  12. De Carvalho, F. A. T. and De Souza, R. M. C. R. (2010). Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recognition Letters, 31, 430-443. https://doi.org/10.1016/j.patrec.2009.11.007
  13. De Souza, R. M. C. R. and De Carvalho, F. A. T. (2007). A clustering methods for mixed feature-type symbolic data using adaptive squared Euclidean distances, The 7th International Conference on Hybrid Intelligent Systems, 168-173.
  14. Diday, E. (1987). Introduction a l'approche symbolique en analyse des donnees, Premiere Journees Symbolique-Numerique, CEREMADE, Universite Paris IX, 21-56.
  15. Dunn, J. C. (1974). Well separated clusters and optimal fuzzy partitions, Journal of Cybernetica, 4, 95-104. https://doi.org/10.1080/01969727408546059
  16. Gowda, K. C. and Diday, E. (1991). Symbolic clustering using a new dissimilarity measure, Pattern Recog-nition, 24, 567-578. https://doi.org/10.1016/0031-3203(91)90022-W
  17. Gowda, K. C. and Ravi, T. V. (1995a). Agglomerative clustering of symbolic objects using the concepts of both similarity and dissimilarity, Pattern Recognition Letters, 16, 647-652. https://doi.org/10.1016/0167-8655(95)80010-Q
  18. Gowda, K. C. and Ravi, T. V. (1995b). Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity, Pattern Recognition, 28, 1277-1282. https://doi.org/10.1016/0031-3203(95)00003-I
  19. Ichino, M. and Yaguchi, H. (1994). Generalized minkowski metrics for mixed feature type data analysis, IEEE Transactions on Systems, Man, and Cybernetics, 24, 698-709. https://doi.org/10.1109/21.286391
  20. Irpino, A. and Verde, R. (2006). A newWasserstein based distance for the hierarchical clustering of histogram symbolic data, IFCS 2006, 185-192.
  21. Kim, J. and Billard, L. (2011). A polythetic clustering process and cluster validity indexes for histogramvalued objects, Computational Statistics & Data Analysis, 55, 2250-2262. https://doi.org/10.1016/j.csda.2011.01.011
  22. Kim, J. and Billard, L. (2012). Dissimilarity measures and divisive clustering for symbolic multimodal-valued data, Computational Statistics & Data Analysis, 56, 2795-2808. https://doi.org/10.1016/j.csda.2012.03.001
  23. Kim, J. and Billard, L. (2013). Dissimilarity measures for histogram-valued observations, Communications in Statistics - Theory and Methods, 42, 283-303. https://doi.org/10.1080/03610926.2011.581785