DOI QR코드

DOI QR Code

Double monothetic clustering for histogram-valued data

  • Kim, Jaejik (Department of Statistics, Sungkyunkwan University) ;
  • Billard, L. (Department of Statistics, University of Georgia)
  • Received : 2018.01.05
  • Accepted : 2018.04.26
  • Published : 2018.05.31

Abstract

One of the common issues in large dataset analyses is to detect and construct homogeneous groups of objects in those datasets. This is typically done by some form of clustering technique. In this study, we present a divisive hierarchical clustering method for two monothetic characteristics of histogram data. Unlike classical data points, a histogram has internal variation of itself as well as location information. However, to find the optimal bipartition, existing divisive monothetic clustering methods for histogram data consider only location information as a monothetic characteristic and they cannot distinguish histograms with the same location but different internal variations. Thus, a divisive clustering method considering both location and internal variation of histograms is proposed in this study. The method has an advantage in interpreting clustering outcomes by providing binary questions for each split. The proposed clustering method is verified through a simulation study and applied to a large U.S. house property value dataset.

Keywords

References

  1. Arroyo J, Gonzalez Rivera G, MateC, and Munoz San Roque A (2011). Smoothing methods for histogram-valued time series: an application to value-at-risk, Statistical Analysis and Data Mining, 4, 216-228. https://doi.org/10.1002/sam.10114
  2. Billard L (2006). Symbolic data analysis: what is it? In Rizzi, A. and Vichi, M., editors, Proceedings in Computational Statistics 2006, pages 261-269, Rome, Italy.
  3. Billard L (2011). Brief overview of symbolic data and analytic issues, Statistical Analysis and Data Mining: The ASA Data Science Journal, 4, 149-156. https://doi.org/10.1002/sam.10115
  4. Billard L and Diday E (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis, Journal of the American Statistical Association, 98, 470-487. https://doi.org/10.1198/016214503000242
  5. Billard L and Diday E (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons, England.
  6. Bock HH and Diday E (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Berlin.
  7. Brito P and Chavent M (2012). Divisive monothetic clustering for interval and histogram-valued data, In Proceedings ICPRAM 2012-1st International Conference on Pattern Recognition Applications and Methods, Vilamoura, Portugal.
  8. Chavent M (2000). Criterion-based divisive clustering for symbolic data. In Bock, H.-H. and Diday, E., editors, Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, pages 299-311. Springer-Verlag, Berlin.
  9. De Carvalho FAT and De Souza RMCR (2010). Unsupervised pattern recognition models for mixed feature-type symbolic data, Pattern Recognition Letters, 31, 430-443. https://doi.org/10.1016/j.patrec.2009.11.007
  10. Dias S and Brito P (2011). A new linear regression model for histogram-valued variables. In Proceedings of the 58th ISI World Statistics Congress, Dublin, Ireland.
  11. Diday E (1987). Introduction al'approche symbolique en analyse des donnees. Premieres Journees Symbolique-Numerique, CEREMADE, UniversiteParis IX, 21-56.
  12. Diday E (1995). Probabilist, possibilist and belief objects for knowledge analysis, Annals of Operations Research, 55, 225-276. https://doi.org/10.1007/BF02030862
  13. Edwards AWF and Cavalli-Sforza EL (1965). A method for cluster analysis, Biometrics, 21, 362-375. https://doi.org/10.2307/2528096
  14. Fisher RA (1936). The use of multiple measurements in taxonomic problems, Annals of Human Genetics, 7, 179-188.
  15. Har-even M and Brailovsky VL (1995). Probabilistic validation approach for clustering, Pattern Recognition Letters, 16, 1189-1196. https://doi.org/10.1016/0167-8655(95)00073-P
  16. Irpino A and Verde R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In Batagelj V (eds), Proceeding IFCS 2006 (pp. 185-192), Springer, Heidelberg.
  17. Irpino A and Verde R (2008). Dynamic clustering of interval data using a Wasserstein-based distance, Pattern Recognition Letters, 29, 1648-1658. https://doi.org/10.1016/j.patrec.2008.04.008
  18. Kim J (2009). Dissimilarity Measures for Histogram-valued Data and Divisive Clustering of Symbolic Objects (PhD thesis), University of Georgia.
  19. Kim J and Billard L (2011). A polythetic clustering process and cluster validity indexes for histogramvalued objects, Computational Statistics & Data Analysis, 55, 2250-2262. https://doi.org/10.1016/j.csda.2011.01.011
  20. Kim J and Billard L (2012). Dissimilarity measures and divisive clustering for symbolic multimodal-valued data, Computational Statistics & Data Analysis, 56, 2795-2808. https://doi.org/10.1016/j.csda.2012.03.001
  21. Kim J and Billard L (2013). Dissimilarity measures for histogram-valued observations, Communications in Statistics-Theory and Methods, 42, 283-303. https://doi.org/10.1080/03610926.2011.581785
  22. Lance GN and Williams WT (1968). Note on a new information-statistic classificatory program, The Computer Journal, 11, 195. https://doi.org/10.1093/comjnl/11.2.195
  23. Limam MM, Diday E, and Winsberg S (2004). Probabilist allocation of aggregated statistiscal units in classification trees for symbolic class description. In Banks D, House L, McMorris FR, Arabie P, and Gaul W (Eds), Classification, Clustering and Data Mining Applications (pp. 371-379), Springer, Heidelberg.
  24. MacNaughton-Smith P, Williams WT, Dale MB, and Mockett LG (1964). Dissimilarity analysis: a new technique of hierarchical sub-division, Nature, 202, 1034-1035. https://doi.org/10.1038/2021034a0
  25. Verde R and Irpino A (2008). Comparing histogram data using a Mahalanobis-Wasserstein distance. In COMPSTAT 2008 (pp. 77-89), Physica-Verlag HD.
  26. Williams WT and Lambert JM (1959). Multivariate methods in plant ecology: I. Association-Analysis in Plant Communities, Journal of Ecology, 47, 83-101. https://doi.org/10.2307/2257249