DOI QR코드

DOI QR Code

Clustering Algorithm for Data Mining using Posterior Probability-based Information Entropy

데이터마이닝을 위한 사후확률 정보엔트로피 기반 군집화알고리즘

  • Park, In-Kyoo (Dept. of Computer.Game, College of Engineering, Joongbu University)
  • 박인규 (중부대학교 공과대학 컴퓨터.게임학과)
  • Received : 2014.09.06
  • Accepted : 2014.11.20
  • Published : 2014.12.28

Abstract

In this paper, we propose a new measure based on the confidence of Bayesian posterior probability so as to reduce unimportant information in the clustering process. Because the performance of clustering is up to selecting the important degree of attributes within the databases, the concept of information entropy is added to posterior probability for attributes discernibility. Hence, The same value of attributes in the confidence of the proposed measure is considerably much less due to the natural logarithm. Therefore posterior probability-based clustering algorithm selects the minimum of attribute reducts and improves the efficiency of clustering. Analysis of the validation of the proposed algorithms compared with others shows their discernibility as well as ability of clustering to handle uncertainty with ACME categorical data.

본 논문에서는 데이터 마이닝에 필요한 클러스터링과정에서 불필요한 정보를 감축하기 위하여 베이지언 사후확률의 신뢰도를 이용한 새로운 척도를 제안한다. 데이터 감축을 위한 속성의 중요도가 클러스터링의 결과에 지배적이기 때문에 많은 속성의 변별력을 향상시키기 위하여 사후확률의 신뢰도에 정보 엔트로피를 적용하였다. 제안된 사후확률을 기반으로 한 러프 엔트로피 척도에 의한 속성의 신뢰도의 중복성은 엔트로피의 자연로그에 의하여 상당히 줄어든다. 따라서 제안된 척도에 의하여 생성된 군집화 알고리즘은 속성값의 변별력을 향상시켜 기존의 리덕트를 최소화하였고, 이는 분할의 효율성을 향상시킬 수 있었다. 제안된 알고리즘의 검증을 위해 패턴분류 문제에 적용되는 ACME 데이터에 대하여 속성간의 변별력, 분할결과에 따른 분할의 순정도를 기존의 알고리즘과 비교 분석하였다.

Keywords

References

  1. Dempster A P., Laird N M., Rubin D B., Maximum likelihood from incomplete data via the EM algirithm, Journal of the Royal Statistical Society, Vol. 39, No. 1, pp. 1-38, 1977
  2. Gibson D., Kleindeberg J., Raghvan P., Clustering categorical data: An approach based on dynamical systems, The Very Large Data Bases Journal, vol. 8, no. 3-4, pp. 222-236, 2000 https://doi.org/10.1007/s007780050005
  3. Jiang D., Tang C., Zhang A., Cluster analysis for gene expression data: A survey, IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370-1386, 2004 https://doi.org/10.1109/TKDE.2004.68
  4. Herawan T., Ghazali R., Yanto I., Deris M., Rough set approach for cateforical data clustering, International Journal of Database Theory and Application, vol. 3, no. 1, pp. 33-52, 2010
  5. Huang Z, Extensions to the k-means algorithm for clustering large data sets with cateforical values. Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998 https://doi.org/10.1023/A:1009769707641
  6. Kim D., Lee K., Lee D., Fussy clustering of categorical data using fuzzy centroids, Pattern Recognition Letters, vol. 25, no. 11, pp. 1263-1271, 2004 https://doi.org/10.1016/j.patrec.2004.04.004
  7. Parmar D., Wu T., Blackhurst J., MMR: An algorithm for clustering categorical data using rough set throry, Data and Knowledge Engineering, vol. 63, pp. 879-893, 2007 https://doi.org/10.1016/j.datak.2007.05.005
  8. Sun, L., Xu, J., Xue, Z. and Zhang, L., Rough entropy-based feature selection and its application, Journal of Information and Computational Science, pp. 1525-1532, 2011
  9. Anjana K., Study on Naive Bayesian classifier and its relaton to information gain, International Journal on Recent and Innovation Trends in Computing and Communication, vol. 2, pp. 601-603, 2014
  10. Pawlak, Z. Rough set Theory and Its Applications to Data Analysis, Cyberdynamics and Systems: An International Journal, pp. 661-688, 1998
  11. Tripathy B. K., Ghosh A., A SDR: An algorithm for clustering categorical data using rough set theory, Private communication at the International IEEE Conference held in Kerala, 2011
  12. Tripathy B. K., Ghosh A., A SSDR: An algorithm for clustering categorical data using rough set theory, Advances in Applied Science Research, vol. 2, no. 3, pp. 320-324, 2011
  13. Hassanein W. A., Elmelegy A. A., Clustering algorithm for categorical data using concepts of significance and dependence of attributes, European Scientific Jouranl, vol. 10, no. 3, pp. 381-400, 2014
  14. Fuyuan C., Jiye L., Deyu L., Chuangyin D., A dissimility measure for the k-Modes clustering algorithm, Knowledge-Based Systems Journal, vol. 26, pp. 120-127, 2012 https://doi.org/10.1016/j.knosys.2011.07.011
  15. Tian B. Kulikowski C.A., Leiguang G., Bin Y., Lan H., Chunguang Z., Chinese Journal of Electronics, vol. 21, no. 3, pp. 460-465, 2012
  16. Mete Ciilngirturk A., Ergut O., Hierarchical clustering with simple matching and joint entropy dissimility measure, Journal of Modern Applied Statiscal Methods, vol. 13, no. 1, pp. 329-338, 2013

Cited by

  1. Multiscale Integration Approach for Land Cover Classification Based on Minimal Entropy of Posterior Probability vol.10, pp.3, 2017, https://doi.org/10.1109/JSTARS.2016.2615073