DOI QR코드

DOI QR Code

A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest

  • Received : 2018.04.18
  • Accepted : 2018.07.14
  • Published : 2018.10.31

Abstract

Microarray data plays an essential role in diagnosing and detecting cancer. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. However, microarray data have very little sample data and high data dimensionality. Therefore, to classify microarray data, a dimensional reduction process is required. Dimensional reduction can eliminate redundancy of data; thus, features used in classification are features that only have a high correlation with their class. There are two types of dimensional reduction, namely feature selection and feature extraction. In this paper, we used k-means algorithm as the clustering approach for feature selection. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. All best elements of each cluster are selected and used as features in the classification process. Next, the Random Forest algorithm is used. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering.

Keywords

References

  1. American Cancer Society, Cancer Facts & Figures 2015. Atlanta, GA: American Cancer Society, 2015.
  2. A. Nurfalah and A. A. Suryani, "Cancer detection based on microarray data classification using PCA And modified back propagation," Far East Journal of Electronics and Communications, vol. 16, no. 2, pp. 269-281, 2015.
  3. K. Moorthy and M. S. Mohamad, "Random forest for gene selection and microarray data classification," in Knowledge Technology. Heidelberg: Springer, 2012, pp. 174-183.
  4. E. Pashaei, M. Ozen, and N. Aydin, "A novel gene selection algorithm for cancer identification based on random forest and particle swarm optimization," in Proceedings of 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Niagara Falls, Canada, 2015, pp. 1-6.
  5. Z. M. Hira and D. F. Gillies, "A review of feature selection and feature extraction methods applied on microarray data," Advances in Bioinformatics, vol. 2015, article ID. 198363, 2015.
  6. P. K. Ammu and V. Preeja, "Review on feature selection techniques of DNA microarray data," International Journal of Computer Applications, vol. 61, no. 12, pp. 39-44, 2013. https://doi.org/10.5120/9983-4814
  7. C. S. Tan, W. S. Ting, M. S. Mohamad, W. H. Chan, S. Deris, and Z. Ali Shah, Z. (2014). A review of feature extraction software for microarray gene expression data," BioMed Research International, vol. 2014, article ID. 213656, 2014.
  8. D. P. Ismi, S. Panchoo, and M. Murinto, "K-means clustering based filter feature selection on high dimensional data," International Journal of Advances in Intelligent Informatics, vol. 2, no. 1, pp. 38-45, 2016. https://doi.org/10.26555/ijain.v2i1.54
  9. H. Aydadenta, "On the classification techniques in data mining for microarray data classification," Journal of Physics: Conference Series, vol. 971, no. 1, article no. 012004, 2018.
  10. J. Biesiada, W. Duch, A. Kachel, K. Maczka, and S. Palucha, "Feature ranking methods based on information entropy with Parzen windows," in Proceedings of International Conference on Research in Electrotechnology and Applied Informatics, Katowice, Poland, 2015.
  11. K. Kira and L. A. Rendell, "The feature selection problem: traditional methods and a new algorithm" in Proceedings of the 10th National Conference on Artificial Intelligence (AAAI), San Jose, CA, 1992, pp. 129-134.
  12. R. Diaz-Uriarte and S. A. De Andres, "Gene selection and classification of microarray data using random forest," BMC Bioinformatics, vol. 7, article no. 3, 2006.
  13. J. Darbon and S. Osher, "Algorithms for overcoming the curse of dimensionality for certain Hamilton-Jacobi equations arising in control theory and elsewhere," Research in the Mathematical Sciences, vol. 3, article no. 19, 2016.