DOI QR코드

DOI QR Code

Classification for Imbalanced Breast Cancer Dataset Using Resampling Methods

  • Received : 2023.01.05
  • Published : 2023.01.30

Abstract

Analyzing breast cancer patient files is becoming an exciting area of medical information analysis, especially with the increasing number of patient files. In this paper, breast cancer data is collected from Khartoum state hospital, and the dataset is classified into recurrence and no recurrence. The data is imbalanced, meaning that one of the two classes have more sample than the other. Many pre-processing techniques are applied to classify this imbalanced data, resampling, attribute selection, and handling missing values, and then different classifiers models are built. In the first experiment, five classifiers (ANN, REP TREE, SVM, and J48) are used, and in the second experiment, meta-learning algorithms (Bagging, Boosting, and Random subspace). Finally, the ensemble model is used. The best result was obtained from the ensemble model (Boosting with J48) with the highest accuracy 95.2797% among all the algorithms, followed by Bagging with J48(90.559%) and random subspace with J48(84.2657%). The breast cancer imbalanced dataset was classified into recurrence, and no recurrence with different classified algorithms and the best result was obtained from the ensemble model.

Keywords

References

  1. M. M. El-Lamey, M. M. Eid, M. Gamal, N. E. M. Bishady, and A. W. Mohamed, "Using machine learning algorithms for breast cancer diagnosis," International Journal of Applied Metaheuristic Computing, vol. 12, no. 4, pp. 117-137, 2021, doi: 10.4018/IJAMC.2021100107. 
  2. J. P. Choi, T. H. Han, and R. W. Park, "A Hybrid Bayesian Network Model for Predicting Breast Cancer Prognosis," Journal of Korean Society of Medical Informatics, vol. 15, no. 1. pp. 49-57, 2009. [Online]. Available: www.seer.cancer.gov  https://doi.org/10.4258/jksmi.2009.15.1.49
  3. D. Delen, G. Walker, and A. Kadam, "Predicting breast cancer survivability: A comparison of three data mining methods," Artif Intell Med, vol. 34, no. 2, pp. 113-127, Jan. 2005, doi: 10.1016/j.artmed.2004.07.002. 
  4. F. Ibrahim and N. A. Osman, "Comparison of Different Classification Techniques Using WEKA for Breast Cancer," vol. 15. pp. 520-523, 2007. [Online]. Available: www.springerlink.com  https://doi.org/10.1007/978-3-540-68017-8_131
  5. H. Wang, "Breast Cancer Prediction Using Data Mining Method Machine Learning and Data Mining Techniques View project Optimization Techniques View project." 2015. [Online]. Available: https://www.researchgate.net/publication/319688741 
  6. J. T. McDonald, N. Herron, W. B. Glisson, and R. K. Benton, "Machine learning-based android malware detection using manifest permissions," in Proceedings of the Annual Hawaii International Conference on System Sciences, 2021, vol. 2020-January, pp. 6976-6985. doi: 10.24251/hicss.2021.839. 
  7. M. R. Longadge, M. Snehlata, S. Dongre, and D. L. Malik, "Class Imbalance Problem in Data Mining: Review," International Journal of Computer Science and Network, vol. 2, no. 1. 2013. [Online]. Available: www.ijcsn.org 
  8. A. E. Karrar, "Adopting Graph-Based Machine Learning Algorithms to Classify Android Malware," IJCSNS International Journal of Computer Science and Network Security, vol. 22, no. 9, p. 840, 2022, doi: 10.22937/IJCSNS.2022.22.9.109. 
  9. A. E. Karrar, "A Novel Approach for Semi Supervised Clustering Algorithm," International Journal of Advanced Trends in Computer Science and Engineering, vol. 6, no. 1, pp. 1-7, 2017, [Online]. Available: http://www.warse.org/IJATCSE/static/pdf/file/ijatcse01612017.pdf 
  10. A. E. Karrar, "A Proposed Model for Improving the Performance of Knowledge Bases in Real-World Applications by Extracting Semantic Information," International Journal of Advanced Computer Science and Applications, vol. 13, no. 2, 2022, doi: 10.14569/IJACSA.2022.0130214. 
  11. A. Puri and M. Kumar Gupta, "Improved Hybrid Bag-Boost Ensemble With K-Means-SMOTE-ENN Technique for Handling Noisy Class Imbalanced Data," Comput J, Nov. 2021, doi: 10.1093/comjnl/bxab039. 
  12. V. Chaurasia and S. Pal, "Early prediction of heart diseases using data mining techniques," Caribbean Journal of Science and Technology, vol. 1, pp. 208-217, 2013. 
  13. V. Chaurasia and S. Pal, "9 A Novel Related papers A Novel Approach on Ensemble Classifiers wit h Fast Rot at ion Forest Algorit hm Azhagu Sundari A Pragmat ic Approach of Preprocessing t he Dat a Set for Heart Disease Predict ion sivagowry sabanat han, Eugene Bern PERFORMANCE ANALYSIS OF DATA MINING ALGORIT HMS FOR DIAGNOSIS AND PREDICT ION OF HEART ... A Novel Approach for Breast Cancer Detection using Data Mining Techniques," International Journal of Innovative Research in Computer and Communication Engineering (An ISO, vol. 3297, no. 1, 2007, [Online]. Available: www.ijircce.com 
  14. N. v Chawla, "Data Mining for Imbalanced Datasets: An Overview," Data Mining and Knowledge Discovery Handbook. Springer-Verlag, pp. 853-867, Jan. 2006. doi: 10.1007/0-387-25465-x_40. 
  15. V. Chaurasia and S. Pal, "9 A Novel Related papers A Novel Approach on Ensemble Classifiers wit h Fast Rot at ion Forest Algorit hm Azhagu Sundari A Pragmat ic Approach of Preprocessing t he Dat a Set for Heart Disease Predict ion sivagowry sabanat han, Eugene Bern PERFORMANCE ANALYSIS OF DATA MINING ALGORIT HMS FOR DIAGNOSIS AND PREDICT ION OF HEART ... A Novel Approach for Breast Cancer Detection using Data Mining Techniques," International Journal of Innovative Research in Computer and Communication Engineering (An ISO, vol. 3297, no. 1, 2007, [Online]. Available: www.ijircce.com 
  16. R. Aavula and R. Bhramaramba, "XBPF: An Extensible Breast Cancer Prognosis Framework for Predicting Susceptibility, Recurrence and Survivability Data mining and machine learning Techniques View project Lung cancer related genes identification View project XBPF: An Extensible Breast Cancer Prognosis Framework for Predicting Susceptibility, Recurrence and Survivability," International Journal of Engineering and Advanced Technology (IJEAT), no. 5. pp. 2249-8958, 2019. [Online]. Available: https://www.researchgate.net/publication/337077283 
  17. A. Elsharif Karrar, "The Use of Case-based Reasoning in a Knowledge-based (Learning) Software Development Organizations," International Journal of Innovative Research in Science, Engineering and Technology (An ISO, vol. 3297, no. 5, 2007, doi: 10.15680/IJIRSET.2016.0505331. 
  18. T. Chakraborty and A. K. Chakraborty, "Superensemble classifier for improving predictions in imbalanced datasets," Commun Stat Case Stud Data Anal Appl, pp. 1-19, Nov. 2020, doi: 10.1080/23737484.2020.1740065. 
  19. "A Review on Data Mining Techniques for Treatment of Cancer in Ayurveda Therapy." [Online]. Available: www.ijcset.net 
  20. A. E. Karrar, "The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values," Indonesian Journal of Electrical Engineering and Informatics (IJEEI), vol. 10, no. 2, Apr. 2022, doi: 10.52549/ijeei.v10i2.3730. 
  21. A. E. Karrar, "Investigate the Ensemble Model by Intelligence Analysis to Improve the Accuracy of the Classification Data in the Diagnostic and Treatment Interventions for Prostate Cancer," International Journal of Advanced Computer Science and Applications, vol. 13, no. 1, 2022, doi: 10.14569/IJACSA.2022.0130122. 
  22. M. Umair et al., "Main path analysis to filter unbiased literature," Intelligent Automation and Soft Computing, vol. 32, no. 2, 2022, doi: 10.32604/iasc.2022.018952. 
  23. B. Mirzaei, B. Nikpour, and H. Nezamabadi-Pour, "An under-sampling technique for imbalanced data classification based on DBSCAN algorithm," 2020 8th Iranian Joint Congress on Fuzzy and intelligent Systems (CFIS), Nov. 2020, doi: 10.1109/cfis49607.2020.9238718. 
  24. M. F. Ijaz, M. Attique, and Y. Son, "Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods," Sensors, vol. 20, p. 2809, Nov. 2020, doi: 10.3390/s20102809. 
  25. A. M. Morey, F. Noo, and D. J. Kadrmas, "Effect of Using 2 mm Voxels on Observer Performance for PET Lesion Detection," IEEE Trans Nucl Sci, vol. 63, pp. 1359-1366, Nov. 2016, doi: 10.1109/tns.2016.2518177. 
  26. B. Elhussein, M. Khalifa, A. E. Karrar, and M. M. Alsharani, "A Client-Side App Model for Classifying and Storing Documents," IJCSNS International Journal of Computer Science and Network Security, vol. 22, no. 5, p. 225, 2022, doi: 10.22937/IJCSNS.2022.22.5.32.