Re-SSS: Rebalancing Imbalanced Data Using Safe Sample Screening

Shi, Hongbo;Chen, Xin;Guo, Min;

doi:10.3745/JIPS.01.0065

Journal of Information Processing Systems

제17권1호
/
Pages.89-106
/
2021
/
1976-913X(pISSN)
/
2092-805X(eISSN)

한국정보처리학회 (Korea Information Processing Society)

DOI QR Code

Re-SSS: Rebalancing Imbalanced Data Using Safe Sample Screening

Shi, Hongbo (School of Information, Shanxi University of Finance and Economics) ;
Chen, Xin (School of Information, Shanxi University of Finance and Economics) ;
Guo, Min (School of Information, Shanxi University of Finance and Economics)

투고 : 2019.12.26
심사 : 2020.06.07
발행 : 2021.02.28

https://doi.org/10.3745/JIPS.01.0065 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Different samples can have different effects on learning support vector machine (SVM) classifiers. To rebalance an imbalanced dataset, it is reasonable to reduce non-informative samples and add informative samples for learning classifiers. Safe sample screening can identify a part of non-informative samples and retain informative samples. This study developed a resampling algorithm for Rebalancing imbalanced data using Safe Sample Screening (Re-SSS), which is composed of selecting Informative Samples (Re-SSS-IS) and rebalancing via a Weighted SMOTE (Re-SSS-WSMOTE). The Re-SSS-IS selects informative samples from the majority class, and determines a suitable regularization parameter for SVM, while the Re-SSS-WSMOTE generates informative minority samples. Both Re-SSS-IS and Re-SSS-WSMOTE are based on safe sampling screening. The experimental results show that Re-SSS can effectively improve the classification performance of imbalanced classification problems.

키워드

과제정보

This work is supported by the National Natural Science Foundation of China (No. 61801279), the Key Research and Development Project of Shanxi Province (No. 201903D121160), and the Natural Science Foundation of Shanxi Province (No. 201801D121115 and 201901D111318).

참고문헌

K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, "Machine learning applications in cancer prognosis and prediction," Computational and Structural Biotechnology Journal, vol. 13, pp. 8-17, 2015. https://doi.org/10.1016/j.csbj.2014.11.005
D. Sanchez, M. A. Vila, L. Cerda, and J. M. Serrano, "Association rules applied to credit card fraud detection," Expert Systems with Applications, vol. 36, no. 2, pp. 3630-3640, 2009. https://doi.org/10.1016/j.eswa.2008.02.001
R. A. R. Ashfaq, X. Z. Wang, J. Z. Huang, H. Abbas, and Y. L. He, "Fuzziness based semi-supervised learning approach for intrusion detection system," Information Sciences, vol. 378, no. 1, pp. 484-497, 2017. https://doi.org/10.1016/j.ins.2016.04.019
X. Y. Liu, J. Wu, and Z. H. Zhou, "Exploratory undersampling for class-imbalance learning," IEEE Transactions On Systems Man And Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550, 2009.
D. Devi, S. K. Biswas, and B. Purkayastha, "Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique," Connection Science, vol. 31, no. 2, pp. 105-142, 2019. https://doi.org/10.1080/09540091.2018.1560394
A. Onan, "Consensus clustering-based undersampling approach to imbalanced learning," Scientific Programming, vol. 2019, article no. 5901087, 2019. https://doi.org/10.1155/2019/5901087
H. Han, W. Y. Wang, and B. H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in Advances in Intelligent Computing. Heidelberg, Germany: Springer, 2005, pp. 878-887.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of Artificial Intelligence Research, vol.16, pp. 321-357, 2002. https://doi.org/10.1613/jair.953
M. Koziarski, B. Krawczyk, and M. Wozniak, "Radial-based oversampling for noisy imbalanced data classification," Neurocomputing, vol. 343, pp. 19-33, 2019. https://doi.org/10.1016/j.neucom.2018.04.089
R. Malhotra and S. Kamal, "An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data," Neurocomputing, vol. 343, pp. 120-140, 2019. https://doi.org/10.1016/j.neucom.2018.04.090
G. Dimic, D. Rancic, N. Macek, P. Spalevic, and V. Drasute, "Improving the prediction accuracy in blended learning environment using synthetic minority oversampling technique," Information Discovery and Delivery, vol. 47, no. 2, pp. 76-83, 2019. https://doi.org/10.1108/IDD-08-2018-0036
Q. Wang, "A hybrid sampling SVM approach to imbalanced data classification," Abstract and Applied Analysis, vol. 2014, article no. 973786, 2014. https://doi.org/10.1155/2014/972786
Z. Hu, R. Chiong, I. Pranata, Y. Bao, and Y. Lin, "Malicious web domain identification using online credibility and performance data by considering the class imbalance issue," Industrial Management & Data Systems, vol. 119, no. 3, pp. 676-696, 2019. https://doi.org/10.1108/IMDS-02-2018-0072
M. Bach, A. Werner, J. Zywiec, and W. Pluskiewicz, "The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis," Information Sciences, vol. 384, pp. 174-190, 2017. https://doi.org/10.1016/j.ins.2016.09.038
N. Japkowicz and S. Stephen, "The class imbalance problem: a systematic study," Intelligent Data Analysis, vol. 6, no. 5, pp. 429-449, 2002. https://doi.org/10.3233/IDA-2002-6504
J. Y. Chen, J. Lalor, W. S. Liu, E. Druhl, E. Granillo, V. G. Vimalananda, and H. Yu, "Detecting hypoglycemia incidents reported in patients' secure messages: using cost-sensitive learning and oversampling to reduce data imbalance," Journal of Medical Internet Research, vol. 21, no. 3, article no. e11990, 2019. https://doi.org/10.2196/11990
P. A. Alaba, S. I. Popoola, L. Olatomiwa, M. B. Akanle, O. S. Ohunakin, E. Adetiba, O. D. Alex, A. A. A. Atayero, and W. M. A. W. Daud, "Towards a more efficient and cost-sensitive extreme learning machine: A state-of-the-art review of recent trend," Neurocomputing, vol. 350, pp. 70-90, 2019. https://doi.org/10.1016/j.neucom.2019.03.086
Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, "A novel ensemble method for classifying imbalanced data," Pattern Recognition, vol. 48, no. 5, pp. 1623-1637, 2015. https://doi.org/10.1016/j.patcog.2014.11.014
A. Irtazal, S. M. Adnan, K. T. Ahmed, A. Jaffar, A. Khan, A. Javed, and M. T. Mahmood, "An ensemble based evolutionary approach to the class imbalance problem with applications in CBIR," Applied Sciences, vol. 8, no. 4, article no. 495, 2018. https://doi.org/10.3390/app8040495
H. He, W. Zhang, and S. Zhang, "A novel ensemble method for credit scoring: adaption of different imbalance ratios," Expert Systems with Applications, vol. 98, pp. 105-117, 2018. https://doi.org/10.1016/j.eswa.2018.01.012
D. C. Li, S. C. Hu, L. S. Lin, and C. W. Yeh, "Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets," Plos One, vol. 12, no. 8, article no. e0181853, 2017. https://doi.org/10.1371/journal.pone.0181853
Y. T. Yan, Z. B. Wu, X. Q. Du, J. Chen, S. Zhao, and Y. P. Zhang, "A three-way decision ensemble method for imbalanced data oversampling," International Journal of Approximate Reasoning, vol. 107, pp. 1-16, 2019. https://doi.org/10.1016/j.ijar.2018.12.011
M. A. Naiel, M. O. Ahmad, M. N. S. Swamy, J. Lim, and M. H. Yang, "Online multi-object tracking via robust collaborative model and sample selection," Computer Vision and Image Understanding, vol. 154, pp. 94-107, 2017. https://doi.org/10.1016/j.cviu.2016.07.003
M. A. H. Farquad and I. Bose, "Preprocessing unbalanced data using support vector machine," Decision Support Systems, vol. 53, no. 1, pp. 226-233, 2012. https://doi.org/10.1016/j.dss.2012.01.016
S. J. Lin, "Integrated artificial intelligence-based resizing strategy and multiple criteria decision making technique to form a management decision in an imbalanced environment," International Journal of Machine Learning and Cybernetics, vol. 8, no. 6, pp. 1981-1992, 2016. https://doi.org/10.1007/s13042-016-0574-3
T. Guo, J. Wang, Q. M. Liu, and J. Y. Liang, "Kernel SVM algorithm based on identifying key samples for imbalanced data," Pattern Recognition and Artificial Intelligence, vol. 32, no. 6, pp. 569-576, 2019.
A. Shibagaki, M. Karasuyama, K. Hatano, and I. Takeuchi, "Simultaneous safe screening of features and samples in doubly sparse modeling," in Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, 2016, pp. 1577-1586.
T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, "The entire regularization path for the support vector machine," Journal of Machine Learning Research, vol. 5, pp. 1391-1415, 2004.
H. Shi, Q. Gao, S. Ji, and Y. Liu, "A hybrid sampling method based on safe screening for imbalanced datasets with sparse structure," in Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018, pp. 1-8.
H. Shi, Y. Liu, and S. Ji, "Safe sample screening based sampling method for imbalanced data," Pattern Recognition and Artificial Intelligence, vol. 32, no. 6, pp. 545-556, 2019.
K. Ogawa, Y. Suzuki, and I. Takeuchi, "Safe screening of non-support vectors in pathwise SVM computation," in Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, 2013, pp. 1382-1390.
A. Luque, A. Carrasco, A. Martin, and A. de las Heras, "The impact of class imbalance in classification performance metrics based on the binary confusion matrix," Pattern Recognition, vol. 91, pp. 216-231, 2019. https://doi.org/10.1016/j.patcog.2019.02.023

Journal of Information Processing Systems

Re-SSS: Rebalancing Imbalanced Data Using Safe Sample Screening

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)