• Title/Summary/Keyword: 캐노니컬포레스트

Search Result 1, Processing Time 0.015 seconds

GIR-based canonical forest: An ensemble method for imbalanced big data (불균형 데이터의 분류 성능 향상을 위한 일반화된 불균형 비율(GIR) 기반의 과소 표집 canonical forest (GC-Forest))

  • Solji Han;Jaesung Myung;Hyunjoong Kim
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.5
    • /
    • pp.615-629
    • /
    • 2024
  • In the field of big data mining, the challenge of imbalanced classification problem has been actively researched for decades. While imbalanced data issues manifest in various forms, past research mainly focused on addressing sample size imbalance between classes. However, recent studies have revealed that rather than the imbalance in sample size alone, the degradation of classification performance significantly worsens when the class overlap is combined. In response, this study introduces GC-Forest (GIR-based canonical forest), an effective ensemble classification method that utilizes weighted resampling technique considering the degrees of overlap between classes. This method measures the imbalance ratio in terms of class overlap at each stage of ensemble and balances the classes by increasing the representativeness of the minority class. Additionally, to improve overall classification performance, the GC-Forest method adopts the canonical forest method as an ensemble classifier, which is designed to enhance both the performance and diversity of individual classifiers. The performance of the proposed method was compared and verified through experiments using 14 different types of real imbalanced data. GC-Forest showed very competitive classification performance in terms of AUC, PR-AUC, G-mean, and F1-score compared to 7 other ensemble methods.