• 제목/요약/키워드: Imbalanced dataset handling techniques

검색결과 3건 처리시간 0.019초

Classification for Imbalanced Breast Cancer Dataset Using Resampling Methods

  • Hana Babiker, Nassar
    • International Journal of Computer Science & Network Security
    • /
    • 제23권1호
    • /
    • pp.89-95
    • /
    • 2023
  • Analyzing breast cancer patient files is becoming an exciting area of medical information analysis, especially with the increasing number of patient files. In this paper, breast cancer data is collected from Khartoum state hospital, and the dataset is classified into recurrence and no recurrence. The data is imbalanced, meaning that one of the two classes have more sample than the other. Many pre-processing techniques are applied to classify this imbalanced data, resampling, attribute selection, and handling missing values, and then different classifiers models are built. In the first experiment, five classifiers (ANN, REP TREE, SVM, and J48) are used, and in the second experiment, meta-learning algorithms (Bagging, Boosting, and Random subspace). Finally, the ensemble model is used. The best result was obtained from the ensemble model (Boosting with J48) with the highest accuracy 95.2797% among all the algorithms, followed by Bagging with J48(90.559%) and random subspace with J48(84.2657%). The breast cancer imbalanced dataset was classified into recurrence, and no recurrence with different classified algorithms and the best result was obtained from the ensemble model.

Securing SCADA Systems: A Comprehensive Machine Learning Approach for Detecting Reconnaissance Attacks

  • Ezaz Aldahasi;Talal Alkharobi
    • International Journal of Computer Science & Network Security
    • /
    • 제23권12호
    • /
    • pp.1-12
    • /
    • 2023
  • Ensuring the security of Supervisory Control and Data Acquisition (SCADA) and Industrial Control Systems (ICS) is paramount to safeguarding the reliability and safety of critical infrastructure. This paper addresses the significant threat posed by reconnaissance attacks on SCADA/ICS networks and presents an innovative methodology for enhancing their protection. The proposed approach strategically employs imbalance dataset handling techniques, ensemble methods, and feature engineering to enhance the resilience of SCADA/ICS systems. Experimentation and analysis demonstrate the compelling efficacy of our strategy, as evidenced by excellent model performance characterized by good precision, recall, and a commendably low false negative (FN). The practical utility of our approach is underscored through the evaluation of real-world SCADA/ICS datasets, showcasing superior performance compared to existing methods in a comparative analysis. Moreover, the integration of feature augmentation is revealed to significantly enhance detection capabilities. This research contributes to advancing the security posture of SCADA/ICS environments, addressing a critical imperative in the face of evolving cyber threats.

불균형 데이터 처리를 통한 소프트웨어 요구사항 분류 모델의 성능 개선에 관한 연구 (A Study on Improving Performance of Software Requirements Classification Models by Handling Imbalanced Data)

  • 최종우;이영준;임채균;최호진
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제12권7호
    • /
    • pp.295-302
    • /
    • 2023
  • 자연어로 작성되는 소프트웨어 요구사항은 이해관계자가 바라보는 관점에 따라 의미가 달라질 수 있다. 품질 속성 기반으로 아키텍처 설계시에 품질 속성별로 적합한 설계 전술(Tactic)을 선택해야 효율적인 설계가 가능해 품질 속성 요구사항의 정확한 분류가 필요하다. 이에 따라 고비용 작업인 요구사항 분류에 관한 자연어처리 모델이 많이 연구되고 있지만, 품질 속성 데이터셋(dataset)의 불균형을 처리해 분류 성능을 개선하는 주제는 많이 다루고 있지 않다. 본 연구에서는 먼저 실험을 통해 분류 모델이 한국어 요구사항 데이터셋을 자동으로 분류할 수 있음을 보인다. 이 결과를 바탕으로 EDA(Easy Data Augmentation) 기법을 통한 데이터 증강과 언더샘플링(undersampling) 전략으로 품질 속성 데이터셋의 불균형을 개선할 수 있음을 설명하고 요구사항의 카테고리 분류에 효과가 있음을 보인다. 실험 결과 F1 점수(F1-Score) 기준으로 최대 5.24%p 향상되어 불균형 데이터 처리 기법이 분류 모델의 한국어 요구사항 분류에 도움이 됨을 확인할 수 있다. 또한, EDA의 세부 실험을 통해 분류 성능 개선에 도움이 되는 데이터 증강 연산에 관해 설명한다.