• Title/Summary/Keyword: Imbalanced dataset

Search Result 54, Processing Time 0.022 seconds

Similarity Measurement Between Titles and Abstracts Using Bijection Mapping and Phi-Correlation Coefficient

  • John N. Mlyahilu;Jong-Nam Kim
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.23 no.3
    • /
    • pp.143-149
    • /
    • 2022
  • This excerpt delineates a quantitative measure of relationship between a research title and its respective abstract extracted from different journal articles documented through a Korean Citation Index (KCI) database published through various journals. In this paper, we propose a machine learning-based similarity metric that does not assume normality on dataset, realizes the imbalanced dataset problem, and zero-variance problem that affects most of the rule-based algorithms. The advantage of using this algorithm is that, it eliminates the limitations experienced by Pearson correlation coefficient (r) and additionally, it solves imbalanced dataset problem. A total of 107 journal articles collected from the database were used to develop a corpus with authors, year of publication, title, and an abstract per each. Based on the experimental results, the proposed algorithm achieved high correlation coefficient values compared to others which are cosine similarity, euclidean, and pearson correlation coefficients by scoring a maximum correlation of 1, whereas others had obtained non-a-number value to some experiments. With these results, we found that an effective title must have high correlation coefficient with the respective abstract.

Boosting the Performance of the Predictive Model on the Imbalanced Dataset Using SVM Based Bagging and Out-of-Distribution Detection (SVM 기반 Bagging과 OoD 탐색을 활용한 제조공정의 불균형 Dataset에 대한 예측모델의 성능향상)

  • Kim, Jong Hoon;Oh, Hayoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.11
    • /
    • pp.455-464
    • /
    • 2022
  • There are two unique characteristics of the datasets from a manufacturing process. They are the severe class imbalance and lots of Out-of-Distribution samples. Some good strategies such as the oversampling over the minority class, and the down-sampling over the majority class, are well known to handle the class imbalance. In addition, SMOTE has been chosen to address the issue recently. But, Out-of-Distribution samples have been studied just with neural networks. It seems to be hardly shown that Out-of-Distribution detection is applied to the predictive model using conventional machine learning algorithms such as SVM, Random Forest and KNN. It is known that conventional machine learning algorithms are much better than neural networks in prediction performance, because neural networks are vulnerable to over-fitting and requires much bigger dataset than conventional machine learning algorithms does. So, we suggests a new approach to utilize Out-of-Distribution detection based on SVM algorithm. In addition to that, bagging technique will be adopted to improve the precision of the model.

Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets

  • Mehmet F. Karaca
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.3
    • /
    • pp.591-609
    • /
    • 2024
  • In this study, preprocessings with all combinations were examined in terms of the effects on decreasing word number, shortening the duration of the process and the classification success in balanced and imbalanced datasets which were unbalanced in different ratios. The decreases in the word number and the processing time provided by preprocessings were interrelated. It was seen that more successful classifications were made with Turkish datasets and English datasets were affected more from the situation of whether the dataset is balanced or not. It was found out that the incorrect classifications, which are in the classes having few documents in highly imbalanced datasets, were made by assigning to the class close to the related class in terms of topic in Turkish datasets and to the class which have many documents in English datasets. In terms of average scores, the highest classification was obtained in Turkish datasets as follows: with not applying lowercase, applying stemming and removing stop words, and in English datasets as follows: with applying lowercase and stemming, removing stop words. Applying stemming was the most important preprocessing method which increases the success in Turkish datasets, whereas removing stop words in English datasets. The maximum scores revealed that feature selection, feature size and classifier are more effective than preprocessing in classification success. It was concluded that preprocessing is necessary for text classification because it shortens the processing time and can achieve high classification success, a preprocessing method does not have the same effect in all languages, and different preprocessing methods are more successful for different languages.

A Study of a Method for Maintaining Accuracy Uniformity When Using Long-tailed Dataset (불균형 데이터세트 학습에서 정확도 균일화를 위한 학습 방법에 관한 연구)

  • Geun-pyo Park;XinYu Piao;Jong-Kook Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.585-587
    • /
    • 2023
  • Long-tailed datasets have an imbalanced distribution because they consist of a different number of data samples for each class. However, there are problems of the performance degradation in tail-classes and class-accuracy imbalance for all classes. To address these problems, this paper suggests a learning method for training of long-tailed dataset. The proposed method uses and combines two methods; one is a resampling method to generate a uniform mini-batch to prevent the performance degradation in tail-classes, and the other is a reweighting method to address the accuracy imbalance problem. The purpose of our proposed method is to train the learning models to have uniform accuracy for each class in a long-tailed dataset.

A study on intrusion detection performance improvement through imbalanced data processing (불균형 데이터 처리를 통한 침입탐지 성능향상에 관한 연구)

  • Jung, Il Ok;Ji, Jae-Won;Lee, Gyu-Hwan;Kim, Myo-Jeong
    • Convergence Security Journal
    • /
    • v.21 no.3
    • /
    • pp.57-66
    • /
    • 2021
  • As the detection performance using deep learning and machine learning of the intrusion detection field has been verified, the cases of using it are increasing day by day. However, it is difficult to collect the data required for learning, and it is difficult to apply the machine learning performance to reality due to the imbalance of the collected data. Therefore, in this paper, A mixed sampling technique using t-SNE visualization for imbalanced data processing is proposed as a solution to this problem. To do this, separate fields according to characteristics for intrusion detection events, including payload. Extracts TF-IDF-based features for separated fields. After applying the mixed sampling technique based on the extracted features, a data set optimized for intrusion detection with imbalanced data is obtained through data visualization using t-SNE. Nine sampling techniques were applied through the open intrusion detection dataset CSIC2012, and it was verified that the proposed sampling technique improves detection performance through F-score and G-mean evaluation indicators.

A Study on Improving Performance of Software Requirements Classification Models by Handling Imbalanced Data (불균형 데이터 처리를 통한 소프트웨어 요구사항 분류 모델의 성능 개선에 관한 연구)

  • Jong-Woo Choi;Young-Jun Lee;Chae-Gyun Lim;Ho-Jin Choi
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.7
    • /
    • pp.295-302
    • /
    • 2023
  • Software requirements written in natural language may have different meanings from the stakeholders' viewpoint. When designing an architecture based on quality attributes, it is necessary to accurately classify quality attribute requirements because the efficient design is possible only when appropriate architectural tactics for each quality attribute are selected. As a result, although many natural language processing models have been studied for the classification of requirements, which is a high-cost task, few topics improve classification performance with the imbalanced quality attribute datasets. In this study, we first show that the classification model can automatically classify the Korean requirement dataset through experiments. Based on these results, we explain that data augmentation through EDA(Easy Data Augmentation) techniques and undersampling strategies can improve the imbalance of quality attribute datasets, and show that they are effective in classifying requirements. The results improved by 5.24%p on F1-score, indicating that handling imbalanced data helps classify Korean requirements of classification models. Furthermore, detailed experiments of EDA illustrate operations that help improve classification performance.

Detection of Anomaly Lung Sound using Deep Temporal Feature Extraction (깊은 시계열 특성 추출을 이용한 폐 음성 이상 탐지)

  • Kim-Ngoc T. Le;Gyurin Byun;Hyunseung Choo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.605-607
    • /
    • 2023
  • Recent research has highlighted the effectiveness of Deep Learning (DL) techniques in automating the detection of lung sound anomalies. However, the available lung sound datasets often suffer from limitations in both size and balance, prompting DL methods to employ data preprocessing such as augmentation and transfer learning techniques. These strategies, while valuable, contribute to the increased complexity of DL models and necessitate substantial training memory. In this study, we proposed a streamlined and lightweight DL method but effectively detects lung sound anomalies from small and imbalanced dataset. The utilization of 1D dilated convolutional neural networks enhances sensitivity to lung sound anomalies by efficiently capturing deep temporal features and small variations. We conducted a comprehensive evaluation of the ICBHI dataset and achieved a notable improvement over state-of-the-art results, increasing the average score of sensitivity and specificity metrics by 2.7%.

Securing SCADA Systems: A Comprehensive Machine Learning Approach for Detecting Reconnaissance Attacks

  • Ezaz Aldahasi;Talal Alkharobi
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.12
    • /
    • pp.1-12
    • /
    • 2023
  • Ensuring the security of Supervisory Control and Data Acquisition (SCADA) and Industrial Control Systems (ICS) is paramount to safeguarding the reliability and safety of critical infrastructure. This paper addresses the significant threat posed by reconnaissance attacks on SCADA/ICS networks and presents an innovative methodology for enhancing their protection. The proposed approach strategically employs imbalance dataset handling techniques, ensemble methods, and feature engineering to enhance the resilience of SCADA/ICS systems. Experimentation and analysis demonstrate the compelling efficacy of our strategy, as evidenced by excellent model performance characterized by good precision, recall, and a commendably low false negative (FN). The practical utility of our approach is underscored through the evaluation of real-world SCADA/ICS datasets, showcasing superior performance compared to existing methods in a comparative analysis. Moreover, the integration of feature augmentation is revealed to significantly enhance detection capabilities. This research contributes to advancing the security posture of SCADA/ICS environments, addressing a critical imperative in the face of evolving cyber threats.

Facial Age Classification and Synthesis using Feature Decomposition (특징 분해를 이용한 얼굴 나이 분류 및 합성)

  • Chanho Kim;In Kyu Park
    • Journal of Broadcast Engineering
    • /
    • v.28 no.2
    • /
    • pp.238-241
    • /
    • 2023
  • Recently deep learning models are widely used for various tasks such as facial recognition and face editing. Their training process often involves a dataset with imbalanced age distribution. It is because some age groups (teenagers and middle age) are more socially active and tends to have more data compared to the less socially active age groups (children and elderly). This imbalanced age distribution may negatively impact the deep learning training process or the model performance when tested against those age groups with less data. To this end, we propose an age-controllable face synthesis technique using a feature decomposition to classify age from facial images which can be utilized to synthesize novel data to balance out the age distribution. We perform extensive qualitative and quantitative evaluation on our proposed technique using the FFHQ dataset and we show that our method has better performance than existing method.

A Study on Visual Emotion Classification using Balanced Data Augmentation (균형 잡힌 데이터 증강 기반 영상 감정 분류에 관한 연구)

  • Jeong, Chi Yoon;Kim, Mooseop
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.7
    • /
    • pp.880-889
    • /
    • 2021
  • In everyday life, recognizing people's emotions from their frames is essential and is a popular research domain in the area of computer vision. Visual emotion has a severe class imbalance in which most of the data are distributed in specific categories. The existing methods do not consider class imbalance and used accuracy as the performance metric, which is not suitable for evaluating the performance of the imbalanced dataset. Therefore, we proposed a method for recognizing visual emotion using balanced data augmentation to address the class imbalance. The proposed method generates a balanced dataset by adopting the random over-sampling and image transformation methods. Also, the proposed method uses the Focal loss as a loss function, which can mitigate the class imbalance by down weighting the well-classified samples. EfficientNet, which is the state-of-the-art method for image classification is used to recognize visual emotion. We compare the performance of the proposed method with that of conventional methods by using a public dataset. The experimental results show that the proposed method increases the F1 score by 40% compared with the method without data augmentation, mitigating class imbalance without loss of classification accuracy.