• Title/Summary/Keyword: imbalanced binary data

Search Result 11, Processing Time 0.02 seconds

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

  • Kim, HanYong;Lee, Woojoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.681-690
    • /
    • 2017
  • Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

Classification of Imbalanced Data Based on MTS-CBPSO Method: A Case Study of Financial Distress Prediction

  • Gu, Yuping;Cheng, Longsheng;Chang, Zhipeng
    • Journal of Information Processing Systems
    • /
    • v.15 no.3
    • /
    • pp.682-693
    • /
    • 2019
  • The traditional classification methods mostly assume that the data for class distribution is balanced, while imbalanced data is widely found in the real world. So it is important to solve the problem of classification with imbalanced data. In Mahalanobis-Taguchi system (MTS) algorithm, data classification model is constructed with the reference space and measurement reference scale which is come from a single normal group, and thus it is suitable to handle the imbalanced data problem. In this paper, an improved method of MTS-CBPSO is constructed by introducing the chaotic mapping and binary particle swarm optimization algorithm instead of orthogonal array and signal-to-noise ratio (SNR) to select the valid variables, in which G-means, F-measure, dimensionality reduction are regarded as the classification optimization target. This proposed method is also applied to the financial distress prediction of Chinese listed companies. Compared with the traditional MTS and the common classification methods such as SVM, C4.5, k-NN, it is showed that the MTS-CBPSO method has better result of prediction accuracy and dimensionality reduction.

Experimental Analysis of Equilibrization in Binary Classification for Non-Image Imbalanced Data Using Wasserstein GAN

  • Wang, Zhi-Yong;Kang, Dae-Ki
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.11 no.4
    • /
    • pp.37-42
    • /
    • 2019
  • In this paper, we explore the details of three classic data augmentation methods and two generative model based oversampling methods. The three classic data augmentation methods are random sampling (RANDOM), Synthetic Minority Over-sampling Technique (SMOTE), and Adaptive Synthetic Sampling (ADASYN). The two generative model based oversampling methods are Conditional Generative Adversarial Network (CGAN) and Wasserstein Generative Adversarial Network (WGAN). In imbalanced data, the whole instances are divided into majority class and minority class, where majority class occupies most of the instances in the training set and minority class only includes a few instances. Generative models have their own advantages when they are used to generate more plausible samples referring to the distribution of the minority class. We also adopt CGAN to compare the data augmentation performance with other methods. The experimental results show that WGAN-based oversampling technique is more stable than other approaches (RANDOM, SMOTE, ADASYN and CGAN) even with the very limited training datasets. However, when the imbalanced ratio is too small, generative model based approaches cannot achieve satisfying performance than the conventional data augmentation techniques. These results suggest us one of future research directions.

Parameter estimation for the imbalanced credit scoring data using AUC maximization (AUC 최적화를 이용한 낮은 부도율 자료의 모수추정)

  • Hong, C.S.;Won, C.H.
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.2
    • /
    • pp.309-319
    • /
    • 2016
  • For binary classification models, we consider a risk score that is a function of linear scores and estimate the coefficients of the linear scores. There are two estimation methods: one is to obtain MLEs using logistic models and the other is to estimate by maximizing AUC. AUC approach estimates are better than MLEs when using logistic models under a general situation which does not support logistic assumptions. This paper considers imbalanced data that contains a smaller number of observations in the default class than those in the non-default for credit assessment models; consequently, the AUC approach is applied to imbalanced data. Various logit link functions are used as a link function to generate imbalanced data. It is found that predicted coefficients obtained by the AUC approach are equivalent to (or better) than those from logistic models for low default probability - imbalanced data.

Comparison of resampling methods for dealing with imbalanced data in binary classification problem (이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교)

  • Park, Geun U;Jung, Inkyung
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.349-374
    • /
    • 2019
  • A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

Abnormal signal detection based on parallel autoencoders (병렬 오토인코더 기반의 비정상 신호 탐지)

  • Lee, Kibae;Lee, Chong Hyun
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.4
    • /
    • pp.337-346
    • /
    • 2021
  • Detection of abnormal signal generally can be done by using features of normal signals as main information because of data imbalance. This paper propose an efficient method for abnormal signal detection using parallel AutoEncoder (AE) which can use features of abnormal signals as well. The proposed Parallel AE (PAE) is composed of a normal and an abnormal reconstructors having identical AE structure and train features of normal and abnormal signals, respectively. The PAE can effectively solve the imbalanced data problem by sequentially training normal and abnormal data. For further detection performance improvement, additional binary classifier can be added to the PAE. Through experiments using public acoustic data, we obtain that the proposed PAE shows Area Under Curve (AUC) improvement of minimum 22 % at the expenses of training time increased by 1.31 ~ 1.61 times to the single AE. Furthermore, the PAE shows 93 % AUC improvement in detecting abnormal underwater acoustic signal when pre-trained PAE is transferred to train open underwater acoustic data.

Ensemble Composition Methods for Binary Classification of Imbalanced Data (불균형 데이터의 이진 분류를 위한 앙상블 구성 방법)

  • Yeong-Hun Kim;Ju-Hing Lee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.689-691
    • /
    • 2023
  • 불균형 데이터의 분류의 성능을 향상시키기 위한 앙상블 구성 방법에 관하여 연구한다. 앙상블의 성능은 앙상블을 구성한 기계학습 모델 간의 상호 다양성에 큰 영향을 받는다. 기존 방법에서는 앙상블에 속할 모델 간의 상호 다양성을 높이기 위해 Feature Engineering 을 사용하여 다양한 모델을 만들어 사용하였다. 그럼에도 생성된 모델 가운데 유사한 모델들이 존재하며 이는 상호 다양성을 낮추고 앙상블 성능을 저하시키는 문제를 가지고 있다. 불균형 데이터의 경우에는 유사 모델 판별을 위한 기존 다양성 지표가 다수 클래스에 편향된 수치를 산출하기 때문에 적합하지 않다. 본 논문에서는 기존 다양성 지표를 개선하고 가지치기 방안을 결합하여 유사 모델을 판별하고 상호 다양성이 높은 후보 모델들을 앙상블에 포함시키는 방법을 제안한다. 실험 결과로써 제안한 방법으로 구성된 앙상블이 불균형이 심한 데이터의 분류 성능을 향상시킴을 확인하였다.

A study on the improvement ransomware detection performance using combine sampling methods (혼합샘플링 기법을 사용한 랜섬웨어탐지 성능향상에 관한 연구)

  • Kim Soo Chul;Lee Hyung Dong;Byun Kyung Keun;Shin Yong Tae
    • Convergence Security Journal
    • /
    • v.23 no.1
    • /
    • pp.69-77
    • /
    • 2023
  • Recently, ransomware damage has been increasing rapidly around the world, including Irish health authorities and U.S. oil pipelines, and is causing damage to all sectors of society. In particular, research using machine learning as well as existing detection methods is increasing for ransomware detection and response. However, traditional machine learning has a problem in that it is difficult to extract accurate predictions because the model tends to predict in the direction where there is a lot of data. Accordingly, in an imbalance class consisting of a large number of non-Ransomware (normal code or malware) and a small number of Ransomware, a technique for resolving the imbalance and improving ransomware detection performance is proposed. In this experiment, we use two scenarios (Binary, Multi Classification) to confirm that the sampling technique improves the detection performance of a small number of classes while maintaining the detection performance of a large number of classes. In particular, the proposed mixed sampling technique (SMOTE+ENN) resulted in a performance(G-mean, F1-score) improvement of more than 10%.

Predicting defects of EBM-based additive manufacturing through XGBoost (XGBoost를 활용한 EBM 3D 프린터의 결함 예측)

  • Jeong, Jahoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.5
    • /
    • pp.641-648
    • /
    • 2022
  • This paper is a study to find out the factors affecting the defects that occur during the use of Electron Beam Melting (EBM), one of the 3D printer output methods, through data analysis. By referring to factors identified as major causes of defects in previous studies, log files occurring between processes were analyzed and related variables were extracted. In addition, focusing on the fact that the data is time series data, the concept of a window was introduced to compose variables including data from all three layers. The dependent variable is a binary classification problem with the presence or absence of defects, and due to the problem that the proportion of defect layers is low (about 4%), balanced training data were created through the SMOTE technique. For the analysis, I use XGBoost using Gridsearch CV, and evaluate the classification performance based on the confusion matrix. I conclude results of the stuy by analyzing the importance of variables through SHAP values.

Conditional Generative Adversarial Network based Collaborative Filtering Recommendation System (Conditional Generative Adversarial Network(CGAN) 기반 협업 필터링 추천 시스템)

  • Kang, Soyi;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.157-173
    • /
    • 2021
  • With the development of information technology, the amount of available information increases daily. However, having access to so much information makes it difficult for users to easily find the information they seek. Users want a visualized system that reduces information retrieval and learning time, saving them from personally reading and judging all available information. As a result, recommendation systems are an increasingly important technologies that are essential to the business. Collaborative filtering is used in various fields with excellent performance because recommendations are made based on similar user interests and preferences. However, limitations do exist. Sparsity occurs when user-item preference information is insufficient, and is the main limitation of collaborative filtering. The evaluation value of the user item matrix may be distorted by the data depending on the popularity of the product, or there may be new users who have not yet evaluated the value. The lack of historical data to identify consumer preferences is referred to as data sparsity, and various methods have been studied to address these problems. However, most attempts to solve the sparsity problem are not optimal because they can only be applied when additional data such as users' personal information, social networks, or characteristics of items are included. Another problem is that real-world score data are mostly biased to high scores, resulting in severe imbalances. One cause of this imbalance distribution is the purchasing bias, in which only users with high product ratings purchase products, so those with low ratings are less likely to purchase products and thus do not leave negative product reviews. Due to these characteristics, unlike most users' actual preferences, reviews by users who purchase products are more likely to be positive. Therefore, the actual rating data is over-learned in many classes with high incidence due to its biased characteristics, distorting the market. Applying collaborative filtering to these imbalanced data leads to poor recommendation performance due to excessive learning of biased classes. Traditional oversampling techniques to address this problem are likely to cause overfitting because they repeat the same data, which acts as noise in learning, reducing recommendation performance. In addition, pre-processing methods for most existing data imbalance problems are designed and used for binary classes. Binary class imbalance techniques are difficult to apply to multi-class problems because they cannot model multi-class problems, such as objects at cross-class boundaries or objects overlapping multiple classes. To solve this problem, research has been conducted to convert and apply multi-class problems to binary class problems. However, simplification of multi-class problems can cause potential classification errors when combined with the results of classifiers learned from other sub-problems, resulting in loss of important information about relationships beyond the selected items. Therefore, it is necessary to develop more effective methods to address multi-class imbalance problems. We propose a collaborative filtering model using CGAN to generate realistic virtual data to populate the empty user-item matrix. Conditional vector y identify distributions for minority classes and generate data reflecting their characteristics. Collaborative filtering then maximizes the performance of the recommendation system via hyperparameter tuning. This process should improve the accuracy of the model by addressing the sparsity problem of collaborative filtering implementations while mitigating data imbalances arising from real data. Our model has superior recommendation performance over existing oversampling techniques and existing real-world data with data sparsity. SMOTE, Borderline SMOTE, SVM-SMOTE, ADASYN, and GAN were used as comparative models and we demonstrate the highest prediction accuracy on the RMSE and MAE evaluation scales. Through this study, oversampling based on deep learning will be able to further refine the performance of recommendation systems using actual data and be used to build business recommendation systems.