• Title/Summary/Keyword: 언더샘플링

Search Result 12, Processing Time 0.03 seconds

A Clustering-based Undersampling Method to Prevent Information Loss from Text Data (텍스트 데이터의 정보 손실을 방지하기 위한 군집화 기반 언더샘플링 기법)

  • Jong-Hwi Kim;Saim Shin;Jin Yea Jang
    • Annual Conference on Human and Language Technology
    • /
    • 2022.10a
    • /
    • pp.251-256
    • /
    • 2022
  • 범주 불균형은 분류 모델이 다수 범주에 편향되게 학습되어 소수 범주에 대한 분류 성능을 떨어뜨리는 문제를 야기한다. 언더 샘플링 기법은 다수 범주 데이터의 수를 줄여 소수 범주와 균형을 이루게하는 대표적인 불균형 해결 방법으로, 텍스트 도메인에서의 기존 언더 샘플링 연구에서는 단어 임베딩과 랜덤 샘플링과 같은 비교적 간단한 기법만이 적용되었다. 본 논문에서는 트랜스포머 기반 문장 임베딩과 군집화 기반 샘플링 방법을 통해 텍스트 데이터의 정보 손실을 최소화하는 언더샘플링 방법을 제안한다. 제안 방법의 검증을 위해, 감성 분석 실험에서 제안 방법과 랜덤 샘플링으로 추출한 훈련 세트로 모델을 학습하고 성능을 비교 평가하였다. 제안 방법을 활용한 모델이 랜덤 샘플링을 활용한 모델에 비해 적게는 0.2%, 많게는 2.0% 높은 분류 정확도를 보였고, 이를 통해 제안하는 군집화 기반 언더 샘플링 기법의 효과를 확인하였다.

  • PDF

Comparison of resampling methods for dealing with imbalanced data in binary classification problem (이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교)

  • Park, Geun U;Jung, Inkyung
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.349-374
    • /
    • 2019
  • A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

Geometry Image Optimization using a Normal Vector (정점의 법선벡터를 이용한 기하이미지의 최적화)

  • Park Jong-Lae;Yang Sung-Bong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2004.11a
    • /
    • pp.241-244
    • /
    • 2004
  • 일반적으로 메쉬(mesh)는 비정규 연결 형태(irregular connectivity)로 되어 있다. 리메싱(remeshing)은 비정규 연결 형태의 메쉬를 정규 연결 형태(regular connectivity)로 바꾸어 주는 작업이다. 메쉬의 기하 정보가 2D 그리드에 저장이 되어 있는 기하이미지(geometry Images)는 비정규 연결 형태의 메쉬를 완전 정규 형태(completely regular connectivity)로 리메싱하는 데 사용된다. 원본 메쉬를 기하 이미지로 생성하는 방법은 변형되는 크기를 최소화 하는 스트레치 메트릭(stretch metric)을 기반으로 이루어 졌다. 이 방법은 리메싱된 메쉬의 언더샘플링(undersampling)을 줄여 주게 된다. 하지만 리메싱 과정에서 생기는 오버샘플링(oversampling)은 줄여 주지 못한다. 본 논문에서는 정점(vertex)의 법선 벡터(normal vector)를 이용하여 기하이미지의 오버샘플링을 줄이는 방법을 제시한다.

  • PDF

A Hybrid Under-sampling Approach for Better Bankruptcy Prediction (부도예측 개선을 위한 하이브리드 언더샘플링 접근법)

  • Kim, Taehoon;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.173-190
    • /
    • 2015
  • The purpose of this study is to improve bankruptcy prediction models by using a novel hybrid under-sampling approach. Most prior studies have tried to enhance the accuracy of bankruptcy prediction models by improving the classification methods involved. In contrast, we focus on appropriate data preprocessing as a means of enhancing accuracy. In particular, we aim to develop an effective sampling approach for bankruptcy prediction, since most prediction models suffer from class imbalance problems. The approach proposed in this study is a hybrid under-sampling method that combines the k-Reverse Nearest Neighbor (k-RNN) and one-class support vector machine (OCSVM) approaches. k-RNN can effectively eliminate outliers, while OCSVM contributes to the selection of informative training samples from majority class data. To validate our proposed approach, we have applied it to data from H Bank's non-external auditing companies in Korea, and compared the performances of the classifiers with the proposed under-sampling and random sampling data. The empirical results show that the proposed under-sampling approach generally improves the accuracy of classifiers, such as logistic regression, discriminant analysis, decision tree, and support vector machines. They also show that the proposed under-sampling approach reduces the risk of false negative errors, which lead to higher misclassification costs.

Handling Method of Imbalance Data for Machine Learning : Focused on Sampling (머신러닝을 위한 불균형 데이터 처리 방법 : 샘플링을 위주로)

  • Lee, Kyunam;Lim, Jongtae;Bok, Kyoungsoo;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.19 no.11
    • /
    • pp.567-577
    • /
    • 2019
  • Recently, more and more attempts have been made to solve the problems faced by academia and industry through machine learning. Accordingly, various attempts are being made to solve non-general situations through machine learning, such as deviance, fraud detection and disability detection. A variety of attempts have been made to resolve the non-normal situation in which data is distributed disproportionately, generally resulting in errors. In this paper, we propose handling method of imbalance data for machine learning. The proposed method to such problem of an imbalance in data by verifying that the population distribution of major class is well extracted. Performance Evaluations have proven the proposed method to be better than the existing methods.

A Study on Fraud Detection in the C2C Used Trade Market Using Doc2vec

  • Lim, Do Hyun;Ahn, Hyunchul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.3
    • /
    • pp.173-182
    • /
    • 2022
  • In this paper, we propose a machine learning model that can prevent fraudulent transactions in advance and interpret them using the XAI approach. For the experiment, we collected a real data set of 12,258 mobile phone sales posts from Joonggonara, a major domestic online C2C resale trading platform. Characteristics of the text corresponding to the post body were extracted using Doc2vec, dimensionality was reduced through PCA, and various derived variables were created based on previous research. To mitigate the data imbalance problem in the preprocessing stage, a complex sampling method that combines oversampling and undersampling was applied. Then, various machine learning models were built to detect fraudulent postings. As a result of the analysis, LightGBM showed the best performance compared to other machine learning models. And as a result of SHAP, if the price is unreasonably low compared to the market price and if there is no indication of the transaction area, there was a high probability that it was a fraudulent post. Also, high price, no safe transaction, the more the courier transaction, and the higher the ratio of 0 in the price also led to fraud.

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

  • Kim, HanYong;Lee, Woojoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.681-690
    • /
    • 2017
  • Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

A Study on Improving Performance of Software Requirements Classification Models by Handling Imbalanced Data (불균형 데이터 처리를 통한 소프트웨어 요구사항 분류 모델의 성능 개선에 관한 연구)

  • Jong-Woo Choi;Young-Jun Lee;Chae-Gyun Lim;Ho-Jin Choi
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.7
    • /
    • pp.295-302
    • /
    • 2023
  • Software requirements written in natural language may have different meanings from the stakeholders' viewpoint. When designing an architecture based on quality attributes, it is necessary to accurately classify quality attribute requirements because the efficient design is possible only when appropriate architectural tactics for each quality attribute are selected. As a result, although many natural language processing models have been studied for the classification of requirements, which is a high-cost task, few topics improve classification performance with the imbalanced quality attribute datasets. In this study, we first show that the classification model can automatically classify the Korean requirement dataset through experiments. Based on these results, we explain that data augmentation through EDA(Easy Data Augmentation) techniques and undersampling strategies can improve the imbalance of quality attribute datasets, and show that they are effective in classifying requirements. The results improved by 5.24%p on F1-score, indicating that handling imbalanced data helps classify Korean requirements of classification models. Furthermore, detailed experiments of EDA illustrate operations that help improve classification performance.

Classification Performance Improvement of UNSW-NB15 Dataset Based on Feature Selection (특징선택 기법에 기반한 UNSW-NB15 데이터셋의 분류 성능 개선)

  • Lee, Dae-Bum;Seo, Jae-Hyun
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.5
    • /
    • pp.35-42
    • /
    • 2019
  • Recently, as the Internet and various wearable devices have appeared, Internet technology has contributed to obtaining more convenient information and doing business. However, as the internet is used in various parts, the attack surface points that are exposed to attacks are increasing, Attempts to invade networks aimed at taking unfair advantage, such as cyber terrorism, are also increasing. In this paper, we propose a feature selection method to improve the classification performance of the class to classify the abnormal behavior in the network traffic. The UNSW-NB15 dataset has a rare class imbalance problem with relatively few instances compared to other classes, and an undersampling method is used to eliminate it. We use the SVM, k-NN, and decision tree algorithms and extract a subset of combinations with superior detection accuracy and RMSE through training and verification. The subset has recall values of more than 98% through the wrapper based experiments and the DT_PSO showed the best performance.

An Adaptive System for Effective Fur rendering (효과적인 Fur 렌더링을 위한 적응적 시스템 -혼합 렌더링을 이용한 빠른 Fur 렌더링 방법-)

  • Kim, Hye-Sun;Ban, Yun-Ji;Lee, Chung-Hwan;Nam, Seung-Woo;Choi, Jin-Sung;Oh, Jun-Kyu
    • 한국HCI학회:학술대회논문집
    • /
    • 2009.02a
    • /
    • pp.719-724
    • /
    • 2009
  • Fur rendering is difficult in that there are huge numbers of objects and it takes so much time. The previous method considers fur as cylinder, transforms it into 2D ribbon, triangulates and commits rendering. But this method has problem like under sampling and takes rendering time so long. To resolve these shortcuts we proposed new algorithm. We divide fur into thick and thin fur and we applied adaptive rendering methods for each type of fur. Also we can perform an effective rendering according to the proposed rendering framework.

  • PDF