• Title/Summary/Keyword: Imbalanced data

Search Result 161, Processing Time 0.021 seconds

Image Classification using Class-Balanced Loss (Class-Balanced Loss를 이용한 이미지 분류)

  • Jihee Park;Wonjun Hwang
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2022.11a
    • /
    • pp.164-166
    • /
    • 2022
  • Long-tail problem은 class 별로 sample의 개수에 차이가 있어 성능에 안 좋은 영향을 미치는 것을 말한다. 본 논문에서는 cost-sensitive learning 중 Class-Balanced Loss를 이용해 성능을 개선하여 Long-tail problem을 해결하려고 한다. 먼저, balanced data set과 imbalanced data set의 성능 차이를 살펴보도록 할 것이다. 그 후, Class-Balanced Loss를 3가지 버전으로 이용해 그 성능을 측정하고 분석해 볼 것이다.

  • PDF

Study on Detection Technique for Cochlodinium polykrikoides Red tide using Logistic Regression Model under Imbalanced Data (불균형 데이터 환경에서 로지스틱 회귀모형을 이용한 Cochlodinium polykrikoides 적조 탐지 기법 연구)

  • Bak, Su-Ho;Kim, Heung-Min;Kim, Bum-Kyu;Hwang, Do-Hyun;Enkhjargal, Unuzaya;Yoon, Hong-Joo
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.13 no.6
    • /
    • pp.1353-1364
    • /
    • 2018
  • This study proposed a method to detect Cochlodinium polykrikoides red tide pixels in satellite images using a logistic regression model of machine learning technique under Imbalanced data. The spectral profiles extracted from red tide, clear water, and turbid water were used as training dataset. 70% of the entire data set was extracted and used for as model training, and the classification accuracy of the model was evaluated using the remaining 30%. At this time, the white noise was added to the spectral profile of the red tide, which has a relatively small number of data compared to the clear water and the turbid water, and over-sampling was performed to solve the unbalanced data problem. As a result of the accuracy evaluation, the proposed algorithm showed about 94% classification accuracy.

Comparison of Machine Learning Methodology in COPD Cohort Data (COPD 코호트 자료에서의 Machine Learning 방법론 비교)

  • Jeong, Hyeon-Myeong;Park, Heon-Jin;Rhee, Chin-Kook;Lee, Jong-min
    • The Journal of Bigdata
    • /
    • v.2 no.2
    • /
    • pp.115-128
    • /
    • 2017
  • Recently, Machine Learning Methods are widely used with high prediction performance. But if the limit of the data is solved by the statistical technique, It can, lead to higher prediction performance than the existing one. In this study, the SMOTE method is used to solve the imbalance problem in the longitudinal and imbalanced data. As a result, It, was confirmed that the prediction performance increases. Additionally, Although, studies on COPD have been actively conducted, only studies that are related to acute exacerbation have been conducted. So there are no studies on the prediction of acute exacerbation through multiple perspectives and predictive models for various factors. In this study, We examined the factors related to acute exacerbation of COPD and constructed a personalized specific disease prediction model.

  • PDF

Extraction Method of Significant Clinical Tests Based on Data Discretization and Rough Set Approximation Techniques: Application to Differential Diagnosis of Cholecystitis and Cholelithiasis Diseases (데이터 이산화와 러프 근사화 기술에 기반한 중요 임상검사항목의 추출방법: 담낭 및 담석증 질환의 감별진단에의 응용)

  • Son, Chang-Sik;Kim, Min-Soo;Seo, Suk-Tae;Cho, Yun-Kyeong;Kim, Yoon-Nyun
    • Journal of Biomedical Engineering Research
    • /
    • v.32 no.2
    • /
    • pp.134-143
    • /
    • 2011
  • The selection of meaningful clinical tests and its reference values from a high-dimensional clinical data with imbalanced class distribution, one class is represented by a large number of examples while the other is represented by only a few, is an important issue for differential diagnosis between similar diseases, but difficult. For this purpose, this study introduces methods based on the concepts of both discernibility matrix and function in rough set theory (RST) with two discretization approaches, equal width and frequency discretization. Here these discretization approaches are used to define the reference values for clinical tests, and the discernibility matrix and function are used to extract a subset of significant clinical tests from the translated nominal attribute values. To show its applicability in the differential diagnosis problem, we have applied it to extract the significant clinical tests and its reference values between normal (N = 351) and abnormal group (N = 101) with either cholecystitis or cholelithiasis disease. In addition, we investigated not only the selected significant clinical tests and the variations of its reference values, but also the average predictive accuracies on four evaluation criteria, i.e., accuracy, sensitivity, specificity, and geometric mean, during l0-fold cross validation. From the experimental results, we confirmed that two discretization approaches based rough set approximation methods with relative frequency give better results than those with absolute frequency, in the evaluation criteria (i.e., average geometric mean). Thus it shows that the prediction model using relative frequency can be used effectively in classification and prediction problems of the clinical data with imbalanced class distribution.

Imbalanced Data Improvement Techniques Based on SMOTE and Light GBM (SMOTE와 Light GBM 기반의 불균형 데이터 개선 기법)

  • Young-Jin, Han;In-Whee, Joe
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.12
    • /
    • pp.445-452
    • /
    • 2022
  • Class distribution of unbalanced data is an important part of the digital world and is a significant part of cybersecurity. Abnormal activity of unbalanced data should be found and problems solved. Although a system capable of tracking patterns in all transactions is needed, machine learning with disproportionate data, which typically has abnormal patterns, can ignore and degrade performance for minority layers, and predictive models can be inaccurately biased. In this paper, we predict target variables and improve accuracy by combining estimates using Synthetic Minority Oversampling Technique (SMOTE) and Light GBM algorithms as an approach to address unbalanced datasets. Experimental results were compared with logistic regression, decision tree, KNN, Random Forest, and XGBoost algorithms. The performance was similar in accuracy and reproduction rate, but in precision, two algorithms performed at Random Forest 80.76% and Light GBM 97.16%, and in F1-score, Random Forest 84.67% and Light GBM 91.96%. As a result of this experiment, it was confirmed that Light GBM's performance was similar without deviation or improved by up to 16% compared to five algorithms.

Facial Age Classification and Synthesis using Feature Decomposition (특징 분해를 이용한 얼굴 나이 분류 및 합성)

  • Chanho Kim;In Kyu Park
    • Journal of Broadcast Engineering
    • /
    • v.28 no.2
    • /
    • pp.238-241
    • /
    • 2023
  • Recently deep learning models are widely used for various tasks such as facial recognition and face editing. Their training process often involves a dataset with imbalanced age distribution. It is because some age groups (teenagers and middle age) are more socially active and tends to have more data compared to the less socially active age groups (children and elderly). This imbalanced age distribution may negatively impact the deep learning training process or the model performance when tested against those age groups with less data. To this end, we propose an age-controllable face synthesis technique using a feature decomposition to classify age from facial images which can be utilized to synthesize novel data to balance out the age distribution. We perform extensive qualitative and quantitative evaluation on our proposed technique using the FFHQ dataset and we show that our method has better performance than existing method.

Prediction of Protein-Protein Interaction Sites Based on 3D Surface Patches Using SVM (SVM 모델을 이용한 3차원 패치 기반 단백질 상호작용 사이트 예측기법)

  • Park, Sung-Hee;Hansen, Bjorn
    • The KIPS Transactions:PartD
    • /
    • v.19D no.1
    • /
    • pp.21-28
    • /
    • 2012
  • Predication of protein interaction sites for monomer structures can reduce the search space for protein docking and has been regarded as very significant for predicting unknown functions of proteins from their interacting proteins whose functions are known. In the other hand, the prediction of interaction sites has been limited in crystallizing weakly interacting complexes which are transient and do not form the complexes stable enough for obtaining experimental structures by crystallization or even NMR for the most important protein-protein interactions. This work reports the calculation of 3D surface patches of complex structures and their properties and a machine learning approach to build a predictive model for the 3D surface patches in interaction and non-interaction sites using support vector machine. To overcome classification problems for class imbalanced data, we employed an under-sampling technique. 9 properties of the patches were calculated from amino acid compositions and secondary structure elements. With 10 fold cross validation, the predictive model built from SVM achieved an accuracy of 92.7% for classification of 3D patches in interaction and non-interaction sites from 147 complexes.

Development of Prediction Models for Fatal Accidents using Proactive Information in Construction Sites (건설현장의 공사사전정보를 활용한 사망재해 예측 모델 개발)

  • Choi, Seung Ju;Kim, Jin Hyun;Jung, Kihyo
    • Journal of the Korean Society of Safety
    • /
    • v.36 no.3
    • /
    • pp.31-39
    • /
    • 2021
  • In Korea, more than half of work-related fatalities have occurred on construction sites. To reduce such occupational accidents, safety inspection by government agencies is essential in construction sites that present a high risk of serious accidents. To address this issue, this study developed risk prediction models of serious accidents in construction sites using five machine learning methods: support vector machine, random forest, XGBoost, LightGBM, and AutoML. To this end, 15 proactive information (e.g., number of stories and period of construction) that are usually available prior to construction were considered and two over-sampling techniques (SMOTE and ADASYN) were used to address the problem of class-imbalanced data. The results showed that all machine learning methods achieved 0.876~0.941 in the F1-score with the adoption of over-sampling techniques. LightGBM with ADASYN yielded the best prediction performance in both the F1-score (0.941) and the area under the ROC curve (0.941). The prediction models revealed four major features: number of stories, period of construction, excavation depth, and height. The prediction models developed in this study can be useful both for government agencies in prioritizing construction sites for safety inspection and for construction companies in establishing pre-construction preventive measures.

A Hybrid Oversampling Technique for Imbalanced Structured Data based on SMOTE and Adapted CycleGAN (불균형 정형 데이터를 위한 SMOTE와 변형 CycleGAN 기반 하이브리드 오버샘플링 기법)

  • Jung-Dam Noh;Byounggu Choi
    • Information Systems Review
    • /
    • v.24 no.4
    • /
    • pp.97-118
    • /
    • 2022
  • As generative adversarial network (GAN) based oversampling techniques have achieved impressive results in class imbalance of unstructured dataset such as image, many studies have begun to apply it to solving the problem of imbalance in structured dataset. However, these studies have failed to reflect the characteristics of structured data due to changing the data structure into an unstructured data format. In order to overcome the limitation, this study adapted CycleGAN to reflect the characteristics of structured data, and proposed hybridization of synthetic minority oversampling technique (SMOTE) and the adapted CycleGAN. In particular, this study tried to overcome the limitations of existing studies by using a one-dimensional convolutional neural network unlike previous studies that used two-dimensional convolutional neural network. Oversampling based on the method proposed have been experimented using various datasets and compared the performance of the method with existing oversampling methods such as SMOTE and adaptive synthetic sampling (ADASYN). The results indicated the proposed hybrid oversampling method showed superior performance compared to the existing methods when data have more dimensions or higher degree of imbalance. This study implied that the classification performance of oversampling structured data can be improved using the proposed hybrid oversampling method that considers the characteristic of structured data.

Polynomial Fuzzy Radial Basis Function Neural Network Classifiers Realized with the Aid of Boundary Area Decision

  • Roh, Seok-Beom;Oh, Sung-Kwun
    • Journal of Electrical Engineering and Technology
    • /
    • v.9 no.6
    • /
    • pp.2098-2106
    • /
    • 2014
  • In the area of clustering, there are numerous approaches to construct clusters in the input space. For regression problem, when forming clusters being a part of the overall model, the relationships between the input space and the output space are essential and have to be taken into consideration. Conditional Fuzzy C-Means (c-FCM) clustering offers an opportunity to analyze the structure in the input space with the mechanism of supervision implied by the distribution of data present in the output space. However, like other clustering methods, c-FCM focuses on the distribution of the data. In this paper, we introduce a new method, which by making use of the ambiguity index focuses on the boundaries of the clusters whose determination is essential to the quality of the ensuing classification procedures. The introduced design is illustrated with the aid of numeric examples that provide a detailed insight into the performance of the fuzzy classifiers and quantify several essentials design aspects.