• Title/Summary/Keyword: 불균형적인 이항 자료

Search Result 5, Processing Time 0.017 seconds

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

  • Kim, HanYong;Lee, Woojoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.681-690
    • /
    • 2017
  • Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

Heterogeneity Analysis of the Male Birth Ratio Data (남아 출생률 자료에 대한 이질성 분석)

  • Lim, Hwa-Kyung;Song, Seuck-Heun;Song, Ju-Won
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.2
    • /
    • pp.365-373
    • /
    • 2009
  • Since 1990, identifying the sex of fetus and illegal abortion has brought the sex ratio imbalance at birth in Korea due to a notion of preferring a son to a daughter, socio-economic development, population policy, and so forth. Although there have been many researches such as time series analysis and region difference analysis to monitor this sex ratio imbalance, they have a defect that time and space could not be included in the analysis simultaneously. This study analyzes the sex ratio imbalance at birth, taking into account time and region at the same time. The analysis considered the numbers of male and female babies, who were born as the third or latter in their families, in 2000 and 2001 at 234 Gu / Si / Goon administrative districts. Here, we suggest a mixture model of binomial distributions, assuming heterogeneous populations. The estimation of the location parameters, weights and correlation coefficient of the mixture model is conducted by the EM algorithm, and the heterogeneity of the regions is expressed as a picture using ArcView GIS.

Parameter estimation for the imbalanced credit scoring data using AUC maximization (AUC 최적화를 이용한 낮은 부도율 자료의 모수추정)

  • Hong, C.S.;Won, C.H.
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.2
    • /
    • pp.309-319
    • /
    • 2016
  • For binary classification models, we consider a risk score that is a function of linear scores and estimate the coefficients of the linear scores. There are two estimation methods: one is to obtain MLEs using logistic models and the other is to estimate by maximizing AUC. AUC approach estimates are better than MLEs when using logistic models under a general situation which does not support logistic assumptions. This paper considers imbalanced data that contains a smaller number of observations in the default class than those in the non-default for credit assessment models; consequently, the AUC approach is applied to imbalanced data. Various logit link functions are used as a link function to generate imbalanced data. It is found that predicted coefficients obtained by the AUC approach are equivalent to (or better) than those from logistic models for low default probability - imbalanced data.

A comparative study of feature screening methods for ultrahigh dimensional multiclass classification (초고차원 다범주분류를 위한 변수선별 방법 비교 연구)

  • Lee, Kyungeun;Kim, Kyoung Hee;Shin, Seung Jun
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.793-808
    • /
    • 2017
  • We compare various variable screening methods on multiclass classification problems when the data is ultrahigh-dimensional. Two different approaches were considered: (1) pairwise extension from binary classification via one versus one or one versus rest comparisons and (2) direct classification of multiclass responses. We conducted extensive simulation studies under different conditions: heavy tailed explanatory variables, correlated signal and noise variables, correlated joint distributions but uncorrelated marginals, and unbalanced response variables. We then analyzed real data to examine the performance of the methods. The results showed that model-free methods perform better for multiclass classification problems as well as binary ones.

Procyclicality of Buffer Capital and Its Implications for Basel II: A Cross Country Analysis (은행 자기자본의 경기순응성에 대한 국제비교분석과 Basel II에 대한 시사점)

  • Kim, Hyeon-Wook;Lee, Hangyong
    • KDI Journal of Economic Policy
    • /
    • v.29 no.1
    • /
    • pp.177-196
    • /
    • 2007
  • This paper investigates the cyclical patterns of buffer capital using an unbalanced panel data for the banks in 30 OECD countries and 7 non-OECD Asian countries. We test whether the relationships between buffer capital and business cycle are systematically different across country groups controlling for other potential determinants of bank capital. We find that the correlation is positive for developed countries while it is negative for Asian developing countries. These findings suggest that, once Basel II is implemented, developing countries are more likely to observe an increase in output volatility. We then review the policy recommendations to mitigate the procyclicality problem of Basel II.