• 제목/요약/키워드: Misclassification

검색결과 226건 처리시간 0.025초

신용평가에서 로지스틱 회귀를 이용한 미결정자 추론 (Undecided inference using logistic regression for credit evaluation)

  • 홍종선;정민섭
    • Journal of the Korean Data and Information Science Society
    • /
    • 제22권2호
    • /
    • pp.149-157
    • /
    • 2011
  • 본 연구는 신용평가 과정에서 발생하는 미결정자를 결측자료 문제로 간주하여 MAR와 MNAR 가정 하에서 추론한다. MAR 가정에서 미결정자 추론은 결정자들에 대한 로지스틱 회귀모형의 회귀 계수벡터를 이용하여 미결정자의 부도 확률을 구한 후 결정자의 부도확률과 비교하여 미결정자의 미래 상태를 판단한다. 그리고 MNAR 가정에서의 미결정자 추론은 특성변수가 추가한 로지스틱 모형으로부터 미결정자의 부도확률을 구하고 미결정자를 예측하는 방법을 제안하였다. 두 종류의 실제 자료에 대하여 모의실험을 한 결과, MAR 가정에서 미결정자의 비율이 증가하더라도 원자료의 오분류율과 추론한 결과 차이가 없으며, MNAR 가정에서는 추가적인 변수를 고려하여 미결정자를 추정하였기 때문에 미결정자의 오분류율이 MAR 가정에서의 오분류율보다 감소하고 나아가 전체에서 미결정자가 차지하는 비율이 증가함에 따라 전체의 오분류율이 더욱 감소함을 발견하였다.

유전 알고리듬 기반 집단분류기법의 개발과 성과평가 : 채권등급 평가를 중심으로 (Design and Performance Measurement of a Genetic Algorithm-based Group Classification Method : The Case of Bond Rating)

  • 민재형;정철우
    • 한국경영과학회지
    • /
    • 제32권1호
    • /
    • pp.61-75
    • /
    • 2007
  • The purpose of this paper is to develop a new group classification method based on genetic algorithm and to com-pare its prediction performance with those of existing methods in the area of bond rating. To serve this purpose, we conduct various experiments with pilot and general models. Specifically, we first conduct experiments employing two pilot models : the one searching for the cluster center of each group and the other one searching for both the cluster center and the attribute weights in order to maximize classification accuracy. The results from the pilot experiments show that the performance of the latter in terms of classification accuracy ratio is higher than that of the former which provides the rationale of searching for both the cluster center of each group and the attribute weights to improve classification accuracy. With this lesson in mind, we design two generalized models employing genetic algorithm : the one is to maximize the classification accuracy and the other one is to minimize the total misclassification cost. We compare the performance of these two models with those of existing statistical and artificial intelligent models such as MDA, ANN, and Decision Tree, and conclude that the genetic algorithm-based group classification method that we propose in this paper significantly outperforms the other methods in respect of classification accuracy ratio as well as misclassification cost.

Cost-Sensitive Case Based Reasoning using Genetic Algorithm: Application to Diagnose for Diabetes

  • Park Yoon-Joo;Kim Byung-Chun
    • 한국지능정보시스템학회:학술대회논문집
    • /
    • 한국지능정보시스템학회 2006년도 춘계학술대회
    • /
    • pp.327-335
    • /
    • 2006
  • Case Based Reasoning (CBR) has come to be considered as an appropriate technique for diagnosis, prognosis and prescription in medicine. However, canventional CBR has a limitation in that it cannot incorporate asymmetric misclassification cast. It assumes that the cast of type1 error and type2 error are the same, so it cannot be modified according ta the error cast of each type. This problem provides major disincentive to apply conventional CBR ta many real world cases that have different casts associated with different types of error. Medical diagnosis is an important example. In this paper we suggest the new knowledge extraction technique called Cast-Sensitive Case Based Reasoning (CSCBR) that can incorporate unequal misclassification cast. The main idea involves a dynamic adaptation of the optimal classification boundary paint and the number of neighbors that minimize the tatol misclassification cast according ta the error casts. Our technique uses a genetic algorithm (GA) for finding these two feature vectors of CSCBR. We apply this new method ta diabetes datasets and compare the results with those of the cast-sensitive methods, C5.0 and CART. The results of this paper shaw that the proposed technique outperforms other methods and overcomes the limitation of conventional CBR.

  • PDF

선형판별분석에서 MCMC다중대체법의 효율에 관한 연구 (A Study on the efficiency of the MCMC multiple imputation In LDA)

  • 유희경;김명철
    • 대한안전경영과학회지
    • /
    • 제11권3호
    • /
    • pp.189-198
    • /
    • 2009
  • This thesis studies two imputation methods, the MCMC method and the EM algorithm, that take care of the problem. The performance of the two methods for the linear (or quadratic) discriminant analysis are evaluated under various types of incomplete observations. Based on simulated experiments, the effect of the imputation using the EM algorithm and the MCMC method are evaluated and compared in terms of the probability of misclassification and the RMSE. This is done for the various cases of incomplete observations. The cases are differentiated by missing rates, sample sizes, and distances between two classification groups. The studies show that the probability of misclassification and the RMSE of the EM algorithm method is lower than the MCMC method. Therefore the imputation using the EM algorithm is more efficient than the MCMC method. And the probability of misclassification of the method that all vectors of observations with missing values are omitted from analysis is lower than the EM algorithm and the MCMC method when the samples size is small and the rate of missing values is extremely big.

세 집단 판별분석 상황에서의 영향함수 유도 및 그 응용 (Derivation and Application of In uence Function in Discriminant Analysis for Three Groups)

  • 이혜정;김홍기
    • 응용통계연구
    • /
    • 제24권5호
    • /
    • pp.941-949
    • /
    • 2011
  • 본 논문에서는 세 집단만을 판별분석 할 경우에 계산되는 오분류확률에 영향을 미치는 이상치 판별을 목적으로 하며, 쉽게 응용 가능한 간단한 영향함수식을 제시하였다. 그리고 제시된 수식을 이용하여 안면 데이터로 세 가지 사상체질을 분류해보고 각 관찰값들의 오분류확률에 대한 영향함수를 계산하였다. 이상치를 제거하고 재 판별분석을 하는 데 있어, 오분류확률에 대한 영향함수를 이용하는 것이 효율적인 방법임을 확인하였다.

Analyzing the Effect of Lexical and Conceptual Information in Spam-mail Filtering System

  • Kang Sin-Jae;Kim Jong-Wan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • 제6권2호
    • /
    • pp.105-109
    • /
    • 2006
  • In this paper, we constructed a two-phase spam-mail filtering system based on the lexical and conceptual information. There are two kinds of information that can distinguish the spam mail from the ham (non-spam) mail. The definite information is the mail sender's information, URL, a certain spam keyword list, and the less definite information is the word list and concept codes extracted from the mail body. We first classified the spam mail by using the definite information, and then used the less definite information. We used the lexical information and concept codes contained in the email body for SVM learning in the 2nd phase. According to our results the ham misclassification rate was reduced if more lexical information was used as features, and the spam misclassification rate was reduced when the concept codes were included in features as well.

Comparison of Multiway Discretization Algorithms for Data Mining

  • Kim, Jeong-Suk;Jang, Young-Mi;Na, Jong-Hwa
    • Journal of the Korean Data and Information Science Society
    • /
    • 제16권4호
    • /
    • pp.801-813
    • /
    • 2005
  • The discretization algorithms for continuous data have been actively studied in the area of data mining. These discretizations are very important in data analysis, especially for efficient model selection in data mining. So, in this paper, we introduce the principles of some mutiway discretization algorithms including KEX, 1R and CN4 algorithm and investigate the efficiency of these algorithms through numerical study. For various underlying distribution, we compare these algorithms in view of misclassification rate.

  • PDF

Comparison of Binary Discretization Algorithms for Data Mining

  • Na, Jong-Hwa;Kim, Jeong-Mi;Cho, Wan-Sup
    • Journal of the Korean Data and Information Science Society
    • /
    • 제16권4호
    • /
    • pp.769-780
    • /
    • 2005
  • Recently, the discretization algorithms for continuous data have been actively studied. But there are few articles to compare the efficiency of these algorithms. In this paper we introduce the principles of some binary discretization algorithms including C4.5, CART and QUEST and investigate the efficiency of these algorithms through numerical study. For various underlying distribution, we compare these algorithms in view of misclassification rate and MSE. Real data examples are also included.

  • PDF

On a Balanced Classification Rule

  • Kim, Hea-Jung
    • Journal of the Korean Statistical Society
    • /
    • 제24권2호
    • /
    • pp.453-470
    • /
    • 1995
  • We describe a constrained optimal classification rule for the case when the prior probability of an observation belonging to one of the two populations is unknown. This is done by suggesting a balanced design for the classification experiment and constructing the optimal rule under the balanced design condition. The rule si characterized by a constrained minimization of total risk of misclassification; the constraint of the rule is constructed by the process of equation between Kullback-Leibler's directed divergence measures obtained from the two population conditional densities. The efficacy of the suggested rule is examined through two-group normal classification. This indicates that, in case little is known about the relative population sizes, dramatic gains in accuracy of classification result can be achieved.

  • PDF