• 제목/요약/키워드: unbalanced data

검색결과 324건 처리시간 0.028초

불균형 자료에 대한 분류분석 (Classification Analysis for Unbalanced Data)

  • 김동아;강수연;송종우
    • 응용통계연구
    • /
    • 제28권3호
    • /
    • pp.495-509
    • /
    • 2015
  • 일반적인 2집단 분류(2-class classification)의 경우, 두 집단의 비율이 크게 차이나지 않는 경우가 많다. 본 논문에서는 두 집단의 비율이 크게 차이나는 불균형 데이터(unbalanced data)의 분류 문제에 대해서 다루고자 한다. 불균형 데이터의 분류방법은 균형이 맞는 데이터(balanced data)의 경우보다 분류하기 어려운 경우가 많다. 이런 자료에서 보통의 분류모형을 적용하게 되면 많은 경우에 대부분의 관측치가 큰 집단으로 분류 되는 경우가 많은데 실질적인 어플리케이션에서는 이런 오분류가 손해가 더 큰 경우가 대부분이다. 우리는 sampling 기법을 이용하여 다양한 분류 방법론의 성능을 비교 분석 하였다. 또한 비대칭 손실(asymmetric loss)을 가정한 경우에 어떤 방법론이 가장 작은 loss를 생성하는 지를 비교하였다. 성능 비교를 위해서는 오분류율(misclassification rate), G-mean, ROC, 그리고 AUC(Area under the curve) 등을 이용하였다.

SOM기반 특징 신호 추출 기법을 이용한 불균형 주기 신호의 이상 탐지 (Fault Detection of Unbalanced Cycle Signal Data Using SOM-based Feature Signal Extraction Method)

  • 김송이;강지훈;박종혁;김성식;백준걸
    • 한국시뮬레이션학회논문지
    • /
    • 제21권2호
    • /
    • pp.79-90
    • /
    • 2012
  • 본 연구는 공정신호가 불균형 데이터인 경우 이상 탐지 알고리즘의 성능 개선을 위한 특징 신호 추출 기법을 제안한다. 불균형 데이터란 범주 구분 문제에서 하나의 범주의 속하는 데이터의 비율이 다른 범주의 데이터에 비해 크게 차이나 이상 탐지성능이 크게 저하되는 경우를 의미한다. 공정이 운영되는 경우 얻을 수 있는 이상 신호의 수는 정상 신호에 비해 매우 적기에 이러한 문제를 해결하여 이상 탐지 기법을 적용하는 것은 매우 중요하다. 불균형 문제 해결을 위해 SOM(Self-Organizing Map) 알고리즘을 이용하여 각 노드에 대응되는 가중치를 특징 신호로 간주하여 정상 데이터와 이상 데이터의 비율을 맞춘다. 특징 신호 데이터 집단의 이상 탐지를 위해 클래스 분류 기법인 kNN(k-Nearest Neighbor)과 SVM(Support Vector Machine)을 적용하여 이를 공정 신호 이상탐지를 위해 주로 사용하는 Hotelling's $T^2$ 관리도와 성능을 비교한다. 반도체 공정에서 발생한다고 알려진 공정 신호를 모사하여 신호 알고리즘 성능의 우수성을 검증한다.

초전도 전력케이블의 전류 불평형에 관한 연구 (A Study on the Unbalanced Current Distribution of HTS Power Cable)

  • 김재호;박충화
    • 한국안전학회지
    • /
    • 제27권6호
    • /
    • pp.43-47
    • /
    • 2012
  • The unbalance currents flow the High Temperature Superconducting (HTS) power cable caused by asymmetrical fault, harmonic distortion and unbalanced load. That problem causes additional loss and leakage field in the HTS power cable, and deteriorates the electric power quality and stability. In addition, large amounts of unbalanced current can cause negative sequence and ground relays to operate. This paper presents an analysis unbalanced three-phase current distribution in HTS power cable caused by unbalanced load condition and grounding methods using PSCAD/EMTDC. The results obtained through the analysis would provide important data for the design of HTS power cables and valid information for their installation in power system.

불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차 (A Data Mining Procedure for Unbalanced Binary Classification)

  • 정한나;이정화;전치혁
    • 대한산업공학회지
    • /
    • 제36권1호
    • /
    • pp.13-21
    • /
    • 2010
  • The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

불균형자료를 위한 판별분석에서 HDBSCAN의 활용 (Discriminant analysis for unbalanced data using HDBSCAN)

  • 이보희;김태헌;최용석
    • 응용통계연구
    • /
    • 제34권4호
    • /
    • pp.599-609
    • /
    • 2021
  • 군집간의 개체 수의 차이가 큰 자료들을 불균형자료라고 한다. 불균형자료의 판별분석에서 다수 범주의 개체를 잘 분류하는 것 보다 소수 범주의 개체를 잘 분류하는 것이 더 중요하다. 그러나 개체 수가 상대적으로 작은 소수 범주의 개체를 개체 수가 상대적으로 많은 다수 범주의 개체로 오분류하는 경우가 많다. 본 연구에서는 이를 해결하기 위해 HDBSCAN과 SMOTE를 결합한 방법을 제안한다. HDBSCAN을 이용하여 소수 범주의 노이즈와 다수 범주의 노이즈를 제거하고 SMOTE를 적용하여 새로운 자료를 만들어낸다. 기존의 방법들과 성능을 비교하기 위하여 AUC와 F1 점수를 이용하였고 그 결과 대부분의 경우에 HDBSCAN과 SMOTE를 결합한 방법이 높은 성능 지표를 보였고, 불균형자료를 분류하는데 있어 뛰어난 방법으로 나타났다.

유아의 나이에 따른 편식 및 식습관 실태 (Dietary Habit and Unbalanced Diet Status of Young Children by Age)

  • 정유미
    • 한국식생활문화학회지
    • /
    • 제34권5호
    • /
    • pp.587-594
    • /
    • 2019
  • This study investigated the general information, unbalanced diet, and dietary habits of 86 children in Daegu. The research was undertaken to analyze the current state of diet and dietary habits of children, and to provide basic data for nutrition education. The results reveal that younger children have a more unbalanced diet. Children dislike side-dishes the most. Furthermore, due to the longer time taken to consume food, parents persuade children to eat quickly. Children were also determined to have a high intake of foods and drinks containing sugar; beverages containing sugar are consumed 1-2 times a week by 5-year-olds, and once daily by 6- and 7-year-olds. The results of this study can be applied to provide basic data for nutritional education, and assist in the development of dietary programs for young children.

Noninformative Priors for Fieller-Creasy Problem using Unbalanced Data

  • Kim, Dal-Ho;Lee, Woo-Dong;Kang, Sang-Gil
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 한국데이터정보과학회 2005년도 추계학술대회
    • /
    • pp.71-84
    • /
    • 2005
  • The Fieller-Creasy problem involves statistical inference about the ratio of two independent normal means. It is difficult problem from either a frequentist or a likelihood perspective. As an alternatives, a Bayesian analysis with noninformative priors may provide a solution to this problem. In this paper, we extend the results of Yin and Ghosh (2001) to unbalanced sample case. We find various noninformative priors such as first and second order matching priors, reference and Jeffreys' priors. The posterior propriety under the proposed noninformative priors will be given. Using real data, we provide illustrative examples. Through simulation study, we compute the frequentist coverage probabilities for probability matching and reference priors. Some simulation results will be given.

  • PDF

Integrated Partial Sufficient Dimension Reduction with Heavily Unbalanced Categorical Predictors

  • Yoo, Jae-Keun
    • 응용통계연구
    • /
    • 제23권5호
    • /
    • pp.977-985
    • /
    • 2010
  • In this paper, we propose an approach to conduct partial sufficient dimension reduction with heavily unbalanced categorical predictors. For this, we consider integrated categorical predictors and investigate certain conditions that the integrated categorical predictor is fully informative to partial sufficient dimension reduction. For illustration, the proposed approach is implemented on optimal partial sliced inverse regression in simulation and data analysis.

RDP: A storage-tier-aware Robust Data Placement strategy for Hadoop in a Cloud-based Heterogeneous Environment

  • Muhammad Faseeh Qureshi, Nawab;Shin, Dong Ryeol
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제10권9호
    • /
    • pp.4063-4086
    • /
    • 2016
  • Cloud computing is a robust technology, which facilitate to resolve many parallel distributed computing issues in the modern Big Data environment. Hadoop is an ecosystem, which process large data-sets in distributed computing environment. The HDFS is a filesystem of Hadoop, which process data blocks to the cluster nodes. The data block placement has become a bottleneck to overall performance in a Hadoop cluster. The current placement policy assumes that, all Datanodes have equal computing capacity to process data blocks. This computing capacity includes availability of same storage media and same processing performances of a node. As a result, Hadoop cluster performance gets effected with unbalanced workloads, inefficient storage-tier, network traffic congestion and HDFS integrity issues. This paper proposes a storage-tier-aware Robust Data Placement (RDP) scheme, which systematically resolves unbalanced workloads, reduces network congestion to an optimal state, utilizes storage-tier in a useful manner and minimizes the HDFS integrity issues. The experimental results show that the proposed approach reduced unbalanced workload issue to 72%. Moreover, the presented approach resolve storage-tier compatibility problem to 81% by predicting storage for block jobs and improved overall data block placement by 78% through pre-calculated computing capacity allocations and execution of map files over respective Namenode and Datanodes.

Empirical Statistical Power for Testing Multilocus Genotypic Effects under Unbalanced Designs Using a Gibbs Sampler

  • Lee, Chae-Young
    • Asian-Australasian Journal of Animal Sciences
    • /
    • 제25권11호
    • /
    • pp.1511-1514
    • /
    • 2012
  • Epistasis that may explain a large portion of the phenotypic variation for complex economic traits of animals has been ignored in many genetic association studies. A Baysian method was introduced to draw inferences about multilocus genotypic effects based on their marginal posterior distributions by a Gibbs sampler. A simulation study was conducted to provide statistical powers under various unbalanced designs by using this method. Data were simulated by combined designs of number of loci, within genotype variance, and sample size in unbalanced designs with or without null combined genotype cells. Mean empirical statistical power was estimated for testing posterior mean estimate of combined genotype effect. A practical example for obtaining empirical statistical power estimates with a given sample size was provided under unbalanced designs. The empirical statistical powers would be useful for determining an optimal design when interactive associations of multiple loci with complex phenotypes were examined.