• 제목/요약/키워드: Classification Variables

검색결과 920건 처리시간 0.024초

Classification of Imbalanced Data Based on MTS-CBPSO Method: A Case Study of Financial Distress Prediction

  • Gu, Yuping;Cheng, Longsheng;Chang, Zhipeng
    • Journal of Information Processing Systems
    • /
    • 제15권3호
    • /
    • pp.682-693
    • /
    • 2019
  • The traditional classification methods mostly assume that the data for class distribution is balanced, while imbalanced data is widely found in the real world. So it is important to solve the problem of classification with imbalanced data. In Mahalanobis-Taguchi system (MTS) algorithm, data classification model is constructed with the reference space and measurement reference scale which is come from a single normal group, and thus it is suitable to handle the imbalanced data problem. In this paper, an improved method of MTS-CBPSO is constructed by introducing the chaotic mapping and binary particle swarm optimization algorithm instead of orthogonal array and signal-to-noise ratio (SNR) to select the valid variables, in which G-means, F-measure, dimensionality reduction are regarded as the classification optimization target. This proposed method is also applied to the financial distress prediction of Chinese listed companies. Compared with the traditional MTS and the common classification methods such as SVM, C4.5, k-NN, it is showed that the MTS-CBPSO method has better result of prediction accuracy and dimensionality reduction.

A Classification Model for Predicting the Injured Body Part in Construction Accidents in Korea

  • Lim, Jiseon;Cho, Sungjin;Kang, Sanghyeok
    • 국제학술발표논문집
    • /
    • The 9th International Conference on Construction Engineering and Project Management
    • /
    • pp.230-237
    • /
    • 2022
  • It is difficult to predict industrial accidents in the construction industry because many accident factors, such as human-related factors and environment-related factors, affect the accidents. Many studies have analyzed the severity of injuries and types of accidents; however, there were few studies on the prediction of injured body parts. This study aims to develop a classification model to predict the part of the injured body based on accident-related factors. Construction accident cases from June 2018 to July 2021 provided by the Korea Construction Safety Management Integrated Information were collected through web crawling and then preprocessed. A naïve Bayes classifier, one of the supervised learning algorithms, was employed to construct a classification model of the injured body part, which has four categories: 1) torso, 2) upper extremity, 3) head, and 4) lower extremity. The predictor variables are accident type, type of work, facility type, injury source, and activity type. As a result, the average accuracy for each injured body part was 50.4%. The accuracy of the upper extremity and lower extremity was relatively higher than the cases of the torso and head. Unlike the other classifications, such as spam mail filtering, a naïve Bayes classifier does not provide a good classification performance in construction accidents. The reasons are discussed in the study. Based on the results of this study, more detailed guidelines for construction safety management can be provided, which help establish safety measures at the construction site.

  • PDF

Finding the Optimal Data Classification Method Using LDA and QDA Discriminant Analysis

  • Kim, SeungJae;Kim, SungHwan
    • 통합자연과학논문집
    • /
    • 제13권4호
    • /
    • pp.132-140
    • /
    • 2020
  • With the recent introduction of artificial intelligence (AI) technology, the use of data is rapidly increasing, and newly generated data is also rapidly increasing. In order to obtain the results to be analyzed based on these data, the first thing to do is to classify the data well. However, when classifying data, if only one classification technique belonging to the machine learning technique is applied to classify and analyze it, an error of overfitting can be accompanied. In order to reduce or minimize the problems caused by misclassification of the classification system such as overfitting, it is necessary to derive an optimal classification by comparing the results of each classification by applying several classification techniques. If you try to interpret the data with only one classification technique, you will have poor reasoning and poor predictions of results. This study seeks to find a method for optimally classifying data by looking at data from various perspectives and applying various classification techniques such as LDA and QDA, such as linear or nonlinear classification, as a process before data analysis in data analysis. In order to obtain the reliability and sophistication of statistics as a result of big data analysis, it is necessary to analyze the meaning of each variable and the correlation between the variables. If the data is classified differently from the hypothesis test from the beginning, even if the analysis is performed well, unreliable results will be obtained. In other words, prior to big data analysis, it is necessary to ensure that data is well classified to suit the purpose of analysis. This is a process that must be performed before reaching the result by analyzing the data, and it may be a method of optimal data classification.

산지 경계 추출을 위한 지형학적 변수 선정과 알고리즘 개발 (A Study on the Development of Topographical Variables and Algorithm for Mountain Classification)

  • 최정선;장효진;심우진;안유순;신혜섭;이승진;박수진
    • 한국지형학회지
    • /
    • 제25권3호
    • /
    • pp.1-18
    • /
    • 2018
  • In Korea, 64% of the land is known as mountain area, but the definition and classification standard of mountain are not clear. Demand for utilization and development of mountain area is increasing. In this situation, the unclear definition and scope of the mountain area can lead to the destruction of the mountain and the increase of disasters due to indiscreet permission of forestland use conversion. Therefore, this study analyzed the variables and criteria that can extract the mountain boundaries through the questionnaire survey and the terrain analysis. We developed a mountain boundary extraction algorithm that can classify topographic mountain by using selected variables. As a result, 72.1% of the total land was analyzed as mountain area. For the three catchment areas with different mountain area ratio, we compared the results with the existing data such as forestland map and cadastral map. We confirmed the differences in boundary and distribution of mountain. In a catchment area with predominantly mountainous area, the algorithmbased mountain classification results were judged to be wider than the mountain or forest of the two maps. On the other hand, in the basin where the non-mountainous region predominated, algorithm-based results yielded a lower mountain area ratio than the other two maps. In the two maps, we was able to confirm the distribution of fragmented mountains. However, these areas were classified as non-mountain areas in algorithm-based results. We concluded that this result occurred because of the algorithm, so it is necessary to refine and elaborate the algorithm afterward. Nevertheless, this algorithm can analyze the topographic variables and the optimal value by watershed that can distinguish the mountain area. The results of this study are significant in that the mountain boundaries were extracted considering the characteristics of different mountain topography by region. This study will help establish policies for stable mountain management.

은하벌지에서 발견된 OGLE 변광성의 분류 (CLASSIFICATION OF OGLE VARIABLES IN GALACTIC BULGE)

  • 강영운
    • Journal of Astronomy and Space Sciences
    • /
    • 제14권2호
    • /
    • pp.207-215
    • /
    • 1997
  • 우리 은하 안에서 암흑물질을 찾는 OGLE 프로젝트가 바데의 창에서 발견한 변광성 중에서 미 분류로 분류한 기타변광성들의 특성을 규명하기 위하여 기타변광성의 진화상태, 광도곡선의 형태, 그리고 공전주기에 따른 광도변화의 폭 등을 조사하였다. 기타변광성을 제외한 것들은 대부분 주 계열을 떠나 준거 성 혹은 거성에 속하고, 광도변화의 폭은 주기가 30일 이하에서는 급격히 감소하고, 주기가 30일 이상에서는 완만하게 감소하는 것으로 나타났다. 광도곡선의 형태를 태양 근방의 RS CVn 형 별들의 광도곡선과 비교한 결과 OGLE 기타변광성을 RS CVn 형 별들로 분류하였다.

  • PDF

나무구조의 분류분석에서 변수 중요도에 대한 고찰 (Comparison of Variable Importance Measures in Tree-based Classification)

  • 김나영;이은경
    • 응용통계연구
    • /
    • 제27권5호
    • /
    • pp.717-729
    • /
    • 2014
  • 본 연구에서는 나무구조의 분류분석에서 자료의 크기가 방대해짐에 따라 중요한 문제로 대두되고 있는 변수의 중요도에 대하여 사영추적분류나무를 중심으로 고찰하였다. 사영추적분류나무(projection pursuit classification tree)는 각 마디에서 사영추적을 이용하여 그룹을 잘 분리하는 변수들의 선형결합을 이용하는 방법으로 이때 사용되는 사영계수들은 각 마디에서의 분류에 대한 정보를 가지고 있다. 이를 종합하여 각 변수의 분류에 대한 중요도를 계산할 수 있다. 먼저 사영추적분류나무의 분류과정에서 계산되는 사영추적계수를 이용하여 분류를 위한 변수선택의 중요도를 계산하고 이들의 특성을 살펴보고 이를 같은 형태의 나무모형방법인 CART와 랜덤 포레스트의 결과와 비교 분석하여 사영추적분류나무의 특성을 살펴보고 비교, 분석하였다. 대부분의 자료에서 사영추적분류나무가 훨씬 좋은 성능을 보이고 있었으며 특히 상관계수가 높은 변수들이 포함되어 있는 경우에는 상대적으로 적은 수의 변수로도 잘 분류를 할 수 있음을 확인하였다. 랜덤 포레스트에서 제공하는 변수 중요도는 변수들 간의 상관관계가 높은 경우에는 사영추적분류나무의 변수중요도와 매우 다르게 나타나며 사영추적분류나무의 변수 중요도가 조금 더 나은 성능을 보이고 있음을 알 수 있다.

Assessing Misdiagnosis of Relapse in Patients with Gastric Cancer in Iran Cancer Institute Based on a Hidden Markov Multi-state Model

  • Zare, Ali;Mahmoodi, Mahmood;Mohammad, Kazem;Zeraati, Hojjat;Hosseini, Mostafa;Naieni, Kourosh Holakouie
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제15권9호
    • /
    • pp.4109-4115
    • /
    • 2014
  • Background: Accurate assessment of disease progression requires proper understanding of natural disease process which is often hidden and unobservable. For this purpose, disease status should be clearly detected. But in most diseases it is not possible to detect such status. This study, therefore, aims to present a model which both investigates the unobservable disease process and considers the error probability in diagnosis of disease states. Materials and Methods: Data from 330 patients with gastric cancer undergoing surgery at the Iran Cancer Institute from 1995 to 1999 were analyzed. Moreover, to estimate and assess the effect of demographic, diagnostic and clinical factors as well as medical and post-surgical variables on transition rates and the probability of misdiagnosis of relapse, a hidden Markov multi-state model was employed. Results: Classification errors of patients in alive state without a relapse ($e_{21}$) and with a relapse ($e_{12}$) were 0.22 (95% CI: 0.04-0.63) and 0.02 (95% CI: 0.00-0.09), respectively. Only variables of age and number of renewed treatments affected misdiagnosis of relapse. In addition, patient age and distant metastasis were among factors affecting the occurrence of relapse (state1${\rightarrow}$state2) while the number of renewed treatments and the type and extent of surgery had a significant effect on death hazard without relapse (state2${\rightarrow}$state3)and death hazard with relapse (state2${\rightarrow}$state3). Conclusions: A hidden Markov multi-state model provides the possibility of estimating classification error between different states of disease. Moreover, based on this model, factors affecting the probability of this error can be identified and researchers can be helped with understanding the mechanisms of classification error.

기상레이더를 이용한 최적화된 Type-2 퍼지 RBFNN 에코 패턴분류기 설계 (Design of Optimized Type-2 Fuzzy RBFNN Echo Pattern Classifier Using Meterological Radar Data)

  • 송찬석;이승철;오성권
    • 전기학회논문지
    • /
    • 제64권6호
    • /
    • pp.922-934
    • /
    • 2015
  • In this paper, The classification between precipitation echo(PRE) and non-precipitation echo(N-PRE) (including ground echo and clear echo) is carried out from weather radar data using neuro-fuzzy algorithm. In order to classify between PRE and N-PRE, Input variables are built up through characteristic analysis of radar data. First, the event classifier as the first classification step is designed to classify precipitation event and non-precipitation event using input variables of RBFNNs such as DZ, DZ of Frequency(DZ_FR), SDZ, SDZ of Frequency(SDZ_FR), VGZ, VGZ of Frequency(VGZ_FR). After the event classification, in the precipitation event including non-precipitation echo, the non-precipitation echo is completely removed by the echo classifier of the second classifier step that is built as Type-2 FCM based RBFNNs. Also, parameters of classification system are acquired for effective performance using PSO(Particle Swarm Optimization). The performance results of the proposed echo classifier are compared with CZ. In the sequel, the proposed model architectures which use event classifier as well as the echo classifier of Interval Type-2 FCM based RBFNN show the superiority of output performance when compared with the conventional echo classifier based on RBFNN.

요인 및 군집분석을 이용한 지상 라이다 자료의 분류 (Classification of Terrestrial LiDAR Data Using Factor and Cluster Analysis)

  • 최승필;조지현;김열;김준성
    • 대한공간정보학회지
    • /
    • 제19권4호
    • /
    • pp.139-144
    • /
    • 2011
  • 본 연구는 지상라이다 자료에서 얻어진 색상정보(R, G, B)와 반사강도정보(I)를 동시에 이용하여 이를 통계학적 분류기법으로 서로의 연관성을 분석하여 라이다 자료에 대한 분류방법을 제시하였다. 이를 위하여 우선 변수 R,G,B 및 I를 사용하여 분산 을 극대화하는 요인을 추출하여 주요인과 각 변수들 간의 요인행렬을 산출하였다. 그러나 요인행렬은 기초자료를 축소시켜 보여주기는 하지만, 이로부터 어떤 변수들이 어떤 요인에 의해 높게 관계되는지 명확하게 알기 어렵기 때문에 직각회전방식 중에서 Varimax방법을 이용하여 회전된 요인행렬을 구하여 요인점수를 산출하였다. 그리고 비 계층적 군집화 방법인 K-평균법을 이용하여 요인분석으로 산출된 요인점수에 대하여 군집분석을 실시한 후, 지상라이다 자료의 분류 정확도를 평가하였다.

불균형 자료의 분류분석을 위한 가중 L1-norm SVM (Weighted L1-Norm Support Vector Machine for the Classification of Highly Imbalanced Data)

  • 김은경;전명식;방성완
    • 응용통계연구
    • /
    • 제28권1호
    • /
    • pp.9-21
    • /
    • 2015
  • SVM은 높은 수준의 분류 정확도와 유연성을 바탕으로 다양한 분야의 분류분석에서 널리 사용되고 있다. 그러나 집단별 개체수가 상이한 불균형 자료의 분류분석에서 SVM은 다수집단으로 편향되게 분류함수를 추정하므로 소수집단의 분류 정확도가 심각하게 감소하게 된다. 불균형 자료의 분류분석을 위하여 집단별 오분류 비용을 차등 적용하는 가중 $L_2$-norm SVM이 개발되었으나, 이는 릿지 형태의 벌칙함수를 사용하므로 분류함수의 추정에서 불필요한 잡음변수의 제거에는 효율적이지 못하다. 따라서 본 논문에서는 라소 형태의 별칙함수를 사용하고 훈련개체의 오분류 비용을 차등적으로 부여함으로서 불균형 자료의 분류분석에서 변수선택의 기능을 지니는 가중 $L_1$-norm SVM을 제안하였으며, 모의실험과 실제자료의 분석을 통하여 제안한 방법론의 효율적인 성능과 유용성을 확인하였다.