• Title/Summary/Keyword: one-class support vector machines

Search Result 24, Processing Time 0.027 seconds

A Hybrid Under-sampling Approach for Better Bankruptcy Prediction (부도예측 개선을 위한 하이브리드 언더샘플링 접근법)

  • Kim, Taehoon;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.173-190
    • /
    • 2015
  • The purpose of this study is to improve bankruptcy prediction models by using a novel hybrid under-sampling approach. Most prior studies have tried to enhance the accuracy of bankruptcy prediction models by improving the classification methods involved. In contrast, we focus on appropriate data preprocessing as a means of enhancing accuracy. In particular, we aim to develop an effective sampling approach for bankruptcy prediction, since most prediction models suffer from class imbalance problems. The approach proposed in this study is a hybrid under-sampling method that combines the k-Reverse Nearest Neighbor (k-RNN) and one-class support vector machine (OCSVM) approaches. k-RNN can effectively eliminate outliers, while OCSVM contributes to the selection of informative training samples from majority class data. To validate our proposed approach, we have applied it to data from H Bank's non-external auditing companies in Korea, and compared the performances of the classifiers with the proposed under-sampling and random sampling data. The empirical results show that the proposed under-sampling approach generally improves the accuracy of classifiers, such as logistic regression, discriminant analysis, decision tree, and support vector machines. They also show that the proposed under-sampling approach reduces the risk of false negative errors, which lead to higher misclassification costs.

Emotion Classification Using EEG Spectrum Analysis and Bayesian Approach (뇌파 스펙트럼 분석과 베이지안 접근법을 이용한 정서 분류)

  • Chung, Seong Youb;Yoon, Hyun Joong
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.37 no.1
    • /
    • pp.1-8
    • /
    • 2014
  • This paper proposes an emotion classifier from EEG signals based on Bayes' theorem and a machine learning using a perceptron convergence algorithm. The emotions are represented on the valence and arousal dimensions. The fast Fourier transform spectrum analysis is used to extract features from the EEG signals. To verify the proposed method, we use an open database for emotion analysis using physiological signal (DEAP) and compare it with C-SVC which is one of the support vector machines. An emotion is defined as two-level class and three-level class in both valence and arousal dimensions. For the two-level class case, the accuracy of the valence and arousal estimation is 67% and 66%, respectively. For the three-level class case, the accuracy is 53% and 51%, respectively. Compared with the best case of the C-SVC, the proposed classifier gave 4% and 8% more accurate estimations of valence and arousal for the two-level class. In estimation of three-level class, the proposed method showed a similar performance to the best case of the C-SVC.

Performance Evaluation of One Class Classification to detect anomalies of NIDS (NIDS의 비정상 행위 탐지를 위한 단일 클래스 분류성능 평가)

  • Seo, Jae-Hyun
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.11
    • /
    • pp.15-21
    • /
    • 2018
  • In this study, we try to detect anomalies on the network intrusion detection system by learning only one class. We use KDD CUP 1999 dataset, an intrusion detection dataset, which is used to evaluate classification performance. One class classification is one of unsupervised learning methods that classifies attack class by learning only normal class. When using unsupervised learning, it difficult to achieve relatively high classification efficiency because it does not use negative instances for learning. However, unsupervised learning has the advantage for classifying unlabeled data. In this study, we use one class classifiers based on support vector machines and density estimation to detect new unknown attacks. The test using the classifier based on density estimation has shown relatively better performance and has a detection rate of about 96% while maintaining a low FPR for the new attacks.

The Unified Framework for AUC Maximizer

  • Jun, Jong-Jun;Kim, Yong-Dai;Han, Sang-Tae;Kang, Hyun-Cheol;Choi, Ho-Sik
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.6
    • /
    • pp.1005-1012
    • /
    • 2009
  • The area under the curve(AUC) is commonly used as a measure of the receiver operating characteristic(ROC) curve which displays the performance of a set of binary classifiers for all feasible ratios of the costs associated with true positive rate(TPR) and false positive rate(FPR). In the bipartite ranking problem where one has to compare two different observations and decide which one is "better", the AUC measures the quantity that ranking score of a randomly chosen sample in one class is larger than that of a randomly chosen sample in the other class and hence, the function which maximizes an AUC of bipartite ranking problem is different to the function which maximizes (minimizes) accuracy (misclassification error rate) of binary classification problem. In this paper, we develop a way to construct the unified framework for AUC maximizer including support vector machines based on maximizing large margin and logistic regression based on estimating posterior probability. Moreover, we develop an efficient algorithm for the proposed unified framework. Numerical results show that the propose unified framework can treat various methodologies successfully.

A STUDY ON SPATIAL FEATURE EXTRACTION IN THE CLASSIFICATION OF HIGH RESOLUTIION SATELLITE IMAGERY

  • Han, You-Kyung;Kim, Hye-Jin;Choi, Jae-Wan;Kim, Yong-Il
    • Proceedings of the KSRS Conference
    • /
    • 2008.10a
    • /
    • pp.361-364
    • /
    • 2008
  • It is well known that combining spatial and spectral information can improve land use classification from satellite imagery. High spatial resolution classification has a limitation when only using the spectral information due to the complex spatial arrangement of features and spectral heterogeneity within each class. Therefore, extracting the spatial information is one of the most important steps in high resolution satellite image classification. In this paper, we propose a new spatial feature extraction method. The extracted features are integrated with spectral bands to improve overall classification accuracy. The classification is achieved by applying a Support Vector Machines classifier. In order to evaluate the proposed feature extraction method, we applied our approach to KOMPSAT-2 data and compared the result with the other methods.

  • PDF

One-class Least Square Support Vector Machines (단일부류 최소제곱 서포트 벡터 머신)

  • 우상호;이성환
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10d
    • /
    • pp.559-561
    • /
    • 2002
  • 서포트 벡터 머신은 얼굴인식이나 문자인식과 같은 다양한 패턴인식 문제에서 좋은 성능을 보여준다. 그러나 이러한 문제는 Quadratic Programming(QP) 문제에 관하여 몇 가지 단점을 가지고 있다. 일반적으로 대용량의 QP 문제를 해결하기 위해 많은 계산비용이 요구되며, QP 기반 시스템을 효과적으로 구현하는 것이 쉽지 않은 문제이다. 또한 대규모 데이터의 처리 시에는 입출력을 맞추기 또한 쉽지 않은 단점이 있다. 본 논문에서는 위의 단점을 극복하기 위하여 단일부류 문제를 최소제곱 서포트 벡터 머신을 기반으로 하여 해결하였다. 제안한 방법은 QP 문제를 해결하는 과정이 없이 단일부류 문제를 표현하여 최소제곱 방법을 이용하는 알고리즘이다. 제안된 방법으로 쉽고, 계산 비용을 줄이는 결과를 얻었다. 또한 서포트 벡터 영역 표식자에 확장 적용하여 선형방정식으로 구현하여, 문제를 해결하였다. 제안된 방법의 효율성을 입증하기 위하여 패턴인식 분야 중에 얼굴 인증 방법과 바이오인포매틱스 분야 중에 전립선 암 분류 문제에 적용하였다. 우리의 실험결과는 적합한 성능과 좋은 Equal Error Rate(EER)를 보여준다. 제안된 방법은 알 수 없는 물체의 분류 방법의 효율성을 증대시켰고, 실시간 응용분야에 직접적으로 적용될 수 있을 것으로 기대 된다.

  • PDF

The Prediction of DEA based Efficiency Rating for Venture Business Using Multi-class SVM (다분류 SVM을 이용한 DEA기반 벤처기업 효율성등급 예측모형)

  • Park, Ji-Young;Hong, Tae-Ho
    • Asia pacific journal of information systems
    • /
    • v.19 no.2
    • /
    • pp.139-155
    • /
    • 2009
  • For the last few decades, many studies have tried to explore and unveil venture companies' success factors and unique features in order to identify the sources of such companies' competitive advantages over their rivals. Such venture companies have shown tendency to give high returns for investors generally making the best use of information technology. For this reason, many venture companies are keen on attracting avid investors' attention. Investors generally make their investment decisions by carefully examining the evaluation criteria of the alternatives. To them, credit rating information provided by international rating agencies, such as Standard and Poor's, Moody's and Fitch is crucial source as to such pivotal concerns as companies stability, growth, and risk status. But these types of information are generated only for the companies issuing corporate bonds, not venture companies. Therefore, this study proposes a method for evaluating venture businesses by presenting our recent empirical results using financial data of Korean venture companies listed on KOSDAQ in Korea exchange. In addition, this paper used multi-class SVM for the prediction of DEA-based efficiency rating for venture businesses, which was derived from our proposed method. Our approach sheds light on ways to locate efficient companies generating high level of profits. Above all, in determining effective ways to evaluate a venture firm's efficiency, it is important to understand the major contributing factors of such efficiency. Therefore, this paper is constructed on the basis of following two ideas to classify which companies are more efficient venture companies: i) making DEA based multi-class rating for sample companies and ii) developing multi-class SVM-based efficiency prediction model for classifying all companies. First, the Data Envelopment Analysis(DEA) is a non-parametric multiple input-output efficiency technique that measures the relative efficiency of decision making units(DMUs) using a linear programming based model. It is non-parametric because it requires no assumption on the shape or parameters of the underlying production function. DEA has been already widely applied for evaluating the relative efficiency of DMUs. Recently, a number of DEA based studies have evaluated the efficiency of various types of companies, such as internet companies and venture companies. It has been also applied to corporate credit ratings. In this study we utilized DEA for sorting venture companies by efficiency based ratings. The Support Vector Machine(SVM), on the other hand, is a popular technique for solving data classification problems. In this paper, we employed SVM to classify the efficiency ratings in IT venture companies according to the results of DEA. The SVM method was first developed by Vapnik (1995). As one of many machine learning techniques, SVM is based on a statistical theory. Thus far, the method has shown good performances especially in generalizing capacity in classification tasks, resulting in numerous applications in many areas of business, SVM is basically the algorithm that finds the maximum margin hyperplane, which is the maximum separation between classes. According to this method, support vectors are the closest to the maximum margin hyperplane. If it is impossible to classify, we can use the kernel function. In the case of nonlinear class boundaries, we can transform the inputs into a high-dimensional feature space, This is the original input space and is mapped into a high-dimensional dot-product space. Many studies applied SVM to the prediction of bankruptcy, the forecast a financial time series, and the problem of estimating credit rating, In this study we employed SVM for developing data mining-based efficiency prediction model. We used the Gaussian radial function as a kernel function of SVM. In multi-class SVM, we adopted one-against-one approach between binary classification method and two all-together methods, proposed by Weston and Watkins(1999) and Crammer and Singer(2000), respectively. In this research, we used corporate information of 154 companies listed on KOSDAQ market in Korea exchange. We obtained companies' financial information of 2005 from the KIS(Korea Information Service, Inc.). Using this data, we made multi-class rating with DEA efficiency and built multi-class prediction model based data mining. Among three manners of multi-classification, the hit ratio of the Weston and Watkins method is the best in the test data set. In multi classification problems as efficiency ratings of venture business, it is very useful for investors to know the class with errors, one class difference, when it is difficult to find out the accurate class in the actual market. So we presented accuracy results within 1-class errors, and the Weston and Watkins method showed 85.7% accuracy in our test samples. We conclude that the DEA based multi-class approach in venture business generates more information than the binary classification problem, notwithstanding its efficiency level. We believe this model can help investors in decision making as it provides a reliably tool to evaluate venture companies in the financial domain. For the future research, we perceive the need to enhance such areas as the variable selection process, the parameter selection of kernel function, the generalization, and the sample size of multi-class.

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.

Feature selection for text data via topic modeling (토픽 모형을 이용한 텍스트 데이터의 단어 선택)

  • Woosol, Jang;Ye Eun, Kim;Won, Son
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.6
    • /
    • pp.739-754
    • /
    • 2022
  • Usually, text data consists of many variables, and some of them are closely correlated. Such multi-collinearity often results in inefficient or inaccurate statistical analysis. For supervised learning, one can select features by examining the relationship between target variables and explanatory variables. On the other hand, for unsupervised learning, since target variables are absent, one cannot use such a feature selection procedure as in supervised learning. In this study, we propose a word selection procedure that employs topic models to find latent topics. We substitute topics for the target variables and select terms which show high relevance for each topic. Applying the procedure to real data, we found that the proposed word selection procedure can give clear topic interpretation by removing high-frequency words prevalent in various topics. In addition, we observed that, by applying the selected variables to the classifiers such as naïve Bayes classifiers and support vector machines, the proposed feature selection procedure gives results comparable to those obtained by using class label information.

A Study on Optimal Shape-Size Index Extraction for Classification of High Resolution Satellite Imagery (고해상도 영상의 분류결과 개선을 위한 최적의 Shape-Size Index 추출에 관한 연구)

  • Han, You-Kyung;Kim, Hye-Jin;Choi, Jae-Wan;Kim, Yong-Il
    • Korean Journal of Remote Sensing
    • /
    • v.25 no.2
    • /
    • pp.145-154
    • /
    • 2009
  • High spatial resolution satellite image classification has a limitation when only using the spectral information due to the complex spatial arrangement of features and spectral heterogeneity within each class. Therefore, the extraction of the spatial information is one of the most important steps in high resolution satellite image classification. This study proposes a new spatial feature extraction method, named SSI(Shape-Size Index). SSI uses a simple region-growing based image segmentation and allocates spatial property value in each segment. The extracted feature is integrated with spectral bands to improve overall classification accuracy. The classification is achieved by applying a SVM(Support Vector Machines) classifier. In order to evaluate the proposed feature extraction method, KOMPSAT-2 and QuickBird-2 data are used for experiments. It is demonstrated that proposed SSI algorithm leads to a notable increase in classification accuracy.