• Title/Summary/Keyword: wrapper based feature selection

Search Result 21, Processing Time 0.027 seconds

The Credit Information Feature Selection Method in Default Rate Prediction Model for Individual Businesses (개인사업자 부도율 예측 모델에서 신용정보 특성 선택 방법)

  • Hong, Dongsuk;Baek, Hanjong;Shin, Hyunjoon
    • Journal of the Korea Society for Simulation
    • /
    • v.30 no.1
    • /
    • pp.75-85
    • /
    • 2021
  • In this paper, we present a deep neural network-based prediction model that processes and analyzes the corporate credit and personal credit information of individual business owners as a new method to predict the default rate of individual business more accurately. In modeling research in various fields, feature selection techniques have been actively studied as a method for improving performance, especially in predictive models including many features. In this paper, after statistical verification of macroeconomic indicators (macro variables) and credit information (micro variables), which are input variables used in the default rate prediction model, additionally, through the credit information feature selection method, the final feature set that improves prediction performance was identified. The proposed credit information feature selection method as an iterative & hybrid method that combines the filter-based and wrapper-based method builds submodels, constructs subsets by extracting important variables of the maximum performance submodels, and determines the final feature set through prediction performance analysis of the subset and the subset combined set.

Prediction Model of Hypertension Using Sociodemographic Characteristics Based on Machine Learning (머신러닝 기반 사회인구학적 특징을 이용한 고혈압 예측모델)

  • Lee, Bum Ju
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.11
    • /
    • pp.541-546
    • /
    • 2021
  • Recently, there is a trend of developing various identification and prediction models for hypertension using clinical information based on artificial intelligence and machine learning around the world. However, most previous studies on identification or prediction models of hypertension lack the consideration of the ideas of non-invasive and cost-effective variables, race, region, and countries. Therefore, the objective of this study is to present hypertension prediction model that is easily understood using only general and simple sociodemographic variables. Data used in this study was based on the Korea National Health and Nutrition Examination Survey (2018). In men, the model using the naive Bayes with the wrapper-based feature subset selection method showed the highest predictive performance (ROC = 0.790, kappa = 0.396). In women, the model using the naive Bayes with correlation-based feature subset selection method showed the strongest predictive performance (ROC = 0.850, kappa = 0.495). We found that the predictive performance of hypertension based on only sociodemographic variables was higher in women than in men. We think that our models based on machine leaning may be readily used in the field of public health and epidemiology in the future because of the use of simple sociodemographic characteristics.

A Pre-processing Study to Solve the Problem of Rare Class Classification of Network Traffic Data (네트워크 트래픽 데이터의 희소 클래스 분류 문제 해결을 위한 전처리 연구)

  • Ryu, Kyung Joon;Shin, DongIl;Shin, DongKyoo;Park, JeongChan;Kim, JinGoog
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.12
    • /
    • pp.411-418
    • /
    • 2020
  • In the field of information security, IDS(Intrusion Detection System) is normally classified in two different categories: signature-based IDS and anomaly-based IDS. Many studies in anomaly-based IDS have been conducted that analyze network traffic data generated in cyberspace by machine learning algorithms. In this paper, we studied pre-processing methods to overcome performance degradation problems cashed by rare classes. We experimented classification performance of a Machine Learning algorithm by reconstructing data set based on rare classes and semi rare classes. After reconstructing data into three different sets, wrapper and filter feature selection methods are applied continuously. Each data set is regularized by a quantile scaler. Depp neural network model is used for learning and validation. The evaluation results are compared by true positive values and false negative values. We acquired improved classification performances on all of three data sets.

Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification (효율적인 문서 분류를 위한 혼합 특징 집합과 하이브리드 특징 선택 기법)

  • In, Joo-Ho;Kim, Jung-Ho;Chae, Soo-Hoan
    • Journal of Internet Computing and Services
    • /
    • v.14 no.5
    • /
    • pp.49-57
    • /
    • 2013
  • A novel approach for the feature selection is proposed, which is the important preprocessing task of on-line document classification. In previous researches, the features based on information from their single population for feature selection task have been selected. In this paper, a mixed feature set is constructed by selecting features from multi-population as well as single population based on various information. The mixed feature set consists of two feature sets: the original feature set that is made up of words on documents and the transformed feature set that is made up of features generated by LSA. The hybrid feature selection method using both filter and wrapper method is used to obtain optimal features set from the mixed feature set. We performed classification experiments using the obtained optimal feature sets. As a result of the experiments, our expectation that our approach makes better performance of classification is verified, which is over 90% accuracy. In particular, it is confirmed that our approach has over 90% recall and precision that have a low deviation between categories.

Classification Performance Improvement of UNSW-NB15 Dataset Based on Feature Selection (특징선택 기법에 기반한 UNSW-NB15 데이터셋의 분류 성능 개선)

  • Lee, Dae-Bum;Seo, Jae-Hyun
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.5
    • /
    • pp.35-42
    • /
    • 2019
  • Recently, as the Internet and various wearable devices have appeared, Internet technology has contributed to obtaining more convenient information and doing business. However, as the internet is used in various parts, the attack surface points that are exposed to attacks are increasing, Attempts to invade networks aimed at taking unfair advantage, such as cyber terrorism, are also increasing. In this paper, we propose a feature selection method to improve the classification performance of the class to classify the abnormal behavior in the network traffic. The UNSW-NB15 dataset has a rare class imbalance problem with relatively few instances compared to other classes, and an undersampling method is used to eliminate it. We use the SVM, k-NN, and decision tree algorithms and extract a subset of combinations with superior detection accuracy and RMSE through training and verification. The subset has recall values of more than 98% through the wrapper based experiments and the DT_PSO showed the best performance.

Prediction model of osteoporosis using nutritional components based on association (연관성 규칙 기반 영양소를 이용한 골다공증 예측 모델)

  • Yoo, JungHun;Lee, Bum Ju
    • The Journal of the Convergence on Culture Technology
    • /
    • v.6 no.3
    • /
    • pp.457-462
    • /
    • 2020
  • Osteoporosis is a disease that occurs mainly in the elderly and increases the risk of fractures due to structural deterioration of bone mass and tissues. The purpose of this study are to assess the relationship between nutritional components and osteoporosis and to evaluate models for predicting osteoporosis based on nutrient components. In experimental method, association was performed using binary logistic regression, and predictive models were generated using the naive Bayes algorithm and variable subset selection methods. The analysis results for single variables indicated that food intake and vitamin B2 showed the highest value of the area under the receiver operating characteristic curve (AUC) for predicting osteoporosis in men. In women, monounsaturated fatty acids showed the highest AUC value. In prediction model of female osteoporosis, the models generated by the correlation based feature subset and wrapper based variable subset methods showed an AUC value of 0.662. In men, the model by the full variable obtained an AUC of 0.626, and in other male models, the predictive performance was very low in sensitivity and 1-specificity. The results of these studies are expected to be used as the basic information for the treatment and prevention of osteoporosis.

Feature Based Decision Tree Model for Fault Detection and Classification of Semiconductor Process (반도체 공정의 이상 탐지와 분류를 위한 특징 기반 의사결정 트리)

  • Son, Ji-Hun;Ko, Jong-Myoung;Kim, Chang-Ouk
    • IE interfaces
    • /
    • v.22 no.2
    • /
    • pp.126-134
    • /
    • 2009
  • As product quality and yield are essential factors in semiconductor manufacturing, monitoring the main manufacturing steps is a critical task. For the purpose, FDC(Fault detection and classification) is used for diagnosing fault states in the processes by monitoring data stream collected by equipment sensors. This paper proposes an FDC model based on decision tree which provides if-then classification rules for causal analysis of the processing results. Unlike previous decision tree approaches, we reflect the structural aspect of the data stream to FDC. For this, we segment the data stream into multiple subregions, define structural features for each subregion, and select the features which have high relevance to results of the process and low redundancy to other features. As the result, we can construct simple, but highly accurate FDC model. Experiments using the data stream collected from etching process show that the proposed method is able to classify normal/abnormal states with high accuracy.

A Study on Machine Learning-Based Ransomware Classification methods using Optimized Feature Selection (최적화 특징 선택을 활용한 머신러닝 기반 랜섬웨어 분류 방법 연구)

  • Hye-Min Jeon;Doo-Seop Choi;Eul Gyu Im
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.341-344
    • /
    • 2024
  • 최근 랜섬웨어의 유포 증가로 인한 금전적 피해가 전세계적으로 급증하고 있다. 랜섬웨어는 사용자의 데이터를 암호화하여 금전을 요구하거나, 사용자의 중요하고 민감한 데이터를 파괴하여 사용하지 못하도록 피해를 입힌다. 이러한 피해를 막기 위해 파일의 API calls 이나, opcode 를 이용하는 탐지 및 분류 연구가 활발하게 진행되고 있다. 본 논문에서는 랜섬웨어를 효과적으로 탐지하기 위해 파일 PE 기능 값을 PCA 와 Wrapper 방법으로 데이터 전처리 후 머신러닝으로 학습하고, 학습한 모델을 활용하여 랜섬웨어를 정상과 악성으로 분류하는 방법을 제안한다. 제안한 방법으로 실험 결과 RF 는 98.25%, DT 96.25%, SVM 95%, NB 83%의 분류 정확도를 보였으며, RF 모델에서 가장 높은 분류 정확도를 달성하였다.

Partial AUC maximization for essential gene prediction using genetic algorithms

  • Hwang, Kyu-Baek;Ha, Beom-Yong;Ju, Sanghun;Kim, Sangsoo
    • BMB Reports
    • /
    • v.46 no.1
    • /
    • pp.41-46
    • /
    • 2013
  • Identifying genes indispensable for an organism's life and their characteristics is one of the central questions in current biological research, and hence it would be helpful to develop computational approaches towards the prediction of essential genes. The performance of a predictor is usually measured by the area under the receiver operating characteristic curve (AUC). We propose a novel method by implementing genetic algorithms to maximize the partial AUC that is restricted to a specific interval of lower false positive rate (FPR), the region relevant to follow-up experimental validation. Our predictor uses various features based on sequence information, protein-protein interaction network topology, and gene expression profiles. A feature selection wrapper was developed to alleviate the over-fitting problem and to weigh each feature's relevance to prediction. We evaluated our method using the proteome of budding yeast. Our implementation of genetic algorithms maximizing the partial AUC below 0.05 or 0.10 of FPR outperformed other popular classification methods.

A Study on the Prediction Model of Stock Price Index Trend based on GA-MSVM that Simultaneously Optimizes Feature and Instance Selection (입력변수 및 학습사례 선정을 동시에 최적화하는 GA-MSVM 기반 주가지수 추세 예측 모형에 관한 연구)

  • Lee, Jong-sik;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.4
    • /
    • pp.147-168
    • /
    • 2017
  • There have been many studies on accurate stock market forecasting in academia for a long time, and now there are also various forecasting models using various techniques. Recently, many attempts have been made to predict the stock index using various machine learning methods including Deep Learning. Although the fundamental analysis and the technical analysis method are used for the analysis of the traditional stock investment transaction, the technical analysis method is more useful for the application of the short-term transaction prediction or statistical and mathematical techniques. Most of the studies that have been conducted using these technical indicators have studied the model of predicting stock prices by binary classification - rising or falling - of stock market fluctuations in the future market (usually next trading day). However, it is also true that this binary classification has many unfavorable aspects in predicting trends, identifying trading signals, or signaling portfolio rebalancing. In this study, we try to predict the stock index by expanding the stock index trend (upward trend, boxed, downward trend) to the multiple classification system in the existing binary index method. In order to solve this multi-classification problem, a technique such as Multinomial Logistic Regression Analysis (MLOGIT), Multiple Discriminant Analysis (MDA) or Artificial Neural Networks (ANN) we propose an optimization model using Genetic Algorithm as a wrapper for improving the performance of this model using Multi-classification Support Vector Machines (MSVM), which has proved to be superior in prediction performance. In particular, the proposed model named GA-MSVM is designed to maximize model performance by optimizing not only the kernel function parameters of MSVM, but also the optimal selection of input variables (feature selection) as well as instance selection. In order to verify the performance of the proposed model, we applied the proposed method to the real data. The results show that the proposed method is more effective than the conventional multivariate SVM, which has been known to show the best prediction performance up to now, as well as existing artificial intelligence / data mining techniques such as MDA, MLOGIT, CBR, and it is confirmed that the prediction performance is better than this. Especially, it has been confirmed that the 'instance selection' plays a very important role in predicting the stock index trend, and it is confirmed that the improvement effect of the model is more important than other factors. To verify the usefulness of GA-MSVM, we applied it to Korea's real KOSPI200 stock index trend forecast. Our research is primarily aimed at predicting trend segments to capture signal acquisition or short-term trend transition points. The experimental data set includes technical indicators such as the price and volatility index (2004 ~ 2017) and macroeconomic data (interest rate, exchange rate, S&P 500, etc.) of KOSPI200 stock index in Korea. Using a variety of statistical methods including one-way ANOVA and stepwise MDA, 15 indicators were selected as candidate independent variables. The dependent variable, trend classification, was classified into three states: 1 (upward trend), 0 (boxed), and -1 (downward trend). 70% of the total data for each class was used for training and the remaining 30% was used for verifying. To verify the performance of the proposed model, several comparative model experiments such as MDA, MLOGIT, CBR, ANN and MSVM were conducted. MSVM has adopted the One-Against-One (OAO) approach, which is known as the most accurate approach among the various MSVM approaches. Although there are some limitations, the final experimental results demonstrate that the proposed model, GA-MSVM, performs at a significantly higher level than all comparative models.