• Title/Summary/Keyword: Decision Tree(DT)

Search Result 56, Processing Time 0.021 seconds

Classification Performance Improvement of UNSW-NB15 Dataset Based on Feature Selection (특징선택 기법에 기반한 UNSW-NB15 데이터셋의 분류 성능 개선)

  • Lee, Dae-Bum;Seo, Jae-Hyun
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.5
    • /
    • pp.35-42
    • /
    • 2019
  • Recently, as the Internet and various wearable devices have appeared, Internet technology has contributed to obtaining more convenient information and doing business. However, as the internet is used in various parts, the attack surface points that are exposed to attacks are increasing, Attempts to invade networks aimed at taking unfair advantage, such as cyber terrorism, are also increasing. In this paper, we propose a feature selection method to improve the classification performance of the class to classify the abnormal behavior in the network traffic. The UNSW-NB15 dataset has a rare class imbalance problem with relatively few instances compared to other classes, and an undersampling method is used to eliminate it. We use the SVM, k-NN, and decision tree algorithms and extract a subset of combinations with superior detection accuracy and RMSE through training and verification. The subset has recall values of more than 98% through the wrapper based experiments and the DT_PSO showed the best performance.

Linear interpolation and Machine Learning Methods for Gas Leakage Prediction Base on Multi-source Data Integration (다중소스 데이터 융합 기반의 가스 누출 예측을 위한 선형 보간 및 머신러닝 기법)

  • Dashdondov, Khongorzul;Jo, Kyuri;Kim, Mi-Hye
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.3
    • /
    • pp.33-41
    • /
    • 2022
  • In this article, we proposed to predict natural gas (NG) leakage levels through feature selection based on a factor analysis (FA) of the integrating the Korean Meteorological Agency data and natural gas leakage data for considering complex factors. The paper has been divided into three modules. First, we filled missing data based on the linear interpolation method on the integrated data set, and selected essential features using FA with OrdinalEncoder (OE)-based normalization. The dataset is labeled by K-means clustering. The final module uses four algorithms, K-nearest neighbors (KNN), decision tree (DT), random forest (RF), Naive Bayes (NB), to predict gas leakage levels. The proposed method is evaluated by the accuracy, area under the ROC curve (AUC), and mean standard error (MSE). The test results indicate that the OrdinalEncoder-Factor analysis (OE-F)-based classification method has improved successfully. Moreover, OE-F-based KNN (OE-F-KNN) showed the best performance by giving 95.20% accuracy, an AUC of 96.13%, and an MSE of 0.031.

Optimal Selection of Classifier Ensemble Using Genetic Algorithms (유전자 알고리즘을 이용한 분류자 앙상블의 최적 선택)

  • Kim, Myung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.4
    • /
    • pp.99-112
    • /
    • 2010
  • Ensemble learning is a method for improving the performance of classification and prediction algorithms. It is a method for finding a highly accurateclassifier on the training set by constructing and combining an ensemble of weak classifiers, each of which needs only to be moderately accurate on the training set. Ensemble learning has received considerable attention from machine learning and artificial intelligence fields because of its remarkable performance improvement and flexible integration with the traditional learning algorithms such as decision tree (DT), neural networks (NN), and SVM, etc. In those researches, all of DT ensemble studies have demonstrated impressive improvements in the generalization behavior of DT, while NN and SVM ensemble studies have not shown remarkable performance as shown in DT ensembles. Recently, several works have reported that the performance of ensemble can be degraded where multiple classifiers of an ensemble are highly correlated with, and thereby result in multicollinearity problem, which leads to performance degradation of the ensemble. They have also proposed the differentiated learning strategies to cope with performance degradation problem. Hansen and Salamon (1990) insisted that it is necessary and sufficient for the performance enhancement of an ensemble that the ensemble should contain diverse classifiers. Breiman (1996) explored that ensemble learning can increase the performance of unstable learning algorithms, but does not show remarkable performance improvement on stable learning algorithms. Unstable learning algorithms such as decision tree learners are sensitive to the change of the training data, and thus small changes in the training data can yield large changes in the generated classifiers. Therefore, ensemble with unstable learning algorithms can guarantee some diversity among the classifiers. To the contrary, stable learning algorithms such as NN and SVM generate similar classifiers in spite of small changes of the training data, and thus the correlation among the resulting classifiers is very high. This high correlation results in multicollinearity problem, which leads to performance degradation of the ensemble. Kim,s work (2009) showedthe performance comparison in bankruptcy prediction on Korea firms using tradition prediction algorithms such as NN, DT, and SVM. It reports that stable learning algorithms such as NN and SVM have higher predictability than the unstable DT. Meanwhile, with respect to their ensemble learning, DT ensemble shows the more improved performance than NN and SVM ensemble. Further analysis with variance inflation factor (VIF) analysis empirically proves that performance degradation of ensemble is due to multicollinearity problem. It also proposes that optimization of ensemble is needed to cope with such a problem. This paper proposes a hybrid system for coverage optimization of NN ensemble (CO-NN) in order to improve the performance of NN ensemble. Coverage optimization is a technique of choosing a sub-ensemble from an original ensemble to guarantee the diversity of classifiers in coverage optimization process. CO-NN uses GA which has been widely used for various optimization problems to deal with the coverage optimization problem. The GA chromosomes for the coverage optimization are encoded into binary strings, each bit of which indicates individual classifier. The fitness function is defined as maximization of error reduction and a constraint of variance inflation factor (VIF), which is one of the generally used methods to measure multicollinearity, is added to insure the diversity of classifiers by removing high correlation among the classifiers. We use Microsoft Excel and the GAs software package called Evolver. Experiments on company failure prediction have shown that CO-NN is effectively applied in the stable performance enhancement of NNensembles through the choice of classifiers by considering the correlations of the ensemble. The classifiers which have the potential multicollinearity problem are removed by the coverage optimization process of CO-NN and thereby CO-NN has shown higher performance than a single NN classifier and NN ensemble at 1% significance level, and DT ensemble at 5% significance level. However, there remain further research issues. First, decision optimization process to find optimal combination function should be considered in further research. Secondly, various learning strategies to deal with data noise should be introduced in more advanced further researches in the future.

Exploration of Optimal Product Innovation Strategy Using Decision Tree Analysis: A Data-mining Approach

  • Cho, Insu
    • STI Policy Review
    • /
    • v.8 no.2
    • /
    • pp.75-93
    • /
    • 2017
  • Recently, global competition in the manufacturing sector is driving firms in the manufacturing sector to conduct product innovation projects to maintain their competitive edge. The key points of product innovation projects are 1) what the purpose of the project is and 2) what expected results in the target market can be achieved by implementing the innovation. Therefore, this study focuses on the performance of innovation projects with a business viewpoint. In this respect, this study proposes the "achievement rate" of product innovation projects as a measurement of project performance. Then, this study finds the best strategies from various innovation activities to optimize the achievement rate of product innovation projects. There are three major innovation activities for the projects, including three types of R&D activities: Internal, joint and external R&D, and five types of non-R&D activities - acquisition of machines, equipment and software, purchasing external knowledge, job education and training, market research and design. This study applies decision tree modeling, a kind of data-mining methodology, to explore effective innovation activities. This study employs the data from the 'Korean Innovation Survey (KIS) 2014: Manufacturing Sector.' The KIS 2014 gathered information about innovation activities in the manufacturing sector over three years (2011-2013). This study gives some practical implication for managing the activities. First, innovation activities that increased the achievement rate of product diversification projects included a combination of market research, new product design, and job training. Second, our results show that a combination of internal R&D, job training and training, and market research increases the project achievement most for the replacement of outdated products. Third, new market creation or extension of market share indicates that launching replacement products and continuously upgrading products are most important.

A study for Desertification Monitoring and Assessment based on satellite imagery in Tunisia (위성영상기반 튀니지 사막화 모니터링 및 평가에 관한 연구)

  • KIM, Ji-Won;SONG, Chol-Ho;PARK, Eun-Been;LEE, Jong-Yeol;CHOI, Sol-E;LEE, Eun-Jung;LEE, Woo-Kyun
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.21 no.4
    • /
    • pp.91-107
    • /
    • 2018
  • It is required to monitor and assess the desertification in Tunisia, where the Sahara Desert, which is located in the southern part of Tunisia, is recently expanding northward. In this study, by using remote sensed data, land cover changes were examined, and the Normalized Difference Vegetation Index (NDVI), Topsoil Grain Size Index (TGSI) and Albedo are used to monitor and assess desertification in Tunisia. Decision Tree was constructed, and the frequencies and trends of each assessment indicator, desertification degree and land cover were identified. In addition, we analyzed the correlation between assessment indicators and precipitation. As a result, desertification is generally intensifying northward, especially in areas with high levels of desertification. Also, bivariate correlation analysis showed that Albedo, NDVI and TGSI were all highly correlated with precipitation. It indicates that changes in precipitation have also been shown to affect Tunisian desertification. In conclusion, this study has improved the usability of various methodologies considering the assessment indicators based on satellite imagery, Decision Tree, which is a method of evaluating them complexly, and trends of land cover change.

A Halal Food Classification Framework Using Machine Learning Method for Enhancing Muslim Tourists (무슬림 관광객 증대를 위한 머신러닝 기반의 할랄푸드 분류 프레임워크)

  • Kim, Sun-A;Kim, Jeong-Won;Won, Dong-Yeon;Choi, Yerim
    • The Journal of Information Systems
    • /
    • v.26 no.3
    • /
    • pp.273-293
    • /
    • 2017
  • Purpose The purpose of this study is to introduce a framework that helps Muslims to determine whether a food can be consumed. It can complement existing Halal food classification services having a difficulty of constructing Halal food database. Design/methodology/approach The proposed framework includes two components. First, OCR(Optical Character Recognition) technique is utilized to read the food additive information. Second, machine learning methods were used to trained and predicted to determine whether a food can be consumed using the provided information. Findings Among the compared machine learning methods, SVM(Support Vector Machine), DT(Decision Tree), and NB(Naive Bayes), SVM with linear kernel and DT had excellent performance in the Halal food classification. The framework which adopting the proposed framework will enhance the tourism experiences of Muslim tourists who consider keeping the Islamic law most importantly. Furthermore, it can eventually contribute to the enhancement of smart tourism ecosystem.

Comparative Study of Machine learning Techniques for Spammer Detection in Social Bookmarking Systems (소셜 복마킹 시스템의 스패머 탐지를 위한 기계학습 기술의 성능 비교)

  • Kim, Chan-Ju;Hwang, Kyu-Baek
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.15 no.5
    • /
    • pp.345-349
    • /
    • 2009
  • Social bookmarking systems are a typical web 2.0 service based on folksonomy, providing the platform for storing and sharing bookmarking information. Spammers in social bookmarking systems denote the users who abuse the system for their own interests in an improper way. They can make the entire resources in social bookmarking systems useless by posting lots of wrong information. Hence, it is important to detect spammers as early as possible and protect social bookmarking systems from their attack. In this paper, we applied a diverse set of machine learning approaches, i.e., decision tables, decision trees (ID3), $na{\ddot{i}}ve$ Bayes classifiers, TAN (tree-augment $na{\ddot{i}}ve$ Bayes) classifiers, and artificial neural networks to this task. In our experiments, $na{\ddot{i}}ve$ Bayes classifiers performed significantly better than other methods with respect to the AUC (area under the ROC curve) score as veil as the model building time. Plausible explanations for this result are as follows. First, $na{\ddot{i}}ve$> Bayes classifiers art known to usually perform better than decision trees in terms of the AUC score. Second, the spammer detection problem in our experiments is likely to be linearly separable.

Comparison of machine learning algorithms for regression and classification of ultimate load-carrying capacity of steel frames

  • Kim, Seung-Eock;Vu, Quang-Viet;Papazafeiropoulos, George;Kong, Zhengyi;Truong, Viet-Hung
    • Steel and Composite Structures
    • /
    • v.37 no.2
    • /
    • pp.193-209
    • /
    • 2020
  • In this paper, the efficiency of five Machine Learning (ML) methods consisting of Deep Learning (DL), Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), and Gradient Tree Booting (GTB) for regression and classification of the Ultimate Load Factor (ULF) of nonlinear inelastic steel frames is compared. For this purpose, a two-story, a six-story, and a twenty-story space frame are considered. An advanced nonlinear inelastic analysis is carried out for the steel frames to generate datasets for the training of the considered ML methods. In each dataset, the input variables are the geometric features of W-sections and the output variable is the ULF of the frame. The comparison between the five ML methods is made in terms of the mean-squared-error (MSE) for the regression models and the accuracy for the classification models, respectively. Moreover, the ULF distribution curve is calculated for each frame and the strength failure probability is estimated. It is found that the GTB method has the best efficiency in both regression and classification of ULF regardless of the number of training samples and the space frames considered.

A Comparative Study on Collision Detection Algorithms based on Joint Torque Sensor using Machine Learning (기계학습을 이용한 Joint Torque Sensor 기반의 충돌 감지 알고리즘 비교 연구)

  • Jo, Seonghyeon;Kwon, Wookyong
    • The Journal of Korea Robotics Society
    • /
    • v.15 no.2
    • /
    • pp.169-176
    • /
    • 2020
  • This paper studied the collision detection of robot manipulators for safe collaboration in human-robot interaction. Based on sensor-based collision detection, external torque is detached from subtracting robot dynamics. To detect collision using joint torque sensor data, a comparative study was conducted using data-based machine learning algorithm. Data was collected from the actual 3 degree-of-freedom (DOF) robot manipulator, and the data was labeled by threshold and handwork. Using support vector machine (SVM), decision tree and k-nearest neighbors KNN method, we derive the optimal parameters of each algorithm and compare the collision classification performance. The simulation results are analyzed for each method, and we confirmed that by an optimal collision status detection model with high prediction accuracy.

Performance Analysis of Opinion Mining using Word2vec (Word2vec을 이용한 오피니언 마이닝 성과분석 연구)

  • Eo, Kyun Sun;Lee, Kun Chang
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2018.05a
    • /
    • pp.7-8
    • /
    • 2018
  • This study proposes an analysis of the Word2vec-based machine learning classifiers for the sake of opinion mining tasks. As a bench-marking method, BOW (Bag-of-Words) was adopted. On the basis of utilizing the Word2vec and BOW as feature extraction methods, we applied Laptop and Restaurant dataset to LR, DT, SVM, RF classifiers. The results showed that the Word2vec feature extraction yields more improved performance.

  • PDF