• Title/Summary/Keyword: tree classification

Search Result 930, Processing Time 0.023 seconds

A Study on Detection of Small Size Malicious Code using Data Mining Method (데이터 마이닝 기법을 이용한 소규모 악성코드 탐지에 관한 연구)

  • Lee, Taek-Hyun;Kook, Kwang-Ho
    • Convergence Security Journal
    • /
    • v.19 no.1
    • /
    • pp.11-17
    • /
    • 2019
  • Recently, the abuse of Internet technology has caused economic and mental harm to society as a whole. Especially, malicious code that is newly created or modified is used as a basic means of various application hacking and cyber security threats by bypassing the existing information protection system. However, research on small-capacity executable files that occupy a large portion of actual malicious code is rather limited. In this paper, we propose a model that can analyze the characteristics of known small capacity executable files by using data mining techniques and to use them for detecting unknown malicious codes. Data mining analysis techniques were performed in various ways such as Naive Bayesian, SVM, decision tree, random forest, artificial neural network, and the accuracy was compared according to the detection level of virustotal. As a result, more than 80% classification accuracy was verified for 34,646 analysis files.

An Application of Support Vector Machines to Personal Credit Scoring: Focusing on Financial Institutions in China (Support Vector Machines을 이용한 개인신용평가 : 중국 금융기관을 중심으로)

  • Ding, Xuan-Ze;Lee, Young-Chan
    • Journal of Industrial Convergence
    • /
    • v.16 no.4
    • /
    • pp.33-46
    • /
    • 2018
  • Personal credit scoring is an effective tool for banks to properly guide decision profitably on granting loans. Recently, many classification algorithms and models are used in personal credit scoring. Personal credit scoring technology is usually divided into statistical method and non-statistical method. Statistical method includes linear regression, discriminate analysis, logistic regression, and decision tree, etc. Non-statistical method includes linear programming, neural network, genetic algorithm and support vector machine, etc. But for the development of the credit scoring model, there is no consistent conclusion to be drawn regarding which method is the best. In this paper, we will compare the performance of the most common scoring techniques such as logistic regression, neural network, and support vector machines using personal credit data of the financial institution in China. Specifically, we build three models respectively, classify the customers and compare analysis results. According to the results, support vector machine has better performance than logistic regression and neural networks.

A Study on the Walkability Scores in Jeonju City Using Multiple Regression Models (다중 회귀 모델을 이용한 전주시 보행 환경 점수 예측에 관한 연구)

  • Lee, KiChun;Nam, KwangWoo;Lee, ChangWoo
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.27 no.4
    • /
    • pp.1-10
    • /
    • 2022
  • Attempts to interpret human perspectives using computer vision have been developed in various fields. In this paper, we propose a method for evaluating the walking environment through semantic segmentation results of images from road images. First, the Kakao Map API was used to collect road images, and four-way images were collected from about 50,000 points in JeonJu. 20% of the collected images build datasets through crowdsourcing-based paired comparisons, and train various regression models using paired comparison data. In order to derive the walkability score of the image data, the ranking score is calculated using the Trueskill algorithm, which is a ranking algorithm, and the walkability and analysis using various regression models are performed using the constructed data. Through this study, it is shown that the walkability of Jeonju can be evaluated and scores can be derived through the correlation between pixel distribution classification information rather than human vision.

Probability Estimation Method for Imputing Missing Values in Data Expansion Technique (데이터 확장 기법에서 손실값을 대치하는 확률 추정 방법)

  • Lee, Jong Chan
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.11
    • /
    • pp.91-97
    • /
    • 2021
  • This paper uses a data extension technique originally designed for the rule refinement problem to handling incomplete data. This technique is characterized in that each event can have a weight indicating importance, and each variable can be expressed as a probability value. Since the key problem in this paper is to find the probability that is closest to the missing value and replace the missing value with the probability, three different algorithms are used to find the probability for the missing value and then store it in this data structure format. And, after learning to classify each information area with the SVM classification algorithm for evaluation of each probability structure, it compares with the original information and measures how much they match each other. The three algorithms for the imputation probability of the missing value use the same data structure, but have different characteristics in the approach method, so it is expected that it can be used for various purposes depending on the application field.

Development of a Model for Calculating the Negligence Ratio Using Traffic Accident Information (교통사고 정보를 이용한 과실비율 산정 모델 개발)

  • Eum Han;Giok Park;Heejin Kang;Yoseph Lee;Ilsoo Yun
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.21 no.6
    • /
    • pp.36-56
    • /
    • 2022
  • Traffic accidents occur in Korea are calculated with the 「Automobile Accident Negligence Ratio Certification Standard」 prepared by the 'General Insurance Association of Korea' and the insurance company's agreement or judgment is made. However, disputes are frequently occurring in calculating the negligence ratio. Therefore, it is thought that a more effective response would be possible if accident type according to the standard could be quickly identified using traffic accident information prepared by police. Therefore, this study aims to develop a model that learns the accident information prepared by the police and classifies it to match the accident type in the standard. In particular, through data mining, keywords necessary to classify the accident types of the standard were extracted from the accident data of the police. Then, models were developed to derive the types of accidents by learning the extracted keywords through decision trees and random forest models.

Estimation of unused forest biomass potential resource amount in Korea

  • Sangho Yun;Sung-Min Choi;Joon-Woo Lee;Sung-Min Park
    • Korean Journal of Agricultural Science
    • /
    • v.49 no.2
    • /
    • pp.317-330
    • /
    • 2022
  • Recently, the policy regarding climate change in Korea and overseas has been to promote the utilization of forest biomass to achieve net zero emissions. In addition, with the implementation of the unused forest biomass system in 2018, the size of the Korean market for manufacturing wood pellets and wood chips using unused forest biomass is rapidly expanding. Therefore, it is necessary to estimate the total amount of unused forest biomass that can be used as an energy source and to identify the capacity that can be continuously produced annually. In this study, we estimated the actual forest area that can be produced of logging residue and the potential amount of unused forest biomass resources based on GT (green ton). Using a forest functions classification map (1 : 25,000), 5th digital forest type map (1 : 25,000), and digital elevation model (DEM), the forest area with a slope of 30° or less and mountain ridges of 70% or less was estimated based on production forest and IV age class or more. The total forest area where unused forest biomass can be produced was estimated to be 1,453,047 ha. Based on GT, the total amount of unused forest biomass potential resources in Korea was estimated to be 117,741,436 tons. By forest type, coniferous forests were estimated to be 48,513,580 tons (41.2%), broad-leaved forests 27,419,391 tons (23.3%), and mixed forests 41,808,465 tons (35.5%). Data from this research analysis can be used as basic data to estimate commercial use of unused forest biomass.

Feasibility on Statistical Process Control Analysis of Delivery Quality Assurance in Helical Tomotherapy (토모테라피에서 선량품질보증 분석을 위한 통계적공정관리의 타당성)

  • Kyung Hwan, Chang
    • Journal of radiological science and technology
    • /
    • v.45 no.6
    • /
    • pp.491-502
    • /
    • 2022
  • The purpose of this study was to retrospectively investigate the upper and lower control limits of treatment planning parameters using EBT film based delivery quality assurance (DQA) results and to analyze the results of statistical process control (SPC) in helical tomotherapy (HT). A total of 152 patients who passed or failed DQA results were retrospectively included in this study. Prostate (n = 66), rectal (n = 51), and large-field cancer patients, including lymph nodes (n = 35), were randomly selected. The absolute point dose difference (DD) and global gamma passing rate (GPR) were analyzed for all patients. Control charts were used to evaluate the upper and lower control limits (UCL and LCL) for all the assessed treatment planning parameters. Treatment planning parameters such as gantry period, leaf open time (LOT), pitch, field width, actual and planning modulation factor, treatment time, couch speed, and couch travel were analyzed to provide the optimal range using the DQA results. The classification and regression tree (CART) was used to predict the relative importance of variables in the DQA results from various treatment planning parameters. We confirmed that the proportion of patients with an LOT below 100 ms in the failure group was relatively higher than that in the passing group. SPC can detect QA failure prior to over dosimetric QA tolerance levels. The acceptable tolerance range of each planning parameter may assist in the prediction of DQA failures using the SPC tool in the future.

Exploring On-line Consumption Tendency of Sports 4.0 Market Consumer: Focused on Sports Goods Consumption by Generation of Working Age Population (스포츠 4.0 시장 소비자의 온라인 소비성향 탐색: 생산 가능인구의 세대별 스포츠 용품 소비를 중심으로)

  • Jin-Ho Shin
    • Journal of the Korean Applied Science and Technology
    • /
    • v.40 no.1
    • /
    • pp.24-34
    • /
    • 2023
  • This study sought to explore the online consumption propensity of sports goods by generation of the productive population and to provide basic data to predict the future consumption market by segmenting online consumers in the sports 4.0 market. Therefore, this survey was conducted on those who consumed sports goods among the generation-specific groups (Generation Y and above, Z) of the productive population, and a total of 478 people's data were applied to the final analysis. Data processing was conducted with SPSS statistics (ver.21.0), frequency analysis, exploratory factor analysis, correlation analysis of re-examination reliability, reliability analysis, and decision tree analysis. According to the online consumption propensity of sports goods by generation of the productive population, there is a high probability of being classified as Generation Z group if the factors of leisure, joy, and environment are high. In addition, the classification accuracy of such a model was 69.7%.

A Comparative Study of Prediction Models for College Student Dropout Risk Using Machine Learning: Focusing on the case of N university (머신러닝을 활용한 대학생 중도탈락 위험군의 예측모델 비교 연구 : N대학 사례를 중심으로)

  • So-Hyun Kim;Sung-Hyoun Cho
    • Journal of The Korean Society of Integrative Medicine
    • /
    • v.12 no.2
    • /
    • pp.155-166
    • /
    • 2024
  • Purpose : This study aims to identify key factors for predicting dropout risk at the university level and to provide a foundation for policy development aimed at dropout prevention. This study explores the optimal machine learning algorithm by comparing the performance of various algorithms using data on college students' dropout risks. Methods : We collected data on factors influencing dropout risk and propensity were collected from N University. The collected data were applied to several machine learning algorithms, including random forest, decision tree, artificial neural network, logistic regression, support vector machine (SVM), k-nearest neighbor (k-NN) classification, and Naive Bayes. The performance of these models was compared and evaluated, with a focus on predictive validity and the identification of significant dropout factors through the information gain index of machine learning. Results : The binary logistic regression analysis showed that the year of the program, department, grades, and year of entry had a statistically significant effect on the dropout risk. The performance of each machine learning algorithm showed that random forest performed the best. The results showed that the relative importance of the predictor variables was highest for department, age, grade, and residence, in the order of whether or not they matched the school location. Conclusion : Machine learning-based prediction of dropout risk focuses on the early identification of students at risk. The types and causes of dropout crises vary significantly among students. It is important to identify the types and causes of dropout crises so that appropriate actions and support can be taken to remove risk factors and increase protective factors. The relative importance of the factors affecting dropout risk found in this study will help guide educational prescriptions for preventing college student dropout.

Prediction of Correct Answer Rate and Identification of Significant Factors for CSAT English Test Based on Data Mining Techniques (데이터마이닝 기법을 활용한 대학수학능력시험 영어영역 정답률 예측 및 주요 요인 분석)

  • Park, Hee Jin;Jang, Kyoung Ye;Lee, Youn Ho;Kim, Woo Je;Kang, Pil Sung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.11
    • /
    • pp.509-520
    • /
    • 2015
  • College Scholastic Ability Test(CSAT) is a primary test to evaluate the study achievement of high-school students and used by most universities for admission decision in South Korea. Because its level of difficulty is a significant issue to both students and universities, the government makes a huge effort to have a consistent difficulty level every year. However, the actual levels of difficulty have significantly fluctuated, which causes many problems with university admission. In this paper, we build two types of data-driven prediction models to predict correct answer rate and to identify significant factors for CSAT English test through accumulated test data of CSAT, unlike traditional methods depending on experts' judgments. Initially, we derive candidate question-specific factors that can influence the correct answer rate, such as the position, EBS-relation, readability, from the annual CSAT practices and CSAT for 10 years. In addition, we drive context-specific factors by employing topic modeling which identify the underlying topics over the text. Then, the correct answer rate is predicted by multiple linear regression and level of difficulty is predicted by classification tree. The experimental results show that 90% of accuracy can be achieved by the level of difficulty (difficult/easy) classification model, whereas the error rate for correct answer rate is below 16%. Points and problem category are found to be critical to predict the correct answer rate. In addition, the correct answer rate is also influenced by some of the topics discovered by topic modeling. Based on our study, it will be possible to predict the range of expected correct answer rate for both question-level and entire test-level, which will help CSAT examiners to control the level of difficulties.