• Title/Summary/Keyword: decision tree regression

Search Result 324, Processing Time 0.032 seconds

A Comparative Study of Phishing Websites Classification Based on Classifier Ensembles

  • Tama, Bayu Adhi;Rhee, Kyung-Hyune
    • Journal of Multimedia Information System
    • /
    • v.5 no.2
    • /
    • pp.99-104
    • /
    • 2018
  • Phishing website has become a crucial concern in cyber security applications. It is performed by fraudulently deceiving users with the aim of obtaining their sensitive information such as bank account information, credit card, username, and password. The threat has led to huge losses to online retailers, e-business platform, financial institutions, and to name but a few. One way to build anti-phishing detection mechanism is to construct classification algorithm based on machine learning techniques. The objective of this paper is to compare different classifier ensemble approaches, i.e. random forest, rotation forest, gradient boosted machine, and extreme gradient boosting against single classifiers, i.e. decision tree, classification and regression tree, and credal decision tree in the case of website phishing. Area under ROC curve (AUC) is employed as a performance metric, whilst statistical tests are used as baseline indicator of significance evaluation among classifiers. The paper contributes the existing literature on making a benchmark of classifier ensembles for web phishing detection.

A Study on Quality Control Using Data Mining in Steel Continuous Casting Process (철강 연주공정에서 데이터마이닝을 이용한 품질제어 방법에 관한 연구)

  • Kim, Jae-Kyeong;Kwon, Taeck-Sung;Choi, Il-Young;Kim, Hyea-Kyeong;Kim, Min-Yong
    • Journal of Information Technology Services
    • /
    • v.10 no.3
    • /
    • pp.113-126
    • /
    • 2011
  • The smelting and the continuous casting of steel are important processes that determine the quality of steel products. Especially most of quality defects occur during solidification of the steel continuous casting process. Although quality control techniques such as six sigma, SQC, and TQM can be applied to the continuous casting process for improving quality of steel products, these techniques don't provide real-time analysis to identify the causes of defect occurrence. To solve problems, we have developed a detection model using decision tree which identified abnormal transactions to have a coarse grain structure. And we have compared the proposed model with models using neural network and logistic regression. Experiments on steel data showed that the performance of the proposed model was higher than those of neural network model and logistic regression model. Thus, we expect that the suggested model will be helpful to control the quality of steel products in real-time in the continuous casting process.

Evaluation of Ultrasound for Prediction of Carcass Meat Yield and Meat Quality in Korean Native Cattle (Hanwoo)

  • Song, Y.H.;Kim, S.J.;Lee, S.K.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.15 no.4
    • /
    • pp.591-595
    • /
    • 2002
  • Three hundred thirty five progeny testing steers of Korean beef cattle were evaluated ultrasonically for back fat thickness (BFT), longissimus muscle area (LMA) and intramuscular fat (IF) before slaughter. Class measurements associated with the Korean yield grade and quality grade were also obtained. Residual standard deviation between ultrasonic estimates and carcass measurements of BFT, LMA were 1.49 mm and $0.96cm^2$. The linear correlation coefficients (p<0.01) between ultrasonic estimates and carcass measurements of BFT, LMA and IF were 0.75, 0.57 and 0.67, respectively. Results for improving predictions of yield grade by four methods-the Korean yield grade index equation, fat depth alone, regression and decision tree methods were 75.4%, 79.6%, 64.3% and 81.4%, respectively. We conclude that the decision tree method can easily predict yield grade and is also useful for increasing prediction accuracy rate.

Iowa Liquor Sales Data Predictive Analysis Using Spark

  • Ankita Paul;Shuvadeep Kundu;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.31 no.2
    • /
    • pp.185-196
    • /
    • 2021
  • The paper aims to analyze and predict sales of liquor in the state of Iowa by applying machine learning algorithms to models built for prediction. We have taken recourse of Azure ML and Spark ML for our predictive analysis, which is legacy machine learning (ML) systems and Big Data ML, respectively. We have worked on the Iowa liquor sales dataset comprising of records from 2012 to 2019 in 24 columns and approximately 1.8 million rows. We have concluded by comparing the models with different algorithms applied and their accuracy in predicting the sales using both Azure ML and Spark ML. We find that the Linear Regression model has the highest precision and Decision Forest Regression has the fastest computing time with the sample data set using the legacy Azure ML systems. Decision Tree Regression model in Spark ML has the highest accuracy with the quickest computing time for the entire data set using the Big Data Spark systems.

Decision-making system for the resource forecasting and risk management using regression algorithms (회귀알고리즘을 이용한 자원예측 및 위험관리를 위한 의사결정 시스템)

  • Han, Hyung-Chul;Jung, Jae-Hun;Kim, Sin-Ryeong;Kim, Young-Gon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.15 no.6
    • /
    • pp.311-319
    • /
    • 2015
  • In this paper, in order to increase the production efficiency of the industrial plant, and predicts the resources of the manufacturing process, we have proposed a decision-making system for resource implementing the risk management effectively forecasting and risk management. A variety of information that occurs at each step efficiently difficult the creation of detailed process steps in the scenario you want to manage, is a frequent condition change of manufacturing facilities for the production of various products even within the same process. The data that is not contiguous products production cycle also not constant occurs, there is a problem that needs to check the variation in the small amount of data. In order to solve these problems, data centralized manufacturing processes, process resource prediction, risk prediction, through a process current status monitoring, must allow action immediately when a problem occurs. In this paper, the range of change in the design drawing, resource prediction, a process completion date using a regression algorithm to derive the formula, classification tree technique was proposed decision system in three stages through the boundary value analysis.

A Study on Factors of the Academic Achievement in Computer Training Courses as the Liberal Arts in University (대학 컴퓨터 실습 교양과목에서의 학업성취 요인에 대한 연구)

  • Kim, Wanseop
    • Journal of The Korean Association of Information Education
    • /
    • v.17 no.4
    • /
    • pp.433-447
    • /
    • 2013
  • The purpose of this study is to find out the factors of the students' achievement on the computer training courses which are based on computer practice. In order to improve the academic achievement of the students, it is necessary to analyze the factors affecting academic achievement and apply the results of the analysis to education. In particular, it is necessary to study for finding out factors of the academic achievement in practical computer training courses, because these courses are different from other courses focusing on the theory. In this study, in order to find out the factors, the logistic regression analysis and the decision tree analysis which is the field of data mining were peformed. For the experimental data, the test results of the MOS certification of the S university in seoul were used. Through logistic regression analysis it is found that the factors of the professors, class size, lecture time, group(lecture period) are important in order. Through decision tree analysis of data mining, it is found that there are some additional factors ; entrance year, whether the course is retaken, and the classroom environment. and these various factors effect the academic achievement compositively as identified through the model tree. The tree model was presented as a result of the analysis, and the importance of the factors is expressed numerically from multiple tree models by using the proposed mathematical formula.

Sequential prediction of TBM penetration rate using a gradient boosted regression tree during tunneling

  • Lee, Hang-Lo;Song, Ki-Il;Qi, Chongchong;Kim, Kyoung-Yul
    • Geomechanics and Engineering
    • /
    • v.29 no.5
    • /
    • pp.523-533
    • /
    • 2022
  • Several prediction model of penetration rate (PR) of tunnel boring machines (TBMs) have been focused on applying to design stage. In construction stage, however, the expected PR and its trends are changed during tunneling owing to TBM excavation skills and the gap between the investigated and actual geological conditions. Monitoring the PR during tunneling is crucial to rescheduling the excavation plan in real-time. This study proposes a sequential prediction method applicable in the construction stage. Geological and TBM operating data are collected from Gunpo cable tunnel in Korea, and preprocessed through normalization and augmentation. The results show that the sequential prediction for 1 ring unit prediction distance (UPD) is R2≥0.79; whereas, a one-step prediction is R2≤0.30. In modeling algorithm, a gradient boosted regression tree (GBRT) outperformed a least square-based linear regression in sequential prediction method. For practical use, a simple equation between the R2 and UPD is proposed. When UPD increases R2 decreases exponentially; In particular, UPD at R2=0.60 is calculated as 28 rings using the equation. Such a time interval will provide enough time for decision-making. Evidently, the UPD can be adjusted depending on other project and the R2 value targeted by an operator. Therefore, a calculation process for the equation between the R2 and UPD is addressed.

Development of Prediction Model for Prevalence of Metabolic Syndrome Using Data Mining: Korea National Health and Nutrition Examination Study (국민건강영양조사를 활용한 대사증후군 유병 예측모형 개발을 위한 융복합 연구: 데이터마이닝을 활용하여)

  • Kim, Han-Kyoul;Choi, Keun-Ho;Lim, Sung-Won;Rhee, Hyun-Sill
    • Journal of Digital Convergence
    • /
    • v.14 no.2
    • /
    • pp.325-332
    • /
    • 2016
  • The purpose of this study is to investigate the attributes influencing the prevalence of metabolic syndrome and develop the prediction model for metabolic syndrome over 40-aged people from Korea Health and Nutrition Examination Study 2012. The researcher chose the attributes for prediction model through literature review. Also, we used the decision tree, logistic regression, artificial neural network of data mining algorithm through Weka 3.6. As results, social economic status factors of input attributes were ranked higher than health-related factors. Additionally, prediction model using decision tree algorithm showed finally the highest accuracy. This study suggests that, first of all, prevention and management of metabolic syndrome will be approached by aspect of social economic status and health-related factors. Also, decision tree algorithms known from other research are useful in the field of public health due to their usefulness of interpretation.

A Study on University Big Data-based Student Employment Roadmap Recommendation (대학 빅데이터 기반 학생 취업 로드맵 추천에 관한 연구)

  • Park, Sangsung
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.17 no.3
    • /
    • pp.1-7
    • /
    • 2021
  • The number of new students at many domestic universities is declining. In particular, private universities, which are highly dependent on tuition, are experiencing a crisis of existence. Amid the declining school-age population, universities are striving to fill new students by improving the quality of education and increasing the student employment rate. Recently, there is an increasing number of cases of using the accumulated big data of universities to prepare measures to fill new students. A representative example of this is the analysis of factors that affect student employment. Existing employment-influencing factor analysis studies have applied quantitative models such as regression analysis to university big data. However, since the factors affecting employment differ by major, it is necessary to reflect this. In this paper, the factors affecting employment by major are analyzed using the data of University C and the decision tree model. In addition, based on the analysis results, a roadmap for student employment by major is recommended. As a result of the experiment, four decision tree models were constructed for each major, and factors affecting employment by major and roadmap were derived.

A Study on Regional Variations for Disease-specific Cardiac Arrest (질환성 심정지 발생의 지역별 변이에 관한 연구)

  • Park, Il-Su;Kim, Eun-Ju;Kim, Yoo-Mi;Hong, Sung-Ok;Kim, Young-Taek;Kang, Sung-Hong
    • Journal of Digital Convergence
    • /
    • v.13 no.1
    • /
    • pp.353-366
    • /
    • 2015
  • The purpose of this study was to examine how region-specific characteristics affect the occurrence of cardiac arrest. To analyze, we combined a unique data set including key indicators of health condition and cardiac arrest occurrence at the 244 small administrative districts. Our data came from two main sources in Korea Center For Disease Control and Prevention (KCDC): 2010 Out-of-Hospital Cardiac Arrest Surveillance and Community Health Survey. We analyzed data by using multiple regression, geographically weighted regression and decision tree. Decision tree model is selected as the final model to explain regional variations of cardiac arrest. Factors of regional variations of cardiac arrest occurrence are population density, diagnosis rates of hypertension, stress level, participating screening level, high drinking rate, and smoking rate. Taken as a whole, accounting for geographical variations of health conditions, health behaviors and other socioeconomic factors are important when regionally customized health policy is implemented to decrease the cardiac arrest occurrence.