• Title/Summary/Keyword: Decision Tree Regression

Search Result 328, Processing Time 0.024 seconds

Major gene identification for FASN gene in Korean cattles by data mining (데이터마이닝을 이용한 한우의 우수 지방산합성효소 유전자 조합 선별)

  • Kim, Byung-Doo;Kim, Hyun-Ji;Lee, Seong-Won;Lee, Jea-Young
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.6
    • /
    • pp.1385-1395
    • /
    • 2014
  • Economic traits of livestock are affected by environmental factors and genetic factors. In addition, it is not affected by one gene, but is affected by interaction of genes. We used a linear regression model in order to adjust environmental factors. And, in order to identify gene-gene interaction effect, we applied data mining techniques such as neural network, logistic regression, CART and C5.0 using five-SNPs (single nucleotide polymorphism) of FASN (fatty acid synthase). We divided total data into training (60%) and testing (40%) data, and applied the model which was designed by training data to testing data. By the comparison of prediction accuracy, C5.0 was identified as the best model. It were selected superior genotype using the decision tree.

Forecasting of the COVID-19 pandemic situation of Korea

  • Goo, Taewan;Apio, Catherine;Heo, Gyujin;Lee, Doeun;Lee, Jong Hyeok;Lim, Jisun;Han, Kyulhee;Park, Taesung
    • Genomics & Informatics
    • /
    • v.19 no.1
    • /
    • pp.11.1-11.8
    • /
    • 2021
  • For the novel coronavirus disease 2019 (COVID-19), predictive modeling, in the literature, uses broadly susceptible exposed infected recoverd (SEIR)/SIR, agent-based, curve-fitting models. Governments and legislative bodies rely on insights from prediction models to suggest new policies and to assess the effectiveness of enforced policies. Therefore, access to accurate outbreak prediction models is essential to obtain insights into the likely spread and consequences of infectious diseases. The objective of this study is to predict the future COVID-19 situation of Korea. Here, we employed 5 models for this analysis; SEIR, local linear regression (LLR), negative binomial (NB) regression, segment Poisson, deep-learning based long short-term memory models (LSTM) and tree based gradient boosting machine (GBM). After prediction, model performance comparison was evelauated using relative mean squared errors (RMSE) for two sets of train (January 20, 2020-December 31, 2020 and January 20, 2020-January 31, 2021) and testing data (January 1, 2021-February 28, 2021 and February 1, 2021-February 28, 2021) . Except for segmented Poisson model, the other models predicted a decline in the daily confirmed cases in the country for the coming future. RMSE values' comparison showed that LLR, GBM, SEIR, NB, and LSTM respectively, performed well in the forecasting of the pandemic situation of the country. A good understanding of the epidemic dynamics would greatly enhance the control and prevention of COVID-19 and other infectious diseases. Therefore, with increasing daily confirmed cases since this year, these results could help in the pandemic response by informing decisions about planning, resource allocation, and decision concerning social distancing policies.

A Comparative Study of Prediction Models for College Student Dropout Risk Using Machine Learning: Focusing on the case of N university (머신러닝을 활용한 대학생 중도탈락 위험군의 예측모델 비교 연구 : N대학 사례를 중심으로)

  • So-Hyun Kim;Sung-Hyoun Cho
    • Journal of The Korean Society of Integrative Medicine
    • /
    • v.12 no.2
    • /
    • pp.155-166
    • /
    • 2024
  • Purpose : This study aims to identify key factors for predicting dropout risk at the university level and to provide a foundation for policy development aimed at dropout prevention. This study explores the optimal machine learning algorithm by comparing the performance of various algorithms using data on college students' dropout risks. Methods : We collected data on factors influencing dropout risk and propensity were collected from N University. The collected data were applied to several machine learning algorithms, including random forest, decision tree, artificial neural network, logistic regression, support vector machine (SVM), k-nearest neighbor (k-NN) classification, and Naive Bayes. The performance of these models was compared and evaluated, with a focus on predictive validity and the identification of significant dropout factors through the information gain index of machine learning. Results : The binary logistic regression analysis showed that the year of the program, department, grades, and year of entry had a statistically significant effect on the dropout risk. The performance of each machine learning algorithm showed that random forest performed the best. The results showed that the relative importance of the predictor variables was highest for department, age, grade, and residence, in the order of whether or not they matched the school location. Conclusion : Machine learning-based prediction of dropout risk focuses on the early identification of students at risk. The types and causes of dropout crises vary significantly among students. It is important to identify the types and causes of dropout crises so that appropriate actions and support can be taken to remove risk factors and increase protective factors. The relative importance of the factors affecting dropout risk found in this study will help guide educational prescriptions for preventing college student dropout.

Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm (SVM과 meta-learning algorithm을 이용한 고지혈증 유병 예측모형 개발과 활용)

  • Lee, Seulki;Shin, Taeksoo
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.111-124
    • /
    • 2018
  • This study aims to develop a classification model for predicting the occurrence of hyperlipidemia, one of the chronic diseases. Prior studies applying data mining techniques for predicting disease can be classified into a model design study for predicting cardiovascular disease and a study comparing disease prediction research results. In the case of foreign literatures, studies predicting cardiovascular disease were predominant in predicting disease using data mining techniques. Although domestic studies were not much different from those of foreign countries, studies focusing on hypertension and diabetes were mainly conducted. Since hypertension and diabetes as well as chronic diseases, hyperlipidemia, are also of high importance, this study selected hyperlipidemia as the disease to be analyzed. We also developed a model for predicting hyperlipidemia using SVM and meta learning algorithms, which are already known to have excellent predictive power. In order to achieve the purpose of this study, we used data set from Korea Health Panel 2012. The Korean Health Panel produces basic data on the level of health expenditure, health level and health behavior, and has conducted an annual survey since 2008. In this study, 1,088 patients with hyperlipidemia were randomly selected from the hospitalized, outpatient, emergency, and chronic disease data of the Korean Health Panel in 2012, and 1,088 nonpatients were also randomly extracted. A total of 2,176 people were selected for the study. Three methods were used to select input variables for predicting hyperlipidemia. First, stepwise method was performed using logistic regression. Among the 17 variables, the categorical variables(except for length of smoking) are expressed as dummy variables, which are assumed to be separate variables on the basis of the reference group, and these variables were analyzed. Six variables (age, BMI, education level, marital status, smoking status, gender) excluding income level and smoking period were selected based on significance level 0.1. Second, C4.5 as a decision tree algorithm is used. The significant input variables were age, smoking status, and education level. Finally, C4.5 as a decision tree algorithm is used. In SVM, the input variables selected by genetic algorithms consisted of 6 variables such as age, marital status, education level, economic activity, smoking period, and physical activity status, and the input variables selected by genetic algorithms in artificial neural network consist of 3 variables such as age, marital status, and education level. Based on the selected parameters, we compared SVM, meta learning algorithm and other prediction models for hyperlipidemia patients, and compared the classification performances using TP rate and precision. The main results of the analysis are as follows. First, the accuracy of the SVM was 88.4% and the accuracy of the artificial neural network was 86.7%. Second, the accuracy of classification models using the selected input variables through stepwise method was slightly higher than that of classification models using the whole variables. Third, the precision of artificial neural network was higher than that of SVM when only three variables as input variables were selected by decision trees. As a result of classification models based on the input variables selected through the genetic algorithm, classification accuracy of SVM was 88.5% and that of artificial neural network was 87.9%. Finally, this study indicated that stacking as the meta learning algorithm proposed in this study, has the best performance when it uses the predicted outputs of SVM and MLP as input variables of SVM, which is a meta classifier. The purpose of this study was to predict hyperlipidemia, one of the representative chronic diseases. To do this, we used SVM and meta-learning algorithms, which is known to have high accuracy. As a result, the accuracy of classification of hyperlipidemia in the stacking as a meta learner was higher than other meta-learning algorithms. However, the predictive performance of the meta-learning algorithm proposed in this study is the same as that of SVM with the best performance (88.6%) among the single models. The limitations of this study are as follows. First, various variable selection methods were tried, but most variables used in the study were categorical dummy variables. In the case with a large number of categorical variables, the results may be different if continuous variables are used because the model can be better suited to categorical variables such as decision trees than general models such as neural networks. Despite these limitations, this study has significance in predicting hyperlipidemia with hybrid models such as met learning algorithms which have not been studied previously. It can be said that the result of improving the model accuracy by applying various variable selection techniques is meaningful. In addition, it is expected that our proposed model will be effective for the prevention and management of hyperlipidemia.

Convergence-based analysis on geographical variations of the smoking rates (융복합 기반의 지역간 흡연율의 변이 분석)

  • Lim, Ji-Hye;Kang, Sung-Hong
    • Journal of Digital Convergence
    • /
    • v.13 no.8
    • /
    • pp.375-385
    • /
    • 2015
  • This study aims to identify geographical variations and factors that affect smoking rates. The data are collected from the Community Health Survey conducted between 2009 and 2011 by Korea Centers for Disease Control and Prevention and other government organizations. Correlation and multiple regression analysis were used to examine the factors influencing smoking rates. For the purpose of investigating regional variations, we employed a decision tree model. The study has found that the significant factors associated with geographical variations in the smoking rates were the rate of hazardous drinking, the completion rate of hypertension education, the experience rate of anti-smoking campaigns, stress awareness rate, hypertension prevalence, health insurance cost, diabetes prevalence, obesity rate, and strength training rate. Convergence-based analysis on geographical variations of the smoking rates is highly important when the regionally customized healthcare programs is implemented. In the future, it is necessary to develop effective program and customized approach for the regions of high smoking rates. Our study is expected to be used as meaningful data for the design of effective health care programs and assessments to lead effective non-smoking program.

A Convergence Study in the Severity-adjusted Mortality Ratio on inpatients with multiple chronic conditions (복합만성질환 입원환자의 중증도 보정 사망비에 대한 융복합 연구)

  • Seo, Young-Suk;Kang, Sung-Hong
    • Journal of Digital Convergence
    • /
    • v.13 no.12
    • /
    • pp.245-257
    • /
    • 2015
  • This study was to develop the predictive model for severity-adjusted mortality of inpatients with multiple chronic conditions and analyse the factors on the variation of hospital standardized mortality ratio(HSMR) to propose the plan to reduce the variation. We collect the data "Korean National Hospital Discharge In-depth Injury Survey" from 2008 to 2010 and select the final 110,700 objects of study who have chronic diseases for principal diagnosis and who are over the age of 30 with more than 2 chronic diseases including principal diagnosis. We designed a severity-adjusted mortality predictive model with using data-mining methods (logistic regression analysis, decision tree and neural network method). In this study, we used the predictive model for severity-adjusted mortality ratio by the decision tree using Elixhauser comorbidity index. As the result of the hospital standardized mortality ratio(HSMR) of inpatients with multiple chronic conditions, there were statistically significant differences in HSMR by the insurance type, bed number of hospital, and the location of hospital. We should find the method based on the result of this study to manage mortality ratio of inpatients with multiple chronic conditions efficiently as the national level. So we should make an effort to increase the quality of medical treatment for inpatients with multiple chronic diseases and to reduce growing medical expenses.

A Prediction Model for the Development of Cataract Using Random Forests (Random Forests 기법을 이용한 백내장 예측모형 - 일개 대학병원 건강검진 수검자료에서 -)

  • Han, Eun-Jeong;Song, Ki-Jun;Kim, Dong-Geon
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.4
    • /
    • pp.771-780
    • /
    • 2009
  • Cataract is the main cause of blindness and visual impairment, especially, age-related cataract accounts for about half of the 32 million cases of blindness worldwide. As the life expectancy and the expansion of the elderly population are increasing, the cases of cataract increase as well, which causes a serious economic and social problem throughout the country. However, the incidence of cataract can be reduced dramatically through early diagnosis and prevention. In this study, we developed a prediction model of cataracts for early diagnosis using hospital data of 3,237 subjects who received the screening test first and then later visited medical center for cataract check-ups cataract between 1994 and 2005. To develop the prediction model, we used random forests and compared the predictive performance of this model with other common discriminant models such as logistic regression, discriminant model, decision tree, naive Bayes, and two popular ensemble model, bagging and arcing. The accuracy of random forests was 67.16%, sensitivity was 72.28%, and main factors included in this model were age, diabetes, WBC, platelet, triglyceride, BMI and so on. The results showed that it could predict about 70% of cataract existence by screening test without any information from direct eye examination by ophthalmologist. We expect that our model may contribute to diagnose cataract and help preventing cataract in early stages.

Prediction of Carcass Yield by Ultrasound in Hanwoo (초음파 측정에 의한 한우의 도체육량 예측)

  • Rhee, Y. J.;Jeon, K. J.;Choi, S. B.;Seok, H. K.;Kim, S. J.;Lee, S. K.;Song, Y. H.
    • Journal of Animal Science and Technology
    • /
    • v.45 no.2
    • /
    • pp.335-342
    • /
    • 2003
  • This study was conducted to predict the carcass yield traits using ultrasound before slaughter and to enhance the prediction accuracy of carcass yield grade by applying various strategies. For this experiment, five hundred seventy three Hanwoo steers of 24 months of age were used. Difference between ultrasound result and carcass measure of BFT and LMA was 0.6$\pm$1.65mm and 0.7$\pm$5.56cm2, respectively. Correlation coefficient between ultrasound result and carcass measure of BFT and LMA was 0.86 and 0.82, respectively (p<0.001). Results for improving predictions of yield grade by four methods-the Korean yield grade index equation, fat depth alone, regression and decision tree methods were 80.3%, 81.3%, 80.1% and 81.8%, respectively. We conclude that the decision tree method can easily predict yield grade and is also useful for increasing prediction accuracy rate.

Study on Detection for Cochlodinium polykrikoides Red Tide using the GOCI image and Machine Learning Technique (GOCI 영상과 기계학습 기법을 이용한 Cochlodinium polykrikoides 적조 탐지 기법 연구)

  • Unuzaya, Enkhjargal;Bak, Su-Ho;Hwang, Do-Hyun;Jeong, Min-Ji;Kim, Na-Kyeong;Yoon, Hong-Joo
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.15 no.6
    • /
    • pp.1089-1098
    • /
    • 2020
  • In this study, we propose a method to detect red tide Cochlodinium Polykrikoide using by machine learning and geostationary marine satellite images. To learn the machine learning model, GOCI Level 2 data were used, and the red tide location data of the National Fisheries Research and Development Institute was used. The machine learning model used logistic regression model, decision tree model, and random forest model. As a result of the performance evaluation, compared to the traditional GOCI image-based red tide detection algorithm without machine learning (Son et al., 2012) (75%), it was confirmed that the accuracy was improved by about 13~22%p (88~98%). In addition, as a result of comparing and analyzing the detection performance between machine learning models, the random forest model (98%) showed the highest detection accuracy.It is believed that this machine learning-based red tide detection algorithm can be used to detect red tide early in the future and track and monitor its movement and spread.

The Study on Hypertension Cure Rate Management Centering around Wellness Local Community : With GwangJu as a Central Figure (웰니스 지역사회 중심의 고혈압 치료율 관리 방안에 관한 연구 : 광주광역시 중심으로)

  • Yang, Yu-Jeong;Park, Jong-Ho
    • Journal of Korea Entertainment Industry Association
    • /
    • v.15 no.8
    • /
    • pp.351-361
    • /
    • 2021
  • This study was conducted to identify the factors of hypertension treatment in Gwangju and to establish a hypertension cure rate management plan by using local community health surveys to provide the hypertension cure rate management plan centering around the wellness local community. The research collected 13,714 Gwangju research data among a total of 685,820 local community health surveys of KDCA (Korea Disease Control and Prevention Agency) from 2017 to 2019. Among the data, 2,941 subjects, those with diagnosed hypertension aged over 30, were selected and analyzed through SAS 9.4, SAS Enterprise Miner 15.1. The results are as follows. The differences in hypertension diagnosis cure rate in Gwangju based on the subjects' socioeconomic characteristics were shown in gender, age, marital status, level of educational attainment, economic activity status, and monthly income. The significant differences in hypertension cure rate based on health behavior characteristics were shown in current smoking, monthly alcohol consumption, high-risk drinking, breakfast, recognition of good health level, diabetes and treatment, annual unmet medical needs, and annual health center use. As a result of the logistic regression analysis and interactive decision tree analysis to identify the factors affecting hypertension treatment, the research found that the factors that appear are age, marital status, diabetes and treatment, and annual unmet medical needs. Accordingly, to increase the recognition of the importance of hypertension treatment to people of young ages and not to develop complications, public health-educational effort in Gwangju is needed with an effective preparation plan.