• Title/Summary/Keyword: stepwise variable selection

Search Result 53, Processing Time 0.025 seconds

Climatic Influence on Seed Protein Content in Soybean(Glycine max) (기상요인이 콩 단백질 함량에 미치는 영향)

  • M. H. Yang;J. W. Burton
    • KOREAN JOURNAL OF CROP SCIENCE
    • /
    • v.42 no.5
    • /
    • pp.539-547
    • /
    • 1997
  • This study was carried out to identify how soybean seed protein concentration is influenced by climatic factors. Twelve lines selected for seed protein concentration were studied in 13 environments of North Carolina. Sensitivity of seed protein concentration, total seed protein, and seed yield to climatic variables was investigated using a linear regression model. Best response models were determined using two stepwise selection methods, Maximum R-square and Stepwise Selection. There were wide climatic effects in seed protein concentration, total protein and seed yield. The highest protein concentration environment was characterized by the most high temperature days(HTD) and the smallest variance of average daily temperature range (VADTRg), while the lowest protein concentration environment was distinguished by the fewest HTD and the largest VADTRg. For protein concentration, all lines responded positively to average maximum daily temperature(MxDT), HTD, and average daily temperature range(ADTRg) and negatively to ADRa, while they responded positively or negatively to average daily temperature(ADT), variance of average minimum daily temperature (VMnDT), and VADTRg, indicating that genotypes may greatly differ in degrees of sensitivity to each climatic variable. Eleven lines seemed to have best response models with 2 or 3 variables. Exceptionally, NC106 did not show a significant sensitivity to any climatic variable and thus did not have a best response model. This indicates that it may be considered phenotypically more stable. For total seed protein and seed yield, all the lines responded negatively to both ADTRg and VADRa, suggesting that synthesis of seed components may increase with less daily temperature range and less variation in daily rainfall.

  • PDF

Development of Variable Selection Technique using Stepwise Regression and Data Envelopment Analysis (단계적 회귀법과 자료봉합분석을 이용한 변수선택기법의 개발)

  • Jeong, Min-Eui;Yu, Song-Jin
    • Journal of KIISE:Software and Applications
    • /
    • v.41 no.8
    • /
    • pp.598-604
    • /
    • 2014
  • In this paper, we develop stepwise regression data envelopment model to select important variables. We formulate null hypothesis to understand the importance of each variable and use Kruskal-Wallis test for this purpose. If the Kruskal-Wallis test does reject the null hypothesis this will imply there is significant fluctuation in the efficiency score relative to base model. And therefore we have to further check the pair of variables that causes the fluctuation in order to determine its importance using Conover-Inman test. The proposed models helps understand the extent of misclassification decision making units as efficient/inefficient when variables are retained or discarded alongside provides useful managerial prescription to make improvement strategies.

Prediction of Quantitative Traits Using Common Genetic Variants: Application to Body Mass Index

  • Bae, Sunghwan;Choi, Sungkyoung;Kim, Sung Min;Park, Taesung
    • Genomics & Informatics
    • /
    • v.14 no.4
    • /
    • pp.149-159
    • /
    • 2016
  • With the success of the genome-wide association studies (GWASs), many candidate loci for complex human diseases have been reported in the GWAS catalog. Recently, many disease prediction models based on penalized regression or statistical learning methods were proposed using candidate causal variants from significant single-nucleotide polymorphisms of GWASs. However, there have been only a few systematic studies comparing existing methods. In this study, we first constructed risk prediction models, such as stepwise linear regression (SLR), least absolute shrinkage and selection operator (LASSO), and Elastic-Net (EN), using a GWAS chip and GWAS catalog. We then compared the prediction accuracy by calculating the mean square error (MSE) value on data from the Korea Association Resource (KARE) with body mass index. Our results show that SLR provides a smaller MSE value than the other methods, while the numbers of selected variables in each model were similar.

Pliable regression spline estimator using auxiliary variables

  • Oh, Jae-Kwon;Jhong, Jae-Hwan
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.5
    • /
    • pp.537-551
    • /
    • 2021
  • We conducted a study on a regression spline estimator with a few pre-specified auxiliary variables. For the implementation of the proposed estimators, we adapted a coordinate descent algorithm. This was implemented by considering a structure of the sum of the residuals squared objective function determined by the B-spline and the auxiliary coefficients. We also considered an efficient stepwise knot selection algorithm based on the Bayesian information criterion. This was to adaptively select smoothly functioning estimator data. Numerical studies using both simulated and real data sets were conducted to illustrate the proposed method's performance. An R software package psav is available.

A Survival Prediction Model of Rats in Uncontrolled Acute Hemorrhagic Shock Using the Random Forest Classifier (랜덤 포리스트를 이용한 비제어 급성 출혈성 쇼크의 흰쥐에서의 생존 예측)

  • Choi, J.Y.;Kim, S.K.;Koo, J.M.;Kim, D.W.
    • Journal of Biomedical Engineering Research
    • /
    • v.33 no.3
    • /
    • pp.148-154
    • /
    • 2012
  • Hemorrhagic shock is a primary cause of deaths resulting from injury in the world. Although many studies have tried to diagnose accurately hemorrhagic shock in the early stage, such attempts were not successful due to compensatory mechanisms of humans. The objective of this study was to construct a survival prediction model of rats in acute hemorrhagic shock using a random forest (RF) model. Heart rate (HR), mean arterial pressure (MAP), respiration rate (RR), lactate concentration (LC), and peripheral perfusion (PP) measured in rats were used as input variables for the RF model and its performance was compared with that of a logistic regression (LR) model. Before constructing the models, we performed 5-fold cross validation for RF variable selection, and forward stepwise variable selection for the LR model to examine which variables were important for the models. For the LR model, sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (ROC-AUC) were 0.83, 0.95, 0.88, and 0.96, respectively. For the RF models, sensitivity, specificity, accuracy, and AUC were 0.97, 0.95, 0.96, and 0.99, respectively. In conclusion, the RF model was superior to the LR model for survival prediction in the rat model.

Determination of optimal order for the full-logged I-D-F polynomial equation and significance test of regression coefficients (전대수 다항식형 확률강우강도식의 최적차수 결정 및 회귀계수에 대한 유의성 검정)

  • Park, Jin Hee;Lee, Jae Joon
    • Journal of Korea Water Resources Association
    • /
    • v.55 no.10
    • /
    • pp.775-784
    • /
    • 2022
  • In this study, to determine the optimal order of the full-logged I-D-F polynomial equation, which is mainly used to calculate the probable rainfall over a temporal rainfall duration, the probable rainfall was calculated and the regression coefficients of the full-logged I-D-F polynomial equation was estimated. The optimal variable of the polynomial equation for each station was selected using a stepwise selection method, and statistical significance tests were performed through ANOVA. Using these results, the statistically appropriately calculated rainfall intensity equation for each station was presented. As a result of analyzing the variable selection outputs of the full-logged I-D-F polynomial equation at 9 stations in Gyeongbuk, the 1st to 3rd order equations at 6 stations and the incomplete 3rd order at 1 station were determined as the optimal equations. Since the 1st order equation is similar to the Sherman type equation and the 2nd order one is similar to the general type equation, it was presented as a unified form of rainfall intensity equation for convenience of use by increasing the number of independent variables. Therefore, it is judged that there is no statistical problem in considering only the 3rd order polynomial regression equation for the full-logged I-D-F.

Apartment Price Prediction Using Deep Learning and Machine Learning (딥러닝과 머신러닝을 이용한 아파트 실거래가 예측)

  • Hakhyun Kim;Hwankyu Yoo;Hayoung Oh
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.2
    • /
    • pp.59-76
    • /
    • 2023
  • Since the COVID-19 era, the rise in apartment prices has been unconventional. In this uncertain real estate market, price prediction research is very important. In this paper, a model is created to predict the actual transaction price of future apartments after building a vast data set of 870,000 from 2015 to 2020 through data collection and crawling on various real estate sites and collecting as many variables as possible. This study first solved the multicollinearity problem by removing and combining variables. After that, a total of five variable selection algorithms were used to extract meaningful independent variables, such as Forward Selection, Backward Elimination, Stepwise Selection, L1 Regulation, and Principal Component Analysis(PCA). In addition, a total of four machine learning and deep learning algorithms were used for deep neural network(DNN), XGBoost, CatBoost, and Linear Regression to learn the model after hyperparameter optimization and compare predictive power between models. In the additional experiment, the experiment was conducted while changing the number of nodes and layers of the DNN to find the most appropriate number of nodes and layers. In conclusion, as a model with the best performance, the actual transaction price of apartments in 2021 was predicted and compared with the actual data in 2021. Through this, I am confident that machine learning and deep learning will help investors make the right decisions when purchasing homes in various economic situations.

Risk Prediction Using Genome-Wide Association Studies on Type 2 Diabetes

  • Choi, Sungkyoung;Bae, Sunghwan;Park, Taesung
    • Genomics & Informatics
    • /
    • v.14 no.4
    • /
    • pp.138-148
    • /
    • 2016
  • The success of genome-wide association studies (GWASs) has enabled us to improve risk assessment and provide novel genetic variants for diagnosis, prevention, and treatment. However, most variants discovered by GWASs have been reported to have very small effect sizes on complex human diseases, which has been a big hurdle in building risk prediction models. Recently, many statistical approaches based on penalized regression have been developed to solve the "large p and small n" problem. In this report, we evaluated the performance of several statistical methods for predicting a binary trait: stepwise logistic regression (SLR), least absolute shrinkage and selection operator (LASSO), and Elastic-Net (EN). We first built a prediction model by combining variable selection and prediction methods for type 2 diabetes using Affymetrix Genome-Wide Human SNP Array 5.0 from the Korean Association Resource project. We assessed the risk prediction performance using area under the receiver operating characteristic curve (AUC) for the internal and external validation datasets. In the internal validation, SLR-LASSO and SLR-EN tended to yield more accurate predictions than other combinations. During the external validation, the SLR-SLR and SLR-EN combinations achieved the highest AUC of 0.726. We propose these combinations as a potentially powerful risk prediction model for type 2 diabetes.

Effect of Somatic Cell Score on Protein Yield in Holsteins

  • Khan, M.S.;Shook, G.E.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.11 no.5
    • /
    • pp.580-585
    • /
    • 1998
  • The study was conducted to determine if variation in protein yield can be explained by expressions of early lactation somatic cell score (SCS) and if prediction can be improved by including SCS among the predictors. A data set was prepared (n = 663,438) from Wisconsin Dairy Improvement Association (USA) records for protein yield with sample days near 20. Stepwise regression was used requiring F statistic (p < .01) for any variable to stay in the model. Separate analyses were run for 12 combinations of four seasons and first three parities. Selection of SCS variables was not consistent across seasons or lactations. Coefficients of detennination ($R^2$) ranged from 51 to 61% with higher values for earlier lactations. Including any expression of SCS in the prediction equations improved $R^2$ by < 1 %. SCS was associated with milk yield on the sample day, but the association was not strong enough to improve the prediction of future yield when other expressions of milk yield were in the model.

A Model for the Estimation of Progression Adjustment: Factors on a Signal-Controlled Street Network (신호등이 있는 가로망에서의 신호 연동화보정계수 산정모형)

  • 김원창;오영태;이승환
    • Journal of Korean Society of Transportation
    • /
    • v.10 no.2
    • /
    • pp.25-42
    • /
    • 1992
  • The purpose of this paper is to construct a model to compute a progression adjustment factor on a signalized network. In a way to construct the model, a simulation method is introduced and the TRAF-NETSIM is used as a tool of simulation. The structure of the network chooses an urban arterial network so as to measure the effect of progression and compute average stopped delay on each link. A regression model is constructed by using the results of the simulation. The stepwise variable selection in the regression model in used. The findings of this paper are as follows: i)The secondary queue and platoon ratio are sensitive to the values of the progression adjustment factor ii) The continuous model can practically reflect on various situations in the real world. The platoon adjustment factor can be computed by this model and the data required for this model can be easily obtained in the field.

  • PDF