• Title/Summary/Keyword: stepwise variable selection

Search Result 53, Processing Time 0.024 seconds

Reliability of Covariates in Baseline Survey of a Cohort Study: Epidemiological Investigation on Cancer Risk Among Residents Who Reside Near the Nuclear Power Plants in Korea (코호트 기반 조사 공변수 자료의 신뢰도 평가 연구: 원전주변지역주민 역학조사연구)

  • Bae, Sang-Hyuk;Park, Bo-Young;Li, Zhong-Min;Ahn, Yoon-Ok
    • Journal of Preventive Medicine and Public Health
    • /
    • v.43 no.2
    • /
    • pp.159-165
    • /
    • 2010
  • Objectives: We evaluated the reliability of the possible covariates of the baseline survey data collected for the Epidemiological Investigation on Cancer Risk Among Residents Who Reside Near the Nuclear Power Plants in Korea. Methods: Follow-up surveys were conducted for 477 participants of the cohort at less than 1 year after the initial survey. The mean interval between the initial and follow-up surveys was 282.5 days. Possible covariates were identified by analyzing the correlations with the exposure variable and associations with the outcome variables for all the variables. Logistic regression analysis with stepwise selection was further conducted among the possible covariates to select variables that have covariance with other variables. We considered that these variables can be representing other variables. Seven variables for the males and 3 variables for the females, which had covariance with other possible covariates, were selected as representative variables. The Kappa index of each variable was calculated. Results: For the males, the Kappa indexes were as follow; family history of cancer was 0.64, family history of liver diseases in parents and siblings was 0.56, family history of hypertension in parents and siblings was 0.51, family history of liver diseases was 0.50, family history of hypertension was 0.44, a history of chronic liver diseases was 0.53 and history of pulmonary tuberculosis was 0.36. For females, the Kappa indexes were as follow; family history of cancer was 0.58, family history of hypertension in parents and siblings was 0.56 and family history of hypertension was 0.47. Conclusions: Most of the possible covariates showed good to moderate agreement.

Categorical data analysis of sensory evaluation data with Hanwoo bull beef (한우 수소 고기 관능평가 데이터에 대한 범주형 자료 분석)

  • Lee, Hye-Jung;Cho, Soo-Hyun;Kim, Jae-Hee
    • Journal of the Korean Data and Information Science Society
    • /
    • v.20 no.5
    • /
    • pp.819-827
    • /
    • 2009
  • This study was conducted to investigate the relationship between the sociodemographic factors and the Korean consumers palatability evaluation grades with Hanwoo sensory evaluation data. The dichotomy logistic regression model and the multinomial logistic regression model are fitted with the independent variables such as the consumer living location, age, gender, occupation, monthly income, and beef cut and the the palatability grade as the dependent variable. Stepwise variable selection procedure is incorporated to find the final model and odds ratios are calculated to find the associations between categories.

  • PDF

Statistical Prediction of Used Tablet PC Transaction Price among Consumers (소비자 사이의 중고 태블릿PC 거래 가격의 통계적 예측)

  • Younghee Go;Sohyung Kim;Yujin Chung
    • Journal of Industrial Convergence
    • /
    • v.20 no.12
    • /
    • pp.179-186
    • /
    • 2022
  • This study aims to develop a predictive model to suggest a used sales price to sellers and buyers when trading used tablet PCs. For model development, we analyzed the real used tablet PC transaction data and additionally collected detailed product information. We developed several predictive models and selected the best predictive model among them. Specifically, we considered a multiple linear regression model using the used sales price as a dependent variable and other variables in the integrated data as independent variables, a multiple linear regression model including interactions, and the models from stepwise variable selection in each model. The model with the best predictive performance was finally selected through cross-validation. Through this study, we can predict the sales price of used tablet PCs and suggest appropriate used sales prices to sellers and buyers.

The Selection of the Suitable Site for Forest Tree(Pinus thunbergii) (임목(林木)((해송(海松)) 적지선정(適地選定)에 관한 연구(硏究))

  • Chung, Young Gwan;Park, Nam Chang;Son, Yeong Mo
    • Journal of Korean Society of Forest Science
    • /
    • v.82 no.4
    • /
    • pp.420-430
    • /
    • 1993
  • This study was conducted to investigate the effect of the forest environmental factors(5 items) and physico-chemical properties of soil(13 items) on the growth of Pinus thunbergii stands. The 218 plots were sampled over the coastal district of the whole country. In statistical analysis, the explanatory variables were soil and environmental factors(18 items), and the response variable was the site index of Pinus thunbergii stands. Data computation was processed in order of preparation of original data, computation of inner correlation matrix table by correlation analysis, calculation of partial correlation coefficients and coefficients of determination, estimation of regression equation by stepwise begression analysis, and stepwise regression analysis by factor score of factor analysis. The main results obtained were summarized as follows ; 1. The site index in Pinus thunbergii stands way highly correlated with effective soil depth(r=0.8668), slope percentage, organic matter, and total nitrogen. 2. According to the coefficients by partial correlation analysis, effective soil depth(r=0.6270), slope percentage (r=-0.5423) and base saturation(r=0.3278) among environmental factors had a great effect on tree growth. 3. With stepwise regression analysis, the factors effecting on the Pinus thunbergii stands growth were effective soil depth, slope percentage, organic matter, base saturation, soil pH, content of silt, exchangeable Ca, and etc. 4. Estimation equation for the site index of Pinus thunbergii stands was given by $Y=13.2691+0.0242\;X_2-1.2244\;X_4+0.6142\;X_5-0.3472\;X_{11}+0.0355\;X_{13}+0.1552\;X_{15}-0.1002\;X_{17}$. The coefficient of determination for the estimation model was 0.77, which was significant at the 1 percent level. 5. In result of factor analysis by the environmental factors, principal components were 6 factors, and communality contribution percentage was 71.1 percent. 6. By stepwise regression analysis between factor score and site index of Pinus thunbergii stands, the factor group effecting on site index was 5 principal components. The coefficients of determination was 85 percent, which was significant at the 1 percent level. In conclusion, on the occasion of analizing which factors to effect on the tree height growth in Pinus thunbergii stands the stepwise regression analysis proved to be greatly significant. Also the management of Pinus thunbergii stands should be working by the above selected growth factors.

  • PDF

Assessment of Potential Distribution Possibility of the Warm-Temperate Woody Plants of East Asia in Korea (한국에서 동아시아 난대 목본식물의 잠재분포 가능성 평가)

  • Cheolho, Lee;Hwirae, Kim;Kang-Hyun, Cho;Byeongki, Choi;Bora, Lee
    • Ecology and Resilient Infrastructure
    • /
    • v.9 no.4
    • /
    • pp.269-281
    • /
    • 2022
  • The prediction of changes regarding the distribution of vegetation and plant species according to climate changes is important for ecosystem management. In this study, we attempted to develop an assessment method to evaluate the possibility of the potential distribution of warm-temperate woody plant species of East Asia in Korea. To begin with, a list of warm-temperate woody plants distributed in China and Japan, but not in Korea, was prepared, and a database consisting their global distribution and bioclimatic variables was constructed. In addition, the warm-temperate vegetation zone in Korea was delineated using the coldness index and relevant bioclimatic data were collected. After the exclusion of multicollinearity among bioclimatic variables using correlation analysis, mean temperature of the coldest quarter, mean temperature diurnal range, and annual precipitation were selected as the major variables that influence the distribution of warm-temperate plants. A multivariate environment similarity surfaces (MESS) analysis was conducted to calculate the similarity scores between the distribution of these three bioclimatic variables in the global distribution sites of the East Asian warm-temperate woody plants and the Korean warm-temperate vegetation zone. Finally, using stepwise variable-selection regression, the mean temperature of the coldest quarter and annual precipitation were selected as the main bioclimatic variables that affect the MESS similarity index. The mean temperature of the coldest quarter accounted for 88% of the total variance. For a total of 319 East Asian warm-temperate woody plant species, the possibility of their potential distribution in Korea was evaluated by applying the constructed multivariate regression model that calculates the MESS similarity index.

Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm (SVM과 meta-learning algorithm을 이용한 고지혈증 유병 예측모형 개발과 활용)

  • Lee, Seulki;Shin, Taeksoo
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.111-124
    • /
    • 2018
  • This study aims to develop a classification model for predicting the occurrence of hyperlipidemia, one of the chronic diseases. Prior studies applying data mining techniques for predicting disease can be classified into a model design study for predicting cardiovascular disease and a study comparing disease prediction research results. In the case of foreign literatures, studies predicting cardiovascular disease were predominant in predicting disease using data mining techniques. Although domestic studies were not much different from those of foreign countries, studies focusing on hypertension and diabetes were mainly conducted. Since hypertension and diabetes as well as chronic diseases, hyperlipidemia, are also of high importance, this study selected hyperlipidemia as the disease to be analyzed. We also developed a model for predicting hyperlipidemia using SVM and meta learning algorithms, which are already known to have excellent predictive power. In order to achieve the purpose of this study, we used data set from Korea Health Panel 2012. The Korean Health Panel produces basic data on the level of health expenditure, health level and health behavior, and has conducted an annual survey since 2008. In this study, 1,088 patients with hyperlipidemia were randomly selected from the hospitalized, outpatient, emergency, and chronic disease data of the Korean Health Panel in 2012, and 1,088 nonpatients were also randomly extracted. A total of 2,176 people were selected for the study. Three methods were used to select input variables for predicting hyperlipidemia. First, stepwise method was performed using logistic regression. Among the 17 variables, the categorical variables(except for length of smoking) are expressed as dummy variables, which are assumed to be separate variables on the basis of the reference group, and these variables were analyzed. Six variables (age, BMI, education level, marital status, smoking status, gender) excluding income level and smoking period were selected based on significance level 0.1. Second, C4.5 as a decision tree algorithm is used. The significant input variables were age, smoking status, and education level. Finally, C4.5 as a decision tree algorithm is used. In SVM, the input variables selected by genetic algorithms consisted of 6 variables such as age, marital status, education level, economic activity, smoking period, and physical activity status, and the input variables selected by genetic algorithms in artificial neural network consist of 3 variables such as age, marital status, and education level. Based on the selected parameters, we compared SVM, meta learning algorithm and other prediction models for hyperlipidemia patients, and compared the classification performances using TP rate and precision. The main results of the analysis are as follows. First, the accuracy of the SVM was 88.4% and the accuracy of the artificial neural network was 86.7%. Second, the accuracy of classification models using the selected input variables through stepwise method was slightly higher than that of classification models using the whole variables. Third, the precision of artificial neural network was higher than that of SVM when only three variables as input variables were selected by decision trees. As a result of classification models based on the input variables selected through the genetic algorithm, classification accuracy of SVM was 88.5% and that of artificial neural network was 87.9%. Finally, this study indicated that stacking as the meta learning algorithm proposed in this study, has the best performance when it uses the predicted outputs of SVM and MLP as input variables of SVM, which is a meta classifier. The purpose of this study was to predict hyperlipidemia, one of the representative chronic diseases. To do this, we used SVM and meta-learning algorithms, which is known to have high accuracy. As a result, the accuracy of classification of hyperlipidemia in the stacking as a meta learner was higher than other meta-learning algorithms. However, the predictive performance of the meta-learning algorithm proposed in this study is the same as that of SVM with the best performance (88.6%) among the single models. The limitations of this study are as follows. First, various variable selection methods were tried, but most variables used in the study were categorical dummy variables. In the case with a large number of categorical variables, the results may be different if continuous variables are used because the model can be better suited to categorical variables such as decision trees than general models such as neural networks. Despite these limitations, this study has significance in predicting hyperlipidemia with hybrid models such as met learning algorithms which have not been studied previously. It can be said that the result of improving the model accuracy by applying various variable selection techniques is meaningful. In addition, it is expected that our proposed model will be effective for the prevention and management of hyperlipidemia.

Factors Affecting Selection of Delivery Facilities Pregnant Women (산모의 분만기관 선택관련 요인)

  • Lee, Choong-Wan;Yu, Seung-Hum;Oh, Hee-Choul
    • Journal of Preventive Medicine and Public Health
    • /
    • v.23 no.4 s.32
    • /
    • pp.436-450
    • /
    • 1990
  • This study was designed to investigate the mar factors affecting selection of delivery facilities by pregnant women. Five hundred women hospitalized at 23 Seoul-area delivery facilities, such as university hospitals, general hospitals, hospitals, and clinics were selected and given questionnaires from April 24 to May 7, 1990. A total of 350 questionnaires were collected and analysed for the study. The results are as follows ; 1. In general, variables which significantly affected the choice of delivery facilities included the age of women, their educational level, the educational level of their husbands, monthly average incomes and residential areas. 2. In analyzing the obstetrical characteristics of the women, those variables significantly affecting the choice of delivery facilities were the gestational period, the facilities for prenatal care, the frequency of prenatal care, the type of delivery, the frequency of miscarriage, previous delivery experiences and the awareness on prenatal care. 3. In comparing the motivation factors for selecting the delivery facilities, all the factors except convenience and need for hospitalization differed significantly among delivery facilities. 4. The factor analysis was assessed for twenty possible factors motivating the choice of delivery facilities. Six factors including personal service, scale of the facility, reputation, urgency, convenience, and experience were noted explaining by 57.7%. 5. In the discriminant analysis used to clarify the major factors affecting the selection of delivery facilities, the 16 significant variables were regarded as independent variables, and the type of delivery facilities was considered a dependent variable. The stepwise method was applied to the analysis. Detected discriminant variables were the facilities for prenatal care, scale factor, personal service factor, urgency factor, convenience factor, reputation factor, experience factor, gestational period, types of delivery, frequency of miscarriage, age and income. These 12 discriminant variables were tested, with reference to discriminant prediction, on their importance in the choice of the delivery facility, by the discriminant functional formula. The test showed a hit-rate of 67.7%. The results suggest that general characteristics, obstetrical characteristics, and motivations for selecting the delivery facilities differ significantly according to the types of the delivery facilities. This study implies that all types of delivery facilities should attempt to acommodate characteristics and motivations of pregnant women. The facilities should be prepared to increase their patients satisfaction with required medical conditions by improving service and responding to the pregnant women's preferences.

  • PDF

Fault-Causing Process and Equipment Analysis of PCB Manufacturing Lines Using Data Mining Techniques (데이터마이닝 기법을 이용한 PCB 제조라인의 불량 혐의 공정 및 설비 분석)

  • Sim, Hyun Sik;Kim, Chang Ouk
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.2
    • /
    • pp.65-70
    • /
    • 2015
  • In the PCB(Printed Circuit Board) manufacturing industry, the yield is an important management factor because it affects the product cost and quality significantly. In real situation, it is very hard to ensure a high yield in a manufacturing shop because products called chips are made through hundreds of nano-scale manufacturing processes. Therefore, in order to improve the yield, it is necessary to analyze main fault process and equipment that cause low PCB yield. This paper proposes a systematic approach to discover fault-causing processes and equipment by using a logistic regression and a stepwise variable selection procedure. We tested our approach with lot trace records of real work-site. A lot trace record consists of the equipment sequence that the lot passed through and the number of faults for each fault type in the lot. We demonstrated that the test results reflected the real situation of a PCB manufacturing line.

Factors of Predicting Difficulty of Mathematics Test Items in College Scholastic Ability Test (고등학교 수리영역 시험의 난이도 예측 요인 분석)

  • Ko, Ho-Kyoung;Yi, Hyun-Sook
    • Journal of the Korean School Mathematics Society
    • /
    • v.10 no.1
    • /
    • pp.113-127
    • /
    • 2007
  • This study explored the possibility of building a statistical model predicting difficulty of mathematics test items through the analysis of nation-wide scholastic ability test results for the past 5 years. Multiple linear regression analysis was conducted in predicting difficulty of mathematics test items. We adopted three major areas for independent variables: the content area, the behavior area, and the test item format area, each of which was categorized into more detailed sub-areas. For the dependent variable, the proportion of correct answer was used to represent the item difficulty. Statistically significant independent variables were included in the regression model based on the stepwise selection method. Several important factors affecting difficulty of mathematics test items for each area were identified. R-squares for the final regression model were fairly high, implying that the regression equation can be used to predict difficulty of test items at an acceptable level. Lastly, the regression model was cross-validated using independently collected data. We believe that this study will provide basic but very critical information for predicting the proportion of correct answer by showing the factors that should be considered for developing mathematics test items for the college entrance examination or high school classroom test.

  • PDF

A Study on the Factors Affecting the Arson (방화 발생에 영향을 미치는 요인에 관한 연구)

  • Kim, Young-Chul;Bak, Woo-Sung;Lee, Su-Kyung
    • Fire Science and Engineering
    • /
    • v.28 no.2
    • /
    • pp.69-75
    • /
    • 2014
  • This study derives the factors which affect the occurrence of arson from statistical data (population, economic, and social factors) by multiple regression analysis. Multiple regression analysis applies to 4 forms of functions, linear functions, semi-log functions, inverse log functions, and dual log functions. Also analysis respectively functions by using the stepwise progress which considered selection and deletion of the independent variable factors by each steps. In order to solve a problem of multiple regression analysis, autocorrelation and multicollinearity, Variance Inflation Factor (VIF) and the Durbin-Watson coefficient were considered. Through the analysis, the optimal model was determined by adjusted Rsquared which means statistical significance used determination, Adjusted R-squared of linear function is scored 0.935 (93.5%), the highest of the 4 forms of function, and so linear function is the optimal model in this study. Then interpretation to the optimal model is conducted. As a result of the analysis, the factors affecting the arson were resulted in lines, the incidence of crime (0.829), the general divorce rate (0.151), the financial autonomy rate (0.149), and the consumer price index (0.099).