• Title/Summary/Keyword: Variance Inflation Factors

Search Result 18, Processing Time 0.023 seconds

Tests for homogeneity of proportions in clustered binomial data

  • Jeong, Kwang Mo
    • Communications for Statistical Applications and Methods
    • /
    • v.23 no.5
    • /
    • pp.433-444
    • /
    • 2016
  • When we observe binary responses in a cluster (such as rat lab-subjects), they are usually correlated to each other. In clustered binomial counts, the independence assumption is violated and we encounter an extra-variation. In the presence of extra-variation, the ordinary statistical analyses of binomial data are inappropriate to apply. In testing the homogeneity of proportions between several treatment groups, the classical Pearson chi-squared test has a severe flaw in the control of Type I error rates. We focus on modifying the chi-squared statistic by incorporating variance inflation factors. We suggest a method to adjust data in terms of dispersion estimate based on a quasi-likelihood model. We explain the testing procedure via an illustrative example as well as compare the performance of a modified chi-squared test with competitive statistics through a Monte Carlo study.

Machine Learning-based Phishing Website Detection Model (머신러닝 기반 피싱 사이트 탐지 모델)

  • Sumin Oh;Minseo Park
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.4
    • /
    • pp.575-580
    • /
    • 2024
  • Detecting the status of websites, normal or phishing, is necessary to defend against intelligent phishing attacks. We propose a machine learning-based classification to predict the status of websites. First, we collect information about 'URL', convert it into numerical data, and remove outliers. Second, we apply VIF(Variance Inflation Factors) to understand the correlation and independence between variables. Finally, we develop a phishing website detection model with machine learning-based classifications, which predicts website status. In the test datasets, Random Forest showed the best performance, with precision of 93.74%, recall of 92.26%, and accuracy of 93.14%. In the future, we expect to apply our model to detect various phishing crimes.

A Study on the Selection of Pricing Factors for Used Bulk Carriers (중고 벌크선의 가격결정요인 선정에 관한 연구)

  • Yang, Yun-Ok
    • Journal of Navigation and Port Research
    • /
    • v.41 no.4
    • /
    • pp.181-188
    • /
    • 2017
  • In the existing ship sales market, prices determined based on the prices of similar ship types that recently traded. ince the 2008 financial crisis, ship prices have fluctuated, and ship price criteria have become ever more necessary to the imminent value of the ship. Therefore, this research used the hedonic price model to estimate imminent values of ships. In this study, the influence on ship prices was analyzed by the value of each characteristic and an estimated functional formula was. Out of the four models suggested by the hedonic price model, an optimal model was selected with variance inflation factors and a stepwise selection. For this, the influence of determinants of ship prices was analyzed based on actually traded ships and characteristic data. The selected model s the Log-Line model; as a result of regression analysis, eight variables, including DWT, Age, Market Value, Short-Term Charter, Long-Term Charter, Enbloc, Special Survey Due and Builder were to affect the ship price model. This model is expected to be useful for objective and balanced ship price evaluation.

Prediction of Food Franchise Success and Failure Based on Machine Learning (머신러닝 기반 외식업 프랜차이즈 가맹점 성패 예측)

  • Ahn, Yelyn;Ryu, Sungmin;Lee, Hyunhee;Park, Minseo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.4
    • /
    • pp.347-353
    • /
    • 2022
  • In the restaurant industry, start-ups are active due to high demand from consumers and low entry barriers. However, the restaurant industry has a high closure rate, and in the case of franchises, there is a large deviation in sales within the same brand. Thus, research is needed to prevent the closure of food franchises. Therefore, this study examines the factors affecting franchise sales and uses machine learning techniques to predict the success and failure of franchises. Various factors that affect franchise sales are extracted by using Point of Sale (PoS) data of food franchise and public data in Gangnam-gu, Seoul. And for more valid variable selection, multicollinearity is removed by using Variance Inflation Factor (VIF). Finally, classification models are used to predict the success and failure of food franchise stores. Through this method, we propose success and failure prediction model for food franchise stores with the accuracy of 0.92.

A Study on Work-Related Musculoskeletal Disorders Related to Sonographer's (진단 초음파 검사자의 작업 관련 근골격계질환 연구)

  • An, Hyun
    • Journal of radiological science and technology
    • /
    • v.45 no.4
    • /
    • pp.355-363
    • /
    • 2022
  • This study was to investigate the prevalence rate of musculoskeletal disorders in relation to general characteristic factors, living environment factors, and work environment factors for sonographer's. For the response questions, the guidelines for musculoskeletal burden work were used. For statistical analysis, SPSS 26.0 version was used. For the common body parts of the sonographer's who responded, the prevalence was investigated by dividing the group into a group with high pain or discomfort and a group with low pain or discomfort according to the degree to which they experienced symptoms during the past 12 months. Multiple logistic regression analysis was used to determine the variance inflation factor(VIF), odds ratio (OR) and corresponding 95% confidence interval (CI). A p-value of <0.05 was considered statistically significant. As a result, housework hours, examination history, regular physical activity, number of patient examinations per day, and sitting posture were investigated as variables for rate musculoskeletal disorders. The sonographer's occupational group was found to have a high prevalence rate of musculoskeletal disorders like various other occupational groups. Based on the results of this study, it is judged that musculoskeletal disorders can be reduced by recognizing musculoskeletal disorders and improving work environment factors.

Using Ridge Regression to Improve the Accuracy and Interpretation of the Hedonic Pricing Model : Focusing on apartments in Guro-gu, Seoul (능형회귀분석을 활용한 부동산 헤도닉 가격모형의 정확성 및 해석력 향상에 관한 연구 - 서울시 구로구 아파트를 대상으로 -)

  • Koo, Bonsang;Shin, Byungjin
    • Korean Journal of Construction Engineering and Management
    • /
    • v.16 no.5
    • /
    • pp.77-85
    • /
    • 2015
  • The Hedonic Pricing model is the predominant approach used today to model the effect of relevant factors on real estate prices. These factors include intrinsic elements of a property such as floor areas, number of rooms, and parking spaces. Also, The model also accounts for the impact of amenities or undesirable facilities of a property's value. In the latter case, euclidean distances are typically used as the parameter to represent the proximity and its impact on prices. However, in situations where multiple facilities exist, multi-colinearity may exist between these parameters, which can result in multi-regression models with erroneous coefficients. This research uses Variance Inflation Factors(VIF) and Ridge Regression to identify these errors and thus create more accurate and stable models. The techniques were applied to apartments in Guro-gu of Seoul, whose prices are impacted by subway stations as well as a public prison, a railway terminal and a digital complex. The VIF identified colinearity between variables representing the terminal and the digital complex as well as the latitudinal coordinates. The ridge regression showed the need to remove two of these variables. The case study demonstrated that the application of these techniques were critical in developing accurate and robust Hedonic Pricing models.

A Study on the Factors Affecting the Arson (방화 발생에 영향을 미치는 요인에 관한 연구)

  • Kim, Young-Chul;Bak, Woo-Sung;Lee, Su-Kyung
    • Fire Science and Engineering
    • /
    • v.28 no.2
    • /
    • pp.69-75
    • /
    • 2014
  • This study derives the factors which affect the occurrence of arson from statistical data (population, economic, and social factors) by multiple regression analysis. Multiple regression analysis applies to 4 forms of functions, linear functions, semi-log functions, inverse log functions, and dual log functions. Also analysis respectively functions by using the stepwise progress which considered selection and deletion of the independent variable factors by each steps. In order to solve a problem of multiple regression analysis, autocorrelation and multicollinearity, Variance Inflation Factor (VIF) and the Durbin-Watson coefficient were considered. Through the analysis, the optimal model was determined by adjusted Rsquared which means statistical significance used determination, Adjusted R-squared of linear function is scored 0.935 (93.5%), the highest of the 4 forms of function, and so linear function is the optimal model in this study. Then interpretation to the optimal model is conducted. As a result of the analysis, the factors affecting the arson were resulted in lines, the incidence of crime (0.829), the general divorce rate (0.151), the financial autonomy rate (0.149), and the consumer price index (0.099).

An Analysis of Factors Relating to Agricultural Machinery Farm-Work Accidents Using Logistic Regression

  • Kim, Byounggap;Yum, Sunghyun;Kim, Yu-Yong;Yun, Namkyu;Shin, Seung-Yeoub;You, Seokcheol
    • Journal of Biosystems Engineering
    • /
    • v.39 no.3
    • /
    • pp.151-157
    • /
    • 2014
  • Purpose: In order to develop strategies to prevent farm-work accidents relating to agricultural machinery, influential factors were examined in this paper. The effects of these factors were quantified using logistic regression. Methods: Based on the results of a survey on farm-work accidents conducted by the National Academy of Agricultural Science, 21 tentative independent variables were selected. To apply these variables to regression, the presence of multicollinearity was examined by comparing correlation coefficients, checking the statistical significance of the coefficients in a simple linear regression model, and calculating the variance inflation factor. A logistic regression model and determination method of its goodness of fit was defined. Results: Among 21 independent variables, 13 variables were not collinear each other. The results of a logistic regression analysis using these variables showed that the model was significant and acceptable, with deviance of 714.053. Parameter estimation results showed that four variables (age, power tiller ownership, cognizance of the government's safety policy, and consciousness of safety) were significant. The logistic regression model predicted that the former two increased accident odds by 1.027 and 8.506 times, respectively, while the latter two decreased the odds by 0.243 and 0.545 times, respectively. Conclusions: Prevention strategies against factors causing an accident, such as the age of farmers and the use of a power tiller, are necessary. In addition, more efficient trainings to elevate the farmer's consciousness about safety must be provided.

Development of Ridge Regression Model of Pollutant Load Using Runoff Weighted Value Based on Distributed Curve-Number (분포형 CN 기반 토지피복별 유출가중치를 이용한 오염부하량 능형회귀모형 개발)

  • Song, Chul Min;Kim, Jin Soo
    • Journal of The Korean Society of Agricultural Engineers
    • /
    • v.60 no.1
    • /
    • pp.111-120
    • /
    • 2018
  • The purpose of this study was to develop a ridge regression (RR) model to estimate BOD and TP load using runoff weighted value. The concept of runoff weighted value, based on distributed curve-number (CN), was introduced to reflect the impact of land covers on runoff. The estimated runoff depths by distributed CN were closer to the observed values than those by area weighted mean CN. The RR is a technique used when the data suffers from multicollinearity. The RR model was developed for five flow duration intervals with the independent variables of daily runoff discharge of seven land covers and dependent variables of daily pollutant load. The RR model was applied to Heuk river watershed, a subwatershed of the Han river watershed. The variance inflation factors of the RR model decreased to the value less than 10. The RR model showed a good performance with Nash-Sutcliffe efficiency (NSE) of 0.73 and 0.87, and Pearson correlation coefficient of 0.88 and 0.93 for BOD and TP, respectively. The results suggest that the methods used in the study can be applied to estimate pollutant load of different land cover watersheds using limited data.

HIV-related Perceptions, Knowledge, Professional Ethics, Institutional Support, and HIV/AIDS-related Stigma in Health Services in West Sumatra, Indonesia: An Empirical Evaluation Using PLS-SEM

  • Vivi Triana;Nursyirwan Effendi;Brian Sri Pra Hastuti;Cimi Ilmiawati;Dodi Devianto;Afrizal Afrizal;Adang Bachtiar;Rima Semiarty;Raveinal Raveinal
    • Journal of Preventive Medicine and Public Health
    • /
    • v.57 no.5
    • /
    • pp.435-442
    • /
    • 2024
  • Objectives: The aim of this study was to investigate the significance of associations between knowledge, professional ethics, institutional support, perceptions regarding HIV/AIDS, and HIV/AIDS-related stigma among health workers in West Sumatra, Indonesia. Methods: We conducted a cross-sectional study involving health workers at public hospitals and health centers in West Sumatra in June 2022. The Health Care Provider HIV/AIDS Stigma Scale was employed to assess the stigma associated with HIV/AIDS. To estimate and evaluate the model's ability to explain the proposed constructs, we utilized the standardized partial least squares structural equation model (PLS-SEM). Results: In total, 283 individuals participated in this study (average age, 39 years). The majority were female (91.2%), nearly half were nurses (49.5%), and 59.4% had been working for more than 10 years. The study revealed that HIV/AIDS-related stigma persisted among health workers. The PLS-SEM results indicated that all latent variables had variance inflation factors below 5, confirming that they could be retained in the model. Knowledge and professional ethics significantly contributed to human immunodeficiency virus (HIV)-related stigma, with an effect size (f2) of 0.15 or greater. In contrast, perceived and institutional support had a smaller impact on HIV-related stigma, with an effect size (f2) of at least 0.02. The R2 value for health worker stigma was 0.408, suggesting that knowledge, professional ethics, institutional support, and perceived support collectively explain 40.8% of the variance in stigma. Conclusions: Improving health workers' understanding of HIV, fostering professional ethics, and strengthening institutional support are essential for reducing HIV-related stigma in this population.