• Title/Summary/Keyword: Extreme distribution

Search Result 652, Processing Time 0.023 seconds

Leukocyte count and hypertension in the health screening data of some rural and urban residents (일부 농촌과 도시의 건강선별조사 자료로 본 백혈구수와 고혈압과의 관계)

  • Lee, Choong-Won;Yoon, Nung-Ki;Lee, Sung-Kwan
    • Journal of Preventive Medicine and Public Health
    • /
    • v.24 no.3 s.35
    • /
    • pp.363-372
    • /
    • 1991
  • We used the health screening data of some rural and urban residents to examine the cross-sectional association between leukocyte count and hypertension. The 206 male and 203 female rural residents were selected by multi-stage cluster sampling method in Kyungsan-Kun area of Kyungbuk province in 1985 and 600 urban residents were selected by the same sampling method as the rural residents in Daegu city of the same province in 1986 compatible with age-sex distribution of Daegu city of 1985 census, but of whom 384 actually responded. The rest of 600 were replaced by age and sex with those who were members of the medical insurance plan visiting the health management department of the university hospital to get the biannual preventive medical checkups. Excluded in the analysis were those having hypertensive history, diseases and extreme outlying values of the screening tests, leaving 373 rural and 571 urban residents. Leukocyte count was measured with ELT-8 Laser shadow method and the unit $cells/mm^3$, Blood pressures were determined with an aneroid sphygmomanometer with pre-standardized method and hypertensives were defined as those showing systolic blood pressure more than 140mmHg and/or diastolic blood pressure more than 90mmHg. Total residents pooled (N=944) showed a significant difference between hypertensives and normotensives ($6965.93{\pm}1997.01\;vs\;6490.61{\pm}1941.32,\;P=0.00$) and in rural residents was noted the similar significant difference (P=0.03). None of significant differences were noted in any stratum stratified by residency and sex. Compared to the lowest quintile of WBC, 2/5 quintile showed odds ratio 0.99 (95% Confidence interval, Ci 0.62-1.59), 3/5 quintile 1.41 (95% CI 0.90-2.21), 4/5 quintile 1.76 (95% CI. 1.14-2.72), and highest quintile 1.80 (1.15-2.82) in the total residents. Likelihood ratio test for linear trend for it indicated a significant trend ($X^2_{trend}=5.53,\;df=1,\;P<0.05$). There were no other significant odds ratios compared to the lowest quintile of WBC in strata stratified by residency and sex. The odds ratios in total residents which had showed significant odds ratios became nonsignificant and of reduced magnitude after controlling age, frequency of smoking and drinking with multiple logistic. regression. In each stratum, it changed magnitudes of odds ratios slightly and unstably. None of the trend tests showed any significant trend. These results suggest that the Friedman et al's finding of association between leukocyte count and hypertension may be due to an statistical type I error resulting from the data dredging in an exploratory study, in which more than 800 variables were screened as possible predictors of hypertension.

  • PDF

The Prediction of Export Credit Guarantee Accident using Machine Learning (기계학습을 이용한 수출신용보증 사고예측)

  • Cho, Jaeyoung;Joo, Jihwan;Han, Ingoo
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.83-102
    • /
    • 2021
  • The government recently announced various policies for developing big-data and artificial intelligence fields to provide a great opportunity to the public with respect to disclosure of high-quality data within public institutions. KSURE(Korea Trade Insurance Corporation) is a major public institution for financial policy in Korea, and thus the company is strongly committed to backing export companies with various systems. Nevertheless, there are still fewer cases of realized business model based on big-data analyses. In this situation, this paper aims to develop a new business model which can be applied to an ex-ante prediction for the likelihood of the insurance accident of credit guarantee. We utilize internal data from KSURE which supports export companies in Korea and apply machine learning models. Then, we conduct performance comparison among the predictive models including Logistic Regression, Random Forest, XGBoost, LightGBM, and DNN(Deep Neural Network). For decades, many researchers have tried to find better models which can help to predict bankruptcy since the ex-ante prediction is crucial for corporate managers, investors, creditors, and other stakeholders. The development of the prediction for financial distress or bankruptcy was originated from Smith(1930), Fitzpatrick(1932), or Merwin(1942). One of the most famous models is the Altman's Z-score model(Altman, 1968) which was based on the multiple discriminant analysis. This model is widely used in both research and practice by this time. The author suggests the score model that utilizes five key financial ratios to predict the probability of bankruptcy in the next two years. Ohlson(1980) introduces logit model to complement some limitations of previous models. Furthermore, Elmer and Borowski(1988) develop and examine a rule-based, automated system which conducts the financial analysis of savings and loans. Since the 1980s, researchers in Korea have started to examine analyses on the prediction of financial distress or bankruptcy. Kim(1987) analyzes financial ratios and develops the prediction model. Also, Han et al.(1995, 1996, 1997, 2003, 2005, 2006) construct the prediction model using various techniques including artificial neural network. Yang(1996) introduces multiple discriminant analysis and logit model. Besides, Kim and Kim(2001) utilize artificial neural network techniques for ex-ante prediction of insolvent enterprises. After that, many scholars have been trying to predict financial distress or bankruptcy more precisely based on diverse models such as Random Forest or SVM. One major distinction of our research from the previous research is that we focus on examining the predicted probability of default for each sample case, not only on investigating the classification accuracy of each model for the entire sample. Most predictive models in this paper show that the level of the accuracy of classification is about 70% based on the entire sample. To be specific, LightGBM model shows the highest accuracy of 71.1% and Logit model indicates the lowest accuracy of 69%. However, we confirm that there are open to multiple interpretations. In the context of the business, we have to put more emphasis on efforts to minimize type 2 error which causes more harmful operating losses for the guaranty company. Thus, we also compare the classification accuracy by splitting predicted probability of the default into ten equal intervals. When we examine the classification accuracy for each interval, Logit model has the highest accuracy of 100% for 0~10% of the predicted probability of the default, however, Logit model has a relatively lower accuracy of 61.5% for 90~100% of the predicted probability of the default. On the other hand, Random Forest, XGBoost, LightGBM, and DNN indicate more desirable results since they indicate a higher level of accuracy for both 0~10% and 90~100% of the predicted probability of the default but have a lower level of accuracy around 50% of the predicted probability of the default. When it comes to the distribution of samples for each predicted probability of the default, both LightGBM and XGBoost models have a relatively large number of samples for both 0~10% and 90~100% of the predicted probability of the default. Although Random Forest model has an advantage with regard to the perspective of classification accuracy with small number of cases, LightGBM or XGBoost could become a more desirable model since they classify large number of cases into the two extreme intervals of the predicted probability of the default, even allowing for their relatively low classification accuracy. Considering the importance of type 2 error and total prediction accuracy, XGBoost and DNN show superior performance. Next, Random Forest and LightGBM show good results, but logistic regression shows the worst performance. However, each predictive model has a comparative advantage in terms of various evaluation standards. For instance, Random Forest model shows almost 100% accuracy for samples which are expected to have a high level of the probability of default. Collectively, we can construct more comprehensive ensemble models which contain multiple classification machine learning models and conduct majority voting for maximizing its overall performance.