• Title/Summary/Keyword: Cross-Validation Approach

Search Result 130, Processing Time 0.025 seconds

Ordinary kriging approach to predicting long-term particulate matter concentrations in seven major Korean cities

  • Kim, Sun-Young;Yi, Seon-Ju;Eum, Young Seob;Choi, Hae-Jin;Shin, Hyesop;Ryou, Hyoung Gon;Kim, Ho
    • Environmental Analysis Health and Toxicology
    • /
    • v.29
    • /
    • pp.12.1-12.8
    • /
    • 2014
  • Objectives Cohort studies of associations between air pollution and health have used exposure prediction approaches to estimate individual-level concentrations. A common prediction method used in Korean cohort studies is ordinary kriging. In this study, performance of ordinary kriging models for long-term particulate matter less than or equal to $10{\mu}m$ in diameter ($PM_{10}$) concentrations in seven major Korean cities was investigated with a focus on spatial prediction ability. Methods We obtained hourly $PM_{10}$ data for 2010 at 226 urban-ambient monitoring sites in South Korea and computed annual average $PM_{10}$ concentrations at each site. Given the annual averages, we developed ordinary kriging prediction models for each of the seven major cities and for the entire country by using an exponential covariance reference model and a maximum likelihood estimation method. For model evaluation, cross-validation was performed and mean square error and R-squared ($R^2$) statistics were computed. Results Mean annual average $PM_{10}$ concentrations in the seven major cities ranged between 45.5 and $66.0{\mu}g/m^3$ (standard deviation=2.40 and $9.51{\mu}g/m^3$, respectively). Cross-validated $R^2$ values in Seoul and Busan were 0.31 and 0.23, respectively, whereas the other five cities had $R^2$ values of zero. The national model produced a higher cross-validated $R^2$ (0.36) than those for the city-specific models. Conclusions In general, the ordinary kriging models performed poorly for the seven major cities and the entire country of South Korea, but the model performance was better in the national model. To improve model performance, future studies should examine different prediction approaches that incorporate $PM_{10}$ source characteristics.

Machine Learning Approach to Blood Stasis Pattern Identification Based on Self-reported Symptoms (기계학습을 적용한 자기보고 증상 기반의 어혈 변증 모델 구축)

  • Kim, Hyunho;Yang, Seung-Bum;Kang, Yeonseok;Park, Young-Bae;Kim, Jae-Hyo
    • Korean Journal of Acupuncture
    • /
    • v.33 no.3
    • /
    • pp.102-113
    • /
    • 2016
  • Objectives : This study is aimed at developing and discussing the prediction model of blood stasis pattern of traditional Korean medicine(TKM) using machine learning algorithms: multiple logistic regression and decision tree model. Methods : First, we reviewed the blood stasis(BS) questionnaires of Korean, Chinese, and Japanese version to make a integrated BS questionnaire of patient-reported outcomes. Through a human subject research, patients-reported BS symptoms data were acquired. Next, experts decisions of 5 Korean medicine doctor were also acquired, and supervised learning models were developed using multiple logistic regression and decision tree. Results : Integrated BS questionnaire with 24 items was developed. Multiple logistic regression models with accuracy of 0.92(male) and 0.95(female) validated by 10-folds cross-validation were constructed. By decision tree modeling methods, male model with 8 decision node and female model with 6 decision node were made. In the both models, symptoms of 'recent physical trauma', 'chest pain', 'numbness', and 'menstrual disorder(female only)' were considered as important factors. Conclusions : Because machine learning, especially supervised learning, can reveal and suggest important or essential factors among the very various symptoms making up a pattern identification, it can be a very useful tool in researching diagnostics of TKM. With a proper patient-reported outcomes or well-structured database, it can also be applied to a pre-screening solutions of healthcare system in Mibyoung stage.

QSAR Approach for Toxicity Prediction of Chemicals Used in Electronics Industries (전자산업에서 사용하는 화학물질의 독성예측을 위한 QSAR 접근법)

  • Kim, Jiyoung;Choi, Kwangmin;Kim, Kwansick;Kim, Dongil
    • Journal of Environmental Health Sciences
    • /
    • v.40 no.2
    • /
    • pp.105-113
    • /
    • 2014
  • Objectives: It is necessary to apply quantitative structure activity relationship (QSAR) for the various chemicals with insufficient toxicity data that are used in the workplace, based on the precautionary principle. This study aims to find application plan of QSAR software tool for predicting health hazards such as genetic toxicity, and carcinogenicity for some chemicals used in the electronics industries. Methods: Toxicity prediction of 21 chemicals such as 5-aminotetrazole, ethyl lactate, digallium trioxide, etc. used in electronics industries was assessed by Toxicity Prediction by Komputer Assisted Technology (TOPKAT). In order to identify the suitability and reliability of carcinogenicity prediction, 25 chemicals such as 4-aminobiphenyl, ethylene oxide, etc. which are classified as Group 1 carcinogens by the International Agency for Research on Cancer (IARC) were selected. Results: Among 21 chemicals, we obtained prediction results for 5 carcinogens, 8 non-carcinogens and 8 unpredictability chemicals. On the other hand, the carcinogenic potential of 5 carcinogens was found to be low by relevant research testing data and Oncologic TM tool. Seven of the 25 carcinogens (IARC Group 1) were wrongly predicted as non-carcinogens (false negative rate: 36.8%). We confirmed that the prediction error could be improved by combining genetic toxicity information such as mutagenicity. Conclusions: Some compounds, including inorganic chemicals and polymers, were still limited for applying toxicity prediction program. Carcinogenicity prediction may be further improved by conducting cross-validation of various toxicity prediction programs, or application of the theoretical molecular descriptors.

A Challenging Study to Identify Target Proteins by a Proteomics Approach and Their Validation by Raising Polyclonal Antibody

  • Jeong, Da-Woon;Park, Beom-Young;Kim, Jin-Hyoung;Hwang, In-Ho
    • Food Science of Animal Resources
    • /
    • v.31 no.4
    • /
    • pp.506-512
    • /
    • 2011
  • This study was conducted to validate the theoretical feasibility of a technique to identify biomarkers in Korean native black pig (KNP) and a commercial Landrace breed. Using two-dimensional electrophoresis, we found six proteins (NADH dehydrogenase Fe-S protein 1, an unnamed protein product, similar to T-complex protein 1, annexin V = CaBP33 isoform, fatty acid-binding protein, and catechol O-methyltransferase), which appeared in KNP alone. We raised polyclonal antibodies (used as the primary antibody) for Western blotting to confirm the characteristics of the six KNP proteins. As a result, catechol O-methyltransferase, annexin V = CaBP33 isoform, and the unnamed protein product presented thicker bands in KNP than those in Landrace. Moreover, catechol O-methyltransferase was shown to be more feasible as a biomarker for KNP. However, cross-reactivity was observed with the polyclonal antibodies for KNP and the other three proteins (NADH dehydrogenase, a protein similar to T-complex protein 1, and fatty acid-binding protein). This study only showed limited results from a limited number of animals; however, our research suggests possibilities for future studies.

MODIFIED CONVOLUTIONAL NEURAL NETWORK WITH TRANSFER LEARNING FOR SOLAR FLARE PREDICTION

  • Zheng, Yanfang;Li, Xuebao;Wang, Xinshuo;Zhou, Ta
    • Journal of The Korean Astronomical Society
    • /
    • v.52 no.6
    • /
    • pp.217-225
    • /
    • 2019
  • We apply a modified Convolutional Neural Network (CNN) model in conjunction with transfer learning to predict whether an active region (AR) would produce a ≥C-class or ≥M-class flare within the next 24 hours. We collect line-of-sight magnetogram samples of ARs provided by the SHARP from May 2010 to September 2018, which is a new data product from the HMI onboard the SDO. Based on these AR samples, we adopt the approach of shuffle-and-split cross-validation (CV) to build a database that includes 10 separate data sets. Each of the 10 data sets is segregated by NOAA AR number into a training and a testing data set. After training, validating, and testing our model, we compare the results with previous studies using predictive performance metrics, with a focus on the true skill statistic (TSS). The main results from this study are summarized as follows. First, to the best of our knowledge, this is the first time that the CNN model with transfer learning is used in solar physics to make binary class predictions for both ≥C-class and ≥M-class flares, without manually engineered features extracted from the observational data. Second, our model achieves relatively high scores of TSS = 0.640±0.075 and TSS = 0.526±0.052 for ≥M-class prediction and ≥C-class prediction, respectively, which is comparable to that of previous models. Third, our model also obtains quite good scores in five other metrics for both ≥C-class and ≥M-class flare prediction. Our results demonstrate that our modified CNN model with transfer learning is an effective method for flare forecasting with reasonable prediction performance.

Analytical, Numerical, and Experimental Comparison of the Performance of Semicircular Cooling Plates (반원형 구조의 냉각판 성능에 관한 해석적/수치해석적/실험적 비교)

  • Cho, Kee-Hyeon;Kim, Moo-Hwan
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.35 no.12
    • /
    • pp.1325-1333
    • /
    • 2011
  • An analytical, numerical, and experimental comparison of the hydraulic and thermal performance of new vascular channels with semicircular cross sections was conducted. The following conditions were employed in the study: Reynolds number, 30-2000; cooling channels with a volume fraction of the cooling channels, 0.04; and pressure drop, $30-10^5$ Pa. Three flow configurations were considered: first, second, and third constructal structures with diameters optimized for hydraulic operations. To validate the proposed vascular designs by an analytical approach, 3-D numerical analysis was performed. The numerical model was also validated by the experimental data, and the comparison results were in excellent agreement in all cases. The validation study against the experimental data showed that compared to traditional channels, the optimized structure of the cooling plates could significantly enhance heat transfer and decrease pumping power.

Optimal number of dimensions in linear discriminant analysis for sparse data (희박한 데이터에 대한 선형판별분석에서 최적의 차원 수 결정)

  • Shin, Ga In;Kim, Jaejik
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.6
    • /
    • pp.867-876
    • /
    • 2017
  • Datasets with small n and large p are often found in various fields and the analysis of the datasets is still a challenge in statistics. Discriminant analysis models for such datasets were recently developed in classification problems. One approach of those models tries to detect dimensions that distinguish between groups well and the number of the detected dimensions is typically smaller than p. In such models, the number of dimensions is important because the prediction and visualization of data and can be usually determined by the K-fold cross-validation (CV). However, in sparse data scenarios, the CV is not reliable for determining the optimal number of dimensions since there can be only a few observations for each fold. Thus, we propose a method to determine the number of dimensions using a measure based on the standardized distance between the mean values of each group in the reduced dimensions. The proposed method is verified through simulations.

Quantitative Detection of Residual E. coli Host Cell DNA by Real-Time PCR

  • Lee, Dong-Hyuck;Bae, Jung-Eun;Lee, Jung-Hee;Shin, Jeong-Sup;Kim, In-Seop
    • Journal of Microbiology and Biotechnology
    • /
    • v.20 no.10
    • /
    • pp.1463-1470
    • /
    • 2010
  • E. coli has long been widely used as a host system for the manufacture of recombinant proteins intended for human therapeutic use. When considering the impurities to be eliminated during the downstream process, residual host cell DNA is a major safety concern. The presence of residual E. coli host cell DNA in the final products is typically determined using a conventional slot blot hybridization assay or total DNA Threshold assay. However, both the former and latter methods are time consuming, expensive, and relatively insensitive. This study thus attempted to develop a more sensitive real-time PCR assay for the specific detection of residual E. coli DNA. This novel method was then compared with the slot blot hybridization assay and total DNA Threshold assay in order to determine its effectiveness and overall capabilities. The novel approach involved the selection of a specific primer pair for amplification of the E. coli 16S rRNA gene in an effort to improve sensitivity, whereas the E. coli host cell DNA quantification took place through the use of SYBR Green I. The detection limit of the real-time PCR assay, under these optimized conditions, was calculated to be 0.042 pg genomic DNA, which was much higher than those of both the slot blot hybridization assay and total DNA Threshold assay, where the detection limits were 2.42 and 3.73 pg genomic DNA, respectively. Hence, the real-time PCR assay can be said to be more reproducible, more accurate, and more precise than either the slot blot hybridization assay or total DNA Threshold assay. The real-time PCR assay may thus be a promising new tool for the quantitative detection and clearance validation of residual E. coli host cell DNA during the manufacturingprocess for recombinant therapeutics.

Comparison of Univariate Kriging Algorithms for GIS-based Thematic Mapping with Ground Survey Data (현장 조사 자료를 이용한 GIS 기반 주제도 작성을 위한 단변량 크리깅 기법의 비교)

  • Park, No-Wook
    • Korean Journal of Remote Sensing
    • /
    • v.25 no.4
    • /
    • pp.321-338
    • /
    • 2009
  • The objective of this paper is to compare spatial prediction capabilities of univariate kriging algorithms for generating GIS-based thematic maps from ground survey data with asymmetric distributions. Four univariate kriging algorithms including traditional ordinary kriging, three non-linear transform-based kriging algorithms such as log-normal kriging, multi-Gaussian kriging and indicator kriging are applied for spatial interpolation of geochemical As and Pb elements. Cross validation based on a leave-one-out approach is applied and then prediction errors are computed. The impact of the sampling density of the ground survey data on the prediction errors are also investigated. Through the case study, indicator kriging showed the smallest prediction errors and superior prediction capabilities of very low and very high values. Other non-linear transform based kriging algorithms yielded better prediction capabilities than traditional ordinary kriging. Log-normal kriging which has been widely applied, however, produced biased estimation results (overall, overestimation). It is expected that such quantitative comparison results would be effectively used for the selection of an optimal kriging algorithm for spatial interpolation of ground survey data with asymmetric distributions.

Development of Type 2 Prediction Prediction Based on Big Data (빅데이터 기반 2형 당뇨 예측 알고리즘 개발)

  • Hyun Sim;HyunWook Kim
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.18 no.5
    • /
    • pp.999-1008
    • /
    • 2023
  • Early prediction of chronic diseases such as diabetes is an important issue, and improving the accuracy of diabetes prediction is especially important. Various machine learning and deep learning-based methodologies are being introduced for diabetes prediction, but these technologies require large amounts of data for better performance than other methodologies, and the learning cost is high due to complex data models. In this study, we aim to verify the claim that DNN using the pima dataset and k-fold cross-validation reduces the efficiency of diabetes diagnosis models. Machine learning classification methods such as decision trees, SVM, random forests, logistic regression, KNN, and various ensemble techniques were used to determine which algorithm produces the best prediction results. After training and testing all classification models, the proposed system provided the best results on XGBoost classifier with ADASYN method, with accuracy of 81%, F1 coefficient of 0.81, and AUC of 0.84. Additionally, a domain adaptation method was implemented to demonstrate the versatility of the proposed system. An explainable AI approach using the LIME and SHAP frameworks was implemented to understand how the model predicts the final outcome.