• Title/Summary/Keyword: random forests model

Search Result 53, Processing Time 0.02 seconds

Comparison of survival prediction models for pancreatic cancer: Cox model versus machine learning models

  • Kim, Hyunsuk;Park, Taesung;Jang, Jinyoung;Lee, Seungyeoun
    • Genomics & Informatics
    • /
    • v.20 no.2
    • /
    • pp.23.1-23.9
    • /
    • 2022
  • A survival prediction model has recently been developed to evaluate the prognosis of resected nonmetastatic pancreatic ductal adenocarcinoma based on a Cox model using two nationwide databases: Surveillance, Epidemiology and End Results (SEER) and Korea Tumor Registry System-Biliary Pancreas (KOTUS-BP). In this study, we applied two machine learning methods-random survival forests (RSF) and support vector machines (SVM)-for survival analysis and compared their prediction performance using the SEER and KOTUS-BP datasets. Three schemes were used for model development and evaluation. First, we utilized data from SEER for model development and used data from KOTUS-BP for external evaluation. Second, these two datasets were swapped by taking data from KOTUS-BP for model development and data from SEER for external evaluation. Finally, we mixed these two datasets half and half and utilized the mixed datasets for model development and validation. We used 9,624 patients from SEER and 3,281 patients from KOTUS-BP to construct a prediction model with seven covariates: age, sex, histologic differentiation, adjuvant treatment, resection margin status, and the American Joint Committee on Cancer 8th edition T-stage and N-stage. Comparing the three schemes, the performance of the Cox model, RSF, and SVM was better when using the mixed datasets than when using the unmixed datasets. When using the mixed datasets, the C-index, 1-year, 2-year, and 3-year time-dependent areas under the curve for the Cox model were 0.644, 0.698, 0.680, and 0.687, respectively. The Cox model performed slightly better than RSF and SVM.

Coreference Resolution for Korean Using Random Forests (랜덤 포레스트를 이용한 한국어 상호참조 해결)

  • Jeong, Seok-Won;Choi, MaengSik;Kim, HarkSoo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.535-540
    • /
    • 2016
  • Coreference resolution is to identify mentions in documents and is to group co-referred mentions in the documents. It is an essential step for natural language processing applications such as information extraction, event tracking, and question-answering. Recently, various coreference resolution models based on ML (machine learning) have been proposed, As well-known, these ML-based models need large training data that are manually annotated with coreferred mention tags. Unfortunately, we cannot find usable open data for learning ML-based models in Korean. Therefore, we propose an efficient coreference resolution model that needs less training data than other ML-based models. The proposed model identifies co-referred mentions using random forests based on sieve-guided features. In the experiments with baseball news articles, the proposed model showed a better CoNLL F1-score of 0.6678 than other ML-based models.

Covariance-based Recognition Using Machine Learning Model

  • Osman, Hassab Elgawi
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2009.01a
    • /
    • pp.223-228
    • /
    • 2009
  • We propose an on-line machine learning approach for object recognition, where new images are continuously added and the recognition decision is made without delay. Random forest (RF) classifier has been extensively used as a generative model for classification and regression applications. We extend this technique for the task of building incremental component-based detector. First we employ object descriptor model based on bag of covariance matrices, to represent an object region then run our on-line RF learner to select object descriptors and to learn an object classifier. Experiments of the object recognition are provided to verify the effectiveness of the proposed approach. Results demonstrate that the propose model yields in object recognition performance comparable to the benchmark standard RF, AdaBoost, and SVM classifiers.

  • PDF

Forest Vertical Structure Mapping from Bi-Seasonal Sentinel-2 Images and UAV-Derived DSM Using Random Forest, Support Vector Machine, and XGBoost

  • Young-Woong Yoon;Hyung-Sup Jung
    • Korean Journal of Remote Sensing
    • /
    • v.40 no.2
    • /
    • pp.123-139
    • /
    • 2024
  • Forest vertical structure is vital for comprehending ecosystems and biodiversity, in addition to fundamental forest information. Currently, the forest vertical structure is predominantly assessed via an in-situ method, which is not only difficult to apply to inaccessible locations or large areas but also costly and requires substantial human resources. Therefore, mapping systems based on remote sensing data have been actively explored. Recently, research on analyzing and classifying images using machine learning techniques has been actively conducted and applied to map the vertical structure of forests accurately. In this study, Sentinel-2 and digital surface model images were obtained on two different dates separated by approximately one month, and the spectral index and tree height maps were generated separately. Furthermore, according to the acquisition time, the input data were separated into cases 1 and 2, which were then combined to generate case 3. Using these data, forest vetical structure mapping models based on random forest, support vector machine, and extreme gradient boost(XGBoost)were generated. Consequently, nine models were generated, with the XGBoost model in Case 3 performing the best, with an average precision of 0.99 and an F1 score of 0.91. We confirmed that generating a forest vertical structure mapping model utilizing bi-seasonal data and an appropriate model can result in an accuracy of 90% or higher.

Predicting Corporate Bankruptcy using Simulated Annealing-based Random Fores (시뮬레이티드 어니일링 기반의 랜덤 포레스트를 이용한 기업부도예측)

  • Park, Hoyeon;Kim, Kyoung-jae
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.155-170
    • /
    • 2018
  • Predicting a company's financial bankruptcy is traditionally one of the most crucial forecasting problems in business analytics. In previous studies, prediction models have been proposed by applying or combining statistical and machine learning-based techniques. In this paper, we propose a novel intelligent prediction model based on the simulated annealing which is one of the well-known optimization techniques. The simulated annealing is known to have comparable optimization performance to the genetic algorithms. Nevertheless, since there has been little research on the prediction and classification of business decision-making problems using the simulated annealing, it is meaningful to confirm the usefulness of the proposed model in business analytics. In this study, we use the combined model of simulated annealing and machine learning to select the input features of the bankruptcy prediction model. Typical types of combining optimization and machine learning techniques are feature selection, feature weighting, and instance selection. This study proposes a combining model for feature selection, which has been studied the most. In order to confirm the superiority of the proposed model in this study, we apply the real-world financial data of the Korean companies and analyze the results. The results show that the predictive accuracy of the proposed model is better than that of the naïve model. Notably, the performance is significantly improved as compared with the traditional decision tree, random forests, artificial neural network, SVM, and logistic regression analysis.

Novel two-stage hybrid paradigm combining data pre-processing approaches to predict biochemical oxygen demand concentration (생물화학적 산소요구량 농도예측을 위하여 데이터 전처리 접근법을 결합한 새로운 이단계 하이브리드 패러다임)

  • Kim, Sungwon;Seo, Youngmin;Zakhrouf, Mousaab;Malik, Anurag
    • Journal of Korea Water Resources Association
    • /
    • v.54 no.spc1
    • /
    • pp.1037-1051
    • /
    • 2021
  • Biochemical oxygen demand (BOD) concentration, one of important water quality indicators, is treated as the measuring item for the ecological chapter in lakes and rivers. This investigation employed novel two-stage hybrid paradigm (i.e., wavelet-based gated recurrent unit, wavelet-based generalized regression neural networks, and wavelet-based random forests) to predict BOD concentration in the Dosan and Hwangji stations, South Korea. These models were assessed with the corresponding independent models (i.e., gated recurrent unit, generalized regression neural networks, and random forests). Diverse water quality and quantity indicators were implemented for developing independent and two-stage hybrid models based on several input combinations (i.e., Divisions 1-5). The addressed models were evaluated using three statistical indices including the root mean square error (RMSE), Nash-Sutcliffe efficiency (NSE), and correlation coefficient (CC). It can be found from results that the two-stage hybrid models cannot always enhance the predictive precision of independent models confidently. Results showed that the DWT-RF5 (RMSE = 0.108 mg/L) model provided more accurate prediction of BOD concentration compared to other optimal models in Dosan station, and the DWT-GRNN4 (RMSE = 0.132 mg/L) model was the best for predicting BOD concentration in Hwangji station, South Korea.

Classification Abnormal temperatures based on Meteorological Environment using Random forests (랜덤포레스트를 이용한 기상 환경에 따른 이상기온 분류)

  • Youn Su Kim;Kwang Yoon Song;In Hong Chang
    • Journal of Integrative Natural Science
    • /
    • v.17 no.1
    • /
    • pp.1-12
    • /
    • 2024
  • Many abnormal climate events are occurring around the world. The cause of abnormal climate is related to temperature. Factors that affect temperature include excessive emissions of carbon and greenhouse gases from a global perspective, and air circulation from a local perspective. Due to the air circulation, many abnormal climate phenomena such as abnormally high temperature and abnormally low temperature are occurring in certain areas, which can cause very serious human damage. Therefore, the problem of abnormal temperature should not be approached only as a case of climate change, but should be studied as a new category of climate crisis. In this study, we proposed a model for the classification of abnormal temperature using random forests based on various meteorological data such as longitudinal observations, yellow dust, ultraviolet radiation from 2018 to 2022 for each region in Korea. Here, the meteorological data had an imbalance problem, so the imbalance problem was solved by oversampling. As a result, we found that the variables affecting abnormal temperature are different in different regions. In particular, the central and southern regions are influenced by high pressure (Mainland China, Siberian high pressure, and North Pacific high pressure) due to their regional characteristics, so pressure-related variables had a significant impact on the classification of abnormal temperature. This suggests that a regional approach can be taken to predict abnormal temperatures from the surrounding meteorological environment. In addition, in the event of an abnormal temperature, it seems that it is possible to take preventive measures in advance according to regional characteristics.

A Novel Network Anomaly Detection Method based on Data Balancing and Recursive Feature Addition

  • Liu, Xinqian;Ren, Jiadong;He, Haitao;Wang, Qian;Sun, Shengting
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.7
    • /
    • pp.3093-3115
    • /
    • 2020
  • Network anomaly detection system plays an essential role in detecting network anomaly and ensuring network security. Anomaly detection system based machine learning has become an increasingly popular solution. However, due to the unbalance and high-dimension characteristics of network traffic, the existing methods unable to achieve the excellent performance of high accuracy and low false alarm rate. To address this problem, a new network anomaly detection method based on data balancing and recursive feature addition is proposed. Firstly, data balancing algorithm based on improved KNN outlier detection is designed to select part respective data on each category. Combination optimization about parameters of improved KNN outlier detection is implemented by genetic algorithm. Next, recursive feature addition algorithm based on correlation analysis is proposed to select effective features, in which a cross contingency test is utilized to analyze correlation and obtain a features subset with a strong correlation. Then, random forests model is as the classification model to detection anomaly. Finally, the proposed algorithm is evaluated on benchmark datasets KDD Cup 1999 and UNSW_NB15. The result illustrates the proposed strategies enhance accuracy and recall, and decrease the false alarm rate. Compared with other algorithms, this algorithm still achieves significant effects, especially recall in the small category.

A descriptive study of on-farm biosecurity and management practices during the incursion of porcine epidemic diarrhea into Canadian swine herds, 2014

  • Perri, Amanda M.;Poljak, Zvonimir;Dewey, Cate;Harding, John CS.;O'Sullivan, Terri L.
    • Journal of Veterinary Science
    • /
    • v.21 no.2
    • /
    • pp.25.1-25.16
    • /
    • 2020
  • Porcine epidemic diarrhea virus (PEDV) emerged into Canada in January 2014, primarily affecting sow herds. Subsequent epidemiological analyses suggested contaminated feed was the most likely transmission pathway. The primary objective of this study was to describe general biosecurity and management practices implemented in PEDV-positive sow herds and matched control herds at the time the virus emerged. The secondary objective was to determine if any of these general biosecurity and farm management practices were important in explaining PEDV infection status from January 22, 2014 to March 1, 2014. A case herd was defined as a swine herd with clinical signs and a positive test result for PEDV. A questionnaire was used to a gather 30-day history of herd management practices, animal movements on/off site, feed management practices, semen deliveries and biosecurity practices for case (n = 8) and control (n = 12) herds, primarily located in Ontario. Data was analyzed using descriptive statistics and random forests (RFs). Case herds were larger in size than control herds. Case herds had more animal movements and non-staff movements onto the site. Also, case herds had higher quantities of pigs delivered, feed deliveries and semen deliveries on-site. The biosecurity practices of case herds were considered more rigorous based on herd management, feed deliveries, transportation and truck driver practices than control herds. The RF model found that the most important variables for predicting herd status were related to herd size and feed management variables. Nonetheless, predictive accuracy of the final RF model was 72%.

Object Classification Method Using Dynamic Random Forests and Genetic Optimization

  • Kim, Jae Hyup;Kim, Hun Ki;Jang, Kyung Hyun;Lee, Jong Min;Moon, Young Shik
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.5
    • /
    • pp.79-89
    • /
    • 2016
  • In this paper, we proposed the object classification method using genetic and dynamic random forest consisting of optimal combination of unit tree. The random forest can ensure good generalization performance in combination of large amount of trees by assigning the randomization to the training samples and feature selection, etc. allocated to the decision tree as an ensemble classification model which combines with the unit decision tree based on the bagging. However, the random forest is composed of unit trees randomly, so it can show the excellent classification performance only when the sufficient amounts of trees are combined. There is no quantitative measurement method for the number of trees, and there is no choice but to repeat random tree structure continuously. The proposed algorithm is composed of random forest with a combination of optimal tree while maintaining the generalization performance of random forest. To achieve this, the problem of improving the classification performance was assigned to the optimization problem which found the optimal tree combination. For this end, the genetic algorithm methodology was applied. As a result of experiment, we had found out that the proposed algorithm could improve about 3~5% of classification performance in specific cases like common database and self infrared database compare with the existing random forest. In addition, we had shown that the optimal tree combination was decided at 55~60% level from the maximum trees.