• Title/Summary/Keyword: random forest (RF)

Search Result 182, Processing Time 0.03 seconds

Improved Prediction of Coreceptor Usage and Phenotype of HIV-1 Based on Combined Features of V3 Loop Sequence Using Random Forest

  • Xu, Shungao;Huang, Xinxiang;Xu, Huaxi;Zhang, Chiyu
    • Journal of Microbiology
    • /
    • v.45 no.5
    • /
    • pp.441-446
    • /
    • 2007
  • HIV-1 coreceptor usage and phenotype mainly determined by V3 loop are associated with the disease progression of AIDS. Predicting HIV-1 coreceptor usage and phenotype facilitates the monitoring of R5-to-X4 switch and treatment decision-making. In this study, we employed random forest to predict HIV-1 biological phenotype, based on 37 random features of V3 loop. In comparison with PSSM method, our RF predictor obtained higher prediction accuracy (95.1% for coreceptor usage and 92.1% for phenotype), especially for non-B non-C HIV-l subtypes (96.6% for coreceptor usage and 95.3% for phenotype). The net charge, polarity of V3 loop and five V3 sites are seven most important features for predicting HIV-1 coreceptor usage or phenotype. Among these features, V3 polarity and four V3 sites (22, 12, 18 and 13) are first reported to have high contribution to HIV-1 biological phenotype prediction.

Unveiling the mysteries of flood risk: A machine learning approach to understanding flood-influencing factors for accurate mapping

  • Roya Narimani;Shabbir Ahmed Osmani;Seunghyun Hwang;Changhyun Jun
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.164-164
    • /
    • 2023
  • This study investigates the importance of flood-influencing factors on the accuracy of flood risk mapping using the integration of remote sensing-based and machine learning techniques. Here, the Extreme Gradient Boosting (XGBoost) and Random Forest (RF) algorithms integrated with GIS-based techniques were considered to develop and generate flood risk maps. For the study area of NAPA County in the United States, rainfall data from the 12 stations, Sentinel-1 SAR, and Sentinel-2 optical images were applied to extract 13 flood-influencing factors including altitude, aspect, slope, topographic wetness index, normalized difference vegetation index, stream power index, sediment transport index, land use/land cover, terrain roughness index, distance from the river, soil, rainfall, and geology. These 13 raster maps were used as input data for the XGBoost and RF algorithms for modeling flood-prone areas using ArcGIS, Python, and R. As results, it indicates that XGBoost showed better performance than RF in modeling flood-prone areas with an ROC of 97.45%, Kappa of 93.65%, and accuracy score of 96.83% compared to RF's 82.21%, 70.54%, and 88%, respectively. In conclusion, XGBoost is more efficient than RF for flood risk mapping and can be potentially utilized for flood mitigation strategies. It should be noted that all flood influencing factors had a positive effect, but altitude, slope, and rainfall were the most influential features in modeling flood risk maps using XGBoost.

  • PDF

Utilizing the GOA-RF hybrid model, predicting the CPT-based pile set-up parameters

  • Zhao, Zhilong;Chen, Simin;Zhang, Dengke;Peng, Bin;Li, Xuyang;Zheng, Qian
    • Geomechanics and Engineering
    • /
    • v.31 no.1
    • /
    • pp.113-127
    • /
    • 2022
  • The undrained shear strength of soil is considered one of the engineering parameters of utmost significance in geotechnical design methods. In-situ experiments like cone penetration tests (CPT) have been used in the last several years to estimate the undrained shear strength depending on the characteristics of the soil. Nevertheless, the majority of these techniques rely on correlation presumptions, which may lead to uneven accuracy. This research's general aim is to extend a new united soft computing model, which is a combination of random forest (RF) with grasshopper optimization algorithm (GOA) to the pile set-up parameters' better approximation from CPT, based on two different types of data as inputs. Data type 1 contains pile parameters, and data type 2 consists of soil properties. The contribution of this article is that hybrid GOA - RF for the first time, was suggested to forecast the pile set-up parameter from CPT. In order to do this, CPT data and related bore log data were gathered from 70 various locations across Louisiana. With an R2 greater than 0.9098, which denotes the permissible relationship between measured and anticipated values, the results demonstrated that both models perform well in forecasting the set-up parameter. It is comprehensible that, in the training and testing step, the model with data type 2 has finer capability than the model using data type 1, with R2 and RMSE are 0.9272 and 0.0305 for the training step and 0.9182 and 0.0415 for the testing step. All in all, the models' results depict that the A parameter could be forecasted with adequate precision from the CPT data with the usage of hybrid GOA - RF models. However, the RF model with soil features as input parameters results in a finer commentary of pile set-up parameters.

Risk Assessment of Pine Tree Dieback in Sogwang-Ri, Uljin (울진 소광리 금강소나무 고사발생 특성 분석 및 위험지역 평가)

  • Kim, Eun-Sook;Lee, Bora;Kim, Jaebeom;Cho, Nanghyun;Lim, Jong-Hwan
    • Journal of Korean Society of Forest Science
    • /
    • v.109 no.3
    • /
    • pp.259-270
    • /
    • 2020
  • Extreme weather events, such as heat and drought, have occurred frequently over the past two decades. This has led to continuous reports of cases of forest damage due to physiological stress, not pest damage. In 2014, pine trees were collectively damaged in the forest genetic resources reserve of Sogwang-ri, Uljin, South Korea. An investigation was launched to determine the causes of the dieback, so that a forest management plan could be prepared to deal with the current dieback, and to prevent future damage. This study aimedto 1) understand the topographic and structural characteristics of the area which experienced pine tree dieback, 2) identify the main causes of the dieback, and 3) predict future risk areas through the use of machine-learning techniques. A model for identifying risk areas was developed using 14 explanatory variables, including location, elevation, slope, and age class. When three machine-learning techniques-Decision Tree, Random Forest (RF), and Support Vector Machine (SVM) were applied to the model, RF and SVM showed higher predictability scores, with accuracies over 93%. Our analysis of the variable set showed that the topographical areas most vulnerable to pine dieback were those with high altitudes, high daily solar radiation, and limited water availability. We also found that, when it came to forest stand characteristics, pine trees with high vertical stand densities (5-15 m high) and higher age classes experienced a higher risk of dieback. The RF and SVM models predicted that 9.5% or 115 ha of the Geumgang Pine Forest are at high risk for pine dieback. Our study suggests the need for further investigation into the vulnerable areas of the Geumgang Pine Forest, and also for climate change adaptive forest management steps to protect those areas which remain undamaged.

Applications of Machine Learning Models for the Estimation of Reservoir CO2 Emissions (저수지 CO2 배출량 산정을 위한 기계학습 모델의 적용)

  • Yoo, Jisu;Chung, Se-Woong;Park, Hyung-Seok
    • Journal of Korean Society on Water Environment
    • /
    • v.33 no.3
    • /
    • pp.326-333
    • /
    • 2017
  • The lakes and reservoirs have been reported as important sources of carbon emissions to the atmosphere in many countries. Although field experiments and theoretical investigations based on the fundamental gas exchange theory have proposed the quantitative amounts of Net Atmospheric Flux (NAF) in various climate regions, there are still large uncertainties at the global scale estimation. Mechanistic models can be used for understanding and estimating the temporal and spatial variations of the NAFs considering complicated hydrodynamic and biogeochemical processes in a reservoir, but these models require extensive and expensive datasets and model parameters. On the other hand, data driven machine learning (ML) algorithms are likely to be alternative tools to estimate the NAFs in responding to independent environmental variables. The objective of this study was to develop random forest (RF) and multi-layer artificial neural network (ANN) models for the estimation of the daily $CO_2$ NAFs in Daecheong Reservoir located in Geum River of Korea, and compare the models performance against the multiple linear regression (MLR) model that proposed in the previous study (Chung et al., 2016). As a result, the RF and ANN models showed much enhanced performance in the estimation of the high NAF values, while MLR model significantly under estimated them. Across validation with 10-fold random samplings was applied to evaluate the performance of three models, and indicated that the ANN model is best, and followed by RF and MLR models.

Clustering and classification to characterize daily electricity demand (시간단위 전력사용량 시계열 패턴의 군집 및 분류분석)

  • Park, Dain;Yoon, Sanghoo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.2
    • /
    • pp.395-406
    • /
    • 2017
  • The purpose of this study is to identify the pattern of daily electricity demand through clustering and classification. The hourly data was collected by KPS (Korea Power Exchange) between 2008 and 2012. The time trend was eliminated for conducting the pattern of daily electricity demand because electricity demand data is times series data. We have considered k-means clustering, Gaussian mixture model clustering, and functional clustering in order to find the optimal clustering method. The classification analysis was conducted to understand the relationship between external factors, day of the week, holiday, and weather. Data was divided into training data and test data. Training data consisted of external factors and clustered number between 2008 and 2011. Test data was daily data of external factors in 2012. Decision tree, random forest, Support vector machine, and Naive Bayes were used. As a result, Gaussian model based clustering and random forest showed the best prediction performance when the number of cluster was 8.

ECG-based Biometric Authentication Using Random Forest (랜덤 포레스트를 이용한 심전도 기반 생체 인증)

  • Kim, JeongKyun;Lee, Kang Bok;Hong, Sang Gi
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.54 no.6
    • /
    • pp.100-105
    • /
    • 2017
  • This work presents an ECG biometric recognition system for the purpose of biometric authentication. ECG biometric approaches are divided into two major categories, fiducial-based and non-fiducial-based methods. This paper proposes a new non-fiducial framework using discrete cosine transform and a Random Forest classifier. When using DCT, most of the signal information tends to be concentrated in a few low-frequency components. In order to apply feature vector of Random Forest, DCT feature vectors of ECG heartbeats are constructed by using the first 40 DCT coefficients. RF is based on the computation of a large number of decision trees. It is relatively fast, robust and inherently suitable for multi-class problems. Furthermore, it trade-off threshold between admission and rejection of ID inside RF classifier. As a result, proposed method offers 99.9% recognition rates when tested on MIT-BIH NSRDB.

A Study on the Prediction of CNC Tool Wear Using Machine Learning Technique (기계학습 기법을 이용한 CNC 공구 마모도 예측에 관한 연구)

  • Lee, Kangbae;Park, Sungho;Sung, Sangha;Park, Domyoung
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.11
    • /
    • pp.15-21
    • /
    • 2019
  • The fourth industrial revolution is noted. It is a smarter factory. At present, research on CNC (Computerized Numeric Controller) is actively underway in the manufacturing field. Domestic CNC equipment, acoustic sensors, vibration sensors, etc. This study can improve efficiency through CNC. Collect various data such as X-axis, Y-axis, Z-axis force, moving speed. Data exploration of the characteristics of the collected data. You can use your data as Random Forest (RF), Extreme Gradient Boost (XGB), and Support Vector Machine (SVM). The result of this study is CNC equipment.

Detection of Cropland in Reservoir Area by Using Supervised Classification of UAV Imagery Based on GLCM (GLCM 기반 UAV 영상의 감독분류를 이용한 저수구역 내 농경지 탐지)

  • Kim, Gyu Mun;Choi, Jae Wan
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.36 no.6
    • /
    • pp.433-442
    • /
    • 2018
  • The reservoir area is defined as the area surrounded by the planned flood level of the dam or the land under the planned flood level of the dam. In this study, supervised classification based on RF (Random Forest), which is a representative machine learning technique, was performed to detect cropland in the reservoir area. In order to classify the cropland in the reservoir area efficiently, the GLCM (Gray Level Co-occurrence Matrix), which is a representative technique to quantify texture information, NDWI (Normalized Difference Water Index) and NDVI (Normalized Difference Vegetation Index) were utilized as additional features during classification process. In particular, we analyzed the effect of texture information according to window size for generating GLCM, and suggested a methodology for detecting croplands in the reservoir area. In the experimental result, the classification result showed that cropland in the reservoir area could be detected by the multispectral, NDVI, NDWI and GLCM images of UAV, efficiently. Especially, the window size of GLCM was an important parameter to increase the classification accuracy.

Predicting the Invasion Potential of Pink Muhly (Muhlenbergia capillaris) in South Korea

  • Park, Jeong Soo;Choi, Donghui;Kim, Youngha
    • Proceedings of the National Institute of Ecology of the Republic of Korea
    • /
    • v.1 no.1
    • /
    • pp.74-82
    • /
    • 2020
  • Predictions of suitable habitat areas can provide important information pertaining to the risk assessment and management of alien plants at early stage of their establishment. Here, we predict the invasion potential of Muhlenbergia capillaris (pink muhly) in South Korea using five bioclimatic variables. We adopt four models (generalized linear model, generalized additive model, random forest (RF), and artificial neural network) for projection based on 630 presence and 600 pseudo-absence data points. The RF model yielded the highest performance. The presence probability of M. capillaris was highest within an annual temperature range of 12 to 24℃ and with precipitation from 800 to 1,300 mm. The occurrence of M. capillaris was positively associated with the precipitation of the driest quarter. The projection map showed that suitable areas for M. capillaris are mainly concentrated in the southern coastal regions of South Korea, where temperatures and precipitation are higher than in other regions, especially in the winter season. We can conclude that M. capillaris is not considered to be invasive based on a habitat suitability map. However, there is a possibility that rising temperatures and increasing precipitation levels in winter can accelerate the expansion of this plant on the Korean Peninsula.