• Title/Summary/Keyword: random forest (RF)

Search Result 185, Processing Time 0.022 seconds

Light-weight Classification Model for Android Malware through the Dimensional Reduction of API Call Sequence using PCA

  • Jeon, Dong-Ha;Lee, Soo-Jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.11
    • /
    • pp.123-130
    • /
    • 2022
  • Recently, studies on the detection and classification of Android malware based on API Call sequence have been actively carried out. However, API Call sequence based malware classification has serious limitations such as excessive time and resource consumption in terms of malware analysis and learning model construction due to the vast amount of data and high-dimensional characteristic of features. In this study, we analyzed various classification models such as LightGBM, Random Forest, and k-Nearest Neighbors after significantly reducing the dimension of features using PCA(Principal Component Analysis) for CICAndMal2020 dataset containing vast API Call information. The experimental result shows that PCA significantly reduces the dimension of features while maintaining the characteristics of the original data and achieves efficient malware classification performance. Both binary classification and multi-class classification achieve higher levels of accuracy than previous studies, even if the data characteristics were reduced to less than 1% of the total size.

Method for Assessing Landslide Susceptibility Using SMOTE and Classification Algorithms (SMOTE와 분류 기법을 활용한 산사태 위험 지역 결정 방법)

  • Yoon, Hyung-Koo
    • Journal of the Korean Geotechnical Society
    • /
    • v.39 no.6
    • /
    • pp.5-12
    • /
    • 2023
  • Proactive assessment of landslide susceptibility is necessary for minimizing casualties. This study proposes a methodology for classifying the landslide safety factor using a classification algorithm based on machine learning techniques. The high-risk area model is adopted to perform the classification and eight geotechnical parameters are adopted as inputs. Four classification algorithms-namely decision tree, k-nearest neighbor, logistic regression, and random forest-are employed for comparing classification accuracy for the safety factors ranging between 1.2 and 2.0. Notably, a high accuracy is demonstrated in the safety factor range of 1.2~1.7, but a relatively low accuracy is obtained in the range of 1.8~2.0. To overcome this issue, the synthetic minority over-sampling technique (SMOTE) is adopted to generate additional data. The application of SMOTE improves the average accuracy by ~250% in the safety factor range of 1.8~2.0. The results demonstrate that SMOTE algorithm improves the accuracy of classification algorithms when applied to geotechnical data.

A study of glass and carbon fibers in FRAC utilizing machine learning approach

  • Ankita Upadhya;M. S. Thakur;Nitisha Sharma;Fadi H. Almohammed;Parveen Sihag
    • Advances in materials Research
    • /
    • v.13 no.1
    • /
    • pp.63-86
    • /
    • 2024
  • Asphalt concrete (AC), is a mixture of bitumen and aggregates, which is very sensitive in the design of flexible pavement. In this study, the Marshall stability of the glass and carbon fiber bituminous concrete was predicted by using Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest (RF), and M5P Tree machine learning algorithms. To predict the Marshall stability, nine inputs parameters i.e., Bitumen, Glass and Carbon fibers mixed in 100:0, 75:25, 50:50, 25:75, 0:100 percentage (designated as 100GF:0CF, 75GF:25CF, 50GF:50 CF, 25GF:75CF, 0GF:100CF), Bitumen grade (VG), Fiber length (FL), and Fiber diameter (FD) were utilized from the experimental and literary data. Seven statistical indices i.e., coefficient of correlation (CC), mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), root relative squared error (RRSE), Scattering index (SI), and BIAS were applied to assess the effectiveness of the developed models. According to the performance evaluation results, Artificial neural network (ANN) was outperforming among other models with CC values as 0.9147 and 0.8648, MAE values as 1.3757 and 1.978, RMSE values as 1.843 and 2.6951, RAE values as 39.88 and 49.31, RRSE values as 40.62 and 50.50, SI values as 0.1379 and 0.2027 and BIAS value as -0.1 290 and -0.2357 in training and testing stage respectively. The Taylor diagram (testing stage) also confirmed that the ANN-based model outperforms the other models. Results of sensitivity analysis showed that the fiber length is the most influential in all nine input parameters whereas the fiber combination of 25GF:75CF was the most effective among all the fiber mixes in Marshall stability.

Improved prediction of soil liquefaction susceptibility using ensemble learning algorithms

  • Satyam Tiwari;Sarat K. Das;Madhumita Mohanty;Prakhar
    • Geomechanics and Engineering
    • /
    • v.37 no.5
    • /
    • pp.475-498
    • /
    • 2024
  • The prediction of the susceptibility of soil to liquefaction using a limited set of parameters, particularly when dealing with highly unbalanced databases is a challenging problem. The current study focuses on different ensemble learning classification algorithms using highly unbalanced databases of results from in-situ tests; standard penetration test (SPT), shear wave velocity (Vs) test, and cone penetration test (CPT). The input parameters for these datasets consist of earthquake intensity parameters, strong ground motion parameters, and in-situ soil testing parameters. liquefaction index serving as the binary output parameter. After a rigorous comparison with existing literature, extreme gradient boosting (XGBoost), bagging, and random forest (RF) emerge as the most efficient models for liquefaction instance classification across different datasets. Notably, for SPT and Vs-based models, XGBoost exhibits superior performance, followed by Light gradient boosting machine (LightGBM) and Bagging, while for CPT-based models, Bagging ranks highest, followed by Gradient boosting and random forest, with CPT-based models demonstrating lower Gmean(error), rendering them preferable for soil liquefaction susceptibility prediction. Key parameters influencing model performance include internal friction angle of soil (ϕ) and percentage of fines less than 75 µ (F75) for SPT and Vs data and normalized average cone tip resistance (qc) and peak horizontal ground acceleration (amax) for CPT data. It was also observed that the addition of Vs measurement to SPT data increased the efficiency of the prediction in comparison to only SPT data. Furthermore, to enhance usability, a graphical user interface (GUI) for seamless classification operations based on provided input parameters was proposed.

Quantitative Structure Activity Relationship Prediction of Oral Bioavailabilities Using Support Vector Machine

  • Fatemi, Mohammad Hossein;Fadaei, Fatemeh
    • Journal of the Korean Chemical Society
    • /
    • v.58 no.6
    • /
    • pp.543-552
    • /
    • 2014
  • A quantitative structure activity relationship (QSAR) study is performed for modeling and prediction of oral bioavailabilities of 216 diverse set of drugs. After calculation and screening of molecular descriptors, linear and nonlinear models were developed by using multiple linear regression (MLR), artificial neural network (ANN), support vector machine (SVM) and random forest (RF) techniques. Comparison between statistical parameters of these models indicates the suitability of SVM over other models. The root mean square errors of SVM model were 5.933 and 4.934 for training and test sets, respectively. Robustness and reliability of the developed SVM model was evaluated by performing of leave many out cross validation test, which produces the statistic of $Q^2_{SVM}=0.603$ and SPRESS = 7.902. Moreover, the chemical applicability domains of model were determined via leverage approach. The results of this study revealed the applicability of QSAR approach by using SVM in prediction of oral bioavailability of drugs.

A Detailed Analysis of Classifier Ensembles for Intrusion Detection in Wireless Network

  • Tama, Bayu Adhi;Rhee, Kyung-Hyune
    • Journal of Information Processing Systems
    • /
    • v.13 no.5
    • /
    • pp.1203-1212
    • /
    • 2017
  • Intrusion detection systems (IDSs) are crucial in this overwhelming increase of attacks on the computing infrastructure. It intelligently detects malicious and predicts future attack patterns based on the classification analysis using machine learning and data mining techniques. This paper is devoted to thoroughly evaluate classifier ensembles for IDSs in IEEE 802.11 wireless network. Two ensemble techniques, i.e. voting and stacking are employed to combine the three base classifiers, i.e. decision tree (DT), random forest (RF), and support vector machine (SVM). We use area under ROC curve (AUC) value as a performance metric. Finally, we conduct two statistical significance tests to evaluate the performance differences among classifiers.

RESEARCH ON THE DEVELOPMENT OF COLLEGE STUDENT EDUCATION BASED ON MACHINE LEARNING - TAKE THE PHYSICAL EDUCATION OF YANBIAN UNIVERSITY AS AN EXAMPLE

  • Quan, Yu;Guo, Wei-Jie;He, Lin;Jin, Zhe-Zhi
    • East Asian mathematical journal
    • /
    • v.38 no.1
    • /
    • pp.65-84
    • /
    • 2022
  • This paper is based on Yanbian University's physical test data, and uses statistical analysis methods to study the relationship between college students' physical test scores to promote college physical education. Firstly, using gender as categorical variables, we conduct a general analysis of students in different majors and different grades, and obtain the advantages and disadvantages of male and female college students; then we use Decision Trees and Random Forest algorithms to conduct modeling analysis to provide valuable suggestions for relevant departments of the university. the aiming of this research analyzing about the undergraduates physical test is that giving universities the targeted suggestions to improve the college graduate rate and promote the overall development of higher education, lay the foundation for achieving universal health.

The Investigation of Employing Supervised Machine Learning Models to Predict Type 2 Diabetes Among Adults

  • Alhmiedat, Tareq;Alotaibi, Mohammed
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.9
    • /
    • pp.2904-2926
    • /
    • 2022
  • Currently, diabetes is the most common chronic disease in the world, affecting 23.7% of the population in the Kingdom of Saudi Arabia. Diabetes may be the cause of lower-limb amputations, kidney failure and blindness among adults. Therefore, diagnosing the disease in its early stages is essential in order to save human lives. With the revolution in technology, Artificial Intelligence (AI) could play a central role in the early prediction of diabetes by employing Machine Learning (ML) technology. In this paper, we developed a diagnosis system using machine learning models for the detection of type 2 diabetes among adults, through the adoption of two different diabetes datasets: one for training and the other for the testing, to analyze and enhance the prediction accuracy. This work offers an enhanced classification accuracy as a result of employing several pre-processing methods before applying the ML models. According to the obtained results, the implemented Random Forest (RF) classifier offers the best classification accuracy with a classification score of 98.95%.

LSTM Model-based Prediction of the Variations in Load Power Data from Industrial Manufacturing Machines

  • Rita, Rijayanti;Kyohong, Jin;Mintae, Hwang
    • Journal of information and communication convergence engineering
    • /
    • v.20 no.4
    • /
    • pp.295-302
    • /
    • 2022
  • This paper contains the development of a smart power device designed to collect load power data from industrial manufacturing machines, predict future variations in load power data, and detect abnormal data in advance by applying a machine learning-based prediction algorithm. The proposed load power data prediction model is implemented using a Long Short-Term Memory (LSTM) algorithm with high accuracy and relatively low complexity. The Flask and REST API are used to provide prediction results to users in a graphical interface. In addition, we present the results of experiments conducted to evaluate the performance of the proposed approach, which show that our model exhibited the highest accuracy compared with Multilayer Perceptron (MLP), Random Forest (RF), and Support Vector Machine (SVM) models. Moreover, we expect our method's accuracy could be improved by further optimizing the hyperparameter values and training the model for a longer period of time using a larger amount of data.

Application of data mining and statistical measurement of agricultural high-quality development

  • Yan Zhou
    • Advances in nano research
    • /
    • v.14 no.3
    • /
    • pp.225-234
    • /
    • 2023
  • In this study, we aim to use big data resources and statistical analysis to obtain a reliable instruction to reach high-quality and high yield agricultural yields. In this regard, soil type data, raining and temperature data as well as wheat production in each year are collected for a specific region. Using statistical methodology, the acquired data was cleaned to remove incomplete and defective data. Afterwards, using several classification methods in machine learning we tried to distinguish between different factors and their influence on the final crop yields. Comparing the proposed models' prediction using statistical quantities correlation factor and mean squared error between predicted values of the crop yield and actual values the efficacy of machine learning methods is discussed. The results of the analysis show high accuracy of machine learning methods in the prediction of the crop yields. Moreover, it is indicated that the random forest (RF) classification approach provides best results among other classification methods utilized in this study.