• Title/Summary/Keyword: random forest classification

Search Result 308, Processing Time 0.024 seconds

Developing a Predictive Model of Young Job Seekers' Preference for Hidden Champions Using Machine Learning and Analyzing the Relative Importance of Preference Factors (머신러닝을 활용한 청년 구직자의 강소기업 선호 예측모형 개발 및 요인별 상대적 중요도 분석)

  • Cho, Yoon Ju;Kim, Jin Soo;Bae, Hwan seok;Yang, Sung-Byung;Yoon, Sang-Hyeak
    • The Journal of Information Systems
    • /
    • v.32 no.4
    • /
    • pp.229-245
    • /
    • 2023
  • Purpose This study aims to understand the inclinations of young job seekers towards "hidden champions" - small but competitive companies that are emerging as potential solutions to the growing disparity between youth-targeted job vacancies and job seekers. We utilize machine learning techniques to discern the appeal of these hidden champions. Design/methodology/approach We examined the characteristics of small and medium-sized enterprises using data sourced from the Ministry of Employment and Labor and Youth Worknet. By comparing the efficacy of five machine learning classification models (i.e., Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier, LGBM Classifier, and XGB Classifier), we discovered that the predictive model utilizing the LGBM Classifier yielded the most consistent performance. Findings Our analysis of the relative significance of preference determinants revealed that industry type, geographical location, and employee count are pivotal factors influencing preference. Drawing from these insights, we propose targeted strategic interventions for policymakers, hidden champions, and young job seekers.

Comparative analysis of model performance for predicting the customer of cafeteria using unstructured data

  • Seungsik Kim;Nami Gu;Jeongin Moon;Keunwook Kim;Yeongeun Hwang;Kyeongjun Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.5
    • /
    • pp.485-499
    • /
    • 2023
  • This study aimed to predict the number of meals served in a group cafeteria using machine learning methodology. Features of the menu were created through the Word2Vec methodology and clustering, and a stacking ensemble model was constructed using Random Forest, Gradient Boosting, and CatBoost as sub-models. Results showed that CatBoost had the best performance with the ensemble model showing an 8% improvement in performance. The study also found that the date variable had the greatest influence on the number of diners in a cafeteria, followed by menu characteristics and other variables. The implications of the study include the potential for machine learning methodology to improve predictive performance and reduce food waste, as well as the removal of subjective elements in menu classification. Limitations of the research include limited data cases and a weak model structure when new menus or foreign words are not included in the learning data. Future studies should aim to address these limitations.

Predictive Model for Evaluating Startup Technology Efficiency: A Data Envelopment Analysis (DEA) Approach Focusing on Companies Selected by TIPS, a Private-led Technology Startup Support Program

  • Jeongho Kim;Hyunmin Park;JooHee Oh
    • International Journal of Advanced Culture Technology
    • /
    • v.12 no.2
    • /
    • pp.167-179
    • /
    • 2024
  • This study addresses the challenge of objectively evaluating the performance of early-stage startups amidst limited information and uncertainty. Focusing on companies selected by TIPS, a leading private sector-driven startup support policy in Korea, the research develops a new indicator to assess technological efficiency. By analyzing various input and output variables collected from Crunchbase and KIND (Korea Investor's Network for Disclosure System) databases, including technology use metrics, patents, and Crunchbase rankings, the study derives technological efficiency for TIPS-selected startups. A prediction model is then developed utilizing machine learning techniques such as Random Forest and boosting (XGBoost) to classify startups into efficiency percentiles (10th, 30th, and 50th). The results indicate that prediction accuracy improves with higher percentiles based on the technical efficiency index, providing valuable insights for evaluating and predicting startup performance in early markets characterized by information scarcity and uncertainty. Future research directions should focus on assessing growth potential and sustainability using the developed classification and prediction models, aiding investors in making data-driven investment decisions and contributing to the development of the early startup ecosystem.

Sentiment Analysis on 'HelloTalk' App Reviews Using NRC Emotion Lexicon and GoEmotions Dataset

  • Simay Akar;Yang Sok Kim;Mi Jin Noh
    • Smart Media Journal
    • /
    • v.13 no.6
    • /
    • pp.35-43
    • /
    • 2024
  • During the post-pandemic period, the interest in foreign language learning surged, leading to increased usage of language-learning apps. With the rising demand for these apps, analyzing app reviews becomes essential, as they provide valuable insights into user experiences and suggestions for improvement. This research focuses on extracting insights into users' opinions, sentiments, and overall satisfaction from reviews of HelloTalk, one of the most renowned language-learning apps. We employed topic modeling and emotion analysis approaches to analyze reviews collected from the Google Play Store. Several experiments were conducted to evaluate the performance of sentiment classification models with different settings. In addition, we identified dominant emotions and topics within the app reviews using feature importance analysis. The experimental results show that the Random Forest model with topics and emotions outperforms other approaches in accuracy, recall, and F1 score. The findings reveal that topics emphasizing language learning and community interactions, as well as the use of language learning tools and the learning experience, are prominent. Moreover, the emotions of 'admiration' and 'annoyance' emerge as significant factors across all models. This research highlights that incorporating emotion scores into the model and utilizing a broader range of emotion labels enhances model performance.

Study on failure mode prediction of reinforced concrete columns based on class imbalanced dataset

  • Mingyi Cai;Guangjun Sun;Bo Chen
    • Earthquakes and Structures
    • /
    • v.27 no.3
    • /
    • pp.177-189
    • /
    • 2024
  • Accurately predicting the failure modes of reinforced concrete (RC) columns is essential for structural design and assessment. In this study, the challenges of imbalanced datasets and complex feature selection in machine learning (ML) methods were addressed through an optimized ML approach. By combining feature selection and oversampling techniques, the prediction of seismic failure modes in rectangular RC columns was improved. Two feature selection methods were used to identify six input parameters. To tackle class imbalance, the Borderline-SMOTE1 algorithm was employed, enhancing the learning capabilities of the models for minority classes. Eight ML algorithms were trained and fine-tuned using k-fold shuffle split cross-validation and grid search. The results showed that the artificial neural network model achieved 96.77% accuracy, while k-nearest neighbor, support vector machine, and random forest models each achieved 95.16% accuracy. The balanced dataset led to significant improvements, particularly in predicting the flexure-shear failure mode, with accuracy increasing by 6%, recall by 8%, and F1 scores by 7%. The use of the Borderline-SMOTE1 algorithm significantly improved the recognition of samples at failure mode boundaries, enhancing the classification performance of models like k-nearest neighbor and decision tree, which are highly sensitive to data distribution and decision boundaries. This method effectively addressed class imbalance and selected relevant features without requiring complex simulations like traditional methods, proving applicable for discerning failure modes in various concrete members under seismic action.

Investigating Factors Contributing to Inadequate Facility Safety Inspections and Diagnosis Services: A Machine Learning Approach (머신러닝 기반 시설물 안전 점검·진단용역 부실 판정 요인에 대한 연구)

  • Junyong Park;Chie Hoon Song
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.27 no.4_2
    • /
    • pp.897-908
    • /
    • 2024
  • Evaluating the adequacy of facility safety inspection and diagnosis services performed by private enterprises is a time-consuming and administratively complex process. This study aims to analyze the determinants that could influence the rating of these safety inspection and diagnosis services using data analytics approach. Through a comparative analysis of several machine learning algorithms suitable for multi-class classification, we selected the model with the best performance (Random Forest) and identified the main determinants using the permutation importance technique. Among the variables examined, "contract value," "days of service performed" and "adherence to fair market value" were found to be strongly correlated with the rating assessments. Furthermore, we discovered that the skills and expertise of service performing personnel significantly impacted the rating. The results of this study can contribute to the enhancement of the current post-evaluation administrative processes and offer valuable insights into rating assessments by incorporating previously unexplored variables pertaining to both service providers and the services itself.

Feature Engineering and Evaluation for Android Malware Detection Scheme

  • Jaemin Jung;Jihyeon Park;Seong-je Cho;Sangchul Han;Minkyu Park;Hsin-Hung Cho
    • Journal of Internet Technology
    • /
    • v.22 no.2
    • /
    • pp.423-439
    • /
    • 2021
  • Android is one of the most popular platforms for the mobile and Internet of Things (IoT) devices. This popularity has made Android-based devices a valuable target of malicious apps. Thus, it is essential to devise automatic and portable malware detection approaches for the Android platform. There are many studies on detecting mobile malware using machine learning techniques. In these studies, however, the dataset is imbalanced or is not large enough to generalize the machine learning model, or the dimensionality of features is too high to apply nonlinear classifiers. In this article, we propose a machine learning-based Android malware detection scheme that uses API calls and permissions as features. To restrict the dimensionality of features, we propose minimal domain knowledge-based and Gini importance-based feature selection. We construct large and balanced real-world datasets to build a generalized and non-skewed model and verify our model through experiments. We achieve 96.51% classification accuracy using Random Forest classifier with low overhead. In addition, we also provide an analysis on falsely classified samples in detail. The analysis results show that API hiding can degrade the performance of API call information-based malware detection systems.

Phytosociological Community Classification of Mountain Ridge from Guryongryeong to Mt. Yaksu in the Baekdudaegan, Korea (백두대간의 구룡령에서 약수산 마루금의 식생구조 특성에 관한 연구)

  • An, Hyun-Chul;Choo, Gab-Chul;Park, Sam-Bong;Cho, Hyun-Seo;An, Jong-Bin;Park, Jeong-Geun;Ha, Hyoun Woo;Kim, Jin Joong;Kim, Bong-Gyu
    • Korean Journal of Environment and Ecology
    • /
    • v.28 no.6
    • /
    • pp.741-750
    • /
    • 2014
  • To investigate the vegetation structure of mountain ridge from Guryongryeong to Mt. Yaksu, 22 plots ($100m^2$) installed with random sampling method were surveyed. Three groups of Quercus mongolica-Acer pseudosieboldianum community, Q. mongolica community, Cornus controversa-Q. mongolica community were classified by cluster analysis. Q. mongolica was a major woody plant species in the ridge area from Guryongryeong to Yaksusan and Carpinus cordata and C. controversa was partly occupied in some area. High positive correlations showed between Q. mongolica and Symplocos chinensis for. pilosa, Rhododendron schlippenbachii; Tilia amurensis and Tilia mandshurica, Symplocos chinensis for. pilosa; Tilia mandshurica and S. chinensis for. pilosa, R. schlippenbachii; Betula costata and Acer mono; Symplocos chinensis for. pilosa and Rhododendron schlippenbachii, and relatively high negative correlations showed between A. pseudosieboldianum and S. chinensis for. pilosa, R. schlippenbachii. Species diversity(H') of investigated groups were ranged 0.8170~1.1446 and it was lower compared to those of the ridge area of the national parks in Baekdudaegan.

A Machine Learning Approach for Mechanical Motor Fault Diagnosis (기계적 모터 고장진단을 위한 머신러닝 기법)

  • Jung, Hoon;Kim, Ju-Won
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.40 no.1
    • /
    • pp.57-64
    • /
    • 2017
  • In order to reduce damages to major railroad components, which have the potential to cause interruptions to railroad services and safety accidents and to generate unnecessary maintenance costs, the development of rolling stock maintenance technology is switching from preventive maintenance based on the inspection period to predictive maintenance technology, led by advanced countries. Furthermore, to enhance trust in accordance with the speedup of system and reduce maintenances cost simultaneously, the demand for fault diagnosis and prognostic health management technology is increasing. The objective of this paper is to propose a highly reliable learning model using various machine learning algorithms that can be applied to critical rolling stock components. This paper presents a model for railway rolling stock component fault diagnosis and conducts a mechanical failure diagnosis of motor components by applying the machine learning technique in order to ensure efficient maintenance support along with a data preprocessing plan for component fault diagnosis. This paper first defines a failure diagnosis model for rolling stock components. Function-based algorithms ANFIS and SMO were used as machine learning techniques for generating the failure diagnosis model. Two tree-based algorithms, RadomForest and CART, were also employed. In order to evaluate the performance of the algorithms to be used for diagnosing failures in motors as a critical railroad component, an experiment was carried out on 2 data sets with different classes (includes 6 classes and 3 class levels). According to the results of the experiment, the random forest algorithm, a tree-based machine learning technique, showed the best performance.

Estimation of Fractional Vegetation Cover in Sand Dunes Using Multi-spectral Images from Fixed-wing UAV

  • Choi, Seok Keun;Lee, Soung Ki;Jung, Sung Heuk;Choi, Jae Wan;Choi, Do Yoen;Chun, Sook Jin
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.34 no.4
    • /
    • pp.431-441
    • /
    • 2016
  • Since the use of UAV (Unmanned Aerial Vehicle) is convenient for the acquisition of data on broad or inaccessible regions, it is nowadays used to establish spatial information for various fields, such as the environment, ecosystem, forest, or for military purposes. In this study, the process of estimating FVC (Fractional Vegetation Cover), based on multi-spectral UAV, to overcome the limitations of conventional methods is suggested. Hence, we propose that the FVC map is generated by using multi-spectral imaging. First, two types of result classifications were obtained based on RF (Random Forest) using RGB images and NDVI (Normalized Difference Vegetation Index) with RGB images. Then, the result map was reclassified into vegetation and non-vegetation. Finally, an FVC map-based RF were generated by using pixel calculation and FVC map-based GI (Gutman and Ignatov) model were indirectly made by fixed parameters. The method of adding NDVI shows a relatively higher accuracy compared to that of adding only RGB, and in particular, the GI model shows a lower RMSE (Root Mean Square Error) with 0.182 than RF. In this regard, the availability of the GI model which uses only the values of NDVI is higher than that of RF whose accuracy varies according to the results of classification. Our results showed that the GI mode ensures the quality of the FVC if the NDVI maintained at a uniform level. This can be easily achieved by using a UAV, which can provide vegetation data to improve the estimation of FVC.