• Title/Summary/Keyword: SHapley Additive exPlanations (SHAP)

Search Result 23, Processing Time 0.026 seconds

Crime Prediction and Factor Analysis of Incheon Metropolitan City Using Explainable Artificial Intelligence (설명 가능 인공지능 기술을 적용한 인천광역시 범죄 예측 및 요인 분석)

  • Kim, Da-Hyun;Kim, You-Kyung;Kim, Hyon-Hee
    • Annual Conference of KIPS
    • /
    • 2022.11a
    • /
    • pp.513-515
    • /
    • 2022
  • 본 연구는 범죄를 발생시키는데 관련된 여러가지 요인들을 기반으로 범죄 예측 모델을 생성하고 설명 가능 인공지능 기술을 적용하여 인천 광역시를 대상으로 범죄 발생에 영향을 미치는 요인들을 분석하였다. 범죄 예측 모델 생성을 위해 XG Boost 알고리즘을 적용하였으며, 설명 가능 인공지능 기술로는 Shapley Additive exPlanations (SHAP)을 사용하였다. 기존 관련 사례들을 참고하여 범죄 예측에 사용된 변수를 선정하였고 변수에 대한 데이터는 공공 데이터를 수집하였다. 실험 결과 성매매단속 현황과 청소년 실종 가출 신고 현황이 범죄 발생에 큰 영향을 미치는 주요 요인으로 나타났다. 제안하는 모델은 범죄 발생 지역, 요인들을 미리 예측하여 제시함으로써 범죄 예방에 사용되는 인력자원, 물적자원 등을 용이하게 쓸 수 있도록 활용할 수 있다.

A Data-Driven Causal Analysis on Fatal Accidents in Construction Industry (건설 사고사례 데이터 기반 건설업 사망사고 요인분석)

  • Jiyoon Choi;Sihyeon Kim;Songe Lee;Kyunghun Kim;Sudong Lee
    • Journal of the Korea Safety Management & Science
    • /
    • v.25 no.3
    • /
    • pp.63-71
    • /
    • 2023
  • The construction industry stands out for its higher incidence of accidents in comparison to other sectors. A causal analysis of the accidents is necessary for effective prevention. In this study, we propose a data-driven causal analysis to find significant factors of fatal construction accidents. We collected 14,318 cases of structured and text data of construction accidents from the Construction Safety Management Integrated Information (CSI). For the variables in the collected dataset, we first analyze their patterns and correlations with fatal construction accidents by statistical analysis. In addition, machine learning algorithms are employed to develop a classification model for fatal accidents. The integration of SHAP (SHapley Additive exPlanations) allows for the identification of root causes driving fatal incidents. As a result, the outcome reveals the significant factors and keywords wielding notable influence over fatal accidents within construction contexts.

Explainable Photovoltaic Power Forecasting Scheme Using BiLSTM (BiLSTM 기반의 설명 가능한 태양광 발전량 예측 기법)

  • Park, Sungwoo;Jung, Seungmin;Moon, Jaeuk;Hwang, Eenjun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.8
    • /
    • pp.339-346
    • /
    • 2022
  • Recently, the resource depletion and climate change problem caused by the massive usage of fossil fuels for electric power generation has become a critical issue worldwide. According to this issue, interest in renewable energy resources that can replace fossil fuels is increasing. Especially, photovoltaic power has gaining much attention because there is no risk of resource exhaustion compared to other energy resources and there are low restrictions on installation of photovoltaic system. In order to use the power generated by the photovoltaic system efficiently, a more accurate photovoltaic power forecasting model is required. So far, even though many machine learning and deep learning-based photovoltaic power forecasting models have been proposed, they showed limited success in terms of interpretability. Deep learning-based forecasting models have the disadvantage of being difficult to explain how the forecasting results are derived. To solve this problem, many studies are being conducted on explainable artificial intelligence technique. The reliability of the model can be secured if it is possible to interpret how the model derives the results. Also, the model can be improved to increase the forecasting accuracy based on the analysis results. Therefore, in this paper, we propose an explainable photovoltaic power forecasting scheme based on BiLSTM (Bidirectional Long Short-Term Memory) and SHAP (SHapley Additive exPlanations).

A Study on Efficient AI Model Drift Detection Methods for MLOps (MLOps를 위한 효율적인 AI 모델 드리프트 탐지방안 연구)

  • Ye-eun Lee;Tae-jin Lee
    • Journal of Internet Computing and Services
    • /
    • v.24 no.5
    • /
    • pp.17-27
    • /
    • 2023
  • Today, as AI (Artificial Intelligence) technology develops and its practicality increases, it is widely used in various application fields in real life. At this time, the AI model is basically learned based on various statistical properties of the learning data and then distributed to the system, but unexpected changes in the data in a rapidly changing data situation cause a decrease in the model's performance. In particular, as it becomes important to find drift signals of deployed models in order to respond to new and unknown attacks that are constantly created in the security field, the need for lifecycle management of the entire model is gradually emerging. In general, it can be detected through performance changes in the model's accuracy and error rate (loss), but there are limitations in the usage environment in that an actual label for the model prediction result is required, and the detection of the point where the actual drift occurs is uncertain. there is. This is because the model's error rate is greatly influenced by various external environmental factors, model selection and parameter settings, and new input data, so it is necessary to precisely determine when actual drift in the data occurs based only on the corresponding value. There are limits to this. Therefore, this paper proposes a method to detect when actual drift occurs through an Anomaly analysis technique based on XAI (eXplainable Artificial Intelligence). As a result of testing a classification model that detects DGA (Domain Generation Algorithm), anomaly scores were extracted through the SHAP(Shapley Additive exPlanations) Value of the data after distribution, and as a result, it was confirmed that efficient drift point detection was possible.

A Study on the Employee Turnover Prediction using XGBoost and SHAP (XGBoost와 SHAP 기법을 활용한 근로자 이직 예측에 관한 연구)

  • Lee, Jae Jun;Lee, Yu Rin;Lim, Do Hyun;Ahn, Hyun Chul
    • The Journal of Information Systems
    • /
    • v.30 no.4
    • /
    • pp.21-42
    • /
    • 2021
  • Purpose In order for companies to continue to grow, they should properly manage human resources, which are the core of corporate competitiveness. Employee turnover means the loss of talent in the workforce. When an employee voluntarily leaves his or her company, it will lose hiring and training cost and lead to the withdrawal of key personnel and new costs to train a new employee. From an employee's viewpoint, moving to another company is also risky because it can be time consuming and costly. Therefore, in order to reduce the social and economic costs caused by employee turnover, it is necessary to accurately predict employee turnover intention, identify the factors affecting employee turnover, and manage them appropriately in the company. Design/methodology/approach Prior studies have mainly used logistic regression and decision trees, which have explanatory power but poor predictive accuracy. In order to develop a more accurate prediction model, XGBoost is proposed as the classification technique. Then, to compensate for the lack of explainability, SHAP, one of the XAI techniques, is applied. As a result, the prediction accuracy of the proposed model is improved compared to the conventional methods such as LOGIT and Decision Trees. By applying SHAP to the proposed model, the factors affecting the overall employee turnover intention as well as a specific sample's turnover intention are identified. Findings Experimental results show that the prediction accuracy of XGBoost is superior to that of logistic regression and decision trees. Using SHAP, we find that jobseeking, annuity, eng_test, comm_temp, seti_dev, seti_money, equl_ablt, and sati_safe significantly affect overall employee turnover intention. In addition, it is confirmed that the factors affecting an individual's turnover intention are more diverse. Our research findings imply that companies should adopt a personalized approach for each employee in order to effectively prevent his or her turnover.

Machine learning-based probabilistic predictions of shear resistance of welded studs in deck slab ribs transverse to beams

  • Vitaliy V. Degtyarev;Stephen J. Hicks
    • Steel and Composite Structures
    • /
    • v.49 no.1
    • /
    • pp.109-123
    • /
    • 2023
  • Headed studs welded to steel beams and embedded within the concrete of deck slabs are vital components of modern composite floor systems, where safety and economy depend on the accurate predictions of the stud shear resistance. The multitude of existing deck profiles and the complex behavior of studs in deck slab ribs makes developing accurate and reliable mechanical or empirical design models challenging. The paper addresses this issue by presenting a machine learning (ML) model developed from the natural gradient boosting (NGBoost) algorithm capable of producing probabilistic predictions and a database of 464 push-out tests, which is considerably larger than the databases used for developing existing design models. The proposed model outperforms models based on other ML algorithms and existing descriptive equations, including those in EC4 and AISC 360, while offering probabilistic predictions unavailable from other models and producing higher shear resistances for many cases. The present study also showed that the stud shear resistance is insensitive to the concrete elastic modulus, stud welding type, location of slab reinforcement, and other parameters considered important by existing models. The NGBoost model was interpreted by evaluating the feature importance and dependence determined with the SHapley Additive exPlanations (SHAP) method. The model was calibrated via reliability analyses in accordance with the Eurocodes to ensure that its predictions meet the required reliability level and facilitate its use in design. An interactive open-source web application was created and deployed to the cloud to allow for convenient and rapid stud shear resistance predictions with the developed model.

Data-centric XAI-driven Data Imputation of Molecular Structure and QSAR Model for Toxicity Prediction of 3D Printing Chemicals (3D 프린팅 소재 화학물질의 독성 예측을 위한 Data-centric XAI 기반 분자 구조 Data Imputation과 QSAR 모델 개발)

  • ChanHyeok Jeong;SangYoun Kim;SungKu Heo;Shahzeb Tariq;MinHyeok Shin;ChangKyoo Yoo
    • Korean Chemical Engineering Research
    • /
    • v.61 no.4
    • /
    • pp.523-541
    • /
    • 2023
  • As accessibility to 3D printers increases, there is a growing frequency of exposure to chemicals associated with 3D printing. However, research on the toxicity and harmfulness of chemicals generated by 3D printing is insufficient, and the performance of toxicity prediction using in silico techniques is limited due to missing molecular structure data. In this study, quantitative structure-activity relationship (QSAR) model based on data-centric AI approach was developed to predict the toxicity of new 3D printing materials by imputing missing values in molecular descriptors. First, MissForest algorithm was utilized to impute missing values in molecular descriptors of hazardous 3D printing materials. Then, based on four different machine learning models (decision tree, random forest, XGBoost, SVM), a machine learning (ML)-based QSAR model was developed to predict the bioconcentration factor (Log BCF), octanol-air partition coefficient (Log Koa), and partition coefficient (Log P). Furthermore, the reliability of the data-centric QSAR model was validated through the Tree-SHAP (SHapley Additive exPlanations) method, which is one of explainable artificial intelligence (XAI) techniques. The proposed imputation method based on the MissForest enlarged approximately 2.5 times more molecular structure data compared to the existing data. Based on the imputed dataset of molecular descriptor, the developed data-centric QSAR model achieved approximately 73%, 76% and 92% of prediction performance for Log BCF, Log Koa, and Log P, respectively. Lastly, Tree-SHAP analysis demonstrated that the data-centric-based QSAR model achieved high prediction performance for toxicity information by identifying key molecular descriptors highly correlated with toxicity indices. Therefore, the proposed QSAR model based on the data-centric XAI approach can be extended to predict the toxicity of potential pollutants in emerging printing chemicals, chemical process, semiconductor or display process.

Performance Comparison and SHAP Interpretation of Movie Box Office Prediction Models Based on CatBoost and PyCaret (CatBoost와 PyCaret을 기반한 영화 박스오피스 예측 모델의 성능 비교 및 SHAP 해석)

  • Huiseong Kim;Jihoon Moon
    • Journal of Internet of Things and Convergence
    • /
    • v.10 no.5
    • /
    • pp.213-226
    • /
    • 2024
  • This study uses box office data collected by the Korean Film Council (KOFIC) to develop and compare predictive models for cinema attendance and revenue. Data preprocessing removed irrelevant variables and handled missing values separately for categorical and numerical data to ensure consistency. Exploratory data analysis identified key variables, including Seoul audience size, revenue, total number of screens, film genre, rating, and month of release, which revealed a strong correlation between Seoul audience size and revenue with box office performance. Based on this analysis, predictive models were developed using CatBoost and PyCaret AutoML. CatBoost was chosen for its effectiveness in handling categorical variables such as director name, production company, and genre, while PyCaret AutoML was chosen for its ability to automate the modeling process, making it easy for non-experts to compare different models. The performance of the models was evaluated using mean absolute error (MAE), root mean squared error (RMSE), and R-squared (R2), with CatBoost demonstrating superior accuracy. In addition, the SHAP technique was used to interpret the models, identifying Seoul's audience size and revenue as the most significant predictors. This research presents reliable box office prediction models that will improve decision-making in the film industry and support the development of data-driven strategies.

Prediction of patent lifespan and analysis of influencing factors using machine learning (기계학습을 활용한 특허수명 예측 및 영향요인 분석)

  • Kim, Yongwoo;Kim, Min Gu;Kim, Young-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.147-170
    • /
    • 2022
  • Although the number of patent which is one of the core outputs of technological innovation continues to increase, the number of low-value patents also hugely increased. Therefore, efficient evaluation of patents has become important. Estimation of patent lifespan which represents private value of a patent, has been studied for a long time, but in most cases it relied on a linear model. Even if machine learning methods were used, interpretation or explanation of the relationship between explanatory variables and patent lifespan was insufficient. In this study, patent lifespan (number of renewals) is predicted based on the idea that patent lifespan represents the value of the patent. For the research, 4,033,414 patents applied between 1996 and 2017 and finally granted were collected from USPTO (US Patent and Trademark Office). To predict the patent lifespan, we use variables that can reflect the characteristics of the patent, the patent owner's characteristics, and the inventor's characteristics. We build four different models (Ridge Regression, Random Forest, Feed Forward Neural Network, Gradient Boosting Models) and perform hyperparameter tuning through 5-fold Cross Validation. Then, the performance of the generated models are evaluated, and the relative importance of predictors is also presented. In addition, based on the Gradient Boosting Model which have excellent performance, Accumulated Local Effects Plot is presented to visualize the relationship between predictors and patent lifespan. Finally, we apply Kernal SHAP (SHapley Additive exPlanations) to present the evaluation reason of individual patents, and discuss applicability to the patent evaluation system. This study has academic significance in that it cumulatively contributes to the existing patent life estimation research and supplements the limitations of existing patent life estimation studies based on linearity. It is academically meaningful that this study contributes cumulatively to the existing studies which estimate patent lifespan, and that it supplements the limitations of linear models. Also, it is practically meaningful to suggest a method for deriving the evaluation basis for individual patent value and examine the applicability to patent evaluation systems.

Preliminary Inspection Prediction Model to select the on-Site Inspected Foreign Food Facility using Multiple Correspondence Analysis (차원축소를 활용한 해외제조업체 대상 사전점검 예측 모형에 관한 연구)

  • Hae Jin Park;Jae Suk Choi;Sang Goo Cho
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.1
    • /
    • pp.121-142
    • /
    • 2023
  • As the number and weight of imported food are steadily increasing, safety management of imported food to prevent food safety accidents is becoming more important. The Ministry of Food and Drug Safety conducts on-site inspections of foreign food facilities before customs clearance as well as import inspection at the customs clearance stage. However, a data-based safety management plan for imported food is needed due to time, cost, and limited resources. In this study, we tried to increase the efficiency of the on-site inspection by preparing a machine learning prediction model that pre-selects the companies that are expected to fail before the on-site inspection. Basic information of 303,272 foreign food facilities and processing businesses collected in the Integrated Food Safety Information Network and 1,689 cases of on-site inspection information data collected from 2019 to April 2022 were collected. After preprocessing the data of foreign food facilities, only the data subject to on-site inspection were extracted using the foreign food facility_code. As a result, it consisted of a total of 1,689 data and 103 variables. For 103 variables, variables that were '0' were removed based on the Theil-U index, and after reducing by applying Multiple Correspondence Analysis, 49 characteristic variables were finally derived. We build eight different models and perform hyperparameter tuning through 5-fold cross validation. Then, the performance of the generated models are evaluated. The research purpose of selecting companies subject to on-site inspection is to maximize the recall, which is the probability of judging nonconforming companies as nonconforming. As a result of applying various algorithms of machine learning, the Random Forest model with the highest Recall_macro, AUROC, Average PR, F1-score, and Balanced Accuracy was evaluated as the best model. Finally, we apply Kernal SHAP (SHapley Additive exPlanations) to present the selection reason for nonconforming facilities of individual instances, and discuss applicability to the on-site inspection facility selection system. Based on the results of this study, it is expected that it will contribute to the efficient operation of limited resources such as manpower and budget by establishing an imported food management system through a data-based scientific risk management model.