• Title/Summary/Keyword: machine learning models

Search Result 1,404, Processing Time 0.033 seconds

Semi-supervised learning for sentiment analysis in mass social media (대용량 소셜 미디어 감성분석을 위한 반감독 학습 기법)

  • Hong, Sola;Chung, Yeounoh;Lee, Jee-Hyong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.5
    • /
    • pp.482-488
    • /
    • 2014
  • This paper aims to analyze user's emotion automatically by analyzing Twitter, a representative social network service (SNS). In order to create sentiment analysis models by using machine learning techniques, sentiment labels that represent positive/negative emotions are required. However it is very expensive to obtain sentiment labels of tweets. So, in this paper, we propose a sentiment analysis model by using self-training technique in order to utilize "data without sentiment labels" as well as "data with sentiment labels". Self-training technique is that labels of "data without sentiment labels" is determined by utilizing "data with sentiment labels", and then updates models using together with "data with sentiment labels" and newly labeled data. This technique improves the sentiment analysis performance gradually. However, it has a problem that misclassifications of unlabeled data in an early stage affect the model updating through the whole learning process because labels of unlabeled data never changes once those are determined. Thus, labels of "data without sentiment labels" needs to be carefully determined. In this paper, in order to get high performance using self-training technique, we propose 3 policies for updating "data with sentiment labels" and conduct a comparative analysis. The first policy is to select data of which confidence is higher than a given threshold among newly labeled data. The second policy is to choose the same number of the positive and negative data in the newly labeled data in order to avoid the imbalanced class learning problem. The third policy is to choose newly labeled data less than a given maximum number in order to avoid the updates of large amount of data at a time for gradual model updates. Experiments are conducted using Stanford data set and the data set is classified into positive and negative. As a result, the learned model has a high performance than the learned models by using "data with sentiment labels" only and the self-training with a regular model update policy.

Development of New Variables Affecting Movie Success and Prediction of Weekly Box Office Using Them Based on Machine Learning (영화 흥행에 영향을 미치는 새로운 변수 개발과 이를 이용한 머신러닝 기반의 주간 박스오피스 예측)

  • Song, Junga;Choi, Keunho;Kim, Gunwoo
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.67-83
    • /
    • 2018
  • The Korean film industry with significant increase every year exceeded the number of cumulative audiences of 200 million people in 2013 finally. However, starting from 2015 the Korean film industry entered a period of low growth and experienced a negative growth after all in 2016. To overcome such difficulty, stakeholders like production company, distribution company, multiplex have attempted to maximize the market returns using strategies of predicting change of market and of responding to such market change immediately. Since a film is classified as one of experiential products, it is not easy to predict a box office record and the initial number of audiences before the film is released. And also, the number of audiences fluctuates with a variety of factors after the film is released. So, the production company and distribution company try to be guaranteed the number of screens at the opining time of a newly released by multiplex chains. However, the multiplex chains tend to open the screening schedule during only a week and then determine the number of screening of the forthcoming week based on the box office record and the evaluation of audiences. Many previous researches have conducted to deal with the prediction of box office records of films. In the early stage, the researches attempted to identify factors affecting the box office record. And nowadays, many studies have tried to apply various analytic techniques to the factors identified previously in order to improve the accuracy of prediction and to explain the effect of each factor instead of identifying new factors affecting the box office record. However, most of previous researches have limitations in that they used the total number of audiences from the opening to the end as a target variable, and this makes it difficult to predict and respond to the demand of market which changes dynamically. Therefore, the purpose of this study is to predict the weekly number of audiences of a newly released film so that the stakeholder can flexibly and elastically respond to the change of the number of audiences in the film. To that end, we considered the factors used in the previous studies affecting box office and developed new factors not used in previous studies such as the order of opening of movies, dynamics of sales. Along with the comprehensive factors, we used the machine learning method such as Random Forest, Multi Layer Perception, Support Vector Machine, and Naive Bays, to predict the number of cumulative visitors from the first week after a film release to the third week. At the point of the first and the second week, we predicted the cumulative number of visitors of the forthcoming week for a released film. And at the point of the third week, we predict the total number of visitors of the film. In addition, we predicted the total number of cumulative visitors also at the point of the both first week and second week using the same factors. As a result, we found the accuracy of predicting the number of visitors at the forthcoming week was higher than that of predicting the total number of them in all of three weeks, and also the accuracy of the Random Forest was the highest among the machine learning methods we used. This study has implications in that this study 1) considered various factors comprehensively which affect the box office record and merely addressed by other previous researches such as the weekly rating of audiences after release, the weekly rank of the film after release, and the weekly sales share after release, and 2) tried to predict and respond to the demand of market which changes dynamically by suggesting models which predicts the weekly number of audiences of newly released films so that the stakeholders can flexibly and elastically respond to the change of the number of audiences in the film.

Development of disaster severity classification model using machine learning technique (머신러닝 기법을 이용한 재해강도 분류모형 개발)

  • Lee, Seungmin;Baek, Seonuk;Lee, Junhak;Kim, Kyungtak;Kim, Soojun;Kim, Hung Soo
    • Journal of Korea Water Resources Association
    • /
    • v.56 no.4
    • /
    • pp.261-272
    • /
    • 2023
  • In recent years, natural disasters such as heavy rainfall and typhoons have occurred more frequently, and their severity has increased due to climate change. The Korea Meteorological Administration (KMA) currently uses the same criteria for all regions in Korea for watch and warning based on the maximum cumulative rainfall with durations of 3-hour and 12-hour to reduce damage. However, KMA's criteria do not consider the regional characteristics of damages caused by heavy rainfall and typhoon events. In this regard, it is necessary to develop new criteria considering regional characteristics of damage and cumulative rainfalls in durations, establishing four stages: blue, yellow, orange, and red. A classification model, called DSCM (Disaster Severity Classification Model), for the four-stage disaster severity was developed using four machine learning models (Decision Tree, Support Vector Machine, Random Forest, and XGBoost). This study applied DSCM to local governments of Seoul, Incheon, and Gyeonggi Province province. To develop DSCM, we used data on rainfall, cumulative rainfall, maximum rainfalls for durations of 3-hour and 12-hour, and antecedent rainfall as independent variables, and a 4-class damage scale for heavy rain damage and typhoon damage for each local government as dependent variables. As a result, the Decision Tree model had the highest accuracy with an F1-Score of 0.56. We believe that this developed DSCM can help identify disaster risk at each stage and contribute to reducing damage through efficient disaster management for local governments based on specific events.

Predicting Highway Concrete Pavement Damage using XGBoost (XGBoost를 활용한 고속도로 콘크리트 포장 파손 예측)

  • Lee, Yongjun;Sun, Jongwan
    • Korean Journal of Construction Engineering and Management
    • /
    • v.21 no.6
    • /
    • pp.46-55
    • /
    • 2020
  • The maintenance cost for highway pavement is gradually increasing due to the continuous increase in road extension as well as increase in the number of old routes that have passed the public period. As a result, there is a need for a method of minimizing costs through preventative grievance Preventive maintenance requires the establishment of a strategic plan through accurate prediction old Highway pavement. herefore, in this study, the XGBoost among machine learning classification-based models was used to develop a highway pavement damage prediction model. First, we solved the imbalanced data issue through data sampling, then developed a predictive model using the XGBoost. This predictive model was evaluated through performance indicators such as accuracy and F1 score. As a result, the over-sampling method showed the best performance result. On the other hand, the main variables affecting road damage were calculated in the order of the number of years of service, ESAL, and the number of days below the minimum temperature -2 degrees Celsius. If the performance of the prediction model is improved through more data accumulation and detailed data pre-processing in the future, it is expected that more accurate prediction of maintenance-required sections will be possible. In addition, it is expected to be used as important basic information for estimating the highway pavement maintenance budget in the future.

Analysis of Important Indicators of TCB Using GBM (일반화가속모형을 이용한 기술신용평가 주요 지표 분석)

  • Jeon, Woo-Jeong(Michael);Seo, Young-Wook
    • The Journal of Society for e-Business Studies
    • /
    • v.22 no.4
    • /
    • pp.159-173
    • /
    • 2017
  • In order to provide technical financial support to small and medium-sized venture companies based on technology, the government implemented the TCB evaluation, which is a kind of technology rating evaluation, from the Kibo and a qualified private TCB. In this paper, we briefly review the current state of TCB evaluation and available indicators related to technology evaluation accumulated in the Korea Credit Information Services (TDB), and then use indicators that have a significant effect on the technology rating score. Multiple regression techniques will be explored. And the relative importance and classification accuracy of the indicators were calculated by applying the key indicators as independent features applied to the generalized boosting model, which is a representative machine learning classifier, as the class influence and the fitness of each model. As a result of the analysis, it was analyzed that the relative importance between the two models was not significantly different. However, GBM model had more weight on the InnoBiz certification, R&D department, patent registration and venture confirmation indicators than regression model.

Calculation of Stability Number of Tetrapods Using Weights and Biases of ANN Model (인공신경망 모델의 가중치와 편의를 이용한 테트라포드의 안정수 계산 방법)

  • Lee, Jae Sung;Suh, Kyung-Duck
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.28 no.5
    • /
    • pp.277-283
    • /
    • 2016
  • Tetrapod is one of the most widely used concrete armor units for rubble mound breakwaters. The calculation of the stability number of Tetrapods is necessary to determine the optimal weight of Tetrapods. Many empirical formulas have been developed to calculate the stability number of Tetrapods, from the Hudson formula in 1950s to the recent one developed by Suh and Kang. They were developed by using the regression analysis to determine the coefficients of an assumed formula using the experimental data. Recently, software engineering (or machine learning) methods are introduced as a large amount of experimental data becomes available, e.g. artificial neural network (ANN) models for rock armors. However, these methods are seldom used probably because they did not significantly improve the accuracy compared with the empirical formula and/or the engineers are not familiar with them. In this study, we propose an explicit method to calculate the stability number of Tetrapods using the weights and biases of an ANN model. This method can be used by an engineer who has basic knowledge of matrix operation without requiring knowledge of ANN, and it is more accurate than previous empirical formulas.

Traffic Congestion Estimation by Adopting Recurrent Neural Network (순환인공신경망(RNN)을 이용한 대도시 도심부 교통혼잡 예측)

  • Jung, Hee jin;Yoon, Jin su;Bae, Sang hoon
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.16 no.6
    • /
    • pp.67-78
    • /
    • 2017
  • Traffic congestion cost is increasing annually. Specifically congestion caused by the CDB traffic contains more than a half of the total congestion cost. Recent advancement in the field of Big Data, AI paved the way to industry revolution 4.0. And, these new technologies creates tremendous changes in the traffic information dissemination. Eventually, accurate and timely traffic information will give a positive impact on decreasing traffic congestion cost. This study, therefore, focused on developing both recurrent and non-recurrent congestion prediction models on urban roads by adopting Recurrent Neural Network(RNN), a tribe in machine learning. Two hidden layers with scaled conjugate gradient backpropagation algorithm were selected, and tested. Result of the analysis driven the authors to 25 meaningful links out of 33 total links that have appropriate mean square errors. Authors concluded that RNN model is a feasible model to predict congestion.

Management Automation Technique for Maintaining Performance of Machine Learning-Based Power Grid Condition Prediction Model (기계학습 기반 전력망 상태예측 모델 성능 유지관리 자동화 기법)

  • Lee, Haesung;Lee, Byunsung;Moon, Sangun;Kim, Junhyuk;Lee, Heysun
    • KEPCO Journal on Electric Power and Energy
    • /
    • v.6 no.4
    • /
    • pp.413-418
    • /
    • 2020
  • It is necessary to manage the prediction accuracy of the machine learning model to prevent the decrease in the performance of the grid network condition prediction model due to overfitting of the initial training data and to continuously utilize the prediction model in the field by maintaining the prediction accuracy. In this paper, we propose an automation technique for maintaining the performance of the model, which increases the accuracy and reliability of the prediction model by considering the characteristics of the power grid state data that constantly changes due to various factors, and enables quality maintenance at a level applicable to the field. The proposed technique modeled a series of tasks for maintaining the performance of the power grid condition prediction model through the application of the workflow management technology in the form of a workflow, and then automated it to make the work more efficient. In addition, the reliability of the performance result is secured by evaluating the performance of the prediction model taking into account both the degree of change in the statistical characteristics of the data and the level of generalization of the prediction, which has not been attempted in the existing technology. Through this, the accuracy of the prediction model is maintained at a certain level, and further new development of predictive models with excellent performance is possible. As a result, the proposed technique not only solves the problem of performance degradation of the predictive model, but also improves the field utilization of the condition prediction model in a complex power grid system.

AutoML and Artificial Neural Network Modeling of Process Dynamics of LNG Regasification Using Seawater (해수 이용 LNG 재기화 공정의 딥러닝과 AutoML을 이용한 동적모델링)

  • Shin, Yongbeom;Yoo, Sangwoo;Kwak, Dongho;Lee, Nagyeong;Shin, Dongil
    • Korean Chemical Engineering Research
    • /
    • v.59 no.2
    • /
    • pp.209-218
    • /
    • 2021
  • First principle-based modeling studies have been performed to improve the heat exchange efficiency of ORV and optimize operation, but the heat transfer coefficient of ORV is an irregular system according to time and location, and it undergoes a complex modeling process. In this study, FNN, LSTM, and AutoML-based modeling were performed to confirm the effectiveness of data-based modeling for complex systems. The prediction accuracy indicated high performance in the order of LSTM > AutoML > FNN in MSE. The performance of AutoML, an automatic design method for machine learning models, was superior to developed FNN, and the total time required for model development was 1/15 compared to LSTM, showing the possibility of using AutoML. The prediction of NG and seawater discharged temperatures using LSTM and AutoML showed an error of less than 0.5K. Using the predictive model, real-time optimization of the amount of LNG vaporized that can be processed using ORV in winter is performed, confirming that up to 23.5% of LNG can be additionally processed, and an ORV optimal operation guideline based on the developed dynamic prediction model was presented.

Analysis and estimation of species distribution of Mythimna seperata and Cnaphalocrocis medinalis with land-cover data under climate change scenario using MaxEnt (MaxEnt를 활용한 기후변화와 토지 피복 변화에 따른 멸강나방 및 혹명나방의 한국 내 분포 변화 분석과 예측)

  • Taechul Park;Hojung Jang;SoEun Eom;Kimoon Son;Jung-Joon Park
    • Korean Journal of Environmental Biology
    • /
    • v.40 no.2
    • /
    • pp.214-223
    • /
    • 2022
  • Among migratory insect pests, Mythimna seperata and Cnaphalocrocis medinalis are invasive pests introduced into South Korea through westerlies from southern China. M. seperata and C. medinalis are insect pests that use rice as a host. They injure rice leaves and inhibit rice growth. To understand the distribution of M. seperata and C. medinalis, it is important to understand environmental factors such as temperature and humidity of their habitat. This study predicted current and future habitat suitability models for understanding the distribution of M. seperata and C. medinalis. Occurrence data, SSPs (Shared Socio-economic Pathways) scenario, and RCP (Representative Concentration Pathway) were applied to MaxEnt (Maximum Entropy), a machine learning model among SDM (Species Distribution Model). As a result, M. seperata and C. medinalis are aggregated on the west and south coasts where they have a host after migration from China. As a result of MaxEnt analysis, the contribution was high in the order of Land-cover data and DEM (Digital Elevation Model). In bioclimatic variables, BIO_4 (Temperature seasonality) was high in M. seperata and BIO_2 (Mean Diurnal Range) was found in C. medinalis. The habitat suitability model predicted that M. seperata and C. medinalis could inhabit most rice paddies.