• Title/Summary/Keyword: 베이지안 회귀분석

Search Result 74, Processing Time 0.019 seconds

Investigating Opinion Mining Performance by Combining Feature Selection Methods with Word Embedding and BOW (Bag-of-Words) (속성선택방법과 워드임베딩 및 BOW (Bag-of-Words)를 결합한 오피니언 마이닝 성과에 관한 연구)

  • Eo, Kyun Sun;Lee, Kun Chang
    • Journal of Digital Convergence
    • /
    • v.17 no.2
    • /
    • pp.163-170
    • /
    • 2019
  • Over the past decade, the development of the Web explosively increased the data. Feature selection step is an important step in extracting valuable data from a large amount of data. This study proposes a novel opinion mining model based on combining feature selection (FS) methods with Word embedding to vector (Word2vec) and BOW (Bag-of-words). FS methods adopted for this study are CFS (Correlation based FS) and IG (Information Gain). To select an optimal FS method, a number of classifiers ranging from LR (logistic regression), NN (neural network), NBN (naive Bayesian network) to RF (random forest), RS (random subspace), ST (stacking). Empirical results with electronics and kitchen datasets showed that LR and ST classifiers combined with IG applied to BOW features yield best performance in opinion mining. Results with laptop and restaurant datasets revealed that the RF classifier using IG applied to Word2vec features represents best performance in opinion mining.

Bayesian Model Selection for Linkage Analyses: Considering Collinear Predictors (연관분석을 위한 베이지안 모형 선택: 상호상관성 변수를 중심으로)

  • Suh, Young-Ju
    • The Korean Journal of Applied Statistics
    • /
    • v.18 no.3
    • /
    • pp.533-541
    • /
    • 2005
  • We identify the correct chromosome and locate the corresponding markers close to the QTL in the linkage analysis of a quantitative trait by using the SSVS method. We consider several markers linked to the QTL, as well as to each oyher and thus the i.b.d. values at these loci generate collinear predictors to be evaluated when using the SSVS approach. The results on considering only closely linked markers to two QTL simultaneously showed clear evidence in favor of the closest marker to the QTL considered over other markers. The results of the analysis of collinear markers with SSVS showeed high concordance to those obtained using traditional multiple regression. We conclude based on this simulation study that the SSVS is quite useful to identify linkage with multiple linked markers simultaneously for a complex quantitative trait.

Cancer incidence and mortality estimations in Busan by using spatial multi-level model (공간 다수준 분석을 이용한 부산지역 암발생 및 암사망 추정)

  • Ko, Younggyu;Han, Junhee;Yoon, Taeho;Kim, Changhoon;Noh, Maengseok
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.5
    • /
    • pp.1169-1182
    • /
    • 2016
  • Cancer is a typical cause of death in Korea that becomes a major issue in health care. According to Cause of Death Statistics (2014) by National Statistical Office, SMRs (standardized mortality rates) in Busan were counted as the highest among all cities. In this paper, we used data of Busan Regional Cancer Center to estimate the extent of the cancer incidence rate and cancer mortality rate. The data are considered in small areas of administrative units such as Gu/Dong from years 2003 to 2009. All cancer including four major cancers (stomach cancer, colorectal cancer, lung cancer, liver cancer) have been analyzed. We carried out model selection and parameter estimation using spatial multi-level model incorporating a spatial correlation. For the spatial effects, CAR (conditional autoregressive model) has been assumed.

Forecasting of Customer's Purchasing Intention Using Support Vector Machine (Support Vector Machine 기법을 이용한 고객의 구매의도 예측)

  • Kim, Jin-Hwa;Nam, Ki-Chan;Lee, Sang-Jong
    • Information Systems Review
    • /
    • v.10 no.2
    • /
    • pp.137-158
    • /
    • 2008
  • Rapid development of various information technologies creates new opportunities in online and offline markets. In this changing market environment, customers have various demands on new products and services. Therefore, their power and influence on the markets grow stronger each year. Companies have paid great attention to customer relationship management. Especially, personalized product recommendation systems, which recommend products and services based on customer's private information or purchasing behaviors in stores, is an important asset to most companies. CRM is one of the important business processes where reliable information is mined from customer database. Data mining techniques such as artificial intelligence are popular tools used to extract useful information and knowledge from these customer databases. In this research, we propose a recommendation system that predicts customer's purchase intention. Then, customer's purchasing intention of specific product is predicted by using data mining techniques using receipt data set. The performance of this suggested method is compared with that of other data mining technologies.

Analysis of Elderly Drivers' Accident Models Considering Operations and Physical Characteristics (고령운전자 운전 및 신체특성을 반영한 교통사고 분석 연구)

  • Lim, Sam Jin;Park, Jun Tae;Kim, Young Il;Kim, Tae Ho
    • Journal of Korean Society of Transportation
    • /
    • v.30 no.6
    • /
    • pp.37-46
    • /
    • 2012
  • The number of traffic accidents caused by elderly drivers over the age of 65 has surged over the past ten years from 37,000 to 274,000 cases. The proportion of elderly drivers' accidents has jumped 3.1 times from 1.2% to 3.7% out of all traffic accidents, and traffic safety organizations are pursuing diverse measures to address the situation. Above all, connecting safety measures with an in-depth research on behavioral and physical characteristics of elderly drivers will prove vital. This study conducted an empirical research linking the driving characteristics and traffic accidents by elderly drivers based on the Driving Aptitude Test items and traffic accident data, which enabled the measurement of behavioral characteristics of elderly drivers. In developing the Influence Model, we applied the zero-inflated Poisson (ZIP) regression model and selected an accident prediction model based on the Bayesian Influence in regards to the ZIP regression model and the zero-inflated negative binomial (ZINB) regression model. According to the results of the AAE analysis, the ZIP regression model was more appropriate and it was found that three variables? prediction of velocity, diversion, and cognitive ability? had a relation of influence with traffic accidents caused by elderly drivers.

Crime Incident Prediction Model based on Bayesian Probability (베이지안 확률 기반 범죄위험지역 예측 모델 개발)

  • HEO, Sun-Young;KIM, Ju-Young;MOON, Tae-Heon
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.20 no.4
    • /
    • pp.89-101
    • /
    • 2017
  • Crime occurs differently based on not only place locations and building uses but also the characteristics of the people who use the place and the spatial structures of the buildings and locations. Therefore, if spatial big data, which contain spatial and regional properties, can be utilized, proper crime prevention measures can be enacted. Recently, with the advent of big data and the revolutionary intelligent information era, predictive policing has emerged as a new paradigm for police activities. Based on 7420 actual crime incidents occurring over three years in a typical provincial city, "J city," this study identified the areas in which crimes occurred and predicted risky areas. Spatial regression analysis was performed using spatial big data about only physical and environmental variables. Based on the results, using the street width, average number of building floors, building coverage ratio, the type of use of the first floor (Type II neighborhood living facility, commercial facility, pleasure use, or residential use), this study established a Crime Incident Prediction Model (CIPM) based on Bayesian probability theory. As a result, it was found that the model was suitable for crime prediction because the overlap analysis with the actual crime areas and the receiver operating characteristic curve (Roc curve), which evaluated the accuracy of the model, showed an area under the curve (AUC) value of 0.8. It was also found that a block where the commercial and entertainment facilities were concentrated, a block where the number of building floors is high, and a block where the commercial, entertainment, residential facilities are mixed are high-risk areas. This study provides a meaningful step forward to the development of a crime prediction model, unlike previous studies that explored the spatial distribution of crime and the factors influencing crime occurrence.

Identification of major risk factors association with respiratory diseases by data mining (데이터마이닝 모형을 활용한 호흡기질환의 주요인 선별)

  • Lee, Jea-Young;Kim, Hyun-Ji
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.2
    • /
    • pp.373-384
    • /
    • 2014
  • Data mining is to clarify pattern or correlation of mass data of complicated structure and to predict the diverse outcomes. This technique is used in the fields of finance, telecommunication, circulation, medicine and so on. In this paper, we selected risk factors of respiratory diseases in the field of medicine. The data we used was divided into respiratory diseases group and health group from the Gyeongsangbuk-do database of Community Health Survey conducted in 2012. In order to select major risk factors, we applied data mining techniques such as neural network, logistic regression, Bayesian network, C5.0 and CART. We divided total data into training and testing data, and applied model which was designed by training data to testing data. By the comparison of prediction accuracy, CART was identified as best model. Depression, smoking and stress were proved as the major risk factors of respiratory disease.

A case study of small area estimation about charter and monthly rent price index (소지역모형 추정기법을 활용한 전·월세 추정)

  • Lee, Seung Soo;Park, Won Ran;Chung, Sung Suk
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.2
    • /
    • pp.327-337
    • /
    • 2017
  • In this study we compared three models for small area estimation, Fay-Herriot, Hierarchical Bayses model and spatio-temporal model about charter, monthly rent price index. Charter, monthly rent price of Korea are important issue in these days. Because housing type rapidly changes from self to charter and monthly rent. The accuracy of the estimation was checked on four scales, that is ARB, ASRB, AAB, ASD. In this result, the spatio-temporal model among applied models has most optimal scales about small area estimation of charter and monthly rent index.

A Development of Nonstationary Frequency Analysis Model using a Bayesian Multiple Non-crossing Quantile Regression Approach (베이지안 다중 비교차 분위회귀 분석 기법을 이용한 비정상성 빈도해석 모형 개발)

  • Uranchimeg, Sumiya;Kim, Yong-Tak;Kwon, Young-Jun;Kwon, Hyun-Han
    • Journal of Coastal Disaster Prevention
    • /
    • v.4 no.3
    • /
    • pp.119-131
    • /
    • 2017
  • Global warming under the influence of climate change and its direct impact on glacial and sea level are known issue. However, there is a lack of research on an indirect impact of climate change such as coastal structure design which is mainly based on a frequency analysis of water level under the stationary assumption, meaning that maximum sea level will not vary significantly over time. In general, stationary assumption does not hold and may not be valid under a changing climate. Therefore, this study aims to develop a novel approach to explore possible distributional changes in annual maximum sea levels (AMSLs) and provide the estimate of design water level for coastal structures using a multiple non-crossing quantile regression based nonstationary frequency analysis within a Bayesian framework. In this study, 20 tide gauge stations, where more than 30 years of hourly records are available, are considered. First, the possible distributional changes in the AMSLs are explored, focusing on the change in the scale and location parameter of the probability distributions. The most of the AMSLs are found to be upward-convergent/divergent pattern in the distribution, and the significance test on distributional changes is then performed. In this study, we confirm that a stationary assumption under the current climate characteristic may lead to underestimation of the design sea level, which results in increase in the failure risk in coastal structures. A detailed discussion on the role of the distribution changes for design water level is provided.

Exploring Feature Selection Methods for Effective Emotion Mining (효과적 이모션마이닝을 위한 속성선택 방법에 관한 연구)

  • Eo, Kyun Sun;Lee, Kun Chang
    • Journal of Digital Convergence
    • /
    • v.17 no.3
    • /
    • pp.107-117
    • /
    • 2019
  • In the era of SNS, many people relies on it to express their emotions about various kinds of products and services. Therefore, for the companies eagerly seeking to investigate how their products and services are perceived in the market, emotion mining tasks using dataset from SNSs become important much more than ever. Basically, emotion mining is a branch of sentiment analysis which is based on BOW (bag-of-words) and TF-IDF. However, there are few studies on the emotion mining which adopt feature selection (FS) methods to look for optimal set of features ensuring better results. In this sense, this study aims to propose FS methods to conduct emotion mining tasks more effectively with better outcomes. This study uses Twitter and SemEval2007 dataset for the sake of emotion mining experiments. We applied three FS methods such as CFS (Correlation based FS), IG (Information Gain), and ReliefF. Emotion mining results were obtained from applying the selected features to nine classifiers. When applying DT (decision tree) to Tweet dataset, accuracy increases with CFS, IG, and ReliefF methods. When applying LR (logistic regression) to SemEval2007 dataset, accuracy increases with ReliefF method.