• Title/Summary/Keyword: 온라인 마이닝

Search Result 240, Processing Time 0.026 seconds

Consumer behavior prediction using Airbnb web log data (에어비앤비(Airbnb) 웹 로그 데이터를 이용한 고객 행동 예측)

  • An, Hyoin;Choi, Yuri;Oh, Raeeun;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.391-404
    • /
    • 2019
  • Customers' fixed characteristics have often been used to predict customer behavior. It has recently become possible to track customer web logs as customer activities move from offline to online. It has become possible to collect large amounts of web log data; however, the researchers only focused on organizing the log data or describing the technical characteristics. In this study, we predict the decision-making time until each customer makes the first reservation, using Airbnb customer data provided by the Kaggle website. This data set includes basic customer information such as gender, age, and web logs. We use various methodologies to find the optimal model and compare prediction errors for cases with web log data and without it. We consider six models such as Lasso, SVM, Random Forest, and XGBoost to explore the effectiveness of the web log data. As a result, we choose Random Forest as our optimal model with a misclassification rate of about 20%. In addition, we confirm that using web log data in our study doubles the prediction accuracy in predicting customer behavior compared to not using it.

Text Mining of Online News, Social Media, and Consumer Review on Artificial Intelligence Service (인공지능 서비스에 대한 온라인뉴스, 소셜미디어, 소비자리뷰 텍스트마이닝)

  • Li, Xu;Lim, Hyewon;Yeo, Harim;Hwang, Hyesun
    • Human Ecology Research
    • /
    • v.59 no.1
    • /
    • pp.23-43
    • /
    • 2021
  • This study looked through the text mining analysis to check the status of the virtual assistant service, and explore the needs of consumers, and present consumer-oriented directions. Trendup 4.0 was used to analyze the keywords of AI services in Online News and social media from 2016 to 2020. The R program was used to collect consumer comment data and implement Topic Modeling analysis. According to the analysis, the number of mentions of AI services in mass media and social media has steadily increased. The Sentimental Analysis showed consumers were feeling positive about AI services in terms of useful and convenient functional and emotional aspects such as pleasure and interest. However, consumers were also experiencing complexity and difficulty with AI services and had concerns and fears about the use of AI services in the early stages of their introduction. The results of the consumer review analysis showed that there were topics(Technical Requirements) related to technology and the access process for the AI services to be provided, and topics (Consumer Request) expressed negative feelings about AI services, and topics(Consumer Life Support Area) about specific functions in the use of AI services. Text mining analysis enable this study to confirm consumer expectations or concerns about AI service, and to examine areas of service support that consumers experienced. The review data on each platform also revealed that the potential needs of consumers could be met by expanding the scope of support services and applying platform-specific strengths to provide differentiated services.

A Study on the Quantitative Evaluation of Initial Coin Offering (ICO) Using Unstructured Data (비정형 데이터를 이용한 ICO(Initial Coin Offering) 정량적 평가 방법에 대한 연구)

  • Lee, Han Sol;Ahn, Sangho;Kang, Juyoung
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.63-74
    • /
    • 2022
  • Initial public offering (IPO) has a legal framework for investor protection, and because there are various quantitative evaluation factors, objective analysis is possible, and various studies have been conducted. In addition, crowdfunding also has several devices to prevent indiscriminate funding as the legal system for investor protection. On the other hand, the blockchain-based cryptocurrency white paper (ICO), which has recently been in the spotlight, has ambiguous legal means and standards to protect investors and lacks quantitative evaluation methods to evaluate ICOs objectively. Therefore, this study collects online-published ICO white papers to detect fraud in ICOs, performs ICO fraud predictions based on BERT, a text embedding technique, and compares them with existing Random Forest machine learning techniques, and shows the possibility on fraud detection. Finally, this study is expected to contribute to the study of ICO fraud detection based on quantitative methods by presenting the possibility of using a quantitative approach using unstructured data to identify frauds in ICOs.

A Study on the Analysis of Influx Factors in Urban Parks Using Data Mining - Focus on Yangjae Citizens' Forest Park - (데이터 마이닝을 활용한 도시공원 유입 요인 분석 연구 - 양재시민의 숲 공원을 대상으로 -)

  • Park Sang Hun
    • Journal of the Korean Regional Science Association
    • /
    • v.39 no.3
    • /
    • pp.35-48
    • /
    • 2023
  • This study analyzed the inflow factors of Yangjae Citizen's Forest Park using social big data generated online. To this end, the applicability of the emotional information analysis method is to be confirmed as a method of analyzing the perception of the city park and confirming the difference in the characteristics and use of the park. The analysis is based on big data, and as the core of the study is keyword network analysis, the methodology of the 'emotional information analysis method' patented by the author was applied. As a result of the analysis, among the influx factors of Yangjae Citizens' Forest recognized by citizens, the most positive emotional factor was derived as a factor related to 'park contents', and the negative emotional factor was derived as a factor related to 'park management'. These research results suggest that more in-depth program development and operation are needed to discover 'park contents' when implementing urban park revitalization support projects in the future

A CF-based Health Functional Recommender System using Extended User Similarity Measure (확장된 사용자 유사도를 이용한 CF-기반 건강기능식품 추천 시스템)

  • Sein Hong;Euiju Jeong;Jaekyeong Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.3
    • /
    • pp.1-17
    • /
    • 2023
  • With the recent rapid development of ICT(Information and Communication Technology) and the popularization of digital devices, the size of the online market continues to grow. As a result, we live in a flood of information. Thus, customers are facing information overload problems that require a lot of time and money to select products. Therefore, a personalized recommender system has become an essential methodology to address such issues. Collaborative Filtering(CF) is the most widely used recommender system. Traditional recommender systems mainly utilize quantitative data such as rating values, resulting in poor recommendation accuracy. Quantitative data cannot fully reflect the user's preference. To solve such a problem, studies that reflect qualitative data, such as review contents, are being actively conducted these days. To quantify user review contents, text mining was used in this study. The general CF consists of the following three steps: user-item matrix generation, Top-N neighborhood group search, and Top-K recommendation list generation. In this study, we propose a recommendation algorithm that applies an extended similarity measure, which utilize quantified review contents in addition to user rating values. After calculating review similarity by applying TF-IDF, Word2Vec, and Doc2Vec techniques to review content, extended similarity is created by combining user rating similarity and quantified review contents. To verify this, we used user ratings and review data from the e-commerce site Amazon's "Health and Personal Care". The proposed recommendation model using extended similarity measure showed superior performance to the traditional recommendation model using only user rating value-based similarity measure. In addition, among the various text mining techniques, the similarity obtained using the TF-IDF technique showed the best performance when used in the neighbor group search and recommendation list generation step.

Predicting stock movements based on financial news with systematic group identification (시스템적인 군집 확인과 뉴스를 이용한 주가 예측)

  • Seong, NohYoon;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.1-17
    • /
    • 2019
  • Because stock price forecasting is an important issue both academically and practically, research in stock price prediction has been actively conducted. The stock price forecasting research is classified into using structured data and using unstructured data. With structured data such as historical stock price and financial statements, past studies usually used technical analysis approach and fundamental analysis. In the big data era, the amount of information has rapidly increased, and the artificial intelligence methodology that can find meaning by quantifying string information, which is an unstructured data that takes up a large amount of information, has developed rapidly. With these developments, many attempts with unstructured data are being made to predict stock prices through online news by applying text mining to stock price forecasts. The stock price prediction methodology adopted in many papers is to forecast stock prices with the news of the target companies to be forecasted. However, according to previous research, not only news of a target company affects its stock price, but news of companies that are related to the company can also affect the stock price. However, finding a highly relevant company is not easy because of the market-wide impact and random signs. Thus, existing studies have found highly relevant companies based primarily on pre-determined international industry classification standards. However, according to recent research, global industry classification standard has different homogeneity within the sectors, and it leads to a limitation that forecasting stock prices by taking them all together without considering only relevant companies can adversely affect predictive performance. To overcome the limitation, we first used random matrix theory with text mining for stock prediction. Wherever the dimension of data is large, the classical limit theorems are no longer suitable, because the statistical efficiency will be reduced. Therefore, a simple correlation analysis in the financial market does not mean the true correlation. To solve the issue, we adopt random matrix theory, which is mainly used in econophysics, to remove market-wide effects and random signals and find a true correlation between companies. With the true correlation, we perform cluster analysis to find relevant companies. Also, based on the clustering analysis, we used multiple kernel learning algorithm, which is an ensemble of support vector machine to incorporate the effects of the target firm and its relevant firms simultaneously. Each kernel was assigned to predict stock prices with features of financial news of the target firm and its relevant firms. The results of this study are as follows. The results of this paper are as follows. (1) Following the existing research flow, we confirmed that it is an effective way to forecast stock prices using news from relevant companies. (2) When looking for a relevant company, looking for it in the wrong way can lower AI prediction performance. (3) The proposed approach with random matrix theory shows better performance than previous studies if cluster analysis is performed based on the true correlation by removing market-wide effects and random signals. The contribution of this study is as follows. First, this study shows that random matrix theory, which is used mainly in economic physics, can be combined with artificial intelligence to produce good methodologies. This suggests that it is important not only to develop AI algorithms but also to adopt physics theory. This extends the existing research that presented the methodology by integrating artificial intelligence with complex system theory through transfer entropy. Second, this study stressed that finding the right companies in the stock market is an important issue. This suggests that it is not only important to study artificial intelligence algorithms, but how to theoretically adjust the input values. Third, we confirmed that firms classified as Global Industrial Classification Standard (GICS) might have low relevance and suggested it is necessary to theoretically define the relevance rather than simply finding it in the GICS.

Perception and Appraisal of Urban Park Users Using Text Mining of Google Maps Review - Cases of Seoul Forest, Boramae Park, Olympic Park - (구글맵리뷰 텍스트마이닝을 활용한 공원 이용자의 인식 및 평가 - 서울숲, 보라매공원, 올림픽공원을 대상으로 -)

  • Lee, Ju-Kyung;Son, Yong-Hoon
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.49 no.4
    • /
    • pp.15-29
    • /
    • 2021
  • The study aims to grasp the perception and appraisal of urban park users through text analysis. This study used Google review data provided by Google Maps. Google Maps Review is an online review platform that provides information evaluating locations through social media and provides an understanding of locations from the perspective of general reviewers and regional guides who are registered as members of Google Maps. The study determined if the Google Maps Reviews were useful for extracting meaningful information about the user perceptions and appraisals for parks management plans. The study chose three urban parks in Seoul, South Korea; Seoul Forest, Boramae Park, and Olympic Park. Review data for each of these three parks were collected via web crawling using Python. Through text analysis, the keywords and network structure characteristics for each park were analyzed. The text was analyzed, as were park ratings, and the analysis compared the reviews of residents and foreign tourists. The common keywords found in the review comments for the three parks were "walking", "bicycle", "rest" and "picnic" for activities, "family", "child" and "dogs" for accompanying types, and "playground" and "walking trail" for park facilities. Looking at the characteristics of each park, Seoul Forest shows many outdoor activities based on nature, while the lack of parking spaces and congestion on weekends negatively impacted users. Boramae Park has the appearance of a city park, with various facilities providing numerous activities, but reviewers often cited the park's complexity and the negative aspects in terms of dog walking groups. At Olympic Park, large-scale complex facilities and cultural events were frequently mentioned, emphasizing its entertainment functions. Google Maps Review can function as useful data to identify parks' overall users' experiences and general feelings. Compared to data from other social media sites, Google Maps Review's data provides ratings and understanding factors, including user satisfaction and dissatisfaction.

Construction of Event Networks from Large News Data Using Text Mining Techniques (텍스트 마이닝 기법을 적용한 뉴스 데이터에서의 사건 네트워크 구축)

  • Lee, Minchul;Kim, Hea-Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.183-203
    • /
    • 2018
  • News articles are the most suitable medium for examining the events occurring at home and abroad. Especially, as the development of information and communication technology has brought various kinds of online news media, the news about the events occurring in society has increased greatly. So automatically summarizing key events from massive amounts of news data will help users to look at many of the events at a glance. In addition, if we build and provide an event network based on the relevance of events, it will be able to greatly help the reader in understanding the current events. In this study, we propose a method for extracting event networks from large news text data. To this end, we first collected Korean political and social articles from March 2016 to March 2017, and integrated the synonyms by leaving only meaningful words through preprocessing using NPMI and Word2Vec. Latent Dirichlet allocation (LDA) topic modeling was used to calculate the subject distribution by date and to find the peak of the subject distribution and to detect the event. A total of 32 topics were extracted from the topic modeling, and the point of occurrence of the event was deduced by looking at the point at which each subject distribution surged. As a result, a total of 85 events were detected, but the final 16 events were filtered and presented using the Gaussian smoothing technique. We also calculated the relevance score between events detected to construct the event network. Using the cosine coefficient between the co-occurred events, we calculated the relevance between the events and connected the events to construct the event network. Finally, we set up the event network by setting each event to each vertex and the relevance score between events to the vertices connecting the vertices. The event network constructed in our methods helped us to sort out major events in the political and social fields in Korea that occurred in the last one year in chronological order and at the same time identify which events are related to certain events. Our approach differs from existing event detection methods in that LDA topic modeling makes it possible to easily analyze large amounts of data and to identify the relevance of events that were difficult to detect in existing event detection. We applied various text mining techniques and Word2vec technique in the text preprocessing to improve the accuracy of the extraction of proper nouns and synthetic nouns, which have been difficult in analyzing existing Korean texts, can be found. In this study, the detection and network configuration techniques of the event have the following advantages in practical application. First, LDA topic modeling, which is unsupervised learning, can easily analyze subject and topic words and distribution from huge amount of data. Also, by using the date information of the collected news articles, it is possible to express the distribution by topic in a time series. Second, we can find out the connection of events in the form of present and summarized form by calculating relevance score and constructing event network by using simultaneous occurrence of topics that are difficult to grasp in existing event detection. It can be seen from the fact that the inter-event relevance-based event network proposed in this study was actually constructed in order of occurrence time. It is also possible to identify what happened as a starting point for a series of events through the event network. The limitation of this study is that the characteristics of LDA topic modeling have different results according to the initial parameters and the number of subjects, and the subject and event name of the analysis result should be given by the subjective judgment of the researcher. Also, since each topic is assumed to be exclusive and independent, it does not take into account the relevance between themes. Subsequent studies need to calculate the relevance between events that are not covered in this study or those that belong to the same subject.

Information types and characteristics within the Wireless Emergency Alert in COVID-19: Focusing on Wireless Emergency Alerts in Seoul (코로나 19 하에서 재난문자 내의 정보유형 및 특성: 서울특별시 재난문자를 중심으로)

  • Yoon, Sungwook;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.45-68
    • /
    • 2022
  • The central and local governments of the Republic of Korea provided information necessary for disaster response through wireless emergency alerts (WEAs) in order to overcome the pandemic situation in which COVID-19 rapidly spreads. Among all channels for delivering disaster information, wireless emergency alert is the most efficient, and since it adopts the CBS(Cell Broadcast Service) method that broadcasts directly to the mobile phone, it has the advantage of being able to easily access disaster information through the mobile phone without the effort of searching. In this study, the characteristics of wireless emergency alerts sent to Seoul during the past year and one month (January 2020 to January 2021) were derived through various text mining methodologies, and various types of information contained in wireless emergency alerts were analyzed. In addition, it was confirmed through the population mobility by age in the districts of Seoul that what kind of influence it had on the movement behavior of people. After going through the process of classifying key words and information included in each character, text analysis was performed so that individual sent characters can be used as an analysis unit by applying a document cluster analysis technique based on the included words. The number of WEAs sent to the Seoul has grown dramatically since the spread of Covid-19. In January 2020, only 10 WEAs were sent to the Seoul, but the number of the WEAs increased 5 times in March, and 7.7 times over the previous months. Since the basic, regional local government were authorized to send wireless emergency alerts independently, the sending behavior of related to wireless emergency alerts are different for each local government. Although most of the basic local governments increased the transmission of WEAs as the number of confirmed cases of Covid-19 increases, the trend of the increase in WEAs according to the increase in the number of confirmed cases of Covid-19 was different by region. By using structured econometric model, the effect of disaster information included in wireless emergency alerts on population mobility was measured by dividing it into baseline effect and accumulating effect. Six types of disaster information, including date, order, online URL, symptom, location, normative guidance, were identified in WEAs and analyzed through econometric modelling. It was confirmed that the types of information that significantly change population mobility by age are different. Population mobility of people in their 60s and 70s decreased when wireless emergency alerts included information related to date and order. As date and order information is appeared in WEAs when they intend to give information about Covid-19 confirmed cases, these results show that the population mobility of higher ages decreased as they reacted to the messages reporting of confirmed cases of Covid-19. Online information (URL) decreased the population mobility of in their 20s, and information related to symptoms reduced the population mobility of people in their 30s. On the other hand, it was confirmed that normative words that including the meaning of encouraging compliance with quarantine policies did not cause significant changes in the population mobility of all ages. This means that only meaningful information which is useful for disaster response should be included in the wireless emergency alerts. Repeated sending of wireless emergency alerts reduces the magnitude of the impact of disaster information on population mobility. It proves indirectly that under the prolonged pandemic, people started to feel tired of getting repetitive WEAs with similar content and started to react less. In order to effectively use WEAs for quarantine and overcoming disaster situations, it is necessary to reduce the fatigue of the people who receive WEA by sending them only in necessary situations, and to raise awareness of WEAs.

Product Community Analysis Using Opinion Mining and Network Analysis: Movie Performance Prediction Case (오피니언 마이닝과 네트워크 분석을 활용한 상품 커뮤니티 분석: 영화 흥행성과 예측 사례)

  • Jin, Yu;Kim, Jungsoo;Kim, Jongwoo
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.1
    • /
    • pp.49-65
    • /
    • 2014
  • Word of Mouth (WOM) is a behavior used by consumers to transfer or communicate their product or service experience to other consumers. Due to the popularity of social media such as Facebook, Twitter, blogs, and online communities, electronic WOM (e-WOM) has become important to the success of products or services. As a result, most enterprises pay close attention to e-WOM for their products or services. This is especially important for movies, as these are experiential products. This paper aims to identify the network factors of an online movie community that impact box office revenue using social network analysis. In addition to traditional WOM factors (volume and valence of WOM), network centrality measures of the online community are included as influential factors in box office revenue. Based on previous research results, we develop five hypotheses on the relationships between potential influential factors (WOM volume, WOM valence, degree centrality, betweenness centrality, closeness centrality) and box office revenue. The first hypothesis is that the accumulated volume of WOM in online product communities is positively related to the total revenue of movies. The second hypothesis is that the accumulated valence of WOM in online product communities is positively related to the total revenue of movies. The third hypothesis is that the average of degree centralities of reviewers in online product communities is positively related to the total revenue of movies. The fourth hypothesis is that the average of betweenness centralities of reviewers in online product communities is positively related to the total revenue of movies. The fifth hypothesis is that the average of betweenness centralities of reviewers in online product communities is positively related to the total revenue of movies. To verify our research model, we collect movie review data from the Internet Movie Database (IMDb), which is a representative online movie community, and movie revenue data from the Box-Office-Mojo website. The movies in this analysis include weekly top-10 movies from September 1, 2012, to September 1, 2013, with in total. We collect movie metadata such as screening periods and user ratings; and community data in IMDb including reviewer identification, review content, review times, responder identification, reply content, reply times, and reply relationships. For the same period, the revenue data from Box-Office-Mojo is collected on a weekly basis. Movie community networks are constructed based on reply relationships between reviewers. Using a social network analysis tool, NodeXL, we calculate the averages of three centralities including degree, betweenness, and closeness centrality for each movie. Correlation analysis of focal variables and the dependent variable (final revenue) shows that three centrality measures are highly correlated, prompting us to perform multiple regressions separately with each centrality measure. Consistent with previous research results, our regression analysis results show that the volume and valence of WOM are positively related to the final box office revenue of movies. Moreover, the averages of betweenness centralities from initial community networks impact the final movie revenues. However, both of the averages of degree centralities and closeness centralities do not influence final movie performance. Based on the regression results, three hypotheses, 1, 2, and 4, are accepted, and two hypotheses, 3 and 5, are rejected. This study tries to link the network structure of e-WOM on online product communities with the product's performance. Based on the analysis of a real online movie community, the results show that online community network structures can work as a predictor of movie performance. The results show that the betweenness centralities of the reviewer community are critical for the prediction of movie performance. However, degree centralities and closeness centralities do not influence movie performance. As future research topics, similar analyses are required for other product categories such as electronic goods and online content to generalize the study results.