• Title/Summary/Keyword: Text mining analysis

Search Result 1,187, Processing Time 0.024 seconds

Methodology for Identifying Issues of User Reviews from the Perspective of Evaluation Criteria: Focus on a Hotel Information Site (사용자 리뷰의 평가기준 별 이슈 식별 방법론: 호텔 리뷰 사이트를 중심으로)

  • Byun, Sungho;Lee, Donghoon;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.23-43
    • /
    • 2016
  • As a result of the growth of Internet data and the rapid development of Internet technology, "big data" analysis has gained prominence as a major approach for evaluating and mining enormous data for various purposes. Especially, in recent years, people tend to share their experiences related to their leisure activities while also reviewing others' inputs concerning their activities. Therefore, by referring to others' leisure activity-related experiences, they are able to gather information that might guarantee them better leisure activities in the future. This phenomenon has appeared throughout many aspects of leisure activities such as movies, traveling, accommodation, and dining. Apart from blogs and social networking sites, many other websites provide a wealth of information related to leisure activities. Most of these websites provide information of each product in various formats depending on different purposes and perspectives. Generally, most of the websites provide the average ratings and detailed reviews of users who actually used products/services, and these ratings and reviews can actually support the decision of potential customers in purchasing the same products/services. However, the existing websites offering information on leisure activities only provide the rating and review based on one stage of a set of evaluation criteria. Therefore, to identify the main issue for each evaluation criterion as well as the characteristics of specific elements comprising each criterion, users have to read a large number of reviews. In particular, as most of the users search for the characteristics of the detailed elements for one or more specific evaluation criteria based on their priorities, they must spend a great deal of time and effort to obtain the desired information by reading more reviews and understanding the contents of such reviews. Although some websites break down the evaluation criteria and direct the user to input their reviews according to different levels of criteria, there exist excessive amounts of input sections that make the whole process inconvenient for the users. Further, problems may arise if a user does not follow the instructions for the input sections or fill in the wrong input sections. Finally, treating the evaluation criteria breakdown as a realistic alternative is difficult, because identifying all the detailed criteria for each evaluation criterion is a challenging task. For example, if a review about a certain hotel has been written, people tend to only write one-stage reviews for various components such as accessibility, rooms, services, or food. These might be the reviews for most frequently asked questions, such as distance between the nearest subway station or condition of the bathroom, but they still lack detailed information for these questions. In addition, in case a breakdown of the evaluation criteria was provided along with various input sections, the user might only fill in the evaluation criterion for accessibility or fill in the wrong information such as information regarding rooms in the evaluation criteria for accessibility. Thus, the reliability of the segmented review will be greatly reduced. In this study, we propose an approach to overcome the limitations of the existing leisure activity information websites, namely, (1) the reliability of reviews for each evaluation criteria and (2) the difficulty of identifying the detailed contents that make up the evaluation criteria. In our proposed methodology, we first identify the review content and construct the lexicon for each evaluation criterion by using the terms that are frequently used for each criterion. Next, the sentences in the review documents containing the terms in the constructed lexicon are decomposed into review units, which are then reconstructed by using the evaluation criteria. Finally, the issues of the constructed review units by evaluation criteria are derived and the summary results are provided. Apart from the derived issues, the review units are also provided. Therefore, this approach aims to help users save on time and effort, because they will only be reading the relevant information they need for each evaluation criterion rather than go through the entire text of review. Our proposed methodology is based on the topic modeling, which is being actively used in text analysis. The review is decomposed into sentence units rather than considering the whole review as a document unit. After being decomposed into individual review units, the review units are reorganized according to each evaluation criterion and then used in the subsequent analysis. This work largely differs from the existing topic modeling-based studies. In this paper, we collected 423 reviews from hotel information websites and decomposed these reviews into 4,860 review units. We then reorganized the review units according to six different evaluation criteria. By applying these review units in our methodology, the analysis results can be introduced, and the utility of proposed methodology can be demonstrated.

How to improve the accuracy of recommendation systems: Combining ratings and review texts sentiment scores (평점과 리뷰 텍스트 감성분석을 결합한 추천시스템 향상 방안 연구)

  • Hyun, Jiyeon;Ryu, Sangyi;Lee, Sang-Yong Tom
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.219-239
    • /
    • 2019
  • As the importance of providing customized services to individuals becomes important, researches on personalized recommendation systems are constantly being carried out. Collaborative filtering is one of the most popular systems in academia and industry. However, there exists limitation in a sense that recommendations were mostly based on quantitative information such as users' ratings, which made the accuracy be lowered. To solve these problems, many studies have been actively attempted to improve the performance of the recommendation system by using other information besides the quantitative information. Good examples are the usages of the sentiment analysis on customer review text data. Nevertheless, the existing research has not directly combined the results of the sentiment analysis and quantitative rating scores in the recommendation system. Therefore, this study aims to reflect the sentiments shown in the reviews into the rating scores. In other words, we propose a new algorithm that can directly convert the user 's own review into the empirically quantitative information and reflect it directly to the recommendation system. To do this, we needed to quantify users' reviews, which were originally qualitative information. In this study, sentiment score was calculated through sentiment analysis technique of text mining. The data was targeted for movie review. Based on the data, a domain specific sentiment dictionary is constructed for the movie reviews. Regression analysis was used as a method to construct sentiment dictionary. Each positive / negative dictionary was constructed using Lasso regression, Ridge regression, and ElasticNet methods. Based on this constructed sentiment dictionary, the accuracy was verified through confusion matrix. The accuracy of the Lasso based dictionary was 70%, the accuracy of the Ridge based dictionary was 79%, and that of the ElasticNet (${\alpha}=0.3$) was 83%. Therefore, in this study, the sentiment score of the review is calculated based on the dictionary of the ElasticNet method. It was combined with a rating to create a new rating. In this paper, we show that the collaborative filtering that reflects sentiment scores of user review is superior to the traditional method that only considers the existing rating. In order to show that the proposed algorithm is based on memory-based user collaboration filtering, item-based collaborative filtering and model based matrix factorization SVD, and SVD ++. Based on the above algorithm, the mean absolute error (MAE) and the root mean square error (RMSE) are calculated to evaluate the recommendation system with a score that combines sentiment scores with a system that only considers scores. When the evaluation index was MAE, it was improved by 0.059 for UBCF, 0.0862 for IBCF, 0.1012 for SVD and 0.188 for SVD ++. When the evaluation index is RMSE, UBCF is 0.0431, IBCF is 0.0882, SVD is 0.1103, and SVD ++ is 0.1756. As a result, it can be seen that the prediction performance of the evaluation point reflecting the sentiment score proposed in this paper is superior to that of the conventional evaluation method. In other words, in this paper, it is confirmed that the collaborative filtering that reflects the sentiment score of the user review shows superior accuracy as compared with the conventional type of collaborative filtering that only considers the quantitative score. We then attempted paired t-test validation to ensure that the proposed model was a better approach and concluded that the proposed model is better. In this study, to overcome limitations of previous researches that judge user's sentiment only by quantitative rating score, the review was numerically calculated and a user's opinion was more refined and considered into the recommendation system to improve the accuracy. The findings of this study have managerial implications to recommendation system developers who need to consider both quantitative information and qualitative information it is expect. The way of constructing the combined system in this paper might be directly used by the developers.

A Study on the Changes in Perspectives on Unwed Mothers in S.Korea and the Direction of Government Polices: 1995~2020 Social Media Big Data Analysis (한국미혼모에 대한 관점 변화와 정부정책의 방향: 1995년~2020년 소셜미디어 빅데이터 분석)

  • Seo, Donghee;Jun, Boksun
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.12
    • /
    • pp.305-313
    • /
    • 2021
  • This study collected and analyzed big data from 1995 to 2020, focusing on the keywords "unwed mother", "single mother," and "single mom" to present appropriate government support policy directions according to changes in perspectives on unwed mothers. Big data collection platform Textom was used to collect data from portal search sites Naver and Daum and refine data. The final refined data were word frequency analysis, TF-IDF analysis, an N-gram analysis provided by Textom. In addition, Network analysis and CONCOR analysis were conducted through the UCINET6 program. As a result of the study, similar words appeared in word frequency analysis and TF-IDF analysis, but they differed by year. In the N-gram analysis, there were similarities in word appearance, but there were many differences in frequency and form of words appearing in series. As a result of CONCOR analysis, it was found that different clusters were formed by year. This study confirms the change in the perspective of unwed mothers through big data analysis, suggests the need for unwed mothers policies for various options for independent women, and policies that embrace pregnancy, childbirth, and parenting without discrimination within the new family form.

Trend of Research and Industry-Related Analysis in Data Quality Using Time Series Network Analysis (시계열 네트워크분석을 통한 데이터품질 연구경향 및 산업연관 분석)

  • Jang, Kyoung-Ae;Lee, Kwang-Suk;Kim, Woo-Je
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.6
    • /
    • pp.295-306
    • /
    • 2016
  • The purpose of this paper is both to analyze research trends and to predict industrial flows using the meta-data from the previous studies on data quality. There have been many attempts to analyze the research trends in various fields till lately. However, analysis of previous studies on data quality has produced poor results because of its vast scope and data. Therefore, in this paper, we used a text mining, social network analysis for time series network analysis to analyze the vast scope and data of data quality collected from a Web of Science index database of papers published in the international data quality-field journals for 10 years. The analysis results are as follows: Decreases in Mathematical & Computational Biology, Chemistry, Health Care Sciences & Services, Biochemistry & Molecular Biology, Biochemistry & Molecular Biology, and Medical Information Science. Increases, on the contrary, in Environmental Sciences, Water Resources, Geology, and Instruments & Instrumentation. In addition, the social network analysis results show that the subjects which have the high centrality are analysis, algorithm, and network, and also, image, model, sensor, and optimization are increasing subjects in the data quality field. Furthermore, the industrial connection analysis result on data quality shows that there is high correlation between technique, industry, health, infrastructure, and customer service. And it predicted that the Environmental Sciences, Biotechnology, and Health Industry will be continuously developed. This paper will be useful for people, not only who are in the data quality industry field, but also the researchers who analyze research patterns and find out the industry connection on data quality.

Analysis of the Importance and Satisfaction of Viewing Quality Factors among Non-Audience in Professional Baseball According to Corona 19 (코로나 19에 따른 프로야구 무관중 시청품질요인의 중요도, 만족도 분석)

  • Baek, Seung-Heon;Kim, Gi-Tak
    • Journal of Korea Entertainment Industry Association
    • /
    • v.15 no.2
    • /
    • pp.123-135
    • /
    • 2021
  • The data processing of this study is focused on keywords related to 'Corona 19 and professional baseball' and 'Corona 19 and professional baseball no spectators', using text mining and social network analysis of textom program to identify problems and view quality. It was used to set the variable of For quantitative analysis, a questionnaire on viewing quality was constructed, and out of 270 survey respondents, 250 questionnaires were used for the final study. As a tool for securing the validity and reliability of the questionnaire, exploratory factor analysis and reliability analysis were conducted, and IPA analysis (importance-satisfaction) was conducted based on the questionnaire that secured validity and reliability, and the results and strategies were presented. As a result of IPA analysis, factors related to the image (image composition, image coloration, image clarity, image enlargement and composition, high-quality image) were found in the first quadrant, and the second quadrant was the game situation (support team game level, support player game level, star). Player discovery, competition with rival teams), game information (match schedule information, player information check, team performance and player performance, game information), interaction (consensus with the supporting team), and some factors appeared. The factors of commentator (baseball-related knowledge, communication ability, pronunciation and voice, use of standard language, introduction of game-related information) and interaction (real-time communication with the front desk, sympathy with viewers, information exchange such as chatting) appeared.

A Gap Analysis Using Spatial Data and Social Media Big Data Analysis Results of Island Tourism Resources for Sustainable Resource Management (지속가능한 자원관리를 위한 섬 지역 관광자원의 공간정보와 소셜미디어 빅데이터 분석 결과를 활용한 격차분석)

  • Lee, Sung-Hee;Lee, Ju-Kyung;Son, Yong-Hoon;Kim, Young-Jin
    • Journal of Korean Society of Rural Planning
    • /
    • v.30 no.2
    • /
    • pp.13-24
    • /
    • 2024
  • This study conducts an analysis of social media big data pertaining to island tourism resources, aiming to discern the diverse forms and categories of island tourism favored by consumers, ascertain predominant resources, and facilitate objective decision-making grounded in scientific methodologies. To achieve this objective, an examination of blog posts published on Naver from 2022 to 2023 was undertaken, utilizing keywords such as 'Island tourism', 'Island travel', and 'Island backpacking' as focal points for analysis. Text mining techniques were applied to sift through the data. Among the resources identified, the port emerged as a significant asset, serving as a pivotal conduit linking the island and mainland and holding substantial importance as a focal point and resource for tourist access to the island. Furthermore, an analysis of the disparity between existing island tourism resources and those acknowledged by tourists who actively engage with and appreciate island destinations led to the identification of 186 newly emerging resources. These nascent resources predominantly clustered within five regions: Incheon Metropolitan City, Tongyeong/Geoje City, Jeju Island, Ulleung-gun, and Shinan-gun. A scrutiny of these resources, categorized according to the tourism resource classification system, revealed a notable presence of new resources, chiefly in the domains of 'rural landscape', 'tourist resort/training facility', 'transportation facility', and 'natural resource'. Notably, many of these emerging resources were previously overlooked in official management targets or resource inventories pertaining to existing island tourism resources. Noteworthy examples include ports, beaches, and mountains, which, despite constituting a substantial proportion of the newly identified tourist resources, were not accorded prominence in spatial information datasets. This study holds significance in its ability to unearth novel tourism resources recognized by island tourism consumers through a gap analysis approach that juxtaposes the existing status of island tourism resource data with techniques utilizing social media big data. Furthermore, the methodology delineated in this research offers a valuable framework for domestic local governments to gauge local tourism demand and embark on initiatives for tourism development or regional revitalization.

Automated Development of Rank-Based Concept Hierarchical Structures using Wikipedia Links (위키피디아 링크를 이용한 랭크 기반 개념 계층구조의 자동 구축)

  • Lee, Ga-hee;Kim, Han-joon
    • The Journal of Society for e-Business Studies
    • /
    • v.20 no.4
    • /
    • pp.61-76
    • /
    • 2015
  • In general, we have utilized the hierarchical concept tree as a crucial data structure for indexing huge amount of textual data. This paper proposes a generality rank-based method that can automatically develop hierarchical concept structures with the Wikipedia data. The goal of the method is to regard each of Wikipedia articles as a concept and to generate hierarchical relationships among concepts. In order to estimate the generality of concepts, we have devised a special ranking function that mainly uses the number of hyperlinks among Wikipedia articles. The ranking function is effectively used for computing the probabilistic subsumption among concepts, which allows to generate relatively more stable hierarchical structures. Eventually, a set of concept pairs with hierarchical relationship is visualized as a DAG (directed acyclic graph). Through the empirical analysis using the concept hierarchy of Open Directory Project, we proved that the proposed method outperforms a representative baseline method and it can automatically extract concept hierarchies with high accuracy.

Self Introduction Essay Classification Using Doc2Vec for Efficient Job Matching (Doc2Vec 모형에 기반한 자기소개서 분류 모형 구축 및 실험)

  • Kim, Young Soo;Moon, Hyun Sil;Kim, Jae Kyeong
    • Journal of Information Technology Services
    • /
    • v.19 no.1
    • /
    • pp.103-112
    • /
    • 2020
  • Job seekers are making various efforts to find a good company and companies attempt to recruit good people. Job search activities through self-introduction essay are nowadays one of the most active processes. Companies spend time and cost to reviewing all of the numerous self-introduction essays of job seekers. Job seekers are also worried about the possibility of acceptance of their self-introduction essays by companies. This research builds a classification model and conducted an experiments to classify self-introduction essays into pass or fail using deep learning and decision tree techniques. Real world data were classified using stratified sampling to alleviate the data imbalance problem between passed self-introduction essays and failed essays. Documents were embedded using Doc2Vec method developed from existing Word2Vec, and they were classified using logistic regression analysis. The decision tree model was chosen as a benchmark model, and K-fold cross-validation was conducted for the performance evaluation. As a result of several experiments, the area under curve (AUC) value of PV-DM results better than that of other models of Doc2Vec, i.e., PV-DBOW and Concatenate. Furthmore PV-DM classifies passed essays as well as failed essays, while PV_DBOW can not classify passed essays even though it classifies well failed essays. In addition, the classification performance of the logistic regression model embedded using the PV-DM model is better than the decision tree-based classification model. The implication of the experimental results is that company can reduce the cost of recruiting good d job seekers. In addition, our suggested model can help job candidates for pre-evaluating their self-introduction essays.

Analysis on Status and Trends of SIAM Journal Papers using Text Mining (텍스트마이닝 기법을 활용한 미국산업응용수학 학회지의 연구 현황 및 동향 분석)

  • Kim, Sung-Yeun
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.7
    • /
    • pp.212-222
    • /
    • 2020
  • The purpose of this study is to understand the current status and trends of the research studies published by the Society for Industrial and Applied Mathematics which is a leader in the field of industrial mathematics around the world. To perform this purpose, titles and abstracts were collected from 6,255 research articles between 2016 and 2019, and the R program was used to analyze the topic modeling model with LDA techniques and a regression model. As the results of analyses, first, a variety of studies have been studied in the fields of industrial mathematics, such as algebra, discrete mathematics, geometry, topological mathematics, probability and statistics. Second, it was found that the ascending research subjects were fluid mechanics, graph theory, and stochastic differential equations, and the descending research subjects were computational theory and classical geometry. The results of the study, based on the understanding of the overall flows and changes of the intellectual structure in the fields of industrial mathematics, are expected to provide researchers in the field with implications of the future direction of research and how to build an industrial mathematics curriculum that reflects the zeitgeist in the field of education.

Comparison of Topic Modeling Methods for Analyzing Research Trends of Archives Management in Korea: focused on LDA and HDP (국내 기록관리학 연구동향 분석을 위한 토픽모델링 기법 비교 - LDA와 HDP를 중심으로 -)

  • Park, JunHyeong;Oh, Hyo-Jung
    • Journal of Korean Library and Information Science Society
    • /
    • v.48 no.4
    • /
    • pp.235-258
    • /
    • 2017
  • The purpose of this study is to analyze research trends of archives management in Korea by comparing LDA (Latent Semantic Allocation) topic modeling, which is the most famous method in text mining, and HDP (Hierarchical Dirichlet Process) topic modeling, which is developed LDA topic modeling. Firstly we collected 1,027 articles related to archives management from 1997 to 2016 in two journals related with archives management and four journals related with library and information science in Korea and performed several preprocessing steps. And then we conducted LDA and HDP topic modelings. For a more in-depth comparison analysis, we utilized LDAvis as a topic modeling visualization tool. At the results, LDA topic modeling was influenced by frequently keywords in all topics, whereas, HDP topic modeling showed specific keywords to easily identify the characteristics of each topic.