• Title/Summary/Keyword: SNS Information

Search Result 1,473, Processing Time 0.028 seconds

Text Mining-Based Emerging Trend Analysis for e-Learning Contents Targeting for CEO (텍스트마이닝을 통한 최고경영자 대상 이러닝 콘텐츠 트렌드 분석)

  • Kyung-Hoon Kim;Myungsin Chae;Byungtae Lee
    • Information Systems Review
    • /
    • v.19 no.2
    • /
    • pp.1-19
    • /
    • 2017
  • Original scripts of e-learning lectures for the CEOs of corporation S were analyzed using topic analysis, which is a text mining method. Twenty-two topics were extracted based on the keywords chosen from five-year records that ranged from 2011 to 2015. Research analysis was then conducted on various issues. Promising topics were selected through evaluation and element analysis of the members of each topic. In management and economics, members demonstrated high satisfaction and interest toward topics in marketing strategy, human resource management, and communication. Philosophy, history of war, and history demonstrated high interest and satisfaction in the field of humanities, whereas mind health showed high interest and satisfaction in the field of in lifestyle. Studies were also conducted to identify topics on the proportion of content, but these studies failed to increase member satisfaction. In the field of IT, educational content responds sensitively to change of the times, but it may not increase the interest and satisfaction of members. The present study found that content production for CEOs should draw out deep implications for value innovation through technology application instead of simply ending the technical aspect of information delivery. Previous studies classified contents superficially based on the name of content program when analyzing the status of content operation. However, text mining can derive deep content and subject classification based on the contents of unstructured data script. This approach can examine current shortages and necessary fields if the service contents of the themes are displayed by year. This study was based on data obtained from influential e-learning companies in Korea. Obtaining practical results was difficult because data were not acquired from portal sites or social networking service. The content of e-learning trends of CEOs were analyzed. Data analysis was also conducted on the intellectual interests of CEOs in each field.

Clustering Method based on Genre Interest for Cold-Start Problem in Movie Recommendation (영화 추천 시스템의 초기 사용자 문제를 위한 장르 선호 기반의 클러스터링 기법)

  • You, Tithrottanak;Rosli, Ahmad Nurzid;Ha, Inay;Jo, Geun-Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.1
    • /
    • pp.57-77
    • /
    • 2013
  • Social media has become one of the most popular media in web and mobile application. In 2011, social networks and blogs are still the top destination of online users, according to a study from Nielsen Company. In their studies, nearly 4 in 5active users visit social network and blog. Social Networks and Blogs sites rule Americans' Internet time, accounting to 23 percent of time spent online. Facebook is the main social network that the U.S internet users spend time more than the other social network services such as Yahoo, Google, AOL Media Network, Twitter, Linked In and so on. In recent trend, most of the companies promote their products in the Facebook by creating the "Facebook Page" that refers to specific product. The "Like" option allows user to subscribed and received updates their interested on from the page. The film makers which produce a lot of films around the world also take part to market and promote their films by exploiting the advantages of using the "Facebook Page". In addition, a great number of streaming service providers allows users to subscribe their service to watch and enjoy movies and TV program. They can instantly watch movies and TV program over the internet to PCs, Macs and TVs. Netflix alone as the world's leading subscription service have more than 30 million streaming members in the United States, Latin America, the United Kingdom and the Nordics. As the matter of facts, a million of movies and TV program with different of genres are offered to the subscriber. In contrast, users need spend a lot time to find the right movies which are related to their interest genre. Recent years there are many researchers who have been propose a method to improve prediction the rating or preference that would give the most related items such as books, music or movies to the garget user or the group of users that have the same interest in the particular items. One of the most popular methods to build recommendation system is traditional Collaborative Filtering (CF). The method compute the similarity of the target user and other users, which then are cluster in the same interest on items according which items that users have been rated. The method then predicts other items from the same group of users to recommend to a group of users. Moreover, There are many items that need to study for suggesting to users such as books, music, movies, news, videos and so on. However, in this paper we only focus on movie as item to recommend to users. In addition, there are many challenges for CF task. Firstly, the "sparsity problem"; it occurs when user information preference is not enough. The recommendation accuracies result is lower compared to the neighbor who composed with a large amount of ratings. The second problem is "cold-start problem"; it occurs whenever new users or items are added into the system, which each has norating or a few rating. For instance, no personalized predictions can be made for a new user without any ratings on the record. In this research we propose a clustering method according to the users' genre interest extracted from social network service (SNS) and user's movies rating information system to solve the "cold-start problem." Our proposed method will clusters the target user together with the other users by combining the user genre interest and the rating information. It is important to realize a huge amount of interesting and useful user's information from Facebook Graph, we can extract information from the "Facebook Page" which "Like" by them. Moreover, we use the Internet Movie Database(IMDb) as the main dataset. The IMDbis online databases that consist of a large amount of information related to movies, TV programs and including actors. This dataset not only used to provide movie information in our Movie Rating Systems, but also as resources to provide movie genre information which extracted from the "Facebook Page". Formerly, the user must login with their Facebook account to login to the Movie Rating System, at the same time our system will collect the genre interest from the "Facebook Page". We conduct many experiments with other methods to see how our method performs and we also compare to the other methods. First, we compared our proposed method in the case of the normal recommendation to see how our system improves the recommendation result. Then we experiment method in case of cold-start problem. Our experiment show that our method is outperform than the other methods. In these two cases of our experimentation, we see that our proposed method produces better result in case both cases.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

Perception and Appraisal of Urban Park Users Using Text Mining of Google Maps Review - Cases of Seoul Forest, Boramae Park, Olympic Park - (구글맵리뷰 텍스트마이닝을 활용한 공원 이용자의 인식 및 평가 - 서울숲, 보라매공원, 올림픽공원을 대상으로 -)

  • Lee, Ju-Kyung;Son, Yong-Hoon
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.49 no.4
    • /
    • pp.15-29
    • /
    • 2021
  • The study aims to grasp the perception and appraisal of urban park users through text analysis. This study used Google review data provided by Google Maps. Google Maps Review is an online review platform that provides information evaluating locations through social media and provides an understanding of locations from the perspective of general reviewers and regional guides who are registered as members of Google Maps. The study determined if the Google Maps Reviews were useful for extracting meaningful information about the user perceptions and appraisals for parks management plans. The study chose three urban parks in Seoul, South Korea; Seoul Forest, Boramae Park, and Olympic Park. Review data for each of these three parks were collected via web crawling using Python. Through text analysis, the keywords and network structure characteristics for each park were analyzed. The text was analyzed, as were park ratings, and the analysis compared the reviews of residents and foreign tourists. The common keywords found in the review comments for the three parks were "walking", "bicycle", "rest" and "picnic" for activities, "family", "child" and "dogs" for accompanying types, and "playground" and "walking trail" for park facilities. Looking at the characteristics of each park, Seoul Forest shows many outdoor activities based on nature, while the lack of parking spaces and congestion on weekends negatively impacted users. Boramae Park has the appearance of a city park, with various facilities providing numerous activities, but reviewers often cited the park's complexity and the negative aspects in terms of dog walking groups. At Olympic Park, large-scale complex facilities and cultural events were frequently mentioned, emphasizing its entertainment functions. Google Maps Review can function as useful data to identify parks' overall users' experiences and general feelings. Compared to data from other social media sites, Google Maps Review's data provides ratings and understanding factors, including user satisfaction and dissatisfaction.

Analyzing Topic Trends and the Relationship between Changes in Public Opinion and Stock Price based on Sentiment of Discourse in Different Industry Fields using Comments of Naver News (네이버 뉴스 댓글을 이용한 산업 분야별 담론의 감성에 기반한 주제 트렌드 및 여론의 변화와 주가 흐름의 연관성 분석)

  • Oh, Chanhee;Kim, Kyuli;Zhu, Yongjun
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.1
    • /
    • pp.257-280
    • /
    • 2022
  • In this study, we analyzed comments on news articles of representative companies of the three industries (i.e., semiconductor, secondary battery, and bio industries) that had been listed as national strategic technology projects of South Korea to identify public opinions towards them. In addition, we analyzed the relationship between changes in public opinion and stock price. 'Samsung Electronics' and 'SK Hynix' in the semiconductor industry, 'Samsung SDI' and 'LG Chem' in the secondary battery industry, and 'Samsung Biologics' and 'Celltrion' in the bio-industry were selected as the representative companies and 47,452 comments of news articles about the companies that had been published from January 1, 2020, to December 31, 2020, were collected from Naver News. The comments were grouped into positive, neutral, and negative emotions, and the dynamic topics of comments over time in each group were analyzed to identify the trends of public opinion in each industry. As a result, in the case of the semiconductor industry, investment, COVID-19 related issues, trust in large companies such as Samsung Electronics, and mention of the damage caused by changes in government policy were the topics. In the case of secondary battery industries, references to investment, battery, and corporate issues were the topics. In the case of bio-industries, references to investment, COVID-19 related issues, and corporate issues were the topics. Next, to understand whether the sentiment of the comments is related to the actual stock price, for each company, the changes in the stock price and the sentiment values of the comments were compared and analyzed using visual analytics. As a result, we found a clear relationship between the changes in the sentiment value of public opinion and the stock price through the similar patterns shown in the change graphs. This study analyzed comments on news articles that are highly related to stock price, identified changes in public opinion trends in the COVID-19 era, and provided objective feedback to government agencies' policymaking.

A Study on the Revitalization of Tourism Industry through Big Data Analysis (한국관광 실태조사 빅 데이터 분석을 통한 관광산업 활성화 방안 연구)

  • Lee, Jungmi;Liu, Meina;Lim, Gyoo Gun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.149-169
    • /
    • 2018
  • Korea is currently accumulating a large amount of data in public institutions based on the public data open policy and the "Government 3.0". Especially, a lot of data is accumulated in the tourism field. However, the academic discussions utilizing the tourism data are still limited. Moreover, the openness of the data of restaurants, hotels, and online tourism information, and how to use SNS Big Data in tourism are still limited. Therefore, utilization through tourism big data analysis is still low. In this paper, we tried to analyze influencing factors on foreign tourists' satisfaction in Korea through numerical data using data mining technique and R programming technique. In this study, we tried to find ways to revitalize the tourism industry by analyzing about 36,000 big data of the "Survey on the actual situation of foreign tourists from 2013 to 2015" surveyed by the Korea Culture & Tourism Research Institute. To do this, we analyzed the factors that have high influence on the 'Satisfaction', 'Revisit intention', and 'Recommendation' variables of foreign tourists. Furthermore, we analyzed the practical influences of the variables that are mentioned above. As a procedure of this study, we first integrated survey data of foreign tourists conducted by Korea Culture & Tourism Research Institute, which is stored in the tourist information system from 2013 to 2015, and eliminate unnecessary variables that are inconsistent with the research purpose among the integrated data. Some variables were modified to improve the accuracy of the analysis. And we analyzed the factors affecting the dependent variables by using data-mining methods: decision tree(C5.0, CART, CHAID, QUEST), artificial neural network, and logistic regression analysis of SPSS IBM Modeler 16.0. The seven variables that have the greatest effect on each dependent variable were derived. As a result of data analysis, it was found that seven major variables influencing 'overall satisfaction' were sightseeing spot attraction, food satisfaction, accommodation satisfaction, traffic satisfaction, guide service satisfaction, number of visiting places, and country. Variables that had a great influence appeared food satisfaction and sightseeing spot attraction. The seven variables that had the greatest influence on 'revisit intention' were the country, travel motivation, activity, food satisfaction, best activity, guide service satisfaction and sightseeing spot attraction. The most influential variables were food satisfaction and travel motivation for Korean style. Lastly, the seven variables that have the greatest influence on the 'recommendation intention' were the country, sightseeing spot attraction, number of visiting places, food satisfaction, activity, tour guide service satisfaction and cost. And then the variables that had the greatest influence were the country, sightseeing spot attraction, and food satisfaction. In addition, in order to grasp the influence of each independent variables more deeply, we used R programming to identify the influence of independent variables. As a result, it was found that the food satisfaction and sightseeing spot attraction were higher than other variables in overall satisfaction and had a greater effect than other influential variables. Revisit intention had a higher ${\beta}$ value in the travel motive as the purpose of Korean Wave than other variables. It will be necessary to have a policy that will lead to a substantial revisit of tourists by enhancing tourist attractions for the purpose of Korean Wave. Lastly, the recommendation had the same result of satisfaction as the sightseeing spot attraction and food satisfaction have higher ${\beta}$ value than other variables. From this analysis, we found that 'food satisfaction' and 'sightseeing spot attraction' variables were the common factors to influence three dependent variables that are mentioned above('Overall satisfaction', 'Revisit intention' and 'Recommendation'), and that those factors affected the satisfaction of travel in Korea significantly. The purpose of this study is to examine how to activate foreign tourists in Korea through big data analysis. It is expected to be used as basic data for analyzing tourism data and establishing effective tourism policy. It is expected to be used as a material to establish an activation plan that can contribute to tourism development in Korea in the future.

Structural features and Diffusion Patterns of Gartner Hype Cycle for Artificial Intelligence using Social Network analysis (인공지능 기술에 관한 가트너 하이프사이클의 네트워크 집단구조 특성 및 확산패턴에 관한 연구)

  • Shin, Sunah;Kang, Juyoung
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.107-129
    • /
    • 2022
  • It is important to preempt new technology because the technology competition is getting much tougher. Stakeholders conduct exploration activities continuously for new technology preoccupancy at the right time. Gartner's Hype Cycle has significant implications for stakeholders. The Hype Cycle is a expectation graph for new technologies which is combining the technology life cycle (S-curve) with the Hype Level. Stakeholders such as R&D investor, CTO(Chef of Technology Officer) and technical personnel are very interested in Gartner's Hype Cycle for new technologies. Because high expectation for new technologies can bring opportunities to maintain investment by securing the legitimacy of R&D investment. However, contrary to the high interest of the industry, the preceding researches faced with limitations aspect of empirical method and source data(news, academic papers, search traffic, patent etc.). In this study, we focused on two research questions. The first research question was 'Is there a difference in the characteristics of the network structure at each stage of the hype cycle?'. To confirm the first research question, the structural characteristics of each stage were confirmed through the component cohesion size. The second research question is 'Is there a pattern of diffusion at each stage of the hype cycle?'. This research question was to be solved through centralization index and network density. The centralization index is a concept of variance, and a higher centralization index means that a small number of nodes are centered in the network. Concentration of a small number of nodes means a star network structure. In the network structure, the star network structure is a centralized structure and shows better diffusion performance than a decentralized network (circle structure). Because the nodes which are the center of information transfer can judge useful information and deliver it to other nodes the fastest. So we confirmed the out-degree centralization index and in-degree centralization index for each stage. For this purpose, we confirmed the structural features of the community and the expectation diffusion patterns using Social Network Serice(SNS) data in 'Gartner Hype Cycle for Artificial Intelligence, 2021'. Twitter data for 30 technologies (excluding four technologies) listed in 'Gartner Hype Cycle for Artificial Intelligence, 2021' were analyzed. Analysis was performed using R program (4.1.1 ver) and Cyram Netminer. From October 31, 2021 to November 9, 2021, 6,766 tweets were searched through the Twitter API, and converting the relationship user's tweet(Source) and user's retweets (Target). As a result, 4,124 edgelists were analyzed. As a reult of the study, we confirmed the structural features and diffusion patterns through analyze the component cohesion size and degree centralization and density. Through this study, we confirmed that the groups of each stage increased number of components as time passed and the density decreased. Also 'Innovation Trigger' which is a group interested in new technologies as a early adopter in the innovation diffusion theory had high out-degree centralization index and the others had higher in-degree centralization index than out-degree. It can be inferred that 'Innovation Trigger' group has the biggest influence, and the diffusion will gradually slow down from the subsequent groups. In this study, network analysis was conducted using social network service data unlike methods of the precedent researches. This is significant in that it provided an idea to expand the method of analysis when analyzing Gartner's hype cycle in the future. In addition, the fact that the innovation diffusion theory was applied to the Gartner's hype cycle's stage in artificial intelligence can be evaluated positively because the Gartner hype cycle has been repeatedly discussed as a theoretical weakness. Also it is expected that this study will provide a new perspective on decision-making on technology investment to stakeholdes.

Structural Properties of Social Network and Diffusion of Product WOM: A Sociocultural Approach (사회적 네트워크 구조특성과 제품구전의 확산: 사회문화적 접근)

  • Yoon, Sung-Joon;Han, Hee-Eun
    • Journal of Distribution Research
    • /
    • v.16 no.1
    • /
    • pp.141-177
    • /
    • 2011
  • I. Research Objectives: Most of the previous studies on diffusion have concentrated on efficacy of WOM communication with the use of variables at individual level (Iacobucci 1996; Midgley et al. 1992). However, there is a paucity of studies which investigated network's structural properties as antecedents of WOM from the perspective of consumers' sociocultural propensities. Against this research backbone, this study attempted to link the network's structural properties and consumer' WOM behavior on cross-national basis. The major research objective of this study was to examine the relationship between network properties and WOM by comparing Korean and Chinese consumers. Specific objectives of this research are threefold; firstly, it sought to examine whether network properties (i.e., tie strength, centrality, range) affect WOM (WOM intention and quality of WOM). Secondly, it aimed to explore the moderating effects of cutural orientation (uncertainty avoidance and individuality) on the relationship between network properties and WOM. Thirdly, it substantiates the role of innovativeness as antecedents to both network properties and WOM. II. Research Hypotheses: Based on the above research objectives, the study put forth the following research hypotheses to validate. ${\cdot}$ H 1-1 : The Strength of tie between two counterparts within network will positively influence WOM effectivenes ${\cdot}$ H 1-2 : The network centrality will positively influence the WOM effectiveness ${\cdot}$ H 1-3 : The network range will positively influence the WOM effectiveness ${\cdot}$ H 2-1 : The consumer's uncertainty avoidance tendency will moderate the relationship between network properties and WOM effectiveness ${\cdot}$ H 2-2 : The consumer's individualism tendency will moderate the relationship between network properties and WOM effectiveness ${\cdot}$ H 3-1 : The consumer's innovativeness will positively influence the social network properties ${\cdot}$ H 3-2 : The consumer's innovativeness will positively influence WOM effectiveness III. Methodology: Through a pilot study and back-translation, two versions of questionnaire were prepared, one in Korean and the other in Chinese. The chinese data were collected from the chinese students enrolled in language schools in Suwon city in Korea, while Korean data were collected from students taking classes in a major university in Seoul. A total of 277 questionnaire were used for analysis of Korean data and 212 for Chinese data. The reason why Chinese students living in Korea rather than in China were selected was based on two factors: one was to neutralize the differences (ie, retail channel availability) that may arise from living in separate countries and the second was to minimize the difference in communication venues such as internet accessibility and cell phone usability. SPSS 12.0 and AMOS 7.0 were used for analysis. IV. Results: Prior to hypothesis verification, mean differences between the two countries in terms of major constructs were performed with the following result; As for network properties (tie strength, centrality and range), Koreans showed higher scores in all three constructs. For cultural orientation traits, Koreans scored higher only on uncertainty avoidance trait than Chinese. As a result of verifying the first research objective, confirming the relationship between network properties and WOM effectiveness, on Korean side, tie strength(Beta=.116; t=1.785) and centrality (Beta=.499; t=6.776) significantly influenced on WOM intention, and similar finding was obtained for Chinese side, with tie strength (Beta=.246; t=3.544) and centrality (Beta=.247; t=3.538) being significant. However, with regard to WOM argument quality, Korean data yielded only centrality (Beta=.82; t=7.600) having a significant impact on WOM, whereas China showed both tie strength(Beat=.142; t=2.052) and centrality(Beta=.348; t=5.031) being influential. To answer for the second research objective addressing the moderating role of cultural orientation, moderated regression anaylsis was performed and the result showed that uncertainty avoidance moderated between network range and WOM intention for both Korea and China, But for Korea, the uncertainty avoidance moderated between tie strength and WOM quality, while for China it moderated between network range and WOM intention. And innovativeness moderated between tie strength and WOM intention for Korea but it moderated between network range and WOM intention for China. As a result of analysing for third research objective, we found that for Korea, innovativeness positively influenced centrality only (Beta=.546; t=10.808), while for China it influenced both tie strength (Beta=.203; t=2.998) and centrality(Beta=.518; t=8.782). But for both countries alike, the innovativeness influenced positively on WOM (WOM intention and WOM quality). V. Implications: The study yields the two practical implications. Firstly, the result suggests that companies targeting multinational customers need to identify segments which are susceptible to the positive WOM and WOM information based on individual traits such as uncertainty avoidance and individualism and based on that, develop marketing communication strategy. Secondly, the companies need to divide the market on Roger's five innovation stages and based on this information, enforce marketing strategy which utilizes social networking tools such as public media and WOM. For instance, innovator and early adopters, if provided with new product information, will be able to capitalize upon the network advantages and thus add informational value to network operations using SNS or corporate blog.

  • PDF

Subject-Balanced Intelligent Text Summarization Scheme (주제 균형 지능형 텍스트 요약 기법)

  • Yun, Yeoil;Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.141-166
    • /
    • 2019
  • Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.

Accessibility Analysis in Mapping Cultural Ecosystem Service of Namyangju-si (접근성 개념을 적용한 문화서비스 평가 -남양주시를 대상으로-)

  • Jun, Baysok;Kang, Wanmo;Lee, Jaehyuck;Kim, Sunghoon;Kim, Byeori;Kim, Ilkwon;Lee, Jooeun;Kwon, Hyuksoo
    • Journal of Environmental Impact Assessment
    • /
    • v.27 no.4
    • /
    • pp.367-377
    • /
    • 2018
  • A cultural ecosystem service(CES), which is non-material benefit that human gains from ecosystem, has been recently further recognized as gross national income increases. Previous researches proposed to quantify the value of CES, which still remains as a challenging issue today due to its social and cultural subjectivity. This study proposes new way of assessing CES which is called Cultural Service Opportunity Spectrum(CSOS). CSOS is accessibility based CES assessment methodology for regional scale and it is designed to be applicable for any regions in Korea for supporting decision making process. CSOS employed public spatial data which are road network and population density map. In addition, the results of 'Rapid Assessment of Natural Assets' implemented by National Institute of Ecology, Korea were used as a complementary data. CSOS was applied to Namyangju-si and the methodology resulted in revealing specific areas with great accessibility to 'Natural Assets' in the region. Based on the results, the advantages and limitations of the methodology were discussed with regard to weighting three main factors and in contrast to Scenic Quality model and Recreation model of InVEST which have been commonly used for assessing CES today due to its convenience today.