• Title/Summary/Keyword: 토픽 추출

Search Result 211, Processing Time 0.025 seconds

Analyzing the Effect of Characteristics of Dictionary on the Accuracy of Document Classifiers (용어 사전의 특성이 문서 분류 정확도에 미치는 영향 연구)

  • Jung, Haegang;Kim, Namgyu
    • Management & Information Systems Review
    • /
    • v.37 no.4
    • /
    • pp.41-62
    • /
    • 2018
  • As the volume of unstructured data increases through various social media, Internet news articles, and blogs, the importance of text analysis and the studies are increasing. Since text analysis is mostly performed on a specific domain or topic, the importance of constructing and applying a domain-specific dictionary has been increased. The quality of dictionary has a direct impact on the results of the unstructured data analysis and it is much more important since it present a perspective of analysis. In the literature, most studies on text analysis has emphasized the importance of dictionaries to acquire clean and high quality results. However, unfortunately, a rigorous verification of the effects of dictionaries has not been studied, even if it is already known as the most essential factor of text analysis. In this paper, we generate three dictionaries in various ways from 39,800 news articles and analyze and verify the effect each dictionary on the accuracy of document classification by defining the concept of Intrinsic Rate. 1) A batch construction method which is building a dictionary based on the frequency of terms in the entire documents 2) A method of extracting the terms by category and integrating the terms 3) A method of extracting the features according to each category and integrating them. We compared accuracy of three artificial neural network-based document classifiers to evaluate the quality of dictionaries. As a result of the experiment, the accuracy tend to increase when the "Intrinsic Rate" is high and we found the possibility to improve accuracy of document classification by increasing the intrinsic rate of the dictionary.

Text Mining and Association Rules Analysis to a Self-Introduction Letter of Freshman at Korea National College of Agricultural and Fisheries (1) (한국농수산대학 신입생 자기소개서의 텍스트 마이닝과 연관규칙 분석 (1))

  • Joo, J.S.;Lee, S.Y.;Kim, J.S.;Shin, Y.K.;Park, N.B.
    • Journal of Practical Agriculture & Fisheries Research
    • /
    • v.22 no.1
    • /
    • pp.113-129
    • /
    • 2020
  • In this study we examined the topic analysis and correlation analysis by text mining to extract meaningful information or rules from the self introduction letter of freshman at Korea National College of Agriculture and Fisheries in 2020. The analysis items are described in items related to 'academic' and 'in-school activities' during high school. In the text mining results, the keywords of 'academic' items were 'study', 'thought', 'effort', 'problem', 'friend', and the key words of 'in-school activities' were 'activity', 'thought', 'friend', 'club', 'school' in order. As a result of the correlation analysis, the key words of 'thinking', 'studying', 'effort', and 'time' played a central role in the 'academic' item. And the key words of 'in-school activities' were 'thought', 'activity', 'school', 'time', and 'friend'. The results of frequency analysis and association analysis were visualized with word cloud and correlation graphs to make it easier to understand all the results. In the next study, TF-IDF(Term Frequency-Inverse Document Frequency) analysis using 'frequency of keywords' and 'reverse of document frequency' will be performed as a method of extracting key words from a large amount of documents.

Latent mobility pattern analysis of bus passengers with LDA (LDA 기법을 이용한 버스 승객의 잠재적 이동패턴 분석)

  • Cho, Ah;Lee, Kyung Hee;Cho, Wan Sup
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.5
    • /
    • pp.1061-1069
    • /
    • 2015
  • Recently, transportation big data generated in the transportation sector has been widely used in the transportation policies making and efficient system management. Bus passengers' mobility patterns are useful insight for transportation policy maker to optimize bus lines and time intervals in a city. We propose a new methodology to discover mobility patterns by using transportation card data. We first estimate the bus stations where the passengers get-off because the transportation card data don't have the get-off information in most cities. We then applies LDA (Latent Dirichlet Allocation), the most representative topic modeling technique, to discover mobility patterns of bus passengers in Cheong-Ju city. To understand discovered patterns, we construct a data warehouse and perform multi-dimensional analysis by bus-route, region, time-period, and the mobility patterns (get-on/get-off station). In the case of Cheong Ju, we discovered mobility pattern 1 from suburban area to Cheong-Ju terminal, mobility pattern 2 from residential area to commercial area, mobility pattern 3 from school areas to commercial area.

Accessibility Analysis Method based on Public Facility Attraction Index Using SNS Data (SNS 데이터를 이용한 공공시설 매력도지수에 따른 접근성 분석기법)

  • Lee, Ji Won;Yu, Ki Yun;Kim, Ji Young
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.37 no.1
    • /
    • pp.29-42
    • /
    • 2019
  • In order to expand the qualitative aspects of public facility, this study used SNS data to derive user-oriented preference factors for public facilities and then were quantified in terms of supply side and demand side. To derive preference factor, LDA, one of topic modeling, was used and attraction index was calculated for each facility. In addition we analyzed spatial accessibility to measure the degree of service experience of users by using 2SFCA model. The study area covered public libraries of Seoul, Korea. As a result of study, five topics were extracted as preference factors for the public library: Circumstance, Scale of facility, Cultural program, Parenting, Books and materials. In particular topic of circumstance and parenting were newly derived preference factors unknown in previous studies. As a result of calculating attraction index for each library, the index of Songpa Library, Jungdok Library, and Namsan Library was high. Songpa library has received good evaluation in parenting factor, and Jungdok & Namsan library in circumstance factor. The accessibility of each region seems to better in center of Seoul where public libraries are crowded, but shrinking toward the outskirts. We expect that the proposed method will contribute to user-oriented public facility evaluation and policy decision making.

A study on trends and predictions through analysis of linkage analysis based on big data between autonomous driving and spatial information (자율주행과 공간정보의 빅데이터 기반 연계성 분석을 통한 동향 및 예측에 관한 연구)

  • Cho, Kuk;Lee, Jong-Min;Kim, Jong Seo;Min, Guy Sik
    • Journal of Cadastre & Land InformatiX
    • /
    • v.50 no.2
    • /
    • pp.101-115
    • /
    • 2020
  • In this paper, big data analysis method was used to find out global trends in autonomous driving and to derive activate spatial information services. The applied big data was used in conjunction with news articles and patent document in order to analysis trend in news article and patents document data in spatial information. In this paper, big data was created and key words were extracted by using LDA (Latent Dirichlet Allocation) based on the topic model in major news on autonomous driving. In addition, Analysis of spatial information and connectivity, global technology trend analysis, and trend analysis and prediction in the spatial information field were conducted by using WordNet applied based on key words of patent information. This paper was proposed a big data analysis method for predicting a trend and future through the analysis of the connection between the autonomous driving field and spatial information. In future, as a global trend of spatial information in autonomous driving, platform alliances, business partnerships, mergers and acquisitions, joint venture establishment, standardization and technology development were derived through big data analysis.

A study on the method of deriving the cause of social issues based on causal sentences (인과관계문형 기반 사회이슈 발생원인 도출 방법 연구)

  • Lee, Namyeon;Lee, Jae Hyung
    • Journal of Digital Convergence
    • /
    • v.19 no.3
    • /
    • pp.167-176
    • /
    • 2021
  • With development of big data analysis technology, many studies to find social issues using texts mining techniques have been conducted. In order to derive social issues, previous studies performed in a way that collects a large amount of text data from news or SNS, and then analyzes issues based on text mining techniques such as topic modeling and terms network analysis. Social issues are the results of various social phenomena and factors. However, since previous studies focused on deriving social issues that are results of various causes, there are limitations to revealing the cause of the issues. In order to effectively respond to social issues, it is necessary not only to derive social issues, but also to be able to identify the causes of social issues. In this study, in order to overcome these limitations, we proposed a method of deriving the factors that cause social issues from texts related to social issues based on the theory of part of Korean linguistics. To do this, we collected news data related to social issues for three years from 2017 to 2019 and proposed a methodology to find causes based causal sentences based on text mining techniques.

An Analysis on Media Trends in Public Agency for Social Service Applying Text Mining (텍스트 마이닝을 적용한 사회서비스원 언론보도기사 분석)

  • Park, Hae-Keung;Youn, Ki-Hyok
    • Journal of Internet of Things and Convergence
    • /
    • v.8 no.2
    • /
    • pp.41-48
    • /
    • 2022
  • This study tried to empirically explore which issues related to the social service agency for public(as below SSA), that is, social perceptions were formed, by using mess media related to the SSA. This study is meaningful in that it identifies the overall social perception and trend of SSA through public opinion. In order to extract media trend data, the search used the big data analysis system, Textom, to collect data from the representative portals Naver News and Daum News. The collected texts were 1,299 in 2020 and 1,410 in 2021, for a total of 2,709. As a result of the analysis, first, the most derived words in relation to the frequency of text appearance were 'SSA', 'establishment', and 'operation'. Second, as a result of the N-gram analysis, the pairs of words directly related to the SSA 'SSA and public', 'SSA and opening', 'SSA and launch', and 'SSA and Department Director', 'SSA and Staff', 'SSA and Caregiver' etc. Third, in the results of TF-IDF analysis and word network analysis, similar to the word occurrence frequency and N-gram results, 'establishment', 'operation', 'public', 'launch', 'provided', 'opened', ' 'Holding' and 'Care' were derived. Based on the above analysis results, it was suggested to strengthen the emergency care support group, to commercialize it in detail, and to stabilize jobs.

Movie Recommended System base on Analysis for the User Review utilizing Ontology Visualization (온톨로지 시각화를 활용한 사용자 리뷰 분석 기반 영화 추천 시스템)

  • Mun, Seong Min;Kim, Gi Nam;Choi, Gyeong cheol;Lee, Kyung Won
    • Design Convergence Study
    • /
    • v.15 no.2
    • /
    • pp.347-368
    • /
    • 2016
  • Recently, researches for the word of mouth(WOM) imply that consumers use WOM informations of products in their purchase process. This study suggests methods using opinion mining and visualization to understand consumers' opinion of each goods and each markets. For this study we conduct research that includes developing domain ontology based on reviews confined to "movie" category because people who want to have watching movie refer other's movie reviews recently, and it is analyzed by opinion mining and visualization. It has differences comparing other researches as conducting attribution classification of evaluation factors and comprising verbal dictionary about evaluation factors when we conduct ontology process for analyzing. We want to prove through the result if research method will be valid. Results derived from this study can be largely divided into three. First, This research explains methods of developing domain ontology using keyword extraction and topic modeling. Second, We visualize reviews of each movie to understand overall audiences' opinion about specific movies. Third, We find clusters that consist of products which evaluated similar assessments in accordance with the evaluation results for the product. Case study of this research largely shows three clusters containing 130 movies that are used according to audiences'opinion.

Prediction of Customer Satisfaction Using RFE-SHAP Feature Selection Method (RFE-SHAP을 활용한 온라인 리뷰를 통한 고객 만족도 예측)

  • Olga Chernyaeva;Taeho Hong
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.4
    • /
    • pp.325-345
    • /
    • 2023
  • In the rapidly evolving domain of e-commerce, our study presents a cohesive approach to enhance customer satisfaction prediction from online reviews, aligning methodological innovation with practical insights. We integrate the RFE-SHAP feature selection with LDA topic modeling to streamline predictive analytics in e-commerce. This integration facilitates the identification of key features-specifically, narrowing down from an initial set of 28 to an optimal subset of 14 features for the Random Forest algorithm. Our approach strategically mitigates the common issue of overfitting in models with an excess of features, leading to an improved accuracy rate of 84% in our Random Forest model. Central to our analysis is the understanding that certain aspects in review content, such as quality, fit, and durability, play a pivotal role in influencing customer satisfaction, especially in the clothing sector. We delve into explaining how each of these selected features impacts customer satisfaction, providing a comprehensive view of the elements most appreciated by customers. Our research makes significant contributions in two key areas. First, it enhances predictive modeling within the realm of e-commerce analytics by introducing a streamlined, feature-centric approach. This refinement in methodology not only bolsters the accuracy of customer satisfaction predictions but also sets a new standard for handling feature selection in predictive models. Second, the study provides actionable insights for e-commerce platforms, especially those in the clothing sector. By highlighting which aspects of customer reviews-like quality, fit, and durability-most influence satisfaction, we offer a strategic direction for businesses to tailor their products and services.

Methodology for Issue-related R&D Keywords Packaging Using Text Mining (텍스트 마이닝 기반의 이슈 관련 R&D 키워드 패키징 방법론)

  • Hyun, Yoonjin;Shun, William Wong Xiu;Kim, Namgyu
    • Journal of Internet Computing and Services
    • /
    • v.16 no.2
    • /
    • pp.57-66
    • /
    • 2015
  • Considerable research efforts are being directed towards analyzing unstructured data such as text files and log files using commercial and noncommercial analytical tools. In particular, researchers are trying to extract meaningful knowledge through text mining in not only business but also many other areas such as politics, economics, and cultural studies. For instance, several studies have examined national pending issues by analyzing large volumes of text on various social issues. However, it is difficult to provide successful information services that can identify R&D documents on specific national pending issues. While users may specify certain keywords relating to national pending issues, they usually fail to retrieve appropriate R&D information primarily due to discrepancies between these terms and the corresponding terms actually used in the R&D documents. Thus, we need an intermediate logic to overcome these discrepancies, also to identify and package appropriate R&D information on specific national pending issues. To address this requirement, three methodologies are proposed in this study-a hybrid methodology for extracting and integrating keywords pertaining to national pending issues, a methodology for packaging R&D information that corresponds to national pending issues, and a methodology for constructing an associative issue network based on relevant R&D information. Data analysis techniques such as text mining, social network analysis, and association rules mining are utilized for establishing these methodologies. As the experiment result, the keyword enhancement rate by the proposed integration methodology reveals to be about 42.8%. For the second objective, three key analyses were conducted and a number of association rules between national pending issue keywords and R&D keywords were derived. The experiment regarding to the third objective, which is issue clustering based on R&D keywords is still in progress and expected to give tangible results in the future.