• Title/Summary/Keyword: 텍스트마이닝

Search Result 1,175, Processing Time 0.028 seconds

Comparison of Term-Weighting Schemes for Environmental Big Data Analysis (환경 빅데이터 이슈 분석을 위한 용어 가중치 기법 비교)

  • Kim, JungJin;Jeong, Hanseok
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.236-236
    • /
    • 2021
  • 최근 텍스트와 같은 비정형 데이터의 생성 속도가 급격하게 증가함에 따라, 이를 분석하기 위한 기술들의 필요성이 커지고 있다. 텍스트 마이닝은 자연어 처리기술을 사용하여 비정형 텍스트를 정형화하고, 문서에서 가치있는 정보를 획득할 수 있는 기법 중 하나이다. 텍스트 마이닝 기법은 일반적으로 각각의 분서별로 특정 용어의 사용 빈도를 나타내는 문서-용어 빈도행렬을 사용하여 용어의 중요도를 나타내고, 다양한 연구 분야에서 이를 활용하고 있다. 하지만, 문서-용어 빈도 행렬에서 나타내는 용어들의 빈도들은 문서들의 차별성과 그에 따른 용어들의 중요도를 나타내기 어렵기때문에, 용어 가중치를 적용하여 문서가 가지고 있는 특징을 분류하는 방법이 필수적이다. 다양한 용어 가중치를 적용하는 방법들이 개발되어 적용되고 있지만, 환경 분야에서는 용어 가중치 기법 적용에 따른 효율성 평가 연구가 미비한 상황이다. 또한, 환경 이슈 분석의 경우 단순히 문서들에 특징을 파악하고 주어진 문서들을 분류하기보다, 시간적 분포도에 따른 각 문서의 특징을 반영하는 것도 상대적으로 중요하다. 따라서, 본 연구에서는 텍스트 마이닝을 이용하여 2015-2020년의 서울지역 환경뉴스 데이터를 사용하여 환경 이슈 분석에 적합한 용어 가중치 기법들을 비교분석하였다. 용어 가중치 기법으로는 TF-IDF (Term frequency-inverse document frquency), BM25, TF-IGM (TF-inverse gravity moment), TF-IDF-ICSDF (TF-IDF-inverse classs space density frequency)를 적용하였다. 본 연구를 통해 환경문서 및 개체 분류에 대한 최적화된 용어 가중치 기법을 제시하고, 서울지역의 환경 이슈와 관련된 핵심어 추출정보를 제공하고자 한다.

  • PDF

Application Development for Text Mining: KoALA (텍스트 마이닝 통합 애플리케이션 개발: KoALA)

  • Byeong-Jin Jeon;Yoon-Jin Choi;Hee-Woong Kim
    • Information Systems Review
    • /
    • v.21 no.2
    • /
    • pp.117-137
    • /
    • 2019
  • In the Big Data era, data science has become popular with the production of numerous data in various domains, and the power of data has become a competitive power. There is a growing interest in unstructured data, which accounts for more than 80% of the world's data. Along with the everyday use of social media, most of the unstructured data is in the form of text data and plays an important role in various areas such as marketing, finance, and distribution. However, text mining using social media is difficult to access and difficult to use compared to data mining using numerical data. Thus, this study aims to develop Korean Natural Language Application (KoALA) as an integrated application for easy and handy social media text mining without relying on programming language or high-level hardware or solution. KoALA is a specialized application for social media text mining. It is an integrated application that can analyze both Korean and English. KoALA handles the entire process from data collection to preprocessing, analysis and visualization. This paper describes the process of designing, implementing, and applying KoALA applications using the design science methodology. Lastly, we will discuss practical use of KoALA through a block-chain business case. Through this paper, we hope to popularize social media text mining and utilize it for practical and academic use in various domains.

Improvement Plan of Web Site FAQ using Text Mining : Focused on the S University Case (텍스트마이닝을 활용한 웹사이트 FAQ 개선방안: S대학교 사례를 중심으로)

  • Ahn, su-hyun;Jo, jeong-hyun;Lee, sang-jun
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2018.05a
    • /
    • pp.361-362
    • /
    • 2018
  • 본 연구는 대학 웹페이지의 Q&A(질의응답) 게시판에 게재된 비정형화 된 데이터를 수집한 후 텍스트마이닝과 네트워크 분석을 활용하여 자주 등장하는 키워드 간 연관 패턴을 파악하고자 한다. 분석결과를 바탕으로 FAQ(자주하는 질문) 게시판을 구성한다면 반복적인 질문에 대한 민원을 간소화함으로써 수요자의 편의성과 행정의 효율성 향상에 기여하고 나아가 원활한 양방향 소통이 가능할 것으로 기대한다.

  • PDF

Analysis of Prevention Methods by Type of Construction Disaster Using Text Mining Techniques (텍스트마이닝을 활용한 건설현장 재해 유형별 예방 대책 분석)

  • Gyu Pil Jo;Myungdo Lee;Yoon-seok Shin;Baek-Joong Kim
    • Journal of the Society of Disaster Information
    • /
    • v.20 no.1
    • /
    • pp.13-19
    • /
    • 2024
  • Purpose: This study provides prevention methods by type of construction disaster using text mining techniques. Method: Based on the database that analyzed the cases of critical disasters in the domestic construction sector, preventive measures and causes are analyzed by text mining techniques, and the contents of the analysis are visually shown. Result: This visual data represents the measures for preventing critical disasters of each process according to the importance. Conclusion: It is believed that the results will be helpful in identifying factors to be considered in preparing preventive measures for serious accidents in construction.

A Study on the Effect of Using Sentiment Lexicon in Opinion Classification (오피니언 분류의 감성사전 활용효과에 대한 연구)

  • Kim, Seungwoo;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.1
    • /
    • pp.133-148
    • /
    • 2014
  • Recently, with the advent of various information channels, the number of has continued to grow. The main cause of this phenomenon can be found in the significant increase of unstructured data, as the use of smart devices enables users to create data in the form of text, audio, images, and video. In various types of unstructured data, the user's opinion and a variety of information is clearly expressed in text data such as news, reports, papers, and various articles. Thus, active attempts have been made to create new value by analyzing these texts. The representative techniques used in text analysis are text mining and opinion mining. These share certain important characteristics; for example, they not only use text documents as input data, but also use many natural language processing techniques such as filtering and parsing. Therefore, opinion mining is usually recognized as a sub-concept of text mining, or, in many cases, the two terms are used interchangeably in the literature. Suppose that the purpose of a certain classification analysis is to predict a positive or negative opinion contained in some documents. If we focus on the classification process, the analysis can be regarded as a traditional text mining case. However, if we observe that the target of the analysis is a positive or negative opinion, the analysis can be regarded as a typical example of opinion mining. In other words, two methods (i.e., text mining and opinion mining) are available for opinion classification. Thus, in order to distinguish between the two, a precise definition of each method is needed. In this paper, we found that it is very difficult to distinguish between the two methods clearly with respect to the purpose of analysis and the type of results. We conclude that the most definitive criterion to distinguish text mining from opinion mining is whether an analysis utilizes any kind of sentiment lexicon. We first established two prediction models, one based on opinion mining and the other on text mining. Next, we compared the main processes used by the two prediction models. Finally, we compared their prediction accuracy. We then analyzed 2,000 movie reviews. The results revealed that the prediction model based on opinion mining showed higher average prediction accuracy compared to the text mining model. Moreover, in the lift chart generated by the opinion mining based model, the prediction accuracy for the documents with strong certainty was higher than that for the documents with weak certainty. Most of all, opinion mining has a meaningful advantage in that it can reduce learning time dramatically, because a sentiment lexicon generated once can be reused in a similar application domain. Additionally, the classification results can be clearly explained by using a sentiment lexicon. This study has two limitations. First, the results of the experiments cannot be generalized, mainly because the experiment is limited to a small number of movie reviews. Additionally, various parameters in the parsing and filtering steps of the text mining may have affected the accuracy of the prediction models. However, this research contributes a performance and comparison of text mining analysis and opinion mining analysis for opinion classification. In future research, a more precise evaluation of the two methods should be made through intensive experiments.

Keyword Analysis of Two SCI Journals on Rock Engineering by using Text Mining (텍스트 마이닝을 이용한 암반공학분야 SCI논문의 주제어 분석)

  • Jung, Yong-Bok;Park, Eui-Seob
    • Tunnel and Underground Space
    • /
    • v.25 no.4
    • /
    • pp.303-319
    • /
    • 2015
  • Text mining is one of the branches of data mining and is used to find any meaningful information from the large amount of text. In this study, we analyzed titles and keywords of two SCI journals on rock engineering by using text mining to find major research area, trend and associations of research fields. Visualization of the results was also included for the intuitive understanding of the results. Two journals showed similar research fields but different patterns in the associations among research fields. IJRMMS showed simple network, that is one big group based on the keyword 'rock' with a few small groups. On the other hand, RMRE showed a complex network among various medium groups. Trend analysis by clustering and linear regression of keyword - year frequency matrix provided that most of the keywords increased in number as time goes by except a few descending keywords.

Text Mining for Korean: Characteristics and Application to 2011 Korean Economic Census Data (한국어 텍스트 마이닝의 특성과 2011 한국 경제총조사 자료에의 응용)

  • Goo, Juna;Kim, Kyunga
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.7
    • /
    • pp.1207-1217
    • /
    • 2014
  • 2011 Korean Economic Census is the first economic census in Korea, which contains text data on menus served by Korean-food restaurants as well as structured data on characteristics of restaurants including area, opening year and total sales. In this paper, we applied text mining to the text data and investigated statistical and technical issues and characteristics of Korean text mining. Pork belly roast was the most popular menu across provinces and/or restaurant types in year 2010, and the number of restaurants per 10000 people was especially high in Kangwon-do and Daejeon metropolitan city. Beef tartare and fried pork cutlet are popular menus in start-up restaurants while whole chicken soup and maeuntang (spicy fish stew) are in long-lived restaurants. These results can be used as a guideline for menu development to restaurant owners, and for government policy-making process that lead small restaurants to choose proper menus for successful business.

Stock Prediction Using News Text Mining and Time Series Analysis (뉴스 텍스트 마이닝과 시계열 분석을 이용한 주가예측)

  • Ahn, Sung-Won;Cho, Sung-Bae
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2010.06c
    • /
    • pp.364-369
    • /
    • 2010
  • 본 논문에서는 뉴스 텍스트 마이닝을 수행하여 2005년 1월부터 2008년 12월까지 4년 간의 뉴스 데이터에 대해 주가에 호재인지 악재인지 여부에 대해 학습을 하고, 이를 근거로 신규 발행된 뉴스가 주가 상승 또는 하락에 영향을 미치는지를 예측하는 알고리즘을 제안한다. 뉴스 텍스트 마이닝을 위해 변형된 Bag of Words 모델과 Naive Bayesian 분류기법을 사용하였으며, 특히 주가 예측에 있어서 뉴스 마이닝에만 의존하던 기존의 관련 연구와는 달리 예측의 정확성을 높이기 위해 주가의 시계열 데이터 분석기법인 RSI를 추가로 작용하였다. 2009년 11월부터 2010년 2월까지 4개월간 42,355건의 뉴스 데이터에 대해 실험한 결과, 기존 연구 대비 의미 있는 결과인 55.01%의 예측성공률을 얻었다.

  • PDF

Establishment of ITS Policy Issues Investigation Method in the Road Section applied Textmining (텍스트마이닝을 활용한 도로분야 ITS 정책이슈 탐색기법 정립)

  • Oh, Chang-Seok;Lee, Yong-taeck;Ko, Minsu
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.15 no.6
    • /
    • pp.10-23
    • /
    • 2016
  • With requiring circumspections using big data, this study attempts to develop and apply the search method for audit issues relating to the ITS policy or program. For the foregoing, the auditing process of the board of audit and inspection was converged with the theoretical frame of boundary analysis proposed by William Dunn as an analysis tool for audit issues. Moreover, we apply the text mining technique in order to computerize the analysis tool, which is similar to the boundary analysis in the concept of approaching meta-problems. For the text mining analysis, specific model we applied the antisymmetry-symmetry compound lexeme-based LDA model based on the Latent Dirichlet Allocation(LDA) methodologies proposed by David Blei. The several prime issues were founded through a case analysis as follows: lack of collection of traffic information by the urban traffic information system, which is operated by the National Police Agency, the overlapping problems between the Ministry of Land, Infrastructure and Transport and the Advanced Traffic Management System and fabrication of the mileage on digital tachograph.