• Title/Summary/Keyword: 텍스트 빈도 분석

Search Result 338, Processing Time 0.029 seconds

Properties and Quantitative Analysis of Bias in Korean Language Models: A Comparison with English Language Models and Improvement Suggestions (한국어 언어모델의 속성 및 정량적 편향 분석: 영어 언어모델과의 비교 및 개선 제안)

  • Jaemin Kim;Dong-Kyu Chae
    • Annual Conference on Human and Language Technology
    • /
    • 2023.10a
    • /
    • pp.558-562
    • /
    • 2023
  • 최근 ChatGPT의 등장으로 텍스트 생성 모델에 대한 관심이 높아지면서, 텍스트 생성 태스크의 성능평가를 위한 지표에 대한 연구가 활발히 이뤄지고 있다. 전통적인 단어 빈도수 기반의 성능 지표는 의미적인 유사도를 고려하지 못하기 때문에, 사전학습 언어모델을 활용한 지표인 BERTScore를 주로 활용해왔다. 하지만 이러한 방법은 사전학습 언어모델이 학습한 데이터에 존재하는 편향으로 인해 공정성에 대한 문제가 우려된다. 이에 따라 한국어 사전학습 언어모델의 편향에 대한 분석 연구가 필요한데, 기존의 한국어 사전학습 언어모델의 편향 분석 연구들은 사회에서 생성되는 다양한 속성 별 편향을 고려하지 못했다는 한계가 있다. 또한 서로 다른 언어를 기반으로 하는 사전학습 언어모델들의 속성 별 편향을 비교 분석하는 연구 또한 미비하였다. 이에 따라 본 논문에서는 한국어 사전학습 언어모델의 속성 별 편향을 비교 분석하며, 영어 사전학습 언어모델이 갖고 있는 속성 별 편향과 비교 분석하였고, 비교 가능한 데이터셋을 구축하였다. 더불어 한국어 사전학습 언어모델의 종류 및 크기 별 편향 분석을 통해 적합한 모델을 선택할 수 있도록 가이드를 제시한다.

  • PDF

Ranking Contribution of Star in Each Domain Using Association Text Mining News Articles on the Web (뉴스기사의 연관 단어 텍스트 마이닝을 이용한 스타의 분야별 기여도순위 비교기법)

  • Kang, Yoonjeong;Yoon, Jaeyeol;Lim, JiYeon;Kim, Ung-mo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.11a
    • /
    • pp.1191-1194
    • /
    • 2011
  • 스타의 대중에 대한 인기가 브랜드의 이미지 제고와 상업적 영향을 끄는 마케팅 전략을 스타 마케팅이라고 한다. 오늘날의 스타는 방송, 연예활동뿐만 아니라 스포츠, 정치활동, 사회기여활동 등 다양한 분야에서 활약하며 스타의 이미지는 그 활약상에 영향을 받는다. 스타의 이미지는 브랜드 및 기업의 이미지로 직결되므로 그에 대한 사전분석은 마케팅에서 중요한 요소이다. 그래서 일반적으로 스타들이 활약하는 도메인을 분류하여서 그 스타에 대해서 검색을 하였을 때 어떤 분야에서 활약하고 기여를 하는지 그 기여도를 도메인에 따라 랭킹을 매기는 방법을 제안한다. 뉴스기사에서 텍스트 마이닝 기술을 이용하여 스타의 이름과 활동 도메인들에 대해서 관련단어를 빈도에 따라 추출한다. 그리고 관련된 단어들을 이용하여 스타에 대한 뉴스 중 각 도메인과 관련된 기사들을 카운트하며 도메인에 대해서 긍정 혹은 부정적인 보도내용일 경우에는 극성을 부여하여 그 가중치를 달리한다. 빈도 및 극성을 고려한 점수화에 의해 스타가 기여하는 분야에 대한 순위를 매긴다.

A Study on Text Mining Methods to Analyze Civil Complaints: Structured Association Analysis (민원 분석을 위한 텍스트 마이닝 기법 연구: 계층적 연관성 분석)

  • Kim, HyunJong;Lee, TaiHun;Ryu, SeungEui;Kim, NaRang
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.23 no.3
    • /
    • pp.13-24
    • /
    • 2018
  • For government and public institutions, civil complaints containing direct requirements of citizens can be utilized as important data in developing policies. However, it is difficult to draw accurate requirements using text mining methods since the nature of the complaint text is unstructured. In this study, a new method is proposed that draws the exact requirements of citizens, improving the previous text mining in analyzing the data of civil complaints. The new text-mining method is based on the principle of Co-Occurrences Structure Map, and it is structured by two-step association analysis, so that it consists of the first-order related word and a second-order related word based on the core subject word. For the analysis, 3,004 cases posted on the electronic bulletin board of Busan City for the year 2016 are used. This study's academic contribution suggests a method deriving the requirements of citizens from the civil affairs data. As a practical contribution, it also enables policy development using civil service data.

Investigations on Techniques and Applications of Text Analytics (텍스트 분석 기술 및 활용 동향)

  • Kim, Namgyu;Lee, Donghoon;Choi, Hochang;Wong, William Xiu Shun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.2
    • /
    • pp.471-492
    • /
    • 2017
  • The demand and interest in big data analytics are increasing rapidly. The concepts around big data include not only existing structured data, but also various kinds of unstructured data such as text, images, videos, and logs. Among the various types of unstructured data, text data have gained particular attention because it is the most representative method to describe and deliver information. Text analysis is generally performed in the following order: document collection, parsing and filtering, structuring, frequency analysis, and similarity analysis. The results of the analysis can be displayed through word cloud, word network, topic modeling, document classification, and semantic analysis. Notably, there is an increasing demand to identify trending topics from the rapidly increasing text data generated through various social media. Thus, research on and applications of topic modeling have been actively carried out in various fields since topic modeling is able to extract the core topics from a huge amount of unstructured text documents and provide the document groups for each different topic. In this paper, we review the major techniques and research trends of text analysis. Further, we also introduce some cases of applications that solve the problems in various fields by using topic modeling.

Social perception of the Arduino lecture as seen in big data (빅데이터 분석을 통한 아두이노 강의에 대한 사회적 인식)

  • Lee, Eunsang
    • Journal of The Korean Association of Information Education
    • /
    • v.25 no.6
    • /
    • pp.935-945
    • /
    • 2021
  • The purpose of this study is to analyze the social perception of Arduino lecture using big data analysis method. For this purpose, data from January 2012 to May 2021 were collected using the Textom website as a keyword searched for 'arduino + lecture' in blogs, cafes, and news channels of NAVER website. The collected data was refined using the Textom website, and text mining analysis and semantic network analysis were performed by opening the Textom website, Ucinet 6, and Netdraw programs. As a result of text mining analysis such as frequency analysis, TF-IDF analysis, and degree centrality it was confirmed that 'education' and 'coding' were the top keywords. As a result of CONCOR analysis for semantic network analysis, four clusters can be identified: 'Arduino-related education', 'Physical computing-related lecture', 'Arduino special lecture', and 'GUI programming'. Through this study, it was possible to confirm various meaningful social perceptions of the general public in relation to Arduino lecture on the Internet. The results of this study will be used as data that provides meaningful implications for instructors preparing for Arduino lectures, researchers studying the subject, and policy makers who establish software education or coding education and related policies.

The Study on the patient safety culture convergence research topics through text mining and CONCOR analysis (텍스트마이닝 및 CONCOR 분석을 활용한 환자안전문화 융복합 연구주제 분석)

  • Baek, Su Mi;Moon, Inn Oh
    • Journal of Digital Convergence
    • /
    • v.19 no.12
    • /
    • pp.359-367
    • /
    • 2021
  • The purpose of this study is to analyze domestic patient safety culture research topics using text mining and CONCOR analysis. The research method was conducted in the stages of data collection, data preprocessing, text mining and social network analysis, and CONCOR analysis. A total of 136 articles were analyzed excluding papers that were not published. Data analysis was performed using Textom and UCINET programs. As a result of this study, TF (frequency) of patient safety culture-related studies showed that patient safety was the highest, and TF-IDF (importance in documents) was highest in nursing. As a result of the CONCOR analysis, a total of seven clusters were derived: knowledge and attitude, communication, medical service, team, work environment, structure, organization and management that constitute the patient safety culture. In the future, it is necessary to conduct research on the relationship between the establishment of a patient safety culture and patient outcomes.

A Trend Analysis and Policy proposal for the Work Permit System through Text Mining: Focusing on Text Mining and Social Network analysis (텍스트마이닝을 통한 고용허가제 트렌드 분석과 정책 제안 : 텍스트마이닝과 소셜네트워크 분석을 중심으로)

  • Ha, Jae-Been;Lee, Do-Eun
    • Journal of Convergence for Information Technology
    • /
    • v.11 no.9
    • /
    • pp.17-27
    • /
    • 2021
  • The aim of this research was to identify the issue of the work permit system and consciousness of the people on the system, and to suggest some ideas on the government policies on it. To achieve the aim of research, this research used text mining based on social data. This research collected 1,453,272 texts from 6,217 units of online documents which contained 'work permit system' from January to December, 2020 using Textom, and did text-mining and social network analysis. This research extracted 100 key words frequently mentioned from the analyses of data top-level key word frequency, and degree centrality analysis, and constituted job problem, importance of policy process, competitiveness in the respect of industries, and improvement of living conditions of foreign workers as major key words. In addition, through semantic network analysis, this research figured out major awareness like 'employment policy', and various kinds of ambient awareness like 'international cooperation', 'workers' human rights', 'law', 'recruitment of foreigners', 'corporate competitiveness', 'immigrant culture' and 'foreign workforce management'. Finally, this research suggested some ideas worth considering in establishing government policies on the work permit system and doing related researches.

Correlation Analysis of Social Sentiment and Stock Prices (사회적 감성과 주가의 상관성 분석)

  • Yun, Hongwon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.7
    • /
    • pp.1593-1598
    • /
    • 2015
  • In this paper, we analyze the correlation between social sentiment and stock prices. Polarity analysis is conducted for the stock prices plunging and soaring duration. And it is performed for its prior period. Using these results, we analyze the relationship between the social sentiment and stock prices. We collected the past data of Dow Jones Industrial Average and detected the period of plunging and soaring. On the basis of the detected time, the New York Times articles are collected and polarity analysis is conducted. Frequency of negative terms is decreased and it of positive terms is increased during the stock prices soaring. There is a little difference between the frequency of negative and positive terms in the previous stock prices plunging or soaring. According to the correlation analysis, it shows a positive correlation between social sentiment and stock prices in the period of plunging and soaring. A significant correlation is not appeared in the previous stock prices plunging or soaring.

Analysis of Descriptive Course Evaluation of University Chemistry Laboratory Class using Text Mining (텍스트 마이닝을 활용한 대학 화학 실험 수업의 서술형 강의 평가 내용 분석)

  • Yun, Jeonghyun;Park, Geumju
    • Journal of the Korean Chemical Society
    • /
    • v.66 no.3
    • /
    • pp.218-227
    • /
    • 2022
  • The purpose of this study is to analyze the opinions of students by using the text mining to the good points and improvements among the descriptive course evaluation written by the students who participated in the university chemistry laboratory class, and to derive the improvement for the class. We analyzed the frequency of occurrence, co-occurrence and network of key words. As a result of the study, in the network of good points in the class, the most frequent mentions were made between class and professor, along with explanation, understanding, student, passion, fun, TA, experiment, help, etc. In the network of improvements in the class, the most frequent mentions were made between class and student, along with professor, content, explanation, exam, wish, experiment, understanding, difficult, thought, problem, etc. In other words, the students suggested the opinion that the contents of the class were well understood and that they felt fun and satisfied with the experimental process due to 'easy and detailed explanation' and 'TA's assistance' as good points of the class. On the other hand, the students suggested the negative opinions that the understanding and concentration in the class was decreased due to 'difficulty of content and exam', 'excessive assignments', and 'class environment' as improvements of the class.

Analysis of the Yearbook from the Korea Meteorological Administration using a text-mining agorithm (텍스트 마이닝 알고리즘을 이용한 기상청 기상연감 자료 분석)

  • Sun, Hyunseok;Lim, Changwon;Lee, YungSeop
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.4
    • /
    • pp.603-613
    • /
    • 2017
  • Many people have recently posted about personal interests on social media. The development of the Internet and computer technology has enabled the storage of digital forms of documents that has resulted in an explosion of the amount of textual data generated; subsequently there is an increased demand for technology to create valuable information from a large number of documents. A text mining technique is often used since text-based data is mostly composed of unstructured forms that are not suitable for the application of statistical analysis or data mining techniques. This study analyzed the Meteorological Yearbook data of the Korea Meteorological Administration (KMA) with a text mining technique. First, a term dictionary was constructed through preprocessing and a term-document matrix was generated. This term dictionary was then used to calculate the annual frequency of term, and observe the change in relative frequency for frequently appearing words. We also used regression analysis to identify terms with increasing and decreasing trends. We analyzed the trends in the Meteorological Yearbook of the KMA and analyzed trends of weather related news, weather status, and status of work trends that the KMA focused on. This study is to provide useful information that can help analyze and improve the meteorological services and reflect meteorological policy.