• Title/Summary/Keyword: 텍스트 데이터 분석

Search Result 1,095, Processing Time 0.025 seconds

A Clustering-based Undersampling Method to Prevent Information Loss from Text Data (텍스트 데이터의 정보 손실을 방지하기 위한 군집화 기반 언더샘플링 기법)

  • Jong-Hwi Kim;Saim Shin;Jin Yea Jang
    • Annual Conference on Human and Language Technology
    • /
    • 2022.10a
    • /
    • pp.251-256
    • /
    • 2022
  • 범주 불균형은 분류 모델이 다수 범주에 편향되게 학습되어 소수 범주에 대한 분류 성능을 떨어뜨리는 문제를 야기한다. 언더 샘플링 기법은 다수 범주 데이터의 수를 줄여 소수 범주와 균형을 이루게하는 대표적인 불균형 해결 방법으로, 텍스트 도메인에서의 기존 언더 샘플링 연구에서는 단어 임베딩과 랜덤 샘플링과 같은 비교적 간단한 기법만이 적용되었다. 본 논문에서는 트랜스포머 기반 문장 임베딩과 군집화 기반 샘플링 방법을 통해 텍스트 데이터의 정보 손실을 최소화하는 언더샘플링 방법을 제안한다. 제안 방법의 검증을 위해, 감성 분석 실험에서 제안 방법과 랜덤 샘플링으로 추출한 훈련 세트로 모델을 학습하고 성능을 비교 평가하였다. 제안 방법을 활용한 모델이 랜덤 샘플링을 활용한 모델에 비해 적게는 0.2%, 많게는 2.0% 높은 분류 정확도를 보였고, 이를 통해 제안하는 군집화 기반 언더 샘플링 기법의 효과를 확인하였다.

  • PDF

Extracting User-Specific Advertising Keywords Based on Textual Data Mining from KakaoTalk (카카오톡에서의 텍스트 데이터 마이닝 기반의 사용자별 적합 광고 키워드 도출 )

  • Yerim Jeon;Dayeong So;Jimin Lee;Eunjin (Jinny) Jo;Jihoon Moon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.368-369
    • /
    • 2023
  • 대화 데이터 기반 광고 추천은 광고 마케팅에서 고객 맞춤형 광고 제공, 마케팅 효과 극대화 등을 위한 중요한 기술로 주목받고 있다. 본 논문에서는 모바일 인스턴스 메신저인 카카오톡 대화창에서 발생한 텍스트 데이터를 기반으로 대화 내용을 분석하여 대화 주제별 적절한 광고 키워드를 제안한다. 이를 위해 주제별 대화 내용을 미용, 식음료, 상거래로 세분하고 KoNLPy 의 Okt 를 이용하여 텍스트 전처리를 수행하고 키워드별로 빈도수를 뽑아 워드 클라우드를 제시한다. 또한, 잠재 디리클레 할당(Latent Dirichlet Allocation, LDA)을 기반으로 대화 주제를 세분화한 뒤 라벨링을 통해 주제별 대화 키워드를 분석한다. 실험 결과, 대화 주제를 온라인 쇼핑, 헤어, 뷰티 관리, 음식으로 나눌 수 있었으며, 토픽별 상위 키워드를 Word2Vec 을 통해 특정 단어와 유사한 키워드를 도출하여 적절한 광고 키워드를 제시할 수 있었다.

A Collaborative Framework for Discovering the Organizational Structure of Social Networks Using NER Based on NLP (NLP기반 NER을 이용해 소셜 네트워크의 조직 구조 탐색을 위한 협력 프레임 워크)

  • Elijorde, Frank I.;Yang, Hyun-Ho;Lee, Jae-Wan
    • Journal of Internet Computing and Services
    • /
    • v.13 no.2
    • /
    • pp.99-108
    • /
    • 2012
  • Many methods had been developed to improve the accuracy of extracting information from a vast amount of data. This paper combined a number of natural language processing methods such as NER (named entity recognition), sentence extraction, and part of speech tagging to carry out text analysis. The data source is comprised of texts obtained from the web using a domain-specific data extraction agent. A framework for the extraction of information from unstructured data was developed using the aforementioned natural language processing methods. We simulated the performance of our work in the extraction and analysis of texts for the detection of organizational structures. Simulation shows that our study outperformed other NER classifiers such as MUC and CoNLL on information extraction.

Analysis of English abstracts in Journal of the Korean Data & Information Science Society using topic models and social network analysis (토픽 모형 및 사회연결망 분석을 이용한 한국데이터정보과학회지 영문초록 분석)

  • Kim, Gyuha;Park, Cheolyong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.1
    • /
    • pp.151-159
    • /
    • 2015
  • This article analyzes English abstracts of the articles published in Journal of the Korean Data & Information Science Society using text mining techniques. At first, term-document matrices are formed by various methods and then visualized by social network analysis. LDA (latent Dirichlet allocation) and CTM (correlated topic model) are also employed in order to extract topics from the abstracts. Performances of the topic models are compared via entropy for several numbers of topics and weighting methods to form term-document matrices.

Analysis of Real Estate Market Trend Using Text Mining and Big Data (빅데이터와 텍스트마이닝을 이용한 부동산시장 동향분석)

  • Chun, Hae-Jung
    • Journal of Digital Convergence
    • /
    • v.17 no.4
    • /
    • pp.49-55
    • /
    • 2019
  • This study is on the trend of real estate market using text mining and big data. The data were collected through internet news posted on Naver from August 2016 to August 2017. As a result of TF-IDF analysis, the frequency was high in the order of housing, sale, household, real estate market, and region. Many words related to policies such as loan, government, countermeasures, and regulations were extracted, and the region - related words appeared the most frequently in Seoul. The combination of the words related to the region showed that the frequencies of 'Seoul - Gangnam', 'Seoul - Metropolitan area', 'Gangnam - reconstruction' and 'Seoul - reconstruction' appeared frequently. It can be seen that the people's interest and expectation about the reconstruction of Gangnam area is high.

Spatial Clustering Analysis based on Text Mining of Location-Based Social Media Data (위치기반 소셜 미디어 데이터의 텍스트 마이닝 기반 공간적 클러스터링 분석 연구)

  • Park, Woo Jin;Yu, Ki Yun
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.23 no.2
    • /
    • pp.89-96
    • /
    • 2015
  • Location-based social media data have high potential to be used in various area such as big data, location based services and so on. In this study, we applied a series of analysis methodology to figure out how the important keywords in location-based social media are spatially distributed by analyzing text information. For this purpose, we collected tweet data with geo-tag in Gangnam district and its environs in Seoul for a month of August 2013. From this tweet data, principle keywords are extracted. Among these, keywords of three categories such as food, entertainment and work and study are selected and classified by category. The spatial clustering is conducted to the tweet data which contains keywords in each category. Clusters of each category are compared with buildings and benchmark POIs in the same position. As a result of comparison, clusters of food category showed high consistency with commercial areas of large scale. Clusters of entertainment category corresponded with theaters and sports complex. Clusters of work and study showed high consistency with areas where private institutes and office buildings are concentrated.

An Analysis of the Discourse Topics of Users who Exhibit Symptoms of Depression on Social Media (소셜미디어를 통한 우울 경향 이용자 담론 주제 분석)

  • Seo, Harim;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.36 no.4
    • /
    • pp.207-226
    • /
    • 2019
  • Depression is a serious psychological disease that is expected to afflict an increasing number of people. And studies on depression have been conducted in the context of social media because social media is a platform through which users often frankly express their emotions and often reveal their mental states. In this study, large amounts of Korean text were collected and analyzed to determine whether such data could be used to detect depression in users. This study analyzed data collected from Twitter users who had and did not have depressive tendencies between January 2016 and February 2019. The data for each user was separately analyzed before and after the appearance of depressive tendencies to see how their expression changed. In this study the data were analyzed through co-occurrence word analysis, topic modeling, and sentiment analysis. This study's automated data collection method enabled analyses of data collected over a relatively long period of time. Also it compared the textual characteristics of users with depressive tendencies to those without depressive tendencies.

Trends Analysis on Research Articles of the Sharing Economy through a Meta Study Based on Big Data Analytics (빅데이터 분석 기반의 메타스터디를 통해 본 공유경제에 대한 학술연구 동향 분석)

  • Kim, Ki-youn
    • Journal of Internet Computing and Services
    • /
    • v.21 no.4
    • /
    • pp.97-107
    • /
    • 2020
  • This study aims to conduct a comprehensive meta-study from the perspective of content analysis to explore trends in Korean academic research on the sharing economy by using the big data analytics. Comprehensive meta-analysis methodology can examine the entire set of research results historically and wholly to illuminate the tendency or properties of the overall research trend. Academic research related to the sharing economy first appeared in the year in which Professor Lawrence Lessig introduced the concept of the sharing economy to the world in 2008, but research began in earnest in 2013. In particular, between 2006 and 2008, research improved dramatically. In order to grasp the overall flow of domestic academic research of trends, 8 years of papers from 2013 to the present have been selected as target analysis papers, focusing on titles, keywords, and abstracts using database of electronic journals. Big data analysis was performed in the order of cleaning, analysis, and visualization of the collected data to derive research trends and insights by year and type of literature. We used Python3.7 and Textom analysis tools for data preprocessing, text mining, and metrics frequency analysis for key word extraction, and N-gram chart, centrality and social network analysis and CONCOR clustering visualization based on UCINET6/NetDraw, Textom program, the keywords clustered into 8 groups were used to derive the typologies of each research trend. The outcomes of this study will provide useful theoretical insights and guideline to future studies.

How the Title of Investment Strategy Report Affects Stock Price Forecast: Using Text Mining Method (투자전략 보고서의 제목이 주가 예측에 미치는 영향: 텍스트마이닝 중심으로)

  • Jang, Joon-Kyu;Lee, Kyu Hyun;Lee, Zoonky
    • The Journal of Bigdata
    • /
    • v.1 no.2
    • /
    • pp.21-34
    • /
    • 2016
  • There are various investment strategy reports available online, prepared by many financial analysts. If the correlation between the title of the report and analyst forecast can be found, we can tell from the title whether analyst' forecast will be reliable or not. The objective of this study is to see the correlation between the title of analyst investment strategy report and the actual result of forecast by using the Text Mining technique. The result of actual analysis showed that "strong buy and sell call" appeared in the title lead the higher accuracy of analyst forecast and fulfillment ratio. The results that potential investors can get better information by reading the title of the analyst report. We hope that this study could be the basis for new methodologies in this area.

  • PDF

Design and Implementation of a Text Mining System using Intelligent Miner (인텔리전트마이너를 이용한 텍스트마이닝 시스템의 설계 및 구현)

  • 최윤정;박승수
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.04b
    • /
    • pp.316-318
    • /
    • 2000
  • 데이터마이닝 기능은 문서의 구조화되지 않은 텍스트보다는 테이블과 일반적인 DB에 있는 구조화된 자료에 초점이 맞춰져 있다. 정보화의 과정속에서 많은 기업이나 조직들은 과거의 시스템을 DB로 구축하여 어느 정도 형태를 갖추게 되었지만, E-business, E-commerce가 활발해지면서 보유하고 있는 DB기반이 아닌 무작위의 새로운 데이터가 사용자들에 의해 생성되기도 한다. 본 논문에서는 이러한 텍스트 문서에 숨어있는 정보들을 발견하기 위한 텍스트마이닝 과정을 시나리오로 설정하고, 문서와 문서집합에 대해 분석도구를 적용하는 어플리케이션을 구현해 보았다. 대규모의 문서집합에 분석도구를 이용함으로써 빠른 문서처리가 가능하고 이는 사용자가 많은 양의 문서들을 다룰 때의 시간비용을 최소화시킬 수 있는 방법이 될 수 있다. 또한 마이닝과정을 통해 발견한 지식과 특징들을 기반으로 반구조화된 파일로 변환하여, 규칙발견, 데이터마이닝기법을 적용하여 의미있는 새로운 결론을 얻을 수 있을 것이다.

  • PDF