• Title/Summary/Keyword: Text data

Search Result 2,953, Processing Time 0.031 seconds

A study on Korean language processing using TF-IDF (TF-IDF를 활용한 한글 자연어 처리 연구)

  • Lee, Jong-Hwa;Lee, MoonBong;Kim, Jong-Weon
    • The Journal of Information Systems
    • /
    • v.28 no.3
    • /
    • pp.105-121
    • /
    • 2019
  • Purpose One of the reasons for the expansion of information systems in the enterprise is the increased efficiency of data analysis. In particular, the rapidly increasing data types which are complex and unstructured such as video, voice, images, and conversations in and out of social networks. The purpose of this study is the customer needs analysis from customer voices, ie, text data, in the web environment.. Design/methodology/approach As previous study results, the word frequency of the sentence is extracted as a word that interprets the sentence has better affects than frequency analysis. In this study, we applied the TF-IDF method, which extracts important keywords in real sentences, not the TF method, which is a word extraction technique that expresses sentences with simple frequency only, in Korean language research. We visualized the two techniques by cluster analysis and describe the difference. Findings TF technique and TF-IDF technique are applied for Korean natural language processing, the research showed the value from frequency analysis technique to semantic analysis and it is expected to change the technique by Korean language processing researcher.

A Study on the Changes in Consumer Perceptions of the Relationship between Ethical Consumption and Consumption Value: Focusing on Analyzing Ethical Consumption and Consumption Value Keyword Changes Using Big Data (윤리적 소비와 소비가치의 관계에 대한 소비자 인식 변화: 소셜 빅데이터를 활용한 윤리적 소비와 소비가치의 키워드 변화 분석을 중심으로)

  • Shin, Eunjung;Koh, Ae-Ran
    • Human Ecology Research
    • /
    • v.59 no.2
    • /
    • pp.245-259
    • /
    • 2021
  • The purpose of this study was to analyze big data to identify the sub-dimensions of ethical consumption, as well as the consumption value associated with ethical consumption that changes over time. For this study, data were collected from Naver and Daum using the keyword 'ethical consumption' and frequency and matrix data were extracted through Textom, for the period January 1, 2016, to December 31, 2018. In addition, a two-way mode network analysis was conducted using the UCINET 6.0 program and visualized using the NetDraw function. The results of text mining show increasing keyword frequency year-on-year, indicating that interest in ethical consumption has grown. The sub-dimensions derived for 2014 and 2015 are fair trade, ethical consumption, eco-friendly products, and cooperatives and for 2016 are fair trade, ethical consumption, eco-friendly products and animal welfare. The results of deriving consumption value keywords were classified as emotional value, social value, functional value and conditional value. The influence of functional value was found to be growing over time. Through network analysis, the relationship between the sub-dimensions of ethical consumption and consumption values derived each year from 2014 to 2018 showed a significantly strong correlation between eco-friendly product consumption and emotional value, social value, functional value and conditional value.

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang;Joon-Gyu Maeng;Jun Park;Seung Yun;Sang-Hun Kim
    • ETRI Journal
    • /
    • v.45 no.1
    • /
    • pp.18-27
    • /
    • 2023
  • We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

Big Data Analysis of the Annals of the Joseon Dynasty Using Jsoup (Jsoup를 이용한 조선왕조실록의 빅 데이터 분석)

  • Bong, Young-Il;Lee, Choong-Ho
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.131-133
    • /
    • 2021
  • The Annals of the Joseon Dynasty are important records registered in UNESCO. This paper proposes a method to analyze big data by examining the frequency of words in the Annals of the Joseon Dynasty translated into Korean. When you access the Annals of the Joseon Dynasty from an Internet site and try to investigate the frequency of words, if you directly access the source included in the page, the keywords necessary for the HTML grammar are included, so that it is difficult to analyze big data based on the frequency of words in the necessary text. In this paper, we propose a method to analyze the text of the Annals of the Joseon Dynasty using Java's Jsoup crawling function. In the experiment, only the Taejo part of the Annals of the Joseon Dynasty was extracted to verify the validity of this method.

  • PDF

Vulnerability Threat Classification Based on XLNET AND ST5-XXL model

  • Chae-Rim Hong;Jin-Keun Hong
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.3
    • /
    • pp.262-273
    • /
    • 2024
  • We provide a detailed analysis of the data processing and model training process for vulnerability classification using Transformer-based language models, especially sentence text-to-text transformers (ST5)-XXL and XLNet. The main purpose of this study is to compare the performance of the two models, identify the strengths and weaknesses of each, and determine the optimal learning rate to increase the efficiency and stability of model training. We performed data preprocessing, constructed and trained models, and evaluated performance based on data sets with various characteristics. We confirmed that the XLNet model showed excellent performance at learning rates of 1e-05 and 1e-04 and had a significantly lower loss value than the ST5-XXL model. This indicates that XLNet is more efficient for learning. Additionally, we confirmed in our study that learning rate has a significant impact on model performance. The results of the study highlight the usefulness of ST5-XXL and XLNet models in the task of classifying security vulnerabilities and highlight the importance of setting an appropriate learning rate. Future research should include more comprehensive analyzes using diverse data sets and additional models.

A Big Data Preprocessing using Statistical Text Mining (통계적 텍스트 마이닝을 이용한 빅 데이터 전처리)

  • Jun, Sunghae
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.5
    • /
    • pp.470-476
    • /
    • 2015
  • Big data has been used in diverse areas. For example, in computer science and sociology, there is a difference in their issues to approach big data, but they have same usage to analyze big data and imply the analysis result. So the meaningful analysis and implication of big data are needed in most areas. Statistics and machine learning provide various methods for big data analysis. In this paper, we study a process for big data analysis, and propose an efficient methodology of entire process from collecting big data to implying the result of big data analysis. In addition, patent documents have the characteristics of big data, we propose an approach to apply big data analysis to patent data, and imply the result of patent big data to build R&D strategy. To illustrate how to use our proposed methodology for real problem, we perform a case study using applied and registered patent documents retrieved from the patent databases in the world.

Reinforcement Method for Automated Text Classification using Post-processing and Training with Definition Criteria (학습방법개선과 후처리 분석을 이용한 자동문서분류의 성능향상 방법)

  • Choi, Yun-Jeong;Park, Seung-Soo
    • The KIPS Transactions:PartB
    • /
    • v.12B no.7 s.103
    • /
    • pp.811-822
    • /
    • 2005
  • Automated text categorization is to classify free text documents into predefined categories automatically and whose main goals is to reduce considerable manual process required to the task. The researches to improving the text categorization performance(efficiency) in recent years, focused on enhancing existing classification models and algorithms itself, but, whose range had been limited by feature based statistical methodology. In this paper, we propose RTPost system of different style from i.ny traditional method, which takes fault tolerant system approach and data mining strategy. The 2 important parts of RTPost system are reinforcement training and post-processing part. First, the main point of training method deals with the problem of defining category to be classified before selecting training sample documents. And post-processing method deals with the problem of assigning category, not performance of classification algorithms. In experiments, we applied our system to documents getting low classification accuracy which were laid on a decision boundary nearby. Through the experiments, we shows that our system has high accuracy and stability in actual conditions. It wholly did not depend on some variables which are important influence to classification power such as number of training documents, selection problem and performance of classification algorithms. In addition, we can expect self learning effect which decrease the training cost and increase the training power with employing active learning advantage.

The Effect of Text Consistency between the Review Title and Content on Review Helpfulness (온라인 리뷰의 제목과 내용의 일치성이 리뷰 유용성에 미치는 영향)

  • Li, Qinglong;Kim, Jaekyeong
    • Knowledge Management Research
    • /
    • v.23 no.3
    • /
    • pp.193-212
    • /
    • 2022
  • Many studies have proposed several factors that affect review helpfulness. Previous studies have investigated the effect of quantitative factors (e.g., star ratings) and affective factors (e.g., sentiment scores) on review helpfulness. Online reviews contain titles and contents, but existing studies focus on the review content. However, there is a limitation to investigating the factors that affect review helpfulness based on the review content without considering the review title. However, previous studies independently investigated the effect of review content and title on review helpfulness. However, it may ignore the potential impact of similarity between review titles and content on review helpfulness. This study used text consistency between review titles and content affect review helpfulness based on the mere exposure effect theory. We also considered the role of information clearness, review length, and source reliability. The results show that text consistency between the review title and the content negatively affects the review helpfulness. Furthermore, we found that information clearness and source reliability weaken the negative effects of text consistency on review helpfulness.