• Title/Summary/Keyword: Korean text classification

Search Result 413, Processing Time 0.027 seconds

A comparative study of filter methods based on information entropy

  • Kim, Jung-Tae;Kum, Ho-Yeun;Kim, Jae-Hwan
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.40 no.5
    • /
    • pp.437-446
    • /
    • 2016
  • Feature selection has become an essential technique to reduce the dimensionality of data sets. Many features are frequently irrelevant or redundant for the classification tasks. The purpose of feature selection is to select relevant features and remove irrelevant and redundant features. Applications of the feature selection range from text processing, face recognition, bioinformatics, speaker verification, and medical diagnosis to financial domains. In this study, we focus on filter methods based on information entropy : IG (Information Gain), FCBF (Fast Correlation Based Filter), and mRMR (minimum Redundancy Maximum Relevance). FCBF has the advantage of reducing computational burden by eliminating the redundant features that satisfy the condition of approximate Markov blanket. However, FCBF considers only the relevance between the feature and the class in order to select the best features, thus failing to take into consideration the interaction between features. In this paper, we propose an improved FCBF to overcome this shortcoming. We also perform a comparative study to evaluate the performance of the proposed method.

Generation of Korean Intonation using Vector Quantization (벡터 양자화를 이용한 한국어 억양 곡선 생성)

  • An, Hye-Sun;Kim, Hyung-Soon
    • Annual Conference on Human and Language Technology
    • /
    • 2001.10d
    • /
    • pp.209-212
    • /
    • 2001
  • 본 논문에서는 text-to-speech 시스템에서 사용할 억양 모델을 위해 벡터 양자화(vector quantization) 방식을 이용한다. 어절 경계강도(break index)는 세단계로 분류하였고, CART(Classification And Regression Tree)를 사용하여 어절 경계강도의 예측 규칙을 생성하였다. 예측된 어절 경계강도를 바탕으로 운율구를 예측하였으며 운율구는 다섯 개의 억양 패턴으로 분류하였다. 하나의 운율구는 정점(peak)의 시간축, 주파수축 값과 이를 기준으로 한 앞, 뒤 기울기를 추출하여 네 개의 파라미터로 단순화하였다. 운율구에 대해서 먼저 운율구가 문장의 끝일 경우와 아닐 경우로 분류하고, 억양 패턴 다섯 개로 분류하여. 모두 10개의 운율구 set으로 나누었다. 그리고 네 개의 파라미터를 가지고 있는 운율구의 억양 패턴을 벡터 양자화 방식을 이용하여 분류(clusteing)하였다 운율의 변화가 두드러지는 조사와 어미는 12 point의 기본주파수 값을 추출하고 벡터 양자화하였다. 운율구와 조사 어미의 codebook index는 문장에 대한 특징 변수 값을 추출하고 CART를 사용하여 예측하였다. 합성할 때에는 입력 tort에 대해서 운율구의 억양 파라미터를 추정한 다음, 조사와 어미의 12 point 기본주파수 값을 추정하여 전체 억양 곡선을 생성하였고 본 연구실에서 제작한 음성합성기를 통해 합성하였다.

  • PDF

Adoption of Virtual Technology to the Development of a BIM based PMIS

  • Suh, Bong-Gyo;Lee, Ghang;Yun, Seok-Heon
    • Journal of the Korea Institute of Building Construction
    • /
    • v.13 no.4
    • /
    • pp.333-340
    • /
    • 2013
  • As construction projects become bigger, PMIS is being used as a project collaboration tool for project participants, owners, designers, inspectors and contractors. As the data type used in PMIS is usually text and most PMIS have no standard information classification system, there is a problem with data usability, such as the capacity for data search and analysis. BIM uses Objects and Properties, and this information might be used for relating with other construction information. As such, BIM technologies can be used with PMIS to enhance the data usability. The web environment is very convenient for multiple users, but the problem is that the data transfer speed is low for big files such as BIM model files. In this study, we suggested a Virtual Technology (VT) application to enhance the performance of BIM data exchange in PMIS, and tested and analyzed its efficiency when it is used to integrate BIM and PMIS in the web environment. The results of the study showed that VT can be used to enhance the efficiency of BIM data exchange in the web environment.

Extracting of Interest Issues Related to Patient Medical Services for Small and Medium Hospital by SNS Big Data Text Mining and Social Networking (중소병원 환자의료서비스에 관한 관심 이슈 도출을 위한 SNS 빅 데이터 텍스트 마이닝과 사회적 연결망 적용)

  • Hwang, Sang Won
    • Korea Journal of Hospital Management
    • /
    • v.23 no.4
    • /
    • pp.26-39
    • /
    • 2018
  • Purposes: The purpose of this study is to analyze the issue of interest in patient medical service of small and medium hospitals using big data. Methods: The method of this study was implemented by data mining and social network using SNS big data. The analysis tool were extracted key keywords and analyzed correlation by using Textom, Ucinet6 and NetDraw program. Findings: In the results of frequency, the network-centered and closeness centrality analysis, It was shown that the government center is interested in the major explanations and evaluations of the technology, information, security, safety, cost and problems of small and medium hospitals, coping with infections, and actual involvement in bank settlement. And, were extracted care for disabilities such as pediatrics, dentistry, obstetrics and gynecology, dementia, nursing, the elderly, and rehabilitation. Practical Implications: Future studies will be more useful if analyzed the needs of customers for medical services in the metropolitan area and provinces may be different in the small and medium hospitals to be studied, further classification studies.

Incremental SVM for Online Product Review Spam Detection (온라인 제품 리뷰 스팸 판별을 위한 점증적 SVM)

  • Ji, Chengzhang;Zhang, Jinhong;Kang, Dae-Ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.05a
    • /
    • pp.89-93
    • /
    • 2014
  • Reviews are very important for potential consumer' making choices. They are also used by manufacturers to find problems of their products and to collect competitors' business information. But someone write fake reviews to mislead readers to make wrong choices. Therefore detecting fake reviews is an important problem for the E-commerce sites. Support Vector Machines (SVMs) are very important text classification algorithms with excellent performance. In this paper, we propose a new incremental algorithm based on weight and the extension of Karush-Kuhn-Tucker(KKT) conditions and Convex Hull for online Review Spam Detection. Finally, we analyze its performance in theory.

  • PDF

A Method of Classifying Tweet by subject using features (특징추출을 이용한 트위터 메시지 주제 분류 방법)

  • Song, Ji-min;Kim, Han-woo;Kim, Dong-joo;Jung, Sung-hoon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.05a
    • /
    • pp.905-907
    • /
    • 2014
  • Twitter is the special place that people in the world can freely share their information and opinion. There are tries to utilize a vast amount of information made from twitter. The study on classification of tweets by subject is actively conducted. Twitter is a service for sharing information with short 140-characters text message. The short message including brief content makes extracting a variety of information hard. In the paper, we suggests the method to classify tweet by subject. The method uses both tweet and subject features. In order to conduct experiments to verify the proposed method, we collected 10,000 tweet messages with the Twitter API. Through the experimental results, we will show that the performance of our proposed method is better than those of previous methods.

  • PDF

Robust Sentiment Classification of Metaverse Services Using a Pre-trained Language Model with Soft Voting

  • Haein Lee;Hae Sun Jung;Seon Hong Lee;Jang Hyun Kim
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.9
    • /
    • pp.2334-2347
    • /
    • 2023
  • Metaverse services generate text data, data of ubiquitous computing, in real-time to analyze user emotions. Analysis of user emotions is an important task in metaverse services. This study aims to classify user sentiments using deep learning and pre-trained language models based on the transformer structure. Previous studies collected data from a single platform, whereas the current study incorporated the review data as "Metaverse" keyword from the YouTube and Google Play Store platforms for general utilization. As a result, the Bidirectional Encoder Representations from Transformers (BERT) and Robustly optimized BERT approach (RoBERTa) models using the soft voting mechanism achieved a highest accuracy of 88.57%. In addition, the area under the curve (AUC) score of the ensemble model comprising RoBERTa, BERT, and A Lite BERT (ALBERT) was 0.9458. The results demonstrate that the ensemble combined with the RoBERTa model exhibits good performance. Therefore, the RoBERTa model can be applied on platforms that provide metaverse services. The findings contribute to the advancement of natural language processing techniques in metaverse services, which are increasingly important in digital platforms and virtual environments. Overall, this study provides empirical evidence that sentiment analysis using deep learning and pre-trained language models is a promising approach to improving user experiences in metaverse services.

CORRECT? CORECT!: Classification of ESG Ratings with Earnings Call Transcript

  • Haein Lee;Hae Sun Jung;Heungju Park;Jang Hyun Kim
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.4
    • /
    • pp.1090-1100
    • /
    • 2024
  • While the incorporating ESG indicator is recognized as crucial for sustainability and increased firm value, inconsistent disclosure of ESG data and vague assessment standards have been key challenges. To address these issues, this study proposes an ambiguous text-based automated ESG rating strategy. Earnings Call Transcript data were classified as E, S, or G using the Refinitiv-Sustainable Leadership Monitor's over 450 metrics. The study employed advanced natural language processing techniques such as BERT, RoBERTa, ALBERT, FinBERT, and ELECTRA models to precisely classify ESG documents. In addition, the authors computed the average predicted probabilities for each label, providing a means to identify the relative significance of different ESG factors. The results of experiments demonstrated the capability of the proposed methodology in enhancing ESG assessment criteria established by various rating agencies and highlighted that companies primarily focus on governance factors. In other words, companies were making efforts to strengthen their governance framework. In conclusion, this framework enables sustainable and responsible business by providing insight into the ESG information contained in Earnings Call Transcript data.

Applying NIST AI Risk Management Framework: Case Study on NTIS Database Analysis Using MAP, MEASURE, MANAGE Approaches (NIST AI 위험 관리 프레임워크 적용: NTIS 데이터베이스 분석의 MAP, MEASURE, MANAGE 접근 사례 연구)

  • Jung Sun Lim;Seoung Hun, Bae;Taehoon Kwon
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.47 no.2
    • /
    • pp.21-29
    • /
    • 2024
  • Fueled by international efforts towards AI standardization, including those by the European Commission, the United States, and international organizations, this study introduces a AI-driven framework for analyzing advancements in drone technology. Utilizing project data retrieved from the NTIS DB via the "drone" keyword, the framework employs a diverse toolkit of supervised learning methods (Keras MLP, XGboost, LightGBM, and CatBoost) enhanced by BERTopic (natural language analysis tool). This multifaceted approach ensures both comprehensive data quality evaluation and in-depth structural analysis of documents. Furthermore, a 6T-based classification method refines non-applicable data for year-on-year AI analysis, demonstrably improving accuracy as measured by accuracy metric. Utilizing AI's power, including GPT-4, this research unveils year-on-year trends in emerging keywords and employs them to generate detailed summaries, enabling efficient processing of large text datasets and offering an AI analysis system applicable to policy domains. Notably, this study not only advances methodologies aligned with AI Act standards but also lays the groundwork for responsible AI implementation through analysis of government research and development investments.

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities (문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구)

  • Kim, Pan-Jun;Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.1 s.63
    • /
    • pp.251-271
    • /
    • 2007
  • This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps In general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.