• Title/Summary/Keyword: Korean text classification

Search Result 413, Processing Time 0.025 seconds

Group-wise Keyword Extraction of the External Audit using Text Mining and Association Rules (텍스트마이닝과 연관규칙을 이용한 외부감사 실시내용의 그룹별 핵심어 추출)

  • Seong, Yoonseok;Lee, Donghee;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.50 no.1
    • /
    • pp.77-89
    • /
    • 2022
  • Purpose: In order to improve the audit quality of a company, an in-depth analysis is required to categorize the audit report in the form of a text document containing the details of the external audit. This study introduces a systematic methodology to extract keywords for each group that determines the differences between groups such as 'audit plan' and 'interim audit' using audit reports collected in the form of text documents. Methods: The first step of the proposed methodology is to preprocess the document through text mining. In the second step, the documents are classified into groups using machine learning techniques and based on this, important vocabularies that have a dominant influence on the performance of classification are extracted. In the third step, the association rules for each group's documents are found. In the last step, the final keywords for each group representing the characteristics of each group are extracted by comparing the important vocabulary for classification with the important vocabulary representing the association rules of each group. Results: This study quantitatively calculates the importance value of the vocabulary used in the audit report based on machine learning rather than the qualitative research method such as the existing literature search, expert evaluation, and Delphi technique. From the case study of this study, it was found that the extracted keywords describe the characteristics of each group well. Conclusion: This study is meaningful in that it has laid the foundation for quantitatively conducting follow-up studies related to key vocabulary in each stage of auditing.

An Improved Text Classification (향상된 텍스트 분류)

  • Wang, Guangxing;Shin, Seong-Yoon;Shin, Kwang-Weong;Lee, Hyun-Chang
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2019.01a
    • /
    • pp.125-126
    • /
    • 2019
  • In this paper, we propose an improved kNN classification method. Through improved the mothed and normalizing the data, the purpose of improving the accuracy is achieved. Then we compared the three classification algorithms and the improved algorithm by experimental data.

  • PDF

The Application way on Semiotic Structure of Knowledge Classification (지식 분류의 기호학적 체계 응용 방안)

  • Yoon, Jeng-Giy
    • Journal of Korean Library and Information Science Society
    • /
    • v.43 no.2
    • /
    • pp.273-292
    • /
    • 2012
  • This study unpackes semiotic character of knowledge classification and wants to know how sign structure of classification effects on canon and banned book etc, and by this impact stems from the semoitic structure structurally, discusses coidentity between banned book and internet in social and cultural structure aspect. and proposes way for understanding and interpretation text like mass media using structuralism theory.

The Color Polarity Method for Binarization of Text Region in Digital Video (디지털 비디오에서 문자 영역 이진화를 위한 색상 극화 기법)

  • Jeong, Jong-Myeon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.9
    • /
    • pp.21-28
    • /
    • 2009
  • Color polarity classification is a process to determine whether the color of text is bright or dark and it is prerequisite task for text extraction. In this paper we propose a color polarity method to extract text region. Based on the observation for the text and background regions, the proposed method uses the ratios of sizes and standard deviations of bright and dark regions. At first, we employ Otsu's method for binarization for gray scale input region. The two largest segments among the bright and the dark regions are selected and the ratio of their sizes is defined as the first measure for color polarity classification. Again, we select the segments that have the smallest standard deviation of the distance from the center among two groups of regions and evaluate the ratio of their standard deviation as the second measure. We use these two ratio features to determine the text color polarity. The proposed method robustly classify color polarity of the text. which has shown by experimental result for the various font and size.

Classification Performance Analysis of Cross-Language Text Categorization using Machine Translation (기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석)

  • Lee, Yong-Gu
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.43 no.1
    • /
    • pp.313-332
    • /
    • 2009
  • Cross-language text categorization(CLTC) can classify documents automatically using training set from other language. In this study, collections appropriated for CLTC were extracted from KTSET. Classification performance of various CLTC methods were compared by SVM classifier using machine translation. Results showed that the classification performance in the order of poly-lingual training method, training-set translation and test-set translation. However, training-set translation could be regarded as the most useful method among CLTC, because it was efficient for machine translation and easily adapted to general environment. On the other hand, low performance was shown to be due to the feature reduction or features with no subject characteristics, which occurred in the process of machine translation of CLTC.

A Study on Fine-Tuning and Transfer Learning to Construct Binary Sentiment Classification Model in Korean Text (한글 텍스트 감정 이진 분류 모델 생성을 위한 미세 조정과 전이학습에 관한 연구)

  • JongSoo Kim
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.28 no.5
    • /
    • pp.15-30
    • /
    • 2023
  • Recently, generative models based on the Transformer architecture, such as ChatGPT, have been gaining significant attention. The Transformer architecture has been applied to various neural network models, including Google's BERT(Bidirectional Encoder Representations from Transformers) sentence generation model. In this paper, a method is proposed to create a text binary classification model for determining whether a comment on Korean movie review is positive or negative. To accomplish this, a pre-trained multilingual BERT sentence generation model is fine-tuned and transfer learned using a new Korean training dataset. To achieve this, a pre-trained BERT-Base model for multilingual sentence generation with 104 languages, 12 layers, 768 hidden, 12 attention heads, and 110M parameters is used. To change the pre-trained BERT-Base model into a text classification model, the input and output layers were fine-tuned, resulting in the creation of a new model with 178 million parameters. Using the fine-tuned model, with a maximum word count of 128, a batch size of 16, and 5 epochs, transfer learning is conducted with 10,000 training data and 5,000 testing data. A text sentiment binary classification model for Korean movie review with an accuracy of 0.9582, a loss of 0.1177, and an F1 score of 0.81 has been created. As a result of performing transfer learning with a dataset five times larger, a model with an accuracy of 0.9562, a loss of 0.1202, and an F1 score of 0.86 has been generated.

Use of Word Clustering to Improve Emotion Recognition from Short Text

  • Yuan, Shuai;Huang, Huan;Wu, Linjing
    • Journal of Computing Science and Engineering
    • /
    • v.10 no.4
    • /
    • pp.103-110
    • /
    • 2016
  • Emotion recognition is an important component of affective computing, and is significant in the implementation of natural and friendly human-computer interaction. An effective approach to recognizing emotion from text is based on a machine learning technique, which deals with emotion recognition as a classification problem. However, in emotion recognition, the texts involved are usually very short, leaving a very large, sparse feature space, which decreases the performance of emotion classification. This paper proposes to resolve the problem of feature sparseness, and largely improve the emotion recognition performance from short texts by doing the following: representing short texts with word cluster features, offering a novel word clustering algorithm, and using a new feature weighting scheme. Emotion classification experiments were performed with different features and weighting schemes on a publicly available dataset. The experimental results suggest that the word cluster features and the proposed weighting scheme can partly resolve problems with feature sparseness and emotion recognition performance.

An Efficient Block Segmentation and Classification of a Document Image Using Edge Information (문서영상의 에지 정보를 이용한 효과적인 블록분할 및 유형분류)

  • 박창준;전준형;최형문
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.33B no.10
    • /
    • pp.120-129
    • /
    • 1996
  • This paper presents an efficient block segmentation and classification using the edge information of the document image. We extract four prominent features form the edge gradient and orientaton, all of which, and thereby the block clssifications, are insensitive to the background noise and the brightness variation of of the image. Using these four features, we can efficiently classify a document image into the seven categrories of blocks of small-size letters, large-size letters, tables, equations, flow-charts, graphs, and photographs, the first five of which are text blocks which are character-recognizable, and the last two are non-character blocks. By introducing the clumn interval and text line intervals of the document in the determination of th erun length of CRLA (constrained run length algorithm), we can obtain an efficient block segmentation with reduced memory size. The simulation results show that the proposed algorithm can rigidly segment and classify the blocks of the documents into the above mentioned seven categories and classification performance is high enough for all the categories except for the graphs with too much variations.

  • PDF

Implementation of Annotation and Thesaurus for Remote Sensing

  • Chae, Gee-Ju;Yun, Young-Bo;Park, Jong-Hyun
    • Proceedings of the KSRS Conference
    • /
    • 2003.11a
    • /
    • pp.222-224
    • /
    • 2003
  • Many users want to add some their own information to data which was on the web and computer without actually needing to touch data. In remote sensing, the result data for image classification consist of image and text file in general. To overcome these inconvenience problems, we suggest the annotation method using XML language. We give the efficient annotation method which can be applied to web and viewing of image classification. We can apply the annotation for web and image classification with image and text file. The need for thesaurus construction is the lack of information for remote sensing and GIS on search engine like Empas, Naver and Google. In search engine, we can’t search the information for word which has many different names simultaneously. We select the remote sensing data from different sources and make the relation between many terms. For this process, we analyze the meaning for different terms which has similar meaning.

  • PDF

Machine Printed and Handwritten Text Discrimination in Korean Document Images

  • Trieu, Son Tung;Lee, Guee Sang
    • Smart Media Journal
    • /
    • v.5 no.3
    • /
    • pp.30-34
    • /
    • 2016
  • Nowadays, there are a lot of Korean documents, which often need to be identified in one of printed or handwritten text. Early methods for the identification use structural features, which can be simple and easy to apply to text of a specific font, but its performance depends on the font type and characteristics of the text. Recently, the bag-of-words model has been used for the identification, which can be invariant to changes in font size, distortions or modifications to the text. The method based on bag-of-words model includes three steps: word segmentation using connected component grouping, feature extraction, and finally classification using SVM(Support Vector Machine). In this paper, bag-of-words model based method is proposed using SURF(Speeded Up Robust Feature) for the identification of machine printed and handwritten text in Korean documents. The experiment shows that the proposed method outperforms methods based on structural features.