• 제목/요약/키워드: text classification

검색결과 720건 처리시간 0.024초

대칭 조건부 확률과 TF-IDF 기반 텍스트 분류를 위한 N-gram 특질 선택 (N-gram Feature Selection for Text Classification Based on Symmetrical Conditional Probability and TF-IDF)

  • 최우식;김성범
    • 대한산업공학회지
    • /
    • 제41권4호
    • /
    • pp.381-388
    • /
    • 2015
  • The rapid growth of the World Wide Web and online information services has generated and made accessible a huge number of text documents. To analyze texts, selecting important keywords is an essential step. In this paper, we propose a feature selection method that combines a term frequency-inverse document frequency technique and symmetrical conditional probability. The proposed method can identify features with N-gram, the sequential multiword. The effectiveness of the proposed method is demonstrated through a real text data from the machine learning repository, University of California, Irvine.

Text Categorization with Improved Deep Learning Methods

  • Wang, Xingfeng;Kim, Hee-Cheol
    • Journal of information and communication convergence engineering
    • /
    • 제16권2호
    • /
    • pp.106-113
    • /
    • 2018
  • Although deep learning methods of convolutional neural networks (CNNs) and long-/short-term memory (LSTM) are widely used for text categorization, they still have certain shortcomings. CNNs require that the text retain some order, that the pooling lengths be identical, and that collateral analysis is impossible; In case of LSTM, it requires the unidirectional operation and the inputs/outputs are very complex. Against these problems, we thus improved these traditional deep learning methods in the following ways: We created collateral CNNs accepting disorder and variable-length pooling, and we removed the input/output gates when creating bidirectional LSTMs. We have used four benchmark datasets for topic and sentiment classification using the new methods that we propose. The best results were obtained by combining LTSM regional embeddings with data convolution. Our method is better than all previous methods (including deep learning methods) in terms of topic and sentiment classification.

RESEARCH ON SENTIMENT ANALYSIS METHOD BASED ON WEIBO COMMENTS

  • Li, Zhong-Shi;He, Lin;Guo, Wei-Jie;Jin, Zhe-Zhi
    • East Asian mathematical journal
    • /
    • 제37권5호
    • /
    • pp.599-612
    • /
    • 2021
  • In China, Weibo is one of the social platforms with more users. It has the characteristics of fast information transmission and wide coverage. People can comment on a certain event on Weibo to express their emotions and attitudes. Judging the emotional tendency of users' comments is not only beneficial to the monitoring of the management department, but also has very high application value for rumor suppression, public opinion guidance, and marketing. This paper proposes a two-input Adaboost model based on TextCNN and BiLSTM. Use the TextCNN model that can perform local feature extraction and the BiLSTM model that can perform global feature extraction to process comment data in parallel. Finally, the classification results of the two models are fused through the improved Adaboost algorithm to improve the accuracy of text classification.

New Feature Selection Method for Text Categorization

  • Wang, Xingfeng;Kim, Hee-Cheol
    • Journal of information and communication convergence engineering
    • /
    • 제15권1호
    • /
    • pp.53-61
    • /
    • 2017
  • The preferred feature selection methods for text classification are filter-based. In a common filter-based feature selection scheme, unique scores are assigned to features; then, these features are sorted according to their scores. The last step is to add the top-N features to the feature set. In this paper, we propose an improved global feature selection scheme wherein its last step is modified to obtain a more representative feature set. The proposed method aims to improve the classification performance of global feature selection methods by creating a feature set representing all classes almost equally. For this purpose, a local feature selection method is used in the proposed method to label features according to their discriminative power on classes; these labels are used while producing the feature sets. Experimental results obtained using the well-known 20 Newsgroups and Reuters-21578 datasets with the k-nearest neighbor algorithm and a support vector machine indicate that the proposed method improves the classification performance in terms of a widely known metric ($F_1$).

Construction of an Internet of Things Industry Chain Classification Model Based on IRFA and Text Analysis

  • Zhimin Wang
    • Journal of Information Processing Systems
    • /
    • 제20권2호
    • /
    • pp.215-225
    • /
    • 2024
  • With the rapid development of Internet of Things (IoT) and big data technology, a large amount of data will be generated during the operation of related industries. How to classify the generated data accurately has become the core of research on data mining and processing in IoT industry chain. This study constructs a classification model of IoT industry chain based on improved random forest algorithm and text analysis, aiming to achieve efficient and accurate classification of IoT industry chain big data by improving traditional algorithms. The accuracy, precision, recall, and AUC value size of the traditional Random Forest algorithm and the algorithm used in the paper are compared on different datasets. The experimental results show that the algorithm model used in this paper has better performance on different datasets, and the accuracy and recall performance on four datasets are better than the traditional algorithm, and the accuracy performance on two datasets, P-I Diabetes and Loan Default, is better than the random forest model, and its final data classification results are better. Through the construction of this model, we can accurately classify the massive data generated in the IoT industry chain, thus providing more research value for the data mining and processing technology of the IoT industry chain.

ETOM+RPost기반의 문서분류시스템의 설계 및 구현 (Design and Implementation of Text Classification System based on ETOM+RPost)

  • 최윤정
    • 한국산학기술학회논문지
    • /
    • 제11권2호
    • /
    • pp.517-524
    • /
    • 2010
  • 최근의 컴퓨터 기술과 인터넷 기술의 발달로 인해 분석 데이터가 급속도로 증가함에 따라 이들을 다루기 위한 자동분류시스템에 대한 요구가 높다. 문서분류시스템은 감독학습이 필수적이기 때문에 최소한의 전문가의 개입만으로도 높은 정확도가 보장되는 자동화 시스템에 대한 요구가 크다. 반면, 분류할 데이터들은 형식이나 내용상으로 그 복잡도가 높아지고 있어서, 일반적인 분류방법으로는 좋은 분석결과를 얻기 어려운 양상을 보인다. 특히 스팸성 데이터와 같이 어떠한 의도가 반영되어 가공되거나 변형되는 데이터는 분석의 어려움을 가중시킨다. 본 논문에서는 분류알고리즘의 성능향상을 위해 제안한 ETOM과 RPost방법을 구현하였다. 분류의 경계선 상에 있는 스팸문서들에 구현시스템을 적용하여 그 과정을 분석하였다. 실험결과 제안방법에 의한 정확도가 0.795에서 0.93으로 약 16%의 증가하였음을 확인하였다.

다항시행접근 단순 베이지안 문서분류기의 개선 (Improving Multinomial Naive Bayes Text Classifier)

  • 김상범;임해창
    • 한국정보과학회논문지:소프트웨어및응용
    • /
    • 제30권3_4호
    • /
    • pp.259-267
    • /
    • 2003
  • 단순 베이지언 분류모형은 구현이 간단하고 효율적이기 때문에 실용적으로 사용하기에 적합하다. 그러나 이 분류모형은 많은 기계학습 도메인에서 우수한 성능을 보임에도 불구하고 문서분류에 적용되었을 경우에는 그 성능이 매우 낮은 것으로 알려져왔다. 본 논문에서는 단순 베이지언 분류모형중 가장 성능이 우수한 것으로 알려진 다항 시행접근 단순 베이지언 분류모형을 개선하는 세가지 방법을 제안한다. 첫 번째는 범주에 대한 단어의 확률추정방법을 문서모델에 기반하여 개선하는 것이고, 두 번째는 문서의 길이에 따라 범주와의 관련성이 선형적으로 증가하는 것을 억제하기 위해 길이에 대한 정규화를 수행하는 것이며, 마지막으로 범주판정에 중요한 역할을 하는 단어들의 영향력을 높여주기 위하여 상호정보가중 단순 베이지언 분류방법을 사용하는 것이다. 제안하는 방법들은 문서분류기의 성능 평가를 위한 벤치마크 문서집합인 Reuters21578과 20Newsgroup에서 기존의 방범에 비해 상당한 성능향상을 가져옴을 알 수 있었다.

바이그램이 문서범주화 성능에 미치는 영향에 관한 연구 (A Study on the Effectiveness of Bigrams in Text Categorization)

  • 이찬도;최준영
    • Journal of Information Technology Applications and Management
    • /
    • 제12권2호
    • /
    • pp.15-27
    • /
    • 2005
  • Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na$\"{i}$ve Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.

  • PDF

문서 범주화를 이용한 지식관리시스템에서의 전문가 분류 자동화 (Automation of Expert Classification in Knowledge Management Systems Using Text Categorization Technique)

  • 양근우;허순영
    • Asia pacific journal of information systems
    • /
    • 제14권2호
    • /
    • pp.115-130
    • /
    • 2004
  • This paper proposes how to build an expert profile database in KMS, which provides the information of expertise that each expert possesses in the organization. To manage tacit knowledge in a knowledge management system, recent researches in this field have shown that it is more applicable in many ways to provide expert search mechanisms in KMS to pinpoint experts in the organizations with searched expertise so that users can contact them for help. In this paper, we develop a framework to automate expert classification using a text categorization technique called Vector Space Model, through which an expert database composed of all the compiled profile information is built. This approach minimizes the maintenance cost of manual expert profiling while eliminating the possibility of incorrectness and obsolescence resulted from subjective manual processing. Also, we define the structure of expertise so that we can implement the expert classification framework to build an expert database in KMS. The developed prototype system, "Knowledge Portal for Researchers in Science and Technology," is introduced to show the applicability of the proposed framework.

경험적 정보를 이용한 kNN 기반 한국어 문서 분류기의 개선 (Improving of kNN-based Korean text classifier by using heuristic information)

  • 임희석;남기춘
    • 컴퓨터교육학회논문지
    • /
    • 제5권3호
    • /
    • pp.37-44
    • /
    • 2002
  • 문서 자동 분류란 입력 문서에 이미 정해져 있는 특정 범주를 할당하는 작업을 의미하며 이는 문서의 효율적, 체계적 관리를 위하여 그 필요성이 증가하고 있는 실정이다. 현재 국내외에서 기계 학습 방법을 이용한 문서 자동 분류에 대한 연구가 활발히 진행되고 있으나 대부분의 연구는 문서 분류기의 성능 향상을 위한 새로운 학습 모델 제안과 학습 모델간의 상호 비교 연구에 치중되어 있으며 특정 학습 모델을 이용한 분류 시스템의 최적화나 개선 방안에 대한 연구는 다소 미흡한 실정이다. 이에 본 논문은 kNN 학습 방법을 이용한 문서 분류 시스템의 성능 향상에 중요한 역할을 하는 파라미터를 정의하고 실험을 통해서 얻은 경험적 정보를 이용한 한국어 문서 분류기 성능 개성 방안을 제안한다. 실험 결과, 이웃 문서들간의 유사도 가중치를 사용하는 분류 함수, 분류 정보를 이용한 자질 선택 방법, 그리고 전역적 분류 방법이 높은 성능을 보였고, 분류 영역에 따라 신중히 결정된 k값을 사용한 지역적 방법도 많은 계산량을 필요로 하는 전역적 방법과 유사한 성능을 보일 수 있음을 확인하였다.

  • PDF