DOI QR코드

DOI QR Code

Reinforcement Method for Automated Text Classification using Post-processing and Training with Definition Criteria

학습방법개선과 후처리 분석을 이용한 자동문서분류의 성능향상 방법

  • 최윤정 (이화여자대학교 컴퓨터학과) ;
  • 박승수 (이화여자대학교 컴퓨터학과)
  • Published : 2005.12.01

Abstract

Automated text categorization is to classify free text documents into predefined categories automatically and whose main goals is to reduce considerable manual process required to the task. The researches to improving the text categorization performance(efficiency) in recent years, focused on enhancing existing classification models and algorithms itself, but, whose range had been limited by feature based statistical methodology. In this paper, we propose RTPost system of different style from i.ny traditional method, which takes fault tolerant system approach and data mining strategy. The 2 important parts of RTPost system are reinforcement training and post-processing part. First, the main point of training method deals with the problem of defining category to be classified before selecting training sample documents. And post-processing method deals with the problem of assigning category, not performance of classification algorithms. In experiments, we applied our system to documents getting low classification accuracy which were laid on a decision boundary nearby. Through the experiments, we shows that our system has high accuracy and stability in actual conditions. It wholly did not depend on some variables which are important influence to classification power such as number of training documents, selection problem and performance of classification algorithms. In addition, we can expect self learning effect which decrease the training cost and increase the training power with employing active learning advantage.

자동문서분류는 문서의 내용에 기반하여 미리 정의된 항목에 자동으로 할당하는 작업으로서 효율적인 정보관리 및 검색등에 필수적인 작업이다. 기존의 문서분류성능 향상을 위한 연구들은 대부분 분류모델 자체를 개선시키는 데 주력해왔으며 통계적인 방법으로 그 범위가 제한되어왔다. 본 연구에서는 자동문서분류의 성능향상을 위해 데이터마이닝 기법과 결함허용방법을 이용하는 개선된 학습알고리즘과 후처 리 방법에 의한 RTPost 시스템을 제안한다. RTPost 시스템은 학습문서 선택작업 이전에 분류항목 설정의 문제를 다루며, 분류함수의 성능보다는 지정방식의 문제점을 감안하여 학습과 분류 후처리 프로세스를 개선하려는 것이다. 이를 통해 분류결과에 중요한 영향을 미쳐왔던 학습문서의 수와 선택방법, 분류모델의 성능등에 의존하지 않는 안정적인 분류가 가능하였고, 이를 분류오류율이 높은 경계선 인접영역에 위치한 문서들에 적용한 결과 높은 정확율을 얻을 수 있었다. 뿐만 아니라, RTPost 프로세스를 진행하는 동안 능동학습방법의 장점을 수용하여 학습효과는 높이며 비용을 감소시킬 수 있는 자가학습방법(self learning)방법의 효과를 기대할 수 있다.

Keywords

References

  1. R. Agrawal, R. Bayardo, and R. Srikant, 'Athena: Mining-based Interactive Management of Text Databases,' In Proceedings of the 7th International Conference on Extending Database Technology, pp.365-379, 2000 https://doi.org/10.1007/3-540-46439-5_25
  2. Yiming Yang. 'An Evaluation of Statistical Approaches to Text Categorization,' Journal of Information Retrieval, Vol.1, No.1, pp.67-88, 1999 https://doi.org/10.1023/A:1009982220290
  3. Zijian Zheng. 'Naive Bayesian Classifier Committees,' In Proceedings of European Conference on Machine Learning, pp.196-207, 1998 https://doi.org/10.1007/BFb0026690
  4. Yiming Yang and J. O. Pedersen. 'A Comparative Study on Feature Selection in Text Categorization,' In Proceedings of the 14th International Conference on Machine Learning, pp.42-420, 1997
  5. David D. Lewis and Jason Catlett. 'Heterogeneous Uncertainty Sampling for Supervised Learning,' In Proceedings of the 11th international Conference on Machine Learning, pp.148-156, 1994
  6. Pedro Domingos and Michael Pazzani. 'Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier', In Proceedings of the 13th International Conference on Machine Learning, pp.105-112, 1996
  7. Kim S.B., HC.,Rim, 'Recomputation of Class Relevence Score for Improving Text Classification,' In Proceedings of Conference of CICLing(Computational Linguistics and Intelligent Text Processing), Lecture Note in Computer Science, VoI.2945, pp.580-583, Feb., 2004 https://doi.org/10.1007/b95558
  8. Ko, Y.J., J.Y., Seo, 'Using the Feature Projection Technique based on a Normalized Voting Method for Text Classification,' Information Processing & Management, Pergamon-Elsevier Science, Vol.40, No.2, pp.191-208, Mar., 2004 https://doi.org/10.1016/S0306-4573(03)00029-3
  9. 김제준,김한구, '베이지언 문서분류시스템을 위한 능동적 학습기반의 학습문서집합 구성방법', 한국정보과학회 논문지, Vol.29, No.12, 2002. 12
  10. Wilson, D.,R., et al 'Reduction Techniques for Exemplar-based Learning algorithms,' Machine Learning, Vol.38, No.3, pp.257-286, 2002 https://doi.org/10.1023/A:1007626913721
  11. T.Joachims, 'Text categorization with support vector machines: learning with many relevant features,' In Proceedings of ECML -98, 10th European Conference on Machine Learning, pp.137-142, 1998
  12. C., Cortes and V., Vapnik, 'Supprot Vector Network', Machine Learning, Vol.20, pp.273-297, 1995 https://doi.org/10.1023/A:1022627411411
  13. D. Koller and S. Tong. 'Active learning for parameter estimation in Bayesian networks,' In Neural Information Processing Systems, 2001
  14. M. Hasenager. 'Active Data Selection in Supervised and Unsupervised Learning,' PhD thesis, Technische Fakultat der Universitat Bielefeld, 2000
  15. Dagan, I. And A.Itai, 'Word Sense Disambiguation using a second language monolingual corpus,' Computational Linguistics, 20(4), December, 1994
  16. Hatzivassiloglou, V., P.A. Duboue, and A.Rzhetsky. 'Disambiguating Proteins, Genes and RNA in Text: a Machine Learning Approach'. Bioinformatics Vol.17, pp.S97-106, 2001 https://doi.org/10.1093/bioinformatics/17.suppl_1.S97
  17. Tateishi, Y., T. Ohta, J Tsujii, 'Building an Annotated Corpus in the Molecular-Biology Domain,' In Proceedings of COLING 2000 Workshop on Semantic Annotation and Intelligent Content, pp.28-34, 2000
  18. S. B. Cho, 'Ensemble of structure adaptive self-organizing maps for high performance classification,' Inforrmation Science, Vol.23, No.1-2, pp.103-114, 2000 https://doi.org/10.1016/S0020-0255(99)00112-7
  19. W.N. Street, and Y. S. Kim, 'Streaming ensemble algorithm(SEA) for large-scale classification,' Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.377-382, San Francisco, California, 2001 https://doi.org/10.1145/502512.502568
  20. B. Krishnarnachari and S. Iyengar, 'Distributed Bayesian Algorithms for Fault-Tolerant Event Region Detection in Wireless Sensor Networks,' IEEE Transactions on Computers, Vol.53, No.3, pp.241-250, March, 2004 https://doi.org/10.1109/TC.2004.1261832
  21. D. K. Pradhan, ed., Fault-Tolerant Computer System Design. Prentice Hall Inc., 1996