An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems

;;;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 29 Issue 12
/
Pages.966-978
/
2002
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems

베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성방법

김제욱 (대우정보시스템 기술연구소) ;
김한준 (서울대학교 공과대학 컴퓨터공학부) ;
이상구 (서울대학교 공과대학 컴퓨터공학부)

Published : 2002.12.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

There are two important problems in improving text classification systems based on machine learning approach. The first one, called "selection problem", is how to select a minimum number of informative documents from a given document collection. The second one, called "composition problem", is how to reorganize selected training documents so that they can fit an adopted learning method. The former problem is addressed in "active learning" algorithms, and the latter is discussed in "boosting" algorithms. This paper proposes a new learning method, called AdaBUS, which proactively solves the above problems in the context of Naive Bayes classification systems. The proposed method constructs more accurate classification hypothesis by increasing the valiance in "weak" hypotheses that determine the final classification hypothesis. Consequently, the proposed algorithm yields perturbation effect makes the boosting algorithm work properly. Through the empirical experiment using the Routers-21578 document collection, we show that the AdaBUS algorithm more significantly improves the Naive Bayes-based classification system than other conventional learning methodson system than other conventional learning methods

기계학습 기법을 이용한 문서분류시스템의 정확도를 결정하는 요인 중 가장 중요한 것은 학습문서 집합의 선택과 그것의 구성방법이다. 학습문서집합 선택의 문제란 임의의 문서공간에서 보다 정보량이 큰 적은 양의 문서집합을 골라서 학습문서로 채택하는 것을 말한다. 이렇게 선택한 학습문서집합을 재구성하여 보다 정확도가 높은 문서분류함수를 만드는 것이 학습문서집합 구성방법의 문제이다. 전자의 문제를 해결하는 대표적인 알고리즘이 능동적 학습(active learning) 알고리즘이고, 후자의 경우는 부스팅(boosting) 알고리즘이다. 본 논문에서는 이 두 알고리즘을 Naive Bayes 문서분류 알고리즘에 적응해보고, 이때 생기는 여러 가지 특징들을 분석하여 새로운 학습문서집합 구성방법인 AdaBUS 알고리즘을 제안한다. 이 알고리즘은 능동적 학습 알고리즘의 아이디어를 이용하여 최종 문서분류함수룰 만들기 위해 임시로 만든 여러 임시 문서분류함수(weak hypothesis)들 간의 변이(variance)를 높였다. 이를 통해 부스팅 알고리즘이 효과적으로 구동되기 위해 필요한 핵심 개념인 교란(perturbation)의 효과를 실현하여 문서분류의 정확도를 높일 수 있었다. Router-21578 문서집합을 이용한 경험적 실험을 통해, AdaBUS 알고리즘이 기존의 알고리즘에 비해 Naive Bayes 알고리즘에 기반한 문서분류시스템의 정확도를 보다 크게 향상시킨다는 사실을 입증한다.

Keywords

References

Tom M. Mitchell. Machine Learning. McGraw-Hill International Editions, chapter 6, 1997
R. Agrawal, R. Bayardo, and R. Srikant. Athena: Mining-based Interactive Management of Text Databases. In Proceedings of the 7th International Conference on Extending Database Technology, pages 365-379, 2000
Pedro Domingos and Michael Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pages 105-112, 1996
김제욱, 김한준, 이상구, Naive Bayes 문서 분류기를 위한 점진적 학습 모델 연구, 정보기술과 데이타베이스 저널, 8(1), pages 95-104, 2001
David D. Lewis and William A. Gale. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994
Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the 13th International Conference on Machine earning, pages 148-156, 1996
David D. Lewis and Jason Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the 11th international Conference on Machine Learning, pages 148-156, 1994
M. Trensh, N. Palmer, and A. Luniewski. Type Classification of Semi-structured Documents. In Proceedings of the 21st ACM SIGMOD International Conference on Management of Data, 1995
Yoav Freund and Robert E. Schapire, A Decisiontheoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), pages 119-139, 1997 https://doi.org/10.1006/jcss.1997.1504
J. R. QuinJan. Bagging, Boosting, and c4.5. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 725-730. 1996
Robert E. Schapire. The Strength of Weak Learnability, Machine Learning, 5(2), pages 197-227, 1990 https://doi.org/10.1023/A:1022648800760
Robert E. Schapire and Yoram Singer. Boos Texter: A Boosting-based System for Text Categorization. Machine Learning, 39(2), pages 135-168, 2000 https://doi.org/10.1023/A:1007649029923
Robert E. Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-orated Predictions. Machine Learning, 37(3), pages 297-336, 1999 https://doi.org/10.1023/A:1007614523901
Leo Breiman. Arcing Classifiers. The Annals of Statistics, 26(3), pages 801-849, 1998 https://doi.org/10.1214/aos/1024691079
Kai Ming Ting and Zijian Zheng. Improving the Performance of Boosting for Naive Bayesian Classification. In Proceedings of the 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1999
Zijian Zheng. Naive Bayesian Classifier Committees. In Proceedings of European Conference on Machine Learning, pages 196-207, 1998 https://doi.org/10.1007/BFb0026690
Ron Kohavi, David H. Wolpert. Bias Plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the 13th International Conference on Machine Learning, pages 275-283, 1996
Yiming Yang. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1(1), pages 67-88, 1999 https://doi.org/10.1023/A:1009982220290
Yiming Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 42-420, 1997

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems

베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성방법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)