An Experimental Study on Feature Selection Using Wikipedia for Text Categorization

위키피디아를 이용한 분류자질 선정에 관한 연구

  • Received : 2012.05.21
  • Accepted : 2012.06.16
  • Published : 2012.06.30


In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in $F_1$ value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.

텍스트 범주화에 있어서 일반적인 문제는 문헌을 표현하는 핵심적인 용어라도 학습문헌 집합에 나타나지 않으면 이 용어는 분류자질로 선정되지 않는다는 것과 형태가 다른 동의어들은 서로 다른 자질로 사용된다는 점이다. 이 연구에서는 위키피디아를 활용하여 문헌에 나타나는 동의어들을 하나의 분류자질로 변환하고, 학습문헌 집합에 출현하지 않은 입력문헌의 용어를 가장 유사한 학습문헌의 용어로 대체함으로써 범주화 성능을 향상시키고자 하였다. 분류자질 선정 실험에서는 (1) 비학습용어 추출 시 범주 정보의 사용여부, (2) 용어의 유사도 측정 방법(위키피디아 문서의 제목과 본문, 카테고리 정보, 링크 정보), (3) 유사도 척도(단순 공기빈도, 정규화된 공기빈도) 등 세 가지 조건을 결합하여 실험을 수행하였다. 비학습용어를 유사도 임계치 이상의 최고 유사도를 갖는 학습용어로 대체하여 kNN 분류기로 분류할 경우 모든 조건 결합에서 범주화 성능이 0.35%~1.85% 향상되었다. 실험 결과 범주화 성능이 크게 향상되지는 못하였지만 위키피디아를 활용하여 분류자질을 선정하는 방법이 효과적인 것으로 확인되었다.


  1. Bird, S., Klein, E., & Loper, E. (2007). Natural language processing in Python. O'ReillyMedia.
  2. Gabrilovich, E., & Markovitch, S. (2005). Feature generation for text categorization using world knowledge. Proceedings of the 19th international Joint Conference on Artificial intelligence, 1048-1053.
  3. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. Proceedings of the 21st National Conference on Artificial Intelligence, 1301-1306.
  4. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proceedings of the 20th International Joint Conference on Artificial Intelligence, 1606-1611. Retrieved from
  5. Huang, A., Milne, D., Frank, E., & Witten, I. H. (2009). Clustering documents using a Wikipediabased concept representation. Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (LNCS 5476/2009), 628-636.
  6. Milne, D., Witten, I. H., & Nichols, D. M. (2007). A knowledge-based search engine powered by Wikipedia. Proceedings of the 16th ACM Conference on Information and Knowledge Management, 445-454.
  7. Milne, D., & Witten, I. H. (2008). An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. Proceedings of the First AAAI Workshop on Wikipedia and Artificial Intelligence, (WIKIAI 2008). Rertrieved from
  8. Minier, Z., Bodo, Z., & Csato, L. (2007). Wikipedia-based kernels for text categorization. Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 157-164.
  9. Ponzetto, S. P., & Strube, M. (2006). Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 192-199.
  10. Ponzetto, S. P., & Strube, M. (2007). Knowledge derived from Wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research, 30(1), 181-212.
  11. Strube, M., & Ponzetto, S. P. (2006). WikiRelate! Computing semantic relatedness using Wikipedia. Proceedings of the 21st National Conference on Artificial Intelligence, 1419-1424.
  12. Wang, P., Hu, J., Zeng, H., Chen, L., & Chen, Z. (2007). Improving text classification by using encyclopedia knowledge. Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, 332-341.
  13. Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using wikipedia. Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 713-721.

Cited by

  1. An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database vol.48, pp.3, 2014,
  2. Twitter Issue Tracking System by Topic Modeling Techniques vol.20, pp.2, 2014,