Journal of Information Technology Applications and Management
- Volume 12 Issue 2
- /
- Pages.15-27
- /
- 2005
- /
- 1598-6284(pISSN)
- /
- 2508-1209(eISSN)
A Study on the Effectiveness of Bigrams in Text Categorization
바이그램이 문서범주화 성능에 미치는 영향에 관한 연구
- Published : 2005.06.01
Abstract
Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na