Proceedings of the Korean Society for Language and Information Conference (한국언어정보학회:학술대회논문집)
- 2007.11a
- /
- Pages.146-154
- /
- 2007
AutoCor: A Query Based Automatic Acquisition of Corpora of Closely-related Languages
- Dimalen, Davis Muhajereen D. (Information Technology Department, School of Computer Studies Mindanao State University-Iligan Institute of Technology) ;
- Roxas, Rachel Edita O. (College of Computer Studies, De La Salle University-Manila)
- Published : 2007.11.01
Abstract
AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus, since some Tagalog documents retrieved by CorpusBuilder are actually documents in other closely-related Philippine languages. AutoCor used the query generation method odds ratio, and introduced the concept of common word pruning to differentiate between documents of closely-related Philippine languages and Tagalog. The performance of the system using with and without pruning are compared, and common word pruning was found to improve the precision of the system.