AutoCor: A Query Based Automatic Acquisition of Corpora of Closely-related Languages

  • Dimalen, Davis Muhajereen D. (Information Technology Department, School of Computer Studies Mindanao State University-Iligan Institute of Technology) ;
  • Roxas, Rachel Edita O. (College of Computer Studies, De La Salle University-Manila)
  • Published : 2007.11.01

Abstract

AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus, since some Tagalog documents retrieved by CorpusBuilder are actually documents in other closely-related Philippine languages. AutoCor used the query generation method odds ratio, and introduced the concept of common word pruning to differentiate between documents of closely-related Philippine languages and Tagalog. The performance of the system using with and without pruning are compared, and common word pruning was found to improve the precision of the system.

Keywords