DOI QR코드

DOI QR Code

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

  • Kim, Jin-Suk (Department of Information Technology Research, KISTI) ;
  • Choe, Ho-Seop (Department of Information Technology Research, KISTI) ;
  • You, Beom-Jong (Department of Information Technology Research, KISTI) ;
  • Seo, Jeong-Hyun (Department of Cyber Environment Development, KISTI) ;
  • Lee, Suk-Hoon (Department of Information & Statistics, Chungnam National University) ;
  • Ra, Dong-Yul (Computer & Telecommunication Engineering Division, Yonsei University)
  • Published : 2009.09.30

Abstract

The HKIB, or Hankookilbo, test collections are two archives of Korean newswire stories manually categorized with semi-hierarchical or hierarchical category taxonomies. The base newswire stories were made available by the Hankook Ilbo (The Korea Daily) for research purposes. At first, Chungnam National University and KISTI collaborated to manually tag 40,075 news stories with categories by semi-hierarchical and balanced three-level classification scheme, where each news story has only one level-3 category (single-labeling). We refer to this original data set as HKIB-40075 test collection. And then Yonsei University and KISTI collaborated to select 20,000 newswire stories from the HKIB-40075 test collection, to rearrange the classification scheme to be fully hierarchical but unbalanced, and to assign one or more categories to each news story (multi-labeling). We refer to this modified data set as HKIB-20000 test collection. We benchmark a k-NN categorization algorithm both on HKIB-20000 and on HKIB-40075, illustrating properties of the collections, providing baseline results for future studies, and suggesting new directions for further research on Korean text categorization problem.

Keywords

References

  1. HERSH, W., C. BUCKLEY, T. J. LEONE, AND D. H. HICKMAN. 1994. OHSUMED: an Interactive Retrieval Evaluation and New Large Text Collection for Research. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), 192-201.
  2. KIM, JINSUK AND MYOUNG HO KIM. 2004. An Evaluation of Passage-based Text Categorization. Journal of Intelligent Information Systems 23(1):47-65. https://doi.org/10.1023/B:JIIS.0000029670.53363.d0
  3. KIM, JINSUK, DU-SEOK JIN, YUNSOO CHOI, CHANG-HOO JEONG, KWANGYOUNG KIM, SUNG-PIL CHOI, MINHO LEE, MIN-HEE CHO, HO-SEOP CHOE, HWA-MOOK YOON, AND JEONG-HYUN SEO. 2007. Toward DB-IR Integration: Per-Document Basis Transactional Index Maintenance. In The 6th International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007) 6:452-462, Luoyang, China. https://doi.org/10.1109/ALPIT.2007.15
  4. KIM, JINSUK. 2009. HKIB-20000/HKIB-40075 Korean Text Categorization Test Collections. README file (version 1.0). Manuscript, May 31, 2009.
  5. http://www.kristalinfo.com/TestCollections/readme_hkib.html
  6. KIM, JINSUK. 2009. Experimental Results for KRISTAL's kNN Classifier on HKIB-20000 & HKIB- 40075 Hangul Benchmark Collections for Korean Text Categorization Research. Manuscript, June 10, 2009.
  7. http://www.kristalinfo.com/TestCollections/supp_hkib.pdf
  8. LEWIS, DAVID D. 1992. An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 92), 37-50. https://doi.org/10.1145/133160.133172
  9. LEWIS, DAVID D. 2004. Reuters-21578 Text Categorization Test Collection. Distribution 1.0 README file (version 1.3). Manuscript, May 14, 2004.
  10. http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt
  11. LEWIS, DAVID D., YIMING YANG, TONY G. ROSE, AND FAN LI. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5:361-397.
  12. SEBASTIANI, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1):1-47. https://doi.org/10.1145/505282.505283
  13. VAN RIJSBERGEN, C. J. 1979. Information Retrieval. Buttersworths, London, second edition.
  14. WITTEN, I. H., A. MOFFAT, AND T. C. BELL. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco: Morgan Kaufmann Publishing.
  15. YANG, Y. AND J. O. PEDERSEN. 1997. A Comparative Study on Feature Selection in Text Categorization. In The Fourteenth International Conference on Machine Learning (ICML 97), 412-420.
  16. YANG, Y. 1999. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1):67-88. https://doi.org/10.1023/A:1009982220290
  17. YANG, Y. AND X. LIU. 1999. A Re-examination of Text Categorization Methods. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), 42-49. https://doi.org/10.1145/312624.312647
  18. YANG, Y., S. SLATTERY, AND R. GHANI. 1999. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 17(2):219-241. https://doi.org/10.1023/A:1013685612819

Cited by

  1. A Study on Feature Selection for kNN Classifier using Document Frequency and Collection Frequency vol.44, pp.1, 2013, https://doi.org/10.16981/kliss.44.1.201303.27