DOI QR코드

DOI QR Code

Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

  • Kim, Sunho (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Royoung (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Nam, Hee-Jo (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Ryeo-Gyeong (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Ko, Enjin (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Han-Su (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Shin, Jihye (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Cho, Daeun (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Jin, Yurhee (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Bae, Soyeon (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Jo, Ye Won (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Jeong, San Ah (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Yena (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Ahn, Seoyeon (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Jang, Bomi (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Seong, Jiheyon (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Lee, Yujin (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Seo, Si Eun (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Yujin (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Ha-Jeong (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Hyeji (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Sung, Hye-Lynn (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Lho, Hyoyoung (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Koo, Jaywon (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Chu, Jion (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Lim, Juwon (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Youngju (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Lee, Kyungyeon (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Lim, Yuri (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Meongeun (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Hwang, Seonjeong (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Han, Shinhye (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Bae, Sohyeun (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Sua (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Yoo, Suhyeon (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Seo, Yeonjeong (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Shin, Yerim (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Yonsoo (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Ko, You-Jung (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Baek, Jihee (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Hyun, Hyejin (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Choi, Hyemin (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Oh, Ji-Hye (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Da-Young (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Park, Hyun-Seok (Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University)
  • Received : 2020.08.12
  • Accepted : 2020.09.10
  • Published : 2020.09.30

Abstract

This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.

Keywords

Acknowledgement

This work was supported by a National Research Foundation of Korea grant (NRF-2019R1F1A1058858) funded by the Korean government (MSIT)

References

  1. Genomics and Informatics archives. Seoul: Korea Genome Organization, 2018. Accessed 2018 Jul 29. Available from: https://genominfo.org/articles/archive.php.
  2. Shinyama Y. PDFMiner.six: Python PDF parser and analyzer. San Francisco: GitHub Inc., 2018. Accessed 2020 Jul 20. Available from: https://github.com/pdfminer/pdfminer.six.
  3. Oh SY, Kim JH, Kim SJ, Nam HJ, Park HS. GNI Corpus Version 1.0: annotated full-text corpus of Genomics & Informatics to support biomedical information extraction. Genomics Inform 2018;16:75-77. https://doi.org/10.5808/GI.2018.16.3.75
  4. Briscoe G, Mulligan C. Digital innovation: the hackathon phenomenon. Creativeworks London Working Paper No. 6. London: Creativeworks London, 2014.
  5. Kissos I, Dershowitz N. OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), 2016 Apr 11-14, Santorini, Greece. New York: Institute of Electrical and Electronics Engineers, 2016.
  6. Mays E, Damerau FJ, Mercer RL. Context based spelling correction. Inf Process Manag 1994;27:517-522.
  7. Tong X, Evans DA. A statistical approach to automatic OCR error correction in context. In: Proceedings of the Fourth Workshop on Very Large Corpora (Ejerhed E, Dagan I, eds.), 1996 Aug 4, Copenhagen, Denmark. Copenhagen: University of Copenhagen, 1996. pp. 88-100.
  8. Foster J, Wagner J, van Genabith J. Adapting a WSJ-trained parser to grammatically noisy text. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, 2008 Jun 15-20, Columbus, OH, USA. Stroudsburg: Association for Computational Linguistics, 2008. pp. 221-224.
  9. Bassil Y, Alwani M. OCR post-processing error correction algorithm using Google online spelling suggestion. Preprint at https://arxiv.org/abs/1204.0191 (2012).
  10. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26 (NIPS 2013) (Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds.). Red Hook: Curran Associates Inc., 2013. pp. 3111-3119.
  11. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).
  12. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. Preprint at https://arxiv.org/abs/1802.05365 (2018).
  13. Sharma A, Chaudhary DR. Character recognition using neural network. Int J Eng Trends Technol 2013;4:662-667.
  14. Garaas T, Xiao M, Pomplun M. Personalized spell checking using neural networks. Boston: University of Massachusetts Boston, 2011. Accessed 2020 Jul 20. Available from: https://www.cs.umb.edu/~marc/pubs/garaas_xiao_pomplun_HCII2007.pdf.
  15. Varis K, Bradford D, Brimm D, Ganier L, Gerundt T, Rapp P, et al. WinMerge 2.14 Help. WinMerge, 2004-2013. Accessed 2020 Sep 3. Available from: https://manual.winmerge.org/.
  16. Ahn JI, Jeong KJ, Ko MJ, Shin HJ, Chung HJ, Jeong HS, et al. High-concentration epigallocatechin gallate treatment causes endoplasmic reticulum stress-mediated cell death in HepG2 cells. Genomics Inform 2009;7:97-106. https://doi.org/10.5808/GI.2009.7.2.097
  17. Kim JM, Kim BG, Oh S. Evolutionary signature of information transfer complexity in cellular membrane proteomes. Genomics Inform 2009;7:111-121. https://doi.org/10.5808/GI.2009.7.2.111