DOI QR코드

DOI QR Code

A Study on Recognition of Citation Metadata using Bidirectional GRU-CRF Model based on Pre-trained Language Model

사전학습 된 언어 모델 기반의 양방향 게이트 순환 유닛 모델과 조건부 랜덤 필드 모델을 이용한 참고문헌 메타데이터 인식 연구

  • 지선영 (경기대학교 일반대학원 문헌정보학과) ;
  • 최성필 (경기대학교 문헌정보학과)
  • Received : 2021.02.25
  • Accepted : 2021.03.17
  • Published : 2021.03.30

Abstract

This study applied reference metadata recognition using bidirectional GRU-CRF model based on pre-trained language model. The experimental group consists of 161,315 references extracted by 53,562 academic documents in PDF format collected from 40 journals published in 2018 based on rules. In order to construct an experiment set. This study was conducted to automatically extract the references from academic literature in PDF format. Through this study, the language model with the highest performance was identified, and additional experiments were conducted on the model to compare the recognition performance according to the size of the training set. Finally, the performance of each metadata was confirmed.

본 연구에서는 사전학습 된 언어 모델을 기반으로 양방향 게이트 순환 유닛 모델과 조건부 랜덤 필드 모델을 활용하여 참고문헌을 구성하는 메타데이터를 자동으로 인식하기 위한 연구를 진행하였다. 실험 집단은 2018년에 발행된 학술지 40종을 대상으로 수집한 PDF 형식의 학술문헌 53,562건을 규칙 기반으로 분석하여 추출한 참고문헌 161,315개이다. 실험 집합을 구축하기 위하여 PDF 형식의 학술 문헌에서 참고문헌을 분석하여 참고문헌의 메타데이터를 자동으로 추출하는 연구를 함께 진행하였다. 본 연구를 통하여 가장 높은 성능을 나타낸 언어 모델을 파악하였으며 해당 모델을 대상으로 추가 실험을 진행하여 학습 집합의 규모에 따른 인식 성능을 비교하고 마지막으로 메타데이터별 성능을 확인하였다.

Keywords

References

  1. Kim, J. H. (2003). A study on automatic extraction of citation information for reference linking. Journal of the Korean Society for Library and Information Sciences, 37(1), 247-268. https://doi.org/10.4275/KSLIS.2003.37.1.247
  2. Kim, J. H., Kim, S. Y., Lim, S. J., & Hwang, H. K. (2019). Case study of journal article and reference mapping. The Journal of the Korea Contents Association, 19(11), 262-269. https://doi.org/10.5392/JKCA.2019.19.11.262
  3. Kim, S. W., Ji, S. W., Seol, J. W., Jeong, H. S., & Choi, S. P. (2018). Bidirectional GRU-GRU CRF based citation metadata recognition. Annual Conference on Human and Language Technology, 461-464.
  4. Lee, S. G., Kim, S. T., Lee, Y. S., & Yi, T. S. (2007). Research and development of citation matcher for reference parsing and cross-reference linking. KOSTI 2007, 5(1), 426-429.
  5. Lim, S. H., Yoon, T. R., Choi, G. C., Cho, W. M., Heo, J. J., Han, H. W., & Lee, K. W. (2019). A proposal for a bibliographic search interface using impact factor in the genealogy of academic literature. 2019 The HCI Society of Korea, 526-529.
  6. Shin, G. M., Han, Y. S., Kim, L. H., & Cha, J. W. (2009). Citation extraction using machine learning. Korea Computer Congress, 36(1C), 331-335.
  7. An, D., Gao, L., Jiang, Z., Liu, R., & Tang, Z. (2017). Citation metadata extraction via deep neural network-based segment sequence labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1967-1970. https://doi.org/10.1145/3132847.3133074
  8. Besagni, D. & Belaid, A. (2004). Citation recognition for scientific publications in digital libraries, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings, 244-252. https://doi.org/10.1109/DIAL.2004.1263253.
  9. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Kevin, C., Minh-Thang, L., Quoc V. L., & Christopher D. M. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. International Conference on Learning Representations. https://openreview.net/forum?id=rlxMH1BtvB
  11. Park, J. W. (2020). KoELECTRA: Pretrained ELECTRA Model for Korea. https://github.com/monologg/KoELECTRA
  12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., Devito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems. 33rd Conference on Neural Information Processing Sytstems, 8024-8035.
  13. Powley, B. & Dale, R. (2007). High accuracy citation extraction and named entity recognition for a heterogeneous corpus of academic papers. In 2007 International Conference on Natural Language Processing and Knowledge Engineering, 119-124. https://doi.org/10.1109/NLPKE.2007.4368021
  14. Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715-1725.
  15. SKTBrain (2019). KoBERT, Available: https://github.com/SKTBrain/KoBERT
  16. tbai (2019). HanBERT, Available: https://github.com/tbai2019/HanBert-54k-N
  17. Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, L., (2015). CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 18, 317-335. https://doi.org/10.1007/s10032-015-0249-8
  18. Yusuke, S. [n.d.]. PdfMiner, Available: https://github.com/pdfminer