DOI QR코드

DOI QR Code

Evaluation of English Term Extraction based on Inner/Outer Term Statistics

  • Kang, In-Su (Dept. of Computer Science, Kyungsung University)
  • 투고 : 2020.03.09
  • 심사 : 2020.04.20
  • 발행 : 2020.04.29

초록

용어추출은 도메인 텍스트 모음으로부터 도메인 용어 목록을 인식하는 작업이다. 용어추출의 기존 효과적인 방법들은 비교사 방식으로 동작하며, 후보 용어 집합을 추출하는 작업과 후보 용어에 용어중요도를 할당하는 작업을 주요 단계로 포함한다. 후보 용어의 용어중요도 계산과 관련하여 본 논문에서는 후보 용어의 내부 및 외부용어집합을 활용한다. 내부용어집합은 후보 용어에 포함된 다른 짧은 용어들의 집합이며, 외부용어집합은 후보 용어가 포함된 다른 긴 용어들의 집합이다. 본 논문에서는 후보 용어의 내부 혹은 외부용어집합으로부터 후보 용어의 용어 강도를 계산하는 다양한 강도 함수들을 제시하고, 이들 용어 강도 값들과 C-value 점수를 결합하는 용어중요도 계산 방법을 소개한다. 생물학 및 전산언어학 분야 영어 데이터셋을 사용한 성능 평가에서는 제안된 방법의 용어추출 성능을 비교하고 분석한다. 제안된 방법은 생물학 및 전산언어학 분야 데이터셋에 대해 각각 최대 1%와 3% 차이의 성능 향상을 보였다.

Automatic term extraction is to recognize domain-specific terms given a collection of domain-specific text. Previous term extraction methods operate effectively in unsupervised manners which include extracting candidate terms, and assigning importance scores to candidate terms. Regarding the calculation of term importance scores, the study focuses on utilizing sets of inner and outer terms of a candidate term. For a candidate term, its inner terms are shorter terms which belong to the candidate term as components, and its outer terms are longer terms which include the candidate term as their component. This work presents various functions that compute, for a candidate term, term strength from either set of its inner or outer terms. In addition, a scoring method of a term importance is devised based on C-value score and the term strength values obtained from the sets of inner and outer terms. Experimental evaluations using GENIA and ACL RD-TEC 2.0 datasets compare and analyze the effectiveness of the proposed term extraction methods for English. The proposed method performed better than the baseline method by up to 1% and 3% respectively for GENIA and ACL datasets.

키워드

참고문헌

  1. N. Astrakhantsev, "ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala," CoRR abs/1611.07804, 2016.
  2. Z. Zhang, J. Gao, and F. Ciravegna, "SemRe-Rank: Incorporating Semantic Relatedness to Improve Automatic Term Extraction Using Personalized PageRank," CoRR abs/1711.03373, 2017.
  3. T. Koutropoulou, and E. Gallopoulos, "TMG-BoBI: Generating Back-of-the-Book Indexes with the Text-to-Matrix-Generator," Proceedings of 10th International Conference on Information, Intelligence, Systems and Applications, 2019.
  4. Z. Wu, Z. Li, P. Mitra, and C. Giles, "Can back-of-the-book indexes be automatically created?," Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 1745-1750, 2013.
  5. N. Simon, and V. Keselj, "Automatic Term Extraction in Technical Domain using Part-of-Speech and Common-Word Features," Proceedings of the ACM Symposium on Document Engineering, 2018.
  6. G. Petasis, V. Karkaletsis, G. Paliouras, A. Krithara, and E. Zavitsanos, "Ontology Population and Enrichment: State of the Art," Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, pp. 134-166, 2011.
  7. M. Asim, M. Wasim, M. Khan, W. Mahmood, and H. Abbasi, "A survey of ontology learning techniques and applications," Database, Vol. 2018, 2018.
  8. K. Frantzi, S. Ananiadou, and H. Mima, "Automatic recognition of multi-word terms:. the c-value/nc-value method," International Journal on Digital Libraries, Vol 3, No. 2, pp. 115-130, 2000. https://doi.org/10.1007/s007999900023
  9. G. Bordea, P. Buitelaar, and T. Polajnar, "Domain-independent term extraction through domain modelling," Proceedings of the 10th International Conference on Terminology and Artificial Intelligence, 2013.
  10. S. Rose, D. Engel, N. Cramer, and W. Cowley, "Automatic keyword extraction from individual documents," Text Mining: Applications and Theory, John Wiley & Sons Ltd, 2010.
  11. H. Nakagawa, and T. Mori, "A Simple but Powerful Automatic Term Extraction Method," COLING-02: COMPUTERM 2002: Second International Workshop on Computational Terminology, 2002.
  12. N. Astrakhantsev, "Methods and software for terminology extraction from domain specific text collection," Ph.D. thesis, Institute for System Programming of Russian Academy of Sciences, 2015.
  13. Z. Zhang, J. Gao, and F. Ciravegna, "JATE 2.0: Java Automatic Term Extraction with Apache Solr," Proceedings of the Tenth International Conference on Language Resources and Evaluation, 2016.
  14. K. Meijer, F. Frasincar, and F. Hogenboom, "A semantic approach for extracting domain taxonomies from text," Decision Support Systems, Vol. 62, pp. 78-93, 2014. https://doi.org/10.1016/j.dss.2014.03.006
  15. K. Ahmad, L. Gillam, and L. Tostevin, "University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER)," Proceedings of The Eighth Text REtrieval Conference, 1999.
  16. J. Ventura, C. Jonquet, M. Roche, and M. Teisseire, "Combining c-value and keyword extraction methods for biomedical terms extraction," International Symposium on Languages in Biology and Medicine, pp. 45-49, 2013.
  17. B. QasemiZadeh, and A. Schumann, "The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods," Proceedings of the Tenth International Conference on Language Resources and Evaluation, 2016.
  18. J. Kim, T. Ohta, Y. Tateisi, and J. Tsujii, "GENIA corpus - a semantically annotated corpus for bio-textmining," ISMB (Supplement of Bioinformatics), pp. 180-182, 2003.
  19. A. Sajatovic, M. Buljan, J. Snajder, and B. Dalbelo, "Basic: Evaluating Automatic Term Extraction Methods on Individual Documents," Proceedings of the Joint Workshop on Multiword Expressions and WordNet, pp. 149-154, 2019.
  20. SpaCy. https://spacy.io/
  21. M. Marcus, B. Santorini, and M. Marcinkiewicz, "Building a Large Annotated Corpus of English: The Penn Treebank," Computational Linguistics, Vol. 19, No. 2, pp. 313-330, 1993.