Terminology Recognition System based on Machine Learning for Scientific Document Analysis

Choi, Yun-Soo;Song, Sa-Kwang;Chun, Hong-Woo;Jeong, Chang-Hoo;Choi, Sung-Pil;

doi:10.3745/KIPSTD.2011.18D.5.329

The KIPS Transactions:PartD (정보처리학회논문지D)

Volume 18D Issue 5
/
Pages.329-338
/
2011
/
1598-2866(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Terminology Recognition System based on Machine Learning for Scientific Document Analysis

과학 기술 문헌 분석을 위한 기계학습 기반 범용 전문용어 인식 시스템

최윤수 (한국과학기술정보연구원) ;
송사광 (한국과학기술정보연구원) ;
전홍우 (한국과학기술정보연구원) ;
정창후 (한국과학기술정보연구원) ;
최성필 (한국과학기술정보연구원)

Received : 2011.06.27
Accepted : 2011.08.17
Published : 2011.10.31

https://doi.org/10.3745/KIPSTD.2011.18D.5.329 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Terminology recognition system which is a preceding research for text mining, information extraction, information retrieval, semantic web, and question-answering has been intensively studied in limited range of domains, especially in bio-medical domain. We propose a domain independent terminology recognition system based on machine learning method using dictionary, syntactic features, and Web search results, since the previous works revealed limitation on applying their approaches to general domain because their resources were domain specific. We achieved F-score 80.8 and 6.5% improvement after comparing the proposed approach with the related approach, C-value, which has been widely used and is based on local domain frequencies. In the second experiment with various combinations of unithood features, the method combined with NGD(Normalized Google Distance) showed the best performance of 81.8 on F-score. We applied three machine learning methods such as Logistic regression, C4.5, and SVMs, and got the best score from the decision tree method, C4.5.

문헌에서의 전문용어 인식 연구는 정보검색, 정보추출, 시맨틱 웹, 질의응답 분야 등의 연구를 위한 선행 연구로서, 지금까지 대부분 특정 분야, 특히 생의학 분야에서 집중되어 연구되어 왔다. 그러나 기존 연구들이 특정 도메인 또는 문헌 내부 통계 정보를 활용함으로써 범용적인 전문용어 인식에 한계점을 보여 왔기 때문에, 본 연구에서는 웹 검색 결과와 사전, 후보용어의 문형 특징 등을 활용하는 기계 학습 기반 범용 전문용어 인식 방법을 제안하였다. 제안한 방법을 문헌의 지역 통계 정보를 사용하는 방법(C-value)과 비교 실험하여 80.8%의 F-값으로 6.5%의 성능향상을 보였다. 다양한 응집도 자질들을 접목한 두 번째 실험에서는 Normalized Google Distance 방법과 접목한 방식이 F-값 81.8%의 성능으로 최고의 성능을 나타냈다. 기계 학습 방법으로는 로지스틱 회귀분석, C4.5, SVMs 등을 적용하였는데, 일반적으로 이진 분류에 좋은 성능을 보이는 SVMs과 로지스틱 회귀분석 방법보다 결정 트리 방식의 C4.5가 전반적으로 좋은 성능을 보였다.

Keywords

References

Beatrice Daille, Eric Gaussier, and Jean-Marc Lange, "Towards Automatic Extraction of Monolingual and Bilingual Terminology. COLING-94, 1994.
Church, K. & Hanks. P, "Word association norms, mutual information, and lexicography," Computational Linguistics, Vol.16, No.1, pp.22-29, 1990.
Corinna Cortes and V. Vapnik, "Support-Vector Networks", Machine Learning, Vol.20, No.3, pp-273-297, 1995.
Dunning, T. "Accurate methods for the statistics of surprise and coincidence," Computational Linguistics, Vol.19, No.1, pp.61-74, 1993.
F. Smadja, K. R. McKeown, and V. Hatzivassiloglou, "Translating collocations for bilingual lexicons: A statistical approach", Computational Linguistics, Vol.22, No.1, pp.1-38, 1996.
G. Zhou, J. Zhang, J. Su, D. Shen and C. Tan, "Recognizing names in biomedical texts: a machine learning approach," Bioinformatics, Vol.20, No.7, pp.1178-1190, 2004. https://doi.org/10.1093/bioinformatics/bth060
Ido Dagan and Kenneth W. Church, "Termight: Identifying and translating technical terminology," ANLP, pp.34-40, 1994.
J. Kazama, T. Makino, Y. Ohta, J. Tsujii, "Tuning support vector machines for biomedical named entity recognition," Proceedings of the ACL-02 workshop on NLP in the biomedical domain, Vol.3, pp.1-8, 2002. https://doi.org/10.3115/1118149.1118150
Justeson, J.S. and S.M. Katz, "Technical terminology : some lingustic propertis and an algorithm for identification in text," Natural Language Engineering, Vol.1, No.1, pp.9-27, 1995.
Joachim Wermter and Udo Hahn, "Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms," HLT'05 Proceedings of the conference on Human Language Technology and Empirical Methods in NLP, 2005.
K. Frantzi and S. Ananiadou and Hideki Mima, "Automatic recognition of multi-word terms: the C-value/NC-value method," International Journal on Digital Libraries, Vol.3, No.2, pp.115-130, 2000. https://doi.org/10.1007/s007999900023
LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/-cjlin/libsvm/
Nakagawa, Hiroshi and Tatsunori Mori, "Automatic term recognition based on statistics of compound nouns and their components," Terminology, Vol.9, No.2, pp.201-219, 2003. https://doi.org/10.1075/term.9.2.04nak
Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
Rudi Cilibrasi and Paul Vitanyi, "The Google Similarity Distance," IEEE Trans. Knowledge and Data Engineering, Vol.19, No.3, pp.370-383, 2007. https://doi.org/10.1109/TKDE.2007.48
Qing T. Zeng, Tony Tse, et. al., "Term identification methods for consumer health vocabulary development," Journal of medical Internet research, Vol.9, No.1, 2007.
WEKA - Data Mining Software in Java, http:// www.cs.waikato.ac.nz/ml/weka/
Y. Tseng, C. Lin, Y. Lin, "Text mining techniques for patent analysis," Information Processing and Management, Vol.43, No.5, pp.1216-1247, 2007. https://doi.org/10.1016/j.ipm.2006.11.011

Cited by

Machine Learning Process for the Prediction of the IT Asset Fault Recovery vol.2, pp.4, 2013, https://doi.org/10.3745/KTSDE.2013.2.4.281

The KIPS Transactions:PartD (정보처리학회논문지D)

Terminology Recognition System based on Machine Learning for Scientific Document Analysis

과학 기술 문헌 분석을 위한 기계학습 기반 범용 전문용어 인식 시스템

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)