Application of Machine Learning Techniques for Resolving Korean Author Names

Kang, In-Su;

doi:10.3743/KOSIM.2008.25.3.027

Journal of the Korean Society for information Management (정보관리학회지)

Volume 25 Issue 3
/
Pages.27-39
/
2008
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

Application of Machine Learning Techniques for Resolving Korean Author Names

한글 저자명 중의성 해소를 위한 기계학습기법의 적용

Kang, In-Su

강인수 (경성대학교 컴퓨터정보학부)

Published : 2008.09.30

https://doi.org/10.3743/KOSIM.2008.25.3.027 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

In bibliographic data, the use of personal names to indicate authors makes it difficult to specify a particular author since there are numerous authors whose personal names are the same. Resolving same-name author instances into different individuals is called author resolution, which consists of two steps: calculating author similarities and then clustering same-name author instances into different person groups. Author similarities are computed from similarities of author-related bibliographic features such as coauthors, titles of papers, publication information, using supervised or unsupervised methods. Supervised approaches employ machine learning techniques to automatically learn the author similarity function from author-resolved training samples. So far however, a few machine learning methods have been investigated for author resolution. This paper provides a comparative evaluation of a variety of recent high-performing machine learning techniques on author disambiguation, and compares several methods of processing author disambiguation features such as coauthors and titles of papers.

동일한 인명을 갖는 서로 다른 실세계 사람들이 존재하는 현실은 인터넷 세계에서 인명으로 표현된 개체의 신원을 식별해야 하는 문제를 발생시킨다. 상기의 문제가 학술정보 내의 저자명 개체로 제한된 경우를 저자식별이라 부른다. 저자식별은 식별 대상이 되는 저자명 개체 사이의 유사도 즉 저자유사도를 계산하는 단계와 이후 저자명 개체들을 군집화하는 단계로 이루어진다. 저자유사도는 공저자, 논문제목, 게재지정보 등의 저자식별자질들의 자질유사도로부터 계산되는데, 이를 위해 기존에 교사방법과 비교사방법들이 사용되었다. 저자식별된 학습샘플을 사용하는 교사방법은 비교사방법에 비해 다양한 저자식별자진들을 결합하는 최저의 저자유사도함수를 자동학습할 수 있다는 장점이 있다. 그러나, 기존교사방법 연구에서는 SVM, MEM 등의 일부 기계학습기법만이 시도되었다. 이 논문은 다양한 기계학습기법들이 저자식별에 미치는 성능, 오류, 효율성을 비교하고, 공저자와 논문제목 자질에 대해 자질값 추출 및 자질 유사도 계산을 위한 여러 기법들의 비교분석을 제공한다.

Keywords

References

강인수. 이승우. 정한민. 김평. 구회관. 이미경. 성원경. 박동인. 2008. 저자 식별을 위한 자질 비교. "한국콘텐츠학회논문지". 8 (2): 41-47
강인수. 2008. 저자 식별을 위한 전자메일의 추출 및 활용. "한국콘텐츠학회논문지" 8(6): 261-268
이승우. 정한민. 김평. 강인수. 성원경. 2006. 서지정보의 동명이인 구별을 위한 공저자 관계의 효용성 연구. "한국컴퓨터종합학술대회 논문집". pp.10-12
Alam. H., Dasmahapatra. S., O'Hara. K.. and Shadoolt. N. 2003. "Identifying communities of practice through ontology network analysis." IEEE Intelligent Systems. 18(2): 18-25 https://doi.org/10.1109/MIS.2003.1193653
Aswani. N., Bontcheva. K., and Cunningham, H. 2006. "Mining information for instance unification." Proceedings of ISWC-2006, pp.329-342
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P. and Fienberg, S. 2003. "Adaptive name matching in information integration." IEEE Intelligent Systems, 18(5): 16-23 https://doi.org/10.1109/MIS.2003.1234765
Blei, D., Ng, A., and Jordan, M. 2003. "Latent Dirichlet allocation." Journal of Machine Learmng Research. 3: 993-1022 https://doi.org/10.1162/jmlr.2003.3.4-5.993
Guha, R, and Garg, A. 2004. "Disambiguating people in search." Proceedings of WWW-2004
Huang, J., Ertekin, S., and Giles, C.L. 2006. "Efficient name disambiguation for large scale databases." Proceedings of PKDD-2006. pp.536-544
Kanani. P., and McCallum. A. 2007. "Efficient strategies for improving partitioningbased author coreference by incorporating Web pages as graph nooes." Proceedings of IIWeb-2007
McCallum. A., Nigam. K., Ungar. and L. H. 2000. "Efficient clustering of highdimensional data sets with application to reference matching." Proceedings of KDD-2007. pp.169-178
Song. Y., Huang. J., Councill. I.. Li. J.. and Giles. C.L. 2007. "Efficient topic-based unsupervised name disambiguation." Proceedings of JCDL-2007
Tan. Y.F., Kan. M.Y. and Lee. D.W. 2006. "Search engine driven author disambiguation." Proceedings of JCDL-2006. pp.314-315
Yang, K.H., Jiang. J. Y., Lee. H.M., and Ho, J.M. 2006. "Extracting citation relationships from Web documents for author disambiguation." Technical Report, TRIIS-06-017. Institute of Information Science. Academia Sinica, Taipei: Taiwan
Wan, X., Gao. J., Li. M., and Ding. B. 2005. "Person resolution in person search results: WebHawk." Proceedings of CIKM-2005, pp.163-170
Winkler, W.E. 2006. "Overview of record linkage and current research directions." Research Report Series #2006-2. Statistical Research Division, U.S. Census Bureau
Xia. X., Lyu, M., Lok, T., and Huang, G. 2005. "Methods of decreasing the number of support vectors via k-mean clustering." LNCS. 3644: 717-726

Cited by

Exploration of Hierarchical Techniques for Clustering Korean Author Names vol.40, pp.2, 2009, https://doi.org/10.1633/JIM.2009.40.2.095
A Comparative Study on Authority Records for Japanese Writers in Japan and the United States of America vol.48, pp.1, 2014, https://doi.org/10.4275/KSLIS.2014.48.1.149

Journal of the Korean Society for information Management (정보관리학회지)

Application of Machine Learning Techniques for Resolving Korean Author Names

한글 저자명 중의성 해소를 위한 기계학습기법의 적용

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)