Automatic Mapping Between Large-Scale Heterogeneous Language Resources for NLP Applications: A Case of Sejong Semantic Classes and KorLexNoun for Korean

Park, Heum;Yoon, Ae-Sun;

Language and Information (한국언어정보학회지:언어와정보)

Volume 15 Issue 2
/
Pages.23-45
/
2011
/
1226-7430(pISSN)

Korean Society for Language and Information (한국언어정보학회)

Automatic Mapping Between Large-Scale Heterogeneous Language Resources for NLP Applications: A Case of Sejong Semantic Classes and KorLexNoun for Korean

Park, Heum (Pusan National University, Center for U-Port IT Research and Education) ;
Yoon, Ae-Sun (Pusan National University, Department of French)

Received : 2011.10.28
Accepted : 2011.12.15
Published : 2011.12.31

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This paper proposes a statistical-based linguistic methodology for automatic mapping between large-scale heterogeneous languages resources for NLP applications in general. As a particular case, it treats automatic mapping between two large-scale heterogeneous Korean language resources: Sejong Semantic Classes (SJSC) in the Sejong Electronic Dictionary (SJD) and nouns in KorLex. KorLex is a large-scale Korean WordNet, but it lacks syntactic information. SJD contains refined semantic-syntactic information, with semantic labels depending on SJSC, but the list of its entry words is much smaller than that of KorLex. The goal of our study is to build a rich language resource by integrating useful information within SJD into KorLex. In this paper, we use both linguistic and statistical methods for constructing an automatic mapping methodology. The linguistic aspect of the methodology focuses on the following three linguistic clues: monosemy/polysemy of word forms, instances (example words), and semantically related words. The statistical aspect of the methodology uses the three statistical formulae ${\chi}^2$, Mutual Information and Information Gain to obtain candidate synsets. Compared with the performance of manual mapping, the automatic mapping based on our proposed statistical linguistic methods shows good performance rates in terms of correctness, specifically giving recall 0.838, precision 0.718, and F1 0.774.

Keywords

Acknowledgement

Supported by : Korea Science and Engineering Foundation (KOSEF)

Language and Information (한국언어정보학회지:언어와정보)

Automatic Mapping Between Large-Scale Heterogeneous Language Resources for NLP Applications: A Case of Sejong Semantic Classes and KorLexNoun for Korean

Abstract

Keywords

Acknowledgement

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)