A Probabilistic Context Sensitive Rewriting Method for Effective Transliteration Variants Generation

Lee, Jae-Sung;

doi:10.5392/JKCA.2007.7.2.073

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Volume 7 Issue 2
/
Pages.73-83
/
2007
/
1598-4877(pISSN)
/
2508-6723(eISSN)

The Korea Contents Association (한국콘텐츠학회)

DOI QR Code

A Probabilistic Context Sensitive Rewriting Method for Effective Transliteration Variants Generation

효과적인 외래어 이형태 생성을 위한 확률 문맥 의존 치환 방법

Lee, Jae-Sung

이재성 (충북대학교 사범대학 컴퓨터교육과)

Published : 2007.02.28

https://doi.org/10.5392/JKCA.2007.7.2.073 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

An information retrieval system, using exact match, needs preprocessing or query expansion to generate transliteration variants in order to search foreign word transliteration variants in the documents. This paper proposes an effective method to generate other transliteration variants from a given transliteration. Because simple rewriting of confused characters produces too many false variants, the proposed method controls the generation priority by learning confusion patterns from real uses and calculating their probability. Especially, the left and right context of a pattern is considered, and local rewriting probability and global rewriting probability are calculated to produce more probable variants in earlier stage. The experimental result showed that the method was very effective by showing more than 80% recall with top 20 generations for a transliteration variants set collected from KT SET 2.0.

완전 일치 방법을 주로 사용하는 정보 검색 시스템에서 외래어 이형태를 검색할 수 있도록 위해서는 외래어 이형태를 자동 생성하는 전처리나 질의어 확장이 필요하다. 본 연구에서는 하나의 외래어가 입력되면, 이를 근거로 실제 사용될 만한 외래어 이형태들을 효과적으로 생성하기 위한 방법을 제안한다. 혼동 자소를 단순하게 치환하는 방법은 불필요한 이형태를 과도하게 생성하므로, 본 연구에서는 실제 문서에 사용된 외래어 이형태들로부터 혼동 패턴을 학습하고, 이를 확률로 계산하여 생성 순서를 조절하였다. 특히, 혼동 패턴에서 좌우문맥을 고려하고 지역 치환 확률과 전역 치환 확률을 계산하여 조기에 많이 사용하는 이형태를 생성하도록 하였다. KT SET 2.0에서 추출한 이형태 데이터에 대해 실험한 결과, 상위 20개의 생성으로도 평균 80% 이상 찾아내어 이 방법이 매우 효과적임을 보였다.

Keywords

References

K S. Jeong, H Myaeng, J. S. Lee, and K Choi, "Automatic Identification and BackTransliteration of Foreign Words for Information Retrieval," Information Processing and Management, VoL.35, No.4, pp.523-540, 1999. https://doi.org/10.1016/S0306-4573(98)00055-7
H Yuichi and Y. Issei, "A Method for Transliterating the Spelling of English Words into Katakana Using the Rewrite Rules," Natural Language Processing, Vol.79, No.1,pp.1 -8，1990.
w. Gao, K Wong, and W. Larn, "Improving Transliteration with Precise Aligment of Phone Chlunks and Using Contextual Features," In proceedings of Asia fuformation Retrieval Symposium, pp.63-70，2004.
이희승， 안병주， 한글 맞춤법 강의 고친판， 신구문 화사， 1994.
J. S. Lee and K Choi, ''English to Korean Statistical Transliteration for fuformation Retrieval，" Computer Processing of Oriental Languages， VoL12, No.1, pp.17-37, 1998.
강병주， 이재성， 최기선， "외국어 음차 표기의 음성적 유사도 비교 알고리즘"， 정보과학회 논문지(B), 제26권， 저10호， pp.1237-1246， 1999.
J.S. Lee and K choi， "A Statistical Method to Generate Various Foreign Word Transliterations in Multingual fuformation Retrieval Systerm," In Proceedings of the 2nd futernational Workshop on fufαmation Retrieval with Asian Languages-l997, pp.123-128, Oct. 1997.
K Jeong, Y. Kwon, and S. H Myaeng, "Construction of Equivalence classess of Foreign Woros through Automatic Identification and Extraction," Natural Language Processing Pacific Rim Symposium，pp.335-340， 1997.
SERI/KIST, 지능형 정보처리71의 개발에 관한연구， 제1차년도 최종보고서， 과학기술처, 1995.
S. M Cheon， Construction of English Loanwords Contents for the Developrnent of Educational Tools: a Step TOOXJrds the Prospent of CAIL Courseuxre, Ph. D dissertation Hankuk University of Foreign Studies, 2005.
M Mettler, "TRW Japanese Fast Data Finder," TIPTER Text program Phase 1 Proc., pp. 113-116, Sep. 1993.
김병혜， 영어단어의 얄파멧표기로부터 한글표기로의 자동변환， 서강대학교 공공정책대학원 석사 학위논문， 1991.
이재성， 다국어 정보검색을 위한 영한 음차 표기 및 복원 모델， 한국과학기술원 박사학위논문，1999.
김정재， 이재성， 최기선， "신경망을 이용한 발음단위 기반 자동 영한 음차 표기 모델" 한국 인지과학회 춘계 학술대회， pp.147-252, 1999.
강병주， 한국어 정보검색에서 외래어와 영어로인한 단어 불일치문제의 해결， 한국과학기술원박사학위논문， 2001.
S. Y. Jung, S. L. Hong, and E. Pack, "An English to Korean Transliteration Model of Extended Markov Window," In Proceedings of18th Intermational Conference on Computational Lingistics， pp.383-389, 2000.
강인호， 김길창， "복수 음운 정보를 이용한 영한 음차표기"， 제 11회 한글 및 한국어 정보처리 학술발표 논문집， pp.50-54 1999.
오종훈， 최기선， "자소 및 음소 정보를 이용한 영어-한국어 음차표기 모델"， 정보과학회 논문지.소프트웨어 및 웅용， 제32권， 제4호， pp.312-326, 2005.
강병주， 최기션， "한 영 자동 음차 복원"， 제 11회 한글 및 한국어 정보처리 학술 발표 논문집， pp.63--69，1999.
W. A Gale and K W. Church, "A Program for Aligning Sentences in Bilingual Corpora," In Using Large Corpora (ed. Amstrong, S.) The MIT Press, Cambridge, Massachusettes，London England, pp.75-102, 1994.
김재군， 김영환， 김성혁， "한국어 정보검색연구를 위한 시험용 데이터 모음(KTSET) 개발"， 제6회 한글 빛 한국어 정보처리 학술 발표 논문집， pp.378-385, 1994
http://www.naver.com
http://www,google.co.kr

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

A Probabilistic Context Sensitive Rewriting Method for Effective Transliteration Variants Generation

효과적인 외래어 이형태 생성을 위한 확률 문맥 의존 치환 방법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)