Applying Randomization Tests to Collocation Analyses in Large Corpora

Yang Kyung-Sook;Kim HeeYoung;

doi:10.5351/KJAS.2005.18.3.583

The Korean Journal of Applied Statistics (응용통계연구)

Volume 18 Issue 3
/
Pages.583-595
/
2005
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Applying Randomization Tests to Collocation Analyses in Large Corpora

언어의 공기관계 분석을 위한 임의화검증의 응용

Yang Kyung-Sook (Brain Korea 21 The Education and Research Group for Korea Studies, Korea University) ;
Kim HeeYoung (Institute of Statistics, Korea University)

양경숙 (BK21 한국학 교육연구단) ;
김희영 (고려대학교 통계연구소)

Published : 2005.11.01

https://doi.org/10.5351/KJAS.2005.18.3.583 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Contingency tables are used to compare counts of n-grams to determine if the n-gram is a true collocation, meaning that the words that make up the n-gram are highly associated in the text. Some statistical methods for identifying collocation are used. They are Kulczinsky coefficient, Ochiai coefficient, Frager and McGowan coefficient, Yule coefficient, mutual information, and chi-square, and so on. But the main problem is that these measures are based ell the assumption of a nor-mal or approximately normal distribution of the variables being sampled. While this assumption is valid in most instances, it is not valid when comparing the rates of occurrence of rare events, and texts are composed mostly of rare events. In this paper we have simply reviewed some statistics about testing association of two words. Some randomization tests to evaluate the significance level in analyzing collocation in large corpora are proposed. A related graph can be used to compare different lest statistics that ran be used to analyze the same contingency table.

언어의 공기관계를 파악하는 데는 여러 가지 연관성 통계량들이 이용된다. 그러나 일부 통계량을 제외한 나머지 통계량들은 분포가 알려져 있지 않아 정작 통계량 값을 구하고도 명확한 설명을 하지 못하는 경우가 있다. 따라서 언어의 공기관계 분석을 위해서 정규근사나 t통계량을 이용하여 가설검증을 하는 경우가 많다. 그러나 공기관계에 있는 어휘빈도가 전체 빈도에서 차지하는 백분율이 매우 작기 때문에 정규근사에는 무리가 있어 보인다. 따라서 본 논문은 여러 논문에서 자주 언급되는 연관성 통계량의 특성을 임의화검증(randomization test)을 통해 고찰함으로써 계량언어학의 연어분석에서 데이터의 특성을 고려하여 보다 정확하게 언어의 공기관계를 이해할 수 있도록 도모하고자한다.

Keywords

References

박병선 (2003). 국어 공기관계의 계량언어학적 연구, 고려대학교 대학원 박사학위논문
허명회 (1997). 2원 분할표의 소표본 검증법, <응용통계연구>, 10, 339-352
홍종선, 강범모, 최호철 (2001). <한국어 연어관계 연구>, 서울: 월인
Rodham, E. Tulloss (1997). Assessment of similarity indices for undesirable properties and a new tripartite similarity index bases on cost functions, Mycology in Sustainable Development: Expanding Concepts, Vanishing Borders (Palm, M.E and I. H. Chapela eds.), Parkway Publishers, Boone, North Carolina, 122-143
Scott Songlin Piao (2002). Word alignment in English-Chinese parallel corpora, Literary and Linguistic Computing, 17, 207-230 https://doi.org/10.1093/llc/17.2.207
Michael P. Oakes (1998). Statistics for Corpus Linguistics, Edinburgh University Press

The Korean Journal of Applied Statistics (응용통계연구)

Applying Randomization Tests to Collocation Analyses in Large Corpora

언어의 공기관계 분석을 위한 임의화검증의 응용

Abstract

Keywords

References

Detail Search