The Statistical Relationship between Linguistic Items and Corpus Size

;;

한국언어정보학회지:언어와정보 (Language and Information)

제7권2호
/
Pages.103-115
/
2003
/
1226-7430(pISSN)

한국언어정보학회 (Korean Society for Language and Information)

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로

The Statistical Relationship between Linguistic Items and Corpus Size

양경숙 (고려대학교) ;
박병선 (고려대학교)

발행 : 2003.12.01

PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

한국언어정보학회지:언어와정보 (Language and Information)

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로

The Statistical Relationship between Linguistic Items and Corpus Size

초록

키워드

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)