• Title/Summary/Keyword: Word Corpus

Search Result 284, Processing Time 0.019 seconds

A Comparison of Frequency Effects among Korean Corpus (한국어 코퍼스에서 나타나는 빈도효과 비교)

  • Jung Jaebum;Lim Huisoek;Nam Kichun
    • Proceedings of the KSPS conference
    • /
    • 2002.11a
    • /
    • pp.93-96
    • /
    • 2002
  • This research studied the correlation of word frequency effect in Korean corpus. Experiment 1 showed that word frequency of each other corpus was significant correlated. Experiment 2 showed significant correlation between word frequency of each corpus and lexical decision time of participants. These results support that 4 corpus in this research should have stability to word frequency effect of participants

  • PDF

Word Order and Cliticization in Sakizaya: A Corpus-based Approach

  • Lin, Chihkai
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.2
    • /
    • pp.41-56
    • /
    • 2020
  • This paper aims to investigate how word order interacts with cliticization in Sakizaya, a Formosan language. This paper looks into nominative and genitive case markers from a corpus-based approach. The data are collected from an online dictionary of Sakizaya, and they are classified into two word orders: nominative case marker preceding genitive case marker and vice versa. The data are also divided into three categories, according to the demarcation of the case markers, which include right, left, or no demarcation. The corpus includes 700 sentences in the construction of predicate + noun phrase + noun phrase. The results suggest that the two case markers tend to be parsed into the preceding word and show right demarcation. The results also reveal that there are type difference and distance effect of the case markers on the cliticization. Nominative case markers show more right demarcation than genitive case markers do in the corpus. Also, the closer the case markers are to the predicate, the more possible the case markers undergo cliticization.

An Analysis of the Vowel Formants of the Young Males in the Buckeye Corpus (벅아이 코퍼스에서의 젊은 성인 남성의 모음 포먼트 분석)

  • Yoon, Kyu-Chul;Noh, Hye-Uk
    • Phonetics and Speech Sciences
    • /
    • v.4 no.2
    • /
    • pp.41-49
    • /
    • 2012
  • The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, syllabic stress information, the location in a word, location in utterance, speech rate of three consecutive words, and the word frequency in the corpus. The results indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants. The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, the syllabic stress information, the location in a word, the location in an utterance, the speech rate of the three consecutive words, and the word frequency in the corpus. The result indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants.

An Attempt to Measure the Familiarity of Specialized Japanese in the Nursing Care Field

  • Haihong Huang;Hiroyuki Muto;Toshiyuki Kanamaru
    • Asia Pacific Journal of Corpus Research
    • /
    • v.4 no.2
    • /
    • pp.57-74
    • /
    • 2023
  • Having a firm grasp of technical terms is essential for learners of Japanese for Specific Purposes (JSP). This research aims to analyze Japanese nursing care vocabulary based on objective corpus-based frequency and subjectively rated word familiarity. For this purpose, we constructed a text corpus centered on the National Examination for Certified Care Workers to extract nursing care keywords. The Log-Likelihood Ratio (LLR) was used as the statistical criterion for keyword identification, giving a list of 300 keywords as target words for a further word recognition survey. The survey involved 115 participants of whom 51 were certified care workers (CW group) and 64 were individuals from the general public (GP group). These participants rated the familiarity of the target keywords through crowdsourcing. Given the limited sample size, Bayesian linear mixed models were utilized to determine word familiarity rates. Our study conducted a comparative analysis of word familiarity between the CW group and the GP group, revealing key terms that are crucial for professionals but potentially unfamiliar to the general public. By focusing on these terms, instructors can bridge the knowledge gap more efficiently.

An Analysis of the Vowel Formants of the Young Females in the Buckeye Corpus (벅아이 코퍼스에서의 젊은 성인 여성의 모음 포먼트 분석)

  • Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.4 no.4
    • /
    • pp.45-52
    • /
    • 2012
  • The purpose of this paper is to measure the first two vowel formants of the ten young female speakers from the Buckeye Corpus of Conversational Speech [1] automatically and then to analyze various potential factors that may affect the formant distribution of the eight peripheral vowels of English. The factors that were analyzed included the place of articulation, the content versus function word information, the syllabic stress information, the location in a word, the location in an utterance, the speech rate of the three consecutive words, and the word frequency in the corpus. The results indicate that the overall formant patterns of the female speakers were similar to those of earlier works. The effects of the factors on the realization of the two formants were also similar to those from the male speakers with minor differences.

Input Dimension Reduction based on Continuous Word Vector for Deep Neural Network Language Model (Deep Neural Network 언어모델을 위한 Continuous Word Vector 기반의 입력 차원 감소)

  • Kim, Kwang-Ho;Lee, Donghyun;Lim, Minkyu;Kim, Ji-Hwan
    • Phonetics and Speech Sciences
    • /
    • v.7 no.4
    • /
    • pp.3-8
    • /
    • 2015
  • In this paper, we investigate an input dimension reduction method using continuous word vector in deep neural network language model. In the proposed method, continuous word vectors were generated by using Google's Word2Vec from a large training corpus to satisfy distributional hypothesis. 1-of-${\left|V\right|}$ coding discrete word vectors were replaced with their corresponding continuous word vectors. In our implementation, the input dimension was successfully reduced from 20,000 to 600 when a tri-gram language model is used with a vocabulary of 20,000 words. The total amount of time in training was reduced from 30 days to 14 days for Wall Street Journal training corpus (corpus length: 37M words).

Vocabulary Coverage Improvement for Embedded Continuous Speech Recognition Using Part-of-Speech Tagged Corpus (품사 부착 말뭉치를 이용한 임베디드용 연속음성인식의 어휘 적용률 개선)

  • Lim, Min-Kyu;Kim, Kwang-Ho;Kim, Ji-Hwan
    • MALSORI
    • /
    • no.67
    • /
    • pp.181-193
    • /
    • 2008
  • In this paper, we propose a vocabulary coverage improvement method for embedded continuous speech recognition (CSR) using a part-of-speech (POS) tagged corpus. We investigate 152 POS tags defined in Lancaster-Oslo-Bergen (LOB) corpus and word-POS tag pairs. We derive a new vocabulary through word addition. Words paired with some POS tags have to be included in vocabularies with any size, but the vocabulary inclusion of words paired with other POS tags varies based on the target size of vocabulary. The 152 POS tags are categorized according to whether the word addition is dependent of the size of the vocabulary. Using expert knowledge, we classify POS tags first, and then apply different ways of word addition based on the POS tags paired with the words. The performance of the proposed method is measured in terms of coverage and is compared with those of vocabularies with the same size (5,000 words) derived from frequency lists. The coverage of the proposed method is measured as 95.18% for the test short message service (SMS) text corpus, while those of the conventional vocabularies cover only 93.19% and 91.82% of words appeared in the same SMS text corpus.

  • PDF

A Study on the Vowel Duration of the Buckeye Corpus (벅아이 코퍼스의 모음 길이 연구)

  • Chung, Hyejung;Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.7 no.4
    • /
    • pp.103-110
    • /
    • 2015
  • The purpose of this study is to assess the vowel property by examining the vowel duration of the American English vowles found in the Buckeye corpus[6]. The vowel durations were analyzed in terms of various linguistic factors including the number of syllables of the word containing the vowel, the location of the vowel in a word, types of stress, function versus content word, the word frequency in the corpus and the speech rate calculated from the three consecutive words. The findings from this work agreed mostly with those from earlier studies, but with some exceptions. The relationship between the speech rate and the vowel duration proved non-linear.

Comparison Thai Word Sense Disambiguation Method

  • Modhiran, Teerapong;Kruatrachue, Boontee;Supnithi, Thepchai
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2004.08a
    • /
    • pp.1307-1312
    • /
    • 2004
  • Word sense disambiguation is one of the most important problems in natural language processing research topics such as information retrieval and machine translation. Many approaches can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledge-based, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy. The purpose of this paper is to compare three famous machine learning techniques, Snow, SVM and Naive Bayes in Word-Sense Disambiguation on Thai language. 10 ambiguous words are selected to test with word and POS features. The results show that SVM algorithm gives the best results in solving of Thai WSD and the accuracy rate is approximately 83-96%.

  • PDF

A Study on the Voice Onset Time of English Voiceless Stops in the Buckeye Corpus (벅아이 코퍼스를 이용한 영어 무성파열음의 VOT 연구)

  • Yoon, Kyu-Chul
    • Phonetics and Speech Sciences
    • /
    • v.4 no.2
    • /
    • pp.33-40
    • /
    • 2012
  • The purpose of this paper is to investigate the voice onset time (VOT) of the English voiceless stops [p, t, k] found in the Buckeye Corpus of Conversational Speech [1]. Three young female speakers were chosen for this study and their VOT values were semi-automatically extracted along with other factors. The factors used for the analysis were place of articulation, location in word, syllabic stress, content word or not, word frequency calculated from the corpus, and the speech rate expressed in syllables per second. Results showed that, for the three places of articulation of each speaker, all the factors had a statistically significant effect on the VOT values. This paper has significance in that the materials used for the analysis were from a corpus of spontaneous natural English speech.