• 제목/요약/키워드: experimental phonetics

검색결과 89건 처리시간 0.022초

L1-norm regularization을 통한 SGMM의 state vector 적응 (L1-norm Regularization for State Vector Adaptation of Subspace Gaussian Mixture Model)

  • 구자현;김영관;김회린
    • 말소리와 음성과학
    • /
    • 제7권3호
    • /
    • pp.131-138
    • /
    • 2015
  • In this paper, we propose L1-norm regularization for state vector adaptation of subspace Gaussian mixture model (SGMM). When you design a speaker adaptation system with GMM-HMM acoustic model, MAP is the most typical technique to be considered. However, in MAP adaptation procedure, large number of parameters should be updated simultaneously. We can adopt sparse adaptation such as L1-norm regularization or sparse MAP to cope with that, but the performance of sparse adaptation is not good as MAP adaptation. However, SGMM does not suffer a lot from sparse adaptation as GMM-HMM because each Gaussian mean vector in SGMM is defined as a weighted sum of basis vectors, which is much robust to the fluctuation of parameters. Since there are only a few adaptation techniques appropriate for SGMM, our proposed method could be powerful especially when the number of adaptation data is limited. Experimental results show that error reduction rate of the proposed method is better than the result of MAP adaptation of SGMM, even with small adaptation data.

화자확인에서 일정한 결과를 얻기 위한 빠른 순시 확률비 테스트 방법 (Fast Sequential Probability Ratio Test Method to Obtain Consistent Results in Speaker Verification)

  • 김은영;서창우;전성채
    • 말소리와 음성과학
    • /
    • 제2권2호
    • /
    • pp.63-68
    • /
    • 2010
  • A new version of sequential probability ratio test (SPRT) which has been investigated in utterance-length control is proposed to obtain uniform response results in speaker verification (SV). Although SPRTs can obtain fast responses in SV tests, differences in the performance may occur depending on the compositions of consonants and vowels in the sentences used. In this paper, a fast sequential probability ratio test (FSPRT) method that shows consistent performances at all times regardless of the compositions of vocalized sentences for SV will be proposed. In generating frames, the FSPRT will first conduct SV test processes with only generated frames without any overlapping and if the results do not satisfy discrimination criteria, the FSPRT will sequentially use frames applied with overlapping. With the progress of processes as such, the test will not be affected by the compositions of sentences for SV and thus fast response outcomes and even consistent performances can be obtained. Experimental results show that the FSPRT has better performance to the SPRT method while requiring less complexity with equal error rates (EER).

  • PDF

Relationship between executive function and cue weighting in Korean stop perception across different dialects and ages

  • Kong, Eun Jong;Lee, Hyunjung
    • 말소리와 음성과학
    • /
    • 제13권3호
    • /
    • pp.21-29
    • /
    • 2021
  • The present study investigated how one's cognitive resources are related to speech perception by examining Korean speakers' executive function (EF) capacity and its association with voice onset time (VOT) and f0 sensitivity in identifying Korean stop laryngeal categories (/t'/ vs. /t/ vs. /th/). Previously, Kong et al. (under revision) reported that Korean listeners (N = 154) in Seoul and Changwon (Gyeongsang) showed differential group patterns in dialect-specific cue weightings across educational institutions (college, high school, and elementary school). We follow up this study by further relating their EF control (working memory, mental flexibility, and inhibition) to their speech perception patterns to examine whether better cognitive ability would control attention to multiple acoustic dimensions. Partial correlation analyses revealed that better EFs in Korean listeners were associated with greater sensitivity to available acoustic details and with greater suppression of irrelevant acoustic information across subgroups, although only a small set of EF components turned out to be relevant. Unlike Seoul participants, Gyeongsang listeners' f0 use was not correlated with any EF task scores, reflecting dialect-specific cue primacy using f0 as a secondary cue. The findings confirm the link between speech perception and general cognitive ability, providing experimental evidence from Korean listeners.

Pitch trajectories of English vowels produced by American men, women, and children

  • Yang, Byunggon
    • 말소리와 음성과학
    • /
    • 제10권4호
    • /
    • pp.31-37
    • /
    • 2018
  • Pitch trajectories reflect a continuous variation of vocal fold movements over time. This study examined the pitch trajectories of English vowels produced by 139 American English speakers, statistically analyzing their trajectories using the Generalized Additive Mixed Models (GAMMs). First, Praat was used to read the sound data of Hillenbrand et al. (1995). A pitch analysis script was then prepared, and six pitch values at the corresponding time points within each vowel segment were collected and checked. The results showed that the group of men produced the lowest pitch trajectories, followed by the groups of women, boys, then girls. The density line showed a bimodal distribution. The pitch values at the six corresponding time points formed a single dip, which changed gradually across the vowel segment from 204 to 193 to 196 Hz. The normality tests performed on the pitch data rejected the null hypothesis. Nonparametric tests were therefore conducted to discover the significant differences in the values among the four groups. The GAMMs, which analyzed all the pitch data, produced significant results among the pitch values at the six corresponding time points but not between the two groups of boys and girls. The GAMMs also revealed that the two groups were significantly different only at the first and second time points. Accordingly, the methodology of this study and its findings may be applicable to future studies comparing curvilinear data sets elicited by experimental conditions.

Hyperparameter experiments on end-to-end automatic speech recognition

  • Yang, Hyungwon;Nam, Hosung
    • 말소리와 음성과학
    • /
    • 제13권1호
    • /
    • pp.45-51
    • /
    • 2021
  • End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that "num blocks" and "linear units" hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as "num blocks" and "linear units" hyperparameters' values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively.

배경소음상황에 따른 성인 말더듬화자의 발화 관련 변수 비교 (Effects of Background Noises on Speech-related Variables of Adults who Stutter)

  • 박진;오선영;전제표;강진석
    • 말소리와 음성과학
    • /
    • 제7권1호
    • /
    • pp.27-37
    • /
    • 2015
  • This study was mainly aimed at investigating on the effects of background noises (i.e., white noise, multi-speaker conversational babble) on stuttering rate and other speech-related measures (i.e., articulation rate, speech effort). Nine Korean-speaking adults who stutter participated in the study. Each of the participants was asked to read a series of passages under each of four experimental conditions (i.e., typical solo reading (TR), choral reading (CR), reading under white noise presented (WR), reading with multi-speaker conversational babble presented (BR). Stuttering rate was computed based on a percentage of syllables stuttered (%SS) and articulation rate was also assessed as another speech-related measure under each of the experimental conditions. To examine the amount of physical effort needed to read, the speech effort was measured by using the 9-point Speech Effort Self Rating Scale originally employed by Ingham et al. (2006). Study results showed that there were no significant differences among each of the passage reading conditions in terms of stuttering rate, articulation rate, and speech effort. In conclusion, it can be argued that the two different types of background noises (i.e., white noise and multi-speaker conversational babble) are not different in the extent to which each of them enhances fluency of adults who stutter. Self ratings of speech effort may be also useful in measuring speech-related variables associated with vocal changes induced under each of the fluency enhancing conditions.

중국어 상성이 중국인의 한자어 발음에 미치는 영향 연구: 부분이형동의어를 중심으로 (The Influence of Chinese Falling-Rising Tone on the Pitch of Sino-Korean Words Pronounced by Chinese Learners: Focusing on the Partly-Different-Form-Same-Meaning Words)

  • 유사양;김영주
    • 말소리와 음성과학
    • /
    • 제4권2호
    • /
    • pp.21-31
    • /
    • 2012
  • The purpose of this study is to find the influence of Chinese falling-rising tone on the pitch pattern of corresponding partly-different-form-same-meaning Sino-Korean words delivered by Chinese learners of Korean and to examine how the falling-rising tone of corresponding Chinese words affects the pitch patterns of Sino-Korean words. The scope of this research is limited to Chinese learners of Korean, especially on two groups of Sino-Korean words - AB:CB type and AB:AC type that the are second-most frequently occuring different-form-same-meaning Sino-Korean words. In this study, Chinese learners pronounced both Chinese words and corresponding Sino-Korean words. Learners' pitch patterns were recorded and analyzed using software and compared with the tone of corresponding Chinese words. Experimental results showed that AB:CB type Sino-Korean words were not affected by Chinese 'falling-rising tone - high and level tone'. As well as AB:CB type, experimental results showed there were no significant influence on the pitch pattern of AB:AC type Sino-Korean words by Chinese falling-rising tone. But it was clear that Chinese learners' made pitch errors on both AB:CB type and AB:AC type Sino-Korean words. In conclusion, the Chinese learners' pitch patterns of partly-different-form-same-meaning Sino-Korean words are different from Korean native speakers', but their pitch errors cannot be attributed to Chinese falling-rising tone.

외래어 어두경음화 발음의 원인과 사회계층 (Causes and Hierarchy of Loanwords Word-initial Glottalization)

  • 박지윤
    • 한국콘텐츠학회논문지
    • /
    • 제21권2호
    • /
    • pp.421-430
    • /
    • 2021
  • 사회계층에 따라 어두경음화가 나타나는 양상에 주목할 필요가 있다. 학력이 높을수록, 공식적이고 격식적인 자리일수록 영어 원어와 가깝게 발음하기 위해 어두경음화를 되도록 피하는 경향이 나타난다. 이에 대해 실험 조사와 Praat 음성 분석 프로그램을 통해 사회계층에 따라 외래어 어두경음화 약화 현상을 증명하여 밝히는 것이 본 연구의 목적이다. 외래어 어두경음화 원인에 대해 기존 논의에서는 표현 강화, 현대 사회 경쟁의 결과, 무조건적인 경음화, 국어사적 분석, 한국어와 영어의 음성음운 차이 기인, 외래어 발음의 규칙화 시도 등으로 다양하게 논의되어 왔으나, 사회계층에 기인한 분석은 부족한 실정이다. 본고에서는 실험 조사와 음성 분석 프로그램인 Praat을 이용하여 학력이 높을수록, 공식적이고 격식적인 자리일수록 외래어 어두경음화 현상이 약하게 나타난다는 것을 밝혔다. 외래어 어두경음화 발음 여부는 자신의 신분과 계층을 표출할 수단으로 보는 심리적 기제로 작용함을 알 수 있다.

음성 훈련에 따른 영어 모음의 인지와 발화 관계 (Relation between Perception and Production of English Vowels by Phonetic Training)

  • 정순용;초미희
    • 한국콘텐츠학회논문지
    • /
    • 제15권2호
    • /
    • pp.542-551
    • /
    • 2015
  • 본 연구에서는 한국 학생들이 영어 모음들을 어떻게 인지하고 발화하는지 알아보기 위해서, 영어 전공과목을 수강하는 42명의 대학생들을 대상으로 11개의 영어 목표 모음 /i, ɪ, eɪ, ${\varepsilon}$, ${\ae}$, ${\alpha}$, ɔ, oʊ, ʊ, u, ʌ/이 들어간 영어 단어를 가지고 인지와 발화테스트를 사전과 사후 두 차례 실시하였다. 통제집단에 대해서는 영어음성학 전공수업을 통한 이론학습만을 진행하였고, 실험집단에 대해서는 이론학습과 더불어 인지와 발화 훈련을 4주간 실시하였다. 구체적인 연구의 목적은 사전테스트를 실시한 이후 4주간의 훈련과 연습을 통해 사후 인지와 발화 정확도가 얼마나 향상되었는지를 알아보는 것과 두 집단이 보인 인지와 발화의 상관관계가 사전과 사후테스트에서 어떠한지를 살펴보는 것이다. 테스트 결과 사전테스트에서는 통제와 실험 두 집단이 인지와 발화에서 강한 상관성을 보인 반면, 사후테스트에서는 두 집단 모두 인지와 발화 간에 상관성이 없는 것으로 나타났다. 이와 같은 결과는 두 집단의 사후 인지와 발화 정확도의 변화가 주된 요인인 것으로 나타났다. 통제집단은 인지에서 사후 변동 폭이 컸던 반면, 실험집단은 발화의 변동 폭이 더 컸고, 이러한 사후 인지와 발화의 변동이 상관관계에도 영향을 주었다. 이와 같은 실험 결과를 바탕으로 교육적인 함축점도 논의되었다.

음성인식 기반 응급상황관제 (Emergency dispatching based on automatic speech recognition)

  • 이규환;정지오;신대진;정민화;강경희;장윤희;장경호
    • 말소리와 음성과학
    • /
    • 제8권2호
    • /
    • pp.31-39
    • /
    • 2016
  • In emergency dispatching at 119 Command & Dispatch Center, some inconsistencies between the 'standard emergency aid system' and 'dispatch protocol,' which are both mandatory to follow, cause inefficiency in the dispatcher's performance. If an emergency dispatch system uses automatic speech recognition (ASR) to process the dispatcher's protocol speech during the case registration, it instantly extracts and provides the required information specified in the 'standard emergency aid system,' making the rescue command more efficient. For this purpose, we have developed a Korean large vocabulary continuous speech recognition system for 400,000 words to be used for the emergency dispatch system. The 400,000 words include vocabulary from news, SNS, blogs and emergency rescue domains. Acoustic model is constructed by using 1,300 hours of telephone call (8 kHz) speech, whereas language model is constructed by using 13 GB text corpus. From the transcribed corpus of 6,600 real telephone calls, call logs with emergency rescue command class and identified major symptom are extracted in connection with the rescue activity log and National Emergency Department Information System (NEDIS). ASR is applied to emergency dispatcher's repetition utterances about the patient information. Based on the Levenshtein distance between the ASR result and the template information, the emergency patient information is extracted. Experimental results show that 9.15% Word Error Rate of the speech recognition performance and 95.8% of emergency response detection performance are obtained for the emergency dispatch system.