• Title/Summary/Keyword: Phoneme Error

Search Result 71, Processing Time 0.02 seconds

Combining multi-task autoencoder with Wasserstein generative adversarial networks for improving speech recognition performance (음성인식 성능 개선을 위한 다중작업 오토인코더와 와설스타인식 생성적 적대 신경망의 결합)

  • Kao, Chao Yuan;Ko, Hanseok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.38 no.6
    • /
    • pp.670-677
    • /
    • 2019
  • As the presence of background noise in acoustic signal degrades the performance of speech or acoustic event recognition, it is still challenging to extract noise-robust acoustic features from noisy signal. In this paper, we propose a combined structure of Wasserstein Generative Adversarial Network (WGAN) and MultiTask AutoEncoder (MTAE) as deep learning architecture that integrates the strength of MTAE and WGAN respectively such that it estimates not only noise but also speech features from noisy acoustic source. The proposed MTAE-WGAN structure is used to estimate speech signal and the residual noise by employing a gradient penalty and a weight initialization method for Leaky Rectified Linear Unit (LReLU) and Parametric ReLU (PReLU). The proposed MTAE-WGAN structure with the adopted gradient penalty loss function enhances the speech features and subsequently achieve substantial Phoneme Error Rate (PER) improvements over the stand-alone Deep Denoising Autoencoder (DDAE), MTAE, Redundant Convolutional Encoder-Decoder (R-CED) and Recurrent MTAE (RMTAE) models for robust speech recognition.

Pronunciation Variation Patterns of Loanwords Produced by Korean and Grapheme-to-Phoneme Conversion Using Syllable-based Segmentation and Phonological Knowledge (한국인 화자의 외래어 발음 변이 양상과 음절 기반 외래어 자소-음소 변환)

  • Ryu, Hyuksu;Na, Minsu;Chung, Minhwa
    • Phonetics and Speech Sciences
    • /
    • v.7 no.3
    • /
    • pp.139-149
    • /
    • 2015
  • This paper aims to analyze pronunciation variations of loanwords produced by Korean and improve the performance of pronunciation modeling of loanwords in Korean by using syllable-based segmentation and phonological knowledge. The loanword text corpus used for our experiment consists of 14.5k words extracted from the frequently used words in set-top box, music, and point-of-interest (POI) domains. At first, pronunciations of loanwords in Korean are obtained by manual transcriptions, which are used as target pronunciations. The target pronunciations are compared with the standard pronunciation using confusion matrices for analysis of pronunciation variation patterns of loanwords. Based on the confusion matrices, three salient pronunciation variations of loanwords are identified such as tensification of fricative [s] and derounding of rounded vowel [ɥi] and [$w{\varepsilon}$]. In addition, a syllable-based segmentation method considering phonological knowledge is proposed for loanword pronunciation modeling. Performance of the baseline and the proposed method is measured using phone error rate (PER)/word error rate (WER) and F-score at various context spans. Experimental results show that the proposed method outperforms the baseline. We also observe that performance degrades when training and test sets come from different domains, which implies that loanword pronunciations are influenced by data domains. It is noteworthy that pronunciation modeling for loanwords is enhanced by reflecting phonological knowledge. The loanword pronunciation modeling in Korean proposed in this paper can be used for automatic speech recognition of application interface such as navigation systems and set-top boxes and for computer-assisted pronunciation training for Korean learners of English.

N-gram Based Robust Spoken Document Retrievals for Phoneme Recognition Errors (음소인식 오류에 강인한 N-gram 기반 음성 문서 검색)

  • Lee, Su-Jang;Park, Kyung-Mi;Oh, Yung-Hwan
    • MALSORI
    • /
    • no.67
    • /
    • pp.149-166
    • /
    • 2008
  • In spoken document retrievals (SDR), subword (typically phonemes) indexing term is used to avoid the out-of-vocabulary (OOV) problem. It makes the indexing and retrieval process independent from any vocabulary. It also requires a small corpus to train the acoustic model. However, subword indexing term approach has a major drawback. It shows higher word error rates than the large vocabulary continuous speech recognition (LVCSR) system. In this paper, we propose an probabilistic slot detection and n-gram based string matching method for phone based spoken document retrievals to overcome high error rates of phone recognizer. Experimental results have shown 9.25% relative improvement in the mean average precision (mAP) with 1.7 times speed up in comparison with the baseline system.

  • PDF

Consonant Inventories of the Better Cochlear Implant Children in Korea (말지각 능력이 우수한 인공와우 착용 아동들의 조음 능력;음소의 정밀 전사)

  • Chang, Son-A;Kim, Su-Jin;Sin, Ji-Yeong
    • Proceedings of the KSPS conference
    • /
    • 2007.05a
    • /
    • pp.274-277
    • /
    • 2007
  • The purpose of this study is 1) to describe the phoneme inventories of cochlear implant(CI) children and 2) to describe their utterances using narrow phonetic transcription method. All the subjects had more than 2 year-experience with CI and showed more than 87% open-set sentence perception abilities. Average consonant accuracy was 81.36% and it was improved up to 87.41% when distortion errors were not counted. They showed different error patterns from hearing aid users. The prominent error pattern was weakening of consonants.

  • PDF

Alveolar Fricative Sound Errors by the Type of Morpheme in the Spontaneous Speech of 3- and 4-Year-Old Children (자발화에 나타난 형태소 유형에 따른 3-4세 아동의 치경마찰음 오류)

  • Kim, Soo-Jin;Kim, Jung-Mee;Yoon, Mi-Sun;Chang, Moon-Soo;Cha, Jae-Eun
    • Phonetics and Speech Sciences
    • /
    • v.4 no.3
    • /
    • pp.129-136
    • /
    • 2012
  • Korean alveolar fricatives are late-developing speech sounds. Most previous research on phonemes used individual words or pseudo words to produce sounds, but word-level phonological analysis does not always reflect a child's practical articulation ability. Also, there has been limited research on articulation development looking at speech production by grammatical morphemes despite its importance in Korean language. Therefore, this research examines the articulation development and phonological patterns of the /s/ phoneme in terms of morphological types produced in children's spontaneous conversational speech. The subjects were twenty-two typically developing 3- and 4-year-old Koreans. All children showed normal levels in three screening tests: hearing, vocabulary, and articulation. Spontaneous conversational samples were recorded at the children's homes. The results are as follows. The error rates decreased with increasing age in all morphological contexts. Also, error percentages within an age group were significantly lower in lexical morphemes than in grammatical morphemes. The stopping of fricative sounds was the main error pattern in all morphological contexts and reduced as age increased. This research shows that articulation performance can differ significantly by morphological contexts. The present study provides data that can be used to identify the difficult context for articulatory evaluation and therapy of alveolar fricative sounds.

A Study on the Characteristics of Errors Type for Wellness of Alzheimer's Dementia Patients in the Naming Task (알츠하이머성 치매환자의 웰니스를 위한 명명하기 과제에서의 오류유형 특성 연구)

  • Kang, Min-Gu
    • Journal of Korea Entertainment Industry Association
    • /
    • v.14 no.8
    • /
    • pp.213-219
    • /
    • 2020
  • The purpose of this study was to investigate the characteristics of error types in naming task for 8 questionable demeatia groups, 9 definite dementia groups, and 10 normal groups. The items of naming error analysis were classified into visual perception errors, semantic association errors, semantic non-correlation errors, phoneme errors, Don't Know, and No Response. For the analysis, descriptive statistics analysis, analysis of variance, and multivariate analysis of variance were conducted using SPSS 21.0. As a result, there was a significant difference in the error rate between groups according to the error type. The errors that showed significant differences between the normal group and the other two groups were visual perception errors and semantic non-related errors. The error of non-response was different from the dementia confirmation group, but there was no significant difference from the dementia suspicion group. These results showed that Alzheimer's patients had a defect in confrontation naming ability. Also, it was found that it is appropriate to provid other clues when the defects caused by the degeneration of a specific step during the information processing process become severe.

Analysis of Error Characteristics and Usabilities for Korean Consonant Perception Test (한국자음지각검사의 오류특성 및 유용성 분석)

  • Kim, Dong Chang;Kim, Jin Sook;Lee, Kyoung Won
    • 재활복지
    • /
    • v.18 no.4
    • /
    • pp.295-314
    • /
    • 2014
  • The purpose of this study was to supply the baseline data for auditory rehabilitation in the field through error type and rate of the phoneme which the hearing impaired feel difficulty to discriminate. Thirty participants with sensorineural hearing loss heard KCPT lists through recorded voice by male and female to get the data about error type and KCPT score accordance with talker's gender. In the initial consonant test list, /ㄷ/, /ㅂ/, /ㅃ/, /ㅉ/, /ㅌ/ showed more than 30% error rate while /ㄱ/and /ㄷ/ showed in final consonant test list. The most common error type was the initial consonant substitution or the final consonant substitution for the initial or final consonant test lists. Talker's gender effect was not signigicant showing no statistical difference between the scores when compared results from male voice and female voice. It means that KCPT can be used regardless of talker's gender in clinics.

Perceptual Characteristics of Korean Consonants Distorted by the Frequency Band Limitation (주파수 대역 제한에 의한 한국어 자음의 지각 특성 분석)

  • Kim, YeonWhoa;Choi, DaeLim;Lee, Sook-Hyang;Lee, YongJu
    • Phonetics and Speech Sciences
    • /
    • v.6 no.1
    • /
    • pp.95-101
    • /
    • 2014
  • This paper investigated the effects of frequency band limitation on perceptual characteristics of Korean consonants. Monosyllabic speech (144 syllables of CV type, 56 syllables of VC type, 8 syllables of V type) produced by two announcers were low- and high-pass filtered with cutoff frequencies ranging from 300 to 5000 Hz. Six listeners with normal hearing performed perception test by types of filter and cutoff frequencies. We reported phoneme recognition rates and types of perception error of band-limited Korean consonants to examine how frequency distortion in the process of speech transmission affect listener's perception. The results showed that recognition rates varied with the following factors: position in a syllable, manner of articulation, place of articulation, and phonation types. Consonants in the final position were stronger to the frequency band limitation than those in the initial position. Fricatives and Affricates are stronger than stops. Fortis consonants were less stronger than their lenis or aspirated counterparts. Types of perception error also varied depending on such factors as consonant's place of articulation: In case of bilabial stops, they were perceived as alveolar stops with while in cases of alveolar and velar stops, there were changes in phonation types without any change in the place of articulation.

Effective Syllable Modeling for Korean Speech Recognition Using Continuous HMM (연속 은닉 마코프 모델을 이용한 한국어 음성 인식을 위한 효율적 음절 모델링)

  • 김봉완;이용주
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.1
    • /
    • pp.23-27
    • /
    • 2003
  • Recently attempts to we the syllable as the recognition unit to enhance performance in continuous speech recognition hate been reported. However, syllables are worse in their trainability than phones and the former have a disadvantage in that contort-dependent modeling is difficult across the syllable boundary since the number of models is much larger for syllables than for phones. In this paper, we propose a method to enhance the trainability for the syllables in Korean and phoneme-context dependent syllable modeling across the syllable boundary. An experiment in which the proposed method is applied to word recognition shows average 46.23% error reduction in comparison with the common syllable modeling. The right phone dependent syllable model showed 16.7% error reduction compared with a triphone model.

Context-adaptive Phoneme Segmentation for a TTS Database (문자-음성 합성기의 데이터 베이스를 위한 문맥 적응 음소 분할)

  • 이기승;김정수
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.2
    • /
    • pp.135-144
    • /
    • 2003
  • A method for the automatic segmentation of speech signals is described. The method is dedicated to the construction of a large database for a Text-To-Speech (TTS) synthesis system. The main issue of the work involves the refinement of an initial estimation of phone boundaries which are provided by an alignment, based on a Hidden Market Model(HMM). Multi-layer perceptron (MLP) was used as a phone boundary detector. To increase the performance of segmentation, a technique which individually trains an MLP according to phonetic transition is proposed. The optimum partitioning of the entire phonetic transition space is constructed from the standpoint of minimizing the overall deviation from hand labelling positions. With single speaker stimuli, the experimental results showed that more than 95% of all phone boundaries have a boundary deviation from the reference position smaller than 20 ms, and the refinement of the boundaries reduces the root mean square error by about 25%.