• Title/Summary/Keyword: Korean speech

Search Result 5,325, Processing Time 0.03 seconds

One-shot multi-speaker text-to-speech using RawNet3 speaker representation (RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템)

  • Sohee Han;Jisub Um;Hoirin Kim
    • Phonetics and Speech Sciences
    • /
    • v.16 no.1
    • /
    • pp.67-76
    • /
    • 2024
  • Recent advances in text-to-speech (TTS) technology have significantly improved the quality of synthesized speech, reaching a level where it can closely imitate natural human speech. Especially, TTS models offering various voice characteristics and personalized speech, are widely utilized in fields such as artificial intelligence (AI) tutors, advertising, and video dubbing. Accordingly, in this paper, we propose a one-shot multi-speaker TTS system that can ensure acoustic diversity and synthesize personalized voice by generating speech using unseen target speakers' utterances. The proposed model integrates a speaker encoder into a TTS model consisting of the FastSpeech2 acoustic model and the HiFi-GAN vocoder. The speaker encoder, based on the pre-trained RawNet3, extracts speaker-specific voice features. Furthermore, the proposed approach not only includes an English one-shot multi-speaker TTS but also introduces a Korean one-shot multi-speaker TTS. We evaluate naturalness and speaker similarity of the generated speech using objective and subjective metrics. In the subjective evaluation, the proposed Korean one-shot multi-speaker TTS obtained naturalness mean opinion score (NMOS) of 3.36 and similarity MOS (SMOS) of 3.16. The objective evaluation of the proposed English and Korean one-shot multi-speaker TTS showed a prediction MOS (P-MOS) of 2.54 and 3.74, respectively. These results indicate that the performance of our proposed model is improved over the baseline models in terms of both naturalness and speaker similarity.

Eight-step Continuum Treatment for Korean Apraxia of Speech Patient: A Case Study (한국어 구어 실행증 환자에 대한 점진적 8단계 치료 기법의 임상적 효과: 사례연구)

  • Lee, Mu-Kyung;Jeong, Ok-Ran
    • Speech Sciences
    • /
    • v.12 no.4
    • /
    • pp.247-254
    • /
    • 2005
  • This study aimed at clarifing clinical effects of eight-step continuum treatment in a patient who showed apraxia of speech after stroke. The eight-step continuum treatment consisted of 8 steps and its clinical efficacy has been proven with American apraxic patients. However, it has not been clinically proven to be effective in Korean patients with apraxia of speech as of yet. Therefore, this study was conducted in an effort to provide preliminary clinical evidence regarding its effectiveness regardless of the linguistic differences between Korean and English. The therapy took place twice a week for 6 months, a total of 48 times. The results showed that the patient's receptive language was improved from 83% to 89% and 37% in accuracy, and expressive language from 15% to 37%. It seemed that spontaneous recovery did not playa role in his improvement since the study was conducted 2 years after the stroke. In addition, the improvement of expressive language was much greater(22%) than that of receptive language(6%), which implied that the therapy was effective in apraxia of speech because apraxia of speech is relatively confined to expressive ability, more specifically motor programming and sequencing.

  • PDF

A Study on Exceptional Pronunciations For Automatic Korean Pronunciation Generator (한국어 자동 발음열 생성 시스템을 위한 예외 발음 연구)

  • Kim Sunhee
    • MALSORI
    • /
    • no.48
    • /
    • pp.57-67
    • /
    • 2003
  • This paper presents a systematic description of exceptional pronunciations for automatic Korean pronunciation generation. An automatic pronunciation generator in Korean is an essential part of a Korean speech recognition system and a TTS (Text-To-Speech) system. It is composed of a set of regular rules and an exceptional pronunciation dictionary. The exceptional pronunciation dictionary is created by extracting the words that have exceptional pronunciations, based on the characteristics of the words of exceptional pronunciation through phonological research and the systematic analysis of the entries of Korean dictionaries. Thus, the method contributes to improve performance of automatic pronunciation generator in Korean as well as the performance of speech recognition system and TTS system in Korean.

  • PDF

Speech Evaluation Variables Related to Speech Intelligibility in Children with Spastic Cerebral Palsy (경직형 뇌성마비아동의 말명료도 및 말명료도와 관련된 말 평가 변인)

  • Park, Ji-Eun;Kim, Hyang-Hee;Shin, Ji-Cheol;Choi, Hong-Shik;Sim, Hyun-Sub;Park, Eun-Sook
    • Phonetics and Speech Sciences
    • /
    • v.2 no.4
    • /
    • pp.193-212
    • /
    • 2010
  • The purpose of our study was to provide effective speech evaluation items examining the variables of speech that successfully predict the speech intelligibility in CP children. The subjects were 55 children with spastic type cerebral palsy. As for the speech evaluation, we performed a speech subsystem evaluation and a speech intelligibility test. The results of the study are as follows. The evaluation task for the speech subsystems consisted of 48 task items within an observational evaluation stage and three levels of severity. The levels showed correlations with gross motor functions, fine motor functions, and age. Second, the evaluation items for the speech subsystems were rearranged into seven factors. Third, 34 out of 48 task items that positively correlated with the syllable intelligibility rating were as follows. There were four items in the observational evaluation stage. Among the nonverbal articulatory function evaluation items, there were 11 items in level one. There were 12 items in level two. In level three there were eight items. Fourth, there were 23 items among the 48 evaluation tasks that correlated with the sentence intelligibility rating. There was one item in the observational evaluation stage which was in the articulatory structure evaluation task. In level one there were six items. In level two, there were eight items. In level three, there was a total number of eight items. Fifth, there was a total number of 14 items that influenced the syllable intelligibility rating. Sixth, there was a total number of 13 items that influenced the syllable intelligibility rating. According to the results above, the variables that influenced the speech intelligibility of CP children among the articulatory function tasks were in the respiratory function task, phonatory function task, and lip and chin related tasks. We did not find any correlation for the tongue function. The results of our study could be applied to speech evaluation, setting therapy goals, and evaluating the degree of progression in children with CP. We only studied children with the spastic type of cerebral palsy, and there were a small number of severe degree CP children compared to those with a moderate degree of CP. Therefore, when evaluating children with other degrees of severity, we may have to take their characteristics more into account. Further study on speech evaluation variables in relation to the severity of the speech intelligibility and different types of cerebral palsy may be necessary.

  • PDF

Research on Construction of the Korean Speech Corpus in Patient with Velopharyngeal Insufficiency (구개인두부전증 환자의 한국어 음성 코퍼스 구축 방안 연구)

  • Lee, Ji-Eun;Kim, Wook-Eun;Kim, Kwang Hyun;Sung, Myung-Whun;Kwon, Tack-Kyun
    • Korean Journal of Otorhinolaryngology-Head and Neck Surgery
    • /
    • v.55 no.8
    • /
    • pp.498-507
    • /
    • 2012
  • Background and Objectives We aimed to develop a Korean version of the velopharyngeal insufficiency (VPI) speech corpus system. Subjects and Method After developing a 3-channel simultaneous speech recording device capable of recording nasal/oral and normal compound speech separately, voice data were collected from VPI patients aged more than 10 years with/without the history of operation or prior speech therapy. This was compared to a control group for which VPI was simulated by using a french-3 nelaton tube inserted via both nostril through nasopharynx and pulling the soft palate anteriorly in varying degrees. The study consisted of three transcriptors: a speech therapist transcribed the voice file into text, a second transcriptor graded speech intelligibility and severity and the third tagged the types and onset times of misarticulation. The database were composed of three main tables regarding (1) speaker's demographics, (2) condition of the recording system and (3) transcripts. All of these were interfaced with the Praat voice analysis program, which enables the user to extract exact transcribed phrases for analysis. Results In the simulated VPI group, the higher the severity of VPI, the higher the nasalance score was obtained. In addition, we could verify the vocal energy that characterizes hypernasality and compensation in nasal/oral and compound sounds spoken by VPI patients as opposed to that characgerizes the normal control group. Conclusion With the Korean version of VPI speech corpus system, patients' common difficulties and speech tendencies in articulation can be objectively evaluated. Comparing these data with those of the normal voice, mispronunciation and dysarticulation of patients with VPI can be corrected.

A Method of Intonation Modeling for Corpus-Based Korean Speech Synthesizer (코퍼스 기반 한국어 합성기의 억양 구현 방안)

  • Kim, Jin-Young;Park, Sang-Eon;Eom, Ki-Wan;Choi, Seung-Ho
    • Speech Sciences
    • /
    • v.7 no.2
    • /
    • pp.193-208
    • /
    • 2000
  • This paper describes a multi-step method of intonation modeling for corpus-based Korean speech synthesizer. We selected 1833 sentences considering various syntactic structures and built a corresponding speech corpus uttered by a female announcer. We detected the pitch using laryngograph signals and manually marked the prosodic boundaries on recorded speech, and carried out the tagging of part-of-speech and syntactic analysis on the text. The detected pitch was separated into 3 frequency bands of low, mid, high frequency components which correspond to the baseline, the word tone, and the syllable tone. We predicted them using the CART method and the Viterbi search algorithm with a word-tone-dictionary. In the collected spoken sentences, 1500 sentences were trained and 333 sentences were tested. In the layer of word tone modeling, we compared two methods. One is to predict the word tone corresponding to the mid-frequency components directly and the other is to predict it by multiplying the ratio of the word tone to the baseline by the baseline. The former method resulted in a mean error of 12.37 Hz and the latter in one of 12.41 Hz, similar to each other. In the layer of syllable tone modeling, it resulted in a mean error rate less than 8.3% comparing with the mean pitch, 193.56 Hz of the announcer, so its performance was relatively good.

  • PDF

'Hanmal' Korean Language Diphone Database for Speech Synthesis

  • Chung, Hyun-Song
    • Speech Sciences
    • /
    • v.12 no.1
    • /
    • pp.55-63
    • /
    • 2005
  • This paper introduces a 'Hanmal' Korean language diphone database for speech synthesis, which has been publicly available since 1999 in the MBROLA web site and never been properly published in a journal. The diphone database is compatible with the MBROLA programme of high-quality multilingual speech synthesis systems. The usefulness of the diphone database is introduced in the paper. The paper also describes the phonetic and phonological structure of the database, showing the process of creating a text corpus. A machine-readable Korean SAMPA convention for the control data input to the MBROLA application is also suggested. Diphone concatenation and prosody manipulation are performed using the MBR-PSOLA algorithm. A set of segment duration models can be applied to the diphone synthesis of Korean.

  • PDF

Analysis of the Timing of Spoken Korean Using a Classification and Regression Tree (CART) Model

  • Chung, Hyun-Song;Huckvale, Mark
    • Speech Sciences
    • /
    • v.8 no.1
    • /
    • pp.77-91
    • /
    • 2001
  • This paper investigates the timing of Korean spoken in a news-reading speech style in order to improve the naturalness of durations used in Korean speech synthesis. Each segment in a corpus of 671 read sentences was annotated with 69 segmental and prosodic features so that the measured duration could be correlated with the context in which it occurred. A CART model based on the features showed a correlation coefficient of 0.79 with an RMSE (root mean squared prediction error) of 23 ms between actual and predicted durations in reserved test data. These results are comparable with recent published results in Korean and similar to results found in other languages. An analysis of the classification tree shows that phrasal structure has the greatest effect on the segment duration, followed by syllable structure and the manner features of surrounding segments. The place features of surrounding segments only have small effects. The model has application in Korean speech synthesis systems.

  • PDF

Some effects of audio-visual speech in perceiving Korean

  • Kim, Jee-Sun;Davis, Chris
    • Annual Conference on Human and Language Technology
    • /
    • 1999.10e
    • /
    • pp.335-342
    • /
    • 1999
  • The experiments reported here investigated whether seeing a speaker's face (visible speech) affects the perception and memory of Korean speech sounds. In order to exclude the possibility of top-down, knowledge-based influences on perception and memory, the experiments tested people with no knowledge of Korean. The first experiment examined whether visible speech (Auditory and Visual - AV) assists English native speakers (with no knowledge of Korean) in the detection of a syllable within a Korean speech phrase. It was found that a syllable was more likely to be detected within a phrase when the participants could see the speaker's face. The second experiment investigated whether English native speakers' judgments about the duration of a Korean phrase would be affected by visible speech. It was found that in the AV condition participant's estimates of phrase duration were highly correlated with the actual durations whereas those in the AO condition were not. The results are discussed with respect to the benefits of communication with multimodal information and future applications.

  • PDF

The Interlanguage Speech Intelligibility Benefit (ISIB) of English Prosody: The Case of Focal Prominence for Korean Learners of English and Natives

  • Lee, Joo-Kyeong;Han, Jeong-Im;Choi, Tae-Hwan;Lim, Injae
    • Phonetics and Speech Sciences
    • /
    • v.4 no.4
    • /
    • pp.53-68
    • /
    • 2012
  • This study investigated the speech intelligibility of Korean-accented and native English focus speech for Korean and native English listeners. Three different types of focus in English, broad, narrow and contrastive, were naturally induced in semantically optimal dialogues. Seven high and seven low proficiency Korean speakers and seven native speakers participated in recording the stimuli with another native speaker. Fifteen listeners from each of Korean high & low proficiency and native groups judged audio signals of focus sentences. Results showed that Korean listeners were more accurate at identifying the focal prominence for Korean speakers' narrow focus speech than that of native speakers, and this suggests that the interlanguage speech intelligibility benefit-talker (ISIB-T) held true for narrow focus regardless of Korean speakers' and listeners' proficiency. However, Korean listeners did not outperform native listeners for Korean speakers' production of narrow focus, which did not support for the ISIB-listener (L). Broad and contrastive focus speech did not provide evidence for either the ISIB-T or ISIB-L. These findings are explained by the interlanguage shared by Korean speakers and listeners where they have established more L1-like common phonetic features and phonological representations. Once semantically and syntactically interpreted in a higher level processing in Korean narrow focus speech, the narrow focus was phonetically realized in a more intelligible way to Korean listeners due to the interlanguage. This may elicit ISIB. However, Korean speakers did not appear to make complete semantic/syntactic access to either broad or contrastive focus, which might lead to detrimental effects on lower level phonetic outputs in top-down processing. This is, therefore, attributed to the fact that Korean listeners did not take advantage over native listeners for Korean talkers and vice versa.