Search | Korea Science

End-to-end non-autoregressive fast text-to-speech (End-to-end 비자기회귀식 가속 음성합성기)

Kim, Wiback;Nam, Hosung
- Phonetics and Speech Sciences
- /
- v.13 no.4
- /
- pp.47-53
- /
- 2021
Autoregressive Text-to-Speech (TTS) models suffer from inference instability and slow inference speed. Inference instability occurs when a poorly predicted sample at time step t affects all the subsequent predictions. Slow inference speed arises from a model structure that forces the predicted samples from time steps 1 to t-1 to predict the sample at time step t. In this study, an end-to-end non-autoregressive fast text-to-speech model is suggested as a solution to these problems. The results of this study show that this model's Mean Opinion Score (MOS) is close to that of Tacotron 2 - WaveNet, while this model's inference speed and stability are higher than those of Tacotron 2 - WaveNet. Further, this study aims to offer insight into the improvement of non-autoregressive models.
https://doi.org/10.13064/KSSS.2021.13.4.047 인용 PDF KSCI

A Study of Korean TTS Listening Speed for the Blind Using a Screen Reader (스크린리더를 사용하는 시각장애인의 한국어 합성음 청취속도 연구)

Lee, Heeyeon;Hong, Ki-Hyung
- Phonetics and Speech Sciences
- /
- v.5 no.3
- /
- pp.63-69
- /
- 2013
The purpose of this study was to evaluate the maximum and optimal listening speed of Korean TTS for the blind. Five blind participants took part in this study. The instruments used in this study were 17 sentence sets (2 sets for an excercise, 10 sets for a repeated test, and 5 sets for a random test), with short meaningful sentences (the same sentences for the repeated test, different sentences for the random test) with 15 differentiated speeds (Range=0.8-3.6, SD=0.2). Each participant's maximum and quickest listening speeds were calculated by objective recall accuracy (determined by the number of correctly recalled syllables/the total number of syllables in a sentence X 100) and subjective recall accuracy (recall accuracy judged by each participant's subjective evaluation). The results showed that the participants' recall accuracy had a tendency to increase as the TTS speed decreased. Participants' subjective recall accuracy was higher than objective recall accuracy in the repeated tests and vice versa in the random tests. The results also revealed that the participants' sentence familiarity had an influence on their Korean TTS listening speed.
https://doi.org/10.13064/KSSS.2013.5.3.063 인용 PDF

Comparison of Speech Intelligibility & Performance of Speech Recognition in Real Driving Environments (자동차 주행 환경에서의 음성 전달 명료도와 음성 인식 성능 비교)

Lee Kwang-Hyun;Choi Dae-Lim;Kim Young-Il;Kim Bong-Wan;Lee Yong-Ju
- MALSORI
- /
- no.50
- /
- pp.99-110
- /
- 2004
The normal transmission characteristics of sound are hardly obtained due to the various noises and structural factors in a running car environment. It is due to the channel distortion of the original source sound recorded by microphones, and it seriously degrades the performance of the speech recognition in real driving environments. In this paper we analyze the degree of intelligibility under the various sound distortion environments by channels according to driving speed with respect to speech transmission index(STI) and compare the STI with rates of speech recognition. We examine the correlation between measures of intelligibility depending on sound pick-up patterns and performance in speech recognition. Thereby we consider the optimal location of a microphone in single channel environment. In experimentation we find that high correlation is obtained between STI and rates of speech recognition.
PDF

Effects of Speech Rate on the Sentence Perception of Adults with Cochlear Implantation (말속도가 인공와우 청각장애인의 문장지각에 미치는 영향)

Shin, Su-Jin;Shin, Ji-Cheol;Yoon, Mi-Sun;Kim, Duk-Young
- Speech Sciences
- /
- v.13 no.2
- /
- pp.47-58
- /
- 2006
People tend to control their speech rate to help those with listening problems such as hearing impaired people. The aim of this study was to investigate effects of speech rate on the sentence perception by 10 adults with cochlear implantation. The sample speech included 42 sentences at normal, slow, and very slow speed focusing on the overall duration, vowel or pause duration. The subjects listened to the speech and wrote down what they heard. Each correct syllable of the content words in the sentence was counted to obtain the score. Partial points were given to the incomplete syllables. Results of this study were as follows: 1. The changes of speech rate had some influence on the sentence perception score by the cochlear implanted people. 2. In slow pause condition, the controlled speech rate had a positive effect on the perception score.
PDF

PASS: A Parallel Speech Understanding System

Chung, Sang-Hwa
- Journal of Electrical Engineering and information Science
- /
- v.1 no.1
- /
- pp.1-9
- /
- 1996
A key issue in spoken language processing has become the integration of speech understanding and natural language processing(NLP). This paper presents a parallel computational model for the integration of speech and NLP. The model adopts a hierarchically-structured knowledge base and memory-based parsing techniques. Processing is carried out by passing multiple markers in parallel through the knowledge base. Speech-specific problems such as insertion, deletion, and substitution have been analyzed and their parallel solutions are provided. The complete system has been implemented on the Semantic Network Array Processor(SNAP) and is operational. Results show an 80% sentence recognition rate for the Air Traffic Control domain. Moreover, a 15-fold speed-up can be obtained over an identical sequential implementation with an increasing speed advantage as the size of the knowledge base grows.
PDF

Automatic Detection of Intonational and Accentual Phrases in Korean Standard Continuous Speech (한국 표준어 연속음성에서의 억양구와 강세구 자동 검출)

Lee, Ki-Young;Song, Min-Suck
- Speech Sciences
- /
- v.7 no.2
- /
- pp.209-224
- /
- 2000
This paper proposes an automatic detection method of intonational and accentual phrases in Korean standard continuous speech. We use the pause over 150 msec for detecting intonational phrases, and extract accentual phrases from the intonational phrases by analyzing syllables and pitch contours. The speech data for the experiment are composed of seven male voices and two female voices which read the texts of the fable 'the ant and the grasshopper' and a newspaper article 'manmulsang' in normal speed and in Korean standard variation. The results of the experiment shows that the detection rate of intonational phrases is 95% on the average and that of accentual phrases is 73%. This detection rate implies that we can segment the continuous speech into smaller units(i.e. prosodic phrases) by using the prosodic information and so the objects of speech recognition can narrow down to words or phrases in continuous speech.
PDF

The Pitch Detection Using Variable LPF (Variable LPF에 의한 피치검출)

백금란
- Proceedings of the Acoustical Society of Korea Conference
- /
- 1993.06a
- /
- pp.88-92
- /
- 1993
In speech signal processing, it is necessary to detect exactly the pitch. The algorithms of pitch extraction which have been proposed until now are difficult to detect pitches over wide range speech signals. Thus we propose a new algorithm which uses the G-peak extraction to do it. It is the method that finds the most MZI(maximum zero-crossing interval) at each frame and convolve it with speech signal ; this is the same with passing speech signals to variable LPF. Finally we obtained the pitch, improve the accuracy of pitch detection and extract it with the high speed.
PDF

Preliminary Study on Synthesis of Emotional Speech (정서음성 합성을 위한 예비연구)

Han Youngho;Yi Sopae;Lee Jung-Chul;Kim Hyung Soon
- Proceedings of the KSPS conference
- /
- 2003.10a
- /
- pp.181-184
- /
- 2003
This paper explores the perceptual relevance of acoustical correlates of emotional speech by using formant synthesizer. The focus is on the role of mean pitch, pitch range, speed rate and phonation type when it comes to synthesizing emotional speech. The result of this research is backing up the traditional impressionistic observations. However it suggests that some phonation types should be synthesized with further refinement.
PDF

Text-to-speech with linear spectrogram prediction for quality and speed improvement (음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech)

Yoon, Hyebin
- Phonetics and Speech Sciences
- /
- v.13 no.3
- /
- pp.71-78
- /
- 2021
Most neural-network-based speech synthesis models utilize neural vocoders to convert mel-scaled spectrograms into high-quality, human-like voices. However, neural vocoders combined with mel-scaled spectrogram prediction models demand considerable computer memory and time during the training phase and are subject to slow inference speeds in an environment where GPU is not used. This problem does not arise in linear spectrogram prediction models, as they do not use neural vocoders, but these models suffer from low voice quality. As a solution, this paper proposes a Tacotron 2 and Transformer-based linear spectrogram prediction model that produces high-quality speech and does not use neural vocoders. Experiments suggest that this model can serve as the foundation of a high-quality text-to-speech model with fast inference speed.
https://doi.org/10.13064/KSSS.2021.13.3.071 인용 PDF KSCI

Implementation of a Single-chip Speech Recognizer Using the TMS320C2000 DSPs (TMS320C2000계열 DSP를 이용한 단일칩 음성인식기 구현)

Chung, Ik-Joo
- Speech Sciences
- /
- v.14 no.4
- /
- pp.157-167
- /
- 2007
In this paper, we implemented a single-chip speech recognizer using the TMS320C2000 DSPs. For this implementation, we had developed very small-sized speaker-dependent recognition engine based on dynamic time warping, which is especially suited for embedded systems where the system resources are severely limited. We carried out some optimizations including speed optimization by programming time-critical functions in assembly language, and code size optimization and effective memory allocation. For the TMS320F2801 DSP which has 12Kbyte SRAM and 32Kbyte flash ROM, the recognizer developed can recognize 10 commands. For the TMS320F2808 DSP which has 36Kbyte SRAM and 128Kbyte flash ROM, it has additional capability of outputting the speech sound corresponding to the recognition result. The speech sounds for response, which are captured when the user trains commands, are encoded using ADPCM and saved on flash ROM. The single-chip recognizer needs few parts except for a DSP itself and an OP amp for amplifying microphone output and anti-aliasing. Therefore, this recognizer may play a similar role to dedicated speech recognition chips.
PDF

Search Result 238, Processing Time 0.029 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)