• Title/Summary/Keyword: Speech speed

Search Result 238, Processing Time 0.024 seconds

ETRI small-sized dialog style TTS system (ETRI 소용량 대화체 음성합성시스템)

  • Kim, Jong-Jin;Kim, Jeong-Se;Kim, Sang-Hun;Park, Jun;Lee, Yun-Keun;Hahn, Min-Soo
    • Proceedings of the KSPS conference
    • /
    • 2007.05a
    • /
    • pp.217-220
    • /
    • 2007
  • This study outlines a small-sized dialog style ETRI Korean TTS system which applies a HMM based speech synthesis techniques. In order to build the VoiceFont, dialog-style 500 sentences were used in training HMM. And the context information about phonemes, syllables, words, phrases and sentence were extracted fully automatically to build context-dependent HMM. In training the acoustic model, acoustic features such as Mel-cepstrums, logF0 and its delta, delta-delta were used. The size of the VoiceFont which was built through the training is 0.93Mb. The developed HMM-based TTS system were installed on the ARM720T processor which operates 60MHz clocks/second. To reduce computation time, the MLSA inverse filtering module is implemented with Assembly language. The speed of the fully implemented system is the 1.73 times faster than real time.

  • PDF

Adaptive echo canceller combined with speech coder for mobile communication systems (이동통신 시스템을 위한 음성 부호화기와 결합된 적응 반향제거기에 관한 연구)

  • 이인성;박영남
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.23 no.7
    • /
    • pp.1650-1658
    • /
    • 1998
  • This paper describes how to remove echoes effectively using speech parameter information provided form speech coder. More specially, the proposed adaptive echo canceller utilizes the excitation signal or linearly predicted error signal instead of output speech signal of vocoder as the input signal for adaptation algorithm. The normalized least mean ssquare(NLMS) algorithm is used for the adaptive echo canceller. The proposed algorithm showed a fast convergece charactersitcis in the sinulatio compared to the conventional method. Specially, the proposed echo canceller utilizing the excitation signal of speech coder showed about four times fast convergence speed over the echo canceller utilizing the output speech signal of the speech coder for the adaptation input.

  • PDF

Implementation of Chip and Algorithm of a Speech Enhancement for an Automatic Speech Recognition Applied to Telematics Device (텔레메틱스 단말용 음성 인식을 위한 음성향상 알고리듬 및 칩 구현)

  • Kim, Hyoung-Gook
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.7 no.5
    • /
    • pp.90-96
    • /
    • 2008
  • This paper presents an algorithm of a single chip acoustic speech enhancement for telematics device. The algorithm consists of two stages, i.e. noise reduction and echo cancellation. An adaptive filter based on cross spectral estimation is used to cancel echo. The external background noise is eliminated and the clear speech is estimated by using MMSE log-spectral magnitude estimation. To be suitable for use in consumer electronics, we also design a low cost, high speed and flexible hardware architecture. The performance of the proposed speech enhancement algorithms were measured both by the signal-to-noise ratio(SNR) and recognition accuracy of an automatic speech recognition(ASR) and yields better results compared with the conventional methods.

  • PDF

A study on the change of prosodic units by speech rate and frequency of turn-taking (발화 속도와 말차례 교체 빈도에 따른 운율 단위 변화에 관한 연구)

  • Won, Yugwon
    • Phonetics and Speech Sciences
    • /
    • v.14 no.2
    • /
    • pp.29-38
    • /
    • 2022
  • This study aimed to analyze the speech appearing in the National Institute of Korean Language's Daily Conversation Speech Corpus (2020) and reveal how the speech rate and the frequency of turn-taking affect the change in prosody units. The analysis results showed a positive correlation between intonation phrase, word phrase frequency, and speaking duration as the speech speed increased; however, the correlation was low, and the suitability of the regression model of the speech rate was 3%-11%, which was weak in explanatory power. There was a significant difference in the mean speech rate according to the frequency of the turn-taking, and the speech rate decreased as the frequency of the turn-taking increased. In addition, as the frequency of turn-taking increased, the frequency of intonation phrases, the frequency of word phrases, and the speaking duration decreased; there was a high negative correlation. The suitability of the regression model of the turn-taking frequency was calculated as 27%-32%. The frequency of turn-taking functions as a factor in changing the speech rate and prosodic units. It is presumed that this can be influenced by the disfluency of the dialogue, the characteristics of turn-taking, and the active interaction between the speakers.

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning (딥러닝 기반 한국어 실시간 TTS 기술 비교)

  • Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.1
    • /
    • pp.640-645
    • /
    • 2021
  • The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.

Continuous Digit Recognition Using the Weight Initialization and LR Parser

  • Choi, Ki-Hoon;Lee, Seong-Kwon;Kim, Soon-Hyob
    • The Journal of the Acoustical Society of Korea
    • /
    • v.15 no.2E
    • /
    • pp.14-23
    • /
    • 1996
  • This paper is a on the neural network to recognize the phonemes, the weight initialization to reduce learning speed, and LR parser for continuous speech recognition. The neural network spots the phonemes in continuous speech and LR parser parses the output of neural network. The whole phonemes recognized in neural network are divided into several groups which are grouped by the similarity of phonemes, and then each group consists of neural network. Each group of neural network to recognize the phonemes consisits of that recognize the phonemes of their own group and VGNN(Verify Group Neural Network) which judges whether the inputs are their own group or not. The weights of neural network are not initialized with random values but initialized from learning data to reduce learning speed. The LR parsing method applied to this paper is not a method which traces a unique path, but one which traces several possible paths because the output of neural network is not accurate. The parser processes the continuous speech frame by frame as accumulating the output of neural network through several possible paths. If this accumulated path-value drops below the threshold value, this path is deleted in possible parsing paths. This paper applies the continuous speech recognition system to the threshold value, this path is deleted in possible parsing paths. This paper applies the continuous speech recognition system to the continuous Korea digits recognition. The recognition rate of isolated digits is 97% in speaker dependent, and 75% in speaker dependent. The recognition rate of continuous digits is 74% in spaker dependent.

  • PDF

An Introduction to Energy-Based Blind Separating Algorithm for Speech Signals

  • Mahdikhani, Mahdi;Kahaei, Mohammad Hossein
    • ETRI Journal
    • /
    • v.36 no.1
    • /
    • pp.175-178
    • /
    • 2014
  • We introduce the Energy-Based Blind Separating (EBS) algorithm for extremely fast separation of mixed speech signals without loss of quality, which is performed in two stages: iterative-form separation and closed-form separation. This algorithm significantly improves the separation speed simply due to incorporating only some specific frequency bins into computations. Simulation results show that, on average, the proposed algorithm is 43 times faster than the independent component analysis (ICA) for speech signals, while preserving the separation quality. Also, it outperforms the fast independent component analysis (FastICA), the joint approximate diagonalization of eigenmatrices (JADE), and the second-order blind identification (SOBI) algorithm in terms of separation quality.

A Study of Speech Coding for the Transmission on Network by the Wavelet Packets (Wavelet Packet을 이용한 Network 상의 음성 코드에 관한 연구)

  • Baek, Han-Wook;Chung, Chin-Hyun
    • Proceedings of the KIEE Conference
    • /
    • 2000.07d
    • /
    • pp.3028-3030
    • /
    • 2000
  • In general. a speech coding is dedicated to the compression performance or the speech quality. But. the speech coding in this paper is focused on the performance of flexible transmission to the, network speed. For this. the subbanding coding is needed. which is used the wavelet packet concept in the signal analysis. The extraction of each frequency-band is difficult to general signal analysis methods, after coding each band, the reconstruction of these is also a difficult problem. But. with the wavelet packet concept(perfect reconstruction) and its fast computation algorithm. the extraction of each band and the reconstruction are more natural. Also, this paper describes a direct solution of the voice transmission on network and implement this algorithm at the TCP/IP network environment of PC.

  • PDF

Voice Similarities between Brothers

  • Ko, Do-Heung;Kang, Sun-Mee
    • Speech Sciences
    • /
    • v.9 no.2
    • /
    • pp.1-11
    • /
    • 2002
  • This paper aims to provide a guideline for modelling speaker identification and speaker verification by comparing voice similarities between brothers. Five pairs of brothers who are believed to have similar voices participated in this experiment. Before conducted in the experiment, perceptual tests were measured if the voices were similar between brothers. The words were measured in both isolation and context, and the subjects were asked to read five times with about three seconds of interval between readings. Recordings were made at natural speed in a quiet room. The data were analyzed in pitch and formant frequencies using CSL (Computerized Speech Lab), PCQuirer and MDVP (Multi -dimensional Voice Program). It was found that data of the initial vowels are much more similar and homogeneous than those of vowels in other position. The acoustic data showed that voice similarities are strikingly high in both pitch and formant frequencies. It was also found that the correlation coefficient was not significant between parameters above.

  • PDF

Voice Similarities between Sisters

  • Ko, Do-Heung
    • Speech Sciences
    • /
    • v.8 no.3
    • /
    • pp.43-50
    • /
    • 2001
  • This paper deals with voice similarities between sisters who are supposed to have common physiological characteristics from a single biological mother. Nine pairs of sisters who are believed to have similar voices participated in this experiment. The speech samples obtained from one pair of sisters were eliminated in the analysis because their perceptual score was relatively low. The words were measured in both isolation and context, and the subjects were asked to read the text five times with about three seconds of interval between readings. Recordings were made at natural speed in a quiet room. The data were analyzed in pitch and formant frequencies using CSL (Computerized Speech Lab) and PCQuirer. It was found that data of the initial vowels are much more similar and homogeneous than those of vowels in other positions. The acoustic data showed that voice similarities are strikingly high in both pitch and formant frequencies. It is assumed that statistical data obtained from this experiment can be used as a guideline for modelling speaker identification and speaker verification.

  • PDF