• Title/Summary/Keyword: Speech Data

Search Result 1,394, Processing Time 0.03 seconds

Adaptive Speech Streaming Based on Packet Loss Prediction Using Support Vector Machine for Software-Based Multipoint Control Unit over IP Networks

  • Kang, Jin Ah;Han, Mikyong;Jang, Jong-Hyun;Kim, Hong Kook
    • ETRI Journal
    • /
    • v.38 no.6
    • /
    • pp.1064-1073
    • /
    • 2016
  • An adaptive speech streaming method to improve the perceived speech quality of a software-based multipoint control unit (SW-based MCU) over IP networks is proposed. First, the proposed method predicts whether the speech packet to be transmitted is lost. To this end, the proposed method learns the pattern of packet losses in the IP network, and then predicts the loss of the packet to be transmitted over that IP network. The proposed method classifies the speech signal into different classes of silence, unvoiced, speech onset, or voiced frame. Based on the results of packet loss prediction and speech classification, the proposed method determines the proper amount and bitrate of redundant speech data (RSD) that are sent with primary speech data (PSD) in order to assist the speech decoder to restore the speech signals of lost packets. Specifically, when a packet is predicted to be lost, the amount and bitrate of the RSD must be increased through a reduction in the bitrate of the PSD. The effectiveness of the proposed method for learning the packet loss pattern and assigning a different speech coding rate is then demonstrated using a support vector machine and adaptive multirate-narrowband, respectively. The results show that as compared with conventional methods that restore lost speech signals, the proposed method remarkably improves the perceived speech quality of an SW-based MCU under various packet loss conditions in an IP network.

Design and Implementation of a Speech Synthesis Engine and a Plug-in for Internet Web Page (인터넷 웹페이지의 음성합성을 위한 엔진 및 플러그-인 설계 및 구현)

  • Lee, Hee-Man;Kim, Ji-Yeong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.2
    • /
    • pp.461-469
    • /
    • 2000
  • In the paper, the design and the implementation of the netscape plug-in and the speech synthesis enginegenerating the speech sounds from the text information of the web pages are described. The steps of the generating speech sound from an web pages are the speech synthesis plug-in is activated when the netscape finds the audio/xesp MIME data type embedded in the browsed web page; the HTML file referenced in the EMBED MTML tag is down loaded from the referenced URL to send to the commander object located in the said plug-in; The speech synthesis engine control tags and the text characters are extracted from the down loaded HTML document by the commander object the synthesized speech sounds are generated by the speech synthesis engine. The speech synthesis engine interprets the command streams from the commander objects to call the member functions for the processing of the speech segment data in the data banks. The commander object and the speech synthesis engine are designed as an independent object to enhancethe flexitility and the portability.

  • PDF

Improving transformer-based speech recognition performance using data augmentation by local frame rate changes (로컬 프레임 속도 변경에 의한 데이터 증강을 이용한 트랜스포머 기반 음성 인식 성능 향상)

  • Lim, Seong Su;Kang, Byung Ok;Kwon, Oh-Wook
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.2
    • /
    • pp.122-129
    • /
    • 2022
  • In this paper, we propose a method to improve the performance of Transformer-based speech recognizers using data augmentation that locally adjusts the frame rate. First, the start time and length of the part to be augmented in the original voice data are randomly selected. Then, the frame rate of the selected part is changed to a new frame rate by using linear interpolation. Experimental results using the Wall Street Journal and LibriSpeech speech databases showed that the convergence time took longer than the baseline, but the recognition accuracy was improved in most cases. In order to further improve the performance, various parameters such as the length and the speed of the selected parts were optimized. The proposed method was shown to achieve relative performance improvement of 11.8 % and 14.9 % compared with the baseline in the Wall Street Journal and LibriSpeech speech databases, respectively.

Rhythmic Differences between Spontaneous and Read Speech of English

  • Kim, Sul-Ki;Jang, Tae-Yeoub
    • Phonetics and Speech Sciences
    • /
    • v.1 no.3
    • /
    • pp.49-55
    • /
    • 2009
  • This study investigates whether rhythm metrics can be used to capture the rhythmic differences between spontaneous and read English speech. Transcription of spontaneous speech tokens extracted from a corpus is read by three English native speakers to generate the corresponding read speech tokens. Two data sets are compared in terms of seven rhythm measures that are suggested by previous studies. Results show that there is a significant difference in the values of vowel-based metrics (VarcoV and nPVI-V) between spontaneous and read speech. This manifests a greater variability in vocalic intervals in spontaneous speech than in read speech. The current study is especially meaningful as it demonstrates a way in which speech styles can be differentiated and parameterized in numerical terms.

  • PDF

An Efficient Transmission Coding Technique of Digitized Speech Data

  • Shimamura, Tetsuya;Yaguchi, katsuaki
    • Proceedings of the IEEK Conference
    • /
    • 2002.07c
    • /
    • pp.1796-1798
    • /
    • 2002
  • Speech transmission is common in many communications systems. In this paper, a technique to reduce the total bits required for expressing the speech data is proposed for the purpose of a packet transmission. A novel coding method is derived based on the concept of finding common information in sequential speech samples. Computer simulations demonstrate that the proposed scheme reduces the total bits re- quired in PCM approximately by half.

  • PDF

Korean Speech Recognition Based on Syllable (음절을 기반으로한 한국어 음성인식)

  • Lee, Young-Ho;Jeong, Hong
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.31B no.1
    • /
    • pp.11-22
    • /
    • 1994
  • For the conventional systme based on word, it is very difficult to enlarge the number of vocabulary. To cope with this problem, we must use more fundamental units of speech. For example, syllables and phonemes are such units, Korean speech consists of initial consonants, middle vowels and final consonants and has characteristic that we can obtain syllables from speech easily. In this paper, we show a speech recognition system with the advantage of the syllable characteristics peculiar to the Korean speech. The algorithm of recognition system is the Time Delay Neural Network. To recognize many recognition units, system consists of initial consonants, middle vowels, and final consonants recognition neural network. At first, our system recognizes initial consonants, middle vowels and final consonants. Then using this results, system recognizes isolated words. Through experiments, we got 85.12% recognition rate for 2735 data of initial consonants, 86.95% recognition rate for 3110 data of middle vowels, and 90.58% recognition rate for 1615 data of final consonants. And we got 71.2% recognition rate for 250 data of isolated words.

  • PDF

Performance of speech recognition unit considering morphological pronunciation variation (형태소 발음변이를 고려한 음성인식 단위의 성능)

  • Bang, Jeong-Uk;Kim, Sang-Hun;Kwon, Oh-Wook
    • Phonetics and Speech Sciences
    • /
    • v.10 no.4
    • /
    • pp.111-119
    • /
    • 2018
  • This paper proposes a method to improve speech recognition performance by extracting various pronunciations of the pseudo-morpheme unit from an eojeol unit corpus and generating a new recognition unit considering pronunciation variations. In the proposed method, we first align the pronunciation of the eojeol units and the pseudo-morpheme units, and then expand the pronunciation dictionary by extracting the new pronunciations of the pseudo-morpheme units at the pronunciation of the eojeol units. Then, we propose a new recognition unit that relies on pronunciation by tagging the obtained phoneme symbols according to the pseudo-morpheme units. The proposed units and their extended pronunciations are incorporated into the lexicon and language model of the speech recognizer. Experiments for performance evaluation are performed using the Korean speech recognizer with a trigram language model obtained by a 100 million pseudo-morpheme corpus and an acoustic model trained by a multi-genre broadcast speech data of 445 hours. The proposed method is shown to reduce the word error rate relatively by 13.8% in the news-genre evaluation data and by 4.5% in the total evaluation data.

Semi-supervised learning of speech recognizers based on variational autoencoder and unsupervised data augmentation (변분 오토인코더와 비교사 데이터 증강을 이용한 음성인식기 준지도 학습)

  • Jo, Hyeon Ho;Kang, Byung Ok;Kwon, Oh-Wook
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.6
    • /
    • pp.578-586
    • /
    • 2021
  • We propose a semi-supervised learning method based on Variational AutoEncoder (VAE) and Unsupervised Data Augmentation (UDA) to improve the performance of an end-to-end speech recognizer. In the proposed method, first, the VAE-based augmentation model and the baseline end-to-end speech recognizer are trained using the original speech data. Then, the baseline end-to-end speech recognizer is trained again using data augmented from the learned augmentation model. Finally, the learned augmentation model and end-to-end speech recognizer are re-learned using the UDA-based semi-supervised learning method. As a result of the computer simulation, the augmentation model is shown to improve the Word Error Rate (WER) of the baseline end-to-end speech recognizer, and further improve its performance by combining it with the UDA-based learning method.

A Study on Realization of Continuous Speech Recognition System of Speaker Adaptation (화자적응화 연속음성 인식 시스템의 구현에 관한 연구)

  • 김상범;김수훈;허강인;고시영
    • The Journal of the Acoustical Society of Korea
    • /
    • v.18 no.3
    • /
    • pp.10-16
    • /
    • 1999
  • In this paper, we have studied Continuous Speech Recognition System of Speaker Adaptation using MAPE (Maximum A Posteriori Probability Estimation) which can adapt any small amount of adaptation speech data. Speaker adaptation is performed by the method of MAPB after Concatenation training which is making sentence unit HMM linked by syllable unit HMM and Viterbi segmentation classifies speech data to be adaptation into segmentation of syllable unit data automatically without hand labelling. For car control speech the recognition rates of adaptation of HMM was 77.18% which is approximately 6% improvement over that of unadapted HMM.(in case of O(n)DP)

  • PDF

A Study on Syntactic Development in Spontaneous Speech (자발화에 나타난 구문구조 발달 양상)

  • Chang, Jin-A;Kim, Su-Jin;Shin, Ji-Young;Yi, Bong-Won
    • MALSORI
    • /
    • v.68
    • /
    • pp.17-32
    • /
    • 2008
  • The purpose of the present study is to investigate syntactic development of Korean by analysing the spontaneous speech data. Thirty children(3, 5, and 7-year-old and 10 per each age group) and 10 adults are employed as subjects for this study. Speech data were recorded and transcribed in orthography. Transcribed data are analysed syntactically: sentence(simple vs complex) patterns and clause patterns(4 basic types according to the predicate) etc. The results are as follows: 1) simple sentences show higher frequency for the upper age groups, 2) complex sentences with conjunctive and embedded clauses show higher frequency for the upper age groups.

  • PDF