Search | Korea Science

Speech Feature Extraction Based on the Human Hearing Model

Chung, Kwang-Woo;Kim, Paul;Hong, Kwang-Seok
- Proceedings of the KSPS conference
- /
- 1996.10a
- /
- pp.435-447
- /
- 1996
In this paper, we propose the method that extracts the speech feature using the hearing model through signal processing techniques. The proposed method includes the following procedure ; normalization of the short-time speech block by its maximum value, multi-resolution analysis using the discrete wavelet transformation and re-synthesize using the discrete inverse wavelet transformation, differentiation after analysis and synthesis, full wave rectification and integration. In order to verify the performance of the proposed speech feature in the speech recognition task, korean digit recognition experiments were carried out using both the DTW and the VQ-HMM. The results showed that, in the case of using DTW, the recognition rates were 99.79% and 90.33% for speaker-dependent and speaker-independent task respectively and, in the case of using VQ-HMM, the rate were 96.5% and 81.5% respectively. And it indicates that the proposed speech feature has the potential for use as a simple and efficient feature for recognition task
PDF

Multi-stage Speech Recognition Using Confidence Vector (신뢰도 벡터 기반의 다단계 음성인식)

Jeon, Hyung-Bae;Hwang, Kyu-Woong;Chung, Hoon;Kim, Seung-Hi;Park, Jun;Lee, Yun-Keun
- MALSORI
- /
- no.63
- /
- pp.113-124
- /
- 2007
In this paper, we propose a use of confidence vector as an intermediate input feature for multi-stage based speech recognition architecture to improve recognition accuracy. A multi-stage speech recognition structure is introduced as a method to reduce the computational complexity of the decoding procedure and then accomplish faster speech recognition. Conventional multi-stage speech recognition is usually composed of three stages, acoustic search, lexical search, and acoustic re-scoring. In this paper, we focus on improving the accuracy of the lexical decoding by introducing a confidence vector as an input feature instead of phoneme which was used typically. We take experimental results on 220K Korean Point-of-Interest (POI) domain and the experimental results show that the proposed method contributes on improving accuracy.
PDF

SPEECH SYNTHESIS USING LARGE SPEECH DATA-BASE

Lee, Kyu-Keon;Mochida, Takemi;Sakurai, Naohiro;Shirai, Katasuhiko
- Proceedings of the Acoustical Society of Korea Conference
- /
- 1994.06a
- /
- pp.949-956
- /
- 1994
In this paper, we introduce a new speech synthesis method for Japanese and Korean arbitrary sentences using the natural speech data-base. Also, application of this method to a CAI system is discussed. In our synthesis method, a basic sentence and basic accent-phrases are selected from the data-base against a target sentence. Factors for those selections are phrase dependency structure (separation degree), number of morae, type of accent and phonemic labels. The target pitch pattern and phonemic parameter series are generated using those selected basic units. As the pitch pattern is generated using patterns which are directly extracted form real speech, it is expected to be more natural than any other pattern which is estimated by any model. Until now, we have examined this method on Japanese sentence speech and affirmed that the synthetic sound preserves human-like features fairly well. Now we extend this method to Korean sentence speech synthesis. Further more, we are trying to apply this synthesis unit to a CAI system.
PDF

Implementation of Formant Speech Analysis/Synthesis System (포만트 분석/합성 시스템 구현)

Lee, Joon-Woo;Son, Ill-Kwon;Bae, Keuo-Sung
- Speech Sciences
- /
- v.1
- /
- pp.295-314
- /
- 1997
In this study, we will implement a flexible formant analysis and synthesis system. In the analysis part, the two-channel (i.e., speech & EGG signals) approach is investigated for accurate estimation of formant information. The EGG signal is used for extracting exact pitch information that is needed for the pitch synchronous LPC analysis and closed phase LPC analysis. In the synthesis part, Klatt formant synthesizer is modified so that the user can change synthesis parameters arbitarily. Experimental results demonstrate the superiority of the two-channel analysis method over the one-channel(speech signal only) method in analysis as well as in synthesis. The implemented system is expected to be very helpful for studing the effects of synthesis parameters on the quality of synthetic speech and for the development of Korean text-to-speech(TTS) system with the formant synthesis method.
PDF

A Study of FO's realization in Emotional speech (감정에 따른 음성의 기본주파수 실현 연구)

Park, Mi-Young;Park, Mi-Kyoung
- Proceedings of the KSPS conference
- /
- 2005.11a
- /
- pp.79-85
- /
- 2005
In this Paper, we are trying to compare the normal speech with emotional speech -happy, sad, and angry states- through the changes of fundamental frequency. Based on the distribution charts of the normal and emotional speech, there are distinctive cues such as range of distribution, average, maximum, minimum, and so on. On the whole, the range of the fundamental frequency is extended in happy and angry states. On the other hand, sad states make the range relatively lessened. Nevertheless, the ranges of the 10 frequency in sad states are wider than the normal speech. In addition, we can verify that ending boundary tones reflect the information of whole speech.
PDF

Speech Database for 3-5 years old Korean Children (만 3-5세 유아의 한국어 음성 데이터베이스 구축)

Yoo, Jae-Kwon;Lee, Kyung-Ok;Lee, Kyoung-Mi
- The Journal of the Korea Contents Association
- /
- v.12 no.4
- /
- pp.52-59
- /
- 2012
Children develop their language skill rapidly between age 3 and 5. To meet the child's language development through a variety of experiences, it is necessary to develop age-appropriate contents. So it needs to develop various contents using speech interface for children, but there is no speech database of korean children. In this paper, we develop speech database of 3 to 5 years old children in korean. For collecting accurate children's speech, child education experts examine in the speech database development process. The words for database are selected from MCDI-K in two stage and children speak a word three times. Such collected speech are tokenized by child and word and stored in database. This speech database will be transferred through web and, hopefully, be the foundation of development of children-oriented contents.
https://doi.org/10.5392/JKCA.2012.12.04.052 인용 PDF KSCI

The Study of Comparison between RPE-LTP and VSELP Speech Coder (RPE-LTP와 VSELP 음성부호화기의 비교에 관한 연구)

박대덕;김화준;심재훈;유재희;정하봉;서정하
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.19 no.9
- /
- pp.1838-1847
- /
- 1994
Until recently, they decided the standard of the digital mobile communication speech coding method and competively developed the more detailed techniques in North America, Europe, Japan, etc. But, we have not yet determined. In this paper, we compared the RPE-LTP speech coding algorithm, standard in Europe, with the VSELP speech coding algorith, standard in North America, with respect to the soruce coding. We described the comprehensive verification and comparison with each speech coder, and discussed the improvement plan. Next, we also compared the number of computations which affects the real time processing seriously. Moreover, we performed the simulation with the Korean speech data, concreting the algorithm of each speech coder. Finally, we compared the performance of each speech coder with segmental SNR and 5-point MOS. The number of computations was calculated, and the result was that the number of multiplication computing times of VSELP speech encoder was the largest. With 26 speech data, the segmental SNR of VSELP was calculated larger than that of RPE-LTP. The 5-point MOS test was performed, and the result was that the basic speech quality of VSELP was equivalent or better than that of RPE-LTP.
PDF

The Relationship between Age and Speech Improvement in the Patients Performed Pharyngeal Flap for Correction of Velopharyngeal Dysfunction (구개인두기능부전의 교정을 위한 인두피판술의 나이에 따른 발음 개선 효과)

Kim, Kyoung-Hoon;Bae, Yong-Chan;Nam, Su-Bong;Choi, Soo-Jong;Kang, Cheol-Uk
- Archives of Plastic Surgery
- /
- v.36 no.3
- /
- pp.294-298
- /
- 2009
Purpose: The pharyngeal flap is one of the popular surgical method to treat the problem of velopharyngeal dysfunction. This study evaluated speech outcomes of patients who underwent superiorly based pharyngeal flap surgery based on timing of surgery. Methods: A restrospective review of 50 patients who underwent pharyngeal flap surgery for velopharyngeal insufficiency between September 1996 and January 2008 was undertaken. Thirty patients with an available preoprative and postoperative speech assessments with at least 6 months of follow-up were included in this study. We checked out the significance of speech improvement after surgery analysing preoperative and postoperative scoring of speech assessment. We also investigated the direct relationship between the age at surgery and the degree of speech improvement, and the improvement score in different age groups. Results: The mean score of preoperative speech was $52.6{\pm}7.4points$ and postoperative speech was $58.6{\pm}6.5points$, which presented significant postoperative speech improvement with an average of 5.9 points(p<0.01). There was a significant inverse relationship between the age at operation and speech improvement degree(p<0.01, r = -0.54). Comparing the age groups, the age group of 4 to 5 years presented statistically significant speech improvement(p<0.01). Conclusion: we propose that all patients indicated should take pharyngeal flap irrespective of age. In this study, the younger the age at surgery, the higher degree of speech improvement, for which we suggest that surgical approach should be undertaken as early as possible, especially younger than age 5 years.
KSCI

Effects of Age and Type of Stimulus on the Cortical Auditory Evoked Potential in Healthy Malaysian Children

Mukari, Siti Zamratol-Mai Sarah;Umat, Cila;Chan, Soon Chien;Ali, Akmaliza;Maamor, Nashrah;Zakaria, Mohd Normani
- Journal of Audiology & Otology
- /
- v.24 no.1
- /
- pp.35-39
- /
- 2020
Background and Objectives: The cortical auditory evoked potential (CAEP) is a useful objective test for diagnosing hearing loss and auditory disorders. Prior to its clinical applications in the pediatric population, the possible influences of fundamental variables on the CAEP should be studied. The aim of the present study was to determine the effects of age and type of stimulus on the CAEP waveforms. Subjects and Methods: Thirty-five healthy Malaysian children aged 4 to 12 years participated in this repeated-measures study. The CAEP waveforms were recorded from each child using a 1 kHz tone burst and the speech syllable /ba/. Latencies and amplitudes of P1, N1, and P2 peaks were analyzed accordingly. Results: Significant negative correlations were found between age and speech-evoked CAEP latency for each peak (p<0.05). However, no significant correlations were found between age and tone-evoked CAEP amplitudes and latencies (p>0.05). The speech syllable /ba/ produced a higher mean P1 amplitude than the 1 kHz tone burst (p=0.001). Conclusions: The CAEP latencies recorded with the speech syllable became shorter with age. While both tone-burst and speech stimuli were appropriate for recording the CAEP, significantly bigger amplitudes were found in speech-evoked CAEP. The preliminary normative CAEP data provided in the present study may be beneficial for clinical and research applications in Malaysian children.
https://doi.org/10.7874/jao.2019.00262 인용

A Study on the Impact of Speech Data Quality on Speech Recognition Models

Yeong-Jin Kim;Hyun-Jong Cha;Ah Reum Kang
- Journal of the Korea Society of Computer and Information
- /
- v.29 no.1
- /
- pp.41-49
- /
- 2024
Speech recognition technology is continuously advancing and widely used in various fields. In this study, we aimed to investigate the impact of speech data quality on speech recognition models by dividing the dataset into the entire dataset and the top 70% based on Signal-to-Noise Ratio (SNR). Utilizing Seamless M4T and Google Cloud Speech-to-Text, we examined the text transformation results for each model and evaluated them using the Levenshtein Distance. Experimental results revealed that Seamless M4T scored 13.6 in models using data with high SNR, which is lower than the score of 16.6 for the entire dataset. However, Google Cloud Speech-to-Text scored 8.3 on the entire dataset, indicating lower performance than data with high SNR. This suggests that using data with high SNR during the training of a new speech recognition model can have an impact, and Levenshtein Distance can serve as a metric for evaluating speech recognition models.
https://doi.org/10.9708/jksci.2024.29.01.041 인용 PDF HTML

Search Result 5,325, Processing Time 0.032 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)