• Title/Summary/Keyword: Speech recognition model

Search Result 623, Processing Time 0.026 seconds

An Implementation of the Vocabulary Independent Speech Recognition System Using VCCV Unit (VCCV단위를 이용한 어휘독립 음성인식 시스템의 구현)

  • 윤재선;홍광석
    • The Journal of the Acoustical Society of Korea
    • /
    • v.21 no.2
    • /
    • pp.160-166
    • /
    • 2002
  • In this paper, we implement a new vocabulary-independent speech recognition system that uses CV, VCCV, VC recognition unit. Since these recognition units are extracted in the trowel region of syllable, the segmentation is easy and robust. And in the case of not existing VCCV unit, the units are replaced by combining VC and CV semi-syllable model. Clustering of vowel group and applying combination rule to the substitution model in the case of not existing of VCCV model lead to 5.2% recognition performance improvement from 90.4% (Model A) to 95.6% (Model C) in the first candidate. The recognition results that is 98.8% recognition rate in the second candidate confirm the effectiveness of the proposed method.

Statistical Model-Based Noise Reduction Approach for Car Interior Applications to Speech Recognition

  • Lee, Sung-Joo;Kang, Byung-Ok;Jung, Ho-Young;Lee, Yun-Keun;Kim, Hyung-Soon
    • ETRI Journal
    • /
    • v.32 no.5
    • /
    • pp.801-809
    • /
    • 2010
  • This paper presents a statistical model-based noise suppression approach for voice recognition in a car environment. In order to alleviate the spectral whitening and signal distortion problem in the traditional decision-directed Wiener filter, we combine a decision-directed method with an original spectrum reconstruction method and develop a new two-stage noise reduction filter estimation scheme. When a tradeoff between the performance and computational efficiency under resource-constrained automotive devices is considered, ETSI standard advance distributed speech recognition font-end (ETSI-AFE) can be an effective solution, and ETSI-AFE is also based on the decision-directed Wiener filter. Thus, a series of voice recognition and computational complexity tests are conducted by comparing the proposed approach with ETSI-AFE. The experimental results show that the proposed approach is superior to the conventional method in terms of speech recognition accuracy, while the computational cost and frame latency are significantly reduced.

Adaptive Speech Emotion Recognition Framework Using Prompted Labeling Technique (프롬프트 레이블링을 이용한 적응형 음성기반 감정인식 프레임워크)

  • Bang, Jae Hun;Lee, Sungyoung
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.2
    • /
    • pp.160-165
    • /
    • 2015
  • Traditional speech emotion recognition techniques recognize emotions using a general training model based on the voices of various people. These techniques can not consider personalized speech character exactly. Therefore, the recognized results are very different to each person. This paper proposes an adaptive speech emotion recognition framework made from user's' immediate feedback data using a prompted labeling technique for building a personal adaptive recognition model and applying it to each user in a mobile device environment. The proposed framework can recognize emotions from the building of a personalized recognition model. The proposed framework was evaluated to be better than the traditional research techniques from three comparative experiment. The proposed framework can be applied to healthcare, emotion monitoring and personalized service.

Performance of speech recognition unit considering morphological pronunciation variation (형태소 발음변이를 고려한 음성인식 단위의 성능)

  • Bang, Jeong-Uk;Kim, Sang-Hun;Kwon, Oh-Wook
    • Phonetics and Speech Sciences
    • /
    • v.10 no.4
    • /
    • pp.111-119
    • /
    • 2018
  • This paper proposes a method to improve speech recognition performance by extracting various pronunciations of the pseudo-morpheme unit from an eojeol unit corpus and generating a new recognition unit considering pronunciation variations. In the proposed method, we first align the pronunciation of the eojeol units and the pseudo-morpheme units, and then expand the pronunciation dictionary by extracting the new pronunciations of the pseudo-morpheme units at the pronunciation of the eojeol units. Then, we propose a new recognition unit that relies on pronunciation by tagging the obtained phoneme symbols according to the pseudo-morpheme units. The proposed units and their extended pronunciations are incorporated into the lexicon and language model of the speech recognizer. Experiments for performance evaluation are performed using the Korean speech recognizer with a trigram language model obtained by a 100 million pseudo-morpheme corpus and an acoustic model trained by a multi-genre broadcast speech data of 445 hours. The proposed method is shown to reduce the word error rate relatively by 13.8% in the news-genre evaluation data and by 4.5% in the total evaluation data.

A Study on the Impact of Speech Data Quality on Speech Recognition Models

  • Yeong-Jin Kim;Hyun-Jong Cha;Ah Reum Kang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.1
    • /
    • pp.41-49
    • /
    • 2024
  • Speech recognition technology is continuously advancing and widely used in various fields. In this study, we aimed to investigate the impact of speech data quality on speech recognition models by dividing the dataset into the entire dataset and the top 70% based on Signal-to-Noise Ratio (SNR). Utilizing Seamless M4T and Google Cloud Speech-to-Text, we examined the text transformation results for each model and evaluated them using the Levenshtein Distance. Experimental results revealed that Seamless M4T scored 13.6 in models using data with high SNR, which is lower than the score of 16.6 for the entire dataset. However, Google Cloud Speech-to-Text scored 8.3 on the entire dataset, indicating lower performance than data with high SNR. This suggests that using data with high SNR during the training of a new speech recognition model can have an impact, and Levenshtein Distance can serve as a metric for evaluating speech recognition models.

Building robust Korean speech recognition model by fine-tuning large pretrained model (대형 사전훈련 모델의 파인튜닝을 통한 강건한 한국어 음성인식 모델 구축)

  • Changhan Oh;Cheongbin Kim;Kiyoung Park
    • Phonetics and Speech Sciences
    • /
    • v.15 no.3
    • /
    • pp.75-82
    • /
    • 2023
  • Automatic speech recognition (ASR) has been revolutionized with deep learning-based approaches, among which self-supervised learning methods have proven to be particularly effective. In this study, we aim to enhance the performance of OpenAI's Whisper model, a multilingual ASR system on the Korean language. Whisper was pretrained on a large corpus (around 680,000 hours) of web speech data and has demonstrated strong recognition performance for major languages. However, it faces challenges in recognizing languages such as Korean, which is not major language while training. We address this issue by fine-tuning the Whisper model with an additional dataset comprising about 1,000 hours of Korean speech. We also compare its performance against a Transformer model that was trained from scratch using the same dataset. Our results indicate that fine-tuning the Whisper model significantly improved its Korean speech recognition capabilities in terms of character error rate (CER). Specifically, the performance improved with increasing model size. However, the Whisper model's performance on English deteriorated post fine-tuning, emphasizing the need for further research to develop robust multilingual models. Our study demonstrates the potential of utilizing a fine-tuned Whisper model for Korean ASR applications. Future work will focus on multilingual recognition and optimization for real-time inference.

A Real-Time Implementation of Speech Recognition System Using Oak DSP core in the Car Noise Environment (자동차 환경에서 Oak DSP 코어 기반 음성 인식 시스템 실시간 구현)

  • Woo, K.H.;Yang, T.Y.;Lee, C.;Youn, D.H.;Cha, I.H.
    • Speech Sciences
    • /
    • v.6
    • /
    • pp.219-233
    • /
    • 1999
  • This paper presents a real-time implementation of a speaker independent speech recognition system based on a discrete hidden markov model(DHMM). This system is developed for a car navigation system to design on-chip VLSI system of speech recognition which is used by fixed point Oak DSP core of DSP GROUP LTD. We analyze recognition procedure with C language to implement fixed point real-time algorithms. Based on the analyses, we improve the algorithms which are possible to operate in real-time, and can verify the recognition result at the same time as speech ends, by processing all recognition routines within a frame. A car noise is the colored noise concentrated heavily on the low frequency segment under 400 Hz. For the noise robust processing, the high pass filtering and the liftering on the distance measure of feature vectors are applied to the recognition system. Recognition experiments on the twelve isolated command words were performed. The recognition rates of the baseline recognizer were 98.68% in a stopping situation and 80.7% in a running situation. Using the noise processing methods, the recognition rates were enhanced to 89.04% in a running situation.

  • PDF

Filtering of Filter-Bank Energies for Robust Speech Recognition

  • Jung, Ho-Young
    • ETRI Journal
    • /
    • v.26 no.3
    • /
    • pp.273-276
    • /
    • 2004
  • We propose a novel feature processing technique which can provide a cepstral liftering effect in the log-spectral domain. Cepstral liftering aims at the equalization of variance of cepstral coefficients for the distance-based speech recognizer, and as a result, provides the robustness for additive noise and speaker variability. However, in the popular hidden Markov model based framework, cepstral liftering has no effect in recognition performance. We derive a filtering method in log-spectral domain corresponding to the cepstral liftering. The proposed method performs a high-pass filtering based on the decorrelation of filter-bank energies. We show that in noisy speech recognition, the proposed method reduces the error rate by 52.7% to conventional feature.

  • PDF

Performance Comparison of Multiple-Model Speech Recognizer with Multi-Style Training Method Under Noisy Environments (잡음 환경하에서의 다 모델 기반인식기와 다 스타일 학습방법과의 성능비교)

  • Yoon, Jang-Hyuk;Chung, Young-Joo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.29 no.2E
    • /
    • pp.100-106
    • /
    • 2010
  • Multiple-model speech recognizer has been shown to be quite successful in noisy speech recognition. However, its performance has usually been tested using the general speech front-ends which do not incorporate any noise adaptive algorithms. For the accurate evaluation of the effectiveness of the multiple-model frame in noisy speech recognition, we used the state-of-the-art front-ends and compared its performance with the well-known multi-style training method. In addition, we improved the multiple-model speech recognizer by employing N-best reference HMMs for interpolation and using multiple SNR levels for training each of the reference HMM.

A study on recognition improvement of velopharyngeal insufficiency patient's speech using various types of deep neural network (심층신경망 구조에 따른 구개인두부전증 환자 음성 인식 향상 연구)

  • Kim, Min-seok;Jung, Jae-hee;Jung, Bo-kyung;Yoon, Ki-mu;Bae, Ara;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.38 no.6
    • /
    • pp.703-709
    • /
    • 2019
  • This paper proposes speech recognition systems employing Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) structures combined with Hidden Markov Moldel (HMM) to effectively recognize the speech of VeloPharyngeal Insufficiency (VPI) patients, and compares the recognition performance of the systems to the Gaussian Mixture Model (GMM-HMM) and fully-connected Deep Neural Network (DNNHMM) based speech recognition systems. In this paper, the initial model is trained using normal speakers' speech and simulated VPI speech is used for generating a prior model for speaker adaptation. For VPI speaker adaptation, selected layers are trained in the CNN-HMM based model, and dropout regulatory technique is applied in the LSTM-HMM based model, showing 3.68 % improvement in recognition accuracy. The experimental results demonstrate that the proposed LSTM-HMM-based speech recognition system is effective for VPI speech with small-sized speech data, compared to conventional GMM-HMM and fully-connected DNN-HMM system.