• Title/Summary/Keyword: Speech Learning Model

Search Result 192, Processing Time 0.025 seconds

Telephone Digit Speech Recognition using Discriminant Learning (Discriminant 학습을 이용한 전화 숫자음 인식)

  • 한문성;최완수;권현직
    • Journal of the Institute of Electronics Engineers of Korea TE
    • /
    • v.37 no.3
    • /
    • pp.16-20
    • /
    • 2000
  • Most of speech recognition systems are using Hidden Markov Model based on statistical modelling frequently. In Korean isolated telephone digit speech recognition, high recognition rate is gained by using HMM if many training data are given. But in Korean continuous telephone digit speech recognition, HMM has some limitations for similar telephone digits. In this paper we suggest a way to overcome some limitations of HMM by using discriminant learning based on minimal classification error criterion in Korean continuous telephone digit speech recognition. The experimental results show our method has high recognition rate for similar telephone digits.

  • PDF

Data Augmentation for DNN-based Speech Enhancement (딥 뉴럴 네트워크 기반의 음성 향상을 위한 데이터 증강)

  • Lee, Seung Gwan;Lee, Sangmin
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.7
    • /
    • pp.749-758
    • /
    • 2019
  • This paper proposes a data augmentation algorithm to improve the performance of DNN(Deep Neural Network) based speech enhancement. Many deep learning models are exploring algorithms to maximize the performance in limited amount of data. The most commonly used algorithm is the data augmentation which is the technique artificially increases the amount of data. For the effective data augmentation algorithm, we used a formant enhancement method that assign the different weights to the formant frequencies. The DNN model which is trained using the proposed data augmentation algorithm was evaluated in various noise environments. The speech enhancement performance of the DNN model with the proposed data augmentation algorithm was compared with the algorithms which are the DNN model with the conventional data augmentation and without the data augmentation. As a result, the proposed data augmentation algorithm showed the higher speech enhancement performance than the other algorithms.

Fatigue Classification Model Based On Machine Learning Using Speech Signals (음성신호를 이용한 기계학습 기반 피로도 분류 모델)

  • Lee, Soo Hwa;Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.6
    • /
    • pp.741-747
    • /
    • 2022
  • Fatigue lowers an individual's ability and makes it difficult to perform work. As fatigue accumulates, concentration decreases and thus the possibility of causing a safety accident increases. Awareness of fatigue is subjective, but it is necessary to quantitatively measure the level of fatigue in the actual field. In previous studies, it was proposed to measure the level of fatigue by expert judgment by adding objective indicators such as bio-signal analysis to subjective evaluations such as multidisciplinary fatigue scales. However this method is difficult to evaluate fatigue in real time in daily life. This paper is a study on the fatigue classification model that determines the fatigue level of workers in real time using speech data recorded in the field. Machine learning models such as logistic classification, support vector machine, and random forest are trained using speech data collected in the field. The performance evaluation showed good performance with accuracy of 0.677 to 0.758, of which logistic classification showed the best performance. From the experimental results, it can be seen that it is possible to classify the fatigue level using speech signals.

Performance comparison evaluation of real and complex networks for deep neural network-based speech enhancement in the frequency domain (주파수 영역 심층 신경망 기반 음성 향상을 위한 실수 네트워크와 복소 네트워크 성능 비교 평가)

  • Hwang, Seo-Rim;Park, Sung Wook;Park, Youngcheol
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.1
    • /
    • pp.30-37
    • /
    • 2022
  • This paper compares and evaluates model performance from two perspectives according to the learning target and network structure for training Deep Neural Network (DNN)-based speech enhancement models in the frequency domain. In this case, spectrum mapping and Time-Frequency (T-F) masking techniques were used as learning targets, and a real network and a complex network were used for the network structure. The performance of the speech enhancement model was evaluated through two objective evaluation metrics: Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) depending on the scale of the dataset. Test results show the appropriate size of the training data differs depending on the type of networks and the type of dataset. In addition, they show that, in some cases, using a real network may be a more realistic solution if the number of total parameters is considered because the real network shows relatively higher performance than the complex network depending on the size of the data and the learning target.

Exploring the feasibility of fine-tuning large-scale speech recognition models for domain-specific applications: A case study on Whisper model and KsponSpeech dataset

  • Jungwon Chang;Hosung Nam
    • Phonetics and Speech Sciences
    • /
    • v.15 no.3
    • /
    • pp.83-88
    • /
    • 2023
  • This study investigates the fine-tuning of large-scale Automatic Speech Recognition (ASR) models, specifically OpenAI's Whisper model, for domain-specific applications using the KsponSpeech dataset. The primary research questions address the effectiveness of targeted lexical item emphasis during fine-tuning, its impact on domain-specific performance, and whether the fine-tuned model can maintain generalization capabilities across different languages and environments. Experiments were conducted using two fine-tuning datasets: Set A, a small subset emphasizing specific lexical items, and Set B, consisting of the entire KsponSpeech dataset. Results showed that fine-tuning with targeted lexical items increased recognition accuracy and improved domain-specific performance, with generalization capabilities maintained when fine-tuned with a smaller dataset. For noisier environments, a trade-off between specificity and generalization capabilities was observed. This study highlights the potential of fine-tuning using minimal domain-specific data to achieve satisfactory results, emphasizing the importance of balancing specialization and generalization for ASR models. Future research could explore different fine-tuning strategies and novel technologies such as prompting to further enhance large-scale ASR models' domain-specific performance.

Training Method and Speaker Verification Measures for Recurrent Neural Network based Speaker Verification System

  • Kim, Tae-Hyung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.34 no.3C
    • /
    • pp.257-267
    • /
    • 2009
  • This paper presents a training method for neural networks and the employment of MSE (mean scare error) values as the basis of a decision regarding the identity claim of a speaker in a recurrent neural networks based speaker verification system. Recurrent neural networks (RNNs) are employed to capture temporally dynamic characteristics of speech signal. In the process of supervised learning for RNNs, target outputs are automatically generated and the generated target outputs are made to represent the temporal variation of input speech sounds. To increase the capability of discriminating between the true speaker and an impostor, a discriminative training method for RNNs is presented. This paper shows the use and the effectiveness of the MSE value, which is obtained from the Euclidean distance between the target outputs and the outputs of networks for test speech sounds of a speaker, as the basis of speaker verification. In terms of equal error rates, results of experiments, which have been performed using the Korean speech database, show that the proposed speaker verification system exhibits better performance than a conventional hidden Markov model based speaker verification system.

Korean speech recognition based on grapheme (문자소 기반의 한국어 음성인식)

  • Lee, Mun-hak;Chang, Joon-Hyuk
    • The Journal of the Acoustical Society of Korea
    • /
    • v.38 no.5
    • /
    • pp.601-606
    • /
    • 2019
  • This paper is a study on speech recognition in the Korean using grapheme unit (Cho-sumg [onset], Jung-sung [nucleus], Jong-sung [coda]). Here we make ASR (Automatic speech recognition) system without G2P (Grapheme to Phoneme) process and show that Deep learning based ASR systems can learn Korean pronunciation rules without G2P process. The proposed model is shown to reduce the word error rate in the presence of sufficient training data.

Effective Recognition of Velopharyngeal Insufficiency (VPI) Patient's Speech Using DNN-HMM-based System (DNN-HMM 기반 시스템을 이용한 효과적인 구개인두부전증 환자 음성 인식)

  • Yoon, Ki-mu;Kim, Wooil
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.1
    • /
    • pp.33-38
    • /
    • 2019
  • This paper proposes an effective recognition method of VPI patient's speech employing DNN-HMM-based speech recognition system, and evaluates the recognition performance compared to GMM-HMM-based system. The proposed method employs speaker adaptation technique to improve VPI speech recognition. This paper proposes to use simulated VPI speech for generating a prior model for speaker adaptation and selective learning of weight matrices of DNN, in order to effectively utilize the small size of VPI speech for model adaptation. We also apply Linear Input Network (LIN) based model adaptation technique for the DNN model. The proposed speaker adaptation method brings 2.35% improvement in average accuracy compared to GMM-HMM based ASR system. The experimental results demonstrate that the proposed DNN-HMM-based speech recognition system is effective for VPI speech with small-sized speech data, compared to conventional GMM-HMM system.

Korean Part-Of-Speech Tagging by using Head-Tail Tokenization (Head-Tail 토큰화 기법을 이용한 한국어 품사 태깅)

  • Suh, Hyun-Jae;Kim, Jung-Min;Kang, Seung-Shik
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Korean part-of-speech taggers decompose a compound morpheme into unit morphemes and attach part-of-speech tags. So, here is a disadvantage that part-of-speech for morphemes are over-classified in detail and complex word types are generated depending on the purpose of the taggers. When using the part-of-speech tagger for keyword extraction in deep learning based language processing, it is not required to decompose compound particles and verb-endings. In this study, the part-of-speech tagging problem is simplified by using a Head-Tail tokenization technique that divides only two types of tokens, a lexical morpheme part and a grammatical morpheme part that the problem of excessively decomposed morpheme was solved. Part-of-speech tagging was attempted with a statistical technique and a deep learning model on the Head-Tail tokenized corpus, and the accuracy of each model was evaluated. Part-of-speech tagging was implemented by TnT tagger, a statistical-based part-of-speech tagger, and Bi-LSTM tagger, a deep learning-based part-of-speech tagger. TnT tagger and Bi-LSTM tagger were trained on the Head-Tail tokenized corpus to measure the part-of-speech tagging accuracy. As a result, it showed that the Bi-LSTM tagger performs part-of-speech tagging with a high accuracy of 99.52% compared to 97.00% for the TnT tagger.

Performance Comparison Analysis on Named Entity Recognition system with Bi-LSTM based Multi-task Learning (다중작업학습 기법을 적용한 Bi-LSTM 개체명 인식 시스템 성능 비교 분석)

  • Kim, GyeongMin;Han, Seunggnyu;Oh, Dongsuk;Lim, HeuiSeok
    • Journal of Digital Convergence
    • /
    • v.17 no.12
    • /
    • pp.243-248
    • /
    • 2019
  • Multi-Task Learning(MTL) is a training method that trains a single neural network with multiple tasks influences each other. In this paper, we compare performance of MTL Named entity recognition(NER) model trained with Korean traditional culture corpus and other NER model. In training process, each Bi-LSTM layer of Part of speech tagging(POS-tagging) and NER are propagated from a Bi-LSTM layer to obtain the joint loss. As a result, the MTL based Bi-LSTM model shows 1.1%~4.6% performance improvement compared to single Bi-LSTM models.