• Title/Summary/Keyword: Speech recognition model

Search Result 624, Processing Time 0.022 seconds

Emotion Recognition in Arabic Speech from Saudi Dialect Corpus Using Machine Learning and Deep Learning Algorithms

  • Hanaa Alamri;Hanan S. Alshanbari
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.8
    • /
    • pp.9-16
    • /
    • 2023
  • Speech can actively elicit feelings and attitudes by using words. It is important for researchers to identify the emotional content contained in speech signals as well as the sort of emotion that resulted from the speech that was made. In this study, we studied the emotion recognition system using a database in Arabic, especially in the Saudi dialect, the database is from a YouTube channel called Telfaz11, The four emotions that were examined were anger, happiness, sadness, and neutral. In our experiments, we extracted features from audio signals, such as Mel Frequency Cepstral Coefficient (MFCC) and Zero-Crossing Rate (ZCR), then we classified emotions using many classification algorithms such as machine learning algorithms (Support Vector Machine (SVM) and K-Nearest Neighbor (KNN)) and deep learning algorithms such as (Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM)). Our Experiments showed that the MFCC feature extraction method and CNN model obtained the best accuracy result with 95%, proving the effectiveness of this classification system in recognizing Arabic spoken emotions.

Enhancing Multimodal Emotion Recognition in Speech and Text with Integrated CNN, LSTM, and BERT Models (통합 CNN, LSTM, 및 BERT 모델 기반의 음성 및 텍스트 다중 모달 감정 인식 연구)

  • Edward Dwijayanto Cahyadi;Hans Nathaniel Hadi Soesilo;Mi-Hwa Song
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.1
    • /
    • pp.617-623
    • /
    • 2024
  • Identifying emotions through speech poses a significant challenge due to the complex relationship between language and emotions. Our paper aims to take on this challenge by employing feature engineering to identify emotions in speech through a multimodal classification task involving both speech and text data. We evaluated two classifiers-Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)-both integrated with a BERT-based pre-trained model. Our assessment covers various performance metrics (accuracy, F-score, precision, and recall) across different experimental setups). The findings highlight the impressive proficiency of two models in accurately discerning emotions from both text and speech data.

A Study on Variation and Determination of Gaussian function Using SNR Criteria Function for Robust Speech Recognition (잡음에 강한 음성 인식에서 SNR 기준 함수를 사용한 가우시안 함수 변형 및 결정에 관한 연구)

  • 전선도;강철호
    • The Journal of the Acoustical Society of Korea
    • /
    • v.18 no.7
    • /
    • pp.112-117
    • /
    • 1999
  • In case of spectral subtraction for noise robust speech recognition system, this method often makes loss of speech signal. In this study, we propose a method that variation and determination of Gaussian function at semi-continuous HMM(Hidden Markov Model) is made on the basis of SNR criteria function, in which SNR means signal to noise ratio between estimation noise and subtracted signal per frame. For proving effectiveness of this method, we show the estimation error to be related with the magnitude of estimated noise through signal waveform. For this reason, Gaussian function is varied and determined by SNR. When we test recognition rate by computer simulation under the noise environment of driving car over the speed of 80㎞/h, the proposed Gaussian decision method by SNR turns out to get more improved recognition rate compared with the frequency subtracted and non-subtracted cases.

  • PDF

Phoneme segmentation and Recognition using Support Vector Machines (Support Vector Machines에 의한 음소 분할 및 인식)

  • Lee, Gwang-Seok;Kim, Deok-Hyun
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2010.05a
    • /
    • pp.981-984
    • /
    • 2010
  • In this paper, we used Support Vector Machines(SVMs) as the learning method, one of Artificial Neural Network, to segregated from the continuous speech into phonemes, an initial, medial, and final sound, and then, performed continuous speech recognition from it. A Decision boundary of phoneme is determined by algorithm with maximum frequency in a short interval. Speech recognition process is performed by Continuous Hidden Markov Model(CHMM), and we compared it with another phoneme segregated from the eye-measurement. From the simulation results, we confirmed that the method, SVMs, we proposed is more effective in an initial sound than Gaussian Mixture Models(GMMs).

  • PDF

Fast computation of Observation Probability for Speaker-Independent Real-Time Speech Recognition (실시간 화자독립 음성인식을 위한 고속 확률계산)

  • Park Dong-Chul;Ahn Ju-Won
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.30 no.9C
    • /
    • pp.907-912
    • /
    • 2005
  • An efficient method for calculation of observation probability in CDHMM(Continous Density Hidden Markov Model) is proposed in this paper. the proposed algorithm, called FCOP(Fast Computation of Observation Probability), approximate obsewation probabilities in CDHMM by eliminating insignificant PDFs(Probability Density Functions) and reduces the computational load. When applied to a speech recognition system, the proposed FCOP algorithm can reduce the instruction cycles by $20\%-30\%$ and can also increase the recognition speed about $30\%$ while minimizing the loss in its recognition rate. When implemented on a practical cellular phone, the FCOP algorithm can increase its recognition speed about $30\%$ while suffering $0.2\%$ loss in recognition rate.

Performance Improvement of Continuous Digits Speech Recognition Using the Transformed Successive State Splitting and Demi-syllable Pair (반음절쌍과 변형된 연쇄 상태 분할을 이용한 연속 숫자 음 인식의 성능 향상)

  • Seo Eun-Kyoung;Choi Gab-Keun;Kim Soon-Hyob;Lee Soo-Jeong
    • Journal of Korea Multimedia Society
    • /
    • v.9 no.1
    • /
    • pp.23-32
    • /
    • 2006
  • This paper describes the optimization of a language model and an acoustic model to improve speech recognition using Korean unit digits. Since the model is composed of a finite state network (FSN) with a disyllable, recognition errors of the language model were reduced by analyzing the grammatical features of Korean unit digits. Acoustic models utilize a demisyllable pair to decrease recognition errors caused by inaccurate division of a phone or monosyllable due to short pronunciation time and articulation. We have used the K-means clustering algorithm with the transformed successive state splitting in the feature level for the efficient modelling of feature of the recognition unit. As a result of experiments, 10.5% recognition rate is raised in the case of the proposed language model. The demi-syllable fair with an acoustic model increased 12.5% recognition rate and 1.5% recognition rate is improved in transformed successive state splitting.

  • PDF

Voice Activity Detection in Noisy Environment using Speech Energy Maximization and Silence Feature Normalization (음성 에너지 최대화와 묵음 특징 정규화를 이용한 잡음 환경에 강인한 음성 검출)

  • Ahn, Chan-Shik;Choi, Ki-Ho
    • Journal of Digital Convergence
    • /
    • v.11 no.6
    • /
    • pp.169-174
    • /
    • 2013
  • Speech recognition, the problem of performance degradation is the difference between the model training and recognition environments. Silence features normalized using the method as a way to reduce the inconsistency of such an environment. Silence features normalized way of existing in the low signal-to-noise ratio. Increase the energy level of the silence interval for voice and non-voice classification accuracy due to the falling. There is a problem in the recognition performance is degraded. This paper proposed a robust speech detection method in noisy environments using a silence feature normalization and voice energy maximize. In the high signal-to-noise ratio for the proposed method was used to maximize the characteristics receive less characterized the effects of noise by the voice energy. Cepstral feature distribution of voice / non-voice characteristics in the low signal-to-noise ratio and improves the recognition performance. Result of the recognition experiment, recognition performance improved compared to the conventional method.

A Real-Time Implementation of Isolated Word Recognition System Based on a Hardware-Efficient Viterbi Scorer (효율적인 하드웨어 구조의 Viterbi Scorer를 이용한 실시간 격리단어 인식 시스템의 구현)

  • Cho, Yun-Seok;Kim, Jin-Yul;Oh, Kwang-Sok;Lee, Hwang-Soo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.13 no.2E
    • /
    • pp.58-67
    • /
    • 1994
  • Hidden Markov Model (HMM)-based algorithms have been used successfully in many speech recognition systems, especially large vocabulary systems. Although general purpose processors can be employed for the system, they inevitably suffer from the computational complexity and enormous data. Therefore, it is essential for real-time speech recognition to develop specialized hardware to accelerate the recognition steps. This paper concerns with a real-time implementation of an isolated word recognition system based on HMM. The speech recognition system consists of a host computer (PC), a DSP board, and a prototype Viterbi scoring board. The DSP board extracts feature vectors of speech signal. The Viterbi scoring board has been implemented using three field-programmable gate array chips. It employs a hardware-efficient Viterbi scoring architecture and performs the Viterbi algorithm for HMM-based speech recognition. At the clock rate of 10 MHz, the system can update about 100,000 states within a single frame of 10ms.

  • PDF

Large Vocabulary Continuous Speech Recognition Based on Language Model Network (언어 모델 네트워크에 기반한 대어휘 연속 음성 인식)

  • 안동훈;정민화
    • The Journal of the Acoustical Society of Korea
    • /
    • v.21 no.6
    • /
    • pp.543-551
    • /
    • 2002
  • In this paper, we present an efficient decoding method that performs in real time for 20k word continuous speech recognition task. Basic search method is a one-pass Viterbi decoder on the search space constructed from the novel language model network. With the consistent search space representation derived from various language models by the LM network, we incorporate basic pruning strategies, from which tokens alive constitute a dynamic search space. To facilitate post-processing, it produces a word graph and a N-best list subsequently. The decoder is tested on the database of 20k words and evaluated with respect to accuracy and RTF.

Auditory Representations for Robust Speech Recognition in Noisy Environments (잡음 환경에서의 음성 인식을 위한 청각 표현)

  • Kim, Doh-Suk;Lee, Soo-Young;Kil, Rhee-M.
    • The Journal of the Acoustical Society of Korea
    • /
    • v.15 no.5
    • /
    • pp.90-98
    • /
    • 1996
  • An auditory model is proposed for robust speech recognition in noisy environments. The model consists of cochlear bandpass filters and nonlinear stages, and represents frequency and intensity information efficiently even in noisy environments. Frequency information of the signal is obtained by zero-crossing intervals, and intensity information is also incorporated by peak detectors and saturating nonlinearities. Also, the robustness of the zero-crossings in estimating frequency is verified by the developed analytic relationship of the variance of the level-crossing interval perturbations as a function of the crossing level values. The proposed auditory model is computationally efficient and free from many unknown parameters compared with other auditory models. Speaker-independent speech recognition experiments demonstrate the robustness of the proposed method.

  • PDF