• Title/Summary/Keyword: 화자 특징

Search Result 299, Processing Time 0.025 seconds

Robust Feature Extraction Based on Image-based Approach for Visual Speech Recognition (시각 음성인식을 위한 영상 기반 접근방법에 기반한 강인한 시각 특징 파라미터의 추출 방법)

  • Gyu, Song-Min;Pham, Thanh Trung;Min, So-Hee;Kim, Jing-Young;Na, Seung-You;Hwang, Sung-Taek
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.3
    • /
    • pp.348-355
    • /
    • 2010
  • In spite of development in speech recognition technology, speech recognition under noisy environment is still a difficult task. To solve this problem, Researchers has been proposed different methods where they have been used visual information except audio information for visual speech recognition. However, visual information also has visual noises as well as the noises of audio information, and this visual noises cause degradation in visual speech recognition. Therefore, it is one the field of interest how to extract visual features parameter for enhancing visual speech recognition performance. In this paper, we propose a method for visual feature parameter extraction based on image-base approach for enhancing recognition performance of the HMM based visual speech recognizer. For experiments, we have constructed Audio-visual database which is consisted with 105 speackers and each speaker has uttered 62 words. We have applied histogram matching, lip folding, RASTA filtering, Liner Mask, DCT and PCA. The experimental results show that the recognition performance of our proposed method enhanced at about 21% than the baseline method.

Automatic detection and severity prediction of chronic kidney disease using machine learning classifiers (머신러닝 분류기를 사용한 만성콩팥병 자동 진단 및 중증도 예측 연구)

  • Jihyun Mun;Sunhee Kim;Myeong Ju Kim;Jiwon Ryu;Sejoong Kim;Minhwa Chung
    • Phonetics and Speech Sciences
    • /
    • v.14 no.4
    • /
    • pp.45-56
    • /
    • 2022
  • This paper proposes an optimal methodology for automatically diagnosing and predicting the severity of the chronic kidney disease (CKD) using patients' utterances. In patients with CKD, the voice changes due to the weakening of respiratory and laryngeal muscles and vocal fold edema. Previous studies have phonetically analyzed the voices of patients with CKD, but no studies have been conducted to classify the voices of patients. In this paper, the utterances of patients with CKD were classified using the variety of utterance types (sustained vowel, sentence, general sentence), the feature sets [handcrafted features, extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), CNN extracted features], and the classifiers (SVM, XGBoost). Total of 1,523 utterances which are 3 hours, 26 minutes, and 25 seconds long, are used. F1-score of 0.93 for automatically diagnosing a disease, 0.89 for a 3-classes problem, and 0.84 for a 5-classes problem were achieved. The highest performance was obtained when the combination of general sentence utterances, handcrafted feature set, and XGBoost was used. The result suggests that a general sentence utterance that can reflect all speakers' speech characteristics and an appropriate feature set extracted from there are adequate for the automatic classification of CKD patients' utterances.

Performance assessments of feature vectors and classification algorithms for amphibian sound classification (양서류 울음 소리 식별을 위한 특징 벡터 및 인식 알고리즘 성능 분석)

  • Park, Sangwook;Ko, Kyungdeuk;Ko, Hanseok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.36 no.6
    • /
    • pp.401-406
    • /
    • 2017
  • This paper presents the performance assessment of several key algorithms conducted for amphibian species sound classification. Firstly, 9 target species including endangered species are defined and a database of their sounds is built. For performance assessment, three feature vectors such as MFCC (Mel Frequency Cepstral Coefficient), RCGCC (Robust Compressive Gammachirp filterbank Cepstral Coefficient), and SPCC (Subspace Projection Cepstral Coefficient), and three classifiers such as GMM(Gaussian Mixture Model), SVM(Support Vector Machine), DBN-DNN(Deep Belief Network - Deep Neural Network) are considered. In addition, i-vector based classification system which is widely used for speaker recognition, is used to assess for this task. Experimental results indicate that, SPCC-SVM achieved the best performance with 98.81 % while other methods also attained good performance with above 90 %.

Implementation of the Speech Emotion Recognition System in the ARM Platform (ARM 플랫폼 기반의 음성 감성인식 시스템 구현)

  • Oh, Sang-Heon;Park, Kyu-Sik
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.11
    • /
    • pp.1530-1537
    • /
    • 2007
  • In this paper, we implemented a speech emotion recognition system that can distinguish human emotional states from recorded speech captured by a single microphone and classify them into four categories: neutrality, happiness, sadness and anger. In general, a speech recorded with a microphone contains background noises due to the speaker environment and the microphone characteristic, which can result in serious system performance degradation. In order to minimize the effect of these noises and to improve the system performance, a MA(Moving Average) filter with a relatively simple structure and low computational complexity was adopted. Then a SFS(Sequential Forward Selection) feature optimization method was implemented to further improve and stabilize the system performance. For speech emotion classification, a SVM pattern classifier is used. The experimental results indicate the emotional classification performance around 65% in the computer simulation and 62% on the ARM platform.

  • PDF

Word Recognition Using VQ and Fuzzy Theory (VQ와 Fuzzy 이론을 이용한 단어인식)

  • Kim, Ja-Ryong;Choi, Kap-Seok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.10 no.4
    • /
    • pp.38-47
    • /
    • 1991
  • The frequency variation among speakers is one of problems in the speech recognition. This paper applies fuzzy theory to solve the variation problem of frequency features. Reference patterns are expressed by fuzzified patterns which are produced by the peak frequency and the peak energy extracted from codebooks which are generated from training words uttered by several speakers, as they should include common features of speech signals. Words are recognized by fuzzy inference which uses the certainty factor between the reference patterns and the test fuzzified patterns which are produced by the peak frequency and the peak energy extracted from the power spectrum of input speech signals. Practically, in computing the certainty factor, to reduce memory capacity and computation requirements we propose a new equation which calculates the improved certainty factor using only the difference between two fuzzy values. As a result of experiments to test this word recognition method by fuzzy interence with Korean digits, it is shown that this word recognition method using the new equation presented in this paper, can solve the variation problem of frequency features and that the memory capacity and computation requirements are reduced.

  • PDF

Robust Speech Recognition Parameters for Emotional Variation (감정 변화에 강인한 음성 인식 파라메터)

  • Kim Weon-Goo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.15 no.6
    • /
    • pp.655-660
    • /
    • 2005
  • This paper studied the feature parameters less affected by the emotional variation for the development of the robust speech recognition technologies. For this purpose, the effect of emotional variation on the speech recognition system and robust feature parameters of speech recognition system were studied using speech database containing various emotions. In this study, LPC cepstral coefficient, met-cepstral coefficient, root-cepstral coefficient, PLP coefficient, RASTA met-cepstral coefficient were used as a feature parameters. And CMS and SBR method were used as a signal bias removal techniques. Experimental results showed that the HMM based speaker independent word recognizer using RASTA met-cepstral coefficient :md its derivatives and CMS as a signal bias removal showed the best performance of $7.05\%$ word error rate. This corresponds to about a $52\%$ word error reduction as compare to the performance of baseline system using met - cepstral coefficient.

Robust Speech Recognition Using Missing Data Theory (손실 데이터 이론을 이용한 강인한 음성 인식)

  • 김락용;조훈영;오영환
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.3
    • /
    • pp.56-62
    • /
    • 2001
  • In this paper, we adopt a missing data theory to speech recognition. It can be used in order to maintain high performance of speech recognizer when the missing data occurs. In general, hidden Markov model (HMM) is used as a stochastic classifier for speech recognition task. Acoustic events are represented by continuous probability density function in continuous density HMM(CDHMM). The missing data theory has an advantage that can be easily applicable to this CDHMM. A marginalization method is used for processing missing data because it has small complexity and is easy to apply to automatic speech recognition (ASR). Also, a spectral subtraction is used for detecting missing data. If the difference between the energy of speech and that of background noise is below given threshold value, we determine that missing has occurred. We propose a new method that examines the reliability of detected missing data using voicing probability. The voicing probability is used to find voiced frames. It is used to process the missing data in voiced region that has more redundant information than consonants. The experimental results showed that our method improves performance than baseline system that uses spectral subtraction method only. In 452 words isolated word recognition experiment, the proposed method using the voicing probability reduced the average word error rate by 12% in a typical noise situation.

  • PDF

A Study on Tourists Information and Language Transference (관광정보와 언어전환에 관한 연구)

  • Lee, Seung Jae
    • Journal of Digital Convergence
    • /
    • v.12 no.5
    • /
    • pp.451-458
    • /
    • 2014
  • The purpose of this paper is to examine website information as well as promotional texts comparing source texts of Korean with translated versions of English, and drew characteristics of tourism texts from a discourse and communicative perspective. This study shows that the website or promotional texts is the first source of information in tourism, which is most referred to by the in-bound tourists, and the information given by the official homepage is most trustful content of Korean tourism. With comparison of source text of Korean with the translated English version, this paper shows that Korean source texts have a tendency to prefer the longer explication and more detailed information on the scenic spots and attractions than the English translations. When it is translated into English, the translated version does not follow the literal way of translation, and is segmented for reader's understanding and adapted following the target language's communicative conventions and the target culture. Consequently, this study supports the adaption in tourism promotional English translation, and ensures that the communicative constraints of tourism, that is, politeness and Grician maxims are preserved even in the written form of communication, translation.

A Real-Time Embedded Speech Recognition System (실시간 임베디드 음성 인식 시스템)

  • 남상엽;전은희;박인정
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.40 no.1
    • /
    • pp.74-81
    • /
    • 2003
  • In this study, we'd implemented a real time embedded speech recognition system that requires minimum memory size for speech recognition engine and DB. The word to be recognized consist of 40 commands used in a PCS phone and 10 digits. The speech data spoken by 15 male and 15 female speakers was recorded and analyzed by short time analysis method, which window size is 256. The LPC parameters of each frame were computed through Levinson-Burbin algorithm and they were transformed to Cepstrum parameters. Before the analysis, speech data should be processed by pre-emphasis that will remove the DC component in speech and emphasize high frequency band. Baum-Welch reestimation algorithm was used for the training of HMM. In test phone, we could get a recognition rate using likelihood method. We implemented an embedded system by porting the speech recognition engine on ARM core evaluation board. The overall recognition rate of this system was 95%, while the rate on 40 commands was 96% and that 10 digits was 94%.

Emotion recognition in speech using hidden Markov model (은닉 마르코프 모델을 이용한 음성에서의 감정인식)

  • 김성일;정현열
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.3 no.3
    • /
    • pp.21-26
    • /
    • 2002
  • This paper presents the new approach of identifying human emotional states such as anger, happiness, normal, sadness, or surprise. This is accomplished by using discrete duration continuous hidden Markov models(DDCHMM). For this, the emotional feature parameters are first defined from input speech signals. In this study, we used prosodic parameters such as pitch signals, energy, and their each derivative, which were then trained by HMM for recognition. Speaker adapted emotional models based on maximum a posteriori(MAP) estimation were also considered for speaker adaptation. As results, the simulation performance showed that the recognition rates of vocal emotion gradually increased with an increase of adaptation sample number.

  • PDF