• Title/Summary/Keyword: Baseline recognizer

Search Result 19, Processing Time 0.024 seconds

An Implementation of the Baseline Recognizer Using the Segmental K-means Algorithm for the Noisy Speech Recognition Using the Aurora DB (Aurora DB를 이용한 잡음 음성 인식실험을 위한 Segmental K-means 훈련 방식의 기반인식기의 구현)

  • Kim Hee-Keun;Chung Young-Joo
    • MALSORI
    • /
    • no.57
    • /
    • pp.113-122
    • /
    • 2006
  • Recently, many studies have been done for speech recognition in noisy environments. Particularly, the Aurora DB has been built as the common database for comparing the various feature extraction schemes. However, in general, the recognition models as well as the features have to be modified for effective noisy speech recognition. As the structure of the HTK is very complex, it is not easy to modify, the recognition engine. In this paper, we implemented a baseline recognizer based on the segmental K-means algorithm whose performance is comparable to the HTK in spite of the simplicity in its implementation.

  • PDF

Semi-supervised learning of speech recognizers based on variational autoencoder and unsupervised data augmentation (변분 오토인코더와 비교사 데이터 증강을 이용한 음성인식기 준지도 학습)

  • Jo, Hyeon Ho;Kang, Byung Ok;Kwon, Oh-Wook
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.6
    • /
    • pp.578-586
    • /
    • 2021
  • We propose a semi-supervised learning method based on Variational AutoEncoder (VAE) and Unsupervised Data Augmentation (UDA) to improve the performance of an end-to-end speech recognizer. In the proposed method, first, the VAE-based augmentation model and the baseline end-to-end speech recognizer are trained using the original speech data. Then, the baseline end-to-end speech recognizer is trained again using data augmented from the learned augmentation model. Finally, the learned augmentation model and end-to-end speech recognizer are re-learned using the UDA-based semi-supervised learning method. As a result of the computer simulation, the augmentation model is shown to improve the Word Error Rate (WER) of the baseline end-to-end speech recognizer, and further improve its performance by combining it with the UDA-based learning method.

Multi-stage Recognition for POI (다단계 인식기반의 POI 인식기 개발)

  • Jeon, Hyung-Bae;Hwang, Kyu-Woong;Chung, Hoon;Kim, Seung-Hi;Park, Jun;Lee, Yun-Keun
    • Proceedings of the KSPS conference
    • /
    • 2007.05a
    • /
    • pp.131-134
    • /
    • 2007
  • We propose a multi-stage recognizer architecture that reduces the computation load and makes fast recognizer. To improve performance of baseline multi-stage recognizer, we introduced new feature. We used confidence vector for each phone segment instead of best phoneme sequence. The multi-stage recognizer with new feature has better performance on n-best and has more robustness.

  • PDF

A Hierarchical deep model for food classification from photographs

  • Yang, Heekyung;Kang, Sungyong;Park, Chanung;Lee, JeongWook;Yu, Kyungmin;Min, Kyungha
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.4
    • /
    • pp.1704-1720
    • /
    • 2020
  • Recognizing food from photographs presents many applications for machine learning, computer vision and dietetics, etc. Recent progress of deep learning techniques accelerates the recognition of food in a great scale. We build a hierarchical structure composed of deep CNN to recognize and classify food from photographs. We build a dataset for Korean food of 18 classes, which are further categorized in 4 major classes. Our hierarchical recognizer classifies foods into four major classes in the first step. Each food in the major classes is further classified into the exact class in the second step. We employ DenseNet structure for the baseline of our recognizer. The hierarchical structure provides higher accuracy and F1 score than those from the single-structured recognizer.

Robot User Control System using Hand Gesture Recognizer (수신호 인식기를 이용한 로봇 사용자 제어 시스템)

  • Shon, Su-Won;Beh, Joung-Hoon;Yang, Cheol-Jong;Wang, Han;Ko, Han-Seok
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.17 no.4
    • /
    • pp.368-374
    • /
    • 2011
  • This paper proposes a robot control human interface using Markov model (HMM) based hand signal recognizer. The command receiving humanoid robot sends webcam images to a client computer. The client computer then extracts the intended commanding hum n's hand motion descriptors. Upon the feature acquisition, the hand signal recognizer carries out the recognition procedure. The recognition result is then sent back to the robot for responsive actions. The system performance is evaluated by measuring the recognition of '48 hand signal set' which is created randomly using fundamental hand motion set. For isolated motion recognition, '48 hand signal set' shows 97.07% recognition rate while the 'baseline hand signal set' shows 92.4%. This result validates the proposed hand signal recognizer is indeed highly discernable. For the '48 hand signal set' connected motions, it shows 97.37% recognition rate. The relevant experiments demonstrate that the proposed system is promising for real world human-robot interface application.

Speech Parameters for the Robust Emotional Speech Recognition (감정에 강인한 음성 인식을 위한 음성 파라메터)

  • Kim, Weon-Goo
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.16 no.12
    • /
    • pp.1137-1142
    • /
    • 2010
  • This paper studied the speech parameters less affected by the human emotion for the development of the robust speech recognition system. For this purpose, the effect of emotion on the speech recognition system and robust speech parameters of speech recognition system were studied using speech database containing various emotions. In this study, mel-cepstral coefficient, delta-cepstral coefficient, RASTA mel-cepstral coefficient and frequency warped mel-cepstral coefficient were used as feature parameters. And CMS (Cepstral Mean Subtraction) method were used as a signal bias removal technique. Experimental results showed that the HMM based speaker independent word recognizer using vocal tract length normalized mel-cepstral coefficient, its derivatives and CMS as a signal bias removal showed the best performance of 0.78% word error rate. This corresponds to about a 50% word error reduction as compare to the performance of baseline system using mel-cepstral coefficient, its derivatives and CMS.

N-gram Based Robust Spoken Document Retrievals for Phoneme Recognition Errors (음소인식 오류에 강인한 N-gram 기반 음성 문서 검색)

  • Lee, Su-Jang;Park, Kyung-Mi;Oh, Yung-Hwan
    • MALSORI
    • /
    • no.67
    • /
    • pp.149-166
    • /
    • 2008
  • In spoken document retrievals (SDR), subword (typically phonemes) indexing term is used to avoid the out-of-vocabulary (OOV) problem. It makes the indexing and retrieval process independent from any vocabulary. It also requires a small corpus to train the acoustic model. However, subword indexing term approach has a major drawback. It shows higher word error rates than the large vocabulary continuous speech recognition (LVCSR) system. In this paper, we propose an probabilistic slot detection and n-gram based string matching method for phone based spoken document retrievals to overcome high error rates of phone recognizer. Experimental results have shown 9.25% relative improvement in the mean average precision (mAP) with 1.7 times speed up in comparison with the baseline system.

  • PDF

Isolated Word Recognition Using a Speaker-Adaptive Neural Network (화자적응 신경망을 이용한 고립단어 인식)

  • 이기희;임인칠
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.32B no.5
    • /
    • pp.765-776
    • /
    • 1995
  • This paper describes a speaker adaptation method to improve the recognition performance of MLP(multiLayer Perceptron) based HMM(Hidden Markov Model) speech recognizer. In this method, we use lst-order linear transformation network to fit data of a new speaker to the MLP. Transformation parameters are adjusted by back-propagating classification error to the transformation network while leaving the MLP classifier fixed. The recognition system is based on semicontinuous HMM's which use the MLP as a fuzzy vector quantizer. The experimental results show that rapid speaker adaptation resulting in high recognition performance can be accomplished by this method. Namely, for supervised adaptation, the error rate is signifecantly reduced from 9.2% for the baseline system to 5.6% after speaker adaptation. And for unsupervised adaptation, the error rate is reduced to 5.1%, without any information from new speakers.

  • PDF

Modified Phonetic Decision Tree For Continuous Speech Recognition

  • Kim, Sung-Ill;Kitazoe, Tetsuro;Chung, Hyun-Yeol
    • The Journal of the Acoustical Society of Korea
    • /
    • v.17 no.4E
    • /
    • pp.11-16
    • /
    • 1998
  • For large vocabulary speech recognition using HMMs, context-dependent subword units have been often employed. However, when context-dependent phone models are used, they result in a system which has too may parameters to train. The problem of too many parameters and too little training data is absolutely crucial in the design of a statistical speech recognizer. Furthermore, when building large vocabulary speech recognition systems, unseen triphone problem is unavoidable. In this paper, we propose the modified phonetic decision tree algorithm for the automatic prediction of unseen triphones which has advantages solving these problems through following two experiments in Japanese contexts. The baseline experimental results show that the modified tree based clustering algorithm is effective for clustering and reducing the number of states without any degradation in performance. The task experimental results show that our proposed algorithm also has the advantage of providing a automatic prediction of unseen triphones.

  • PDF

Pronunciation Variation Modeling for Korean Point-of-Interest Data Using Prosodic Information (운율 정보를 이용한 한국어 위치 정보 데이타의 발음 모델링)

  • Kim, Sun-He;Park, Jeon-Gue;Na, Min-Soo;Jeon, Je-Hun;Chung, Min-Wha
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.2
    • /
    • pp.104-111
    • /
    • 2007
  • This paper examines how the performance of an automatic speech recognizer was improved for Korean Point-of-Interest (POI) data by modeling pronunciation variation using structural prosodic information such as prosodic words and syllable length. First, multiple pronunciation variants are generated using prosodic words given that each POI word can be broken down into prosodic words. And the cross-prosodic-word variations were modeled considering the syllable length of word. A total of 81 experiments were conducted using 9 test sets (3 baseline and 6 proposed) on 9 trained sets (3 baseline, 6 proposed). The results show: (i) the performance was improved when the pronunciation lexica were generated using prosodic words; (ii) the best performance was achieved when the maximum number of variants was constrained to 3 based on the syllable length; and (iii) compared to the baseline word error rate (WER) of 4.63%, a maximum of 8.4% in WER reduction was achieved when both prosodic words and syllable length were considered.