• 제목/요약/키워드: Non-speech

검색결과 468건 처리시간 0.023초

Non-Intrusive Speech Intelligibility Estimation Using Autoencoder Features with Background Noise Information

  • Jeong, Yue Ri;Choi, Seung Ho
    • International Journal of Internet, Broadcasting and Communication
    • /
    • 제12권3호
    • /
    • pp.220-225
    • /
    • 2020
  • This paper investigates the non-intrusive speech intelligibility estimation method in noise environments when the bottleneck feature of autoencoder is used as an input to a neural network. The bottleneck feature-based method has the problem of severe performance degradation when the noise environment is changed. In order to overcome this problem, we propose a novel non-intrusive speech intelligibility estimation method that adds the noise environment information along with bottleneck feature to the input of long short-term memory (LSTM) neural network whose output is a short-time objective intelligence (STOI) score that is a standard tool for measuring intrusive speech intelligibility with reference speech signals. From the experiments in various noise environments, the proposed method showed improved performance when the noise environment is same. In particular, the performance was significant improved compared to that of the conventional methods in different environments. Therefore, we can conclude that the method proposed in this paper can be successfully used for estimating non-intrusive speech intelligibility in various noise environments.

Voice Activity Detection Based on SNR and Non-Intrusive Speech Intelligibility Estimation

  • An, Soo Jeong;Choi, Seung Ho
    • International Journal of Internet, Broadcasting and Communication
    • /
    • 제11권4호
    • /
    • pp.26-30
    • /
    • 2019
  • This paper proposes a new voice activity detection (VAD) method which is based on SNR and non-intrusive speech intelligibility estimation. In the conventional SNR-based VAD methods, voice activity probability is obtained by estimating frame-wise SNR at each spectral component. However these methods lack performance in various noisy environments. We devise a hybrid VAD method that uses non-intrusive speech intelligibility estimation as well as SNR estimation, where the speech intelligibility score is estimated based on deep neural network. In order to train model parameters of deep neural network, we use MFCC vector and the intrusive speech intelligibility score, STOI (Short-Time Objective Intelligent Measure), as input and output, respectively. We developed speech presence measure to classify each noisy frame as voice or non-voice by calculating the weighted average of the estimated STOI value and the conventional SNR-based VAD value at each frame. Experimental results show that the proposed method has better performance than the conventional VAD method in various noisy environments, especially when the SNR is very low.

Robust Non-negative Matrix Factorization with β-Divergence for Speech Separation

  • Li, Yinan;Zhang, Xiongwei;Sun, Meng
    • ETRI Journal
    • /
    • 제39권1호
    • /
    • pp.21-29
    • /
    • 2017
  • This paper addresses the problem of unsupervised speech separation based on robust non-negative matrix factorization (RNMF) with ${\beta}$-divergence, when neither speech nor noise training data is available beforehand. We propose a robust version of non-negative matrix factorization, inspired by the recently developed sparse and low-rank decomposition, in which the data matrix is decomposed into the sum of a low-rank matrix and a sparse matrix. Efficient multiplicative update rules to minimize the ${\beta}$-divergence-based cost function are derived. A convolutional extension of the proposed algorithm is also proposed, which considers the time dependency of the non-negative noise bases. Experimental speech separation results show that the proposed convolutional RNMF successfully separates the repeating time-varying spectral structures from the magnitude spectrum of the mixture, and does so without any prior training.

Considering Dynamic Non-Segmental Phonetics

  • Fujino, Yoshinari
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2000년도 7월 학술대회지
    • /
    • pp.312-320
    • /
    • 2000
  • This presentation aims to explore some possibility of non-segmental phonetics usually ignored in phonetics education. In pedagogical phonetics, especially ESL/EFL oriented phonetics speech sounds tend to be classified in two criteria 1) 'pronunciation' which deals with segments and 2) 'prosody' or 'suprasegmentals', a criterion that deals with non-segmental elements such as stress and intonation. However, speech involves more dynamic processing. It is non-linear and multi-dimensional in spite of the linear sequence of symbols in phonetic/phonological transcriptions. No word is without pitch or voice quality apart from segmental characteristics whether it is spoken in isolation or cut out from continuous speech. This simply tells the dichotomy of pronunciation and prosody is merely a useful convention. There exists some room to consider dynamic non-segmental phonetics. Examples of non-segmental phonetic investigation, some of the analyses conducted within the frame of Firthian Prosodic Analysis, especially of the relation between vowel variants and foot types, are examined and we see what kind of auditory phonetic training is required to understand impressionistic transcriptions which lie behind the non-segmental phonetics.

  • PDF

A Fixed Rate Speech Coder Based on the Filter Bank Method and the Inflection Point Detection

  • Iem, Byeong-Gwan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • 제16권4호
    • /
    • pp.276-280
    • /
    • 2016
  • A fixed rate speech coder based on the filter bank and the non-uniform sampling technique is proposed. The non-uniform sampling is achieved by the detection of inflection points (IPs). A speech block is band passed by the filter bank, and the subband signals are processed by the IP detector, and the detected IP patterns are compared with entries of the IP database. For each subband signal, the address of the closest member of the database and the energy of the IP pattern are transmitted through channel. In the receiver, the decoder recovers the subband signals using the received addresses and the energy information, and reconstructs the speech via the filter bank summation. As results, the coder shows fixed data rate contrary to the existing speech coders based on the non-uniform sampling. Through computer simulation, the usefulness of the proposed technique is confirmed. The signal-to-noise ratio (SNR) performance of the proposed method is comparable to that of the uniform sampled pulse code modulation (PCM) below 20 kbps data rate.

Parts-Based Feature Extraction of Spectrum of Speech Signal Using Non-Negative Matrix Factorization

  • Park, Jeong-Won;Kim, Chang-Keun;Lee, Kwang-Seok;Koh, Si-Young;Hur, Kang-In
    • Journal of information and communication convergence engineering
    • /
    • 제1권4호
    • /
    • pp.209-212
    • /
    • 2003
  • In this paper, we proposed new speech feature parameter through parts-based feature extraction of speech spectrum using Non-Negative Matrix Factorization (NMF). NMF can effectively reduce dimension for multi-dimensional data through matrix factorization under the non-negativity constraints, and dimensionally reduced data should be presented parts-based features of input data. For speech feature extraction, we applied Mel-scaled filter bank outputs to inputs of NMF, than used outputs of NMF for inputs of speech recognizer. From recognition experiment result, we could confirm that proposed feature parameter is superior in recognition performance than mel frequency cepstral coefficient (MFCC) that is used generally.

가변어휘 핵심어 검출을 위한 비핵심어 모델링 및 후처리 성능평가 (Performance Evaluation of Nonkeyword Modeling and Postprocessing for Vocabulary-independent Keyword Spotting)

  • 김형순;김영국;신영욱
    • 음성과학
    • /
    • 제10권3호
    • /
    • pp.225-239
    • /
    • 2003
  • In this paper, we develop a keyword spotting system using vocabulary-independent speech recognition technique, and investigate several non-keyword modeling and post-processing methods to improve its performance. In order to model non-keyword speech segments, monophone clustering and Gaussian Mixture Model (GMM) are considered. We employ likelihood ratio scoring method for the post-processing schemes to verify the recognition results, and filler models, anti-subword models and N-best decoding results are considered as an alternative hypothesis for likelihood ratio scoring. We also examine different methods to construct anti-subword models. We evaluate the performance of our system on the automatic telephone exchange service task. The results show that GMM-based non-keyword modeling yields better performance than that using monophone clustering. According to the post-processing experiment, the method using anti-keyword model based on Kullback-Leibler distance and N-best decoding method show better performance than other methods, and we could reduce more than 50% of keyword recognition errors with keyword rejection rate of 5%.

  • PDF

GMM을 이용한 응급 단어와 비응급 단어의 검출 및 인식 기법 (Detection and Recognition Method for Emergency and Non-emergency Speech by Gaussian Mixture Model)

  • 조영임;이대종
    • 한국지능시스템학회논문지
    • /
    • 제21권2호
    • /
    • pp.254-259
    • /
    • 2011
  • 일반적으로 어떤 순간에 발생할지 모르는 응급 상황을 CCTV의 영상 정보만으로 상황을 항상 모니터링하기에는 인력과 비용의문제점이 발생되고 있다. 본 논문에서는 응급상황을 동적으로 보여주는 CCTV환경에서 감지하기 위해 GMM을 이용한 응급단어와 비응급단어의 검출 및 인식기법을제안하고자 한다. 제안된 방법은 Global GMM 모델에 의해 응급단어와 일반단어를 검출하고 이 모델에 의해 응급단어라 판정된 경우에는 Local GMM 모델에 응급단어 인식을 수행하게 된다. 제안된 방법은 다양한 환경하에서 취득한 응급단어와 일반단어에 대해 적용하여 타당성을 검증하였다.

Automatic proficiency assessment of Korean speech read aloud by non-natives using bidirectional LSTM-based speech recognition

  • Oh, Yoo Rhee;Park, Kiyoung;Jeon, Hyung-Bae;Park, Jeon Gue
    • ETRI Journal
    • /
    • 제42권5호
    • /
    • pp.761-772
    • /
    • 2020
  • This paper presents an automatic proficiency assessment method for a non-native Korean read utterance using bidirectional long short-term memory (BLSTM)-based acoustic models (AMs) and speech data augmentation techniques. Specifically, the proposed method considers two scenarios, with and without prompted text. The proposed method with the prompted text performs (a) a speech feature extraction step, (b) a forced-alignment step using a native AM and non-native AM, and (c) a linear regression-based proficiency scoring step for the five proficiency scores. Meanwhile, the proposed method without the prompted text additionally performs Korean speech recognition and a subword un-segmentation for the missing text. The experimental results indicate that the proposed method with prompted text improves the performance for all scores when compared to a method employing conventional AMs. In addition, the proposed method without the prompted text has a fluency score performance comparable to that of the method with prompted text.

Adaptive Wavelet Based Speech Enhancement with Robust VAD in Non-stationary Noise Environment

  • Sungwook Chang;Sungil Jung;Younghun Kwon;Yang, Sung-il
    • The Journal of the Acoustical Society of Korea
    • /
    • 제22권4E호
    • /
    • pp.161-166
    • /
    • 2003
  • We present an adaptive wavelet packet based speech enhancement method with robust voice activity detection (VAD) in non-stationary noise environment. The proposed method can be divided into two main procedures. The first procedure is a VAD with adaptive wavelet packet transform. And the other is a speech enhancement procedure based on the proposed VAD method. The proposed VAD method shows remarkable performance even in low SNRs and non-stationary noise environment. And subjective evaluation shows that the performance of the proposed speech enhancement method with wavelet bases is better than that with Fourier basis.