• Title/Summary/Keyword: Speech signals

Search Result 498, Processing Time 0.032 seconds

Conformer with lexicon transducer for Korean end-to-end speech recognition (Lexicon transducer를 적용한 conformer 기반 한국어 end-to-end 음성인식)

  • Son, Hyunsoo;Park, Hosung;Kim, Gyujin;Cho, Eunsoo;Kim, Ji-Hwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.5
    • /
    • pp.530-536
    • /
    • 2021
  • Recently, due to the development of deep learning, end-to-end speech recognition, which directly maps graphemes to speech signals, shows good performance. Especially, among the end-to-end models, conformer shows the best performance. However end-to-end models only focuses on the probability of which grapheme will appear at the time. The decoding process uses a greedy search or beam search. This decoding method is easily affected by the final probability output by the model. In addition, the end-to-end models cannot use external pronunciation and language information due to structual problem. Therefore, in this paper conformer with lexicon transducer is proposed. We compare phoneme-based model with lexicon transducer and grapheme-based model with beam search. Test set is consist of words that do not appear in training data. The grapheme-based conformer with beam search shows 3.8 % of CER. The phoneme-based conformer with lexicon transducer shows 3.4 % of CER.

Modified AWSSDR method for frequency-dependent reverberation time estimation (주파수 대역별 잔향시간 추정을 위한 변형된 AWSSDR 방식)

  • Min Sik Kim;Hyung Soon Kim
    • Phonetics and Speech Sciences
    • /
    • v.15 no.4
    • /
    • pp.91-100
    • /
    • 2023
  • Reverberation time (T60) is a typical acoustic parameter that provides information about reverberation. Since the impacts of reverberation vary depending on the frequency bands even in the same space, frequency-dependent (FD) T60, which offers detailed insights into the acoustic environments, can be useful. However, most conventional blind T60 estimation methods, which estimate the T60 from speech signals, focus on fullband T60 estimation, and a few blind FDT60 estimation methods commonly show poor performance in the low-frequency bands. This paper introduces a modified approach based on Attentive pooling based Weighted Sum of Spectral Decay Rates (AWSSDR), previously proposed for blind T60 estimation, by extending its target from fullband T60 to FDT60. The experimental results show that the proposed method outperforms conventional blind FDT60 estimation methods on the acoustic characterization of environments (ACE) challenge evaluation dataset. Notably, it consistently exhibits excellent estimation performance in all frequency bands. This demonstrates that the mechanism of the AWSSDR method is valuable for blind FDT60 estimation because it reflects the FD variations in the impact of reverberation, aggregating information about FDT60 from the speech signal by processing the spectral decay rates associated with the physical properties of reverberation in each frequency band.

An Implementation of Acoustic Echo Canceller Using Adaptive Filtering in Modulated Lapped Transform Domain (Modulated Lapped Transform 영역에서 적응 필터링을 이용한 음향 반향 제거기의 구현)

  • 백수진;박규식
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.6
    • /
    • pp.425-433
    • /
    • 2003
  • Acoustic Echo Canceller (AEC) is a signal processing system for removing unwanted echo signals in teleconference and hands-free communication. Least mean square (LMS) algorithm is one of the adaptive echo cancellation algorithms and it has been most attractive because of its simplicity and robustness. However, the convergence properties of the LMS algorithm degrade with highly correlated input signals such as speech. For this reason, transform-domain adaptive filtering algorithm was introduced to decorrelate the colored input samples by using the orthogonal transform matrix such as DCT, DFT and then LMS adaptive filtering process is applied. In this paper, we propose a MLT domain adaptive echo canceller base on the MLT (Modulated lapped Transform) orthogonal transform matrix. The proposed algorithm achieves high decorrelation efficiency and fast convergence speed via modulated lapped transform of size 2NXN instead of NXN unitary transform such as DCT, DFT, Hadamad and it is applied to the acoustical echo cancellation system. Form the computer simulation with both synthesis and real speech, the proposed MLT domain adaptive echo canceller shows approximately twice faster convergence speed and 20∼30 ㏈ ERLE improvements over the DCT frequency domain acoustic echo cancellation system.

Implementation and Evaluation of Electroglottograph System (전기성문전도(EGG) 시스템의 개발 및 평가)

  • 김기련;김광년;왕수건;허승덕;이승훈;전계록;최병철;정동근
    • Journal of Biomedical Engineering Research
    • /
    • v.25 no.5
    • /
    • pp.343-349
    • /
    • 2004
  • Electroglottograph(EGG) is a signal recorded from the vocal cord vibration by measuring electrical impedance across the vocal folds through the neck skin. The purpose of this study was to develop EGG system and to evaluate possibility for the application on speech analysis and laryngeal disease diagnosis. EGG system was composed of two pairs of ring electrodes, tuned amplifier, phase sensitive detector, low pass filter, and auto-gain controller. It was designed to extract electric impedance after detecting by amplitude modulation method with 2.7MHz carrier signal. Extracted signals were transmitted through line-in of PC sound card, sampled and quantized. Closed Quotient(CQ), Speed Quotient(SQ), Speed Index(SI), fundamental frequency of vocal cord vibration(F0), pitch variability of vocal fold vibration (Jitter), and peak-to-peak amplitude variability of vocal fold vibration(Shimmer) were analyzed as EGG parameters. Experimental results were as follows: the faster vocal fold vibration, the higher values in CQ parameter and the lower values in SQ and SI parameters. EGG and speech signals had the same fundamental frequency. CQ, SQ, and SI were significantly different between normal subjects and patients with laryngeal cancer. These results suggest that it is possible to implement portable EGG system to monitor the function of vocal cord and to test functional changes of the glottis.

On-Line Audio Genre Classification using Spectrogram and Deep Neural Network (스펙트로그램과 심층 신경망을 이용한 온라인 오디오 장르 분류)

  • Yun, Ho-Won;Shin, Seong-Hyeon;Jang, Woo-Jin;Park, Hochong
    • Journal of Broadcast Engineering
    • /
    • v.21 no.6
    • /
    • pp.977-985
    • /
    • 2016
  • In this paper, we propose a new method for on-line genre classification using spectrogram and deep neural network. For on-line processing, the proposed method inputs an audio signal for a time period of 1sec and classifies its genre among 3 genres of speech, music, and effect. In order to provide the generality of processing, it uses the spectrogram as a feature vector, instead of MFCC which has been widely used for audio analysis. We measure the performance of genre classification using real TV audio signals, and confirm that the proposed method has better performance than the conventional method for all genres. In particular, it decreases the rate of classification error between music and effect, which often occurs in the conventional method.

Factors for Speech Signal Time Delay Estimation (음성 신호를 이용한 시간지연 추정에 미치는 영향들에 관한 연구)

  • Kwon, Byoung-Ho;Park, Young-Jin;Park, Youn-Sik
    • Proceedings of the Korean Society for Noise and Vibration Engineering Conference
    • /
    • 2008.04a
    • /
    • pp.909-915
    • /
    • 2008
  • Researches for time delay estimation had been studied broadly. However studies about factors for time delay estimation are insufficient, especially in case of real environment application. In 1997, Brandstein and Siverman announced that performance of time delay estimation deteriorates as reverberant time of room increases. Even though reverberant time of room is same, performance of estimation is different as the specific part of signals. In order to know that reason, we studied and analyzed the factors for time delay estimation using speech signal and room impulse response. In result, we can know that performance of time delay estimation is changed by different R/D ratio and signal characteristics in spite of same reverberant time.

  • PDF

Fast Speech Recognition System using Classification of Energy Labeling (에너지 라벨링 그룹화를 이용한 고속 음성인식시스템)

  • Han Su-Young;Kim Hong-Ryul;Lee Kee-Hee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.9 no.4 s.32
    • /
    • pp.77-83
    • /
    • 2004
  • In this paper, the Classification of Energy Labeling has been proposed. Energy parameters of input signal which are extracted from each phoneme are labelled. And groups of labelling according to detected energies of input signals are detected. Next. DTW processes in a selected group of labeling. This leads to DTW processing faster than a previous algorithm. In this Method, because an accurate detection of parameters is necessary on the assumption in steps of a detection of speeching duration and a detection of energy parameters, variable windows which are decided by pitch period are used. A pitch period is detected firstly : next window scale is decided between 200 frames and 300 frames. The proposed method makes it possible to cancel an influence of windows and reduces the computational complexity by $25\%$.

  • PDF

The Changes in the Closed Qutient of Trained Singers and Untrained Controls Under Varying Intensity at a Constant Vocal Pitch (음도 고정 시 강도 변화에 따른 일반인과 성악인 발성의 성대접촉률 변화 특성의 비교)

  • Kim, Han-Su;Jeon, Yong-Sun;Chung, Sung-Min;Cho, Kun-Kyung;Park, Eun-Hee
    • Journal of the Korean Society of Laryngology, Phoniatrics and Logopedics
    • /
    • v.16 no.1
    • /
    • pp.28-32
    • /
    • 2005
  • Background and Objectives : The most important two factors of the voice production are the respiratory function which is the power source of voice and the glottic closure that transform the air flow into sound signals. The purpose of this study was to investigate the differences between trained singers and untrained controls under varying intensity at a constant vocal pitch by simulataneous using the airway interruption method and electroglottography(EGG). Materials and Methods : Under two different intensity condition at a constant vocal pitch(/G/), 20(Male 10, Female 10) trained singers were studied. Mean flow rate(MFR), subglottic pressure(Psub) and intensity were measured with aerodynamic test using the Phonatory function analyzer. Closed quotients(CQ), jitter and shimmer were also investigated by electroglottography using Lx speech studio. These data were compared with that of normal controls. Results : MFR and Psub were increased on high intensity condition in all subject groups but there was no statistically significance. Statistically significant increasing of CQ. were observed in male trained singers on high intensity condition (untrained male : 51.31${\pm}$3.70%, trained male :55.52${\pm}$6.07%, p=.039). Shimmer percent, one of the phonatory stability parameters, was also decreased statistically in all subject groups(p<.001). Conclusion : The trained singers' phonation was more efficient than untrained singers. The result means that the trained singers can increase the loudness with little changing of mean flow rate, subglottic pressure but more increasing of glottic closed quotients.

  • PDF

A Study on the Performance of Noise Reduction using Multi-Microphones for Digital Hearing Aids (디지털 보청기를 위한 다중 마이크로폰을 이용한 잡음제거 성능 연구)

  • Kang, Hyun-Deok;Song, Young-Rok;Lee, Sang-Min
    • Journal of IKEEE
    • /
    • v.14 no.1
    • /
    • pp.47-54
    • /
    • 2010
  • In this study, we analyzed the reduction of noise in a noise environment using 2, 3, 4 or 5 microphones in digital hearing aids. In order to be able to use this in actual digital hearing aids, we made the experiment microphone set similar to the behind-the-ear type (BTE) and then recorded the signal accordingly, with each situation. With the recorded signals, we reduced the noise in each signal by a noise reduction algorithm using multi-microphones. As a result, in the case of By comparing the SNR (Signal to Noise Ratio) and PESQ (Perceptual Evaluation of Speech) measurements, before and after the noise reduction, the results showed that the improvement in performance was highest when three or four microphones were used. Generally, when two or more microphones were used, we found that as the number of microphones increased there was an increase in performance.

HMM with Global Path constraint in Viterbi Decoding for Insolated Word Recognition (전체 경로 제한 조건을 갖는 HMM을 이용한 단독음 인식)

  • Kim, Weon-Goo;Ahn, Dong-Soon;Youn, Dae-Hee
    • The Journal of the Acoustical Society of Korea
    • /
    • v.13 no.1E
    • /
    • pp.11-19
    • /
    • 1994
  • Hidden Markov Models (HMM's) with explicit state duration density (HMM/SD) can represent the time-varying characteristics of speech signals more accurately. However, such an advantage is reduced in relatively smooth state duration densities or ling bounded duration. To solve this problem, we propose HMM's with global path constraint (HMM/GPC) where the transition between states occur only within prescribed time slots. HMM/GPC explicitly limits state durations and accurately describes the temproal structure of speech simply and efficiently. HMM's formed by combining HMM/GPC with HMM/SD are also presented (HMM/SD+GPC) and performances are compared. HMM/GPC can be implemented with slight modifications to the conventional Viterbi algorithm. HMM/GPC and HMM/SD_GPC not only show superior performance than the conventional HMM and HMM/SD but also require much less computation. In the speaket independent isolated word recognition experiments, the minimum recognition eror rate of HMM/GPC(1.6%) is 1.1% lower than the conventional HMM's and the required computation decreased about 57%.

  • PDF