• Title/Summary/Keyword: 화자 검출

Search Result 112, Processing Time 0.024 seconds

A Study on the Automatic Speech Control System Using DMS model on Real-Time Windows Environment (실시간 윈도우 환경에서 DMS모델을 이용한 자동 음성 제어 시스템에 관한 연구)

  • 이정기;남동선;양진우;김순협
    • The Journal of the Acoustical Society of Korea
    • /
    • v.19 no.3
    • /
    • pp.51-56
    • /
    • 2000
  • Is this paper, we studied on the automatic speech control system in real-time windows environment using voice recognition. The applied reference pattern is the variable DMS model which is proposed to fasten execution speed and the one-stage DP algorithm using this model is used for recognition algorithm. The recognition vocabulary set is composed of control command words which are frequently used in windows environment. In this paper, an automatic speech period detection algorithm which is for on-line voice processing in windows environment is implemented. The variable DMS model which applies variable number of section in consideration of duration of the input signal is proposed. Sometimes, unnecessary recognition target word are generated. therefore model is reconstructed in on-line to handle this efficiently. The Perceptual Linear Predictive analysis method which generate feature vector from extracted feature of voice is applied. According to the experiment result, but recognition speech is fastened in the proposed model because of small loud of calculation. The multi-speaker-independent recognition rate and the multi-speaker-dependent recognition rate is 99.08% and 99.39% respectively. In the noisy environment the recognition rate is 96.25%.

  • PDF

The Development of a Speech Recognition Method Robust to Channel Distortions and Noisy Environments for an Audio Response System(ARS) (잡음환경및 채널왜곡에 강인한 ARS용 전화음성인식 방식 연구)

  • Ahn, Jung-Mo;Yim, Kye-Jong;Kay, Young-Chul;Koo, Myoung-Wan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.16 no.2
    • /
    • pp.41-48
    • /
    • 1997
  • This paper proposes the methods for improving the recognition rate of theARS, especially equipped with the speech recognition capability. Telephone speech, which is the input to the ARS, is usually affected by the announcements from the system, channel noise, and channel distortion, thus directly applying the recognition algorithm developed for clean speech to the noisy telephone speech will bring the significant performance degradation. To cope with this problem, this paper proposes three methods: 1)the accurate detection of the inputting instant of the speech in order to immediately turn off the announcements from the system at that instant, 2)the effective end-point detection of the noisy telephone speech on the basis of Teager energy, and 3)the SDCN-based compensation of the channel distortion. Experiments on speaker-independent, noisy telephone speech reveal that the combination of the above three proposed methods provides great improvements on the recognition rate over the conventional method, showing about 77% in contrast to only 23%.

  • PDF

A New Adaptive Echo Canceller with an Improved Convergence Speed and NET Detection Performance (향상된 수렴속도와 근달화자신호 검출능력을 갖는 적응반향제기기)

  • 김남선;박상택;차용훈;윤일화;윤대희
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.30B no.12
    • /
    • pp.12-20
    • /
    • 1993
  • In a conventional adaptive echo canceller, an ADF(Adaptive Digital Filter) with TDL(Tapped-Delay Line) structure modelling the echo path uses the LMS(Least Mean Square) algorithm to compute the coefficients, and NET detector using energy comparison method prevents the ADF to update the coefficients during the periods of the NET signal presence. The convergence speed of the LMS algorithm depends on the eigenvalue spread ratio of the reference signal and NET detector using the energy comparison method yields poor detection performance if the magnitude of the NET signal is small. This paper presents a new adaptive echo canceller which uses the pre-whitening filter to improve the convergence speed of the LMS algorithm. The pre-whitening filter is realized by using a low-order lattice predictor. Also, a new NET signal detection algorithm is presented, where the start point of the NET signal is detected by computing the cross-correlation coefficient between the primary input and the ADF output while the end point is detected by using the energy comparison method. The simulation results show that the convergence speed of the proposed adaptive echo canceller is faster than that of the conventional echo canceller and the cross-correlation coefficient yields more accurate detection of the start point of the NET signal.

  • PDF

On Improving Convergence Speed and NET Detection Performance for Adaptive Echo Canceller (향상된 수렴 속도와 근단 화자 신호 검출능력을 갖는 적응 반향 제거기)

  • 김남선
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1992.06a
    • /
    • pp.23-28
    • /
    • 1992
  • The purpose of this paper is to develop a new adaptive echo canceller improving convergence speed and near-end-talker detection performance of the conventional echo canceller. In a conventional adaptive echo canceller, an adaptive digital filter with TDL(Tapped-Delay Line) structure modelling the echo path uses the LMS(Least Mean Square) algorithm to cote the coefficients, and NET detector using energy comparison method prevents the adaptive digital filter to update the coefficients during the periods of the NET signal presence. The convergence speed of the LMS algorithm depends on the eigenvalue spread ratio of the reference signal and NET detector using the energy comparison method yields poor detection performance if the magnitude of the NET signal is small. This paper presents a new adaptive echo canceller which uses the pre-whitening filter to improve the convergence speed of the LMS algorithm. The pre-whitening filter is realized by using a low-order lattice predictor. Also, a new NET signal detection algorithm is presented, where the start point of the NET signal is detected by computing the cross-correlation coefficient between the primary input and the ADF(Adaptive Digital Filter) output while the end point is detected by using the energy comparison method. The simulation results show that the convergence speed of the proposed adaptive echo canceller is faster than that of the conventional echo canceller and the cross-correlation coefficient yield more accurate detection of the start point of the NET signal.

  • PDF

A Study on the Segmentation of Speech Signal into Phonemic Units (음성 신호의 음소 단위 구분화에 관한 연구)

  • Lee, Yeui-Cheon;Lee, Gang-Sung;Kim, Soon-Hyon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.10 no.4
    • /
    • pp.5-11
    • /
    • 1991
  • This paper suggests a segmentation method of speech signal into phonemic units. The suggested segmentation system is speaker-independent and performed without anyprior information of speech signal. In segmentation process, we first divide input speech signal into purevoiced region and not pure voiced speech regions. After then we apply the second algorithm which segments each region into the detailed phonemic units by using the voiced detection parameters, i.e., the time variation of 0th LPC cepstrum coefficient parameter and the ZCR parameter. Types of speech, used to prove the availability of segmentation algorithm suggested in this paper, are the vocabulary composed of isolated words and continuous words. According to the experiments, the successful segmentation rate for 507 phonemic units involved in the total vocabulary is 91.7%.

  • PDF

Auditory Representations for Robust Speech Recognition in Noisy Environments (잡음 환경에서의 음성 인식을 위한 청각 표현)

  • Kim, Doh-Suk;Lee, Soo-Young;Kil, Rhee-M.
    • The Journal of the Acoustical Society of Korea
    • /
    • v.15 no.5
    • /
    • pp.90-98
    • /
    • 1996
  • An auditory model is proposed for robust speech recognition in noisy environments. The model consists of cochlear bandpass filters and nonlinear stages, and represents frequency and intensity information efficiently even in noisy environments. Frequency information of the signal is obtained by zero-crossing intervals, and intensity information is also incorporated by peak detectors and saturating nonlinearities. Also, the robustness of the zero-crossings in estimating frequency is verified by the developed analytic relationship of the variance of the level-crossing interval perturbations as a function of the crossing level values. The proposed auditory model is computationally efficient and free from many unknown parameters compared with other auditory models. Speaker-independent speech recognition experiments demonstrate the robustness of the proposed method.

  • PDF

Speech Activity Decision with Lip Movement Image Signals (입술움직임 영상신호를 고려한 음성존재 검출)

  • Park, Jun;Lee, Young-Jik;Kim, Eung-Kyeu;Lee, Soo-Jong
    • The Journal of the Acoustical Society of Korea
    • /
    • v.26 no.1
    • /
    • pp.25-31
    • /
    • 2007
  • This paper describes an attempt to prevent the external acoustic noise from being misrecognized as the speech recognition target. For this, in the speech activity detection process for the speech recognition, it confirmed besides the acoustic energy to the lip movement image signal of a speaker. First of all, the successive images are obtained through the image camera for PC. The lip movement whether or not is discriminated. And the lip movement image signal data is stored in the shared memory and shares with the recognition process. In the meantime, in the speech activity detection Process which is the preprocess phase of the speech recognition. by conforming data stored in the shared memory the acoustic energy whether or not by the speech of a speaker is verified. The speech recognition processor and the image processor were connected and was experimented successfully. Then, it confirmed to be normal progression to the output of the speech recognition result if faced the image camera and spoke. On the other hand. it confirmed not to output of the speech recognition result if did not face the image camera and spoke. That is, if the lip movement image is not identified although the acoustic energy is inputted. it regards as the acoustic noise.

Context-adaptive Phoneme Segmentation for a TTS Database (문자-음성 합성기의 데이터 베이스를 위한 문맥 적응 음소 분할)

  • 이기승;김정수
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.2
    • /
    • pp.135-144
    • /
    • 2003
  • A method for the automatic segmentation of speech signals is described. The method is dedicated to the construction of a large database for a Text-To-Speech (TTS) synthesis system. The main issue of the work involves the refinement of an initial estimation of phone boundaries which are provided by an alignment, based on a Hidden Market Model(HMM). Multi-layer perceptron (MLP) was used as a phone boundary detector. To increase the performance of segmentation, a technique which individually trains an MLP according to phonetic transition is proposed. The optimum partitioning of the entire phonetic transition space is constructed from the standpoint of minimizing the overall deviation from hand labelling positions. With single speaker stimuli, the experimental results showed that more than 95% of all phone boundaries have a boundary deviation from the reference position smaller than 20 ms, and the refinement of the boundaries reduces the root mean square error by about 25%.

Robust Speech Recognition Using Missing Data Theory (손실 데이터 이론을 이용한 강인한 음성 인식)

  • 김락용;조훈영;오영환
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.3
    • /
    • pp.56-62
    • /
    • 2001
  • In this paper, we adopt a missing data theory to speech recognition. It can be used in order to maintain high performance of speech recognizer when the missing data occurs. In general, hidden Markov model (HMM) is used as a stochastic classifier for speech recognition task. Acoustic events are represented by continuous probability density function in continuous density HMM(CDHMM). The missing data theory has an advantage that can be easily applicable to this CDHMM. A marginalization method is used for processing missing data because it has small complexity and is easy to apply to automatic speech recognition (ASR). Also, a spectral subtraction is used for detecting missing data. If the difference between the energy of speech and that of background noise is below given threshold value, we determine that missing has occurred. We propose a new method that examines the reliability of detected missing data using voicing probability. The voicing probability is used to find voiced frames. It is used to process the missing data in voiced region that has more redundant information than consonants. The experimental results showed that our method improves performance than baseline system that uses spectral subtraction method only. In 452 words isolated word recognition experiment, the proposed method using the voicing probability reduced the average word error rate by 12% in a typical noise situation.

  • PDF

A Method of Generating Table-of-Contents for Educational Video (교육용 비디오의 ToC 자동 생성 방법)

  • Lee Gwang-Gook;Kang Jung-Won;Kim Jae-Gon;Kim Whoi-Yul
    • Journal of Broadcast Engineering
    • /
    • v.11 no.1 s.30
    • /
    • pp.28-41
    • /
    • 2006
  • Due to the rapid development of multimedia appliances, the increasing amount of multimedia data enforces the development of automatic video analysis techniques. In this paper, a method of ToC generation is proposed for educational video contents. The proposed method consists of two parts: scene segmentation followed by scene annotation. First, video sequence is divided into scenes by the proposed scene segmentation algorithm utilizing the characteristics of educational video. Then each shot in the scene is annotated in terms of scene type, existence of enclosed caption and main speaker of the shot. The ToC generated by the proposed method represents the structure of a video by the hierarchy of scenes and shots and gives description of each scene and shot by extracted features. Hence the generated ToC can help users to perceive the content of a video at a glance and. to access a desired position of a video easily. Also, the generated ToC automatically by the system can be further edited manually for the refinement to effectively reduce the required time achieving more detailed description of the video content. The experimental result showed that the proposed method can generate ToC for educational video with high accuracy.