• Title/Summary/Keyword: voice extract

Search Result 70, Processing Time 0.023 seconds

Real-Time Implementation of Acoustic Echo Canceller Using TMS320C6711 DSK

  • Heo, Won-Chul;Bae, Keun-Sung
    • Speech Sciences
    • /
    • v.15 no.1
    • /
    • pp.75-83
    • /
    • 2008
  • The interior of an automobile is a very noisy environment with both stationary cruising noise and the reverberated music or speech coming out from the audio system. For robust speech recognition in a car environment, it is necessary to extract a driver's voice command well by removing those background noises. Since we can handle the music and speech signals from an audio system in a car, the reverberated music and speech sounds can be removed using an acoustic echo canceller. In this paper, we implement an acoustic echo canceller with robust double-talk detection algorithm using TMS-320C6711 DSK. First we developed the echo canceller on the PC for verifying the performance of echo cancellation, then implemented it on the TMS320C6711 DSK. For processing of one speech sample with 8kHz sampling rate and 256 filter taps of the echo canceller, the implemented system used only 0.035ms and achieved the ERLE of 20.73dB.

  • PDF

Extraction of Unvoiced Consonant Regions from Fluent Korean Speech in Noisy Environments (잡음환경에서 우리말 연속음성의 무성자음 구간 추출 방법)

  • 박정임;하동경;신옥근
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.4
    • /
    • pp.286-292
    • /
    • 2003
  • Voice activity detection (VAD) is a process that separates the noise region from silence or noise region of input speech signal. Since unvoiced consonant signals have very similar characteristics to those of noise signals, it may result in serious distortion of unvoiced consonants, or in erroneous noise estimation to can out VAD without paying special attention on unvoiced consonants. In this paper, we propose a method to extract in an explicit way the boundaries between unvoiced consonant and noise in fluent speech so that more exact VAD could be performed. The proposed method is based on histogram in frequency domain which was successfully used by Hirsch for noise estimation, and a1so on similarity measure of frequency components between adjacent frames, To evaluate the performance of the proposed method, experiments on unvoiced consonant boundary extraction was performed on seven kinds of noisy speech signals of 10 ㏈ and 15 ㏈ SNR respectively.

Abnormal SIP Packet Detection Mechanism using Co-occurrence Information (공기 정보를 이용한 비정상 SIP 패킷 공격탐지 기법)

  • Kim, Deuk-Young;Lee, Hyung-Woo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.1
    • /
    • pp.130-140
    • /
    • 2010
  • SIP (Session Initiation Protocol) is a signaling protocol to provide IP-based VoIP (Voice over IP) service. However, many security vulnerabilities exist as the SIP protocol utilizes the existing IP based network. The SIP Malformed message attacks may cause malfunction on VoIP services by changing the transmitted SIP header information. Additionally, there are several threats such that an attacker can extract personal information on SIP client system by inserting malicious code into SIP header. Therefore, the alternative measures should be required. In this study, we analyzed the existing research on the SIP anomaly message detection mechanism against SIP attack. And then, we proposed a Co-occurrence based SIP packet analysis mechanism, which has been used on language processing techniques. We proposed a association rule generation and an attack detection technique by using the actual SIP session state. Experimental results showed that the average detection rate was 87% on SIP attacks in case of using the proposed technique.

A Study on the Correlation between Body-Size and MDVP Parameters in the Normal Male and Female Korean Population (정상 한국인의 성별 체형정보와 MDVP 변수간의 상관관계 연구)

  • Kang, Jae-Hwan;Yoo, Jong-Hyang;Kim, Jong-Yeol
    • Speech Sciences
    • /
    • v.15 no.4
    • /
    • pp.107-119
    • /
    • 2008
  • This paper intends to investigate the correlation of 12 MDVP measurements with age, sex and body-size of sampled healthy patients. In order to extract pitch and 12 MDVP parameters efficiently and display the correlation of each parameter easily, we developed the speech analysis program using C/C++ and MFC development tool. The sample group consists of 205 males and 343 females with ages 9-81. We collected vowel voices /a/ and 8 body-size measurements from them. Body-size values were taken at 8 different torso positions of each person. We analyzed the matched voice samples and body-size measurements by the developed speech analysis program and SPSS program. The result shows that a typical characteristic age-F0 pattern that F0 of male subjects are rapidly decreased after mutational period and have stable state with age and that of female subjects are slowly changed by overall age. In MDVP, age-STD in males, age-sPPQ in females relationships are especially similar to the age-F0 relationship. In case of male group, sPPQ(0.316%), Jitt(0.04%), Shim(0.25%), APQ(0.28%) variables are increased with age after mutational period. And Jitt(0.042%), sPPQ(0.219%) of females group are increased with age too. In cases of height, weight and BMI there exists a weak correlation with MDVP, which shows a correlation coefficient below 0.25 about male and female groups. The survey of correlation relationship between 8 body-size measurements and MDVP shows a insignificant statistical result by only just having the correlation coefficient maximum in M8-8 and F0(-0.394%) for males and M8-6,7(-0.368%, -0.364%) for females.

  • PDF

Enhancement of Authentication Performance based on Multimodal Biometrics for Android Platform (안드로이드 환경의 다중생체인식 기술을 응용한 인증 성능 개선 연구)

  • Choi, Sungpil;Jeong, Kanghun;Moon, Hyeonjoon
    • Journal of Korea Multimedia Society
    • /
    • v.16 no.3
    • /
    • pp.302-308
    • /
    • 2013
  • In this research, we have explored personal authentication system through multimodal biometrics for mobile computing environment. We have selected face and speaker recognition for the implementation of multimodal biometrics system. For face recognition part, we detect the face with Modified Census Transform (MCT). Detected face is pre-processed through eye detection module based on k-means algorithm. Then we recognize the face with Principal Component Analysis (PCA) algorithm. For speaker recognition part, we extract features using the end-point of voice and the Mel Frequency Cepstral Coefficient (MFCC). Then we verify the speaker through Dynamic Time Warping (DTW) algorithm. Our proposed multimodal biometrics system shows improved verification rate through combining two different biometrics described above. We implement our proposed system based on Android environment using Galaxy S hoppin. Proposed system presents reduced false acceptance ratio (FAR) of 1.8% which shows improvement from single biometrics system using the face and the voice (presents 4.6% and 6.7% respectively).

Lip and Voice Synchronization Using Visual Attention (시각적 어텐션을 활용한 입술과 목소리의 동기화 연구)

  • Dongryun Yoon;Hyeonjoong Cho
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.4
    • /
    • pp.166-173
    • /
    • 2024
  • This study explores lip-sync detection, focusing on the synchronization between lip movements and voices in videos. Typically, lip-sync detection techniques involve cropping the facial area of a given video, utilizing the lower half of the cropped box as input for the visual encoder to extract visual features. To enhance the emphasis on the articulatory region of lips for more accurate lip-sync detection, we propose utilizing a pre-trained visual attention-based encoder. The Visual Transformer Pooling (VTP) module is employed as the visual encoder, originally designed for the lip-reading task, predicting the script based solely on visual information without audio. Our experimental results demonstrate that, despite having fewer learning parameters, our proposed method outperforms the latest model, VocaList, on the LRS2 dataset, achieving a lip-sync detection accuracy of 94.5% based on five context frames. Moreover, our approach exhibits an approximately 8% superiority over VocaList in lip-sync detection accuracy, even on an untrained dataset, Acappella.

Recognition of Overlapped Sound and Influence Analysis Based on Wideband Spectrogram and Deep Neural Networks (광역 스펙트로그램과 심층신경망에 기반한 중첩된 소리의 인식과 영향 분석)

  • Kim, Young Eon;Park, Gooman
    • Journal of Broadcast Engineering
    • /
    • v.23 no.3
    • /
    • pp.421-430
    • /
    • 2018
  • Many voice recognition systems use methods such as MFCC, HMM to acknowledge human voice. This recognition method is designed to analyze only a targeted sound which normally appears between a human and a device one. However, the recognition capability is limited when there is a group sound formed with diversity in wider frequency range such as dog barking and indoor sounds. The frequency of overlapped sound resides in a wide range, up to 20KHz, which is higher than a voice. This paper proposes the new recognition method which provides wider frequency range by conjugating the Wideband Sound Spectrogram and the Keras Sequential Model based on DNN. The wideband sound spectrogram is adopted to analyze and verify diverse sounds from wide frequency range as it is designed to extract features and also classify as explained. The KSM is employed for the pattern recognition using extracted features from the WSS to improve sound recognition quality. The experiment verified that the proposed WSS and KSM excellently classified the targeted sound among noisy environment; overlapped sounds such as dog barking and indoor sounds. Furthermore, the paper shows a stage by stage analyzation and comparison of the factors' influences on the recognition and its characteristics according to various levels of noise.

Voice Activity Detection using Motion and Variation of Intensity in The Mouth Region (입술 영역의 움직임과 밝기 변화를 이용한 음성구간 검출 알고리즘 개발)

  • Kim, Gi-Bak;Ryu, Je-Woong;Cho, Nam-Ik
    • Journal of Broadcast Engineering
    • /
    • v.17 no.3
    • /
    • pp.519-528
    • /
    • 2012
  • Voice activity detection (VAD) is generally conducted by extracting features from the acoustic signal and a decision rule. The performance of such VAD algorithms driven by the input acoustic signal highly depends on the acoustic noise. When video signals are available as well, the performance of VAD can be enhanced by using the visual information which is not affected by the acoustic noise. Previous visual VAD algorithms usually use single visual feature to detect the lip activity, such as active appearance models, optical flow or intensity variation. Based on the analysis of the weakness of each feature, we propose to combine intensity change measure and the optical flow in the mouth region, which can compensate for each other's weakness. In order to minimize the computational complexity, we develop simple measures that avoid statistical estimation or modeling. Specifically, the optical flow is the averaged motion vector of some grid regions and the intensity variation is detected by simple thresholding. To extract the mouth region, we propose a simple algorithm which first detects two eyes and uses the profile of intensity to detect the center of mouth. Experiments show that the proposed combination of two simple measures show higher detection rates for the given false positive rate than the methods that use a single feature.

Isolated Word Recognition Using k-clustering Subspace Method and Discriminant Common Vector (k-clustering 부공간 기법과 판별 공통벡터를 이용한 고립단어 인식)

  • Nam, Myung-Woo
    • Journal of the Institute of Electronics Engineers of Korea TE
    • /
    • v.42 no.1
    • /
    • pp.13-20
    • /
    • 2005
  • In this paper, I recognized Korean isolated words using CVEM which is suggested by M. Bilginer et al. CVEM is an algorithm which is easy to extract the common properties from training voice signals and also doesn't need complex calculation. In addition CVEM shows high accuracy in recognition results. But, CVEM has couple of problems which are impossible to use for many training voices and no discriminant information among extracted common vectors. To get the optimal common vectors from certain voice classes, various voices should be used for training. But CVEM is impossible to get continuous high accuracy in recognition because CVEM has a limitation to use many training voices and the absence of discriminant information among common vectors can be the source of critical errors. To solve above problems and improve recognition rate, k-clustering subspace method and DCVEM suggested. And did various experiments using voice signal database made by ETRI to prove the validity of suggested methods. The result of experiments shows improvements in performance. And with proposed methods, all the CVEM problems can be solved with out calculation problem.

Implementation of User Recommendation System based on Video Contents Story Analysis and Viewing Pattern Analysis (영상 스토리 분석과 시청 패턴 분석 기반의 추천 시스템 구현)

  • Lee, Hyoun-Sup;Kim, Minyoung;Lee, Ji-Hoon;Kim, Jin-Deog
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.12
    • /
    • pp.1567-1573
    • /
    • 2020
  • The development of Internet technology has brought the era of one-man media. An individual produces content on user own and uploads it to related online services, and many users watch the content of online services using devices that allow them to use the Internet. Currently, most users find and watch content they want through search functions provided by existing online services. These features are provided based on information entered by the user who uploaded the content. In an environment where content needs to be retrieved based on these limited word data, user unwanted information is presented to users in the search results. To solve this problem, in this paper, the system actively analyzes the video in the online service, and presents a way to extract and reflect the characteristics held by the video. The research was conducted to extract morphemes based on the story content based on the voice data of a video and analyze them with big data technology.