• Title/Summary/Keyword: Speaker embedding

Search Result 15, Processing Time 0.024 seconds

I-vector similarity based speech segmentation for interested speaker to speaker diarization system (화자 구분 시스템의 관심 화자 추출을 위한 i-vector 유사도 기반의 음성 분할 기법)

  • Bae, Ara;Yoon, Ki-mu;Jung, Jaehee;Chung, Bokyung;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.39 no.5
    • /
    • pp.461-467
    • /
    • 2020
  • In noisy and multi-speaker environments, the performance of speech recognition is unavoidably lower than in a clean environment. To improve speech recognition, in this paper, the signal of the speaker of interest is extracted from the mixed speech signals with multiple speakers. The VoiceFilter model is used to effectively separate overlapped speech signals. In this work, clustering by Probabilistic Linear Discriminant Analysis (PLDA) similarity score was employed to detect the speech signal of the interested speaker, which is used as the reference speaker to VoiceFilter-based separation. Therefore, by utilizing the speaker feature extracted from the detected speech by the proposed clustering method, this paper propose a speaker diarization system using only the mixed speech without an explicit reference speaker signal. We use phone-dataset consisting of two speakers to evaluate the performance of the speaker diarization system. Source to Distortion Ratio (SDR) of the operator (Rx) speech and customer speech (Tx) are 5.22 dB and -5.22 dB respectively before separation, and the results of the proposed separation system show 11.26 dB and 8.53 dB respectively.

Integrated receptive field diversification method for improving speaker verification performance for variable-length utterances (가변 길이 입력 발성에서의 화자 인증 성능 향상을 위한 통합된 수용 영역 다양화 기법)

  • Shin, Hyun-seo;Kim, Ju-ho;Heo, Jungwoo;Shim, Hye-jin;Yu, Ha-Jin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.3
    • /
    • pp.319-325
    • /
    • 2022
  • The variation of utterance lengths is a representative factor that can degrade the performance of speaker verification systems. To handle this issue, previous studies had attempted to extract speaker features from various branches or to use convolution layers with different receptive fields. Combining the advantages of the previous two approaches for variable-length input, this paper proposes integrated receptive field diversification that extracts speaker features through more diverse receptive field. The proposed method processes the input features by convolutional layers with different receptive fields at multiple time-axis branches, and extracts speaker embedding by dynamically aggregating the processed features according to the lengths of input utterances. The deep neural networks in this study were trained on the VoxCeleb2 dataset and tested on the VoxCeleb1 evaluation dataset that divided into 1 s, 2 s, 5 s, and full-length. Experimental results demonstrated that the proposed method reduces the equal error rate by 19.7 % compared to the baseline.

Development of Deep Learning Models for Multi-class Sentiment Analysis (딥러닝 기반의 다범주 감성분석 모델 개발)

  • Syaekhoni, M. Alex;Seo, Sang Hyun;Kwon, Young S.
    • Journal of Information Technology Services
    • /
    • v.16 no.4
    • /
    • pp.149-160
    • /
    • 2017
  • Sentiment analysis is the process of determining whether a piece of document, text or conversation is positive, negative, neural or other emotion. Sentiment analysis has been applied for several real-world applications, such as chatbot. In the last five years, the practical use of the chatbot has been prevailing in many field of industry. In the chatbot applications, to recognize the user emotion, sentiment analysis must be performed in advance in order to understand the intent of speakers. The specific emotion is more than describing positive or negative sentences. In light of this context, we propose deep learning models for conducting multi-class sentiment analysis for identifying speaker's emotion which is categorized to be joy, fear, guilt, sad, shame, disgust, and anger. Thus, we develop convolutional neural network (CNN), long short term memory (LSTM), and multi-layer neural network models, as deep neural networks models, for detecting emotion in a sentence. In addition, word embedding process was also applied in our research. In our experiments, we have found that long short term memory (LSTM) model performs best compared to convolutional neural networks and multi-layer neural networks. Moreover, we also show the practical applicability of the deep learning models to the sentiment analysis for chatbot.

A Study on Digital Image Watermarking for Embedding Audio Logo (음성로고 삽입을 위한 디지털 영상 워터마킹에 관한 연구)

  • Cho, Gang-Seok;Koh, Sung-Shik
    • Journal of the Institute of Electronics Engineers of Korea TE
    • /
    • v.39 no.3
    • /
    • pp.21-27
    • /
    • 2002
  • The digital watermarking methods have been proposed as a solution for solving the illegal copying and proof of ownership problems in the context of multimedia data. But it is still difficult to have been overcame the problem of the protection of property to multimedia data, such as digital images, digital video, and digital audio. This paper describes a watermarking algorithm that embeds non-linearly audio logo watermark data which is converted from audio signal of the ownership in the components of pixel intensities in an original image and that insists of ownership by hearing the audio signal transformed from the extracted audio logo through the speaker. Experimental results show that our algorithm using audio logo proposed in this paper is robust against attacks such as particularly lossy JPEG image compression. 

Experimental Analysis on Vibration of Composite Plate by Using FBG Sensor System (브래그 격자 센서 시스템을 이용한 복합재 평판 진동의 실험적 해석)

  • Kim, Dae-Hyun
    • Journal of the Korean Society for Nondestructive Testing
    • /
    • v.29 no.5
    • /
    • pp.436-441
    • /
    • 2009
  • A fiber optic sensor is prospective to be applied to structural health monitoring. Especially, a fiber Bragg grating(FBG) sensor is one of the most popular sensors for the structural health monitoring. The FBG sensor has several demodulation systems for tracking the shift of the Bragg wavelength. The dynamic bandwidth is dependent on the demodulation system. In this paper, the sensing mechanism is that the slope of the optical spectrum of FBG could be used as its sensitivity when the tunable laser shot the monochromatic laser wavelength at the highest slope point. In this technique, the high sensitivity is guaranteed even though the sensing range is limited. In an example of the application, the composite plate embedding a FBG sensor was manufactured by using an autoclave method and the above sensing mechanism was applied to the composite plate. Firstly, the natural frequencies of the plate were successfully measured by the FBG sensor during the impact hammer test. Secondly, a high-power speaker was used to force the plate to be vibrated at the specific frequency that was one of the natural frequencies. During the shaking, the FBG sensor measures the dynamic characteristics and ESPI was also used to measure the mode shape. From the two dynamic tests, the availability of the FBG sensor system and the ESPI was proven as a technique for measuring the dynamic characteristics of composite structure.