• Title/Summary/Keyword: 감정 음성

Search Result 235, Processing Time 0.025 seconds

Emotion Transfer with Strength Control for End-to-End TTS (감정 제어 가능한 종단 간 음성합성 시스템)

  • Jeon, Yejin;Lee, Gary Geunbae
    • Annual Conference on Human and Language Technology
    • /
    • 2021.10a
    • /
    • pp.423-426
    • /
    • 2021
  • 본 논문은 전역 스타일 토큰(Global Style Token)을 기준으로 하여 감정의 세기를 조절할 수 있는 방법을 소개한다. 기존의 전역 스타일 토큰 연구에서는 원하는 스타일이 포함된 참조 오디오(reference audio)을 사용하여 음성을 합성하였다. 그러나, 참조 오디오의 스타일대로만 음성합성이 가능하기 때문에 세밀한 감정 조절에 어려움이 있었다. 이 문제를 해결하기 위해 본 논문에서는 전역 스타일 토큰의 레퍼런스 인코더 부분을 잔여 블록(residual block)과 컴퓨터 비전 분야에서 사용되는 AlexNet으로 대체하였다. AlexNet은 5개의 함성곱 신경망(convolutional neural networks) 으로 구성되어 있지만, 본 논문에서는 1개의 신경망을 제외한 4개의 레이어만 사용했다. 청취 평가(Mean Opinion Score)를 통해 제시된 방법으로 감정 세기의 조절 가능성을 보여준다.

  • PDF

An acoustic study of feeling information extracting method (음성을 이용한 감정 정보 추출 방법)

  • Lee, Yeon-Soo;Park, Young-B.
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.10 no.1
    • /
    • pp.51-55
    • /
    • 2010
  • Tele-marketing service has been provided through voice media in a several places such as modern call centers. In modern call centers, they are trying to measure their service quality, and one of the measuring method is a extracting speaker's feeling information in their voice. In this study, it is proposed to analyze speaker's voice in order to extract their feeling information. For this purpose, a person's feeling is categorized by analyzing several types of signal parameters in the voice signal. A person's feeling can be categorized in four different states: joy, sorrow, excitement, and normality. In a normal condition, excited or angry state can be major factor of service quality. In this paper, it is proposed to select a conversation with problems by extracting the speaker's feeling information based on pitches and amplitudes of voice.

Deep Learning-Based Speech Emotion Recognition Technology Using Voice Feature Filters (음성 특징 필터를 이용한 딥러닝 기반 음성 감정 인식 기술)

  • Shin Hyun Sam;Jun-Ki Hong
    • The Journal of Bigdata
    • /
    • v.8 no.2
    • /
    • pp.223-231
    • /
    • 2023
  • In this study, we propose a model that extracts and analyzes features from deep learning-based speech signals, generates filters, and utilizes these filters to recognize emotions in speech signals. We evaluate the performance of emotion recognition accuracy using the proposed model. According to the simulation results using the proposed model, the average emotion recognition accuracy of DNN and RNN was very similar, at 84.59% and 84.52%, respectively. However, we observed that the simulation time for DNN was approximately 44.5% shorter than that of RNN, enabling quicker emotion prediction.

On the Importance of Tonal Features for Speech Emotion Recognition (음성 감정인식에서의 톤 정보의 중요성 연구)

  • Lee, Jung-In;Kang, Hong-Goo
    • Journal of Broadcast Engineering
    • /
    • v.18 no.5
    • /
    • pp.713-721
    • /
    • 2013
  • This paper describes an efficiency of chroma based tonal features for speech emotion recognition. As the tonality caused by major or minor keys affects to the perception of musical mood, so the speech tonality affects the perception of the emotional states of spoken utterances. In order to justify this assertion with respect to tonality and emotion, subjective hearing tests are carried out by using synthesized signals generated from chroma features, and consequently show that the tonality contributes especially to the perception of the negative emotion such as anger and sad. In automatic emotion recognition tests, the modified chroma-based tonal features are shown to produce noticeable improvement of accuracy when they are supplemented to the conventional log-frequency power coefficient (LFPC)-based spectral features.

Enhancing Multimodal Emotion Recognition in Speech and Text with Integrated CNN, LSTM, and BERT Models (통합 CNN, LSTM, 및 BERT 모델 기반의 음성 및 텍스트 다중 모달 감정 인식 연구)

  • Edward Dwijayanto Cahyadi;Hans Nathaniel Hadi Soesilo;Mi-Hwa Song
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.1
    • /
    • pp.617-623
    • /
    • 2024
  • Identifying emotions through speech poses a significant challenge due to the complex relationship between language and emotions. Our paper aims to take on this challenge by employing feature engineering to identify emotions in speech through a multimodal classification task involving both speech and text data. We evaluated two classifiers-Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)-both integrated with a BERT-based pre-trained model. Our assessment covers various performance metrics (accuracy, F-score, precision, and recall) across different experimental setups). The findings highlight the impressive proficiency of two models in accurately discerning emotions from both text and speech data.

Analysis and Use of Intonation Features for Emotional States (감정 상태에 따른 발화문의 억양 특성 분석 및 활용)

  • Lee, Ho-Joon;Park, Jong C.
    • Annual Conference on Human and Language Technology
    • /
    • 2008.10a
    • /
    • pp.145-150
    • /
    • 2008
  • 본 논문에서는 8개의 문장에 대해서 6명의 화자가 5가지 감정 상태로 발화한 총 240개의 문장을 감정 음성 말뭉치로 활용하여 각 감정 상태에서 특징적으로 나타나는 억양 패턴을 분석하고, 이러한 억양 패턴을 음성 합성 시스템에 적용하는 방법에 대해서 논의한다. 이를 위해 본 논문에서는 감정 상태에 따른 특징적 억양 패턴을 억양구의 길이, 억양구의 구말 경계 성조, 하강 현상에 중점을 두어 분석하고, 기쁨, 슬픔, 화남, 공포의 감정을 구분 지을 수 있는 억양 특징들을 음성 합성 시스템에 적용하는 과정을 보인다. 본 연구를 통해 화남의 감정에서 나타나는 억양의 상승 현상을 확인할 수 있었고, 각 감정에 따른 특징적 억양 패턴을 찾을 수 있었다.

  • PDF

Emotion Recognition Using Output Data of Image and Speech (영상과 음성의 출력 데이터를 이용한 감정인식)

  • Oh, Jae-Heung;Jeong, Keun-Ho;Joo, Young-Hoon;Park, Chang-Hyun;Sim, Kwee-Bo
    • Proceedings of the KIEE Conference
    • /
    • 2003.07d
    • /
    • pp.2097-2099
    • /
    • 2003
  • 본 논문에서는 영상과 음성의 데이터를 이용한 사람의 감정을 인식하는 방법을 제안한다. 제안된 방법은 영상과 음성의 인식률에 기반 한다. 영상이나 음성 중 하나의 출력 데이터만을 이용한 경우에는 잘못된 인식에 따른 결과를 해결하기가 힘들다. 이를 보완하기 위해서 영상과 음성의 출력을 이하여 인식률이 높은 감정 상태에 가중치를 줌으로써 잘못된 인식의 결과를 줄일 수 있는 방법을 제안한다. 이를 위해서는 각각의 감정 상태에 대한 영상과 음성의 인식률이 추출되어져 있어야 하며, 추출된 인식률을 기반으로 가중치를 계산하는 방법을 제시한다.

  • PDF

Development of Emotion Recognition Model Using Audio-video Feature Extraction Multimodal Model (음성-영상 특징 추출 멀티모달 모델을 이용한 감정 인식 모델 개발)

  • Jong-Gu Kim;Jang-Woo Kwon
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.24 no.4
    • /
    • pp.221-228
    • /
    • 2023
  • Physical and mental changes caused by emotions can affect various behaviors, such as driving or learning behavior. Therefore, recognizing these emotions is a very important task because it can be used in various industries, such as recognizing and controlling dangerous emotions while driving. In this paper, we attempted to solve the emotion recognition task by implementing a multimodal model that recognizes emotions using both audio and video data from different domains. After extracting voice from video data using RAVDESS data, features of voice data are extracted through a model using 2D-CNN. In addition, the video data features are extracted using a slowfast feature extractor. And the information contained in the audio and video data, which have different domains, are combined into one feature that contains all the information. Afterwards, emotion recognition is performed using the combined features. Lastly, we evaluate the conventional methods that how to combine results from models and how to vote two model's results and a method of unifying the domain through feature extraction, then combining the features and performing classification using a classifier.

The Effect of the Verbal Emotional Context on the Serial Position Effect (음성으로 제시되는 감정 맥락이 서열 위치 효과에 미치는 영향)

  • Jinsun Suhr;Eunmi Oh;Kwanghee Han
    • Science of Emotion and Sensibility
    • /
    • v.27 no.2
    • /
    • pp.3-14
    • /
    • 2024
  • An understanding of the influence of emotional context on memory retrieval is crucial to our comprehensive understanding of human cognition. While previous research focused primarily on visual stimuli to address this relationship, this study ventures into the realm of speech-based emotional contexts. Building on previous findings, we examine the effects of arousal and the valence of verbal contexts on memory, with particular focus on mitigating the serial position effect. In Study 1, we investigated how the arousal level of verbal context in the middle of a word list affects memory retention. Our results demonstrated detriment to the memory of later parts of the word list when exposed to low-arousal contexts. In Study 2, we controlled for arousal levels and examined the impact of valence on memory. We found that negative verbal contexts impair the memory of the word when presented together. Our findings suggest that speech-based emotional contexts do not facilitate verbal memory processing. In particular, negative emotional contexts were found to reinforce the serial position effect. Negative emotional contexts tend to disrupt task performance and fail to elicit memory-enhancing effects, especially when both the context and memory stimulus are verbal. These insights offer a valuable contribution to our understanding of the nuances of auditorily delivered emotional context in verbal memory processes.

Emotion Recognition using Pitch Parameters of Speech (음성의 피치 파라메터를 사용한 감정 인식)

  • Lee, Guehyun;Kim, Weon-Goo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.3
    • /
    • pp.272-278
    • /
    • 2015
  • This paper studied various parameter extraction methods using pitch information of speech for the development of the emotion recognition system. For this purpose, pitch parameters were extracted from korean speech database containing various emotions using stochastical information and numerical analysis techniques. GMM based emotion recognition system were used to compare the performance of pitch parameters. Sequential feature selection method were used to select the parameters showing the best emotion recognition performance. Experimental results of recognizing four emotions showed 63.5% recognition rate using the combination of 15 parameters out of 56 pitch parameters. Experimental results of detecting the presence of emotion showed 80.3% recognition rate using the combination of 14 parameters.