• 제목/요약/키워드: Human speech

검색결과 569건 처리시간 0.022초

언어 사용력(Speech Register)원리를 활용한 유아의 교육용 로봇 인식 (Applying the Speech Register Principle to young children`s Perception of the Intelligent Service Robot)

  • 현은자;이하원;연혜민
    • 한국콘텐츠학회논문지
    • /
    • 제12권10호
    • /
    • pp.532-540
    • /
    • 2012
  • 본 연구의 목적은 유아교육기관에서 로봇을 경험한 만 5세 유아들의 로봇 인식을 조사하는 것이다. 연구의 이론적 배경은 언어 사용력(Speech Register) 이론이었으며 로봇, 친구, 인형을 대상으로 말투(Speech Tone) 비교를 하였다. 2-3년간 로봇을 경험한 3군데 유치원의 만 5세 유아 50명을 대상으로 인간에게 어울리는 자연스러운 말투와 단조로운 말투로 그림책을 읽어준 후 어떤 말투가 사람에게 어울리는 말투인지 유아들에게 질문하여 확인하였다. 그리고 인간에게 어울리는 말투가 로봇, 친구, 인형 중 어떤 대상에게 적합한지 선택하게 하였고 그 이유를 분석하였다. 연구결과, 사람에게 어울리는 말투는 86%의 유아가 인형보다 로봇에게, 74% 유아는 로봇보다 친구에게 그리고 68%유아는 인형보다 친구에게 어울린다고 인식하였다. 즉, 유아들은 로봇을 인공물보다는 인간에게 가까운 존재로서 중간자적 혼성물(hybrid beings)로 인식하고 있었다. 그리고 인식 기제는 인간 고유성(human uniqueness)인 인지적 특성이 반영된 결과였다. 그러므로 생물과 무생물의 이분법적 존재론적 인식 분류는 혼성물을 포함한 분류방식으로 대치할 필요성이 있음을 제안한다.

얼굴영상과 음성을 이용한 멀티모달 감정인식 (Multimodal Emotion Recognition using Face Image and Speech)

  • 이현구;김동주
    • 디지털산업정보학회논문지
    • /
    • 제8권1호
    • /
    • pp.29-40
    • /
    • 2012
  • A challenging research issue that has been one of growing importance to those working in human-computer interaction are to endow a machine with an emotional intelligence. Thus, emotion recognition technology plays an important role in the research area of human-computer interaction, and it allows a more natural and more human-like communication between human and computer. In this paper, we propose the multimodal emotion recognition system using face and speech to improve recognition performance. The distance measurement of the face-based emotion recognition is calculated by 2D-PCA of MCS-LBP image and nearest neighbor classifier, and also the likelihood measurement is obtained by Gaussian mixture model algorithm based on pitch and mel-frequency cepstral coefficient features in speech-based emotion recognition. The individual matching scores obtained from face and speech are combined using a weighted-summation operation, and the fused-score is utilized to classify the human emotion. Through experimental results, the proposed method exhibits improved recognition accuracy of about 11.25% to 19.75% when compared to the most uni-modal approach. From these results, we confirmed that the proposed approach achieved a significant performance improvement and the proposed method was very effective.

한국인의 외국어 발화오류검출 음성인식기에서 청취판단과 상관관계가 높은 기계 스코어링 기법 (Machine Scoring Methods Highly-correlated with Human Ratings in Speech Recognizer Detecting Mispronunciation of Foreign Language)

  • 배민영;권철홍
    • 음성과학
    • /
    • 제11권2호
    • /
    • pp.217-226
    • /
    • 2004
  • An automatic pronunciation correction system provides users with correction guidelines for each pronunciation error. For this purpose, we develop a speech recognition system which automatically classifies pronunciation errors when Koreans speak a foreign language. In this paper, we propose a machine scoring method for automatic assessment of pronunciation quality by the speech recognizer. Scores obtained from an expert human listener are used as the reference to evaluate the different machine scores and to provide targets when training some of algorithms. We use a log-likelihood score and a normalized log-likelihood score as machine scoring methods. Experimental results show that the normalized log-likelihood score had higher correlation with human scores than that obtained using the log-likelihood score.

  • PDF

Joint streaming model for backchannel prediction and automatic speech recognition

  • Yong-Seok Choi;Jeong-Uk Bang;Seung Hi Kim
    • ETRI Journal
    • /
    • 제46권1호
    • /
    • pp.118-126
    • /
    • 2024
  • In human conversations, listeners often utilize brief backchannels such as "uh-huh" or "yeah." Timely backchannels are crucial to understanding and increasing trust among conversational partners. In human-machine conversation systems, users can engage in natural conversations when a conversational agent generates backchannels like a human listener. We propose a method that simultaneously predicts backchannels and recognizes speech in real time. We use a streaming transformer and adopt multitask learning for concurrent backchannel prediction and speech recognition. The experimental results demonstrate the superior performance of our method compared with previous works while maintaining a similar single-task speech recognition performance. Owing to the extremely imbalanced training data distribution, the single-task backchannel prediction model fails to predict any of the backchannel categories, and the proposed multitask approach substantially enhances the backchannel prediction performance. Notably, in the streaming prediction scenario, the performance of backchannel prediction improves by up to 18.7% compared with existing methods.

Emotion Recognition based on Multiple Modalities

  • Kim, Dong-Ju;Lee, Hyeon-Gu;Hong, Kwang-Seok
    • 융합신호처리학회논문지
    • /
    • 제12권4호
    • /
    • pp.228-236
    • /
    • 2011
  • Emotion recognition plays an important role in the research area of human-computer interaction, and it allows a more natural and more human-like communication between humans and computer. Most of previous work on emotion recognition focused on extracting emotions from face, speech or EEG information separately. Therefore, a novel approach is presented in this paper, including face, speech and EEG, to recognize the human emotion. The individual matching scores obtained from face, speech, and EEG are combined using a weighted-summation operation, and the fused-score is utilized to classify the human emotion. In the experiment results, the proposed approach gives an improvement of more than 18.64% when compared to the most successful unimodal approach, and also provides better performance compared to approaches integrating two modalities each other. From these results, we confirmed that the proposed approach achieved a significant performance improvement and the proposed method was very effective.

음성인식용 인터페이스의 사용편의성 평가 방법론 (A Usability Evaluation Method for Speech Recognition Interfaces)

  • 한성호;김범수
    • 대한인간공학회지
    • /
    • 제18권3호
    • /
    • pp.105-125
    • /
    • 1999
  • As speech is the human being's most natural communication medium, using it gives many advantages. Currently, most user interfaces of a computer are using a mouse/keyboard type but the interface using speech recognition is expected to replace them or at least be used as a tool for supporting it. Despite the advantages, the speech recognition interface is not that popular because of technical difficulties such as recognition accuracy and slow response time to name a few. Nevertheless, it is important to optimize the human-computer system performance by improving the usability. This paper presents a set of guidelines for designing speech recognition interfaces and provides a method for evaluating the usability. A total of 113 guidelines are suggested to improve the usability of speech-recognition interfaces. The evaluation method consists of four major procedures: user interface evaluation; function evaluation; vocabulary estimation; and recognition speed/accuracy evaluation. Each procedure is described along with proper techniques for efficient evaluation.

  • PDF

PROSODY IN SPEECH TECHNOLOGY - National project and some of our related works -

  • Hirose Keikichi
    • 한국음향학회:학술대회논문집
    • /
    • 한국음향학회 2002년도 하계학술발표대회 논문집 제21권 1호
    • /
    • pp.15-18
    • /
    • 2002
  • Prosodic features of speech are known to play an important role in the transmission of linguistic information in human conversation. Their roles in the transmission of para- and non- linguistic information are even much more. In spite of their importance in human conversation, from engineering viewpoint, research focuses are mainly placed on segmental features, and not so much on prosodic features. With the aim of promoting research works on prosody, a research project 'Prosody and Speech Processing' is now going on. A rough sketch of the project is first given in the paper. Then, the paper introduces several prosody-related research works, which are going on in our laboratory. They include, corpus-based fundamental frequency contour generation, speech rate control for dialogue-like speech synthesis, analysis of prosodic features of emotional speech, reply speech generation in spoken dialogue systems, and language modeling with prosodic boundaries.

  • PDF

자동차 텔레매틱스용 내장형 음성 HMI시스템 (The Human-Machine Interface System with the Embedded Speech recognition for the telematics of the automobiles)

  • 권오일
    • 전자공학회논문지CI
    • /
    • 제41권2호
    • /
    • pp.1-8
    • /
    • 2004
  • 자동차 텔레매틱스 용 음성 HMI(Human Machine Interface) 기술은 차량 내 음성정보기술 활용을 위하여 차량 잡음환경에 강인한 내장형 음성 기술을 통합한 음성 HMI 기반 텔레매틱스 용 DSP 시스템의 개발을 포함한다. 개발된 내장형 음성 인식엔진을 바탕으로 통합 시험을 위한 자동차 텔레매틱스 용 DSP 시스템 구현 개발을 수행하는 본 논문은 자동차용 음성 HMI의 요소 기술을 통합하는 기술 개발로 자동차용 음성 HMI 기술 개발에 중심이 되는 연구이다.

음질 개선을 통한 음성의 인식 (Speech Recognition through Speech Enhancement)

  • 조준희;이기성
    • 대한전기학회:학술대회논문집
    • /
    • 대한전기학회 2003년도 학술회의 논문집 정보 및 제어부문 B
    • /
    • pp.511-514
    • /
    • 2003
  • The human being uses speech signals to exchange information. When background noise is present, speech recognizers experience performance degradations. Speech recognition through speech enhancement in the noisy environment was studied. Histogram method as a reliable noise estimation approach for spectral subtraction was introduced using MFCC method. The experiment results show the effectiveness of the proposed algorithm.

  • PDF

모의 지능로봇에서의 음성 감정인식 (Speech Emotion Recognition on a Simulated Intelligent Robot)

  • 장광동;김남;권오욱
    • 대한음성학회지:말소리
    • /
    • 제56호
    • /
    • pp.173-183
    • /
    • 2005
  • We propose a speech emotion recognition method for affective human-robot interface. In the Proposed method, emotion is classified into 6 classes: Angry, bored, happy, neutral, sad and surprised. Features for an input utterance are extracted from statistics of phonetic and prosodic information. Phonetic information includes log energy, shimmer, formant frequencies, and Teager energy; Prosodic information includes Pitch, jitter, duration, and rate of speech. Finally a pattern classifier based on Gaussian support vector machines decides the emotion class of the utterance. We record speech commands and dialogs uttered at 2m away from microphones in 5 different directions. Experimental results show that the proposed method yields $48\%$ classification accuracy while human classifiers give $71\%$ accuracy.

  • PDF