• Title/Summary/Keyword: Text-To-Speech synthesis

Search Result 82, Processing Time 0.022 seconds

Analysis of Voice Color Similarity for the development of HMM Based Emotional Text to Speech Synthesis (HMM 기반 감정 음성 합성기 개발을 위한 감정 음성 데이터의 음색 유사도 분석)

  • Min, So-Yeon;Na, Deok-Su
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.15 no.9
    • /
    • pp.5763-5768
    • /
    • 2014
  • Maintaining a voice color is important when compounding both the normal voice because an emotion is not expressed with various emotional voices in a single synthesizer. When a synthesizer is developed using the recording data of too many expressed emotions, a voice color cannot be maintained and each synthetic speech is can be heard like the voice of different speakers. In this paper, the speech data was recorded and the change in the voice color was analyzed to develop an emotional HMM-based speech synthesizer. To realize a speech synthesizer, a voice was recorded, and a database was built. On the other hand, a recording process is very important, particularly when realizing an emotional speech synthesizer. Monitoring is needed because it is quite difficult to define emotion and maintain a particular level. In the realized synthesizer, a normal voice and three emotional voice (Happiness, Sadness, Anger) were used, and each emotional voice consists of two levels, High/Low. To analyze the voice color of the normal voice and emotional voice, the average spectrum, which was the measured accumulated spectrum of vowels, was used and the F1(first formant) calculated by the average spectrum was compared. The voice similarity of Low-level emotional data was higher than High-level emotional data, and the proposed method can be monitored by the change in voice similarity.

UA Tree-based Reduction of Speech DB in a Large Corpus-based Korean TTS (대용량 한국어 TTS의 결정트리기반 음성 DB 감축 방안)

  • Lee, Jung-Chul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.15 no.7
    • /
    • pp.91-98
    • /
    • 2010
  • Large corpus-based concatenating Text-to-Speech (TTS) systems can generate natural synthetic speech without additional signal processing. Because the improvements in the natualness, personality, speaking style, emotions of synthetic speech need the increase of the size of speech DB, it is necessary to prune the redundant speech segments in a large speech segment DB. In this paper, we propose a new method to construct a segmental speech DB for the Korean TTS system based on a clustering algorithm to downsize the segmental speech DB. For the performance test, the synthetic speech was generated using the Korean TTS system which consists of the language processing module, prosody processing module, segment selection module, speech concatenation module, and segmental speech DB. And MOS test was executed with the a set of synthetic speech generated with 4 different segmental speech DBs. We constructed 4 different segmental speech DB by combining CM1(or CM2) tree clustering method and full DB (or reduced DB). Experimental results show that the proposed method can reduce the size of speech DB by 23% and get high MOS in the perception test. Therefore the proposed method can be applied to make a small sized TTS.

Implementation of Korean TTS Service on Android OS (안드로이드 OS 기반 한국어 TTS 서비스의 설계 및 구현)

  • Kim, Tae-Guon;Kim, Bong-Wan;Choi, Dae-Lim;Lee, Yong-Ju
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.1
    • /
    • pp.9-16
    • /
    • 2012
  • Though Android-based smart phones are being released in Korea, Korean TTS engine is not built on them and Google has not announced service or software developer's kit related to Korean TTS officially. Thus, application developers who want to include Korean TTS capability in their application have difficulties. In this paper, we design and implement Android OS-based Korean TTS system and service. For speed, text preprocessing and synthesis libraries are implemented using Android NDK. By using Java's thread mechanism and the AudioTrack class, the response time of TTS is minimized. For the test of implemented service, an application that reads incoming SMS is developed. The test shows that synthesized speech are generated in real-time for random sentences. By using the implemented Korean TTS service, Android application developers can transmit information easily through voice. Korean TTS service proposed and implemented in this paper overcomes shortcomings of the existing restrictive synthesis methods and provides the benefit for application developers and users.

Robot Vision to Audio Description Based on Deep Learning for Effective Human-Robot Interaction (효과적인 인간-로봇 상호작용을 위한 딥러닝 기반 로봇 비전 자연어 설명문 생성 및 발화 기술)

  • Park, Dongkeon;Kang, Kyeong-Min;Bae, Jin-Woo;Han, Ji-Hyeong
    • The Journal of Korea Robotics Society
    • /
    • v.14 no.1
    • /
    • pp.22-30
    • /
    • 2019
  • For effective human-robot interaction, robots need to understand the current situation context well, but also the robots need to transfer its understanding to the human participant in efficient way. The most convenient way to deliver robot's understanding to the human participant is that the robot expresses its understanding using voice and natural language. Recently, the artificial intelligence for video understanding and natural language process has been developed very rapidly especially based on deep learning. Thus, this paper proposes robot vision to audio description method using deep learning. The applied deep learning model is a pipeline of two deep learning models for generating natural language sentence from robot vision and generating voice from the generated natural language sentence. Also, we conduct the real robot experiment to show the effectiveness of our method in human-robot interaction.

A Study on the optimal text corpus for company names (한국어최적상호명코퍼스설계에관한연구)

  • Lee, Sun-Jung
    • Journal of the Korea Computer Industry Society
    • /
    • v.5 no.5
    • /
    • pp.747-754
    • /
    • 2004
  • In this paper, we obtain an optimal corpus that can represent its characteristics very well from the baseline corpus which consists of unique 1,566,943 names among company names in a directory assistance serve (114). Two kinds of optimal solutions ared considered to obtain the optimal corpus. The first solution is to find phonetically balanced corpus (PBC), which are the minimum set including all possible triphones in the baseline corpus. The second solution is to find the phonetically distributed corpus (PDC), which is a minimum set representing the frequency characteristics of triphones in the baseline corpus. We can obtain 8,699 words as the PBC and 16,783 words (similarity measure R = 0.92) as PDC, respectively. These corpora can be used for the development of speech recognition and speech synthesis.

  • PDF

A Neural Network Based Korean Segmental Duration Modeling Using Tonal Information of Phonemes (음소별 성조 정보를 이용한 신경망 기반의 한국어 음소 지속시간 모델링)

  • 김은경;이상호;오영환
    • The Journal of the Acoustical Society of Korea
    • /
    • v.18 no.6
    • /
    • pp.84-88
    • /
    • 1999
  • The accurate estimation of segmental duration is crucial for natural-sounding text-to-speech synthesis. For predicting Korean segmental durations, conventional methods utilized phonemic context, part-of-speech context and locational information in prosodic phrase. In this paper, the tonal information of phonemes is employed for more accurate prediction. After defining two non-boundary tones and six boundary tones, we annotated the tonal label on each syllable of 400 sentences. To predict segmental duration using tonal information, we constructed neural networks with a real-valued output node predicting phonemic duration and trained them by backpropagation algorithm. Experimental results showed that the proposed features are effective for predicting Korean segmental durations, and we got 0.863 correlation coefficient of the observed durations and predicted ones.

  • PDF

VoiceXML Dialog System Based on RSS for Contents Syndication (콘텐츠 배급을 위한 RSS 기반의 VoiceXML 다이얼로그 시스템)

  • Kwon, Hyeong-Joon;Kim, Jung-Hyun;Lee, Hyon-Gu;Hong, Kwang-Seok
    • The KIPS Transactions:PartB
    • /
    • v.14B no.1 s.111
    • /
    • pp.51-58
    • /
    • 2007
  • This paper suggests prototype of dialog system combining VXML(VoiceXML) that is the W3C's standard XML format for specifying interactive voice dialogues between human and computer, and RSS(RDF Site Summary or Really Simple Syndication) that is representative technology of semantic web for syndication and subscription of updated web-contents. Merits of the proposed system are as following: 1) It is a new method that recognize spoken contents using ire and wireless telephone networks and then provide contents to user via STT(Speech-to-Text) and TTS(Text-to-Speech) instead of traditional method using web only. 2) It can apply advantage of RSS that subscription of updated contents is converted to VXML without modifying traditional method to provide RSS service, 3) In terms of users, it can reduce restriction on time-spate in search of contents that is provided by RSS because it uses ire and wireless telephone networks, not internet environment. 4) In terms of information provider, it does not need special component for syndication of the newest contents using speech recognition and synthesis technology. We implemented a news service system using VXML and RSS for performance evaluation of the proposed system. In experiment results, we estimated the response time and the speech recognition rate in subscription and search of actuality contents, and confirmed that the proposed system can provide contents those are provided using RSS Feed.

The Modeling of Pause Duration For Text-To-Speech Synthesis System (TTS 시스템을 위한 휴지기간 모델링)

  • Chung Jihye;Lee Yanhee
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • spring
    • /
    • pp.83-86
    • /
    • 2000
  • 본 논문에서는 비정형 단위를 사용한 음성 합성 시스템의 합성음에 대한 자연성을 향상시키기 위한 휴지 구간 추출 및 휴지 지속시간 예측 모델을 제안한다. 제안된 휴지 지속시간 예측 모델은 트리 기반 모델링 기법 중 하나인 CART (Classification And Regression Trees)방법을 이용하였다. 이를 위해 남성 단일 화자가 발성한 6,220개의 어절경계 포함하는 총 400문장의 문 음성 데이터베이스를 구축하였고, 이 데이터베이스로부터 V-fold Cross-Validation 방법에 의해 최적의 트리를 결정하였다. 이 모델을 평가한 결과, 휴지 구간 추출 정확율은 $81\%$로 휴지 구간 존재 추출 정확율은 $83\%, 휴지 구간 비존재 추출 정확율은 $80\%이었고, 실 휴지지속시간과 예측 휴지지속시간과의 다중상관 계수는 0.84로, 오차 범위 20ms 이내에서 의 정 확율은 $88\%$ 이었다. 또한, 휴지지속시간을 예측하여 적용한 합성음을 청취 실험한 결과 자연 음성과 대체적으로 유사하게 나타났다.

  • PDF

Text-to-Speech System Using Variable Synthesis Units (가변합성단위를 사용한 문서 음성 변환 시스템)

  • 조관선;이철희
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 1998.06a
    • /
    • pp.99-102
    • /
    • 1998
  • 본 논문에서는 자연스러운 음성을 합성하기 위해 가변합성단위를 사용한 합성시스템을 제안한다. 음소나 diphone과 같은 작은 단위를 사용하는 기존의 시스템은 음성세그먼트 연결시 접속점이 많아지는 단점이 있다. 반면에 단어나 복합음소와 같이 큰 단위를 사용할 경우 접속점의 수가 감소하여 음질이 향상되지만 단위수 증가로 무제한 합성이 어려워진다. 이러한 문제를 해결하기 위하여 본 논문에서는 접속점의 수를 줄이고 적정한 크기의 메모리로 향상된 음질을 얻기 위한 방법으로 어절 및 CVC와 같은 큰 단위와 반음절과 같은 작은 단위를 선택적으로 사용하여 음성을 합성한다. 실험은 특정문장을 대상으로 각각 반음절, CVC로 합성한 음성과 이들을 어절과 혼합하여 합성한 음성을 비교하였으며 그 결과 가변단위를 사용하여 합성한 음성이 비교적 자연스러움을 알 수 있었다.

  • PDF

Using of The Korean Language Voice Synthesis For E-Mail Manager System (한국어 음성 합성을 이용한 이메일 매니저)

  • Jo, Gyu-Sang;Lee, Young-Hoon;Lee, Byeong-Ryeol;Seo, Dae-Young
    • Annual Conference on Human and Language Technology
    • /
    • 2009.10a
    • /
    • pp.266-270
    • /
    • 2009
  • IT 관련 산업의 발전에 의한 저변의 확대로 장애우들의 IT 사용 수요가 늘고 있다. 본 논문에서는 IT분야에서 가장 기초적으로 활용되는 E-Mail을 시각 장애우가 활용 하는 데에 불편함이 없도록 하는 이메일 매니저 개발에 관련된 기법에 대해 논하고자 한다. TTS(Text-To Speech : 문자 텍스트를 음성으로 전환하여 들려줌)와 음성키보드(키보드 입력 시 입력한 문자를 음성으로 알려줌) 기능으로 시각 장애우가 이메일을 사용함에 있어 불편함을 느끼지 않도록 하였으며 본 시스템의 TTS 알고리즘은 국어 표준발음법을 참고로 하여 자바로 구현 하였다.

  • PDF