Speech Synthesis using Diphone Clustering and Improved Spectral Smoothing

Jang, Hyo-Jong;Kim, Kwan-Jung;Kim, Gye-Young;Choi, Hyung-Il;

doi:10.3745/KIPSTB.2003.10B.6.665

The KIPS Transactions:PartB (정보처리학회논문지B)

Volume 10B Issue 6
/
Pages.665-672
/
2003
/
1598-284X(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Speech Synthesis using Diphone Clustering and Improved Spectral Smoothing

다이폰 군집화와 개선된 스펙트럼 완만화에 의한 음성합성

장효종 (숭실대학교 대학원 컴퓨터학과) ;
김관중 (한서대학교 컴퓨터정보학과) ;
김계영 (숭실대학교 컴퓨터학부) ;
최형일 (숭실대학교 미디어학부)

Published : 2003.10.01

https://doi.org/10.3745/KIPSTB.2003.10B.6.665 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This paper describes a speech synthesis technique by concatenating unit phoneme. At that time, a major problem is that discontinuity is happened from connection part between unit phonemes, especially from connection part between unit phonemes recorded by different persons. To solve the problem, this paper uses clustered diphone, and proposes a spectral smoothing technique, not only using formant trajectory and distribution characteristic of spectrum but also reflecting human's acoustic characteristic. That is, the proposed technique performs unit phoneme clustering using distribution characteristic of spectrum at connection part between unit phonemes and decides a quantity and a scope for the smoothing by considering human's acoustic characteristic at the connection part of unit phonemes, and then performs the spectral smoothing using weights calculated along a time axes at the border of two diphones. The proposed technique removes the discontinuity and minimizes the distortion which can be occurred by spectrum smoothing. For the purpose of the performance evaluation, we test on five hundred diphones which are extracted from twenty sentences recorded by five persons, and show the experimental results.

본 논문에서는 단위음소들의 연결을 통한 음성합성 방법에 관하여 기술한다. 이때, 발생하는 가장 큰 문제점은 두 단위음소 사이의 연결부분에서 불연속이 발생하는 것이며, 특히 다른 화자로부터 녹음한 단위음소의 연결에서 불연속이 많이 발생한다. 이 문제를 해결하기 위하여 본 논문에서는 군집화된 다이폰을 이용하며, 포만트 궤적과 스펙트럼의 분포특성을 사용할 뿐 아니라 인간의 청각적인 특성을 반영하여 스펙트럼을 완만화하는 방법을 제안한다. 즉, 제안하는 방법은 단위음소 연결구간의 스펙트럼 분포특성의 유사도를 사용하여 단위음소들을 군집화하고 단위음소의 연결 구간에서 인간의 청각신경 특성을 고려하여 완만화의 양과 범위를 결정한 다음, 두 다이폰 경계의 스펙트럼 분포를 시간에 따라 가중치를 다르게 주어 스펙트럼 완만화를 수행한다. 이 방법은 불연속을 제거하며 완만화로 인하여 발생할 수 있는 음성의 왜곡을 최소화한다. 제안하는 방법의 성능을 평가하기 위하여 5명으로부터 녹음한 20개의 문장 중에서 추출한 500여 개의 다이폰을 사용하여 실험을 수행하였다.

Keywords

References

R.E. Donovan, P.C. Woodland, 'A hidden Markov model based trainable speech synthesizer,' Computer Speech and Lanquage, pp.1-19, 1999 https://doi.org/10.1006/csla.1999.0123
Conkie, A.D., Isard, S., 'Optimal coupling of diphones Progress in Speech Synthesis,' Springer, New York, Chapter 23, pp.293-304, 1997
Kleijin, W.B., Haagen, J., 'Waveform interpolation for coding and synthesis,' Speech Coding and Synthesis, Chapter 5, pp.175-207, 1995
David T. Chappell, John H.L. Hansen, 'A Comparison of Spectral Smoothing methods for segment conatenation based speech synthesis,' Speech Communication Vol.36, pp.343-374, 2002 https://doi.org/10.1016/S0167-6393(01)00008-5
Wouters, J., Macon, M.W., 'Control of Spectral Dynamics in Concatenative Speech Synthesis,' IEEE Transanctions on Speech and Audio Processing, Vol.9, No.1, pp.30-38, Jan., 2001 https://doi.org/10.1109/89.890069
Esther Klabbers, Raymond Veldhuis, 'Reducing Audible Spectral Discontivuities,' IEEE Transactions on Speech and Audio Processing, Vol.9, No.1, Jan., 2001 https://doi.org/10.1109/89.890070
H.van den Heuvel, B. Cranen, T. Rietveld, 'Speaker variability in the coarticulation of/a, i, u/,' Speech Communication, Vol.18, pp.113-130, 1996 https://doi.org/10.1016/0167-6393(95)00039-9
Hossein Najafzadeh-Azghandi, 'Perceptual Coding of Narrowband Signals,' Ph.D Thesis, Department of Electrical & Computer Engineering, McGill University, Montreal, Canada, April, 2000
John H.L. Hansen and David T. Chappell, 'An Auditory Based Distortion Measure with Application to Concatenative Speech Synthesis,' IEEE Transactions on Speech and Audio Processing, Vol.6, No.5, pp.489-495, Sep., 1998 https://doi.org/10.1109/89.709674
L.R. Rabiner, R.W. Schafer, 'Digital Processing of Speech Signals,' Prentice-hall, 1978
H.S. Hou and H.C. Andrews, 'Cubic Splines for Image Interpolatio and Digital Filtering,' IEEE Transactions on Acoustics Speech and Signal Processing, ASSP, Vol.26, No.6, pp.508-517, December, 1978 https://doi.org/10.1109/TASSP.1978.1163154

The KIPS Transactions:PartB (정보처리학회논문지B)

Speech Synthesis using Diphone Clustering and Improved Spectral Smoothing

다이폰 군집화와 개선된 스펙트럼 완만화에 의한 음성합성

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)