Improvement of Synthetic Speech Quality using a New Spectral Smoothing Technique

새로운 스펙트럼 완만화에 의한 합성 음질 개선

  • Published : 2003.12.01

Abstract

This paper describes a speech synthesis technique using a diphone as an unit phoneme. Speech synthesis is basically accomplished by concatenating unit phonemes, and it's major problem is discontinuity at the connection part between unit phonemes. To solve this problem, this paper proposes a new spectral smoothing technique which reflects not only formant trajectories but also distribution characteristics of spectrum and human's acoustic characteristics. That is, the proposed technique decides the quantity and extent of smoothing by considering human's acoustic characteristics at the connection part of unit phonemes, and then performs spectral smoothing using weights calculated along a time axis at the border of two diphones. The proposed technique reduces the discontinuity and minimizes the distortion which is caused by spectral smoothing. For the purpose of performance evaluation, we tested on five hundred diphones which are extracted from twenty sentences using ETRI Voice DB samples and individually self-recorded samples.

본 논문에서는 단위음소로 다이폰을 사용하여 음성을 합성하는 방법에 관하여 기술한다. 음성 합성은 기본적으로 단위음소들의 연결을 통하여 이루어지는데, 이때 발생하는 가장 큰 문제점은 두 단위음소 사이의 연결부분에서 불연속이 발생하는 것이다. 이 문제를 해결하기 위하여 본 논문에서는 포만트 궤적뿐 아니라 스펙트럼의 분포특성과 인간의 청각적인 특성을 반영하여 스펙트럼을 완만화하는 방법을 제안한다. 즉, 제안하는 방법은 단위음소의 연결 구간에서 인간의 청각신경 특성을 고려하여 완만화의 양과 범위를 결정한 다음, 두 다이폰 경계의 스펙트럼 분포를 시간에 따라 가중치를 다르게 주어 스펙트럼 완만화를 수행한다. 이 방법은 불연속을 제거하며 완만화로 인하여 발생할 수 있는 음성의 왜곡을 최소화한다. 제안하는 방법의 성능을 평가하기 위하여 ETRI 음성 DB 샘플과 개인별로 자체 녹음한 총 20여개의 문장에서 추출한 약 500여 개의 다이폰에 대하여 실험을 수행하였다.

Keywords

References

  1. R.E. Donovan, P.C. Woodland, A hidden Markov model based trainable speech synthesizer, Computer Speech and Language, pp1-19, 1999 https://doi.org/10.1006/csla.1999.0123
  2. Conkie, A.D., Isard S., Optimal coupling of diphones Progress in Speech Synthesis, Springer, New York, Chapter 23, pp293-304, 1997
  3. Kleijn W.B., Haagen J., Waveform interpolation for coding and synthesis, Speech Coding and Synthesis, Chapter 5, pp175-207, 1995
  4. David T. Chappell, John H.L. Hansen, A Comparison of Spectral Smoothing methods for segment concatenation based speech synthesis, Speech Communication 36, pp343-374, 2002 https://doi.org/10.1016/S0167-6393(01)00008-5
  5. Wouters, J. ,Macon, M.W. ,Control of Spectral Dynamics in Concatenative Speech Synthesis, Speech and Audio Processing, IEEE Transactions on, Vol 9, No. 1, pp30-38, Jan 2001 https://doi.org/10.1109/89.890069
  6. Hossein Najafzadeh-Azghandi, Perceptual Coding of Narrowband Signals, Ph.D The-sis, Department of Electrical & Computer Engineering, McGill University, Montreal, Canada, April 2000
  7. John H. L. Hansen and David T.Chappell, An Auditory-Based Distortion Measure with Application to Concatenative Speech Synthesis, Speech and Audio Processing, IEEE Transactions on, Vol 6, No.5, pp489-495, Sep 1998 https://doi.org/10.1109/89.709674
  8. L. R. Rabiner, R. W. Schafer, Digital Processing of Speech Signals, Prentice-hall, 1978
  9. H. S. Hou and H. C. Andrews, Cubic Splines for Image Interpolatio and Digital Filtering, IEEE Trans. Acoustics,Speech,and Signal Processing, ASSP-26,6, December 1978, 508-517
  10. Esther Klabbers, Raymond Veldhuis, Reducing Audible Spectral Discontivuities, IEEE Transactions on Speech and Audio Processing, Vol 9, No. 1, Jan 2001
  11. H. van den Heuvel, B.Cranen, T.Rietveld, Speaker variability in the coarticulation of /a,i,u/, Speech Communication 18, pp113-130, 1996 https://doi.org/10.1016/0167-6393(95)00039-9