DOI QR코드

DOI QR Code

The usefulness of the depth images in image-based speech synthesis

영상 기반 음성합성에서 심도 영상의 유용성

  • Ki-Seung Lee (Department of Electronic Engineering, Konkuk University)
  • 이기승 (건국대학교 전기전자공학부)
  • Received : 2022.12.30
  • Accepted : 2023.01.25
  • Published : 2023.01.31

Abstract

The images acquired from the speaker's mouth region revealed the unique patterns according to the corresponding voices. By using this principle, the several methods were proposed in which speech signals were recognized or synthesized from the images acquired at the speaker's lower face. In this study, an image-based speech synthesis method was proposed in which the depth images were cooperatively used. Since depth images yielded depth information that cannot be acquired from optical image, it can be used for the purpose of supplementing flat optical images. In this paper, the usefulness of depth images from the perspective of speech synthesis was evaluated. The validation experiment was carried out on 60 Korean isolated words, it was confirmed that the performance in terms of both subjective and objective evaluation was comparable to the optical image-based method. When the two images were used in combination, performance improvements were observed compared with when each image was used alone.

발성하고 있는 입 주변에서 취득한 영상은 발성 음에 따라 특이적인 패턴을 나타낸다. 이를 이용하여 화자의 얼굴 하단에서 취득한 영상으로부터 발성 음을 인식하거나 합성하는 방법이 제안되고 있다. 본 연구에서는 심도 영상을 협력적으로 이용하는 영상 기반 음성합성 기법을 제안하였다. 심도 영상은 광학 영상에서는 관찰되지 않는 깊이 정보의 취득이 가능하기 때문에 평면적인 광학 영상을 보완하는 목적으로 사용이 가능하다. 본 논문에서는 음성 합성 관점에서 심도 영상의 유용성을 평가하고자 한다. 60개의 한국어 고립어 음성에 대해 검증 실험을 수행하였으며, 실험결과 객관적, 주관적 평가에서 광학적 영상과 근접한 성능을 얻는 것을 확인할 수 있었으며 두 영상을 조합하여 사용하는 경우 각 영상을 단독으로 사용하는 경우보다 향상된 성능을 나타내었다.

Keywords

Acknowledgement

본 논문은 한국연구재단 연구과제인 "심도영상을 이용한 무 음성 대화 기술 개발"(과제번호: 2022R1F1A10689791120682073250101) 의 연구 결과 중 일부입니다.

References

  1. B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg, "Silent speech interfaces," Speech Comm. 52, 270-287 (2010). https://doi.org/10.1016/j.specom.2009.08.002
  2. K.-S. Lee, "EMG-based speech recognition using hidden markov models with global control variables," IEEE Trans. Biomed. Eng. 55, 930-940 (2008). https://doi.org/10.1109/TBME.2008.915658
  3. I. Almajai and B. Milner, "Visually derived wiener filters for speech enhancement," IEEE Trans. Audio, Speech, Language Proc. 19, 1642-1651 (2011). https://doi.org/10.1109/TASL.2010.2096212
  4. S. Li, Y. Tian, G. Lu, Y. Zhang, H. Lv, X. Yu, H. Xue, H. Zhang, J. Wang, and X. Jing, "A 94-GHz milimeter-wave sensor for speech signal acquisition," Sensors, 13, 14248-14260 (2013). https://doi.org/10.3390/s131114248
  5. K.-S. Lee, "Speech synthesis using Doppler signal" (in Korean), J. Acoust. Soc. Kr. 35, 134-142 (2016). https://doi.org/10.7776/ASK.2016.35.2.134
  6. K.-S. Lee, "Ultrasonic doppler based silent speech interface using perceptual distance," Appl. Sci. 12, 827 (2022).
  7. M. A. Subhi, S. H. M. Ali, A. G. Ismail, and M. Othman, "Food volume estimation based on stereo image analysis," IEEE IMM, 6, 36-43 (2018).
  8. P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," Proc. IEEE CSPV, 511-518 (2001).
  9. D. W. Griffin and J. S. Lim, "Signal estimation from the modified short-time fourier transform," IEEE Trans. on Acoustic, Speech Signal Proc. 32, 236-243 (1984). https://doi.org/10.1109/TASSP.1984.1164317
  10. J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, "A deep learning loss function based on the perceptual evaluation of the speech quality," IEEE Signal Process. Lett. 25, 1680-1684 (2018). https://doi.org/10.1109/lsp.2018.2871419
  11. ITU-T, Rec. P. 862, Perceptual Evaluation of Speech Quality(PESQ): An Objective Method for End-ToEnd Speech Quality Assessment of Narrow Band Telephone Networks and Speech Codecs, Int. Telecomm. Union-Telecomm. Stand. Sector, 2001.