DOI QR코드

DOI QR Code

Harmonic Structure Features for Robust Speaker Diarization

  • Zhou, Yu (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences) ;
  • Suo, Hongbin (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences) ;
  • Li, Junfeng (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences) ;
  • Yan, Yonghong (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences)
  • Received : 2011.07.18
  • Accepted : 2012.04.03
  • Published : 2012.08.30

Abstract

In this paper, we present a new approach for speaker diarization. First, we use the prosodic information calculated on the original speech to resynthesize the new speech data utilizing the spectrum modeling technique. The resynthesized data is modeled with sinusoids based on pitch, vibration amplitude, and phase bias. Then, we use the resynthesized speech data to extract cepstral features and integrate them with the cepstral features from original speech for speaker diarization. At last, we show how the two streams of cepstral features can be combined to improve the robustness of speaker diarization. Experiments carried out on the standardized datasets (the US National Institute of Standards and Technology Rich Transcription 04-S multiple distant microphone conditions) show a significant improvement in diarization error rate compared to the system based on only the feature stream from original speech.

Keywords

References

  1. N.W.D. Evans, C. Fredouille, and J.F. Bonastre, "Speaker Diarization Using Unsupervised Discriminant Analysis of Inter-Channel Delay Features," Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., ICASSP, 2009, pp. 4061-4064.
  2. J. Pelecanos and S. Sridharan, "Feature Warping for Robust Speaker Verification," A Speaker Odyssey - The Speaker Recognition Workshop, Crete, Greece, 2001, pp. 213-218.
  3. P. Ouellet, G. Boulianne, and P. Kenny, "Flavors of Gaussian Warping," Proc. Interspeech, 2005, pp. 2957-2960.
  4. R. Sinha et al., "The Cambridge University March 2005 Speaker Diarization System," Proc. Interspeech, 2005, pp. 2437-2440.
  5. X. Zhu et al., "Speaker Diarization: From Broadcast News to Lectures," Machine Learning for Multimodal Interaction, 2006, pp. 396-406.
  6. G. Friedland et al., "Prosodic and Other Long-Term Features for Speaker Diarization," IEEE Trans. Audio, Speech, Language Process., vol. 17, no. 5, 2009, pp. 985-993. https://doi.org/10.1109/TASL.2009.2015089
  7. G. Friedland et al., "Fusing Short Term and Long Term Features for Improved Speaker Diarization," IEEE Int. Conf. Acoustics, Speech, Signal Process., 2009, pp. 4077-4080.
  8. X. Serra, "Musical Sound Modeling with Sinusoids Plus Noise," Studies on New Music Research: Musical Signal Processing, C. Roads et al., Eds., The Netherlands: Swets & Zeitlinger, 1997, pp. 91-122.
  9. R.J. McAulay and T.F. Quatieri, "Magnitude-Only Reconstruction Using a Sinusoidal Speech Model," Proc. ICASSP, 1984, pp. 1-27.
  10. C. Cao et al., "Harmonic Structure Features for Robust Speaker Recognition against Channel Effect," 2nd Int. Symp. Inf. Sci. Eng., 2009, pp. 451-454.
  11. C. Wooters and M. Huijbregts, "The ICSI RT07s Speaker Diarization System," Multimodal Technologies for Perception of Humans, 2008, pp. 509-519.
  12. C. Fredouille and G. Senay, "Technical Improvements of the E-HMM Based Speaker Diarization System for Meeting Records," Machine Learning for Multimodal Interaction, May 2006, pp. 359-370.
  13. Y. Zhou et al., "An Improved Speaker Diarization System for Multiple Distance Microphone Meetings," 5th Int. Conf. Int. Computation Technol. Autom., 2012, pp. 80-83.
  14. A. Adami et al., "Qualcomm-ICSI-OGI Features for ASR," Proc. 7th Int. Conf. Spoken Language Process., 2002, pp. 21-24.
  15. BeamformIt toolkit. http://www.xavieranguera.com/beamformit/
  16. C. Wooters et al., "Toward Robust Speaker Segmentation: ICSI-SRI Fall 2004 Diarization System," Proc. Rich Transcription Workshop (RT-04), 2004.
  17. J. Ajmera, I. Lapidot, and I. McCowan, "Unknown Multiple Speaker Clustering Using HMM," Int. Conf. Spoken Language Process., 2002, pp. 573-576.
  18. S.S. Chen and P.S. Gopalakrishnan, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion," Proc. DARPA Broadcast News Transcription Understanding Workshop, 1998, pp. 127-132.
  19. C. Cao et al., "Singing Melody Extraction in Polyphonic Music by Harmonic Tracking," Proc. 8th Int. Conf. Music Inf. Retrieval, 2007, pp. 373-374.
  20. D. Imseng and G. Friedland, "Tuning-Robust Initialization Methods for Speaker Diarization," IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 8, 2010, pp. 2028-2037. https://doi.org/10.1109/TASL.2010.2040796
  21. http://nist.gov/speech/tests/rt/rt2004/fall