DOI QR코드

DOI QR Code

Proposal of speaker change detection system considering speaker overlap

화자 겹침을 고려한 화자 전환 검출 시스템 제안

  • Park, Jisu (Department of Information and Communication Engineering, Hannam University) ;
  • Yun, Young-Sun (Department of Information and Communication Engineering, Hannam University) ;
  • Cha, Shin (Department of Information and Communication Engineering, Hannam University) ;
  • Park, Jeon Gue (ETRI)
  • Received : 2021.07.09
  • Accepted : 2021.08.26
  • Published : 2021.09.30

Abstract

Speaker Change Detection (SCD) refers to finding the moment when the main speaker changes from one person to the next in a speech conversation. In speaker change detection, difficulties arise due to overlapping speakers, inaccuracy in the information labeling, and data imbalance. To solve these problems, TIMIT corpus widely used in speech recognition have been concatenated artificially to obtain a sufficient amount of training data, and the detection of changing speaker has performed after identifying overlapping speakers. In this paper, we propose an speaker change detection system that considers the speaker overlapping. We evaluated and verified the performance using various approaches. As a result, a detection system similar to the X-Vector structure was proposed to remove the speaker overlapping region, while the Bi-LSTM method was selected to model the speaker change system. The experimental results show a relative performance improvement of 4.6 % and 13.8 % respectively, compared to the baseline system. Additionally, we determined that a robust speaker change detection system can be built by conducting related studies based on the experimental results, taking into consideration text and speaker information.

화자 전환 검출은 대화 중에 발성 화자가 다른 사람으로 바뀌는 시점을 검출하는 것을 의미한다. 이 과정에서 화자 중복, 화자 정보 표기의 부정확성, 데이터 불균형 등으로 화자가 바뀌는 순간을 검출하는 데 어려움이 발생한다. 본 논문에서는 이러한 문제를 해결하기 위해 음성 인식에 널리 사용되는 TIMIT 데이터를 가공하여 충분한 양의 훈련 데이터를 얻었으며, 화자가 겹치는지를 파악한 후에 화자 전환 여부를 판단하였다. 본 논문에서는 화자 겹침을 고려한 화자 전환 검출 시스템을 구축하기 위하여 다양한 접근법을 사용하여 성능을 평가하고 검증했다. 그 결과 화자 겹칩 영역을 제거하기 위해 X-Vector 구조와 유사한 형태의 검출 시스템과 화자 전환 검출 시스템을 모델링하기 위한 Bi-LSTM 모델을 제안하였다. 실험 결과 기준 시스템보다 상대적으로 각각 4.6 %, 13.8 % 성능 향상을 확인하였다. 또한, 실험 결과를 기반으로 텍스트 정보와 화자 정보 등을 고려한다면 좀 더 강인한 화자 전환 검출 시스템을 구축할 수 있을 것으로 판단한다.

Keywords

Acknowledgement

This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-01376, Development of the multi-speaker conversational speech recognition technology).

References

  1. A. G. Adam, S. S. Kajarekar, and H. Hermansky, "A new speaker change detection method for two-speaker segmentation," Proc. ICASSP. 3908-3911 (2002).
  2. L. Bullock, H. Bredin, and L. P. Garcia Perera, "Overlap aware diarization: Resegmentation using neural end-to-end overlapped speech detection," Proc. ICASSP. 7114-7118 (2020).
  3. N. Sajjan, S. Ganesh, N. Sharma, S. Ganapathy, and N. Ryant, "Leveraging lstm models for overlap detection in multi party meetings," Proc. ICASSP. 5249-5253 (2018).
  4. V. Andrei, H. Cucu, and C. Burileanu. "Detecting overlapped speech on short time frames using deep learning," Proc. Interspeech, 1198-1202 (2017).
  5. E. Kazimirova, A. Belyaev, "Automatic detection of multi speaker fragments with high time resolution," Proc. ICASSP. 1338-1392 (2018).
  6. Z. Ge, A. N. Iyer, S. Cheluvaraja, and A. Ganapathiraju, "Speaker change detection using features through a neural network speaker classier," Proc. IEEE SAI Intelligent Systems Conference, 1111-1116 (2017).
  7. R. Yin, H. Bredin, and C. Barras, "Speaker change detection in broadcast tv using bidirectional long short term memory networks," Proc. Interspeech, 3827-3831 (2017).
  8. M. Kunesova, M. Hruz, Z. Zajc, and V. Radova, "Detection of overlapping speech for the purposes of speaker diarization," Proc. ICSC. 247-257 (2019).
  9. S. C. Levinson, "Turn-taking in human communication - Origins and implications for language processing," Trends in Cognitive Sciences, 20, 6-14 (2016). https://doi.org/10.1016/j.tics.2015.10.010
  10. H. Bredin, "TristouNet: Triplet loss for speaker turn embedding," Proc. Interspeech, 5430-5434 (2017).
  11. WebRTC Homepage, http://webrtc.org, (Last viewed November 21, 2020).
  12. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," Proc. ICASSP. 5329-5333 (2018).
  13. J. Park, S. Cha, S. Eun, J. G. Park, and Y.-S. Yun, "Data augmentation and d-vector representation methods for speaker change detection," Proc. ICRACS. 67-71 (2020).
  14. V. Zue, S. Sene, and S. Glass, "Speech database development at MIT: TIMIT and beyond," Speech communication, 9, 351-356 (1990). https://doi.org/10.1016/0167-6393(90)90010-7
  15. H. Kim, J. Park, S. Cha, K. A Son, Y.-S. Yun, and J. G. Park, "Framework switching of speaker overlap detection system" (in Korean), J. SW Assessment and Valuation, 17, 101-113 (2021). https://doi.org/10.29056/jsav.2021.06.13