DOI QR코드

DOI QR Code

Group-based speaker embeddings for text-independent speaker verification

문장 독립 화자 검증을 위한 그룹기반 화자 임베딩

  • Received : 2021.07.16
  • Accepted : 2021.08.23
  • Published : 2021.09.30

Abstract

Recently, deep speaker embedding approach has been widely used in text-independent speaker verification, which shows better performance than the traditional i-vector approach. In this work, to improve the deep speaker embedding approach, we propose a novel method called group-based speaker embedding which incorporates group information. We cluster all speakers of the training data into a predefined number of groups in an unsupervised manner, so that a fixed-length group embedding represents the corresponding group. A Group Decision Network (GDN) produces a group weight, and an aggregated group embedding is generated from the weighted sum of the group embeddings and the group weights. Finally, we generate a group-based embedding by adding the aggregated group embedding to the deep speaker embedding. In this way, a speaker embedding can reduce the search space of the speaker identity by incorporating group information, and thereby can flexibly represent a significant number of speakers. We conducted experiments using the VoxCeleb1 database to show that our proposed approach can improve the previous approaches.

딥러닝 기반의 심층 화자 임베딩 방식은 최근 문장 독립 화자 검증 연구에 널리 사용되고 있으며, 기존의 i-vector 방식에 비해 더 좋은 성능을 보이고 있다. 본 연구에서는 심층 화자 임베딩 방식을 발전시키기 위하여, 화자의 그룹 정보를 도입한 그룹기반 화자 임베딩을 제안한다. 훈련 데이터 내에 존재하는 전체 화자들을 정해진 개수의 그룹으로 비지도 클러스터링 하며, 고정된 길이의 그룹 임베딩 벡터가 각각의 그룹을 대표한다. 그룹 결정 네트워크가 각 그룹에 대응되는 그룹 가중치를 출력하며, 이를 이용한 그룹 임베딩 벡터들의 가중 합을 통해 집합 그룹 임베딩을 추출한다. 최종적으로 집합 그룹 임베딩을 심층 화자 임베딩에 더해주어 그룹기반 화자 임베딩을 생성한다. 이러한 방식을 통해 그룹 정보를 심층 화자 임베딩에 도입함으로써, 화자 임베딩이 나타낼 수 있는 전체 화자의 검색 공간을 줄일 수 있고, 이를 통해 화자 임베딩은 많은 수의 화자를 유연하게 표현할 수 있다. VoxCeleb1 데이터베이스를 이용하여 본 연구에서 제안하는 방식이 기존의 방식을 개선시킨다는 것을 확인하였다.

Keywords

Acknowledgement

이 논문은 2021년도정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2021R1A2C1014044).

References

  1. J. H. L. Hansen and T. Hasan, "Speaker recognition by machines and humans: A tutorial review," IEEE Signal Processing Magazine, 32, 74-99 (2015). https://doi.org/10.1109/MSP.2015.2462851
  2. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans on. Audio, Speech, and Lang. Process. 19, 788-798 (2011). https://doi.org/10.1109/TASL.2010.2064307
  3. S. Ioffe, "Probabilistic linear discriminant analysis," Proc. ECCV. 531-542 (2006).
  4. A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, "I-vector based speaker recognition on short utterances," Proc. Interspeech, 2341-2344 (2011).
  5. A. Hajavi and A. Etemad, "A deep neural network for short-segment speaker recognition," Proc. Interspeech, 2878-2882 (2019).
  6. Y. Jung, S. M. Kye, Y. Choi, M. Jung, and H. Kim, "Improving multi-scale aggregation using feature pyramid module for robust speaker verification of variable-duration utterances," Proc. Interspeech, 1501-1505 (2020).
  7. Y. Jung, Y. Choi, H. Lim, and H. Kim, "A unified deep learning framework for short-duration speaker verification in adverse environments," IEEE Access, 8, 175448-175466 (2020). https://doi.org/10.1109/ACCESS.2020.3025941
  8. V. Peddinti, D. Povey, and S. Khudanpur, "A time delay neural network architecture for efficient modeling of long temporal contexts," Proc. Interspeech, 3214-3218 (2015.)
  9. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," Proc. ICLR. 1-14 (2015).
  10. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE CVPR. 770-778 (2016).
  11. A. Nagrani, J. S. Chung, and A. Zisserman, "VoxCeleb: A largescale speaker identification dataset," Proc. Interspeech, 2616-2620 (2017).
  12. W. Cai, J. Chen, and M. Li, "Exploring the encoding layer and loss function in end-to-end speaker and language recognition system," Proc. Odyssey, 74-81 (2018).
  13. Y. Jung, Y. Kim, H. Lim, Y. Choi, and H. Kim, "Spatial pyramid encoding with convex length normalization for text-independent speaker verification," Proc. Interspeech, 4030-4034 (2019).
  14. E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. IEEE ICASSP. 4052-4056 (2014).
  15. Z. Huang, S. Wang, and K. Yu, "Angular softmax for short-duration text-independent speaker verification," Proc. Interspeech, 3623-3627 (2018).
  16. Y. Liu, L. He, and J. Liu, "Large margin softmax loss for speaker verification," Proc. Interspeech, 2873-2877 (2019).
  17. Y. Kim, W. Park, M-C. Roh, and J. Shin, "Groupface: learning latent groups and constructing group-based representations for face recognition," Proc. IEEE CVPR. 5621-5630 (2020).
  18. K. Okabe, T. Koshinaka, and K. Shinoda, "Attentive statistics pooling for deep speaker embedding," Proc. Interspeech, 2252-2256 (2018).