Variational autoencoder for prosody-based speaker recognition

Starlet Ben Alex;Leena Mary;

doi:10.4218/etrij.2021-0377

ETRI Journal

Volume 45 Issue 4
/
Pages.678-689
/
2023
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Variational autoencoder for prosody-based speaker recognition

Starlet Ben Alex (Department of Electronics Engineering, Saintgits College of Engineering, APJ Abdul Kalam Technological University) ;
Leena Mary (Centre for Advanced Signal Processing, Department of Electronics and Communication Engineering, Rajiv Gandhi Institute of Technology, APJ Abdul Kalam Technological University)

Received : 2022.04.05
Accepted : 2022.10.11
Published : 2023.08.10

https://doi.org/10.4218/etrij.2021-0377 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This paper describes a novel end-to-end deep generative model-based speaker recognition system using prosodic features. The usefulness of variational autoencoders (VAE) in learning the speaker-specific prosody representations for the speaker recognition task is examined herein for the first time. The speech signal is first automatically segmented into syllable-like units using vowel onset points (VOP) and energy valleys. Prosodic features, such as the dynamics of duration, energy, and fundamental frequency (F₀), are then extracted at the syllable level and used to train/adapt a speaker-dependent VAE from a universal VAE. The initial comparative studies on VAEs and traditional autoencoders (AE) suggest that the former can efficiently learn speaker representations. Investigations on the impact of gender information in speaker recognition also point out that gender-dependent impostor banks lead to higher accuracies. Finally, the evaluation on the NIST SRE 2010 dataset demonstrates the usefulness of the proposed approach for speaker recognition.

Keywords

Acknowledgement

Authors would like to thank Kerala State Council for Science, Technology and Environment and Dr. Sri Rama Murty, Assoc. Professor, Indian Institute of Hyderabad for their support.

References

J. H. L. Hansen and T. Hasan, Speaker recognition by machines and humans: a tutorial review, IEEE Signal Proc. Mag. 32 (2015), no. 6, 74-99. https://doi.org/10.1109/MSP.2015.2462851
L. Mary, Significance of prosody for speaker, language, emotion, and speech recognition, Extraction of prosody for automatic speaker, language, emotion and speech recognition, Springer, 2018, pp. 1-22.
E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke, Modeling prosodic feature sequences for speaker recognition, Speech Commun. 46 (2005), no. 3-4, 455-472. https://doi.org/10.1016/j.specom.2005.02.018
N. Dehak, P. Dumouchel, and P. Kenny, Modeling prosodic features with joint factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. 15 (2007), no. 7, 2095-2103. https://doi.org/10.1109/TASL.2007.902758
L. Mary and B. Yegnanarayana, Extraction and representation of prosodic features for language and speaker recognition, Speech Commun. 50 (2008), no. 10, 782-796. https://doi.org/10.1016/j.specom.2008.04.010
G. Friedland, O. Vinyals, Y. Huang, and C. Muller, Prosodic and other long-term features for speaker diarization, IEEE Trans. Audio Speech Lang. Process. 17 (2009), no. 5, 985-993. https://doi.org/10.1109/TASL.2009.2015089
C.-C. Leung, M. Ferras, C. Barras, and J.-L. Gauvain, Comparing prosodic models for speaker recognition, (Ninth Annual Conference of the International Speech Communication Association, Brisbane, Australia), 2008, pp. 1945-1948.
A. G. Adami, R. Mihaescu, D. A. Reynolds, and J. J. Godfrey, Modeling prosodic dynamics for speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China), 2003. https://doi.org/10.1109/ICASSP.2003.1202761
B. Peskin, J. Navratil, J. Abramson, D. Jones, D. Klusacek, D. A. Reynolds, and B. Xiang, Using prosodic and conversational features for high-performance speaker recognition: Report from JHU WS'02, (IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, China), 2003. https://doi.org/10.1109/ICASSP.2003.1202762
D. Reynolds, B. Peskin, J. Navratil, J. Campbell, W. Andrews, D. Klusacek, A. Adami, Q. Jin, J. Abramson, R. Mihaescu, and J. Goddfrey, SuperSID: exploiting high-level information for high-performance speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), EE, Hong Kong, China), 2003, pp. 784-787.
S. Kajarekar, L. Ferrer, K. Sonmez, J. Zheng, E. Shriberg, and A. Stolcke, Modeling NERFs for speaker recognition, (Odyssey04-the speaker and language recognition workshop, Toledo, Spain), 2004, pp. 51-56.
M. Kockmann, L. Ferrer, L. Burget, E. Shriberg, and J. Cernocky, Recent progress in prosodic speaker verification, (IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Rep.) 2011, pp. 4556-4559.
E. Shriberg, L. Ferrer, A. Venkataraman, and S. Kajarekar, SVM modeling of "SNERF-grams" for speaker recognition, (Eighth International Conference on Spoken Language Processing, Jeju, Rep. of Korea), 2004, pp. 1409-1412.
S. B. Alex, B. P. Babu, and L. Mary, Utterance and syllable level prosodic features for automatic emotion recognition, (IEEE Recent Advances in Intelligent Computational Systems, Trivandrum, Kerala, India), 2018, pp. 31-35.
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. 19 (2010), no. 4, 788-798.
W. H. Kang and N. S. Kim, Unsupervised learning of total variability embedding for speaker verification with random digit strings, Appl. Sci. 9 (2019), no. 8, 1597.
P. Kenny, T. Stafylakis, P. Ouellet, V. Gupta, and M. J. Alam, Deep neural networks for extracting Baum-Welch statistics for speaker recognition, (Odyssey 2012-the Speaker and Language Recognition Workshop, Joensuu, Finland), 2014, pp. 293-298.
Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network, (IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy), 2014, pp. 1695-1699.
D. Garcia-Romero and A. McCree, Insights into deep neural networks for speaker recognition, (Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany), 2015, pp. 1141-1145.
M. McLaren, Y. Lei, and L. Ferrer, Advances in deep neural network approaches to speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia), 2015, pp. 4814-4818.
F. Richardson, D. Reynolds, and N. Dehak, Deep neural network approaches to speaker and language recognition, IEEE Signal Process. Lett. 22 (2015), no. 10, 1671-1675. https://doi.org/10.1109/LSP.2015.2420092
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, (IEEE International Conference on Acoustics, speech and signal processing, Florence, Italy), 2014, pp. 4052-4056.
G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, End-to-end text-dependent speaker verification, (IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China), 2016, pp. 5115-5119.
D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, (Proc. Interspeech, Stockholm, Sweden), 2017, pp. 999-1003.
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada), 2018, pp. 5329-5333.
D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, Deep neural network-based speaker embeddings for end-to-end speaker verification, (IEEE Spoken Language Technology Workshop, San Diego, CA, USA), 2016, pp. 165-170.
L. Li, Z. Tang, Y. Shi, and D. Wang, Gaussian-constrained training for speaker verification, (IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK), 2019, pp. 6036-6040.
X. Wang, L. Li, and D. Wang, VAE-based domain adaptation for speaker verification, (Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Lanzhou, China), 2019, pp. 535-539.
Y. Zhang, L. Li, and D. Wang, VAE-based regularization for deep speaker embedding, (Proc. Interspeech, Graz, Austria), 2019, pp. 4020-4024.
K. S. R. Murty and B. Yegnanarayana, Combining evidence from residual phase and MFCC features for speaker recognition, IEEE Signal Process. Lett. 13 (2005), no. 1, 52-55.
D. P. Kingma and M. Welling, Auto-encoding variational Bayes, arXiv preprint, 2013. https://doi.org/10.48550/arXiv.1312.6114
D. J. Rezende, S. Mohamed, and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXive Preprint, 2014. https://doi.org/10.48550/arXiv.1401.4082
K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, Draw: A recurrent neural network for image generation, arXive Preprint, 2015. https://doi.org/10.48550/arXiv. 1502.04623
T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, Deep convolutional inverse graphics network, (Advances in Neural Information Processing Systems, Montreal, Quebec, Canada), 2015, pp. 2539-2547.
Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, Variational autoencoder for deep learning of images, labels and captions, (Advances in Neural Information Processing Systems, Barcelona, Spain), 2016, pp. 2352-2360.
S. Mohamed and D. J. Rezende, Variational information maximisation for intrinsically motivated reinforcement learning, (Advances in Neural Information Processing Systems, Montreal, Quebec, Canada), 2015, pp. 2125-2133.
C. Doersch, Tutorial on variational autoencoders, arXiv preprint, 2016. https://doi.org/10.48550/arXiv.1606.05908
M. Blaauw and J. Bonada, Modeling and transforming speech using variational autoencoders, Proc. Interspeech, San Francisco, CA, USA, 2016, pp. 1770-1774.
N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M Shanahan, c, arXive preprint, 2016. https://doi.org/10.48550/arXiv.1611.02648
S. Tan and K. C. Sim, Learning utterance-level normalisation using variational autoencoders for robust automatic speech recognition, (IEEE Spoken Language Technology Workshop, San Diego, CA, USA), 2016, pp. 43-49.
L. Mary and B. Yegnanarayana, Prosodic features for speaker verification, (Ninth International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA), 2006, pp. 917-920.
K. Bartkova, D. L. Gac, D. Charlet, and D. Jouvet, Prosodic parameter for speaker identification, (Proc. Seventh International Conference on Spoken Language Processing (ICSLP), Denver, CO, USA), 2002, pp. 1197-1200.
Y. Xu, Consistency of tone-syllable alignment across different syllable structures and speaking rates, Phonetica 55 (1998), no. 4, 179-203. https://doi.org/10.1159/000028432
M. Atterer and D. R. Ladd, On the phonetics and phonology of "segmental anchoring" of F0: evidence from German, J. Phon. 32 (2004), no. 2, 177-197. https://doi.org/10.1016/S0095-4470(03)00039-1
S. B. Alex, L. Mary, and B. P. Babu, Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features, Circ. Syst. Signal Process. 39 (2020), 5681-5709. https://doi.org/10.1007/s00034-020-01429-3
L. Mary, A. P. Antony, B. P. Babu, and S. R. M. Prasanna, Automatic syllabification of speech signal using short time energy and vowel onset points, Int. J. Speech Technol. 21 (2018), no. 3, 571-579. https://doi.org/10.1007/s10772-018-9517-6
K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub, Modeling dynamic prosodic variation for speaker verification, (Proc. Fifth International Conference on Spoken Language Processing (ICSLP), Syndey, Australia), 1998, pp. 3189-3192.
K. Djolander, The snack sound toolkit, 2004. http://www.speech.kth.se/snack
C.-Y. Lin and H.-C. Wang, Language identification using pitch contour information, (IEEE International Conference on Acoustics, Speech and Signal Processing, Philadelphia, PA, USA), 2005, pp. I-601.
C. Gussenhoven, B. H. Repp, A. Rietveld, H. H. Rump, and J. Terken, The perceptual prominence of fundamental frequency peaks, J. Acous. Soc. Am. 102 (1997), no. 5, 3009-3022. https://doi.org/10.1121/1.420355
A. G. Adami, Modeling prosodic differences for speaker recognition, Speech Commun. 49 (2007), no. 4, 277-291. https://doi.org/10.1016/j.specom.2007.02.005
S. J. Park, G. Yeung, J. Kreiman, P. A. Keating, and A. Alwan, Using voice quality features to improve short-utterance, textindependent speaker verification systems, (Proc. Interspeech, Stockholm, Sweden), 2017, pp. 1522-1526.
M. Farrus, J. Hernando, and P. Ejarque, Jitter and shimmer measurements for speaker recognition, (Eighth Annual Conference of the International Speech Communication Association, Antwerp, Belgium), 2007, pp. 778-781.
G. Deekshitha, K. R. Sreelakshmi, B. P. Babu, and L. Mary, Development of spoken story database in Malayalam language, (4th International Conference on Electrical Energy Systems, Chennai, India), 2018, pp. 530-533.
The NIST year 2010 speaker recognition evaluation plan, 2010. http://www.itl.nist.gov/iad/mig/tests/sre/2010/
The NIST year 2003 speaker recognition evaluation plan, 2003. http://www.nist.gov/speech/tests/spk/2003/
F. Chollet, Keras, 2015. https://keras.io
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, and M. Kudlur, Tensorflow: a system for large-scale machine learning, (12th (USENIX) symposium on operating systems design and implementation, Savannah, GA, USA), 2016, pp. 265-283.
Y. Liu, L. He, J. Liu, and M. T. Johnson, Speaker embedding extraction with phonetic information, arXiv Preprint, 2018. https://doi.org/10.48550/arXiv.1804.04862
T. Pekhovsky and M. Korenevsky, Investigation of using VAE for i-vector speaker verification, arXiv preprint, 2017. https://doi.org/10.48550/arXiv.1705.09185
S. Thomas, S. H. Mallidi, S. Ganapathy, and H. Hermansky, Adaptation transforms of auto-associative neural networks as features for speaker verification, (Odyssey 2012-the Speaker and Language Recognition Workshop, Singapore), 2012, pp. 98-104.
M. Diez, A. Varona, M. Penagarikano, L. J. RodriguezFuentes, and G. Bordel, Using phone log-likelihood ratios as features for speaker recognition, Evaluation 3 (2013). https://doi.org/10.21437/Interspeech.2013-419
M. Kockmann, L. Ferrer, L. Burget, and J. Cernocky iVector fusion of prosodic and cepstral features for speaker verification, (Twelfth Annual Conference of the International Speech Communication Association, Florence, Italy), 2011, pp. 265-268.
N. Scheffer, L. Ferrer, M. Graciarena, S. Kajarekar, E. Shriberg, and A. Stolcke, The SRI NIST 2010 speaker recognition evaluation system, (IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Rep.) 2011, pp. 5292-5295.

ETRI Journal

Variational autoencoder for prosody-based speaker recognition

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)