Acknowledgement
Authors would like to thank Kerala State Council for Science, Technology and Environment and Dr. Sri Rama Murty, Assoc. Professor, Indian Institute of Hyderabad for their support.
References
- J. H. L. Hansen and T. Hasan, Speaker recognition by machines and humans: a tutorial review, IEEE Signal Proc. Mag. 32 (2015), no. 6, 74-99. https://doi.org/10.1109/MSP.2015.2462851
- L. Mary, Significance of prosody for speaker, language, emotion, and speech recognition, Extraction of prosody for automatic speaker, language, emotion and speech recognition, Springer, 2018, pp. 1-22.
- E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke, Modeling prosodic feature sequences for speaker recognition, Speech Commun. 46 (2005), no. 3-4, 455-472. https://doi.org/10.1016/j.specom.2005.02.018
- N. Dehak, P. Dumouchel, and P. Kenny, Modeling prosodic features with joint factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. 15 (2007), no. 7, 2095-2103. https://doi.org/10.1109/TASL.2007.902758
- L. Mary and B. Yegnanarayana, Extraction and representation of prosodic features for language and speaker recognition, Speech Commun. 50 (2008), no. 10, 782-796. https://doi.org/10.1016/j.specom.2008.04.010
- G. Friedland, O. Vinyals, Y. Huang, and C. Muller, Prosodic and other long-term features for speaker diarization, IEEE Trans. Audio Speech Lang. Process. 17 (2009), no. 5, 985-993. https://doi.org/10.1109/TASL.2009.2015089
- C.-C. Leung, M. Ferras, C. Barras, and J.-L. Gauvain, Comparing prosodic models for speaker recognition, (Ninth Annual Conference of the International Speech Communication Association, Brisbane, Australia), 2008, pp. 1945-1948.
- A. G. Adami, R. Mihaescu, D. A. Reynolds, and J. J. Godfrey, Modeling prosodic dynamics for speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China), 2003. https://doi.org/10.1109/ICASSP.2003.1202761
- B. Peskin, J. Navratil, J. Abramson, D. Jones, D. Klusacek, D. A. Reynolds, and B. Xiang, Using prosodic and conversational features for high-performance speaker recognition: Report from JHU WS'02, (IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, China), 2003. https://doi.org/10.1109/ICASSP.2003.1202762
- D. Reynolds, B. Peskin, J. Navratil, J. Campbell, W. Andrews, D. Klusacek, A. Adami, Q. Jin, J. Abramson, R. Mihaescu, and J. Goddfrey, SuperSID: exploiting high-level information for high-performance speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), EE, Hong Kong, China), 2003, pp. 784-787.
- S. Kajarekar, L. Ferrer, K. Sonmez, J. Zheng, E. Shriberg, and A. Stolcke, Modeling NERFs for speaker recognition, (Odyssey04-the speaker and language recognition workshop, Toledo, Spain), 2004, pp. 51-56.
- M. Kockmann, L. Ferrer, L. Burget, E. Shriberg, and J. Cernocky, Recent progress in prosodic speaker verification, (IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Rep.) 2011, pp. 4556-4559.
- E. Shriberg, L. Ferrer, A. Venkataraman, and S. Kajarekar, SVM modeling of "SNERF-grams" for speaker recognition, (Eighth International Conference on Spoken Language Processing, Jeju, Rep. of Korea), 2004, pp. 1409-1412.
- S. B. Alex, B. P. Babu, and L. Mary, Utterance and syllable level prosodic features for automatic emotion recognition, (IEEE Recent Advances in Intelligent Computational Systems, Trivandrum, Kerala, India), 2018, pp. 31-35.
- N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. 19 (2010), no. 4, 788-798.
- W. H. Kang and N. S. Kim, Unsupervised learning of total variability embedding for speaker verification with random digit strings, Appl. Sci. 9 (2019), no. 8, 1597.
- P. Kenny, T. Stafylakis, P. Ouellet, V. Gupta, and M. J. Alam, Deep neural networks for extracting Baum-Welch statistics for speaker recognition, (Odyssey 2012-the Speaker and Language Recognition Workshop, Joensuu, Finland), 2014, pp. 293-298.
- Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network, (IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy), 2014, pp. 1695-1699.
- D. Garcia-Romero and A. McCree, Insights into deep neural networks for speaker recognition, (Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany), 2015, pp. 1141-1145.
- M. McLaren, Y. Lei, and L. Ferrer, Advances in deep neural network approaches to speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia), 2015, pp. 4814-4818.
- F. Richardson, D. Reynolds, and N. Dehak, Deep neural network approaches to speaker and language recognition, IEEE Signal Process. Lett. 22 (2015), no. 10, 1671-1675. https://doi.org/10.1109/LSP.2015.2420092
- E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, (IEEE International Conference on Acoustics, speech and signal processing, Florence, Italy), 2014, pp. 4052-4056.
- G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, End-to-end text-dependent speaker verification, (IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China), 2016, pp. 5115-5119.
- D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, (Proc. Interspeech, Stockholm, Sweden), 2017, pp. 999-1003.
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada), 2018, pp. 5329-5333.
- D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, Deep neural network-based speaker embeddings for end-to-end speaker verification, (IEEE Spoken Language Technology Workshop, San Diego, CA, USA), 2016, pp. 165-170.
- L. Li, Z. Tang, Y. Shi, and D. Wang, Gaussian-constrained training for speaker verification, (IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK), 2019, pp. 6036-6040.
- X. Wang, L. Li, and D. Wang, VAE-based domain adaptation for speaker verification, (Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Lanzhou, China), 2019, pp. 535-539.
- Y. Zhang, L. Li, and D. Wang, VAE-based regularization for deep speaker embedding, (Proc. Interspeech, Graz, Austria), 2019, pp. 4020-4024.
- K. S. R. Murty and B. Yegnanarayana, Combining evidence from residual phase and MFCC features for speaker recognition, IEEE Signal Process. Lett. 13 (2005), no. 1, 52-55.
- D. P. Kingma and M. Welling, Auto-encoding variational Bayes, arXiv preprint, 2013. https://doi.org/10.48550/arXiv.1312.6114
- D. J. Rezende, S. Mohamed, and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXive Preprint, 2014. https://doi.org/10.48550/arXiv.1401.4082
- K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, Draw: A recurrent neural network for image generation, arXive Preprint, 2015. https://doi.org/10.48550/arXiv. 1502.04623
- T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, Deep convolutional inverse graphics network, (Advances in Neural Information Processing Systems, Montreal, Quebec, Canada), 2015, pp. 2539-2547.
- Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, Variational autoencoder for deep learning of images, labels and captions, (Advances in Neural Information Processing Systems, Barcelona, Spain), 2016, pp. 2352-2360.
- S. Mohamed and D. J. Rezende, Variational information maximisation for intrinsically motivated reinforcement learning, (Advances in Neural Information Processing Systems, Montreal, Quebec, Canada), 2015, pp. 2125-2133.
- C. Doersch, Tutorial on variational autoencoders, arXiv preprint, 2016. https://doi.org/10.48550/arXiv.1606.05908
- M. Blaauw and J. Bonada, Modeling and transforming speech using variational autoencoders, Proc. Interspeech, San Francisco, CA, USA, 2016, pp. 1770-1774.
- N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M Shanahan, c, arXive preprint, 2016. https://doi.org/10.48550/arXiv.1611.02648
- S. Tan and K. C. Sim, Learning utterance-level normalisation using variational autoencoders for robust automatic speech recognition, (IEEE Spoken Language Technology Workshop, San Diego, CA, USA), 2016, pp. 43-49.
- L. Mary and B. Yegnanarayana, Prosodic features for speaker verification, (Ninth International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA), 2006, pp. 917-920.
- K. Bartkova, D. L. Gac, D. Charlet, and D. Jouvet, Prosodic parameter for speaker identification, (Proc. Seventh International Conference on Spoken Language Processing (ICSLP), Denver, CO, USA), 2002, pp. 1197-1200.
- Y. Xu, Consistency of tone-syllable alignment across different syllable structures and speaking rates, Phonetica 55 (1998), no. 4, 179-203. https://doi.org/10.1159/000028432
- M. Atterer and D. R. Ladd, On the phonetics and phonology of "segmental anchoring" of F0: evidence from German, J. Phon. 32 (2004), no. 2, 177-197. https://doi.org/10.1016/S0095-4470(03)00039-1
- S. B. Alex, L. Mary, and B. P. Babu, Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features, Circ. Syst. Signal Process. 39 (2020), 5681-5709. https://doi.org/10.1007/s00034-020-01429-3
- L. Mary, A. P. Antony, B. P. Babu, and S. R. M. Prasanna, Automatic syllabification of speech signal using short time energy and vowel onset points, Int. J. Speech Technol. 21 (2018), no. 3, 571-579. https://doi.org/10.1007/s10772-018-9517-6
- K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub, Modeling dynamic prosodic variation for speaker verification, (Proc. Fifth International Conference on Spoken Language Processing (ICSLP), Syndey, Australia), 1998, pp. 3189-3192.
- K. Djolander, The snack sound toolkit, 2004. http://www.speech.kth.se/snack
- C.-Y. Lin and H.-C. Wang, Language identification using pitch contour information, (IEEE International Conference on Acoustics, Speech and Signal Processing, Philadelphia, PA, USA), 2005, pp. I-601.
- C. Gussenhoven, B. H. Repp, A. Rietveld, H. H. Rump, and J. Terken, The perceptual prominence of fundamental frequency peaks, J. Acous. Soc. Am. 102 (1997), no. 5, 3009-3022. https://doi.org/10.1121/1.420355
- A. G. Adami, Modeling prosodic differences for speaker recognition, Speech Commun. 49 (2007), no. 4, 277-291. https://doi.org/10.1016/j.specom.2007.02.005
- S. J. Park, G. Yeung, J. Kreiman, P. A. Keating, and A. Alwan, Using voice quality features to improve short-utterance, textindependent speaker verification systems, (Proc. Interspeech, Stockholm, Sweden), 2017, pp. 1522-1526.
- M. Farrus, J. Hernando, and P. Ejarque, Jitter and shimmer measurements for speaker recognition, (Eighth Annual Conference of the International Speech Communication Association, Antwerp, Belgium), 2007, pp. 778-781.
- G. Deekshitha, K. R. Sreelakshmi, B. P. Babu, and L. Mary, Development of spoken story database in Malayalam language, (4th International Conference on Electrical Energy Systems, Chennai, India), 2018, pp. 530-533.
- The NIST year 2010 speaker recognition evaluation plan, 2010. http://www.itl.nist.gov/iad/mig/tests/sre/2010/
- The NIST year 2003 speaker recognition evaluation plan, 2003. http://www.nist.gov/speech/tests/spk/2003/
- F. Chollet, Keras, 2015. https://keras.io
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, and M. Kudlur, Tensorflow: a system for large-scale machine learning, (12th (USENIX) symposium on operating systems design and implementation, Savannah, GA, USA), 2016, pp. 265-283.
- Y. Liu, L. He, J. Liu, and M. T. Johnson, Speaker embedding extraction with phonetic information, arXiv Preprint, 2018. https://doi.org/10.48550/arXiv.1804.04862
- T. Pekhovsky and M. Korenevsky, Investigation of using VAE for i-vector speaker verification, arXiv preprint, 2017. https://doi.org/10.48550/arXiv.1705.09185
- S. Thomas, S. H. Mallidi, S. Ganapathy, and H. Hermansky, Adaptation transforms of auto-associative neural networks as features for speaker verification, (Odyssey 2012-the Speaker and Language Recognition Workshop, Singapore), 2012, pp. 98-104.
- M. Diez, A. Varona, M. Penagarikano, L. J. RodriguezFuentes, and G. Bordel, Using phone log-likelihood ratios as features for speaker recognition, Evaluation 3 (2013). https://doi.org/10.21437/Interspeech.2013-419
- M. Kockmann, L. Ferrer, L. Burget, and J. Cernocky iVector fusion of prosodic and cepstral features for speaker verification, (Twelfth Annual Conference of the International Speech Communication Association, Florence, Italy), 2011, pp. 265-268.
- N. Scheffer, L. Ferrer, M. Graciarena, S. Kajarekar, E. Shriberg, and A. Stolcke, The SRI NIST 2010 speaker recognition evaluation system, (IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Rep.) 2011, pp. 5292-5295.