DOI QR코드

DOI QR Code

Automatic Speech Style Recognition Through Sentence Sequencing for Speaker Recognition in Bilateral Dialogue Situations

양자 간 대화 상황에서의 화자인식을 위한 문장 시퀀싱 방법을 통한 자동 말투 인식

  • Kang, Garam (Department of BIG DATA Analytics, Kyung Hee University) ;
  • Kwon, Ohbyung (School of Management, Kyung Hee University)
  • 강가람 (경희대학교 일반대학원 빅데이터응용학과) ;
  • 권오병 (경희대학교 경영대학)
  • Received : 2021.01.15
  • Accepted : 2021.06.08
  • Published : 2021.06.30

Abstract

Speaker recognition is generally divided into speaker identification and speaker verification. Speaker recognition plays an important function in the automatic voice system, and the importance of speaker recognition technology is becoming more prominent as the recent development of portable devices, voice technology, and audio content fields continue to expand. Previous speaker recognition studies have been conducted with the goal of automatically determining who the speaker is based on voice files and improving accuracy. Speech is an important sociolinguistic subject, and it contains very useful information that reveals the speaker's attitude, conversation intention, and personality, and this can be an important clue to speaker recognition. The final ending used in the speaker's speech determines the type of sentence or has functions and information such as the speaker's intention, psychological attitude, or relationship to the listener. The use of the terminating ending has various probabilities depending on the characteristics of the speaker, so the type and distribution of the terminating ending of a specific unidentified speaker will be helpful in recognizing the speaker. However, there have been few studies that considered speech in the existing text-based speaker recognition, and if speech information is added to the speech signal-based speaker recognition technique, the accuracy of speaker recognition can be further improved. Hence, the purpose of this paper is to propose a novel method using speech style expressed as a sentence-final ending to improve the accuracy of Korean speaker recognition. To this end, a method called sentence sequencing that generates vector values by using the type and frequency of the sentence-final ending appearing in the utterance of a specific person is proposed. To evaluate the performance of the proposed method, learning and performance evaluation were conducted with a actual drama script. The method proposed in this study can be used as a means to improve the performance of Korean speech recognition service.

화자인식은 자동 음성시스템에서 중요한 기능을 담당하며, 최근 휴대용 기기의 발전 및 음성 기술, 오디오 콘텐츠 분야 등이 계속해서 확장됨에 따라 화자인식 기술의 중요성은 더구나 부각 되고 있다. 이전의 화자인식 연구는 음성 파일을 기반으로 화자가 누구인지 자동으로 판정 및 정확도 향상을 위한 목표를 가지고 진행되었다. 한편 말투는 중요한 사회언어학적 소재로 사용자의 사회적 환경과 밀접하게 관련되어 있다. 추가로 화자의 말투에 사용되는 종결어미는 문장의 유형을 결정하거나 화자의 의도, 심리적 태도 또는 청자에 대한 관계 등의 기능과 정보를 가지고 있다. 이처럼 종결어미의 활용형태는 화자의 특성에 따라 다양한 개연성이 있어 특정 미확인 화자의 종결어미의 종류와 분포는 해당 화자를 인식하는 것에 도움이 될 것으로 보인다. 기존 텍스트 기반의 화자인식에서 말투를 고려한 연구가 적었으며 음성 신호를 기반으로 한 화자인식 기법에 말투 정보를 추가한다면 화자인식의 정확도를 더욱 높일 수 있을 것이다. 따라서 본 연구의 목적은 한국어 화자인식의 정확도를 개선하기 위해 종결어미로 표현되는 말투(speech style) 정보를 활용한 방법을 제안하는 것이다. 이를 위해 특정인의 발화 내용에서 등장하는 종결어미의 종류와 빈도를 활용하여 벡터값을 생성하는 문장 시퀀싱이라는 방법을 제안한다. 본 연구에서 제안한 방법의 우수성을 평가하기 위해 드라마 대본으로 학습 및 성능평가를 수행하였다. 본 연구에서 제안한 방법은 향후 실존하는 한국어 음성인식 서비스의 성능 향상을 위한 수단으로 사용될 수 있으며 지능형 대화 시스템 및 각종 음성 기반 서비스에 활용될 것을 기대한다.

Keywords

Acknowledgement

This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2018S1A5A2A03036394).

References

  1. Ahn, J., "The use of new forms of honorific final ending in Modern Korean". The Linguistic Association of Korea Journal, Vol.25, No.3 (2017), 173-192. https://doi.org/10.24303/lakdoi.2017.25.3.173
  2. Ai, H., W. Xia., and Q. Zhang., "Speaker Recognition Based on Lightweight Neural Network for Smart Home Solutions". International Symposium on Cyberspace Safety and Security, No.12(2019), 421-431.
  3. Alluri, K., V. Raju., S. Gangashetty., and A. K. Vuppala., "Analysis of Source and System features for Speaker Recognition in Emotional Conditions". IEEE Region 10 Conference, No.(2016), 2847-2850.
  4. Bhattacharya, G., M. Alam., and P. Kenny., "Deep speaker recognition: Modular or monolithic?". INTERSPEECH, No.9(2019), 1143-1147.
  5. Bu, S., and S. B. Cho, "Speaker Identification Method based on Convolutional Neural Network with STFT Sound-Map". KIISE Transactions on Computing Practices, Vol.24, No.6(2018), 289-294. https://doi.org/10.5626/KTCP.2018.24.6.289
  6. Chae. S., "Noise Robust Text-Dependent Speaker Verification Using Teacher-Student Learning Framework". Department of Electrical Engineeting and Computer Science College of Engineering SEOUL NATIONAL UNIVERSITY. (2019).
  7. Chae. S., "Theories and methods of sociolinguistic research". Saegugeosaenghwal, Vol.14, No.4 (2004), 83-103.
  8. Chakroun, R., and M. Frikha., "A New Text Independent Speaker Recognition System with Short Utterances Using SVM". European, Mediterranean, and Middle Eastern Conference on Information Systems, No.11(2020), 566-574.
  9. Chen, G., S. Chen., L, Fan., X. Du., Z. Zhao., F. Song., & Y. Liu., "Who is real bob? adversarial attacks on speaker recognition systems". arXiv:1911.01840, No.4(2020).
  10. Choi. J., "Classification of Continuous Speech Speakers by Multilayer Perceptron Network". Proceedings of the Korean Institute of Information and Commucation Sciences Conference, No.5(2017), 682-683.
  11. Choi. J., "Speech-dependent Speaker Identification Using Mel Frequency Cepstrum Coefficients for Continuous Speech Recognition". Journal of KIIT, Vol.14, No.10(2016), 67-72. https://doi.org/10.14801/jkiit.2016.14.10.67
  12. Dehak, N., P. Kenny. R. Dehak., P. Dumouchel., and P. Ouellet., "Front-end factor analysis for speaker verification". IEEE Transactions on Audio, Speech, and Language Processing, Vol.19, No.4(2011), 788-798. https://doi.org/10.1109/TASL.2010.2064307
  13. Devi, K., N. Singh., and K. Thongam., "Automatic Speaker Recognition from Speech Signals Using Self Organizing Feature Map and Hybrid Neural Network". Microprocessors and Microsystems, Vol.79, No.(2020), 103264. https://doi.org/10.1016/j.micpro.2020.103264
  14. Garcia-Romero. D., D. Snyder., G. Sell., A. McCree., D. Povey., and S. Khudanpur., "X-vector DNN Refinement with Full-Length Recordings for Speaker Recognition". INTERSPEECH, No.9(2019), 1493-1496.
  15. Ha. B., and M. Huh., "The Effect of Pitch, Duration, and Intensity on a Preception of Speech". Journal of speech-language & hearing disorders, Vol.27, No.3(2018), 45-54. https://doi.org/10.15724/jslhd.2018.27.1.004
  16. Han, G., Study on the Endings of Modern Hangeul, Yuk Rack, (2004).
  17. Han. S., "A Study on the Use of Final Endings in Korean Language Conversation". The Journal of Humanities and Social science, Vol.11, No.4(2020), 2315-2327.
  18. Huanjun, B., X. Mingxing., and F. Thomas., "Emotion Attribute Projection for Speaker Recognition on Emotional Speech". EUROSPEECH, No.8(2007), 758-761.
  19. Ioffe. S., "Probabilistic linear discriminant analysis" Computer Vision-ECCV, No.5(2006), 531-542.
  20. Jang. K., "A study on stylistic features of ending-components in Korean". Youkrack, (2010).
  21. Jo. M., "Pragmatic Strategy and Intonation of "-geodeun", the Final Endings: Focusing on the age variation of Those in 10s, 20s, 30s". Korean Linguistics, Vol.65, No.11(2014), 237-262.
  22. Jung. H., S. Yoon., and N. Park., "Speaker Recognition Using Convolutional Siamese Neural Networks". The transactions of The Korean Institute of Electrical Engineers. Vol.69, No.1(2020), 164-169. https://doi.org/10.5370/kiee.2020.69.1.164
  23. Kang, B., "Lexical Differences between Utterances of Men and Women : A Corpus Based Classification Study". Korean Linguistics, Vol.58, No.2(2013), 1-30.
  24. Kang, H., and M. H. Kim, "A Multivariate Analytical Study of Variation Patterns of Honorific Final Endings in KakaoTalk Dialogue". The Sociolinguistic Journal of Korea, Vol.26, No.1(2018), 1~30. https://doi.org/10.14353/sjk.2018.26.1.01
  25. Kang, J., B. R. Kim, K. Y. Kim, and S. H. Lee, "Performance Improvement of Speaker Recognition by MCE-based Score Combination of Multiple Feature Parameters". Journal of the Korea Academia-Industrial cooperation Society, Vol.21, No.6(2020), 679-686. https://doi.org/10.5762/KAIS.2020.21.6.679
  26. Kim, J., "A study of awareness and generation of Korean language leaners on attitude of speaker in terms of boundary tone". EWHA WOMANS UNIVERSITY, (2018).
  27. Kim, J., M. S. Yoon, S. J. Kim, M. S. Chang, and J. E. Cha, "Utterance Types in Typically Developing Preschoolers". Korean Journal of Communication Disorders, Vol.17, No.3(2012), 488-498.
  28. Kim, S., "The function and meaning of the final ending -ni", Urimal Studies, Vol. 15, 2004, 53-78.
  29. Kim, S., Kim, J. "Development of final ending of three to four-year-old children", Communication Sciences & Disorders, Vol. 9 (2004), 22-35.
  30. Kwon, O., Kim, J., Cho, H.Y., Hong, K.A. Han, J.M., Kim, Y.W., Choi, S., KHU-SentiwordNet: Developing A Korean SentiwordNet Combining Empty Morpheme, Proceedings of the 2019 Conference on Korea IT Service, 2019, pp.194-197.
  31. Mohdiwale, S., and T. Sahu., "Nearest Neighbor Classification Approach for Bilingual Speaker and Gender Recognition". Advances in Biometrics, No.(2019), 249-266.
  32. Pack, J., "Study on the Recognition of Honorification among Korean Native Speakers -focused on Koreans in their 20s, 30s-". Hanminjok Emunhak, Vol.73, No.8(2016), 119-154.
  33. Patterns of Honorific Final Endings in KakaoTalk Dialogue". The Sociolinguistic Journal of Korea, Vol.26, No.1(2018), 1~30. https://doi.org/10.14353/sjk.2018.26.1.01
  34. Povey, D., X. Zhang., and S. Khudanpur., "Parallel training of deep neural networks with natural gradient and parameter averaging". ICLR, No.11(2014).
  35. Ramachandran, R., K. Farrell., R. Ramachandran., and R. Mammone., "Speaker recognition-general classifier approaches and data fusion methods". Pattern recognition, Vol.35, No.12 (2002), 2801-2821. https://doi.org/10.1016/S0031-3203(01)00235-7
  36. Seo. Y., and H. Kim., "Recent Speaker Recognition Technology Trend". The Magazine of the IEIE, Vol.41, No.3(2014), 40-49.
  37. Sing, P., M. Embi., and H. Hashim., "Ask the Assistant: Using Google Assistant in classroom reading comprehension activities". International Journal of New Technology and Research, Vol.5, No.7(2019), 39-43.
  38. Snyder, D., D. Garcia-Romero., D. Povey., and S. Khudanpur., "Deep Neural Network Embeddings for Text-Independent Speaker Verification". Interspeech, No.8(2017), 999-1003.
  39. Snyder. D., D. Garcia-Romero., G. Sell., A. McCree., D. Povey., and S. Khudanpur., "Speaker recognition for multi-speaker conversations using x-vectors". IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), No.5(2019), 5796-5800.
  40. Snyder, D., D. Garcia-Romero.,G. Sell., D. Povey., and S. Khudanpur., "X-vectors: Robust dnn embeddings for speaker recognition". IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), No.4(2018), 5329-5333.
  41. So, S., "Development of speaker classification model using text-independent utterance based on deep neural network", HANYANG UNIVERSITY, (2019).
  42. Song, J., "Semantic Functions of the Korean Sentence-Terminal Suffix -ney", Journal of Korean Linguistics, Vol.76, No.12(2015), 123-159. https://doi.org/10.15811/jkl.2015..76.005
  43. Wang, N., P. Ching., N. Zheng., and T. Lee., "Robust speaker recognition using denoised vocal source and vocal tract features". IEEE transactions on audio, speech, and language processing, Vol.19, No.1(2011), 196-205. https://doi.org/10.1109/TASL.2010.2045800
  44. Yun, H., and Z. Jin., "Exploring listeners' perception on evidential grammatical markers: Comparison between Seoul and Yanbian dialect users". Language and Information, Vol.24, No.1(2020), 29-45. https://doi.org/10.29403/li.24.1.3