Prediction of Closed Quotient During Vocal Phonation using GRU-type Neural Network with Audio Signals

Hyeonbin Han;Keun Young Lee;Seong-Yoon Shin;Yoseup Kim;Gwanghyun Jo;Jihoon Park;Young-Min Kim;

doi:10.56977/jicce.2024.22.2.145

Journal of information and communication convergence engineering

제22권2호
/
Pages.145-152
/
2024
/
2234-8255(pISSN)
/
2234-8883(eISSN)

한국정보통신학회 (The Korea Institute of Information and Commucation Engineering)

DOI QR Code

Prediction of Closed Quotient During Vocal Phonation using GRU-type Neural Network with Audio Signals

Hyeonbin Han (Department of Mathematical Data Science, Hanyang University ERICA) ;
Keun Young Lee (Independent scholar) ;
Seong-Yoon Shin (School of Computer Science and Engineering, Kunsan National University) ;
Yoseup Kim (Digital Healthcare Research Center, Deltoid Inc.) ;
Gwanghyun Jo (Department of Mathematical Data Science, Hanyang University ERICA) ;
Jihoon Park (Division of Vocal Music, Nicedream Music Academy) ;
Young-Min Kim (Digital Health Research Divisions, Korea Institute of Oriental Medicine)

투고 : 2024.03.30
심사 : 2024.06.07
발행 : 2024.06.30

https://doi.org/10.56977/jicce.2024.22.2.145 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Closed quotient (CQ) represents the time ratio for which the vocal folds remain in contact during voice production. Because analyzing CQ values serves as an important reference point in vocal training for professional singers, these values have been measured mechanically or electrically by either inverse filtering of airflows captured by a circumferentially vented mask or post-processing of electroglottography waveforms. In this study, we introduced a novel algorithm to predict the CQ values only from audio signals. This has eliminated the need for mechanical or electrical measurement techniques. Our algorithm is based on a gated recurrent unit (GRU)-type neural network. To enhance the efficiency, we pre-processed an audio signal using the pitch feature extraction algorithm. Then, GRU-type neural networks were employed to extract the features. This was followed by a dense layer for the final prediction. The Results section reports the mean square error between the predicted and real CQ. It shows the capability of the proposed algorithm to predict CQ values.

키워드

과제정보

This study was supported by a grant (NRF KSN1824130) from the Korea Institute of Oriental Medicine.

참고문헌

E. B. Lacerda and C. A. B. Mello, "Automatic classification of laryngeal mechanisms in singing based on the audio signal," Procedia Computer Science, vol. 112, pp. 2204-2212, Feb. 2017. DOI: 10.1016/j.procs.2017.08.115.
A. Zysk and P. Badura, "An Approach for Vocal Register Recognition Based on Spectral Analysis of Singing," International Journal of Cognitive and Language Sciences, vol. 11, no. 2, pp. 207-212, Jan. 2017. DOI: 10.5281/zenodo.1128825.
R. K. Shosted, "Vocalic context as a condition for nasal coda emergence: aerodynamic evidence," Journal of the International Phonetic Association, vol. 36, no. 1, pp. 39-58, May 2006. DOI: 10.1017/S0025100306002350.
P. Fabre, "Percutaneous electric process registering glottic union during phonation: glottography at high frequency; first results," Bulletin de L'academie Nationale de Medecine, vol. 141, no. 3-4, pp. 66-69, Jan. 1957. DOI: 10.1007/BF02991550.
V. Hampala, M. Garcia, J. G. Svec, R. C. Scherer, and C. T. Herbst, "Relationship between the electroglottographic signal and vocal fold contact area," Journal of Voice, vol. 30, no. 2, pp. 161-171, Mar. 2016. DOI: 10.1016/j.jvoice.2015.03.018.
F. M. B. La and J. Sundberg, "Contact quotient versus closed quotient: a comparative study on professional male singers," Journal of Voice, vol. 29, no. 2, pp. 148-154, Mar. 2015. DOI: 10.1016/j.jvoice.2014.07.005.
K. Verdolini, R. Chan, I. R. Titze, M. Hess, and W. Bierhals, "Correspondence of electroglottographic closed quotient to vocal fold impact stress in excised canine larynges," Journal of Voice, vol. 12, no. 4, pp. 415-423, Apr. 1997. DOI: 10.1016/S0892-1997(98)80050-7.
J. Y. Lim, S. E. Lim, S. H. Choi, J. H. Kim, K. M. Kim, and H. S. Choi, "Clinical characteristics and voice analysis of patients with mutational dysphonia: clinical significance of diplophonia and closed quotients," Journal of voice, vol. 21, no. 1, pp. 12-19, Jan. 2007. DOI: 10.1016/j.jvoice.2005.10.002.
K. He, X. Zhang, S. Ren, and, J. Sun, "Deep residual learning for image recognition," in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas: US, pp. 770-778, 2016. DOI: 10.1109/CVPR.2016.90.
J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," in Proceeding of the 33th Advances in Neural Information Processing Systems, Vancouver: CA, pp. 6840-6851, 2020. DOI: 10.48550/arXiv.2006.11239.
D. Silver, A. Huang, CJ. Maddison, and A. Guez, "Mastering the game of Go with deep neural networks and tree search". Nature, vol. 529, pp. 484-489, Jan. 2016. DOI: 10.1038/nature16961.
T. Haarnoja, A, Zhou, P. Abbeel, and S. Levine, "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor," in Proceeding of International Conference on Machine Learning, Stockholm: SE, pp. 1861-1870, 2018. DOI: 10.48550/arXiv.1801.01290.
T. Haarnoja, A, Zhou, P. Abbeel, and S. Levine, "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor," in Proceeding of International Conference on Machine Learning, Stockholm: SE, pp. 1861-1870, 2018. DOI: 10.48550/arXiv.1801.01290.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, AN. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proceeding of the 30th Advances in Neural Information Processing Systems, Long beach: US, pp. 6000-6010, 2017. DOI: 10.48550/arXiv.1706.03762.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, "Language models are few-shot learners," in Proceeding of the 33 th Advances in Neural Information Processing Systems, Vancouver: CA, pp. 1877-1901, 2020. DOI: 10.48550/arXiv.2005.14165.
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997. DOI: 10.1162/neco.1997.9.8.1735.
J. Chen X. Xue. "A transfer learning-based long short-term memory model for the prediction of river water temperature," Engineering Applications of Artificial Intelligence, vol. 133, pp. 108605, 2024. DOI: 10.1016/j.engappai.2024.108605.
C. Qin, D. Qin, Q. Jiang, and B. Zhu, "Forecasting carbon price with attention mechanism and bidirectional long short-term memory network," Energy, vol. 299, pp. 131410, 2024. DOI: 10.1016/j.energy.2024.131410.
R. Dey and F. M. Salem, "Gate-variants of gated recurrent unit (GRU) neural networks," in Proceeding of 2017 IEEE 60th International Midwest Symposium on Circuits and Systems, Boston: US, pp. 1597-1600, Aug. 2017. DOI: 10.1109/MWSCAS.2017.8053243.
Y. Yevnin, S. Chorev, I. Dukan, and Y. Toledo, "Short-term wave forecasts using gated recurrent unit," Ocean Engineering, vol. 268, no. 15, pp. 113389, 2023. DOI: 10.1016/j.oceaneng.2022.113389.
L. Zhang, J. Zhang, W. Gao, F. Bai, N. Li, and N. Ghadimi, "A deep learning outline aimed at prompt skin cancer detection utilizing gated recurrent unit networks and improved orca predation algorithm," Biomedical Signal Processing and Control, vol. 90, pp. 105858, 2024. DOI: 10.1016/j.bspc.2023.105858.
E. Terhardt, G. Stoll, and M. Seewann, "Algorithm for extraction of pitch and pitch salience from complex tonal signals," The Journal of the Acoustical Society of America, vol. 71, no. 3, pp. 679-688, Mar. 1982. DOI: 10.1121/1.387544.
D. Talkin and W. B. Klejin, "A robust algorithm for pitch tracking (RAPT)," Speech Coding and Synthesis, pp. 497-518, 1995.
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, Dec. 2014. DOI: 10.48550/arXiv.1412.6980.
C. Ittichaichareon, S. Suksri, and T. Yingthawornsuk, "Speech recognition using MFCC," in Proceeding of International Conference on Computer Graphics, Simulation and Modeling, Pattaya: TH, vol. 9. pp. 135-138, Jul. 2012.
M. Muller and S. Ewert, "Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features," International Society for Music Information Retrieval, 2011.
B. P. Das and R. Parekh, "Recognition of isolated words using features based on LPC, MFCC, ZCR and STE, with neural network classifiers," International Journal of Modern Engineering Research, vol. 2. no. 3. pp. 854-858, May-Jun. 2012.
H. Panti, A. Jagtap, V. Bhoyar, and A. Gupta, "Speech emotion recognition using MFCC, GFCC, chromagram and RMSE features," in Proceeding of 2021 8th International Conference on Signal Processing and Integrated Networks (SPIN), Nodia: IN, Aug. 2021. DOI: 10.1109/SPIN52536.2021.9566046.
L. Breiman, "Random forests," Machine Learning, vol. 45, pp. 5-32, 2001. DOI: 10.1023/A:1010933404324.
J. Hu and S. Szymczak, "A review on longitudinal data analysis with random forests," Briefings in Bioinformatics, vol. 24, no. 2, pp. 1-11, 2023. DOI: 10.1093/bib/bbad002.
T. Chen, C. Guestrin, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, pp. 785-794. DOI: 10.1145/2939672.2939785.

Journal of information and communication convergence engineering

Prediction of Closed Quotient During Vocal Phonation using GRU-type Neural Network with Audio Signals

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)