Automatic melody extraction algorithm using a convolutional neural network

Lee, Jongseol;Jang, Dalwon;Yoon, Kyoungro;

doi:10.3837/tiis.2017.12.019

KSII Transactions on Internet and Information Systems (TIIS)

Volume 11 Issue 12
/
Pages.6038-6053
/
2017
/
1976-7277(pISSN)
/
1976-7277(eISSN)

Korean Society for Internet Information (한국인터넷정보학회)

DOI QR Code

Automatic melody extraction algorithm using a convolutional neural network

Lee, Jongseol (Communications & Media R&D, Korea Electronics Technology Institute) ;
Jang, Dalwon (Communications & Media R&D, Korea Electronics Technology Institute) ;
Yoon, Kyoungro (Department of Computer Science and Engineering, Konkuk University)

Received : 2016.10.31
Accepted : 2017.09.15
Published : 2017.12.31

https://doi.org/10.3837/tiis.2017.12.019 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this study, we propose an automatic melody extraction algorithm using deep learning. In this algorithm, feature images, generated using the energy of frequency band, are extracted from polyphonic audio files and a deep learning technique, a convolutional neural network (CNN), is applied on the feature images. In the training data, a short frame of polyphonic music is labeled as a musical note and a classifier based on CNN is learned in order to determine a pitch value of a short frame of audio signal. We want to build a novel structure of melody extraction, thus the proposed algorithm has a simple structure and instead of using various signal processing techniques for melody extraction, we use only a CNN to find a melody from a polyphonic audio. Despite of simple structure, the promising results are obtained in the experiments. Compared with state-of-the-art algorithms, the proposed algorithm did not give the best result, but comparable results were obtained and we believe they could be improved with the appropriate training data. In this paper, melody extraction and the proposed algorithm are introduced first, and the proposed algorithm is then further explained in detail. Finally, we present our experiment and the comparison of results follows.

Keywords

References

J. S. Downie, "Music information retrieval," Annual Review of Information Science and Technology, 37:295-340, 2003.
R. Typke, F. Wiering and R. Veltkamp, "A survey of music information retrieval systems," in Proc. of ISMIR, pp. 153-160, 2005.
D. Jang, C.-J. Song, S. Shin, S.-J. Park, S.-J. Jang and S.-P. Lee, "Implementation of a matching engine for a practical query-by-singing/humming system," in Proc. of ISSPIT, pp. 258-263, 2011.
J. S. R. Jang and H. R. Lee, "A general framework of progressive filtering and its application to query by singing/humming," IEEE Trans. on Audio, Speech, and language Processing, vol. 16, no. 2, pp 350-358, Feb., 2008. https://doi.org/10.1109/TASL.2007.913035
S. W. Hainsworth and M. D. Macleod, "Particle filtering applied to musical tempo tracking," EURASIP J. Applied Signal Processing, vol. 15, pp. 2385-2395, 2004.
D. P. W. Ellis and G. E. Poliner, "Identifying cover songs with chroma features and dynamic programming beat tracking," in Proc. of Int. Conf Acoustic, Speech and Signal Processing, Honolulu, HI, 2007.
G. Tzanetakis and P. Cook, "Musical genre classification of audio signals," IEEE Trans. Speech Audio Process, vol. 10, no. 5, pp. 293-302, 2002. https://doi.org/10.1109/TSA.2002.800560
D. Jang, M. Jin and C. D. Yoo, "Music genre classification using novel features and a weighted voting method," in Proc. of ICME, 2008.
T. LH. Li, A. B. Chan, and A. HW. Chun, "Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network," in Proc of IMECS, 2010.
X. Hu and J. S. Downie, "Improving mood classification in music digital libraries by combining lyrics and audio," in Proc of the 10th annual joint conference on Digital libraries, pp159-168, 2010.
J. H. Kim, S. Lee, S. M. Kim and W. Y. Yoo, "Music mood classification model based on Arousal-Valence values," in Proc of ICACT, pp 292-295, 2011.
D. Jang, C. D. Yoo, S. Lee, S. Kim and T. Kalker, "Pairwise Boosted Audio Fingerprint," IEEE Trans. Information Forensics and Security, vol. 4, no. 4, pp. 995-1004, Dec. 2009. https://doi.org/10.1109/TIFS.2009.2034452
J. Haitsma and T. Kalker, "A highly robust audio fingerprinting system," ISMIR 2002.
S. Durand, J. P. Bello, B. David, and G. Richard, "Feature Adapted Convolutional neural Networks for Downbeat Tracking," in Proc. of ICASSP, 2016.
K. Choi, G. Fazekas, and M. Sandler, "Automatic tagging using deep convolutional neural networks," in Proc of ISMIR, 2016.
S. Jo and C. D. Yoo, "Melody extraction from polyphonic audio based on particle filter," in Proc of ISMIR, pp. 357-362, 2010.
D. P. W. Ellis and G. E. Poliner, "Classification-based melody transcription," Machine Learning, Vol. 65, pp. 439-456, 2006. https://doi.org/10.1007/s10994-006-8373-9
J. Salamon, E. Gomez, D. P. W. Ellis, and G. Richard, "Melody extraction from polyphonic music signals: Approaches, applications, and challenges," IEEE Signal Processing magazine, 2014
K. Dressler, "An auditory streaming approach for melody extraction from polyphonic music," in Proc. of 12th ISMIR, Miami, FL, pp. 19-24, Oct. 2011.
V. Rao and P. Rao, "Vocal melody extraction in the presence of pitched accompaniment in polyphonic music," IEEE Trans. Audio, Speech, Lang. Processing, vol. 18, no. 8, pp. 2145-2154, Nov. 2010. https://doi.org/10.1109/TASL.2010.2042124
S. Jo, S. Joo and C. D. Yoo, "Melody pitch estimation based on range estimation and candidate extraction using harmonic structure model," in Proc. of InterSpeech, Makuhari, Japan, Sept. 2010, pp. 2902-2905.
V. Arora and L. Behera, "On-line melody extraction from polyphonic audio using harmonic cluster tracking," IEEE Trans. Audio, Speech, Lang. Processing, vol. 21, no. 3, pp. 520-530, Mar. 2013. https://doi.org/10.1109/TASL.2012.2227731
C. Hsu and J. S. R. Jang, "Singing pitch extraction by voice vibrato/tremolo estimation and instrument partial deletion," in Proc. of 11th ISMIR, Utrecht, The Netherlands, Aug. 2010, pp. 525-530. http://ismir2010.ismir.net/proceedings/ismir2010-89.pdf
T.-C. Yeh, M.-J. Wu, J.-S. Jang, W.-L. Chang and I.-B. Liao, "A hybrid approach to singing pitch extraction based on trend estimation and hidden Markov models," in Proc. of IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, pp. 457-460, Mar. 2012.
S. Kum, C. Oh, and J. Nam, "Melody Extraction on Vocal Segments using Multi-Column Deep Neural Networks," in Proc. of ISMIR, 2016
E. J. Humphrey, J. P. Bello, and Y. LeCun, "Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics," in Proc of ISMIR, 2012.
Music Information Retrieval Evaluation eXchange [Online], Available: http://www.music-ir.org/mirex/wiki/MIREX_HOME
R. B. Palm, "Prediction as a candidate for learning deep hierarchical models of data", Technical University of Denmark, 2012.
A. Krizhevsky, I Sutckever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proc. of NIPS, 2012.
D.C. Ciresan, U Meier and L. M. Gambardella, " Convolutional neural network committees for handwritten character classification," in Proc. of International Conference on Document Analysis and Recognition, pp. 1250-1254, 2011
C. Zhang and Z. Zhang. "Improving multiview face detection with multi-task deep convolutional neural networks," in Proc. of Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pp. 1036-1041, 2014.
J. Zbontar and Y LeCun, "Computing the stereo matching cost with a convolutional neural network, " Proceeding of CVPR 2015.
2016: Audio Melody Extraction [online] Available: http://www.music-ir.org/mirex/wiki/2016:Audio_Melody_Extraction
2015: MIREX2015 Results [online] Available: http://www.music-ir.org/mirex/wiki/2015:MIREX2015_Results
D. Hermes, "Measurement of pitch by subharmonic summation," Journal of Acoustic of Society of America, vol.83, pp.257-264,1988. https://doi.org/10.1121/1.396427
V. Arora and L. Behera, "On-line melody extraction from polyphonic audio using harmonic cluster tracking," IEEE Trans. on Audio Speech and Language Processing, vol. 21, no. 3, pp. 520 -530, Mar. 2013. https://doi.org/10.1109/TASL.2012.2227731
T.-C. Yeh, M.-J. Wu, J.-S. Jang, W.-L. Chang, and I.-B. Liao, "A hybrid approach to singing pitch extraction based on trend estimation and hidden Markov models," in Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, pp. 457-460, Mar. 2012.