DOI QR코드

DOI QR Code

Speech detection from broadcast contents using multi-scale time-dilated convolutional neural networks

다중 스케일 시간 확장 합성곱 신경망을 이용한 방송 콘텐츠에서의 음성 검출

  • Jang, Byeong-Yong (School of Electronics Engineering, Chungbuk National University) ;
  • Kwon, Oh-Wook (School of Electronics Engineering, Chungbuk National University)
  • Received : 2019.09.10
  • Accepted : 2019.11.08
  • Published : 2019.12.31

Abstract

In this paper, we propose a deep learning architecture that can effectively detect speech segmentation in broadcast contents. We also propose a multi-scale time-dilated layer for learning the temporal changes of feature vectors. We implement several comparison models to verify the performance of proposed model and calculated the frame-by-frame F-score, precision, and recall. Both the proposed model and the comparison model are trained with the same training data, and we train the model using 32 hours of Korean broadcast data which is composed of various genres (drama, news, documentary, and so on). Our proposed model shows the best performance with F-score 91.7% in Korean broadcast data. The British and Spanish broadcast data also show the highest performance with F-score 87.9% and 92.6%. As a result, our proposed model can contribute to the improvement of performance of speech detection by learning the temporal changes of the feature vectors.

본 논문에서는 방송 콘텐츠에서 음성 구간 검출을 효과적으로 할 수 있는 심층 학습 모델 구조를 제안한다. 또한 특징 벡터의 시간적 변화를 학습하기 위한 다중 스케일 시간 확장 합성곱 층을 제안한다. 본 논문에서 제안한 모델의 성능을 검증하기 위하여 여러 개의 비교 모델을 구현하고, 프레임 단위의 F-score, precision, recall을 계산하여 보여 준다. 제안 모델과 비교 모델은 모두 같은 학습 데이터로 학습되었으며, 모든 모델은 다양한 장르(드라마, 뉴스, 다큐멘터리 등)로 구성되어 있는 한국 방송데이터 32시간을 이용하여 모델을 학습되었다. 제안 모델은 한국 방송데이터에서 F-score 91.7%로 가장 좋은 성능을 보여주었다. 또한 영국과 스페인 방송 데이터에서도 F-score 87.9%와 92.6%로 가장 높은 성능을 보여주었다. 결과적으로 본 논문의 제안 모델은 특징 벡터의 시간적 변화를 학습하여 음성 구간 검출 성능 향상에 기여할 수 있었다.

Keywords

References

  1. Butko, T., & Nadeu, C. (2011). Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing, 2011(1), 1-10. https://doi.org/10.1186/1687-4722-2011-1
  2. Castan, D., Tavarez, D., Lopez-Otero, P., Franco-Pedroso, J., Delgado, H., Navas, E., Docio-Fernandez, L., ... Lleida, E. (2015). Albayzin-2014 evaluation: audio segmentation and classification in broadcast news domains. EURASIP Journal on Audio, Speech, and Music Processing, 2015(33), 1-9. https://doi.org/10.1186/s13636-014-0045-2
  3. Doukhan, D., Lechapt, E., Evrard, M., & Carrive, J. (2018). Ina's MIREX 2018 music and speech detection system. Music Information Retrieval Evaluation eXchange (MIREX).
  4. Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. https://doi.org/10.1109/TASL.2010.2064307
  5. He, K., Zhang, X., Ren, S., & Sun, J. (2016, June). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
  6. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. https://doi.org/10.1109/5.726791
  7. Lu, R., & Duan, Z. (2017). Bidirectional GRU for sound event detection. Detection and Classification of Acoustic Scenes and Events.
  8. Mesaros, A., Heittola, T., & Virtanen, T. (2016). Metrics for polyphonic sound event detection. Applied Sciences, 6(6), 162. https://doi.org/10.3390/app6060162
  9. Mirex (2015). Music/speech classification and detection. Retrieved from http://www.music-ir.org/mirex/wiki/2015:Music/Speech_Classifi-cation_and_Detection
  10. Mirex (2018). Music and/or speech detection. Retrieved from http://www.music-ir.org/mirex/wiki/2018:Music_and/or_Speech_Detection
  11. Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In 15th Annual Conference of the International Speech Communication Association (Interspeech-2014) (pp. 338-342). Singapore.
  12. Tsipas, N., Vrysis, L., Dimoulas, C., & Papanikolaou, G. (2017). Efficient audio-driven multimedia indexing through similaritybased speech/music discrimination. Multimedia Tools and Applications, 76(24), 25603-25621. https://doi.org/10.1007/s11042-016-4315-0
  13. Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. Retrieved from https://arxiv.org/abs/1511.07122.
  14. Zhang, Q., Cui, Z., Niu, X., Geng, S., & Qiao, Y. (2017). Image segmentation with pyramid dilated convolution based on ResNet and U-Net. In International Conference on Neural Information Processing (pp. 364-372).
  15. Zuo, Z., Shuai, B., Wang, G., Liu, X., Wang, X., Wang, B., & Chen, Y. (2015, June). Convolutional recurrent neural networks: Learning spatial dependencies for image representation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 18-26).