Search | Korea Science

Kim, Bum-Jun;Moon, Hyeongi;Park, Sung-Wook;Jeong, Youngho;Park, Young-Cheol
- Journal of Broadcast Engineering
- /
- v.24 no.3
- /
- pp.472-484
- /
- 2019
This paper proposes a time-domain sound event detection algorithm using DNN (Deep Neural Network). In this system, time domain sound waveform data which is not converted into the frequency domain is used as input to the DNN. The overall structure uses CRNN structure, and GLU, ResNet, and Squeeze-and-excitation blocks are applied. And proposed structure uses structure that considers features extracted from several layers together. In addition, under the assumption that it is practically difficult to obtain training data with strong labels, this study conducted training using a small number of weakly labeled training data and a large number of unlabeled training data. To efficiently use a small number of training data, the training data applied data augmentation methods such as time stretching, pitch change, DRC (dynamic range compression), and block mixing. Unlabeled data was supplemented with insufficient training data by attaching a pseudo-label. In the case of using the neural network and the data augmentation method proposed in this paper, the sound event detection performance is improved by about 6 %(based on the f-score), compared with the case where the neural network of the CRNN structure is used by training in the conventional method.
https://doi.org/10.5909/JBE.2019.24.3.472 인용 PDF KSCI KPUBS HTML

Hwang, Seo-Rim;Park, Sung Wook;Park, Youngcheol
- The Journal of the Acoustical Society of Korea
- /
- v.41 no.1
- /
- pp.30-37
- /
- 2022
This paper compares and evaluates model performance from two perspectives according to the learning target and network structure for training Deep Neural Network (DNN)-based speech enhancement models in the frequency domain. In this case, spectrum mapping and Time-Frequency (T-F) masking techniques were used as learning targets, and a real network and a complex network were used for the network structure. The performance of the speech enhancement model was evaluated through two objective evaluation metrics: Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) depending on the scale of the dataset. Test results show the appropriate size of the training data differs depending on the type of networks and the type of dataset. In addition, they show that, in some cases, using a real network may be a more realistic solution if the number of total parameters is considered because the real network shows relatively higher performance than the complex network depending on the size of the data and the learning target.
https://doi.org/10.7776/ASK.2022.41.1.030 인용 PDF KSCI