Improving transformer-based speech recognition performance using data augmentation by local frame rate changes

Lim, Seong Su;Kang, Byung Ok;Kwon, Oh-Wook;

doi:10.7776/ASK.2022.41.2.122

The Journal of the Acoustical Society of Korea (한국음향학회지)

Volume 41 Issue 2
/
Pages.122-129
/
2022
/
1225-4428(pISSN)
/
2287-3775(eISSN)

The Acoustical Society of Korea (한국음향학회)

DOI QR Code

Improving transformer-based speech recognition performance using data augmentation by local frame rate changes

로컬 프레임 속도 변경에 의한 데이터 증강을 이용한 트랜스포머 기반 음성 인식 성능 향상

Lim, Seong Su ;
Kang, Byung Ok ;
Kwon, Oh-Wook (Department of Intelligent Systems and Robotics, Chungbuk National University)

임성수 (충북대학교 제어로봇공학전공) ;
강병옥 (한국전자통신연구원 인공지능연구소) ;
권오욱 (충북대학교 지능로봇공학과)

Received : 2021.09.28
Accepted : 2021.12.30
Published : 2022.03.31

https://doi.org/10.7776/ASK.2022.41.2.122 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a method to improve the performance of Transformer-based speech recognizers using data augmentation that locally adjusts the frame rate. First, the start time and length of the part to be augmented in the original voice data are randomly selected. Then, the frame rate of the selected part is changed to a new frame rate by using linear interpolation. Experimental results using the Wall Street Journal and LibriSpeech speech databases showed that the convergence time took longer than the baseline, but the recognition accuracy was improved in most cases. In order to further improve the performance, various parameters such as the length and the speed of the selected parts were optimized. The proposed method was shown to achieve relative performance improvement of 11.8 % and 14.9 % compared with the baseline in the Wall Street Journal and LibriSpeech speech databases, respectively.

본 논문은 프레임 속도를 국부적으로 조절하는 데이터 증강을 이용하여 트랜스포머 기반 음성 인식기의 성능을 개선하는 방법을 제안한다. 먼저, 원래의 음성데이터에서 증강할 부분의 시작 시간과 길이를 랜덤으로 선택한다. 그 다음, 선택된 부분의 프레임 속도는 선형보간법을 이용하여 새로운 프레임 속도로 변경된다. 월스트리트 저널 및 LibriSpeech 음성데이터를 이용한 실험결과, 수렴 시간은 베이스라인보다 오래 걸리지만, 인식 정확도는 대부분의 경우에 향상됨을 보여주었다. 성능을 더욱 향상시키기 위하여 변경 부분의 길이 및 속도 등 다양한 매개변수를 최적화하였다. 제안 방법은 월스트리트 저널 및 LibriSpeech 음성 데이터에서 베이스라인과 비교하여 각각 11.8 % 및 14.9 %의 상대적 성능 향상을 보여주는 것으로 나타났다.

Keywords

Acknowledgement

이 논문은 2019년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(2019-0-00004, 준지도학습형 언어지능 원천기술 및 이에 기반한 외국인 지원용 한국어 튜터링 서비스 개발).

References

T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, "Audio augmentation for speech recognition," Proc. Interspeech, 3586-3589 (2015).
D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "SpecAugment: A simple data augmentation method for automatic speech recognition," arXiv:1904.08779 (2019).
X. Song, Z. Wu, Y. Huang, D. Su, and H. Meng, "SpecSwap: A simple data augmentation method for end-to-end speech recognition," Proc. Interspeech, 581-585 (2020).
D. B. Paul and J. M. Baker, "The design for the wall street journal-based CSR corpus," Proc. Speech and Natural Language Workshop, 357-362 (1992).
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," Proc. ICASSP. 5206-5210 (2015).
W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, "Listen, attend and spell," arXiv:1508.01211 (2018).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS. 5998-6008 (2017).
A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks," Proc. ICML. 369-376 (2006).
S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," Proc. ICASSP. 4835-4839 (2017).
Sox, Audio Manipulation Tool, http://sox.sourceforge.net/, (Last viewed March 25, 2015).
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, "ESPnet: End-to-end speech processing toolkit," arXiv: 1804.00015 (2018).

The Journal of the Acoustical Society of Korea (한국음향학회지)

Improving transformer-based speech recognition performance using data augmentation by local frame rate changes

로컬 프레임 속도 변경에 의한 데이터 증강을 이용한 트랜스포머 기반 음성 인식 성능 향상

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)