Acknowledgement
This work was supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2019-0-01376, Development of the multi-speaker conversational speech recognition technology).
References
- H. Chung, J. G. Park, and H. Jung, Rank-weighted reconstruction feature for a robust deep neural network-based acoustic model, ETRI J. 41 (2019), no. 2, 235-241. https://doi.org/10.4218/etrij.2018-0189
- A. B. Nassif et al., Speech recognition using deep neural networks: A systematic review, IEEE Access 7 (2019), 19143-19165. https://doi.org/10.1109/access.2019.2896880
- J. Li et al., On the comparison of popular end-to-end models for large scale speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1-5.
- V. Roger, J. Farinas, and J. Pinquier, Deep neural networks for automatic speech processing: A survey from large corpora to limited data, arXiv preprint, CoRR, 2020, arXiv: 2003.04241.
- D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, in Proc. Int. Conf. Learn. Represent. (San Diego, CA, USA), May 2015.
- I. Sutskever, O. Vinyals, and O. V. Le, Sequence to sequence learning with neural networks, in Proc. Int. Conf. Neural Inf. Process. Syst. (Montreal, Canada), Dec. 2014.
- L. Dong, S. Xu, and B. Xu, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Calgary, Canada), Apr. 2018, pp. 5884-5888.
- A. Vaswani et al., Attention is all you need, in Proc. Int. Cont. Neural Inf. Process. Syst. (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
- H. Hwang and C. Lee, Linear-time Korean morphological analysis using an action-based local monotonic attention mechanism, ETRI J. 42 (2020), no. 1, 101-107. https://doi.org/10.4218/etrij.2018-0456
- N. Moritz, T. Hori, and J. Le, Streaming automatic speech recognition with the transformer model, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6074-6078.
- N. Moritz, T. Hori, and J. L. Roux, Streaming end-to-end speech recognition with joint CTC-attention based models, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 936-943.
- S. H. K. Parthasarathi and N. Strom, Lessons from building acoustic models with a million hours of speech, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (Brighton, UK), May 2019, pp. 6670-6674.
- S. Zhou et al, Syllable-based sequence-to-sequence speech recognition with the Transformer in Mandarin Chinese, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 791-795.
- S. Zhou, S. Xu, and B. Xu, Multilingual end-to-end speech recognition with a single transformer on low-resource languages, arXiv preprint, CoRR, 2018, arXiv: 1806.05059.
- A. Gulati et al., Conformer: Convolution-augmented transformer for speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5036-5040.
- S. Karita et al., A comparative study on transformer vs RNN in speech applications, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 449-456.
- W. Huang et al., Conv-transformer transducer: Low latency, low frame rate, streamable end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Sept. 2020, pp. 5001-5005.
- S. Karita, Improving Transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 1408-1412.
- G. I. Winata et al., Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition, arXiv preprint, CoRR, 2020, arXiv: 2012.01687.
- H. Miao et al., Transformer-based online CTC/attention end-to-end speech recognition architecture, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6084-6088.
- L. Kurzinger et al., Lightweight end-to-end speech recognition from raw audio data using Sinc-convolutions, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1659-1663.
- S. Li et al., Improving transformer-based speech recognition with unsupervised pre-training and multi-task semantic knowledge learning, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5006-5010.
- T. Hori et al., Transformer-based long-context end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5011-5015.
- T. Moriya et al., Self-distillation for improving CTC-transformer-based ASR systems, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 546-550.
- X. Chang et al., End-to-end multi-speaker speech recognition with transformer, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6134-6138.
- X. Zhou et al., Self-and-mixed attention decoder with deep acoustic structure for transformer-based LVCSR, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5016-5020.
- Y. Fujita et al., Insertion-based modeling for end-to-end automatic speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3660-3664.
- Y. Higuchi et al., Improved mask-CTC for non-autoregressive end-to-end ASR, arXiv preprint, CoRR, 2020, arXiv: 2010.13270.
- Y. Higuchi et al., Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3655-3659.
- Y. Lu et al., Bi-encoder transformer network for MandarinEnglish code-switching speech recognition using mixture of experts, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 4766-4770.
- Y. Zhao et al., Cross attention with monotonic alignment for speech Transformer, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5031-5035.
- T. Parcollet, M. Morchid, and G. Linares, E2E-SINCNET: Toward fully end-to-end speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7714-7718.
- D. Amodei et al., Deep speech 2: End-to-end speech recognition in English and Mandarin, in Proc. Int. Conf. Mach. Learn. (New York, NY, USA), June 2016, pp. 173-182.
- H. Braun et al., GPU-accelerated Viterbi exact lattice decoder for batched online and offline speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7874-7878.
- Y. R. Oh, K. Park, and J. G. Park, Online speech recognition using multichannel parallel acoustic score computation and deep neural network (DNN)-based voice-activity detector, Appl. Sci. 10 (2020), no. 12, 4091-5010. https://doi.org/10.3390/app10124091
- H. Seki et al., Vectorized beam search for CTC-attention-based speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 3825-3829.
- H. Seki, T. Hori, and S. Watanabe, Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition, arXiv preprint, CoRR, 2018, arXiv: cs/1811.04568.
- A. Graves et al., Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, in Proc. Int. Conf. Mach. Learn. (Pittsburgh, PA, USA), June 2006, pp. 369-376.
- H. Miao et al., Online hybrid CTC/attention end-to-end automatic speech recognition architecture, IEEE/ACM Trans. Audio, Speech, Language Process. 28 (2020), 1452-1465. https://doi.org/10.1109/taslp.2020.2987752
- T. Yoshimura et al., End-to-end automatic speech recognition integrated with CTC-based voice activity detection, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6999-7003.
- T. Hori, S. Watanabe, and J. Hershey, Joint CTC/attention decoding for end-to-end speech recognition, in Proc. Annu. Meet. Assoc. Comput. Linguistics (Vancouver, Canada), July 2017, pp. 518-529.
- C. Meister, T. Vieira, and R. Cotterell, Best-first beam search, Trans. Assoc. Comput Linguistics 8 (2020), 795-809. https://doi.org/10.1162/tacl_a_00346
- S. Watanabe et al., Hybrid CTC/attention architecture for end-to-end speech recognition, IEEE J. Sel. Top. Signal Process. 11 (2017), no. 8, 1240-1253. https://doi.org/10.1109/JSTSP.2017.2763455
- P. Zhou et al., Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding, arXiv preprint, CoRR, 2019, arXiv: abs/1911.00203.
- N. Kitaev, L. Kaiser, and A. Levskaya, Reformer: The efficient transformer, in Proc. Int. Conf. Learn. Represent. (Addis Ababa, Ethiopia), Jan. 2020.
- K. Park, A robust endpoint detection algorithm for the speech recognition in noisy environments, in Proc. Congr. Expos. Noise Control Eng. (Inter-Noise), (Innsbruck, Austria), Sept. 2013, pp. 5790-5795.
- S. Watanabe et al., ESPnet: End-to-end speech processing toolkit, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 2207-2211.
- T. Xiao et al., Sharing attention weights for fast transformer, in Proc. Int. Joint Conf. Artif. Intell. (Macao, China), Aug. 2019, pp. 5292-5298.
- M. Ott et al., Scaling neural machine translation, in Proc. Conf. Mach. Translation (Brussels, Belgium), Oct. 2018, pp. 1-9.
- J. U. Bang et al., Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps, IEICE Trans. Inf. Syst. 103 (2020), no. 2, 406-415.