References
- Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. Retrieved from https://arxiv.org/abs/1701.07875
- Chen, J., Tan, X., Luan, J., Qin, T., & Liu, T. Y. (2020). HiFiSinger: Towards high-fidelity neural singing voice synthesis. Retrieved from https://arxiv.org/abs/2009.01776
- Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243. https://doi.org/10.1109/TASSP.1984.1164317
- Hsu, P., Wang, C., Liu, A. T., & Lee, H. (2020). Towards robust neural vocoding for speech generation: A survey. Retrieved from https://arxiv.org/abs/1912.02461
- Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., ... Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Retrieved from https://arxiv.org/abs/1910.06711
- Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2019). Neural speech synthesis with transformer network. Retrieved from https://arxiv.org/abs/1809.08895
- Perraudin, N., Balazs, P., & Sondergaard, P. L. (2013, October). A fast Griffin-Lim algorithm. Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1-4). New Paltz, NY.
- Prenger, R., Valle, R., & Catanzaro, B. (2018). WaveGlow: A flow-based generative network for speech synthesis. Retrieved from https://arxiv.org/abs/1811.00002
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019, December). FastSpeech: Fast, robust and controllable text to speech. Proceedings of the 33rd Annual Conference on Neural Information Processing Systems(pp. 3156-3164). Vancouver, BC.
- Sharma, A., Kumar, P., Maddukuri, V., Madamshetti, N., Kishore, K. G., Kavuru, S. S. S., Raman, B., ... Roy, P. P. (2020). Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis. Multimedia Tools and Applications, 79(41), 30205-30233. https://doi.org/10.1007/s11042-020-09321-7
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Wu, Y. (2018, April). Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4779-4783). Calgary, AB.
- Song, W., Xu, G., Zhang, Z., Zhang, C., He, X., & Zhou, B. (2020, October). Efficient WaveGlow: An improved WaveGlow vocoder with enhanced speed. Proceedings of the 21st Annual Conference of the International Speech Communication Association (pp. 225-229). Shanghai, China.
- Tachibana, H., Uenoyama, K., & Aihara, S. (2018, April). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4784-4788). Calgary, AB.
- van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., ... Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
- Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017, August). Tacotron: Towards end-to-end speech synthesis. Proceedings of the 18th Annual Conference of the International Speech Communication Association (pp. 4006-4010). Stockholm, Sweden.
- Zhu, X., Beauregard, G. T., & Wyse, L. (2006, July). Real-time iterative spectrum inversion with look-ahead. Proceedings of the 2006 IEEE International Conference on Multimedia and Expo (pp. 229-232). Toronto, ON.