References
- G. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Computation, 14, 1771-1800 (2002). https://doi.org/10.1162/089976602760128018
- G. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, 18, 1527-1554 (2006). https://doi.org/10.1162/neco.2006.18.7.1527
- A. Krizhevsky, I. Sutskever, and G. Hinton, "Image Net classification with deep convolutional neural networks," Proc. Advances in Neural Information Processing Systems, 1097-1105 (2012).
- G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition," IEEE Signal Processing Magazine, 29, 82-97 (2012).
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," Proc. Advances in Neural Information Processing Systems, 2672-2680 (2014).
- D. Kingma and M. Welling, "Auto-encoding variational Bayes," arXiv:1312.6114 (2013).
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE Conference on Computer Vision and Pattern Recognition, 770-778 (2016).
- D. Rumelhart, G. Hinton, and R. Williams, "Learning representations by back-propagating errors," Nature, 323, 533-536 (1986). https://doi.org/10.1038/323533a0
- A. Paszke, S. Gross, F. Massa, A. Lere, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. Devito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," Proc. Advances in Neural Information Processing Systems, 8026-8037 (2019).
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Ollah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Wardern, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv:1603.04467 (2016).
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hanneman, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The Kaldi speech recognition toolkit," Proc. IEEE Automatic Speech Recognition and Understanding Workshop (2011).
- S. Smith and Q. Le, "A Bayesian perspective on generalization and stochastic gradient descent," Proc. Int. Conf. on Learning Representations (2018).
- T. Zhang, "Solving large scale linear prediction problems using stochastic gradient descent algorithms," Proc. Int. Conf. on Machine learning (2004).
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, "On parallelizability of stochastic gradient descent for speech DNNs," Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processing, 235-239 (2014).
- M. Zinkevich, M. Weimer, A. Smola, and L. Li, "Parallelized stochastic gradient descent," Proc. Advances in Neural Information Processing Systems, 2595-2603 (2010).
- L. Valiant, "A bridging model for parallel computation," Communications of the ACM, 33, 103-111 (1990). https://doi.org/10.1145/79173.79181
- H. Su, H. Chen, and H. Xu, "Experiments on parallel training of deep neural network using model averaging," arXiv:1507.01239 (2015).
- J. Hermans, On scalable deep learning and parallelizing gradient descent, (Master Thesis, Maastricht University, 2017).
- S. Zhang, A. Choromanska, and Y. LeCun, "Deep learning with elastic averaging SGD," Proc. Advances in Neural Information Processing Systems, 685-693 (2015).
- K. Chen and Q. Huo, "Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering," Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processing, 5880-5884 (2016).
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, "Large scale distributed deep networks," Proc. Advances in Neural Information Processing Systems, 1223-1231 (2012).
- J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, "Revisiting distributed synchronous SGD," arXiv:1604.00981 (2016).
- F. Niu, B. Recht, C. Re, and S. Wright, "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent," Proc. Advances in Neural Information Processing Systems, 693-701 (2011).
- S. Sallinen, N. Satish, M. Smelyanskiy, S. Sury, and C. Re, "High performance parallel stochastic gradient descent in shared memory," Proc. IEEE International Parallel and Distributed Processing Symposium, 873-882 (2016).
- S. Zhao and W. Li, "Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee," Proc. AAAI Conference on Artificial Intelligence, 2379-2385 (2016)
- X. Lian, W. Zhang, C. Zhang, and J. Liu, "Asynchronous decentralized parallel stochastic gradient descent," Proc. Int. Conf. on Machine Learning, 3049-3058 (2018).
- I. Mitliagkas, C. Zhang, S. Hadjis, and C. Re, "Asynchrony begets momentum, with an application to deep learning," Proc. Annual Allerton Conference, 997-1004 (2016).
- S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z. Ma, and T. Liu, "Asynchronous stochastic gradient descent with delay compensation," Proc. Int. Conf. on Machine Learning, 4120-4129 (2017).
- O. Yadan, K. Adams, Y. Taigman, and M. Ranzato, "Multi-GPU training of ConvNets," arXiv:1312.5853 (2013).
- A. Petrowski, G. Dreyfus, and C. Girault, "Performance analysis of a pipelined backpropagation parallel algorithm," IEEE Trans. on Neural Networks, 4, 970-981 (1993). https://doi.org/10.1109/72.286892
- X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, "Pipelined back-propagation for context-dependent deep neural networks," Proc. Interspeech, 26-29 (2012).
- Z. Huo, B. Gu, Q. Yang, and H. Huang, "Decoupled parallel backpropagation with convergence guarantee," Proc. Int. Conf. on Machine Learning, 2098-2106 (2018).
- Z. Huo, B. Gu, and H. Huang, "Training neural networks using features replay," Proc. Advances in Neural Information Processing Systems, 6659-6668 (2018).
- M. Jaderberg, W. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu, "Decoupled neural interfaces using synthetic gradients," Proc. Int. Conf. on Machine Learning, 1627-1635 (2017).
- C. Chen, C. Yang, and H. Cheng, "Efficient and robust parallel DNN training through model parallelism on multi-GPU platform," arXiv:1809.02839 (2018).
- Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. Chen, D. Chen, H. Lee, J. Ngiam, Q. Le, Y. Wu, and Z. Chen, "GPipe: Efficient training of giant neural networks using pipeline parallelism," Proc. Advances in Neural Information Processing Systems, 103-112 (2019).
- H. Lee, K. Lee, I. Yoo, and D. Yook, "Analysis of parallel training algorithms for deep neural networks," Proc. Annual Conference on Computational Science and Computational Intelligence, 1462-1463 (2018).
- D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, P. Gibbons, and M. Zaharia, "PipeDream: Generalized pipeline parallelism for DNN training," ACM Symposium on Operating Systems Principles, 1-15 (2019).
- S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. Subramania, J. Trmal, B. Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, and T. Yoshioka, "CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings," Proc. Int. Workshop on Speech Processing in Everyday Environments (2020).