DOI QR코드

DOI QR Code

A statistical journey to DNN, the third trip: Language model and transformer

심층신경망으로 가는 통계 여행, 세 번째 여행: 언어모형과 트랜스포머

  • Received : 2024.07.31
  • Accepted : 2024.08.12
  • Published : 2024.10.31

Abstract

Over the past decade, the remarkable advancements in deep neural networks have paralleled the development and evolution of language models. Initially, language models were developed in the form of Encoder-Decoder models using early RNNs. However, with the introduction of Attention in 2015 and the emergence of the Transformer in 2017, the field saw revolutionary growth. This study briefly reviews the development process of language models and examines in detail the working mechanism and technical elements of the Transformer. Additionally, it explores statistical models and methodologies related to language models and the Transformer.

지난 10년의 기간 심층신경망의 비약적 발전은 언어모형의 개발과 그 발전을 함께 해 왔다. 언어모형은 초기 RNN을 이용한 encoder-decoder 모형의 형태로 개발되었으나, 2015년 attention이 등장하고, 2017년 transformer가 등장하여 혁명적 기술로 성장하였다. 본 연구에서는 언어모형의 발전과정을 간략하게 살펴보고, 트랜스포머의 작동원리와 기술적 요소에 대하여 구체적으로 살펴본다. 동시에 언어모형, 트랜스포머와 관련되는 통계모형과, 방법론에 대하여 함께 검토한다.

Keywords

References

  1. Bahdanau D, Cho K, and Bengio Y (2016). Neural machine translation by jointly learning to align and translate. In Proceedings of the Third International Conference on Learning Representations (ICLR 2015), San Diego, CA, Available from: https://arxiv.org/abs/1409.0473
  2. Ba JL, Kiros JR, and Hinton GE (2016). Layer normalization, Available from: pre-print: arXiv:1607.06450
  3. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1724-1734.
  4. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y (2015). Attention-based models for speech recognition, Advances in Neural Information Processing Systems, 28 (NIPS 2015), 577-585.
  5. He K, Zhang X, Ren S, and Sun J (2016). Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 770-778.
  6. Hwang IJ, Kim HJ, Kim YJk, and Lee YD (2024). Generalized neural collaborative filtering, The Korean Journal of Applied Statistics, 37, 311-322.
  7. Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, 448-456.
  8. Kim HJ, Kim YJ, Jang K, and Lee YD (2024a). A statistical journey to DNN, the second trip: Architecture of RNN and image classification, The Korean Journal of Applied Statistics, 37, 553-563.
  9. Kim HJ, Hwang IJ, Kim YJ, and Lee YD (2024b). A statistical journey to DNN, the first trip: From regression to deep neural network, The Korean Journal of Applied Statistics, 37, 541-551.
  10. Li Y, Si S, Li G, Hsieh CJ, and Bengio S (2021). Learnable fourier features for multi-dimensional spatial positional encoding, Advances in Neural Information Processing Systems, 34 (NeurIPS 2021), 15816-15829.
  11. Luong MT, Pham H, and Manning CD (2015). Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, 1412-1421.
  12. Mikolov T, Chen K, Corrado G, and Dean J (2013a). Efficient estimation of word representations in vector space, Available from: pre-print: arXiv:1301.3781
  13. Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J (2013b). Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems, 26 (NIPS 2013), 3111-3119.
  14. Park C, Na I, Jo Y et al. (2019). SANVis: Visual Analytics for Understanding Self-Attention Networks. In Proceedings of 2019 IEEE Visualization Conference (VIS), Vancouver, BC, 146.
  15. Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, and Tran S (2018). Image transformer. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmassan, Stockholm, 4052-4061.
  16. Shaw P, Uszkoreit J, and Vaswani A (2018). Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, Louisiana, 464-468.
  17. Siu C (2019). Residual networks behave like boosting algorithms. In Proceedings of 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington DC, 31-40.
  18. Su J, Ahmed M, Lu Y, Pan S, Bo W, and Liu Y (2024). Roformer: Enhanced transformer with rotary position embedding, Neurocomputing, 568, 127063.
  19. Sutskever I, Vinyals O, and Le QV (2014). Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems, 27 (NIPS 2014), 3104-3112.
  20. Vaswani A, Shazeer N, and Parmar N (2017). Attention is all you need, Advances in Neural Information Processing Systems, 30 (NIPS 2017), 5998-6008.
  21. Veit A, Wilber MJ, and Belongie S (2016). Residual networks behave like ensembles of relatively shallow networks, Advances in Neural Information Processing Systems, 29 (NIPS 2016), 550-558.
  22. Wang X, Tu Z, Wang L, and Shi S (2019). Self-attention with structural position representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 1403-1409.
  23. Zhou X, Ren Z, Zhou S, Jiang Z, Yu TZ, and Luo H (2024). Rethinking position embedding methods in the transformer architecture, Neural Process Letters 56, 41.