A statistical journey to DNN, the third trip: Language model and transformer

Yu Jin Kim;In Jun Hwang;Kisuk Jang;Yoon Dong Lee;

doi:10.5351/KJAS.2024.37.5.567

The Korean Journal of Applied Statistics (응용통계연구)

Volume 37 Issue 5
/
Pages.567-582
/
2024
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

A statistical journey to DNN, the third trip: Language model and transformer

심층신경망으로 가는 통계 여행, 세 번째 여행: 언어모형과 트랜스포머

Yu Jin Kim (Business School, Sogang University) ;
In Jun Hwang (Business School, Sogang University) ;
Kisuk Jang (Business School, Sogang University) ;
Yoon Dong Lee (Business School, Sogang University)

김유진 (서강대학교 경영학부) ;
황인준 (서강대학교 경영학부) ;
장기석 (서강대학교 경영학부) ;
이윤동 (서강대학교 경영학부)

Received : 2024.07.31
Accepted : 2024.08.12
Published : 2024.10.31

https://doi.org/10.5351/KJAS.2024.37.5.567 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Over the past decade, the remarkable advancements in deep neural networks have paralleled the development and evolution of language models. Initially, language models were developed in the form of Encoder-Decoder models using early RNNs. However, with the introduction of Attention in 2015 and the emergence of the Transformer in 2017, the field saw revolutionary growth. This study briefly reviews the development process of language models and examines in detail the working mechanism and technical elements of the Transformer. Additionally, it explores statistical models and methodologies related to language models and the Transformer.

지난 10년의 기간 심층신경망의 비약적 발전은 언어모형의 개발과 그 발전을 함께 해 왔다. 언어모형은 초기 RNN을 이용한 encoder-decoder 모형의 형태로 개발되었으나, 2015년 attention이 등장하고, 2017년 transformer가 등장하여 혁명적 기술로 성장하였다. 본 연구에서는 언어모형의 발전과정을 간략하게 살펴보고, 트랜스포머의 작동원리와 기술적 요소에 대하여 구체적으로 살펴본다. 동시에 언어모형, 트랜스포머와 관련되는 통계모형과, 방법론에 대하여 함께 검토한다.

Keywords

References

Bahdanau D, Cho K, and Bengio Y (2016). Neural machine translation by jointly learning to align and translate. In Proceedings of the Third International Conference on Learning Representations (ICLR 2015), San Diego, CA, Available from: https://arxiv.org/abs/1409.0473
Ba JL, Kiros JR, and Hinton GE (2016). Layer normalization, Available from: pre-print: arXiv:1607.06450
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1724-1734.
Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y (2015). Attention-based models for speech recognition, Advances in Neural Information Processing Systems, 28 (NIPS 2015), 577-585.
He K, Zhang X, Ren S, and Sun J (2016). Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 770-778.
Hwang IJ, Kim HJ, Kim YJk, and Lee YD (2024). Generalized neural collaborative filtering, The Korean Journal of Applied Statistics, 37, 311-322.
Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, 448-456.
Kim HJ, Kim YJ, Jang K, and Lee YD (2024a). A statistical journey to DNN, the second trip: Architecture of RNN and image classification, The Korean Journal of Applied Statistics, 37, 553-563.
Kim HJ, Hwang IJ, Kim YJ, and Lee YD (2024b). A statistical journey to DNN, the first trip: From regression to deep neural network, The Korean Journal of Applied Statistics, 37, 541-551.
Li Y, Si S, Li G, Hsieh CJ, and Bengio S (2021). Learnable fourier features for multi-dimensional spatial positional encoding, Advances in Neural Information Processing Systems, 34 (NeurIPS 2021), 15816-15829.
Luong MT, Pham H, and Manning CD (2015). Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, 1412-1421.
Mikolov T, Chen K, Corrado G, and Dean J (2013a). Efficient estimation of word representations in vector space, Available from: pre-print: arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J (2013b). Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems, 26 (NIPS 2013), 3111-3119.
Park C, Na I, Jo Y et al. (2019). SANVis: Visual Analytics for Understanding Self-Attention Networks. In Proceedings of 2019 IEEE Visualization Conference (VIS), Vancouver, BC, 146.
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, and Tran S (2018). Image transformer. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmassan, Stockholm, 4052-4061.
Shaw P, Uszkoreit J, and Vaswani A (2018). Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, Louisiana, 464-468.
Siu C (2019). Residual networks behave like boosting algorithms. In Proceedings of 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington DC, 31-40.
Su J, Ahmed M, Lu Y, Pan S, Bo W, and Liu Y (2024). Roformer: Enhanced transformer with rotary position embedding, Neurocomputing, 568, 127063.
Sutskever I, Vinyals O, and Le QV (2014). Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems, 27 (NIPS 2014), 3104-3112.
Vaswani A, Shazeer N, and Parmar N (2017). Attention is all you need, Advances in Neural Information Processing Systems, 30 (NIPS 2017), 5998-6008.
Veit A, Wilber MJ, and Belongie S (2016). Residual networks behave like ensembles of relatively shallow networks, Advances in Neural Information Processing Systems, 29 (NIPS 2016), 550-558.
Wang X, Tu Z, Wang L, and Shi S (2019). Self-attention with structural position representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 1403-1409.
Zhou X, Ren Z, Zhou S, Jiang Z, Yu TZ, and Luo H (2024). Rethinking position embedding methods in the transformer architecture, Neural Process Letters 56, 41.

The Korean Journal of Applied Statistics (응용통계연구)

A statistical journey to DNN, the third trip: Language model and transformer

심층신경망으로 가는 통계 여행, 세 번째 여행: 언어모형과 트랜스포머

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)