DOI QR코드

DOI QR Code

Understanding the semantic change of Hangeul using word embedding

단어 임베딩 기법을 이용한 한글의 의미 변화 파악

  • Sun, Hyunseok (Department of Applied Statistics, Chung-Ang University) ;
  • Lee, Yung-Seop (Department of Statistics, Dongguk University) ;
  • Lim, Changwon (Department of Applied Statistics, Chung-Ang University)
  • 선현석 (중앙대학교 응용통계학과) ;
  • 이영섭 (동국대학교 통계학과) ;
  • 임창원 (중앙대학교 응용통계학과)
  • Received : 2021.01.04
  • Accepted : 2021.02.15
  • Published : 2021.06.30

Abstract

In recent years, as many people post their interests on social media or store documents in digital form due to the development of the internet and computer technologies, the amount of text data generated has exploded. Accordingly, the demand for technology to create valuable information from numerous document data is also increasing. In this study, through statistical techniques, we investigate how the meanings of Korean words change over time by using the presidential speech records and newspaper articles public data. Using this, we present a strategy that can be utilized in the study of the synchronic change of Hangeul. The purpose of this study is to deviate from the study of the theoretical language phenomenon of Hangeul, which was studied by the intuition of existing linguists or native speakers, to derive numerical values through public documents that can be used by anyone, and to explain the phenomenon of changes in the meaning of words.

최근 들어 많은 사람들이 자신의 관심사를 SNS에 게시하거나 인터넷과 컴퓨터의 기술 발달로 디지털 형태의 문서 저장이 가능하게 됨으로써 생성되는 텍스트 자료의 양이 폭발적으로 증가하게 되었다. 이에 따라 수많은 문서 자료로부터 가치 있는 정보를 창출하기 위한 기술의 요구 또한 증가하고 있다. 본 연구에서는 대통령 연설 기록문과 신문기사 공공데이터를 활용하여 한글 단어들이 시간에 따라 어떻게 의미가 변화되어 가는지를 통계적 기법을 통해 발굴하였다. 이를 이용하여 한글의 통시적 변화 연구에 활용할 수 있는 방안을 제시한다. 기존 언어학자나 원어민의 직관에 의해 연구되던 한글의 이론적 언어 현상 연구에서 벗어나 누구나 사용할 수 있는 공공문서를 통해 수치화된 값을 도출하고 단어의 의미변화 현상을 설명하고자 한다.

Keywords

Acknowledgement

이 논문은 2017년도 정부(과학기술정보통신부)의 재원으로 한국연구재단-차세대정보컴퓨팅기술개발사업의 지원을 받아 수행된 연구임(NRF-2017M3C4A7083281).

References

  1. Cho NH (2004). Acceptance and development of the theory of semantic change, Linguistics, 43, 461-485.
  2. Choi TH, Choi YS, and Shin SM (2009). A study on the relationship between player characteristic factors and competitive factors of tennis grand slams competition using canonical correlation biplot and procrustes analysis, Korean Journal of Applied Statistics, 22, 855-864. https://doi.org/10.5351/KJAS.2009.22.4.855
  3. Davies M (2010). The Corpus of Historical American English: COHA, BYE, Brigham Young University.
  4. Deerwester S, Dumais ST, Furnas GW, Landauer TK, and Harshman R (1990). Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41, 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  5. Golub GH and Reinsch C (1970). Singular value decomposition and least squares solutions, Umerische Mathematik, 14, 403-420. https://doi.org/10.1007/BF02163027
  6. Hamilton WL, Leskovec J, and Jurafsky D (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.
  7. Harris ZS (1954). Distributional structure, Word, 10, 146-162. https://doi.org/10.1080/00437956.1954.11659520
  8. Kim Y, Chiu YI, Hanaki K, Hegde D, and Petrov S (2014). Temporal Analysis of Language through Neural Language Models.
  9. Klingenberg CP (2015). Analyzing fluctuating asymmetry with geometric morphometrics: concepts, methods, and applications, Symmetry, 7, 843-934. https://doi.org/10.3390/sym7020843
  10. Kulkarni V, Al-Rfou R, Perozzi B, and Skiena S (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web, 625-635.
  11. Lin Y, Michel JB, Aiden EL, Orwant J, Brockman W, and Petrov S (2012). Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 System Demonstrations, 169-174.
  12. Matveeva I, Levow G, Farahat A, and Royer C (2007). Term representation with generalized latent semantic analysis, Recent Advances in Natural Language Processing IV: Selected Papers from RANLP 2005. Available from: https://doi.org/10.1075/cilt.292.08.
  13. Mikolov T, Le QV, and Sutskever I (2013). Exploiting Similarities among Languages for Machine Translation.
  14. Naptali W, Tsuchiya M, and Nakagawa S (2009). Word co-occurrence matrix and context dependent class in lsa based language model for speech recognition, International Journal of Computers, 1.
  15. Park S, Byun J, Baek S, Cho Y, and Oh A (2018). Subword-level Word Vector Representations for Korean. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 1, 2429-2438.
  16. Sahlgren M (2008). The distributional hypothesis, Italian Journal of Disability Studies, 20, 33-53.
  17. Schonemann PH (1966). A generalized solution of the orthogonal procrustes problem. Psychometrika, 31, 1-10. https://doi.org/10.1007/BF02289451
  18. Yoon P (2013). Korean semantic lecture, Youkrack.