DOI QR코드

DOI QR Code

A Study on Word Vector Models for Representing Korean Semantic Information

  • Received : 2015.12.03
  • Accepted : 2015.12.21
  • Published : 2015.12.31

Abstract

This paper examines whether the Global Vector model is applicable to Korean data as a universal learning algorithm. The main purpose of this study is to compare the global vector model (GloVe) with the word2vec models such as a continuous bag-of-words (CBOW) model and a skip-gram (SG) model. For this purpose, we conducted an experiment by employing an evaluation corpus consisting of 70 target words and 819 pairs of Korean words for word similarities and analogies, respectively. Results of the word similarity task indicated that the Pearson correlation coefficients of 0.3133 as compared with the human judgement in GloVe, 0.2637 in CBOW and 0.2177 in SG. The word analogy task showed that the overall accuracy rate of 67% in semantic and syntactic relations was obtained in GloVe, 66% in CBOW and 57% in SG.

Keywords

References

  1. Stephen Clark. (2013). Vector space models of lexical meaning. In Shalom Lappin and Chris Fox, editors, Handbook of Contemporary Semantics, 2nd ed. Blackwell, Malden, MA. In press; http://www.cl.cam.ac.uk/-sc609/pubs/sem_handbook.pdf.
  2. Katrin Erk. (2012). Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Compas, 6(10):635-653. https://doi.org/10.1002/lnco.362
  3. Peter Turney and Patrick Pantel. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141-188.
  4. Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval (Vol. 1, p. 496). Cambridge: Cambridge university press.
  5. Fabrizio Sebastiani. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47. https://doi.org/10.1145/505282.505283
  6. Tellex Stefanie, et al. (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 2th annual international ACM SIGIR conference on Research and development in information retrieval, 41-47, ACM.
  7. Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, 384-394. Association for Computational Linguistics.
  8. Socher, Richard, et al. "Recursive deep models for semantic compositionality over a sentiment treebank." Proceedings of the conference on empirical methods in natural language processing (EMNLP). Vol. 1631. 2013.
  9. Deerwester, Scott C., et al. (1990). "Indexing by latent semantic analysis." JAsIs 41(6):391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  10. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013a). Efficient estimation of word representations in vector space. http://arxiv.org/abs/1301.3781/.
  11. Jeffrey Pennington, Richard Socher, and Christopher Manning. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 12:1532-1543.
  12. Haizhou Li, Bin Ma, & Chin-Hui Lee. (2007). A vector space modeling approach to spoken language identification. Audio, Speech, and Language Processing, IEEE Transactions on, 15(1), 271-284. https://doi.org/10.1109/TASL.2006.876860
  13. The National Institute of Korean Language (2007). The Sejong Corpus.
  14. Young-In Lee, Hyun-jung Lee, Myoung-Wan Koo, & Sook Whan Cho. (2015). Korean Semantic Similarity Measures for the Vector Space Models. Phonetics and Speech Sciences, 7(4): 1-7.
  15. Levy et al. (2014). Linguistic regularities in sparse and explicit word representations. CoNLL-2014, 171.