DOI QR코드

DOI QR Code

A Comparative Study of Word Embedding Models for Arabic Text Processing

  • Assiri, Fatmah (University of Jeddah, College of Computer Science and Engineering) ;
  • Alghamdi, Nuha (University of Jeddah, College of Computer Science and Engineering)
  • Received : 2022.08.05
  • Published : 2022.08.30

Abstract

Natural texts are analyzed to obtain their intended meaning to be classified depending on the problem under study. One way to represent words is by generating vectors of real values to encode the meaning; this is called word embedding. Similarities between word representations are measured to identify text class. Word embeddings can be created using word2vec technique. However, recently fastText was implemented to provide better results when it is used with classifiers. In this paper, we will study the performance of well-known classifiers when using both techniques for word embedding with Arabic dataset. We applied them to real data collected from Wikipedia, and we found that both word2vec and fastText had similar accuracy with all used classifiers.

Keywords

References

  1. N. Alghamdi and F. Assiri, "Solving the cold-start problem in recommender systems using contextual information in arabic from calendars," Arabian Journal for Science and Engineering, pp. 1-9, 2020.
  2. F. Ricci, L. Rokach, and B. Shapira, "Introduction to recommender systems handbook," in Recommender systems handbook. Springer, 2011, pp. 1-35.
  3. A. B. Soliman, K. Eissa, and S. R. El-Beltagy, "Aravec: A set of arabic word embedding models for use in arabic nlp," Procedia Computer Science, vol. 117, pp. 256-265, 2017. https://doi.org/10.1016/j.procs.2017.10.117
  4. A. A. Altowayan and L. Tao, "Word embeddings for arabic sentiment analysis," in 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016, pp. 3820-3825.
  5. A. Dahou, S. Xiong, J. Zhou, M. H. Haddoud, and P. Duan, "Word embeddings and convolutional neural network for arabic sentiment classification," in Proceedings of coling 2016, the 26th international conference on computational linguistics: Technical papers, 2016, pp. 2418-2427.
  6. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017. https://doi.org/10.1162/tacl_a_00051
  7. G. G. Chowdhury, "Natural language processing," Annual review of information science and technology, vol. 37, no. 1, pp. 51-89, 2003. https://doi.org/10.1002/aris.1440370103
  8. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," arXiv preprint arXiv:1310.4546, 2013.
  9. P. Rodriguez Bertorello, "Recommendation engine: Semantic cold start," Available at SSRN 3655839, 2020.
  10. F. Anwar, N. Iltaf, H. Afzal, and H. Abbas, "A deep learning framework to predict rating for cold start item using item metadata," in 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE). IEEE, 2019, pp. 313-319.
  11. Y. C. Yoon and J. W. Lee, "Movie recommendation using metadata based word2vec algorithm," in 2018 International Conference on Platform Technology and Service (PlatCon). IEEE, 2018, pp. 1-6.
  12. F. Anwaar, N. Iltaf, H. Afzal, and R. Nawaz, "Hrs-ce: A hybrid framework to integrate content embeddings in recommender systems for cold start items," Journal of computational science, vol. 29, pp. 9-18, 2018. https://doi.org/10.1016/j.jocs.2018.09.008
  13. H. Wang, D. Amagata, T. Makeawa, T. Hara, N. Hao, K. Yonekawa, and M. Kurokawa, "A dnn-based cross-domain recommender system for alleviating cold-start problem in e-commerce," IEEE Open Journal of the Industrial Electronics Society, vol. 1, pp. 194-206, 2020. https://doi.org/10.1109/ojies.2020.3012627
  14. J. Yuan, W. Shalaby, M. Korayem, D. Lin, K. AlJadda, and J. Luo, "Solving cold-start problem in large-scale recommendation engines: A deep learning approach," in 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016, pp. 1901-1910.
  15. X. Wang, K. Liu, and J. Zhao, "Handling cold-start problem in review spam detection by jointly embedding texts and behaviors," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 366-376.
  16. H. Y. Erdin and A. Guran, "Semi-supervised turkish text categorization with word2vec, doc2vec and fasttext algorithms," in 2019 27th Signal Processing and Communications Applications Conference (SIU). IEEE, 2019, pp. 1-4.
  17. H. Kang and J. Yang, "Performance comparison of word2vec and fasttext embedding models," (J. DCS), vol. 21, no. 7, pp. 1335-1343, 2020. https://doi.org/10.9728/dcs.2020.21.7.1335
  18. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
  19. S. Raschka, Python Machine Learning. Packt Publishing, 2015.
  20. D. L. Olson and D. Delen, Advanced data mining techniques. Springer Science & Business Media, 2008.
  21. Y. Sasaki et al., "The truth of the f-measure. 2007," URL: https://www.cs.odu.edu/~mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf [accessed 2021-05-26], 2007.