DOI QR코드

DOI QR Code

Performance Comparison of Automatic Classification Using Word Embeddings of Book Titles

단행본 서명의 단어 임베딩에 따른 자동분류의 성능 비교

  • 이용구 (경북대학교 문헌정보학과)
  • Received : 2023.11.20
  • Accepted : 2023.12.13
  • Published : 2023.12.30

Abstract

To analyze the impact of word embedding on book titles, this study utilized word embedding models (Word2vec, GloVe, fastText) to generate embedding vectors from book titles. These vectors were then used as classification features for automatic classification. The classifier utilized the k-nearest neighbors (kNN) algorithm, with the categories for automatic classification based on the DDC (Dewey Decimal Classification) main class 300 assigned by libraries to books. In the automatic classification experiment applying word embeddings to book titles, the Skip-gram architectures of Word2vec and fastText showed better results in the automatic classification performance of the kNN classifier compared to the TF-IDF features. In the optimization of various hyperparameters across the three models, the Skip-gram architecture of the fastText model demonstrated overall good performance. Specifically, better performance was observed when using hierarchical softmax and larger embedding dimensions as hyperparameters in this model. From a performance perspective, fastText can generate embeddings for substrings or subwords using the n-gram method, which has been shown to increase recall. The Skip-gram architecture of the Word2vec model generally showed good performance at low dimensions(size 300) and with small sizes of negative sampling (3 or 5).

이 연구는 짧은 텍스트인 서명에 단어 임베딩이 미치는 영향을 분석하기 위해 Word2vec, GloVe, fastText 모형을 이용하여 단행본 서명을 임베딩 벡터로 생성하고, 이를 분류자질로 활용하여 자동분류에 적용하였다. 분류기는 k-최근접 이웃(kNN) 알고리즘을 사용하였고 자동분류의 범주는 도서관에서 도서에 부여한 DDC 300대 강목을 기준으로 하였다. 서명에 대한 단어 임베딩을 적용한 자동분류 실험 결과, Word2vec와 fastText의 Skip-gram 모형이 TF-IDF 자질보다 kNN 분류기의 자동분류 성능에서 더 우수한 결과를 보였다. 세 모형의 다양한 하이퍼파라미터 최적화 실험에서는 fastText의 Skip-gram 모형이 전반적으로 우수한 성능을 나타냈다. 특히, 이 모형의 하이퍼파라미터로는 계층적 소프트맥스와 더 큰 임베딩 차원을 사용할수록 성능이 향상되었다. 성능 측면에서 fastText는 n-gram 방식을 사용하여 하부문자열 또는 하위단어에 대한 임베딩을 생성할 수 있어 재현율을 높이는 것으로 나타났다. 반면에 Word2vec의 Skip-gram 모형은 주로 낮은 차원(크기 300)과 작은 네거티브 샘플링 크기(3이나 5)에서 우수한 성능을 보였다.

Keywords

References

  1. Kang, Hyungsuc & Yang, Janghoon (2020). Performance comparison of Word2Vec and fastText embedding models. Journal of Digital Contents Society, 21(7), 1335-1343. http://dx.doi.org/10.9728/dcs.2020.21.7.1335 
  2. Lee, Da-Bin & Choi, Sung-Pil (2019). Comparative analysis of various Korean morpheme embedding models using massive textual resources. Journal of KIISE, 46(5), 413-418. https://doi.org/10.5626/JOK.2019.46.5.413 
  3. Lee, Yong-Gu (2020). A study on book categorization in social sciences using kNN classifiers and table of contents text. Journal of the Korean Society for Information Management, 37(1), 1-21. https://doi.org/10.3743/KOSIM.2020.37.1.001 
  4. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://doi.org/10.1162/tacl_a_00051 
  5. Chen, Q. & Sokolova, M. (2021). Specialists, scientists, and sentiments: Word2Vec and Doc2Vec in analysis of scientific and medical texts. SN Computer Science, 2(5), 414. https://doi.org/10.1007/s42979-021-00807-1 
  6. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Volume 1, 4171-4186. https://doi.org/10.18653/v1/N19-1423 
  7. Dharma, E. M., Gaol, F. L., Warnars, H. L. H. S., & Soewito, B. (2022). The accuracy comparison among Word2vec, GloVe, and fastText towards convolution neural network (CNN) text classification. Journal of Theoretical and Applied Information Technology, 100(2), 349-359. https://doi.org/10.29207/resti.v6i3.3711 
  8. Goularas, D. & Kamis, S. (2019). Evaluation of deep learning techniques in sentiment analysis from twitter data. In 2019 International Conference on Deep Learning and Machine Learning in Emerging Applications(Deep-ML), 12-17. https://doi.org/0.1109/Deep-ML.2019.00011  https://doi.org/10.1109/Deep-ML.2019.00011
  9. Harris, Z. S. (1954). Distributional structure. Word, 10(2-3), 146-162. https://doi.org/10.1080/00437956.1954.11659520 
  10. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed Representations. In Rumelhart, D. E., McClelland, J. L., & the PDP Research Group eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume I. Cambridge: Massachusetts Institute of Technology Press, 77-109. 
  11. McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: contextualized word vectors. Proceedings of the 31st International Conference on Neural Information Processing Systems, 6297-6308. 
  12. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space. https://doi.org/10.48550/arXiv.1301.3781 
  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26. 
  14. Park, S., Byun, J., Baek, S., Cho, Y., & Oh, A. (2018). Subword-level word vector representations for Korean. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2429-2438. https://doi.org/10.18653/v1/P18-1226 
  15. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532-1543. https://doi.org/10.3115/v1/D14-1162 
  16. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, Volume 1, 2227-2237. https://doi.org/10.18653/v1/N18-1202 
  17. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the Association for Computing Machinery, 18(11), 613-620. https://doi.org/10.1145/361219.361220 
  18. Sitender, Sangeeta, Sushma, N. S. & Sharma, S. K. (2023). Effect of GloVe, Word2Vec and fastText embedding on english and hindi neural machine translation systems. In Proceedings of Data Analytics and Management 2022, 433-447. https://doi.org/10.1007/978-981-19-7615-5_37 
  19. Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word embedding models: methods and experimental results. Asia Pacific Signal and Information Processing Association Transactions on Signal and Information Processing, 8, e19. https://doi.org/10.1017/ATSIP.2019.12 
  20. Wang, C., Nulty, P., & Lillis, D. (2020). A comparative study on word embeddings in deep learning for text classification. In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, 37-46. https://doi.org/10.1145/3443279.3443304 
  21. Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in twitter election classification. Information Retrieval Journal, 21, 183-207. https://doi.org/10.1007/s10791-017-9319-5