DOI QR코드

DOI QR Code

Word-Level Embedding to Improve Performance of Representative Spatio-temporal Document Classification

  • Byoungwook Kim (Dept. of Computer Science and Engineering, Dongshin University) ;
  • Hong-Jun Jang (Dept. of Computer Science and Engineering, Jeonju University)
  • Received : 2022.09.28
  • Accepted : 2023.01.25
  • Published : 2023.12.31

Abstract

Tokenization is the process of segmenting the input text into smaller units of text, and it is a preprocessing task that is mainly performed to improve the efficiency of the machine learning process. Various tokenization methods have been proposed for application in the field of natural language processing, but studies have primarily focused on efficiently segmenting text. Few studies have been conducted on the Korean language to explore what tokenization methods are suitable for document classification task. In this paper, an exploratory study was performed to find the most suitable tokenization method to improve the performance of a representative spatio-temporal document classifier in Korean. For the experiment, a convolutional neural network model was used, and for the final performance comparison, tasks were selected for document classification where performance largely depends on the tokenization method. As a tokenization method for comparative experiments, commonly used Jamo, Character, and Word units were adopted. As a result of the experiment, it was confirmed that the tokenization of word units showed excellent performance in the case of representative spatio-temporal document classification task where the semantic embedding ability of the token itself is important.

Keywords

Acknowledgement

This research was funded by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korean Government (MSIT) (No. 2021R1F1A1049387) and this result was supported by "Regional Innovation Strategy (RIS)" through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2021RIS-002).

References

  1. A. Mucherino, P. J. Papajorgji, P. M. Pardalos, A. Mucherino, P. J. Papajorgji, and P. M. Pardalos, "K-nearest neighbor classification," in Data Mining in Agriculture. New York, NY: Springer, 2009, pp. 83-106. https://doi.org/10.1007/978-0-387-88615-2_4
  2. Y. Wang, J. Hodges, and B. Tang, "Classification of web documents using a naive bayes method," in Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence, Sacramento, CA, USA, 2003, pp. 560-564. https://doi.org/10.1109/TAI.2003.1250241
  3. S. Mayor and B. Pant, "Document classification using support vector machine," International Journal of Engineering Science and Technology, vol. 4, no. 4, pp. 1741-1745, 2012.
  4. W. M. Noormanshah, P. N. Nohuddin, and Z. Zainol, "Document categorization using decision tree: preliminary study," International Journal of Engineering & Technology, vol. 7, no. 4.34, pp. 437-440, 2018. https://doi.org/10.14419/ijet.v7i4.34.26907
  5. J. Kalita, "Detecting and extracting events from text documents," 2016 [Online]. Available: https://arxiv.org/abs/1601.04012.
  6. A. Badia, J. Ravishankar, and T. Muezzinoglu, "Text extraction of spatial and temporal information," in Proceedings of 2007 IEEE Intelligence and Security Informatics, New Brunswick, NJ, USA, 2007, pp. 381-381. https://doi.org/10.1109/ISI.2007.379527
  7. C. G. Lim, Y. S. Jeong, and H. J. Choi, "Survey of temporal information extraction," Journal of Information Processing Systems, vol. 15, no. 4, pp. 931-956, 2019. https://doi.org/10.3745/JIPS.04.0129
  8. A. Feriel and M. K. Kholladi, "Automatic extraction of spatio-temporal information from Arabic text documents," International Journal of Computer Science & Information Technology, vol. 7, no. 5, pp. 97-107, 2015. https://doi.org/10.5121/ijcsit.2015.7507
  9. B. Kim, Y. Yang, J. S. Park, and H. J. Jang, "A convolution neural network-based representative spatiotemporal documents classification for big text data," Applied Sciences, vol. 12, no. 8, article no. 3843, 2022. https://doi.org/10.3390/app12083843
  10. J. Chen, H. Huang, S. Tian, and Y. Qu, "Feature selection for text classification with Naive Bayes," Expert Systems with Applications, vol. 36, no. 3, pp. 5432-5435, 2009. https://doi.org/10.1016/j.eswa.2008.06.054
  11. H. Pavel, "How to build and apply Naive Bayes classification for spam filtering," 2020 [Online]. Available: https://towardsdatascience.com/how-to-build-and-apply-naive-bayes-classification-for-spam-filtering-2b8d3308501.
  12. V. Mitra, C. J. Wang, and S. Banerjee, "Text classification: a least square support vector machine approach," Applied Soft Computing, vol. 7, no. 3, pp. 908-914, 2007. https://doi.org/10.1016/j.asoc.2006.04.002
  13. M. Z. Islam, J. Liu, J. Li, L. Liu, and W. Kang, "A semantics aware random forest for text classification," in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 2019, pp. 1061-1070. https://doi.org/10.1145/3357384.3357891
  14. Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015. https://doi.org/10.1038/nature14539
  15. Z. Zhong, Y. Gao, Y. Zheng, and B. Zheng, "Efficient spatio-temporal recurrent neural network for video deblurring," in Computer Vision-ECCV 2020. Cham, Switzerland: Springer, 2020, pp. 191-207. https://doi.org/10.1007/978-3-030-58539-6_12
  16. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," 2013 [Online]. Available: https://arxiv.org/abs/1301.3781.
  17. T. Huang, "A CNN model for SMS spam detection," in Proceedings of 2019 4th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), Hohhot, China, 2019, pp. 851-85110. https://doi.org/10.1109/ICMCCE48743.2019.00195
  18. S. Liu and I. Lee, "Sequence encoding incorporated CNN model for Email document sentiment classification," Applied Soft Computing, vol. 102, article no. 107104, 2021. https://doi.org/10.1016/j.asoc.2021.107104
  19. E. Mutabazi, J. Ni, G. Tang, and W. Cao, "review on medical textual question answering systems based on deep learning approaches," Applied Sciences, vol. 11, no. 12, article no. 5456, 2021. https://doi.org/10.3390/app11125456
  20. M. Kim, K. Chae, S. Lee, H. J. Jang, and S. Kim, "Automated classification of online sources for infectious disease occurrences using machine-learning-based natural language processing approaches," International Journal of Environmental Research and Public Health, vol. 17, no. 24, article no. 9467, 2020. https://doi.org/10.3390/ijerph17249467
  21. B. Ban, "A survey on awesome Korean NLP datasets," in Proceedings of 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, South Korea, 2022, pp. 1615-1620. https://doi.org/10.1109/ICTC55196.2022.9952930
  22. D. T. Vu, G. Yu, C. Lee, and J. Kim, "Text data augmentation for the Korean language," Applied Sciences, vol. 12, no. 7, article no. 3425, 2022. https://doi.org/10.3390/app12073425
  23. Y. Kim, J. H. Kim, J. M. Lee, M. J. Jang, Y. J. Yum, S. Kim, et al., "A pre-trained BERT for Korean medical natural language processing," Scientific Reports, vol. 12, article no. 13847, 2022. https://doi.org/10.1038/s41598-022-17806-8
  24. J. Shin, H. Song, H. Lee, and J. Park, "Constructing Korean abusive language dataset using machine translation," in Proceedings of the Korea Computer Congress (KCC), Jeju, South Korea, 2022.
  25. J. Seo, S. Lee, L. Liu, and W. Choi, "TA-SBERT: token attention sentence-BERT for improving sentence representation," IEEE Access, vol. 10, pp. 39119-39128, 2022. https://doi.org/10.1109/ACCESS.2022.3164769
  26. C. Toraman, E. H. Yilmaz, F. Sahinuc, and O. Ozcelik, "Impact of tokenization on language models: an analysis for Turkish," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 4, article no. 116, 2023. https://doi.org/10.1145/3578707
  27. M. Alkaoud and M. Syed, "On the importance of tokenization in Arabic embedding models," in Proceedings of the 5th Arabic Natural Language Processing Workshop, Virtual Event (Barcelona, Spain), 2020, pp. 119-129.
  28. S. Li, J. Hu, Y. Cui, and J. Hu, "DeepPatent: patent classification with convolutional neural networks and word embedding," Scientometrics, vol. 117, pp. 721-744, 2018. https://doi.org/10.1007/s11192-018-2905-5