Word-Level Embedding to Improve Performance of Representative Spatio-temporal Document Classification

Byoungwook Kim;Hong-Jun Jang;

doi:10.3745/JIPS.04.0296

Journal of Information Processing Systems

Volume 19 Issue 6
/
Pages.830-841
/
2023
/
1976-913X(pISSN)
/
2092-805X(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Word-Level Embedding to Improve Performance of Representative Spatio-temporal Document Classification

Byoungwook Kim (Dept. of Computer Science and Engineering, Dongshin University) ;
Hong-Jun Jang (Dept. of Computer Science and Engineering, Jeonju University)

Received : 2022.09.28
Accepted : 2023.01.25
Published : 2023.12.31

https://doi.org/10.3745/JIPS.04.0296 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Tokenization is the process of segmenting the input text into smaller units of text, and it is a preprocessing task that is mainly performed to improve the efficiency of the machine learning process. Various tokenization methods have been proposed for application in the field of natural language processing, but studies have primarily focused on efficiently segmenting text. Few studies have been conducted on the Korean language to explore what tokenization methods are suitable for document classification task. In this paper, an exploratory study was performed to find the most suitable tokenization method to improve the performance of a representative spatio-temporal document classifier in Korean. For the experiment, a convolutional neural network model was used, and for the final performance comparison, tasks were selected for document classification where performance largely depends on the tokenization method. As a tokenization method for comparative experiments, commonly used Jamo, Character, and Word units were adopted. As a result of the experiment, it was confirmed that the tokenization of word units showed excellent performance in the case of representative spatio-temporal document classification task where the semantic embedding ability of the token itself is important.

Keywords

Acknowledgement

This research was funded by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korean Government (MSIT) (No. 2021R1F1A1049387) and this result was supported by "Regional Innovation Strategy (RIS)" through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2021RIS-002).

References

A. Mucherino, P. J. Papajorgji, P. M. Pardalos, A. Mucherino, P. J. Papajorgji, and P. M. Pardalos, "K-nearest neighbor classification," in Data Mining in Agriculture. New York, NY: Springer, 2009, pp. 83-106. https://doi.org/10.1007/978-0-387-88615-2_4
Y. Wang, J. Hodges, and B. Tang, "Classification of web documents using a naive bayes method," in Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence, Sacramento, CA, USA, 2003, pp. 560-564. https://doi.org/10.1109/TAI.2003.1250241
S. Mayor and B. Pant, "Document classification using support vector machine," International Journal of Engineering Science and Technology, vol. 4, no. 4, pp. 1741-1745, 2012.
W. M. Noormanshah, P. N. Nohuddin, and Z. Zainol, "Document categorization using decision tree: preliminary study," International Journal of Engineering & Technology, vol. 7, no. 4.34, pp. 437-440, 2018. https://doi.org/10.14419/ijet.v7i4.34.26907
J. Kalita, "Detecting and extracting events from text documents," 2016 [Online]. Available: https://arxiv.org/abs/1601.04012.
A. Badia, J. Ravishankar, and T. Muezzinoglu, "Text extraction of spatial and temporal information," in Proceedings of 2007 IEEE Intelligence and Security Informatics, New Brunswick, NJ, USA, 2007, pp. 381-381. https://doi.org/10.1109/ISI.2007.379527
C. G. Lim, Y. S. Jeong, and H. J. Choi, "Survey of temporal information extraction," Journal of Information Processing Systems, vol. 15, no. 4, pp. 931-956, 2019. https://doi.org/10.3745/JIPS.04.0129
A. Feriel and M. K. Kholladi, "Automatic extraction of spatio-temporal information from Arabic text documents," International Journal of Computer Science & Information Technology, vol. 7, no. 5, pp. 97-107, 2015. https://doi.org/10.5121/ijcsit.2015.7507
B. Kim, Y. Yang, J. S. Park, and H. J. Jang, "A convolution neural network-based representative spatiotemporal documents classification for big text data," Applied Sciences, vol. 12, no. 8, article no. 3843, 2022. https://doi.org/10.3390/app12083843
J. Chen, H. Huang, S. Tian, and Y. Qu, "Feature selection for text classification with Naive Bayes," Expert Systems with Applications, vol. 36, no. 3, pp. 5432-5435, 2009. https://doi.org/10.1016/j.eswa.2008.06.054
H. Pavel, "How to build and apply Naive Bayes classification for spam filtering," 2020 [Online]. Available: https://towardsdatascience.com/how-to-build-and-apply-naive-bayes-classification-for-spam-filtering-2b8d3308501.
V. Mitra, C. J. Wang, and S. Banerjee, "Text classification: a least square support vector machine approach," Applied Soft Computing, vol. 7, no. 3, pp. 908-914, 2007. https://doi.org/10.1016/j.asoc.2006.04.002
M. Z. Islam, J. Liu, J. Li, L. Liu, and W. Kang, "A semantics aware random forest for text classification," in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 2019, pp. 1061-1070. https://doi.org/10.1145/3357384.3357891
Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015. https://doi.org/10.1038/nature14539
Z. Zhong, Y. Gao, Y. Zheng, and B. Zheng, "Efficient spatio-temporal recurrent neural network for video deblurring," in Computer Vision-ECCV 2020. Cham, Switzerland: Springer, 2020, pp. 191-207. https://doi.org/10.1007/978-3-030-58539-6_12
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," 2013 [Online]. Available: https://arxiv.org/abs/1301.3781.
T. Huang, "A CNN model for SMS spam detection," in Proceedings of 2019 4th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), Hohhot, China, 2019, pp. 851-85110. https://doi.org/10.1109/ICMCCE48743.2019.00195
S. Liu and I. Lee, "Sequence encoding incorporated CNN model for Email document sentiment classification," Applied Soft Computing, vol. 102, article no. 107104, 2021. https://doi.org/10.1016/j.asoc.2021.107104
E. Mutabazi, J. Ni, G. Tang, and W. Cao, "review on medical textual question answering systems based on deep learning approaches," Applied Sciences, vol. 11, no. 12, article no. 5456, 2021. https://doi.org/10.3390/app11125456
M. Kim, K. Chae, S. Lee, H. J. Jang, and S. Kim, "Automated classification of online sources for infectious disease occurrences using machine-learning-based natural language processing approaches," International Journal of Environmental Research and Public Health, vol. 17, no. 24, article no. 9467, 2020. https://doi.org/10.3390/ijerph17249467
B. Ban, "A survey on awesome Korean NLP datasets," in Proceedings of 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, South Korea, 2022, pp. 1615-1620. https://doi.org/10.1109/ICTC55196.2022.9952930
D. T. Vu, G. Yu, C. Lee, and J. Kim, "Text data augmentation for the Korean language," Applied Sciences, vol. 12, no. 7, article no. 3425, 2022. https://doi.org/10.3390/app12073425
Y. Kim, J. H. Kim, J. M. Lee, M. J. Jang, Y. J. Yum, S. Kim, et al., "A pre-trained BERT for Korean medical natural language processing," Scientific Reports, vol. 12, article no. 13847, 2022. https://doi.org/10.1038/s41598-022-17806-8
J. Shin, H. Song, H. Lee, and J. Park, "Constructing Korean abusive language dataset using machine translation," in Proceedings of the Korea Computer Congress (KCC), Jeju, South Korea, 2022.
J. Seo, S. Lee, L. Liu, and W. Choi, "TA-SBERT: token attention sentence-BERT for improving sentence representation," IEEE Access, vol. 10, pp. 39119-39128, 2022. https://doi.org/10.1109/ACCESS.2022.3164769
C. Toraman, E. H. Yilmaz, F. Sahinuc, and O. Ozcelik, "Impact of tokenization on language models: an analysis for Turkish," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 4, article no. 116, 2023. https://doi.org/10.1145/3578707
M. Alkaoud and M. Syed, "On the importance of tokenization in Arabic embedding models," in Proceedings of the 5th Arabic Natural Language Processing Workshop, Virtual Event (Barcelona, Spain), 2020, pp. 119-129.
S. Li, J. Hu, Y. Cui, and J. Hu, "DeepPatent: patent classification with convolutional neural networks and word embedding," Scientometrics, vol. 117, pp. 721-744, 2018. https://doi.org/10.1007/s11192-018-2905-5

Journal of Information Processing Systems

Word-Level Embedding to Improve Performance of Representative Spatio-temporal Document Classification

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)