DOI QR코드

DOI QR Code

Implementation of Machine Learning for Spam Detection and Topic Modeling for Emails in Bahasa Indonesia

  • Received : 2024.10.29
  • Accepted : 2024.12.05
  • Published : 2024.12.30

Abstract

Indonesia ranks fifth as the country of origin for spammers. Attention is urgently needed to tackle spam, especially in Bahasa Indonesia (Indonesian language), which can be achieved by building the best spam detection model. This study aims to compare machine learning models for spam detection, study spam email modeling topics, and design the implementation on the REST API. Spam detection is carried out using machine learning algorithms, i.e., Long Short Term Memory (LSTM), K-Nearest Neighbours (KNN), Naive Bayes, Random Forest, Adaboost, and Support Vector Machine (SVM) combined with slang preprocessing convert and translate. Furthermore, Latent Dirichlet Allocation (LDA) is used for topic modeling of spam emails. The results show that slang processes convert and translate can improve accuracy and f1-score, Long Short Term Memory (LSTM) was the best method with accuracy 93.15% and f1-score of 93.01%, compared to the other methods. In addition, there were five main topics on data categorized as spam: promotions, job vacancies, educational offers, bulletins and news, and investment and finance. A REST API model was successfully developed to separate spam categories based on promotional and other topics.

Keywords

References

  1. 2024 Data Breach Investigations Report | Verizon. (2024). https://www.verizon.com/business/resources/reports/dbir/
  2. Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of Phishing Email Using Random Forest Machine Learning Technique. Journal of Applied Mathematics, 2014, 1-6. https://doi.org/10.1155/2014/425731
  3. Aliyah Salsabila, N., Ardhito Winatmoko, Y., Akbar Septiandri, A., & Jamal, A. (2018). Colloquial Indonesian Lexicon. 2018 International Conference on Asian Language Processing (IALP), 226-229. https://doi.org/10.1109/IALP.2018.8629151
  4. Anugerah Ayu, M., & Haris Muhendra, A. (2024). Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter. In J. Li (Ed.), Artificial Intelligence (Vol. 22). IntechOpen. https://doi.org/10.5772/intechopen.113725
  5. Bendovschi, A. (2015). Cyber-Attacks - Trends, Patterns and Security Countermeasures. Procedia Economics and Finance, 28, 24-31. https://doi.org/10.1016/S2212-5671(15)01077-1
  6. Cybellium. (2023). Mastering Email in the enterprise. Cybellium Ltd.
  7. Devi, K., & Ramaraj, R. R. (2015). A New Feature Selection Algorithm for Efficient Spam Filtering using Adaboost and Hashing Techniques. Indian Journal of Science and Technology, 8. https://doi.org/10.17485/ijst/2015/v8i13/65753
  8. Garcia, E. K., Feldman, S., Gupta, M. R., & Srivastava, S. (2010). Completely Lazy Learning. IEEE Transactions on Knowledge and Data Engineering, 22(9), 1274-1285. https://doi.org/10.1109/TKDE.2009.159
  9. Hayuningtyas, R. Y. (2017). Aplikasi Filtering of Spam Email Menggunakan Naive Bayes.
  10. Hoo, Z. H., Candlish, J., & Teare, D. (2017). What is an ROC curve? Emergency Medicine Journal, 34(6), 357-359. https://doi.org/10.1136/emermed-2017-206735
  11. Jiang, L., Wang, D., Cai, Z., & Yan, X. (2007). Survey of Improving Naive Bayes for Classification. In R. Alhajj, H. Gao, J. Li, X. Li, & O. R. Zaiane (Eds.), Advanced Data Mining and Applications (Vol. 4632, pp. 134-145). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-73871-8_14
  12. John-Africa, E., & Emmah, V. T. (2022). Performance Evaluation of LSTM and RNN Models in the Detection of Email Spam Messages.
  13. Kaddoura, S., Chandrasekaran, G., Elena Popescu, D., & Duraisamy, J. H. (2022). A systematic literature review on spam content detection and classification. PeerJ Computer Science, 8, e830. https://doi.org/10.7717/peerj-cs.830
  14. Khomsah, S., & Aribowo, A. S. (2020). Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia. 4(4).
  15. Laksono, E. P. (2020). Optimization of K Value in KNN Algorithm for Spam and Ham Email Classification | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi). https://www.jurnal.iaii.or.id/index.php/RESTI/article/view/1845
  16. McCreadie, R. M. C., Macdonald, C., & Ounis, I. (2010). Crowdsourcing a News Query Classification Dataset.
  17. Om, K. (2017). Secure email gateway. 2017 IEEE International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), 49-53. https://doi.org/10.1109/ICSTM.2017.8089126
  18. Peta Ancaman Digital di Indonesia. (2024). https://map.awanpintar.id/
  19. Poesio, M., & Artstein, R. (2005). The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. Proceedings of the Workshop on Frontiers in Corpus Annotations II Pie in the Sky - CorpusAnno '05, 76-83. https://doi.org/10.3115/1608829.1608840
  20. Qaiser, S., & Ali, R. (2018). Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications, 181. https://doi.org/10.5120/ijca2018917395
  21. Raschka, S. (2015). Python Machine Learning. Packt Publishing Ltd.
  22. Rodan, A., Faris, H., & Alqatawna, J. (2016). Optimizing Feedforward Neural Networks Using Biogeography Based Optimization for E-Mail Spam Identification. International Journal of Communications, Network and System Sciences, 09(01), 19-28. https://doi.org/10.4236/ijcns.2016.91002
  23. Ruskanda, F. Z. (2019). Study on the Effect of Preprocessing Methods for Spam Email Detection. Indonesian Journal on Computing (Indo-JC), 4(1), 109. https://doi.org/10.21108/INDOJC.2019.4.1.284
  24. Sahria, Y., & Fudholi, D. H. (2020). Analysis of Health Research Topics in Indonesia Using the LDA (Latent Dirichlet Allocation) Topic Modeling Method | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi). https://jurnal.iaii.or.id/index.php/RESTI/article/view/1821
  25. Sanz, E. P., Gomez Hidalgo, J. M., & Cortizo Perez, J. C. (2008). Chapter 3 Email Spam Filtering. In Advances in Computers (Vol. 74, pp. 45-114). Elsevier. https://doi.org/10.1016/S0065-2458(08)00603-7
  26. Siddique, Z. B., Khan, M. A., Din, I. U., Almogren, A., Mohiuddin, I., & Nazir, S. (2021). Machine Learning-Based Detection of Spam Emails. Scientific Programming, 2021, 1-11. https://doi.org/10.1155/2021/6508784
  27. Sohan, S. M., Maurer, F., Anslow, C., & Robillard, M. P. (2017). A study of the effectiveness of usage examples in REST API documentation. 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 53-61. https://doi.org/10.1109/VLHCC.2017.8103450
  28. Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In A. Sattar & B. Kang (Eds.), AI 2006: Advances in Artificial Intelligence (Vol. 4304, pp. 1015-1021). Springer Berlin Heidelberg. https://doi.org/10.1007/11941439_114
  29. Vernanda, Y., Hansun, S., & Kristanda, M. B. (2020). Indonesian language email spam detection using N-gram and Naive Bayes algorithm. Bulletin of Electrical Engineering and Informatics, 9(5), 2012-2019. https://doi.org/10.11591/eei.v9i5.2444
  30. Wang, D., Irani, D., & Pu, C. (2014). Is Email Business Dying?: A Study on Evolution of Email Spam Over Fifteen Years. EAI Endorsed Transactions on Collaborative Computing, 1(1), e3. https://doi.org/10.4108/cc.1.1.e3
  31. Welnitzova, K., & Munkova, D. (2021). Sentence-structure errors of machine translation into Slovak. Topics in Linguistics, 22(1), 78-92. https://doi.org/10.2478/topling-2021-0006