Implementation of Machine Learning for Spam Detection and Topic Modeling for Emails in Bahasa Indonesia

Masna Novita RAHMANIAR;Ahmad HARTIONO;Setia PRAMANA;

doi:10.24225/kjai.2024.12.4.9

Korean Journal of Artificial Intelligence (한국인공지능학회지)

Volume 12 Issue 4
/
Pages.9-19
/
2024
/
2508-7894(eISSN)

Korea Artificial Intelligence Association (한국인공지능학회)

DOI QR Code

Implementation of Machine Learning for Spam Detection and Topic Modeling for Emails in Bahasa Indonesia

Masna Novita RAHMANIAR (BPS Statistics Indonesia) ;
Ahmad HARTIONO (BPS Statistics Indonesia) ;
Setia PRAMANA (STIS Polytechnic of Statistics)

Received : 2024.10.29
Accepted : 2024.12.05
Published : 2024.12.30

https://doi.org/10.24225/kjai.2024.12.4.9 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Indonesia ranks fifth as the country of origin for spammers. Attention is urgently needed to tackle spam, especially in Bahasa Indonesia (Indonesian language), which can be achieved by building the best spam detection model. This study aims to compare machine learning models for spam detection, study spam email modeling topics, and design the implementation on the REST API. Spam detection is carried out using machine learning algorithms, i.e., Long Short Term Memory (LSTM), K-Nearest Neighbours (KNN), Naive Bayes, Random Forest, Adaboost, and Support Vector Machine (SVM) combined with slang preprocessing convert and translate. Furthermore, Latent Dirichlet Allocation (LDA) is used for topic modeling of spam emails. The results show that slang processes convert and translate can improve accuracy and f1-score, Long Short Term Memory (LSTM) was the best method with accuracy 93.15% and f1-score of 93.01%, compared to the other methods. In addition, there were five main topics on data categorized as spam: promotions, job vacancies, educational offers, bulletins and news, and investment and finance. A REST API model was successfully developed to separate spam categories based on promotional and other topics.

Keywords

References

2024 Data Breach Investigations Report | Verizon. (2024). https://www.verizon.com/business/resources/reports/dbir/
Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of Phishing Email Using Random Forest Machine Learning Technique. Journal of Applied Mathematics, 2014, 1-6. https://doi.org/10.1155/2014/425731
Aliyah Salsabila, N., Ardhito Winatmoko, Y., Akbar Septiandri, A., & Jamal, A. (2018). Colloquial Indonesian Lexicon. 2018 International Conference on Asian Language Processing (IALP), 226-229. https://doi.org/10.1109/IALP.2018.8629151
Anugerah Ayu, M., & Haris Muhendra, A. (2024). Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter. In J. Li (Ed.), Artificial Intelligence (Vol. 22). IntechOpen. https://doi.org/10.5772/intechopen.113725
Bendovschi, A. (2015). Cyber-Attacks - Trends, Patterns and Security Countermeasures. Procedia Economics and Finance, 28, 24-31. https://doi.org/10.1016/S2212-5671(15)01077-1
Cybellium. (2023). Mastering Email in the enterprise. Cybellium Ltd.
Devi, K., & Ramaraj, R. R. (2015). A New Feature Selection Algorithm for Efficient Spam Filtering using Adaboost and Hashing Techniques. Indian Journal of Science and Technology, 8. https://doi.org/10.17485/ijst/2015/v8i13/65753
Garcia, E. K., Feldman, S., Gupta, M. R., & Srivastava, S. (2010). Completely Lazy Learning. IEEE Transactions on Knowledge and Data Engineering, 22(9), 1274-1285. https://doi.org/10.1109/TKDE.2009.159
Hayuningtyas, R. Y. (2017). Aplikasi Filtering of Spam Email Menggunakan Naive Bayes.
Hoo, Z. H., Candlish, J., & Teare, D. (2017). What is an ROC curve? Emergency Medicine Journal, 34(6), 357-359. https://doi.org/10.1136/emermed-2017-206735
Jiang, L., Wang, D., Cai, Z., & Yan, X. (2007). Survey of Improving Naive Bayes for Classification. In R. Alhajj, H. Gao, J. Li, X. Li, & O. R. Zaiane (Eds.), Advanced Data Mining and Applications (Vol. 4632, pp. 134-145). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-73871-8_14
John-Africa, E., & Emmah, V. T. (2022). Performance Evaluation of LSTM and RNN Models in the Detection of Email Spam Messages.
Kaddoura, S., Chandrasekaran, G., Elena Popescu, D., & Duraisamy, J. H. (2022). A systematic literature review on spam content detection and classification. PeerJ Computer Science, 8, e830. https://doi.org/10.7717/peerj-cs.830
Khomsah, S., & Aribowo, A. S. (2020). Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia. 4(4).
Laksono, E. P. (2020). Optimization of K Value in KNN Algorithm for Spam and Ham Email Classification | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi). https://www.jurnal.iaii.or.id/index.php/RESTI/article/view/1845
McCreadie, R. M. C., Macdonald, C., & Ounis, I. (2010). Crowdsourcing a News Query Classification Dataset.
Om, K. (2017). Secure email gateway. 2017 IEEE International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), 49-53. https://doi.org/10.1109/ICSTM.2017.8089126
Peta Ancaman Digital di Indonesia. (2024). https://map.awanpintar.id/
Poesio, M., & Artstein, R. (2005). The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. Proceedings of the Workshop on Frontiers in Corpus Annotations II Pie in the Sky - CorpusAnno '05, 76-83. https://doi.org/10.3115/1608829.1608840
Qaiser, S., & Ali, R. (2018). Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications, 181. https://doi.org/10.5120/ijca2018917395
Raschka, S. (2015). Python Machine Learning. Packt Publishing Ltd.
Rodan, A., Faris, H., & Alqatawna, J. (2016). Optimizing Feedforward Neural Networks Using Biogeography Based Optimization for E-Mail Spam Identification. International Journal of Communications, Network and System Sciences, 09(01), 19-28. https://doi.org/10.4236/ijcns.2016.91002
Ruskanda, F. Z. (2019). Study on the Effect of Preprocessing Methods for Spam Email Detection. Indonesian Journal on Computing (Indo-JC), 4(1), 109. https://doi.org/10.21108/INDOJC.2019.4.1.284
Sahria, Y., & Fudholi, D. H. (2020). Analysis of Health Research Topics in Indonesia Using the LDA (Latent Dirichlet Allocation) Topic Modeling Method | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi). https://jurnal.iaii.or.id/index.php/RESTI/article/view/1821
Sanz, E. P., Gomez Hidalgo, J. M., & Cortizo Perez, J. C. (2008). Chapter 3 Email Spam Filtering. In Advances in Computers (Vol. 74, pp. 45-114). Elsevier. https://doi.org/10.1016/S0065-2458(08)00603-7
Siddique, Z. B., Khan, M. A., Din, I. U., Almogren, A., Mohiuddin, I., & Nazir, S. (2021). Machine Learning-Based Detection of Spam Emails. Scientific Programming, 2021, 1-11. https://doi.org/10.1155/2021/6508784
Sohan, S. M., Maurer, F., Anslow, C., & Robillard, M. P. (2017). A study of the effectiveness of usage examples in REST API documentation. 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 53-61. https://doi.org/10.1109/VLHCC.2017.8103450
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In A. Sattar & B. Kang (Eds.), AI 2006: Advances in Artificial Intelligence (Vol. 4304, pp. 1015-1021). Springer Berlin Heidelberg. https://doi.org/10.1007/11941439_114
Vernanda, Y., Hansun, S., & Kristanda, M. B. (2020). Indonesian language email spam detection using N-gram and Naive Bayes algorithm. Bulletin of Electrical Engineering and Informatics, 9(5), 2012-2019. https://doi.org/10.11591/eei.v9i5.2444
Wang, D., Irani, D., & Pu, C. (2014). Is Email Business Dying?: A Study on Evolution of Email Spam Over Fifteen Years. EAI Endorsed Transactions on Collaborative Computing, 1(1), e3. https://doi.org/10.4108/cc.1.1.e3
Welnitzova, K., & Munkova, D. (2021). Sentence-structure errors of machine translation into Slovak. Topics in Linguistics, 22(1), 78-92. https://doi.org/10.2478/topling-2021-0006

Korean Journal of Artificial Intelligence (한국인공지능학회지)

Implementation of Machine Learning for Spam Detection and Topic Modeling for Emails in Bahasa Indonesia

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)