DOI QR코드

DOI QR Code

A Machine Learning Approach for Named Entity Recognition in Classical Arabic Natural Language Processing

  • Ramzi Salah (Faculty of Information Science and Technology, University Kebangsaan Malaysia) ;
  • Muaadh Mukred (Faculty of Information Science and Technology, University Kebangsaan Malaysia) ;
  • Lailatul Qadri binti Zakaria (Faculty of Information Science and Technology, University Kebangsaan Malaysia) ;
  • Fuad A. M. Al-Yarimi (Department of Computer Science, King Khalid University)
  • Received : 2024.03.27
  • Accepted : 2024.09.23
  • Published : 2024.10.31

Abstract

A key element of many Natural Language Processing (NLP) applications is Named Entity Recognition (NER). It involves categorizing and identifying text into separate categories, such as identifying a location or an individual's name. Arabic NER (ANER) is also utilized in numerous other Arabic NLP (ANLP) tasks, such as Machine Translation (MT), Question Answering (QA), and Information Extraction (IE). ANER systems can often be classified into three major groups: rule-based, Machine Learning (ML), and hybrid. This study focuses on examining ML-based ANER developments, particularly in the context of Classical Arabic, which presents unique challenges due to its complex morphological structure and limited linguistic resources. We propose a supervised approach that integrates word-level, morphological, and knowledge-based features to improve NER performance for Classical Arabic. Our method was evaluated on the CANERCorpus, a specialized dataset containing annotated texts from Classical Arabic literature. The Naive Bayes (NB) approach achieved an F-measure of 80%, with precision and recall levels at 86% and 75%, respectively. These results indicate a significant improvement over traditional methods, particularly in dealing with the intricate structure of Classical Arabic. The study highlights the potential of ML in overcoming the challenges of ANER and provides directions for further research in this domain.

Keywords

Acknowledgement

The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/214/45.

References

  1. Nadeau, D., and S. Sekine, "A survey of named entity recognition and classification," Lingvisticae Investigationes, vol.30, no.1, pp.3-26, 2007. https://doi.org/10.1075/li.30.1.03nad
  2. Salminen, J. et al., "Developing an online hate classifier for multiple social media platforms," Human-centric Computing and Information Sciences, vol.10, no.1, pp.1-34, 2020. https://doi.org/10.1186/s13673-019-0205-6
  3. Salah, R.E., and L.Q. binti Zakaria, "A Comparative Review of Machine Learning for Arabic Named Entity Recognition," International Journal on Advanced Science, Engineering and Information Technology, vol.7, no.2, pp.511-518, 2017. https://doi.org/10.18517/ijaseit.7.2.1810
  4. Silalahi, S., T. Ahmad, and H. Studiawan, "Transformer-Based Named Entity Recognition on Drone Flight Logs to Support Forensic Investigation," IEEE Access, vol.11, pp.3257-3274, 2023. https://doi.org/10.1109/ACCESS.2023.3234605
  5. Li, J. et al., "A Survey on Deep Learning for Named Entity Recognition," IEEE Transactions on Knowledge and Data Engineering, vol.34, no.1, pp.50-70, 2022. https://doi.org/10.1109/TKDE.2020.2981314
  6. Taquini, R., K. R. Finardi, and G. B. Amorim, "English as a Medium of Instruction at Turkish State Universities," Education and Linguistics Research, vol.3, no.2, pp.35-53, 2017. https://doi.org/10.5296/elr.v3i2.11438
  7. McEntee-Atalianis, L., and R. Vessey, "Mapping the language ideologies of organisational members: a corpus linguistic investigation of the United Nations' General Debates (1970-2016)," Language Policy, vol.19, no.4, pp.549-573, 2020. https://doi.org/10.1007/s10993-020-09542-4
  8. Salah, R. E. and L. Q. B. Zakaria, "Arabic Rule-Based Named Entity Recognition Systems Progress and Challenges," International Journal on Advanced Science, Engineering and Information Technology, vol.7, no.3, pp.815-821, 2017. https://doi.org/10.18517/ijaseit.7.3.1811
  9. Mohd, M. et al., "Quranic Optical Text Recognition Using Deep Learning Models," IEEE Access, vol.9, pp.38318-38330, 2021. https://doi.org/10.1109/ACCESS.2021.3064019
  10. AbdelRahman, S. et al., "Integrated Machine Learning Techniques for Arabic Named Entity Recognition," IJCSI International Journal of Computer Science Issues, vol.7, no.4, pp.27-36, 2010.
  11. Benajiba, Y., P. Rosso, and J. M. Benediruiz, "ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy," in Proc. of 8th International Conference on Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, vol.4394, pp.143-153, Springer, 2007.
  12. Komariah, K. S. et al., "SMPT: A Semi-Supervised Multi-Model Prediction Technique for Food Ingredient Named Entity Recognition (FINER) Dataset Construction," Informatics, vol.10, no.1, 2023.
  13. Benajiba, Y. and P. Rosso, "Arabic Named Entity Recognition using Conditional Random Fields," in Proc. of Workshop on HLT & NLP within the Arabic World, LREC, vol.8, pp.143-153, 2008.
  14. Al-Twairesh, N., H. Al-Khalifa, and A. Al-Salman, "AraSenTi: Large-Scale Twitter-Specific Arabic Sentiment Lexicons," in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.697-705, 2016.
  15. Abdul-Hamid, A. and K. Darwish, "Simplified Feature Set for Arabic Named Entity Recognition," in Proc. of the 2010 Named Entities Workshop, Association for Computational Linguistics 2010, pp.110-115, 2010.
  16. AbdelRahman, S. et al., "Integrated Machine Learning Techniques for Arabic Named Entity Recognition," International Journal of Computer Science Issues (IJCSI), vol.7, no.4, pp.27-36, 2010.
  17. Bidhend, M. A., B. Minaei-Bidgoli, and H. Jouzi, "Extracting person names from ancient Islamic Arabic texts," in Proc. of Language Resources and Evaluation for Religious Texts (LRE-Rel) Workshop Programme, Eight International Conference on Language Resources and Evaluation (LREC 2012), 2012.
  18. Morsi, A. and A. Rafea, "Studying the impact of various features on the performance of Conditional Random Field-based Arabic Named Entity Recognition," in Proc. of 2013 ACS International Conference on Computer Systems and Applications (AICCSA), pp.1-5, 2013.
  19. Zirikly, A. and M. Diab, "Named Entity Recognition for Dialectal Arabic," in Proc. of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP 2014), pp.78-86, 2014.
  20. Alduailaj, A. M. and A. Belghith, "Detecting Arabic Cyberbullying Tweets Using Machine Learning," Machine Learning and Knowledge Extraction, vol.5, no.1, pp.29-42, 2023. https://doi.org/10.3390/make5010003
  21. Alsayadi, H. A. and A. M. ElKorany, "Integrating Semantic Features for Enhancing Arabic Named Entity Recognition," International Journal of Advanced Computer Science and Applications (IJACSA), vol.7, no.3, pp.128-136, 2016. https://doi.org/10.14569/IJACSA.2016.070318
  22. Dahan, F., A. Touir, and H. Mathkour, "First Order Hidden Markov Model for Automatic Arabic Name Entity Recognition," International Journal of Computer Applications, vol.123, no.7, pp.37-40, 2015. https://doi.org/10.5120/ijca2015905397
  23. Al-Shoukry, S. and N. Omar, "Proper Nouns Recognition in Arabic Crime Text Using Machine Learning Approach," Journal of Theoretical and Applied Information Technology, vol.79, no.3, pp.506-513, 2015. https://doi.org/10.1109/ICEEI.2015.7352553
  24. Aoumeur, N. E., Z. Li, and E. M. Alshari, "Improving the Polarity of Text through word2vec Embedding for Primary Classical Arabic Sentiment Analysis," Neural Processing Letters, vol.55, pp.2249-2264, 2023. https://doi.org/10.1007/s11063-022-11111-1
  25. Koulali, R. and A. Meziane, "A contribution to Arabic Named Entity Recognition," in Proc. of 2012 10th International Conference on ICT and Knowledge Engineering, pp.46-52, 2012.
  26. Mohammed, N. F. and N. Omar, "Arabic Named Entity Recognition Using Artificial Neural Network," Journal of Computer Science, vol.8, no.8, pp.1285-1293, 2012. https://doi.org/10.3844/jcssp.2012.1285.1293
  27. Alanazi, S., "A Named Entity Recognition System Applied to Arabic Text in the Medical Domain," Doctoral thesis, Staffordshire University, 2017.
  28. Al-Ayyoub, M. et al., "Deep learning for Arabic NLP: A survey," Journal of Computational Science, vol.26, pp.522-531, 2018. https://doi.org/10.1016/j.jocs.2017.11.011
  29. Kanan, T. et al., "A Review of Natural Language Processing and Machine Learning Tools Used to Analyze Arabic Social Media," in Proc. of 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp.622-628, 2019.
  30. Imene, S. and A. Hassina, "An Unsupervised Semantic Model for Arabic/French Terminology Extraction," in Proc. of International Conference on Emerging Technologies and Intelligent Systems: ICETIS 2021 Volume 2, Lecture Notes in Networks and Systems, vol.322, pp.49-59, Springer, 2022.
  31. Al-Laith, A. et al., "AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text Corpus," Applied Sciences, vol.11, no.5, 2021.
  32. Kartchner, D. et al., "Rule-Enhanced Active Learning for Semi-Automated Weak Supervision," AI, vol.3, no.1, pp.211-228, 2022. https://doi.org/10.3390/ai3010013
  33. Shaalan, K. and H. Raza, "Arabic Named Entity Recognition from Diverse Text Types," in Proc. of 6th International Conference, Advances in Natural Language Processing, Lecture Notes in Computer Science, vol.5221, pp.440-451, Springer, 2008.
  34. Shaalan, K. and H. Raza, "Person Name Entity Recognition for Arabic," in Proc. of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Association for Computational Linguistics, pp.17-24, 2007.
  35. Appelt, D. E. and D. J. Israel, "Introduction to Information Extraction Technology," in Proc. of Tutorial prepared for the IJCAI Conference, 1999.
  36. Eikvil, L., Information Extraction from World Wide Web-A Survey, 1999.
  37. Al-Ayyoub, M. et al., "A comprehensive survey of arabic sentiment analysis," Information Processing & Management, vol.56, no.2, pp.320-342, 2019. https://doi.org/10.1016/j.ipm.2018.07.006
  38. Salah, R. E. and L. Q. B. Zakaria, "Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)," in Proc. of 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), pp.1-8, 2018.
  39. Dash, M. and H. Liu, "Feature selection for classification," Intelligent Data Analysis, vol.1, no.1-4, pp.131-156, 1997. https://doi.org/10.3233/IDA-1997-1302
  40. Farghaly, A. and K. Shaalan, "Arabic Natural Language Processing: Challenges and Solutions," ACM Transactions on Asian Language Information Processing (TALIP), vol.8, no.4, pp.1-22, 2009. https://doi.org/10.1145/1644879.1644881
  41. Mohammad, A.-S. et al., "Gated recurrent unit with multilingual universal sentence encoder for Arabic aspect-based sentiment analysis," Knowledge-Based Systems, vol.261, 2023.
  42. Wang, X. and J. Liu, "A novel feature integration and entity boundary detection for named entity recognition in cybersecurity," Knowledge-Based Systems, vol.260, 2023.
  43. Guo, X. et al., "CG-ANER: Enhanced contextual embeddings and glyph features-based agricultural named entity recognition," Computers and Electronics in Agriculture, vol.194, 2022.
  44. Sun, M. et al., "Learning the Morphological and Syntactic Grammars for Named Entity Recognition," Information, vol.13, no.2, 2022.
  45. Alotaibi, F. S. et al., "Keyphrase Extraction Using Enhanced Word and Document Embedding," IETE Journal of Research, vol.69, no.12, pp.8876-8888, 2022. https://doi.org/10.1080/03772063.2022.2103036
  46. Wei, H. et al., "Named Entity Recognition From Biomedical Texts Using a Fusion AttentionBased BiLSTM-CRF," IEEE Access, vol.7, pp.73627-73636, 2019. https://doi.org/10.1109/ACCESS.2019.2920734
  47. Meselhi, M. A. et al., "A Novel Hybrid Approach to Arabic Named Entity Recognition," in Proc. of 10th China Workshop on Machine Translation, Communications in Computer and Information Science, vol.493, pp.93-103, Springer, 2014.
  48. Salah, R. E. and L. Q. binti Zakaria, "Arabic Rule-Based Named Entity Recognition Systems Progress and Challenges," International Journal on Advanced Science Engineering and Information Technology, vol.7, no.3, pp.815-821, 2017. https://doi.org/10.18517/ijaseit.7.3.1811
  49. Shaalan, K. and M. Oudah, "A hybrid approach to Arabic named entity recognition," Journal of Information Science, vol.40, no.1, pp.67-87, 2014. https://doi.org/10.1177/0165551513502417
  50. Abdallah, S., K. Shaalan, and M. Shoaib, "Integrating Rule-Based System with Classification for Arabic Named Entity Recognition," in Proc. of 13th International Conference on Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, vol.7181, pp.311-322, Springer, 2012.
  51. Boujelben, I., S. Jamoussi, and A. Ben Hamadou, "A hybrid method for extracting relations between Arabic named entities," Journal of King Saud University - Computer and Information Sciences, vol.26, no.4, pp.425-440, 2014. https://doi.org/10.1016/j.jksuci.2014.06.004
  52. Oudah, M. M., "Integrating Rule-based Approach and Machine learning Approach for Arabic Named Entity Recognition," The British University in Dubai (BUiD), 2012.
  53. Pasha, A. et al., "MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic," in Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pp.1094-1101, 2014.
  54. Habash, N., O. Rambow, and R. Roth, "MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization," in Proc. of the 2nd international conference on Arabic language resources and tools, Cairo, Egypt, 2009.
  55. Farber, B. et al., "Improving NER in Arabic Using a Morphological Tagger," in Proc. of the Sixth International Conference on Language Resources and Evaluation (LREC'08), 2008.
  56. Shaalan, K., "A survey of arabic named entity recognition and classification," Computational Linguistics, vol.40, no.2, pp.469-510, 2014. https://doi.org/10.1162/COLI_a_00178
  57. Saif, A., M. J. Ab Aziz, and N. Omar, "Mapping Arabic WordNet synsets to Wikipedia articles using monolingual and bilingual features," Natural Language Engineering, vol.23, no.1, pp.53-91, 2017. https://doi.org/10.1017/S1351324915000376
  58. Zirikly, A. and M. Diab, "Named entity recognition for arabic social media," in Proc. of NAACL-HLT 2015, pp.176-185, 2015.
  59. Althobaiti, M., U. Kruschwitz, and M. Poesio, "Combining Minimally-supervised Methods for Arabic Named Entity Recognition," Transactions of the Association for Computational Linguistics, vol.3, pp.243-255, 2015. https://doi.org/10.1162/tacl_a_00136
  60. Benajiba, Y. and P. Rosso, "ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information," in Proc. of 3rd Indian International Conference on Artificial Intelligence (IICAI-07), pp.1814-1823, 2007.
  61. Benajiba, Y., P. Rosso, and J. M. Benediruiz, "Anersys: An arabic named entity recognition system based on maximum entropy," in Proc. of 8th International Conference on Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, vol.4394, pp.143-153, Springer, 2007.
  62. Chi, W. W., T. Y. Tang, N. M. Salleh, M. Mukred, H. AlSalman, and M. Zohaib, "Data Augmentation With Semantic Enrichment for Deep Learning Invoice Text Classification," IEEE Access, vol.12, pp.57326-57344, 2024.
  63. Moussaoui, T. E., C. Loqman, and J. Boumhidi, "Flat and Nested Named Entity Recognition in Arabic Language," in Proc. of 2024 4th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), pp.1-7, May 2024.
  64. Mekki, A., I. Zribi, M. Ellouze, and L. H. Belguith, "Named Entity Recognition of Tunisian Arabic Using the Bi-LSTM-CRF Model," International Journal on Artificial Intelligence Tools, vol.33, no.02, 2024.
  65. Qarah, F., and T. Alsanoosy, "A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models," Applied Sciences, vol.14, no.13, 2024.