DOI QR코드

DOI QR Code

Comparative study of text representation and learning for Persian named entity recognition

  • Received : 2021.11.13
  • Accepted : 2022.03.29
  • Published : 2022.10.10

Abstract

Transformer models have had a great impact on natural language processing (NLP) in recent years by realizing outstanding and efficient contextualized language models. Recent studies have used transformer-based language models for various NLP tasks, including Persian named entity recognition (NER). However, in complex tasks, for example, NER, it is difficult to determine which contextualized embedding will produce the best representation for the tasks. Considering the lack of comparative studies to investigate the use of different contextualized pretrained models with sequence modeling classifiers, we conducted a comparative study about using different classifiers and embedding models. In this paper, we use different transformer-based language models tuned with different classifiers, and we evaluate these models on the Persian NER task. We perform a comparative analysis to assess the impact of text representation and text classification methods on Persian NER performance. We train and evaluate the models on three different Persian NER datasets, that is, MoNa, Peyma, and Arman. Experimental results demonstrate that XLM-R with a linear layer and conditional random field (CRF) layer exhibited the best performance. This model achieved phrase-based F-measures of 70.04, 86.37, and 79.25 and word-based F scores of 78, 84.02, and 89.73 on the MoNa, Peyma, and Arman datasets, respectively. These results represent state-of-the-art performance on the Persian NER task.

Keywords

References

  1. Z. Miftahutdinov, I. Alimova, and E. Tutubalina, On biomedical named entity recognition: Experiments in interlingual transfer for clinical and social media texts, Advances in information retrieval, J. M. Jose, E. Yilmaz, J. Magalhaes, P. Castells, N. Ferro, M. J. Silva, and F. Martins, (eds.), Springer International Publishing, 2020, pp. 281-288.
  2. Y. Chen, T. A. Lasko, Q. Mei, J. C. Denny, and H. Xu, A study of active learning methods for named entity recognition in clinical text, J. Biomed. Inform. 58 (2015), 11-18. https://doi.org/10.1016/j.jbi.2015.09.010
  3. M. Carbonell, A. Fornes, M. Villegas, and J. Llados, A neural model for text localization, transcription and named entity recognition in full pages, Pattern Recognition Lett. 136 (2020), 219-227. https://doi.org/10.1016/j.patrec.2020.05.001
  4. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, (Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA), 2018, pp. 2227-2237.
  5. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pretraining of deep bidirectional transformers for language understanding, (Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA), 2019, pp. 4171-4186.
  6. A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, Unsupervised cross-lingual representation learning at scale, (Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics), 2020, pp. 8440-8451. https://doi.org/10.18653/v1/2020.acl-main.747
  7. M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, Parsbert: Transformer-based model for Persian language understanding, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2005.12515
  8. A. Graves, S. Fernandez, and J. Schmidhuber, Bidirectional LSTM networks for improved phoneme classification and recognition, (International Conference on Artificial Neural Networks: Formal Models and Their Applications, Warsaw, Poland), 2005, pp. 799-804.
  9. P. S. Mortazavi and M. Shamsfard, Named entity recognition in Persian texts, JSDP 16 (2020), no. 4, 1-10. https://doi.org/10.29252/jsdp.16.4.93
  10. M. Abdoos, and B. B. Minaei, Improving named entity recognition using Izafe in Farsi, Signal Data Process. 14 (2018), no. 4, 43-54. https://doi.org/10.29252/jsdp.14.4.43
  11. H. Moradi, F. Ahmadi, and M.-R. Feizi-Derakhshi, A hybrid approach for Persian named entity recognition, Iranian J. Sci. Technol. Trans. A: Sci. 41 (2017), no. 1, 215-222. https://doi.org/10.1007/s40995-017-0209-x
  12. S. Hosseinnejad, Y. Shekofteh, and T. Emami Azadi, A'laam corpus: A standard corpus of named entity for Persian language, Signal Data Process. 14 (2017), no. 3, 127-142. https://doi.org/10.29252/jsdp.14.3.127
  13. O. Moradiannasab, S. Momtazi, and A. Palmer, A named entity recognition tool for Persian, (Proceedings of the 3rd Conference on Computational Linguistics, Tehran, Iran), 2014.
  14. M. K. Khormuji and M. Bazrafkan, Persian named entity recognition based with local filters, Int. J. Comput. Appl. 100 (2014), no. 4, 1-6.
  15. K. Dashtipour, M. Gogate, A. Adeel, A. Algarafi, N. Howard, and A. Hussain, Persian named entity recognition, (IEEE 16th International Conference on Cognitive Informatics & Cognitive Computing, Oxford, UK), 2017, pp. 79-83. https://doi.org/10.1109/ICCI-CC.2017.8109733
  16. H. Poostchi, E. Z. Borzeshi, M. Abdous, and M. Piccardi, Personer: Persian named-entity recognition, (Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan), 2016, pp. 3381-3389.
  17. M. Bijankhan, J. Sheykhzadegan, M. Bahrani, and M. Ghayoomi, Lessons from building a Persian written corpus: Peykare, Lang. Resour. Eval. 45 (2011), no. 2, 143-164. https://doi.org/10.1007/s10579-010-9132-x
  18. M. H. Bokaei and M. Mahmoudi, Improved deep Persian named entity recognition, (9th International Symposium on Telecommunications, Tehran, Iran), 2018, pp. 381-386. https://doi.org/10.1109/ISTEL.2018.8661067
  19. M. S. Shahshahani, M. Mohseni, A. Shakery, and H. Faili, Payma: A tagged corpus of Persian named entities, Signal Data Process. 16 (2019), no. 1, 91-110.
  20. L. Hafezi and M. Rezaeian, Neural architecture for Persian named entity recognition, (Iranian Conference on Signal Processing and Intelligent Systems, Tehran, Iran), 2018, pp. 61-64. https://doi.org/10.1109/ICSPIS.2018.8700549
  21. S. Momtazi, and F. Torabi, Named entity recognition in Persian text using deep learning, Signal Data Process. 16 (2020), no. 4, 93-112. https://doi.org/10.29252/jsdp.16.4.93
  22. E. Taher, S. A. Hoseini, and M. Shamsfard, Beheshti-NER: Persian named entity recognition using BERT, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2003.08875
  23. F. Jalali Farahani and G. Ghassem-Sani, Persian named entity recognition, Master's Thesis, Sharif University of Technology, 2020.
  24. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, and K. J. Google's, Google's neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint, 2016. https://doi.org/10.48550/arXiv.1609.08144
  25. A. Kumar, S. Verma, and A. Sharan, ATE-SPD: simultaneous extraction of aspect-term and aspect sentiment polarity using BiLSTM-CRF neural network, J. Experimental Theoret. Artif. Intell. 33 (2021), no. 3, 487-508. https://doi.org/10.1080/0952813X.2020.1764632
  26. Z. Meng, S. Tian, L. Yu, and Y. Lv, Joint extraction of entities and relations based on character graph convolutional network and multi-head self-attention mechanism, J. Experimental Theoret. Artif. Intell. 33 (2021), no. 2, 349-362. https://doi.org/10.1080/0952813X.2020.1744198
  27. A. Thomas, and S. Sangeetha, An innovative hybrid approach for extracting named entities from unstructured text data, Comput. Intell. 35 (2019), no. 4, 799-826. https://onlinelibrary.wiley.com/doi/abs/10.1111/coin.12214
  28. S. Kwon, Y. Ko, and J. Seo, Effective vector representation for the Korean named-entity recognition, Pattern Recogn. Lett. 117 (2019), 52-57. https://doi.org/10.1016/j.patrec.2018.11.019
  29. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, (Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA), Dec. 2013, pp. 3111-3119.
  30. W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, and T. Luis, Finding function in form: Compositional character models for open vocabulary word representation, (Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal), 2015, pp. 1520-1530. https://doi.org/10.18653/v1/D15-1176
  31. A. Hadifar, and S. Momtazi, The impact of corpus domain on word representation: A study on Persian word embeddings, Lang. Resources Eval. 52 (2018), no. 4, 997-1019. https://doi.org/10.1007/s10579-018-9419-x
  32. D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, (3rd International Conference on Learning Representations, San Diego, CA, USA), May 2015.
  33. Y. Bengio, A. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Trans. Pattern Anal. Machine Intell. 35 (2013), no. 8, 1798-1828. https://doi.org/10.1109/TPAMI.2013.50