DOI QR코드

DOI QR Code

Comparative Study of Various Persian Stemmers in the Field of Information Retrieval

  • Received : 2014.06.12
  • Accepted : 2015.08.05
  • Published : 2015.09.30

Abstract

In linguistics, stemming is the operation of reducing words to their more general form, which is called the 'stem'. Stemming is an important step in information retrieval systems, natural language processing, and text mining. Information retrieval systems are evaluated by metrics like precision and recall and the fundamental superiority of an information retrieval system over another one is measured by them. Stemmers decrease the indexed file, increase the speed of information retrieval systems, and improve the performance of these systems by boosting precision and recall. There are few Persian stemmers and most of them work based on morphological rules. In this paper we carefully study Persian stemmers, which are classified into three main classes: structural stemmers, lookup table stemmers, and statistical stemmers. We describe the algorithms of each class carefully and present the weaknesses and strengths of each Persian stemmer. We also propose some metrics to compare and evaluate each stemmer by them.

Keywords

References

  1. E. Rahimtoroghi, H. Faili, and A. Shakery, "A structural rule-based stemmer for Persian," in Proceedings of 2010 5th International Symposium on Telecommunications (IST), Tehran, Iran, 2010, pp. 574-578.
  2. A. A. Sharifloo and M. Shamsfard, "A bottom up approach to Persian stemming," in Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP), Hyderabad, India, 2008, pp. 583-588.
  3. F. Momenipour and M. Keyvanpour, "Analytical study of various information retrieval models based on mathematical approaches," Journal of Next Generation Information Technology, vol. 4, no. 5, pp. 63-73, 2013. https://doi.org/10.4156/jnit.vol4.issue5.8
  4. M. Tashakori, M. Meybodi, and F. Oroumchian, "Bon: the Persian stemmer," in EurAsia-ICT 2002: Information and Communication Technology. Heidelberg, Germany: Springer, 2002, pp. 487-494.
  5. M. Agosti, M. Bacchin, N. Ferro, and M. Melucci, "Improving the automatic retrieval of text documents," in Advances in Cross-Language Information Retrieval. Heidelberg, Germany: Springer, 2003, pp. 279-290.
  6. R. Karimpour, A. Ghorbani, A. Pishdad, M. Mohtarami, A. AleAhmad, H. Amiri, and F. Oroumchian, "Improving Persian information retrieval systems using stemming and part of speech tagging," in Evaluating Systems for Multilingual and Multimodal Information Access. Heidelberg, Germany: Springer, 2009, pp. 89-96.
  7. P. Janarthanan and N. Rajkumar, "Information Retrieval Using Second Order Co-occurrence PMI," International Journal of Information Technology & Computer Science, vol. 9, no. 3, pp. 1-10, 2013.
  8. S. Estahbanati and R. Javidan, "A new stemmer for Farsi language," in Proceedings of 2011 CSI International Symposium on Computer Science and Software Engineering (CSSE), Tehran, Iran, 2011, pp. 25-29.
  9. S. Estahbanati and J. Reza, "A new multi-phase algorithm for stemming in Farsi language based on morphology," International Journal of Computer Theory and Engineering, vol. 3, no. 5, pp. 623-627, 2011.
  10. K. Taghva, R. Beckley, and M. Sadeh, "A stemming algorithm for the Farsi language," in Proceedings of 2005 International Conference on Information Technology: Coding and Computing (ITCC), Las Vegas, NV, 2005, pp. 158-162.
  11. A. Mokhtaripour, and S. Jahanpour, "Introduction to new Farsi stemmer," in Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, VA, 2006, pp. 826-827.
  12. A. H. Jadidinejad, F. Mahmoudi, and J. Dehdari, "Evaluation of PerStem: a simple and efficient stemming algorithm for Persian," in Multilingual Information Access Evaluation I: Text Retrieval Experiments. Heidelberg, Germany: Springer, 2010, pp. 98-101.
  13. J. Dehdari and D. Lonsdale, "A link grammar parser for Persian," in Aspects of Iranian Linguistics. Newcastle upon Tyne: Cambridge Scholars Publishing, 2008, pp. 19-33.
  14. A. AleAhmad, H. Amiri, E. Darrudi, M. Rahgozar, and F. Oroumchian, "Hamshahri: a standard Persian text collection," Knowledge-Based Systems, vol. 22, no. 5, pp. 382-387, 2009. https://doi.org/10.1016/j.knosys.2009.05.002
  15. J. Mehrad and S. R. Berenjian, "Providing a Persian language singular-stemmer system (RICeST Stemmer)," International Journal of Information Science and Management, vol. 9, no. 2, pp. 13-22, 2011.
  16. M. Melucci and N. Orio, "A novel method for stemmer generation based on hidden Markov models," in Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM), New Orleans, LA, 2003, pp. 131-138.
  17. M. M. Nasiri, K. S. Esmaeili, and H. Abolhassani, "A statistical stemmer for Persian language," in Proceedings of 11th International CSI Computer Conference (CSICC2006), Tehran, Iran, 2006.
  18. M. Bacchin, N. Ferro, and M. Melucci, "A probabilistic model for stemmer generation," Information Processing & Management, vol. 41, no. 1, pp. 121-137, 2005. https://doi.org/10.1016/j.ipm.2004.04.006
  19. Iranian Students' News Agency, http://www.isna.ir/en.
  20. Apache Lucene, http://lucene.apache.org/core/index.html.
  21. R. Hesamifard and G. Ghassem-Sani, "A stemming algorithm for the Persian words," in Proceedings of the 11th Annual International CSI Computer Conference (CSICC2006), Tehran, Iran, 2006, pp. 515-519.
  22. M. H. Dianati, M. H. Sadreddini, A. H. Rasekh, S. M. Fakhrahmad, and H. Taghi-Zadeh, "Words stemming based on structural and semantic similarity," Computer Engineering and Applications Journal, vol. 3, No. 2, pp. 89-99, 2014.
  23. M. Ghayoomi, "Bootstrapping the development of an HPSG-based Treebank for Persian," Linguistic Issues in Language Technology, vol. 7, no. 1, pp. 1-13, 2012.
  24. M. Keyvanpour and R. Tavoli, "Document image retrieval: algorithms, analysis and promising directions," International Journal of Software Engineering and Its Applications, vol. 7, no. 1, pp. 93-106, 2013.
  25. M. Keyvanpour and R. Tavoli, "Feature weighting for improving document image retrieval system performance," International Journal of Computer Science Issues, vol. 9, no. 3, pp. 125-130, 2012.