DOI QR코드

DOI QR Code

A Semi-supervised Learning of HMM to Build a POS Tagger for a Low Resourced Language

  • Pattnaik, Sagarika (Department of Computer Science and Engineering, SOA, Deemed to be University) ;
  • Nayak, Ajit Kumar (Department of Computer Science and Information Technology, SOA, Deemed to be University) ;
  • Patnaik, Srikanta (Department of Computer Science and Engineering, SOA, Deemed to be University)
  • 투고 : 2020.07.02
  • 심사 : 2020.12.22
  • 발행 : 2020.12.31

초록

Part of speech (POS) tagging is an indispensable part of major NLP models. Its progress can be perceived on number of languages around the globe especially with respect to European languages. But considering Indian Languages, it has not got a major breakthrough due lack of supporting tools and resources. Particularly for Odia language it has not marked its dominancy yet. With a motive to make the language Odia fit into different NLP operations, this paper makes an attempt to develop a POS tagger for the said language on a HMM (Hidden Markov Model) platform. The tagger judiciously considers bigram HMM with dynamic Viterbi algorithm to give an output annotated text with maximum accuracy. The model is experimented on a corpus belonging to tourism domain accounting to a size of approximately 0.2 million tokens. With the proportion of training and testing as 3:1, the proposed model exhibits satisfactory result irrespective of limited training size.

키워드

참고문헌

  1. T. Siddiqui and U. S. Tiwari, Natural Language Processing and Information Retrieval, Oxford University Press, pp.77-88, 2008.
  2. D. N. Mehta and N. Desai, "A survey on part-of-speech tagging of Indian languages," In 1st International Conference on Computing, Communication, Electrical, Electronics, Devices and Signal Processing, vol. 34, 2011. DOI: 10.13140/RG.2.1.3595. .2481.
  3. F. Md. Hasan, U. Naushad, and K. Mumit, "Comparison of different POS tagging techniques (N-Gram, HMM and Brill's tagger) for Bangla," Advances and Innovations in Systems, Computing sciences and Software engineering, Springer, Dordrecht, pp.121-126, 2007. DOI: 10.1007/978-1-4020-6264-3_23.
  4. S. G. Kanakaraddi and S. S. Nandyal, "Survey on parts of speech tagger techniques," 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), Coimbatore, pp. 1-6, 2018. DOI: 10.1109/ICCTCT.2018.8550884.
  5. D. Jurafsky and J. H. Martin, Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing, 2nd Edition, 2008.
  6. P. J. Antony, A.V. Vidyapeetham, and K.P.Soman, "Parts of speech tagging for Indian Languages: A Literature Survey," International Journal of Computer Applications,vol. 34, no. 8, pp. 0975-8887, 2011.
  7. H. Amiri, F. Raja, M. Sarmadi, S.Tasharofi, H. Hojjat, and F. Oroumchian, "A survey of part of speech tagging in Persian," Data base Research Group, 2007.
  8. J. H. Kim and J. Seo, "A Hidden markov model imbedding multiword units for Part-of-Speech tagging," Journal of Electrical Engineering and Information Science, vol. 2, no. 6, pp. 7-13, 1997.
  9. D. Modi, N. Nain, and M. Nehra, "Part-of-speech tagging for Hindi corpus in poor Resource Scenario," Journal of Multimedia Information System, vol. 5, no. 3, pp. 147-154, 2018. DOI: 10.9717/JMIS.2018.5.3.147.
  10. E. Brill, "A simple rule-based part of speech tagger," Proceedings of the workshop on Speech and Natural Language, Association for Computational Linguistics, 1992. DOI: 10.3115/974499. 974526.
  11. K. Mohnot, N. Bansal, S.P. Singh and A. Kumar, "Hybrid approach for part of speech tagger for Hindi language," International Journal of Computer Technology and Electronics Engineering (IJCTEE) 4.1, pp. 25-30 2014.
  12. J. Kupiec, "Robust part-of-speech tagging using a hidden Markov model," Computer Speech & Language, vol. 6, no.3,pp.225-242, 1992.DOI: 10.1016/0885-2308(92)90019-Z.
  13. K. K. Zin and N. L.Thein, "Part of speech tagging for Myanmar using hidden Markov model," 2009 International Conference on the Current Trends in Information Technology (CTIT), pp. 1-6,2009. DOI: 10.1109/CTIT.2009.5423133.
  14. A. F. Wicaksono and A. Purwarianti, "HMM based part-of-speech tagger for bahasa Indonesia," In Fourth International Malindo Workshop, Jakarta. Aug. 2010.
  15. U. Afini and C. Supriyanto, "Morphology analysis for Hidden Markov Model based Indonesian part-of-speech tagger," in 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 237-240, Nov. 2017. DOI: 10.1109/ICICOS.2017.8276368.
  16. D. E. Cahyani and M. J. Vindiyanto, "Indonesian Part of Speech tagging using Hidden markov model - Ngram& Viterbi," in 2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), Yogyakarta, Indonesia, pp. 353-358, 2019. DOI: 10.1109/ICITISEE48480.2019.9003989.
  17. Z. Agic, M. Tadic, and Z. Dovedan, "Investigating Language Independence in HMM PoS/MSD-Tagging," ITI 2008 - 30th International Conference on Information Technology Interfaces, Dubrovnik, pp. 657-662, 2008. DOI: 0.1109/ITI.2008.4588489.
  18. Y. O. M. ElHadj, I. A. Al-Sughayeir, and A. M. Al-Ansari, "Arabic part-of-speech tagging using the sentence structure," In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, Apr. 2009.
  19. A. M. Alashqar, "A comparative study on Arabic POS tagging using Quran corpus," In 2012 8th International Conference on Informatics and Systems (INFOS), IEEE, May. 2012.
  20. M. Shrivastava and P. Bhattacharyya, "Hindi pos tagger using naive stemming: Harnessing morphological information without extensive linguistic knowledge," International Conference on NLP (ICON08), Pune, India, 2008.
  21. R. B.Shambhavi and R. K. P, "Kannada Part-Of-Speech tagging with probabilistic classifiers," International Journal of Computer Applications 48.17, pp. 26-30, 2012. DOI:10.5120/7442-0452.
  22. S. K. Sharma and G. S. Lehal, "Using hidden markov model to improve the accuracy of Punjabi POS tagger," In 2011 IEEE International Conference on Computer Science and Automation Engineering,IEEE, vol. 2, pp. 697-701, Jun. 2011. DOI: 10.1109/CSAE.2011.5952600.
  23. B. R. Das and S. Patnaik, "A novel approach for Odia part of speech tagging using Artificial Neural Network," Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), Springer, 2014. DOI:10.1007/978-3-319-02931-3_18.
  24. B. R. Das, S. Sahoo, C. S. Panda, and S. Patnaik, "Part of speech tagging in Odia using Support Vector machine," Procedia Computer Science 48, pp. 507-512, 2015. DOI:10.1016/j.procs.2015.04.127.
  25. B. Pitambar, "Odia Parts of Speech Tagging Corpora: Suitability of Statistical Models," M.Phil. Diss. Jawaharlal Nehru University New Delhi, India, Jul. 2015.
  26. S. Pattnaik and A.K. Nayak, "An Automatic Summarizer for a LowResourced Language," Advanced Computing and Intelligent Engineering, Singapore, pp. 285-295, 2020. DOI:10.1007/978-981-15-1081-6_24.
  27. K. C. Pradhan, B. K. Hota, and B. Pradhan, Saraswat Byabaharika Odia Byakarana, Styanarayan Book Store, fifth edition 2006.
  28. S. C. Mohanty. Applied Odia Grammar, A.k. Mishra Publishers private Limited, first edition 2015.ISBN: 978-93-82550-38-9.
  29. B. P. Mahapatra, Prachalita Odia BhasaraEkaByakarana, published by Pitambar Mishra, Vidyapuri, Cuttack, First Edition, Mar. 2007.
  30. M. Padro and L. Padro, "Developing competitive HMM POS taggers using small training corpora," In International Conference on Natural Language Processing (in Spain), Springer, Berlin, Heidelberg, pp.127-136, Oct. 2004. DOI:10.1007/978-3-540-30228-5_12.
  31. Indian Language Technology Proliferation and Deployment Center [Internet], Accessed 27th, Jun. 2020.Available: http://tdil-dc.in.

피인용 문헌

  1. 예쁜꼬마선충의 수영 행동 영상과 기계학습 모델을 이용한 수질 오염 물질 구분 방법 vol.25, pp.7, 2021, https://doi.org/10.6109/jkiice.2021.25.7.903