DOI QR코드

DOI QR Code

A Methodology for Urdu Word Segmentation using Ligature and Word Probabilities

  • Received : 2011.12.10
  • Accepted : 2012.02.20
  • Published : 2012.02.29

Abstract

This paper introduce a technique for Word segmentation for the handwritten recognition of Urdu script. Word segmentation or word tokenization is a primary technique for understanding the sentences written in Urdu language. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A method is proposed for word segmentation in this paper. It finds the boundaries of words in a sequence of ligatures using probabilistic formulas, by utilizing the knowledge of collocation of ligatures and words in the corpus. The word identification rate using this technique is 97.10% with 66.63% unknown words identification rate.

Keywords

References

  1. Thanaruk Theeramunkong and Sasiporn Usa-navasin , "Non-Dictionary- Based Thai Word Segmentation Using Decision Trees", Human Language Technology Conference, Proceedings of the first international conference on Human language technology research, (2001).
  2. Xin-Jing Wang,Wen Liu,Yong Qin ,"A Search-based Chinese Word Segmentation Method", International World Wide Web Conference, Proceedings of the 16th international conference on World Wide Web, (2007).
  3. Krisda Khankasikam and Nuttanart Muansuwan , "Thai Word Segmentation a Lexical Semantic Approach", (2008).
  4. Choochart Haruechaiyasak, Sarawoot Kongyoung and Matthew N. Dailey, "A Comparative Study on Thai Word Segmentation Approaches", (2008)
  5. Li Haizhou and Yuan Baosheng , "Chinese Word Segmentation". In Proceedings of the 12th Pacific Asia Conference on Language, Information and Computation, PACLIC-12, (1998) 212-217.
  6. http://www.crulp.org/oud/default.aspx
  7. Sarmad Hussain, "Resources for Urdu Language Processing", In Proceedings of the Sixth Workshop on Asian Language Resources, (2008).
  8. Sajjad, H. "Statistical Part-of-Speech for Urdu", MS thesis, Centre for Research in Urdu Lan-guage Processing , National University of Computer and Emerging Scientist, Lahore, Pakistan , (2007).
  9. Stanley F. Chen and Joshua T. Goodman, "An Empirical Study of Smoothing Techniques for Language Modeling", In Proceedings of the 34th Annual Meeting of the Association for
  10. MacKay, David J. C. and Linda C. Peto , "A hierarchical Dirichlet language model", Natural Language Engineering, 1 (3) (1995) 1-19.
  11. Church, Kenneth W. and William A. Gale. 1991, "A comparison of the enhanced Good-Turing and deleted estimation methods for es-timating probabilities of English bigrams", Computer Speech and Language, (5) 19-54. https://doi.org/10.1016/0885-2308(91)90016-J
  12. Daniel Jurafsky, James H. Martin. "Speech and Language Processing".
  13. Pak-kwong Wong and Chorkin Chan, "Chinese Word Segmentation based on Maximum Matching and Word Binding Force". In Proceedings of the 16th conference on Computational linguistics, (1996).
  14. Poowarawan, Y., "Dictionary-based Thai Syllable Separation", Proceedings of the Ninth Electronics Engineering Conference, (1986).
  15. Fung Pascale and Wu Dekai, "Statistical aug-mentation of a Chinese machine readable dictionary", (1994).
  16. Richard Sproat, Chilin Shih, William Gale and Nancy Chang, "A Stochastic Finite-State Word-Segmentation Algorithm for Chinese", Computational Linguistics, 3 (22) (1996).
  17. Chang, Jyun-Shen, Shun-De Chen, Ying Zhen, Xian-Zhong Liu and Shu-Jin Ke, "Largecorpus-based methods for Chinese personal name recognition", Journal of Chinese Information Processing, 6 (3) (1992) 7-15.
  18. Li Haizhou et al, "Pinyin Streamer: Chinese pinyin to hanzi translater", Apple-ISS technical report, (1997).
  19. Richard Sproat, Chilin Shih, William Gale and Nancy Chang, "A Stochastic Finite-State Word-Segmentation Algorithm for Chinese", Computational Linguistics, 3 (22) (1996).
  20. Jian-Cheng Dai and Hsi-Jian Lee, "Paring with Tag Information in a probabilistic generalized LR parser", International Conference on Chinese Computing, Singapore, (1994).
  21. Wirote Aroonmanakun ,"Collocation and Thai Word Segmentation", In Proceedings of the Fifth Symposium on Natural Language Processing & The Fifth Oriental COCOSDA Workshop.
  22. Virach Sornlertlamvanich, Tanapong Potipiti and Thatsanee charoenporn, "Automatic Cor-pus-Based Thai Word Algorithm Extraction with the C4.5 Learning", Proceedings of the 18th conference on Computational linguistics, (2000).
  23. Alexander Clark1 and Shalom Lappin2, "Grammar Induction through Machine Learning".
  24. http://www.crulp.org/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm
  25. Sajjad, H. "Statistical Part-of-Speech for Urdu", MS thesis, Centre for Research in Urdu Language Processing, National University of Computer and Emerging Sciencies, Lahore, Pakistan, (2007).