DOI QR코드

DOI QR Code

Optical Character Recognition for Hindi Language Using a Neural-network Approach

  • Yadav, Divakar (Department of Computer Science & Engineering, Jaypee Institute of Information Technology) ;
  • Sanchez-Cuadrado, Sonia (Department of Computer Science, Carlos III University) ;
  • Morato, Jorge (Department of Computer Science, Carlos III University)
  • Received : 2012.12.17
  • Accepted : 2013.01.17
  • Published : 2013.03.31

Abstract

Hindi is the most widely spoken language in India, with more than 300 million speakers. As there is no separation between the characters of texts written in Hindi as there is in English, the Optical Character Recognition (OCR) systems developed for the Hindi language carry a very poor recognition rate. In this paper we propose an OCR for printed Hindi text in Devanagari script, using Artificial Neural Network (ANN), which improves its efficiency. One of the major reasons for the poor recognition rate is error in character segmentation. The presence of touching characters in the scanned documents further complicates the segmentation process, creating a major problem when designing an effective character segmentation technique. Preprocessing, character segmentation, feature extraction, and finally, classification and recognition are the major steps which are followed by a general OCR. The preprocessing tasks considered in the paper are conversion of gray scaled images to binary images, image rectification, and segmentation of the document's textual contents into paragraphs, lines, words, and then at the level of basic symbols. The basic symbols, obtained as the fundamental unit from the segmentation process, are recognized by the neural classifier. In this work, three feature extraction techniques-: histogram of projection based on mean distance, histogram of projection based on pixel value, and vertical zero crossing, have been used to improve the rate of recognition. These feature extraction techniques are powerful enough to extract features of even distorted characters/symbols. For development of the neural classifier, a back-propagation neural network with two hidden layers is used. The classifier is trained and tested for printed Hindi texts. A performance of approximately 90% correct recognition rate is achieved.

Keywords

References

  1. Mori, S. et. al.: Historical Review of OCR Research and Development. Proceeding IEEE, Vol.80, No.7, 1992, pp.1029-1058. https://doi.org/10.1109/5.156468
  2. Chaudhari, A.A., Ahmad, E.A. S., Hossain, S., Rahman, C.M.: OCR of Bangla Character Using Neural Network: A better approach. 2nd International Conference on Electrical Engineering (ICEE 2002.), khuln, Bangladesh, 2002.
  3. Mahmud, J.U.; Raihan, M.F., Rahman, C.M.: A complete OCR System for Continuous Bengali Characters. TENCON 2003, IEEE conference on convergent Technologies for Asia-Pacific Region Vol.4, 2003, pp.1372-1376.
  4. Garain, U., Chaudhuri, B. B.: Segmentation of Touching Character in Printed Devanagari and Bangla Script Using Fuzzy Multifactorial Analysis. IEEE Transaction on System, Man and Cybernetics -Part C: Applications and Reviews, Vol.32, No.4, 2002, pp.449-459. https://doi.org/10.1109/TSMCC.2002.807272
  5. Jawahar, C.V., Pavan Kumar, M.N.S.S.K., Ravi Kiran, S.S.: A Bilingual OCR for Hindi-Telugu Documents and its Application. Document Analysis and Recognition. IEEE Proceedings Seventh International Conference on, Vol.1, 2003, pp.408-412.
  6. Lakshmi, C.V., Patvardhan, C.: A High Accuracy OCR System for printed Telugu text. Conference on Convergent Technology for Asia-pacific Region Vol.2, 2003, pp.725-729.
  7. Ashwin, T.V., Sastry, P.S.. A Font and size independent OCR for printed Kannad documents using support vector machines.
  8. Bansal, V., Sinha, R.M.K.: A Complete OCR for Printed Hindi Text in Devanagari Script. Sixth International Conference on Document Analysis and Recognition, IEEE publication, Seatle USA., 2001, pp.800-804.
  9. Bansal, V., Sinha, R.M.K.: Integrating Knowledge Source in Devanagari Text Recognition System. IEEE transaction on Systems, MAN and Cybernetics-Part A: System and Humans, Vol.30, No.4, 2000, pp.500-505. https://doi.org/10.1109/3468.852443
  10. Bansal, V., Sinha, R.M.K.: On How to describe Shape of Devanagari Characters and Use them for Recognition. 5th International Conference on document Analysis and recognition (ICDAR'99), Bangalore, India, 1999.
  11. Bansal, V., Sinha, R.M.K.: A Devanagari OCR and A Brief Overview of OCR Research for Indian Script, PROC Symposium on Transaction support System (STRANS 2001), Kanpur, India, 2001.
  12. Setlur, S., Kompalli, S., Ramanaprasad, V., Govindaraju, V.: Creation of data resources and design of an evaluation test bed for Devanagari Script recognition. Research Issues in Data Engineering Multilingual Information Management. IEEE Proceeding 13th International Workshop on, 2003, pp.55-61.
  13. Bansal, V., Sinha, R.M.K.: Segmentation of touching and fused Devanagari characters. Pattern Recognit., Vol.35, 2002, pp.875-893. https://doi.org/10.1016/S0031-3203(01)00081-4
  14. Bansal, V., Sinha, R.M.K.: Integrating knowledge sources in Devanagari text recognition. IEEE Trans. Syst. Man Cybern. A: Syst. Hum., Vol.30, No.4, 2000, pp.500-505. https://doi.org/10.1109/3468.852443
  15. Chaudhuri, B. B., Pal, U.: An OCR System to Read Two Indian Language Script Bangla and Devanagari (Hindi. Document Analysis and recognition. IEEE Proceeding of the Fourth International Conference on, Vol.2, 1997, pp.1011-1015.
  16. Bansal, V., Sinha, R.M.K.: Designing a Front End OCR System for Indian Script for Machine Translation- A Case Study for Devanagari. Symposium on Machine Aids for Translation and Communication, New Delhi, India, 1996.
  17. Bansal, V., Sinha, R.M.K.: On Devanagari Document Processing. IEEE International Conference on Systems, Man and Cybernetics, Intelligent Systems for the 21st Century., Vol.2, pp.1621-1626, 1995.
  18. Pal, U., Chaudhuri, B.B.: Automatic Separation of Machine-printed and Hand-Written Text Lines. Document Analysis and Recognition. IEEE Proceedings of the Fifth International Conference on, 1999, pp.645-648.
  19. Chaudhuri, B.B., Pal, U.: Skew angle Detection of Digitized Indian script Documents. Pattern Analysis and Machine Intelligence, IEEE Transactions on Vol.19, Issue 2, 1997, pp.182-186. https://doi.org/10.1109/34.574803
  20. Faaborg, Alexander j.: Using Neural Network to Create an Adaptive Character Recognition System.
  21. Hinds, S.C., Fisher, J.L., D'Amato, D.P.: A document skew detection method using run-length encoding and the Hough transform. Pattern Recognition, IEEE Proceedings., 10th International Conference on Vol.1, 1990, pp.464-468.
  22. Tellache, M., Sid-Ahmaed, M., Abaza, B.: Thinning algorithms for Arabic OCR. Communications, Computers and Signal Processing, 1993, IEEE Pacific Rim Conference on Vol.1, 1993, pp.248-251.
  23. Shimizu, M., Fukuda, H., Nakamura, G.: A thinning algorithm for digital figures of characters. Image Analysis and Interpretation, 2000. Proceedings. 4th IEEE Southwest Symposium, 2000, pp.83-87.
  24. Pal, U., Chaudhuri, B.B.: Automatic identification of English, Chinese, Arabic, Devanagari and Bangla script line. Document Analysis and Recognition. Proceedings. Sixth International Conference on, 2001, pp.790-794.
  25. Bansal, V., Sinha, R.M.K.: Partitioning and searching dictionary for correction of optically read Devanagari character strings. Document Analysis and Recognition. Proceedings of the Fifth International Conference on, 1999, pp.653-656.
  26. Aradhya, V.N. Manjunath, Kumar, G. Hemantha, Noushath, S.: Multilingual OCR system for South Indian scripts and English documents: An approach based on Fourier transform and principal component analysis. Engineering Applications of Artificial Intelligence 21, 2008, pp.658-668. https://doi.org/10.1016/j.engappai.2007.05.009
  27. Bhattacharya, U., Chaudhuri, B.B.: Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.31, No.3, 2009, pp.444-457. https://doi.org/10.1109/TPAMI.2008.88
  28. Desai, Apurva A.: Gujarati hand written numeral optical character reorganization through neural network. Pattern Recognition 43, 2010, pp.2582-2589. https://doi.org/10.1016/j.patcog.2010.01.008
  29. Pal, U.; Roy, P. P., Tripathy, N., Josep, L.: Multi-oriented Bangla and Devanagari text recognition. Pattern Recognition 43, 2010, pp.4124-4136. https://doi.org/10.1016/j.patcog.2010.06.017
  30. Jayadevan, R., Kolhe, Satish R., Patil, Pradeep M., Pal, U.: Offline Recognition of Devanagari Script: A Survey. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, Vol.41, No.6, 2011, pp.782-796. https://doi.org/10.1109/TSMCC.2010.2095841
  31. Apurva A. Desai: Gujarati handwritten numeral optical character reorganization through neural network. Pattern Recognition, Vol.43, Issue 7, 2010, pp.2582-2589. https://doi.org/10.1016/j.patcog.2010.01.008
  32. Govindaraju,V., Khedekar, S., Kompalli, S., Farooq, F., Setlur, S., Vemulapati, R.: Tools for enabling digital access to multilingual Indian documents. in Proc. 1st Int. Workshop Document Image Anal. Libraries, 2004, pp.122-133.
  33. Kompalli, S., Nayak, S., Setlur, S., Govindaraju, V.: Challenges in OCR of Devanagari documents. In Proc. 8th Conf. Document Anal. Recognition, 2005, pp.1-5.
  34. Ma, H., Doermann, D.: Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Trans. Asian Lang. Inf. Process., Vol.2, No.3, 2003, pp.193-218. https://doi.org/10.1145/979872.979875
  35. Kompalli, S., Setlur, S., Govindaraju, V.: Devanagari OCR using a recognition driven segmentation framework and stochastic language models. Int. J. Document Anal. Recognit., Vol.12, 2009, pp.123-138. https://doi.org/10.1007/s10032-009-0086-8
  36. Kumar, S.: An analysis of irregularities in Devanagari script writing: A machine recognition perspective. International Journal of Computer Science Eng., Vol.2, No.2, 2010, pp.274-279.
  37. Kompalli, S., Setlur, S., Govindaraju, V.: Design and comparison of segmentation driven and recognition driven Devanagari OCR. In Proceeding 2nd Int. Conf. Document Image Anal. Libraries, 2006, pp.1-7.
  38. Dhurandhar, A., Shankarnarayanan, K., Jawale, R.: Robust pattern recognition scheme for Devanagari script. Comput. Intell. Security, Part I, Lecture Notes in Artificial Intelligence 3801, 2005, pp.1021-1026 .
  39. Dhingra, K. D., Sanyal, S., Sanyal, P. K.: A robust OCR for degraded documents. In Lecture Notes in Electrical Engineering. Huang et al., Eds. New York: Springer-Verlag, 2008, pp.497-509.
  40. Natarajan, P., MacRostie, E., Decerbo, M.: The BBN Byblos hindi OCR system. In Guide to OCR for Indic Scripts, V. Govindaraju and S. Setlur, Eds. New York: Springer-Verlag, 2009, pp.173-180.