Optical Character Recognition for Hindi Language Using a Neural-network Approach

Yadav, Divakar;Sanchez-Cuadrado, Sonia;Morato, Jorge;

doi:10.3745/JIPS.2013.9.1.117

Journal of Information Processing Systems

Volume 9 Issue 1
/
Pages.117-140
/
2013
/
1976-913X(pISSN)
/
2092-805X(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Optical Character Recognition for Hindi Language Using a Neural-network Approach

Yadav, Divakar (Department of Computer Science & Engineering, Jaypee Institute of Information Technology) ;
Sanchez-Cuadrado, Sonia (Department of Computer Science, Carlos III University) ;
Morato, Jorge (Department of Computer Science, Carlos III University)

Received : 2012.12.17
Accepted : 2013.01.17
Published : 2013.03.31

https://doi.org/10.3745/JIPS.2013.9.1.117 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Hindi is the most widely spoken language in India, with more than 300 million speakers. As there is no separation between the characters of texts written in Hindi as there is in English, the Optical Character Recognition (OCR) systems developed for the Hindi language carry a very poor recognition rate. In this paper we propose an OCR for printed Hindi text in Devanagari script, using Artificial Neural Network (ANN), which improves its efficiency. One of the major reasons for the poor recognition rate is error in character segmentation. The presence of touching characters in the scanned documents further complicates the segmentation process, creating a major problem when designing an effective character segmentation technique. Preprocessing, character segmentation, feature extraction, and finally, classification and recognition are the major steps which are followed by a general OCR. The preprocessing tasks considered in the paper are conversion of gray scaled images to binary images, image rectification, and segmentation of the document's textual contents into paragraphs, lines, words, and then at the level of basic symbols. The basic symbols, obtained as the fundamental unit from the segmentation process, are recognized by the neural classifier. In this work, three feature extraction techniques-: histogram of projection based on mean distance, histogram of projection based on pixel value, and vertical zero crossing, have been used to improve the rate of recognition. These feature extraction techniques are powerful enough to extract features of even distorted characters/symbols. For development of the neural classifier, a back-propagation neural network with two hidden layers is used. The classifier is trained and tested for printed Hindi texts. A performance of approximately 90% correct recognition rate is achieved.

Keywords

References

Mori, S. et. al.: Historical Review of OCR Research and Development. Proceeding IEEE, Vol.80, No.7, 1992, pp.1029-1058. https://doi.org/10.1109/5.156468
Chaudhari, A.A., Ahmad, E.A. S., Hossain, S., Rahman, C.M.: OCR of Bangla Character Using Neural Network: A better approach. 2nd International Conference on Electrical Engineering (ICEE 2002.), khuln, Bangladesh, 2002.
Mahmud, J.U.; Raihan, M.F., Rahman, C.M.: A complete OCR System for Continuous Bengali Characters. TENCON 2003, IEEE conference on convergent Technologies for Asia-Pacific Region Vol.4, 2003, pp.1372-1376.
Garain, U., Chaudhuri, B. B.: Segmentation of Touching Character in Printed Devanagari and Bangla Script Using Fuzzy Multifactorial Analysis. IEEE Transaction on System, Man and Cybernetics -Part C: Applications and Reviews, Vol.32, No.4, 2002, pp.449-459. https://doi.org/10.1109/TSMCC.2002.807272
Jawahar, C.V., Pavan Kumar, M.N.S.S.K., Ravi Kiran, S.S.: A Bilingual OCR for Hindi-Telugu Documents and its Application. Document Analysis and Recognition. IEEE Proceedings Seventh International Conference on, Vol.1, 2003, pp.408-412.
Lakshmi, C.V., Patvardhan, C.: A High Accuracy OCR System for printed Telugu text. Conference on Convergent Technology for Asia-pacific Region Vol.2, 2003, pp.725-729.
Ashwin, T.V., Sastry, P.S.. A Font and size independent OCR for printed Kannad documents using support vector machines.
Bansal, V., Sinha, R.M.K.: A Complete OCR for Printed Hindi Text in Devanagari Script. Sixth International Conference on Document Analysis and Recognition, IEEE publication, Seatle USA., 2001, pp.800-804.
Bansal, V., Sinha, R.M.K.: Integrating Knowledge Source in Devanagari Text Recognition System. IEEE transaction on Systems, MAN and Cybernetics-Part A: System and Humans, Vol.30, No.4, 2000, pp.500-505. https://doi.org/10.1109/3468.852443
Bansal, V., Sinha, R.M.K.: On How to describe Shape of Devanagari Characters and Use them for Recognition. 5th International Conference on document Analysis and recognition (ICDAR'99), Bangalore, India, 1999.
Bansal, V., Sinha, R.M.K.: A Devanagari OCR and A Brief Overview of OCR Research for Indian Script, PROC Symposium on Transaction support System (STRANS 2001), Kanpur, India, 2001.
Setlur, S., Kompalli, S., Ramanaprasad, V., Govindaraju, V.: Creation of data resources and design of an evaluation test bed for Devanagari Script recognition. Research Issues in Data Engineering Multilingual Information Management. IEEE Proceeding 13th International Workshop on, 2003, pp.55-61.
Bansal, V., Sinha, R.M.K.: Segmentation of touching and fused Devanagari characters. Pattern Recognit., Vol.35, 2002, pp.875-893. https://doi.org/10.1016/S0031-3203(01)00081-4
Bansal, V., Sinha, R.M.K.: Integrating knowledge sources in Devanagari text recognition. IEEE Trans. Syst. Man Cybern. A: Syst. Hum., Vol.30, No.4, 2000, pp.500-505. https://doi.org/10.1109/3468.852443
Chaudhuri, B. B., Pal, U.: An OCR System to Read Two Indian Language Script Bangla and Devanagari (Hindi. Document Analysis and recognition. IEEE Proceeding of the Fourth International Conference on, Vol.2, 1997, pp.1011-1015.
Bansal, V., Sinha, R.M.K.: Designing a Front End OCR System for Indian Script for Machine Translation- A Case Study for Devanagari. Symposium on Machine Aids for Translation and Communication, New Delhi, India, 1996.
Bansal, V., Sinha, R.M.K.: On Devanagari Document Processing. IEEE International Conference on Systems, Man and Cybernetics, Intelligent Systems for the 21st Century., Vol.2, pp.1621-1626, 1995.
Pal, U., Chaudhuri, B.B.: Automatic Separation of Machine-printed and Hand-Written Text Lines. Document Analysis and Recognition. IEEE Proceedings of the Fifth International Conference on, 1999, pp.645-648.
Chaudhuri, B.B., Pal, U.: Skew angle Detection of Digitized Indian script Documents. Pattern Analysis and Machine Intelligence, IEEE Transactions on Vol.19, Issue 2, 1997, pp.182-186. https://doi.org/10.1109/34.574803
Faaborg, Alexander j.: Using Neural Network to Create an Adaptive Character Recognition System.
Hinds, S.C., Fisher, J.L., D'Amato, D.P.: A document skew detection method using run-length encoding and the Hough transform. Pattern Recognition, IEEE Proceedings., 10th International Conference on Vol.1, 1990, pp.464-468.
Tellache, M., Sid-Ahmaed, M., Abaza, B.: Thinning algorithms for Arabic OCR. Communications, Computers and Signal Processing, 1993, IEEE Pacific Rim Conference on Vol.1, 1993, pp.248-251.
Shimizu, M., Fukuda, H., Nakamura, G.: A thinning algorithm for digital figures of characters. Image Analysis and Interpretation, 2000. Proceedings. 4th IEEE Southwest Symposium, 2000, pp.83-87.
Pal, U., Chaudhuri, B.B.: Automatic identification of English, Chinese, Arabic, Devanagari and Bangla script line. Document Analysis and Recognition. Proceedings. Sixth International Conference on, 2001, pp.790-794.
Bansal, V., Sinha, R.M.K.: Partitioning and searching dictionary for correction of optically read Devanagari character strings. Document Analysis and Recognition. Proceedings of the Fifth International Conference on, 1999, pp.653-656.
Aradhya, V.N. Manjunath, Kumar, G. Hemantha, Noushath, S.: Multilingual OCR system for South Indian scripts and English documents: An approach based on Fourier transform and principal component analysis. Engineering Applications of Artificial Intelligence 21, 2008, pp.658-668. https://doi.org/10.1016/j.engappai.2007.05.009
Bhattacharya, U., Chaudhuri, B.B.: Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.31, No.3, 2009, pp.444-457. https://doi.org/10.1109/TPAMI.2008.88
Desai, Apurva A.: Gujarati hand written numeral optical character reorganization through neural network. Pattern Recognition 43, 2010, pp.2582-2589. https://doi.org/10.1016/j.patcog.2010.01.008
Pal, U.; Roy, P. P., Tripathy, N., Josep, L.: Multi-oriented Bangla and Devanagari text recognition. Pattern Recognition 43, 2010, pp.4124-4136. https://doi.org/10.1016/j.patcog.2010.06.017
Jayadevan, R., Kolhe, Satish R., Patil, Pradeep M., Pal, U.: Offline Recognition of Devanagari Script: A Survey. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, Vol.41, No.6, 2011, pp.782-796. https://doi.org/10.1109/TSMCC.2010.2095841
Apurva A. Desai: Gujarati handwritten numeral optical character reorganization through neural network. Pattern Recognition, Vol.43, Issue 7, 2010, pp.2582-2589. https://doi.org/10.1016/j.patcog.2010.01.008
Govindaraju,V., Khedekar, S., Kompalli, S., Farooq, F., Setlur, S., Vemulapati, R.: Tools for enabling digital access to multilingual Indian documents. in Proc. 1st Int. Workshop Document Image Anal. Libraries, 2004, pp.122-133.
Kompalli, S., Nayak, S., Setlur, S., Govindaraju, V.: Challenges in OCR of Devanagari documents. In Proc. 8th Conf. Document Anal. Recognition, 2005, pp.1-5.
Ma, H., Doermann, D.: Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Trans. Asian Lang. Inf. Process., Vol.2, No.3, 2003, pp.193-218. https://doi.org/10.1145/979872.979875
Kompalli, S., Setlur, S., Govindaraju, V.: Devanagari OCR using a recognition driven segmentation framework and stochastic language models. Int. J. Document Anal. Recognit., Vol.12, 2009, pp.123-138. https://doi.org/10.1007/s10032-009-0086-8
Kumar, S.: An analysis of irregularities in Devanagari script writing: A machine recognition perspective. International Journal of Computer Science Eng., Vol.2, No.2, 2010, pp.274-279.
Kompalli, S., Setlur, S., Govindaraju, V.: Design and comparison of segmentation driven and recognition driven Devanagari OCR. In Proceeding 2nd Int. Conf. Document Image Anal. Libraries, 2006, pp.1-7.
Dhurandhar, A., Shankarnarayanan, K., Jawale, R.: Robust pattern recognition scheme for Devanagari script. Comput. Intell. Security, Part I, Lecture Notes in Artificial Intelligence 3801, 2005, pp.1021-1026 .
Dhingra, K. D., Sanyal, S., Sanyal, P. K.: A robust OCR for degraded documents. In Lecture Notes in Electrical Engineering. Huang et al., Eds. New York: Springer-Verlag, 2008, pp.497-509.
Natarajan, P., MacRostie, E., Decerbo, M.: The BBN Byblos hindi OCR system. In Guide to OCR for Indic Scripts, V. Govindaraju and S. Setlur, Eds. New York: Springer-Verlag, 2009, pp.173-180.

Journal of Information Processing Systems

Optical Character Recognition for Hindi Language Using a Neural-network Approach

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)