Machine Learning Based Automatic Categorization Model for Text Lines in Invoice Documents

  • Received : 2010.10.05
  • Accepted : 2011.01.05
  • Published : 2010.12.30

Abstract

Automatic understanding of contents in document image is a very hard problem due to involvement with mathematically challenging problems originated mainly from the over-determined system induced by document segmentation process. In both academic and industrial areas, there have been incessant and various efforts to improve core parts of content retrieval technologies by the means of separating out segmentation related issues using semi-structured document, e.g., invoice,. In this paper we proposed classification models for text lines on invoice document in which text lines were clustered into the five categories in accordance with their contents: purchase order header, invoice header, summary header, surcharge header, purchase items. Our investigation was concentrated on the performance of machine learning based models in aspect of linear-discriminant-analysis (LDA) and non-LDA (logic based). In the group of LDA, na$\"{\i}$ve baysian, k-nearest neighbor, and SVM were used, in the group of non LDA, decision tree, random forest, and boost were used. We described the details of feature vector construction and the selection processes of the model and the parameter including training and validation. We also presented the experimental results of comparison on training/classification error levels for the models employed.

Keywords

Acknowledgement

Supported by : Kyungwon University

References

  1. S. Büttcher, C. L. A. Clarke, and G. V. Cormack."Information Retrieval: Implementing and Evaluating Search Engines," MIT Press, Cambridge, MA, 2010.
  2. H. Baird, D. Lopresti, B. Davison, and W. Pottenger, "Robust document image understanding technologies," Proc. of ACM HDP Workshop, USA, pp. 9-14, 2004.
  3. I. Witten, A. Moffat, and T. C. Bell, "Managing Gigabytes: Compressing and Indexing Documents and Images," Second Edition, Morgan Kaufnann Publishiers, New York, NY, 1999.
  4. S. B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques," Informatica, Vol. 31, pp. 249-268, 2007.
  5. L. Breiman, J. H. Friedman, R. A. Olshen, and C.J. Stone, "Classification and regression trees," Monterey, CA: Wadsworth & Brooks/ Cole Advanced Books & Software, New York, NY, 1984.
  6. S. Haykin, "Neural Networks-A Comprehensive Foundation,"second ed. Prentice-Hall Inc., Upper Saddle River, NJ, 1998.
  7. Y. Ishitani. "Model-based information extraction method tolerant of OCR errors for document images." Int. J. Comput. Proc. Oriental Lang., vol. 15(2) pp. 165-186, 2002. https://doi.org/10.1142/S0219427902000583
  8. Y. Belaïd and A. Belaïd, "Morphological Tagging Approach in Document Analysis of Invoices," Proceedings of the 17th Interna tional Conference on Pattern Recognition (ICPR'04), 2004.
  9. H. E. Nielson and W. A. Barrett, "Consensus- Based Table Form Recognition," ICDAR, Edinburgh (Scotland), pp. 906-910 , 2003.
  10. F. Cesarini, E. Francesconi, M. Gori and G. Soda, "Analysis and Understanding of Multi-Class Invoices," IJDAR, 2003.
  11. H. Shin, "Fast Text Line Segmentation Model Based On DCT For Color Image," KIPS, Volume 17-D, Issues 6, 2010. https://doi.org/10.3745/KIPSTD.2010.17D.6.463
  12. D. Ming, J. Liu, and J. Tian, "Research on Chinese financial invoice recognition technology," Pattern Recognition Letters, Vol. 24, Issues 1-3, pp. 489-497, 2003. https://doi.org/10.1016/S0167-8655(02)00271-4
  13. H. Hamza, Y. Belaid and A. Belaid, "Case-Based Reasoning for Invoice Analysis and Recognition," LECTURE NOTES IN COMPUTER SCIENCE, No.4626, pp. 404-418, 2007. https://doi.org/10.1007/978-3-540-74141-1_28
  14. H. Sako, M. Seki, N. Furukawa, H. Ikeda and A. Imaizumi, "Form Reading based on Form- type Identification and Formdata Recognition," In International Conference on Document Analysis and Recognition, Edinburgh (Scotland), pp. 926-930, 2003.
  15. N. Chen and D. Blostein "A survey of document image classification: problem statement, clas-sifier architecture and performance evaluation," IJDAR, vol.10, pp.1-16, 2007. https://doi.org/10.1007/s10032-006-0020-2
  16. R. R. Picard and R. Dennis Cook, "Cross- Validation of Regression Models," Journal of the American Statistical Association 79 (387): pp. 575-583, 1984. https://doi.org/10.2307/2288403
  17. D. J. Hand, H. Mannila, and P. Smyth, "Principles of Data Mining," MIT Press, Cambridge, MA, 2001.