DOI QR코드

DOI QR Code

Real Scene Text Image Super-Resolution Based on Multi-Scale and Attention Fusion

  • Xinhua Lu (School of Information Engineering, Nanyang Institute of Technology) ;
  • Haihai Wei (School of Computer and Artificial Intelligence, Zhengzhou University) ;
  • Li Ma (School of Information Engineering, Nanyang Institute of Technology) ;
  • Qingji Xue (School of Information Engineering, Nanyang Institute of Technology) ;
  • Yonghui Fu (School of Computer and Artificial Intelligence, Zhengzhou University)
  • Received : 2022.07.26
  • Accepted : 2022.11.14
  • Published : 2023.08.31

Abstract

Plenty of works have indicated that single image super-resolution (SISR) models relying on synthetic datasets are difficult to be applied to real scene text image super-resolution (STISR) for its more complex degradation. The up-to-date dataset for realistic STISR is called TextZoom, while the current methods trained on this dataset have not considered the effect of multi-scale features of text images. In this paper, a multi-scale and attention fusion model for realistic STISR is proposed. The multi-scale learning mechanism is introduced to acquire sophisticated feature representations of text images; The spatial and channel attentions are introduced to capture the local information and inter-channel interaction information of text images; At last, this paper designs a multi-scale residual attention module by skillfully fusing multi-scale learning and attention mechanisms. The experiments on TextZoom demonstrate that the model proposed increases scene text recognition's (ASTER) average recognition accuracy by 1.2% compared to text super-resolution network.

Keywords

Acknowledgement

This paper was sponsored by Natural Science Foundation of Henan (No. 222300420504) and Academic Degrees & Graduate Education Reform Project of Henan Province (No. 2021SJGLX262Y).

References

  1. X. Chen, L. Jin, Y. Zhu, C. Luo, and T. Wang, "Text recognition in the wild: a survey," ACM Computing Surveys (CSUR), vol. 54, no. 2, article no. 42, 2021. https://doi.org/10.1145/3440756
  2. W. Liu, C. Chen, K. Y. K. Wong, Z. Su, and J. Han, "Star-Net: a spatial attention residue network for scene text recognition," in Proceedings of the British Machine Vision Conference (BMVC), York, UK, 2016.
  3. Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou, "Focusing attention: towards accurate text recognition in natural images," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5086-5094.
  4. C. Luo, L. Jin, and Z. Sun, "Moran: a multi-object rectified attention network for scene text recognition," Pattern Recognition, vol. 90, pp. 109-118, 2019. https://doi.org/10.1016/j.patcog.2019.01.020
  5. B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, "ASTER: an attentional scene text recognizer with flexible rectification," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2035-2048, 2019. https://doi.org/10.1109/TPAMI.2018.2848939
  6. E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li, "Scene text detection with supervised pyramid context network," Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 1, pp. 9038-9045, 2019. https://doi.org/10.1609/aaai.v33i01.33019038
  7. W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao, "Shape robust text detection with progressive scale expansion network," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 9336-9345.
  8. W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen, "Efficient and accurate arbitraryshaped text detection with pixel aggregation network," in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 8439-8448.
  9. J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee, "What is wrong with scene text recognition model comparisons? dataset and model analysis," in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 4714-4722.
  10. C. Dong, C. C. Loy, K. He, and X. Tang, "Image super-resolution using deep convolutional networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295-307, 2016. https://doi.org/10.1109/TPAMI.2015.2439281
  11. J. Kim, J. K. Lee, and K. M. Lee, "Accurate image super-resolution using very deep convolutional networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 1646-1654.
  12. B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, "Enhanced deep residual networks for single image superresolution," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, 2017, pp. 1132-1140.
  13. B. Liu, "Lightweight single image super-resolution by channel split residual convolution," Journal of Information Processing Systems, vol. 18, no. 1, pp. 12-25, 2022. http://doi.org/10.3745/JIPS.02.0168
  14. Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, "Residual dense network for image super-resolution," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 2472-2481.
  15. W. Wang, E. Xie, X. Liu, W. Wang, D. Liang, C. Shen, and X. Bai, "Scene text image super-resolution in the wild," in Computer Vision-ECCV 2020. Cham, Switzerland: Springer, 2020, pp. 650-666. https://doi.org/10.1007/978-3-030-58607-2_38
  16. C. Fang, Y. Zhu, L. Liao, and X. Ling, "TSRGAN: real-world text image super-resolution based on adversarial learning and triplet attention," Neurocomputing, vol. 455, pp. 88-96, 2021. https://doi.org/10.1016/j.neucom.2021.05.060
  17. W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang, "Deep Laplacian pyramid networks for fast and accurate super-resolution," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 5835-5843.
  18. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 2818-2826.
  19. J. Qin, Y. Huang, and W. Wen, "Multi-scale feature fusion residual network for single image superresolution," Neurocomputing, vol. 379, pp. 334-342, 2020. https://doi.org/10.1016/j.neucom.2019.10.076
  20. S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, "CBAM: convolutional block attention module," in Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 2018, pp. 3-19).
  21. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 770-778.
  22. T. Tong, G. Li, X. Liu, and Q. Gao, "Image super-resolution using dense skip connections," in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 4809-4817.
  23. Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, "Image super-resolution using very deep residual channel attention networks," in Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 2018, pp. 294-310.
  24. B. Shi, X. Bai, and C. Yao, "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298-2304, 2017. https://doi.org/10.1109/TPAMI.2016.2646371
  25. M. Jaderberg, K. Simonyan, and A. Zisserman, "Spatial transformer networks," Advances in Neural Information Processing Systems, vol. 28, pp. 2017-2025, 2015.
  26. W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, "Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 1874-1883.
  27. J. W. Liu, H. D. Zhao, X. L. Luo, ad J. Xu, "Research progress on batch normalization of deep learning and its related algorithms," Acta Automatica Sinica, vol. 46, no. 6, pp. 1090-1120, 2020.
  28. C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, et al., "Photo-realistic single image super-resolution using a generative adversarial network," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 105-114.