DOI QR코드

DOI QR Code

Deep Learning-based Professional Image Interpretation Using Expertise Transplant

전문성 이식을 통한 딥러닝 기반 전문 이미지 해석 방법론

  • Kim, Taejin (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (School of Management Information Systems, Kookmin University)
  • 김태진 (국민대학교 비즈니스IT 전문대학원) ;
  • 김남규 (국민대학교 비즈니스IT 전문대학원)
  • Received : 2020.05.12
  • Accepted : 2020.06.20
  • Published : 2020.06.30

Abstract

Recently, as deep learning has attracted attention, the use of deep learning is being considered as a method for solving problems in various fields. In particular, deep learning is known to have excellent performance when applied to applying unstructured data such as text, sound and images, and many studies have proven its effectiveness. Owing to the remarkable development of text and image deep learning technology, interests in image captioning technology and its application is rapidly increasing. Image captioning is a technique that automatically generates relevant captions for a given image by handling both image comprehension and text generation simultaneously. In spite of the high entry barrier of image captioning that analysts should be able to process both image and text data, image captioning has established itself as one of the key fields in the A.I. research owing to its various applicability. In addition, many researches have been conducted to improve the performance of image captioning in various aspects. Recent researches attempt to create advanced captions that can not only describe an image accurately, but also convey the information contained in the image more sophisticatedly. Despite many recent efforts to improve the performance of image captioning, it is difficult to find any researches to interpret images from the perspective of domain experts in each field not from the perspective of the general public. Even for the same image, the part of interests may differ according to the professional field of the person who has encountered the image. Moreover, the way of interpreting and expressing the image also differs according to the level of expertise. The public tends to recognize the image from a holistic and general perspective, that is, from the perspective of identifying the image's constituent objects and their relationships. On the contrary, the domain experts tend to recognize the image by focusing on some specific elements necessary to interpret the given image based on their expertise. It implies that meaningful parts of an image are mutually different depending on viewers' perspective even for the same image. So, image captioning needs to implement this phenomenon. Therefore, in this study, we propose a method to generate captions specialized in each domain for the image by utilizing the expertise of experts in the corresponding domain. Specifically, after performing pre-training on a large amount of general data, the expertise in the field is transplanted through transfer-learning with a small amount of expertise data. However, simple adaption of transfer learning using expertise data may invoke another type of problems. Simultaneous learning with captions of various characteristics may invoke so-called 'inter-observation interference' problem, which make it difficult to perform pure learning of each characteristic point of view. For learning with vast amount of data, most of this interference is self-purified and has little impact on learning results. On the contrary, in the case of fine-tuning where learning is performed on a small amount of data, the impact of such interference on learning can be relatively large. To solve this problem, therefore, we propose a novel 'Character-Independent Transfer-learning' that performs transfer learning independently for each character. In order to confirm the feasibility of the proposed methodology, we performed experiments utilizing the results of pre-training on MSCOCO dataset which is comprised of 120,000 images and about 600,000 general captions. Additionally, according to the advice of an art therapist, about 300 pairs of 'image / expertise captions' were created, and the data was used for the experiments of expertise transplantation. As a result of the experiment, it was confirmed that the caption generated according to the proposed methodology generates captions from the perspective of implanted expertise whereas the caption generated through learning on general data contains a number of contents irrelevant to expertise interpretation. In this paper, we propose a novel approach of specialized image interpretation. To achieve this goal, we present a method to use transfer learning and generate captions specialized in the specific domain. In the future, by applying the proposed methodology to expertise transplant in various fields, we expected that many researches will be actively conducted to solve the problem of lack of expertise data and to improve performance of image captioning.

최근 텍스트와 이미지 딥러닝 기술의 괄목할만한 발전에 힘입어, 두 분야의 접점에 해당하는 이미지 캡셔닝에 대한 관심이 급증하고 있다. 이미지 캡셔닝은 주어진 이미지에 대한 캡션을 자동으로 생성하는 기술로, 이미지 이해와 텍스트 생성을 동시에 다룬다. 다양한 활용 가능성 덕분에 인공지능의 핵심 연구 분야 중 하나로 자리매김하고 있으며, 성능을 다양한 측면에서 향상시키고자 하는 시도가 꾸준히 이루어지고 있다. 하지만 이처럼 이미지 캡셔닝의 성능을 고도화하기 위한 최근의 많은 노력에도 불구하고, 이미지를 일반인이 아닌 분야별 전문가의 시각에서 해석하기 위한 연구는 찾아보기 어렵다. 동일한 이미지에 대해서도 이미지를 접한 사람의 전문 분야에 따라 관심을 갖고 주목하는 부분이 상이할 뿐 아니라, 전문성의 수준에 따라 이를 해석하고 표현하는 방식도 다르다. 이에 본 연구에서는 전문가의 전문성을 활용하여 이미지에 대해 해당 분야에 특화된 캡션을 생성하기 위한 방안을 제안한다. 구체적으로 제안 방법론은 방대한 양의 일반 데이터에 대해 사전 학습을 수행한 후, 소량의 전문 데이터에 대한 전이 학습을 통해 해당 분야의 전문성을 이식한다. 또한 본 연구에서는 이 과정에서 발생하게 되는 관찰간 간섭 문제를 해결하기 위해 '특성 독립 전이 학습' 방안을 제안한다. 제안 방법론의 실현 가능성을 파악하기 위해 MSCOCO의 이미지-캡션 데이터 셋을 활용하여 사전 학습을 수행하고, 미술 치료사의 자문을 토대로 생성한 '이미지-전문 캡션' 데이터를 활용하여 전문성을 이식하는 실험을 수행하였다. 실험 결과 일반 데이터에 대한 학습을 통해 생성된 캡션은 전문적 해석과 무관한 내용을 다수 포함하는 것과 달리, 제안 방법론에 따라 생성된 캡션은 이식된 전문성 관점에서의 캡션을 생성함을 확인하였다. 본 연구는 전문 이미지 해석이라는 새로운 연구 목표를 제안하였고, 이를 위해 전이 학습의 새로운 활용 방안과 특정 도메인에 특화된 캡션을 생성하는 방법을 제시하였다.

Keywords

References

  1. Alex, K., S. Ilya, and E. H. Geoffrey, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in Neural Information Processing Systems, Vol. 25, (2012), 1097-1105.
  2. Ali, F. B., G. Lluis, R. Marcal, and D. Karatzas, "Good News, Everyone! Context Driven Entity-Aware Captioning for News Images," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2019), 12466-12475.
  3. Ashnish, V., S. Noam, P. Niki, U. Jakob, J. Llion, N. G. Aidan, K. Lukasz, and P. Illia, "Attention is All You Need,", arXiv:1706.03762, (2017).
  4. Buck J.N., "The H-T-P test," Journal of Clinical Psychology, Vol 4, (1948), 151-159. https://doi.org/10.1002/1097-4679(194804)4:2<151::AID-JCLP2270040203>3.0.CO;2-O
  5. Caigny, A. D., C. Krsitof, W. D. B. Koen, and L. Stefan, "Incorporating Textual Information in Customer Churn Prediction Models Based on a Convolutional Neural Network," International Journal of Forecasting, (2019), 1-16.
  6. Chen, L., T. Zhang, and Y. Chen, "Customer Purchase Intent Prediction Under Online Multi-Channel Promotion: A Feature-Combined Deep Learning Framework," IEEE Access, Vol. 7, (2019), 112963-112976. https://doi.org/10.1109/ACCESS.2019.2935121
  7. Christain, S., W. Liu, Y. Jia, S. Pierre, R. Scott, A. Dragomir, E. Dumitru, V. Vincent, and R. Andrew, "Going Deeper with Convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 1-9.
  8. Devlin, J., MW. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, (2018).
  9. Feng, M., T. Shaonan, C. Lee, and M. Ling, "Deep Learning Models for Bankruptcy Prediction Using Textual Disclosures," European Journal of Operational Research, Vol. 274, No. 2, (2019), 743-758. https://doi.org/10.1016/j.ejor.2018.10.024
  10. Forrest, N. I., S. Han, W. M. Matthew, A. Khalid, J. D. William, and K. Kurt, "SqueezeNet:AlexNet-level Accuracy with 50x Fewer Parameters and <0.5MB Model Size," arXiv:1602.07360, (2016).
  11. Gan, C., Z. Gan, X. He, J. Gao, and D. Li, "StyleNet: Generating Attractive Visual Captions with Styles," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 3137-3146.
  12. He, K., X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770-778.
  13. Hossain, M. D. Z., S. Ferdous, F. S. Mohd, and L. Hamid, "A Comprehensive Survey of Deep Learning for Image Captioning," ACM Computing Surveys, Vol. 51, No. 6, (2019), 1-36.
  14. Huang, G., Z. Liu, V. D. M. Laurens, and Q.W. Kilian, "Densely Connected Convolutional Networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 4700-4708.
  15. Ian, G., B. Yoshua., and C. Aaron, Deep Learning, MIT Press, United Strates, 2016.
  16. Jeffrey, P., S. Richard., and D. M. Christopher, "Glove: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014), 1532-1543.
  17. Justin, J., K. Andrej, and F. Li., "Densecap: Fully Convolutional Localization Networks for Dense Captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 4565-4574.
  18. Karl, W., M. K. Taghi, and D. Wang, "A Survey of Transfer Learning," Journal of Big Data, Vol. 3, (2016) 1-40.
  19. Kim, B. N., J. W. Choi, H. S. Ko, "Replication crisis in psychology: A review of its causes and solutions," Korean Journal of Psychology:general, Vol. 36. No. 3, (2017), 359-396. https://doi.org/10.22257/kjp.2017.09.36.3.359
  20. Lecun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, Vol. 1, No. 4, (1989), 541-551. https://doi.org/10.1162/neco.1989.1.4.541
  21. Liu, Y. and L. Wu, "Geological Disaster Recognition on Optical Remote Sensing Images Using Deep Learning," Procedia Computer Science, Vol. 91, (2016), 566-575. https://doi.org/10.1016/j.procs.2016.07.144
  22. Marc, T., G. Albert, and P. C. Kenneth, "Transfer Learning from Language Models to Image Caption Generators: Better Models may not Transfer Better," arXiv:1901.01216, (2019).
  23. Micheal, I. J., "Attractor Dynamics and Parallelism in a Connectionist Sequential Machine," Artificial Neural Networks: Concept Learning, (1990), 112-127.
  24. Pan, S. J. and Q. Yang, "A Survey on Transfer Learning," IEEE Transactions on Knowledge and Data Engineering, Vol. 22, No. 10, (2010), 1345-1359. https://doi.org/10.1109/TKDE.2009.191
  25. Pang, G., X. Wang, F. Hao, J. Xie, X. Wang, Y. Lin, and X. Qin, "ACNN-FM: A Novel Recommender with Attention-based Convolutional Neural Network and Factorization Machines," Knowledge-Based Systems, Vol. 181, (2019), 1-13.
  26. Peters, M. E., N. Mark, I. Mohi, G. Matt, C. Christopher, K. Lee, and Z. Luke, "Deep Contextualized Word Representations," arXiv:1802.05365, (2018).
  27. Piotr, B., G. Eduard, J. Armand, and M. Tomas, "Enriching Word Vectors with Subword Information," arXiv:1607.04606, (2016)
  28. Qi D., L. S., J. Song, E. Cui, T. Bharti, A. Sacheti, "ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data," arXive:2001.07966, (2020).
  29. Ren, S., K. He, G. Ross, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Advances in Neural Information Processing Systems, Vol. 28, (2015), 91-99.
  30. Ryan, K., S. Ruslan, and Z. Richard, "Multimodal Neural Language Models," in Proceedings of the International Conference on Machine Learning, Vol. 32, (2014), 592-603.
  31. Sanjiban, S. R., M. Abhinav, G. Rishab, S. O. Mohammad, and P. V. Krishna, "A Deep Learning Based Artificial Neural Network Approach for Intrusion Detection," in Proceedings of the International Conference Mathematics and Computing, (2017), 44-53.
  32. Hochreiter, S. and S. Jurgen, "Long Short-Term Memory," Neural Computation, Vol. 9, No. 8, (1997), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
  33. Tan, C., F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, "A Survey on Deep Transfer Learning," arXiv:1808.01974, (2018).
  34. Tomas, M., K. Chen, C. Greg, and D. Jeffrey, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, (2013).
  35. Tomas, M., S. Ilya, K. Chen, C. Greg, and D. Jeffrey, "Distributed Representations of Words and Phrases and their Compositionality," Advances in Neural Information Processing Systems, Vol. 26, (2013), 3111-3119.
  36. Xu, K., J. Ba, K. Ryan, K. Cho, C. Aaron, S. Ruslan, S. Z. Richard, and B. Yoshua, "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention," in Proceedings of the International Conference on Machine Learning, Vol. 32, (2015), 2048-2057.
  37. Yang, Y., L. Zheng, J. Zhang, Q. Cui, Z. Li, and P. S. Yu, "TI-CNN: Convolutional Neural Networks for Fake News Detection," arXiv:1806.00749, (2018).
  38. Yang, Z., Z. Dai, Y. Yang, C. Jaime, R. S. Russ, and Q. V. Le, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," arXiv:1906.08237, (2019).