DOI QR코드

DOI QR Code

멀티 뷰 기법 리뷰: 이해와 응용

Multi-view learning review: understanding methods and their application

  • 배강일 (중앙대학교 응용통계학과) ;
  • 이영섭 (동국대학교 통계학과) ;
  • 임창원 (중앙대학교 응용통계학과)
  • Bae, Kang Il (Department of Applied Statistics, Chung-Ang University) ;
  • Lee, Yung Seop (Department of Statistics, Dongguk University) ;
  • Lim, Changwon (Department of Applied Statistics, Chung-Ang University)
  • 투고 : 2018.11.06
  • 심사 : 2019.01.03
  • 발행 : 2019.02.28

초록

멀티 뷰 기법은 데이터를 다양한 관점에서 보려는 접근 방법이며 데이터의 다양한 정보를 통합하여 사용하려는 시도이다. 최근 많은 연구가 진행되고 있는 멀티 뷰 기법에서는 단일 뷰 만을 이용하여 모형을 학습시켰을 때 보다 좋은 성과를 보인 경우가 많았다. 멀티 뷰 기법에서 딥 러닝 기법의 도입으로 이미지, 텍스트, 음성, 영상 등 다양한 분야에서 좋은 성과를 보였다. 본 연구에서는 멀티 뷰 기법이 인간 행동 인식, 의학, 정보 검색, 표정 인식 분야에서 직면한 여러 가지 문제들을 어떻게 해결하고 있는지 소개하였다. 또한 전통적인 멀티 뷰 기법들을 데이터 차원, 분류기 차원, 표현 간의 통합으로 분류하여 멀티 뷰 기법의 데이터 통합 원리를 리뷰 하였다. 마지막으로 딥 러닝 기법 중 가장 범용적으로 사용되고 있는 CNN, RNN, RBM, Autoencoder, GAN 등이 멀티 뷰 기법에 어떻게 응용되고 있는지를 살펴보았다. 이때 CNN, RNN 기반 학습 모형을 지도학습 기법으로, RBM, Autoencoder, GAN 기반 학습 모형을 비지도 학습 기법으로 분류하여 이 방법들이 대한 이해를 돕고자 하였다.

Multi-view learning considers data from various viewpoints as well as attempts to integrate various information from data. Multi-view learning has been studied recently and has showed superior performance to a model learned from only a single view. With the introduction of deep learning techniques to a multi-view learning approach, it has showed good results in various fields such as image, text, voice, and video. In this study, we introduce how multi-view learning methods solve various problems faced in human behavior recognition, medical areas, information retrieval and facial expression recognition. In addition, we review data integration principles of multi-view learning methods by classifying traditional multi-view learning methods into data integration, classifiers integration, and representation integration. Finally, we examine how CNN, RNN, RBM, Autoencoder, and GAN, which are commonly used among various deep learning methods, are applied to multi-view learning algorithms. We categorize CNN and RNN-based learning methods as supervised learning, and RBM, Autoencoder, and GAN-based learning methods as unsupervised learning.

키워드

GCGHDE_2019_v32n1_41_f0001.png 이미지

Figure 3.1. Illustration of consistency and complementary principles (Liu et al., 2015a).

GCGHDE_2019_v32n1_41_f0002.png 이미지

Figure 3.2. Data integration: projection method from each view onto common feature space.

GCGHDE_2019_v32n1_41_f0003.png 이미지

Figure 3.3. Flow chart of co-training learning.

GCGHDE_2019_v32n1_41_f0004.png 이미지

Figure 3.4. Two kinds of the multi-view strategy: (a) One-view-one-network strategy, (b) Multi-view-one-networkstrategy (Kang et al., 2017).

GCGHDE_2019_v32n1_41_f0005.png 이미지

Figure 4.1. The word-level matching CNN (Ma et al., 2015).

GCGHDE_2019_v32n1_41_f0006.png 이미지

Figure 5.1. A schematic of DCCA (Andrew et al., 2013).

GCGHDE_2019_v32n1_41_f0007.png 이미지

Figure 5.2. A schematic of DGCCA with deep networks for J views (Benton et al., 2017).

GCGHDE_2019_v32n1_41_f0008.png 이미지

Figure 5.3. A schematic of Multimodal DBM (Srivastava and Salakhutdinov, 2012).

GCGHDE_2019_v32n1_41_f0009.png 이미지

Figure 5.4. Autoencoder model (Jaques et al., 2017).

GCGHDE_2019_v32n1_41_f0010.png 이미지

Figure 5.5. Correspondence Autoencoder (Feng et al., 2014).

GCGHDE_2019_v32n1_41_f0011.png 이미지

Figure 5.7. Architecture of text-conditional convolutional GAN (Reed et al., 2016).

GCGHDE_2019_v32n1_41_f0012.png 이미지

Figure 4.2. (a) The simple RNN model (b) The m-RNN model (Mao et al., 2014).

GCGHDE_2019_v32n1_41_f0013.png 이미지

Figure 5.6. (a) Graphical model of JMVAE (b) Two approaches to estimate encoders with a single input, $q{\Phi}_{x}\;(z{\mid}x),\;q{\Phi}_{w}\;(z{\mid}x)$ on the JMVAE (Suzuki et al., 2016).

Table 6.1. R & Python Implementations of Multi-view Learning Models

GCGHDE_2019_v32n1_41_t0001.png 이미지

참고문헌

  1. Adetiba, E. and Olugbara, O. O. (2015). Lung cancer prediction using neural network ensemble with histogram of oriented gradient genomic features, The Scientific World Journal.
  2. Andrew, G., Arora, R., Bilmes, J., and Livescu, K. (2013). Deep canonical correlation analysis, In International Conference on Machine Learning, 1247-1255.
  3. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473.
  4. Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., andWeinberger, K. (2010). Learning to rank with (a lot of) word features, Information retrieval, 13, 291-314. https://doi.org/10.1007/s10791-009-9117-9
  5. Bengio, Y. (2009). Learning deep architectures for AI, Foundations and trends in Machine Learning, 2, 1-127. https://doi.org/10.1561/2200000006
  6. Benton, A., Khayrallah, H., Gujral, B., Reisinger, D., Zhang, S., and Arora, R. (2017). Deep generalized canonical correlation analysis, arXiv preprint arXiv:1702.02519.
  7. Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, 11, 92-100.
  8. Bokhari, M. U. and Hasan, F. (2013). Multimodal information retrieval: Challenges and future trends, International Journal of Computer Applications, 74(14).
  9. Cai, J., Tang, Y., and Wang, J. (2016). Kernel canonical correlation analysis via gradient descent, Neurocomputing, 182, 322-331. https://doi.org/10.1016/j.neucom.2015.12.039
  10. Cho, K., Van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014a). On the properties of neural machine translation: Encoder-decoder approaches, arXiv preprint arXiv:1409.1259.
  11. Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014b). Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078.
  12. Cristianini, N., Shawe-Taylor, J., and Lodhi, H. (2002). Latent semantic kernels, Journal of Intelligent Information Systems, 18, 127-152. https://doi.org/10.1023/A:1013625426931
  13. Dasgupta, S., Littman, M. L., and McAllester, D. A. (2002). PAC generalization bounds for co-training. In Advances in Neural Information Processing Systems, 375-382.
  14. Dey, A. (2016). Machine learning algorithms: a review, (IJCSIT) International Journal of Computer Science and Information Technologies, 7, 1174-1179.
  15. Ding, C. and Tao, D. (2015). Robust face recognition via multimodal deep face representation, IEEE Transactions on Multimedia, 17, 2049-2058. https://doi.org/10.1109/TMM.2015.2477042
  16. Doersch, C. (2016). Tutorial on variational autoencoders, arXiv preprint arXiv:1606.05908.
  17. Dou, Q., Chen, H., Yu, L., Qin, J., and Heng, P. A. (2017). Multilevel contextual 3-D CNNs for false positive reduction in pulmonary nodule detection, IEEE Transactions on Biomedical Engineering, 64, 1558-1567. https://doi.org/10.1109/TBME.2016.2613502
  18. Du, J., Ling, C. X., and Zhou, Z. H. (2011). When does cotraining work in real data?, IEEE Transactions on Knowledge and Data Engineering, 23, 788-799. https://doi.org/10.1109/TKDE.2010.158
  19. Elman, J. L. (1990). Finding structure in time, Cognitive science, 14, 179-211. https://doi.org/10.1207/s15516709cog1402_1
  20. Farquhar, J., Hardoon, D., Meng, H., Shawe-Taylor, J. S., and Szedmak, S. (2006). Two view learning: SVM-2K, theory and practice. In Advances in Neural Information Processing Systems, 355-362.
  21. Feng, F., Wang, X., and Li, R. (2014). Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia, (pp. 7-16), ACM
  22. Freund, Y. and Haussler, D. (1992). Unsupervised learning of distributions on binary vectors using two layer networks. In Advances in neural information processing systems, 912-919.
  23. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., and Mikolov, T. (2013). Devise: a deep visualsemantic embedding model. In Advances in Neural Information Processing Systems, 2121-2129.
  24. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems, 11, 2672-2680.
  25. Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. (2004). Canonical correlation analysis: an overview with application to learning methods, Neural Computation, 16, 2639-2664. https://doi.org/10.1162/0899766042321814
  26. Hinton, G. E. and Salakhutdinov, R. R. (2006).Reducing the dimensionality of data with neural networks, Science, 313(5786), 504-507. https://doi.org/10.1126/science.1127647
  27. Hinton, G. E. and Salakhutdinov, R. R. (2009). Replicated softmax: an undirected topic model. In Advances in Neural Information Processing Systems, 1607-1614.
  28. Horst, P. (1961). Generalized canonical correlations and their applications to experimental data, Journal of Clinical Psychology, 17, 331-347. https://doi.org/10.1002/1097-4679(196110)17:4<331::AID-JCLP2270170402>3.0.CO;2-D
  29. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift., arXiv preprint arXiv:1502.03167.
  30. Jain, A., Zamir, A. R., Savarese, S., and Saxena, A. (2016). Structural-RNN: Deep learning on spatiotemporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5308-5317.
  31. Jaques, N., Taylor, S., Sano, A., and Picard, R. (2017). Multimodal Autoencoder: A Deep Learning Approach to Filling in Missing Sensor Data and Enabling Better Mood Prediction. In Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, Texas.
  32. Jeni, L. A., Girard, J. M., Cohn, J. F., and De La Torre, F. (2013). Continuous au intensity estimation using localized, sparse facial feature space. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 1-7.
  33. Kang, G., Liu, K., Hou, B., and Zhang, N. (2017). 3D multi-view convolutional neural networks for lung nodule classification, PLoS One, 12(11).
  34. Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes, arXiv preprint arXiv:1312.6114.
  35. Kiros, R., Popuri, K., Cobzas, D., and Jagersand, M. (2014). Stacked multiscale feature learning for domain independent medical image segmentation. In International Workshop on Machine Learning in Medical Imaging, 25-32. Springer, Cham.
  36. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency, The Annals of Mathematical Statistics, 22, 79-86. https://doi.org/10.1214/aoms/1177729694
  37. Kumari, J., Rajesh, R., and Pooja, K. M. (2015). Facial expression recognition: a survey, Procedia Computer Science, 58, 486-491. https://doi.org/10.1016/j.procs.2015.08.011
  38. Lahat, D., Adali, T., and Jutten, C. (2015). Multimodal data fusion: an overview of methods, challenges, and prospects, Proceedings of the IEEE, 103, 1449-1477. https://doi.org/10.1109/JPROC.2015.2460697
  39. Li, Y., Yang, M., and Zhang, Z. (2016). Multi-view representation learning: a survey from shallow methods to deep methods, arXiv preprint arXiv:1610.01206.
  40. Liu, J., Jiang, Y., Li, Z., Zhou, Z. H., and Lu, H. (2015a). Partially shared latent factor learning with multiview data, IEEE Transactions on Neural Networks and Learning Systems, 26, 1233-1246. https://doi.org/10.1109/TNNLS.2014.2335234
  41. Liu, S., Liu, S., Cai, W., Che, H., Pujol, S., Kikinis, R., and Fulham, M. J. (2015b). Multimodal neuroimaging feature learning for multiclass diagnosis of Alzheimer's disease, IEEE Transactions on Biomedical Engineering, 62, 1132-1140. https://doi.org/10.1109/TBME.2014.2372011
  42. Lorincz, A., Jeni, L., Szabo, Z., Cohn, J., and Kanade, T. (2013). Emotional expression classification using time-series kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 889-895
  43. Ma, L., Lu, Z., Shang, L., and Li, H. (2015). Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision, 2623-2631.
  44. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn), arXiv preprint arXiv:1412.6632.
  45. Neverova, N., Wolf, C., Taylor, G., and Nebout, F. (2016). Moddrop: adaptive multi-modal gesture recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 1692-1706. https://doi.org/10.1109/TPAMI.2015.2461544
  46. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 689-696.
  47. Nigam, K. and Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. In Proceedings of the Ninth International Conference on Information and Knowledge Management, 86-93.
  48. Pohl, C., Ali, R. M., Chand, S. J. H., Tamin, S. S., Nazirun, N. N. N., and Supriyanto, E. (2014). Interdisciplinary approach to multimodal image fusion for vulnerable plaque detection, In Biomedical Engineering and Sciences (IECBES), 2014 IEEE Conference on, 11-16.
  49. Qi, C. R., Su, H., Nie$\ss$ner, M., Dai, A., Yan, M., and Guibas, L. J. (2016). Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5648-5656.
  50. Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434.
  51. Ramachandram, D. and Taylor, G. W. (2017). Deep multimodal learning: a survey on recent advances and trends, IEEE Signal Processing Magazine, 34, 96-108. https://doi.org/10.1109/MSP.2017.2738401
  52. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016). Generative adversarial text to image synthesis, arXiv preprint arXiv:1605.05396.
  53. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, Social Service Review, 61, 85-117. https://doi.org/10.1016/j.neunet.2014.09.003
  54. Shen, D., Wu, G., and Suk, H. I. (2017). Deep learning in medical image analysis, Annual Review of Biomedical Engineering, 19, 221-248. https://doi.org/10.1146/annurev-bioeng-071516-044442
  55. Sindhwani, V., Niyogi, P., and Belkin, M. (2005). A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of ICML Workshop on Learning with Multiple Views, 74-79, Citeseer.
  56. Sitova, Z., Sedenka, J., Yang, Q., Peng, G., Zhou, G., Gasti, P., and Balagani, K. S. (2016). HMOG: New behavioral biometric features for continuous authentication of smartphone users IEEE Transactions on Information Forensics and Security, 11(5), 877-892. https://doi.org/10.1109/TIFS.2015.2506542
  57. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory (No. CU-CS-321-86), COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE.
  58. Sousa, R. T. and Gama, J. (2017). Comparison between Co-training and Self-training for single-target regression in data streams using AMRules.
  59. Srivastava, N. and Salakhutdinov, R. R. (2012). Multimodal learning with deep Boltzmann machines, In Advances in neural information processing systems, 2222-2230.
  60. Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. (2015). Multi-view convolutional neural networks for 3d shape recognition. In In Proceedings of the IEEE International Conference on Computer Vision, 945-953.
  61. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 3104-3112.
  62. Suzuki, M., Nakayama, K., and Matsuo, Y. (2016). Joint multimodal learning with deep generative models, arXiv preprint arXiv:1611.01891.
  63. Suzuki, M., Nakayama, K., and Matsuo, Y. (2018). Improving Bi-directional Generation between Different Modalities with Variational Autoencoders, arXiv preprint arXiv:1801.08702.
  64. Tatulli, E. and Hueber, T. (2017). Feature extraction using multimodal convolutional neural networks for visual speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 2971-2975.
  65. Taylor, G. W., Fergus, R., LeCun, Y., and Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European conference on computer vision, (pp. 140-153), Springer, Berlin, Heidelberg.
  66. Uludag, K. and Roebroeck, A. (2014). General overview on the merits of multimodal neuroimaging data fusion, Neuroimage, 102, 3-10. https://doi.org/10.1016/j.neuroimage.2014.05.018
  67. Vargas, R., Mosavi, A., and Ruiz, L. (2017). Deep Learning: A Review,Advances in Intelligent Systems and Computing.
  68. Vrigkas, M., Nikou, C., and Kakadiaris, I. A. (2015). A review of human activity recognition methods, Frontiers in Robotics and AI, 2, 28.
  69. Wang, W., Ooi, B. C., Yang, X., Zhang, D., and Zhuang, Y. (2014). Effective multi-modal retrieval based on stacked auto-encoders. In Proceedings of the VLDB Endowment, 7, 649-660.
  70. Xu, C., Tao, D., and Xu, C. (2013). A survey on multi-view learning, arXiv preprint arXiv:1304.5634.
  71. Yu, Y., Lin, H., Meng, J., Wei, X., Guo, H., and Zhao, Z. (2017). Deep transfer learning for modality classification of medical images, Information, 8, 91. https://doi.org/10.3390/info8030091
  72. Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., and Metaxas, D. (2017). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks, arXiv preprint.
  73. Zhang, W., Zhang, Y., Ma, L., Guan, J., and Gong, S. (2015). Multimodal learning for facial expression recognition, Pattern Recognition, 48, 3191-3202. https://doi.org/10.1016/j.patcog.2015.04.012
  74. Zhao, J., Xie, X., Xu, X., and Sun, S. (2017). Multi-view learning overview: recent progress and new challenges, Information Fusion, 38, 43-54. https://doi.org/10.1016/j.inffus.2017.02.007
  75. Zhou, Z. H. and Li, M. (2005). Tri-training: exploiting unlabeled data using three classifiers, IEEE Transactions on knowledge and Data Engineering, 86, 660-689.
  76. Zhu, X. (2006). Semi-supervised learning literature survey, Computer Science, University of Wisconsin-Madison, 2, 4.
  77. Zhu, X. and Goldberg, A. B. (2007). Introduction to semi-supervised learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, 3, 1-130.