Natural Photography Generation with Text Guidance from Spherical Panorama Image

Kim, Beomseok;Jung, Jinwoong;Hong, Eunbin;Cho, Sunghyun;Lee, Seungyong;

doi:10.15701/kcgs.2017.23.3.65

Journal of the Korea Computer Graphics Society (한국컴퓨터그래픽스학회논문지)

Volume 23 Issue 3
/
Pages.65-75
/
2017
/
1975-7883(pISSN)
/
2383-529X(eISSN)

Korea Computer Graphics Society (한국컴퓨터그래픽스학회)

DOI QR Code

Natural Photography Generation with Text Guidance from Spherical Panorama Image

360 영상으로부터 텍스트 정보를 이용한 자연스러운 사진 생성

Kim, Beomseok (POSTECH) ;
Jung, Jinwoong (POSTECH) ;
Hong, Eunbin (POSTECH) ;
Cho, Sunghyun (DGIST) ;
Lee, Seungyong (POSTECH)

김범석 (포항공과대학교) ;
정진웅 (포항공과대학교) ;
홍은빈 (포항공과대학교) ;
조성현 (대구경북과학기술원) ;
이승용 (포항공과대학교)

Received : 2017.06.24
Accepted : 2017.07.06
Published : 2017.07.14

https://doi.org/10.15701/kcgs.2017.23.3.65 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

As a 360-degree image carries information of all directions, it often has too much information. Moreover, in order to investigate a 360-degree image on a 2D display, a user has to either click and drag the image with a mouse, or project it to a 2D panorama image, which inevitably introduces severe distortions. In consequence, investigating a 360-degree image and finding an object of interest in such a 360-degree image could be a tedious task. To resolve this issue, this paper proposes a method to find a region of interest and produces a 2D naturally looking image from a given 360-degree image that best matches a description given by a user in a natural language sentence. Our method also considers photo composition so that the resulting image is aesthetically pleasing. Our method first converts a 360-degree image to a 2D cubemap. As objects in a 360-degree image may appear distorted or split into multiple pieces in a typical cubemap, leading to failure of detection of such objects, we introduce a modified cubemap. Then our method applies a Long Short Term Memory (LSTM) network based object detection method to find a region of interest with a given natural language sentence. Finally, our method produces an image that contains the detected region, and also has aesthetically pleasing composition.

360 영상은 상하좌우 모든 영역에 대한 정보를 갖고 있기 때문에 종종 지나치게 많은 정보를 포함하게 된다. 또한 360 영상의 내용을 2D 모니터를 이용하여 확인하기 위해서는 마우스를 이용하여 360 영상을 돌려 봐야 하거나, 또는 심하게 왜곡된 2D 영상으로 변환해서 봐야 하는 문제가 있다. 따라서 360 영상에서 사용자가 원하는 물체를 찾는 것은 상당히 까다로운 일이 될 수 있다. 본 논문은 물체나 영역을 묘사하는 문장이 주어졌을 때, 360 영상 내에서 문장과 가장 잘 어울리는 영상을 추출해 내는 방법을 제시한다. 본 논문에서 제시한 방법은 주어진 문장 뿐 아니라 구도 역시 고려하여 구도 면에서도 보기 좋은 결과 영상을 생성한다. 본 논문에서 제시하는 방법은 우선 360 영상을 2D 큐브맵으로 변환한다. 일반적인 큐브맵은 큐브맵의 경계 부분에 걸쳐 있는 물체가 있을 경우, 이를 검출하기 어려운 문제가 있다. 따라서 더 정확한 물체 검출을 위해 본 논문에서는 변형된 큐브맵을 제시한다. 이렇게 변형된 큐브맵에 Long Short Term Memory (LSTM) 네트워크 기반의 자연어 문장을 이용한 물체 검출 방법을 적용한다. 최종적으로 원래의 360영상에서 검출된 영역을 포함하면서도 영상 구도 면에서 보기 좋은 영역을 찾아서 결과 영상을 생성한다.

Keywords

References

R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.
R. Girshick, "Fast r-cnn," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440-1448.
J. Dai, "R-FCN: Object detection via region-based fully convolutional networks," arXiv preprint arXiv:1605.06409, 2016.
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625-2634.
R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, "Natural language object retrieval," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4555-4564.
J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, "Selective search for object recognition," International Journal of Computer Vision, vol. 104, no. 2, pp. 154-171, 2013. https://doi.org/10.1007/s11263-013-0620-5
L. Liu, R. Chen, L. Wolf, and D. Cohen-Or, "Optimizing photo composition," in Computer Graphics Forum, vol. 29, no. 2. Wiley Online Library, 2010, pp. 469-478. https://doi.org/10.1111/j.1467-8659.2009.01616.x
C. L. Zitnick and P. Dollar, "Edge boxes: Locating object proposals from edges," in European Conference on Computer Vision. Springer, 2014, pp. 391-405.
S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in Advances in Neural Information Processing Systems, 2015, pp. 91-99.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention," in International Conference on Machine Learning, 2015, pp. 2048-2057.
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, "Deep captioning with multimodal recurrent neural networks (m-rnn)," arXiv preprint arXiv:1412.6632, 2014.
S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, "Person search with natural language description," arXiv preprint arXiv:1702.05729, 2017.
M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, "Bing: Binarized normed gradients for objectness estimation at 300fps," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286-3293.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015. https://doi.org/10.1007/s11263-015-0816-y
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft coco: Common objects in context," in European Conference on Computer Vision. Springer, 2014, pp. 740-755.
S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg, "Referitgame: Referring to objects in photographs of natural scenes." in EMNLP, 2014, pp. 787-798.
J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, "Recognizing scene viewpoint using panoramic place representation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2695-2702.