DOI QR코드

DOI QR Code

Utilizing AI Foundation Models for Language-Driven Zero-Shot Object Navigation Tasks

언어-기반 제로-샷 물체 목표 탐색 이동 작업들을 위한 인공지능 기저 모델들의 활용

  • Received : 2024.05.20
  • Accepted : 2024.08.16
  • Published : 2024.08.30

Abstract

In this paper, we propose an agent model for Language-Driven Zero-Shot Object Navigation (L-ZSON) tasks, which takes in a freeform language description of an unseen target object and navigates to find out the target object in an inexperienced environment. In general, an L-ZSON agent should able to visually ground the target object by understanding the freeform language description of it and recognizing the corresponding visual object in camera images. Moreover, the L-ZSON agent should be also able to build a rich spatial context map over the unknown environment and decide efficient exploration actions based on the map until the target object is present in the field of view. To address these challenging issues, we proposes AML (Agent Model for L-ZSON), a novel L-ZSON agent model to make effective use of AI foundation models such as Large Language Model (LLM) and Vision-Language model (VLM). In order to tackle the visual grounding issue of the target object description, our agent model employs GLEE, a VLM pretrained for locating and identifying arbitrary objects in images and videos in the open world scenario. To meet the exploration policy issue, the proposed agent model leverages the commonsense knowledge of LLM to make sequential navigational decisions. By conducting various quantitative and qualitative experiments with RoboTHOR, the 3D simulation platform and PASTURE, the L-ZSON benchmark dataset, we show the superior performance of the proposed agent model.

Keywords

References

  1. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, "Cows on pasture: baselines and benchmarks for language-driven zero-shot object navigation," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, pp. 23171-23181, 2023, DOI: 10.1109/CVPR52729.2023.02219. 
  2. V. S. Dorbala, J. F. Mullen, and D. Manocha, "Can an embodied agent find your "Cat-shaped Mug"? LLM-based zero-shot object navigation," IEEE Robotics and Automation Letters, vol. 9, no. 5, pp. 4083-4090, May, 2024, DOI: 10.1109/LRA.2023.3346800. 
  3. K. Zhou, K. Zheng, C. Pryor, Y. Shen, H. Jin, L. Getoor, and X. E. Wang, "Esc: exploration with soft commonsense constraints for zero-shot object navigation," 40th International Conference on Machine Learning (ICML), Honolulu, Hawaii, USA, pp. 42829-42842, 2023, [Online], https://dl.acm.org/doi/10.5555/3618408.3620214. 
  4. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, M. Aravindh, A. Anurag, D. Mostafa, S. Zhuoran, W. Xiao, Z. Xiaohua, K. Thomas, and N. Houlsby, "Simple open-vocabulary object detection," European Conference on Computer Vision (ECCV), Tel Aviv, Israel, pp. 728-755, 2022, DOI: 10.1007/978-3-031-20080-9_42. 
  5. L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, W. Lijuan, Y. Lu, Z. Lei, H. Jenq-Neng, C. Kai-Wei, and J. Gao, "Grounded language-image pre-training," 2022IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, Louisiana, USA, pp. 10965-10975, 2022, DOI: 10.1109/CVPR52688.2022.01069. 
  6. P. Wu, Y. Mu, B. Wu, Y. Hou, J. Ma, S. Zhang, and C. Liu, "Voronav: Voronoi-based zero-shot object navigation with large language model," arXiv preprint arXiv:2401.02695, 2024, DOI: 10.48550/arXiv.2401.02695. 
  7. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, and D. Amodei, "Language models are few-shot learners," 34th International Conference on Neural Information Processing Systems, vol. 33, pp. 1877-1901, 2020, [Online], https://dl.acm.org/doi/abs/10.5555/3495724.3495883. 
  8. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, "Llama 2: Open foundation and fine-tuned chat models," arXiv preprint arXiv:2307.09288, 2023, DOI: 10.48550/arXiv.2307.09288. 
  9. J. Wu, Y. Jiang, Q. Liu, Z. Yuan, X. Bai, and S. Bai, "General object foundation model for images and videos at scale," arXiv: 2312.09158, 2024, DOI: 10.48550/arXiv.2312.09158. 
  10. M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford, L. Weihs, M. Yatskar and A. Farhadi, "RoboTHOR: An open simulation-to-real embodied ai platform," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 3164-3174, 2020, DOI: 10.1109/CVPR42600.2020.00323. 
  11. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Lasvegas, NV, USA, pp. 779-788, 2016, DOI: 10.1109/CVPR.2016.91. 
  12. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," European Conference on Computer Vision (ECCV), pp. 213-229, Nov., 2020, DOI: 10.1007/978-3-030-58452-8_13. 
  13. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: transformers for image recognition at scale," 2021 International Conference on Learning Representations (ICLR), 2021, [Online], https://openreview.net/forum?id=YicbFdNTTy. 
  14. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," 2021IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, pp. 10012-10022, 2021, DOI: 10.1109/ICCV48922.2021.00986. 
  15. H. Du, X. Yu, and L. Zheng, "Learning object relation graph and tentative policy for visual navigation," European Conference on Computer Vision (ECCV), pp. 19-34, 2020, DOI: 10.1007/978-3-030-58571-6_2. 
  16. S. Zhang, X. Song, Y. Bai, W. Li, Y. Chu, and S. Jiang, "Hierarchical object-to-zone graph for object navigation," 2021IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, pp. 15130-15140, 2021, DOI: 10.1109/ICCV48922.2021.01485.