References
- S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, "Cows on pasture: baselines and benchmarks for language-driven zero-shot object navigation," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, pp. 23171-23181, 2023, DOI: 10.1109/CVPR52729.2023.02219.
- V. S. Dorbala, J. F. Mullen, and D. Manocha, "Can an embodied agent find your "Cat-shaped Mug"? LLM-based zero-shot object navigation," IEEE Robotics and Automation Letters, vol. 9, no. 5, pp. 4083-4090, May, 2024, DOI: 10.1109/LRA.2023.3346800.
- K. Zhou, K. Zheng, C. Pryor, Y. Shen, H. Jin, L. Getoor, and X. E. Wang, "Esc: exploration with soft commonsense constraints for zero-shot object navigation," 40th International Conference on Machine Learning (ICML), Honolulu, Hawaii, USA, pp. 42829-42842, 2023, [Online], https://dl.acm.org/doi/10.5555/3618408.3620214.
- M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, M. Aravindh, A. Anurag, D. Mostafa, S. Zhuoran, W. Xiao, Z. Xiaohua, K. Thomas, and N. Houlsby, "Simple open-vocabulary object detection," European Conference on Computer Vision (ECCV), Tel Aviv, Israel, pp. 728-755, 2022, DOI: 10.1007/978-3-031-20080-9_42.
- L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, W. Lijuan, Y. Lu, Z. Lei, H. Jenq-Neng, C. Kai-Wei, and J. Gao, "Grounded language-image pre-training," 2022IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, Louisiana, USA, pp. 10965-10975, 2022, DOI: 10.1109/CVPR52688.2022.01069.
- P. Wu, Y. Mu, B. Wu, Y. Hou, J. Ma, S. Zhang, and C. Liu, "Voronav: Voronoi-based zero-shot object navigation with large language model," arXiv preprint arXiv:2401.02695, 2024, DOI: 10.48550/arXiv.2401.02695.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, and D. Amodei, "Language models are few-shot learners," 34th International Conference on Neural Information Processing Systems, vol. 33, pp. 1877-1901, 2020, [Online], https://dl.acm.org/doi/abs/10.5555/3495724.3495883.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, "Llama 2: Open foundation and fine-tuned chat models," arXiv preprint arXiv:2307.09288, 2023, DOI: 10.48550/arXiv.2307.09288.
- J. Wu, Y. Jiang, Q. Liu, Z. Yuan, X. Bai, and S. Bai, "General object foundation model for images and videos at scale," arXiv: 2312.09158, 2024, DOI: 10.48550/arXiv.2312.09158.
- M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford, L. Weihs, M. Yatskar and A. Farhadi, "RoboTHOR: An open simulation-to-real embodied ai platform," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 3164-3174, 2020, DOI: 10.1109/CVPR42600.2020.00323.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Lasvegas, NV, USA, pp. 779-788, 2016, DOI: 10.1109/CVPR.2016.91.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," European Conference on Computer Vision (ECCV), pp. 213-229, Nov., 2020, DOI: 10.1007/978-3-030-58452-8_13.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: transformers for image recognition at scale," 2021 International Conference on Learning Representations (ICLR), 2021, [Online], https://openreview.net/forum?id=YicbFdNTTy.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," 2021IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, pp. 10012-10022, 2021, DOI: 10.1109/ICCV48922.2021.00986.
- H. Du, X. Yu, and L. Zheng, "Learning object relation graph and tentative policy for visual navigation," European Conference on Computer Vision (ECCV), pp. 19-34, 2020, DOI: 10.1007/978-3-030-58571-6_2.
- S. Zhang, X. Song, Y. Bai, W. Li, Y. Chu, and S. Jiang, "Hierarchical object-to-zone graph for object navigation," 2021IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, pp. 15130-15140, 2021, DOI: 10.1109/ICCV48922.2021.01485.