A Study on the Web Building Assistant System Using GUI Object Detection and Large Language Model

웹 구축 보조 시스템에 대한 GUI 객체 감지 및 대규모 언어 모델 활용 연구

  • Hyun-Cheol Jang (Dept. of Big Data Convergence, Korea University Graduate School of Computer and Information Technology) ;
  • Hyungkuk Jang (Dept. of Big Data Convergence, Korea University Graduate School of Computer and Information Technology)
  • 장현철 (고려대학교 컴퓨터정보통신대학원 빅데이터 융합학과) ;
  • 장형국 (고려대학교 컴퓨터정보통신대학원 빅데이터 융합학과)
  • Published : 2024.05.23

Abstract

As Large Language Models (LLM) like OpenAI's ChatGPT[1] continue to grow in popularity, new applications and services are expected to emerge. This paper introduces an experimental study on a smart web-builder application assistance system that combines Computer Vision with GUI object recognition and the ChatGPT (LLM). First of all, the research strategy employed computer vision technology in conjunction with Microsoft's "ChatGPT for Robotics: Design Principles and Model Abilities"[2] design strategy. Additionally, this research explores the capabilities of Large Language Model like ChatGPT in various application design tasks, specifically in assisting with web-builder tasks. The study examines the ability of ChatGPT to synthesize code through both directed prompts and free-form conversation strategies. The researchers also explored ChatGPT's ability to perform various tasks within the builder domain, including functions and closure loop inferences, basic logical and mathematical reasoning. Overall, this research proposes an efficient way to perform various application system tasks by combining natural language commands with computer vision technology and LLM (ChatGPT). This approach allows for user interaction through natural language commands while building applications.

Keywords

References

  1. OpenAI, "Chatgpt," 2022. URL:https://openai.com/blog/chatgpt.
  2. S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, "Chatgpt for robotics: Design principles and model abilities," Tech. Rep. MSR-TR-2023-8, Microsoft, February 2023.
  3. S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor, "Language-conditioned imitation learning for robot manipulation tasks," Advances in Neural Information Processing Systems, vol. 33, pp. 13139-13150, 2020.
  4. B. Wang, G. Li, and Y. Li, "Enabling conversational interaction with mobile ui using large language models," arXiv preprint arXiv:2209.08655, 2022.
  5. J. Chen, M. Xie, Z. Xing, C. Chen, X. Xu, L. Zhu, and G. Li, "Object detection for graphical user interface: Old fashioned or deep learning or a combination?," in proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1202-1214, 2020.
  6. M. Xie, S. Feng, Z. Xing, J. Chen, and C. Chen, "Uied: a hybrid tool for gui element detection," in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1655-1659, 2020.
  7. M. S. Manikanta, J. Rushi, A. Lalitha, B. S. K. Goud, V. Suresh, and T. Daniya, "Web based e-commerce system integrated with chatbot,"
  8. I. Z. I. Elwazer, "Comparison of adi website builders," 2023.
  9. CAFE24, "D.collection," 2023. URL: https://dcollection.cafe24.com/.
  10. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," 2015.
  11. Google, "Google ocr." URL: https://cloud.google.com/vision/docs/ocr.
  12. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, et al., "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.