Leveraging Reinforcement Learning for LLM-based Automated Software Vulnerability Repair

강화 학습을 활용한 대형 언어 모델 기반 자동 소프트웨어 취약점 패치 생성

  • Woorim Han (Dept. of Electrical and Computer Engineering and Inter-University Semiconductor Research Center (ISRC), Seoul National University) ;
  • Miseon Yu (Dept. of Electrical and Computer Engineering and Inter-University Semiconductor Research Center (ISRC), Seoul National University) ;
  • Yunheung Paek (Dept. of Electrical and Computer Engineering and Inter-University Semiconductor Research Center (ISRC), Seoul National University)
  • 한우림 (서울대학교 전기정보공학부, 서울대학교 반도체 공동연구소) ;
  • 유미선 (서울대학교 전기정보공학부, 서울대학교 반도체 공동연구소) ;
  • 백윤흥 (서울대학교 전기정보공학부, 서울대학교 반도체 공동연구소)
  • Published : 2024.10.31

Abstract

Software vulnerabilities impose a significant burden on developers, particularly in debugging and maintenance. Automated Software Vulnerability Repair has emerged as a promising solution to mitigate these challenges. Recent advances have introduced learning-based approaches that take vulnerable functions and their Common Weakness Enumeration (CWE) types as input and generate repaired functions as output. These approaches typically fine-tune large pre-trained language models to produce vulnerability patches, with performance evaluated using Exact Match (EM) and CodeBLEU metrics to assess similarity to ground-truth patches. However, current methods rely on teacher forcing during fine-tuning, where the model is trained with ground-truth inputs, but during inference, inputs are generated by the model itself, leading to exposure bias. Additionally, while models are trained using the cross-entropy loss function, they are evaluated using discrete, non-differentiable metrics, resulting in a mismatch between the training objective and the test objective. This mismatch can yield inconsistent results, as the model is not directly optimized to improve test-time performance metrics. To address these discrepancies, we propose the use of reinforcement learning (RL) to optimize patch generation. By directly using the CodeBLEU score as a reward signal during training, our approach encourages the generation of higher-quality patches that align more closely with evaluation metrics, thereby improving overall performance.

Keywords

Acknowledgement

This work was supported by the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University in 2024 and was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00277326). Also, this work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the artificial intelligence semiconductor support program to nurture the best talents (IITP-2023-RS-2023-00256081) grant funded by the Korea government(MSIT) and was supported by Inter-University Semiconductor Research Center (ISRC). This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2024-00438729, Development of Full Lifecycle Privacy-Preserving Techniques using Anonymized Confidential Computing).

References

  1. Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han, and David Lo. Out of Sight, Out of Mind: Better Automatic Vulnerability Repair by Broadening Input Ranges and Sources. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 872-872.
  2. Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 935-947.
  3. Le, Hung, et al. "Coderl: Mastering code generation through pretrained models and deep reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 21314-21328.
  4. Zimin Chen, Steve Kommrusch, and Martin Monperrus. 2022. Neural transfer learning for repairing security vulnerabilities in c code. IEEE Transactions on Software Engineering 49, 1 (2022), 147-165.
  5. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.