Abstract
Common Weakness Enumerations (CWEs) and Common Vulnerabilities and Exposures (CVEs) are open knowledge bases that provide definitions, descriptions, and samples of code vulnerabilities. The combination of Large Language Models (LLMs) with vulnerability knowledge bases helps to enhance and automate code vulnerability repair. Several key factors come into play in this setting, including (1) the retrieval of the most relevant context to a specific vulnerable code snippet; (2) augmenting LLM prompts with the retrieved context; and (3) the generated artifact form, such as a code repair with natural language explanations or a code repair only. Artifacts produced by these factors often lack transparency and explainability regarding the rationale behind the repair. In this paper, we propose an LLM-enabled framework for explainable recommendation of vulnerable code repairs with techniques addressing each factor. Our method is data-driven, which means the data characteristics of the selected CWE and CVE datasets and the knowledge base determine the best retrieval strategies. Across 100 experiments, we observe the inadequacy of the SOTA metrics to differentiate between low-quality and irrelevant repairs. To address this limitation, we design the LLM-as-a-Judge framework to enhance the robustness of recommendation assessments. Compared to baselines from prior works, as well as using static code analysis and LLMs in zero-shot, our findings highlight that multifaceted LLMs guided by retrieval context produce explainable and reliable recommendations under a small to mild level of self-alignment bias. Our work is developed on open-source knowledge bases and models, which makes it reproducible and extensible to new datasets and retrieval strategies.