Small Language Models Achieve LLM-Level Accuracy in Bug Fixing, Significantly Reducing Resource Demands
Introduction
The field of automated program repair (APR) has witnessed remarkable advancements, largely driven by the capabilities of large language models (LLMs). However, the substantial computational resources required by LLMs have presented a significant barrier to their widespread adoption in practical software development. Addressing this challenge, researchers from Kyushu University, led by Kazuki Kusama, Honglin Shu, and Masanari Kondo, have presented compelling evidence that smaller language models (SLMs) can achieve comparable, and in some cases superior, bug-fixing accuracy. This research not only questions the necessity of massive parameter counts for effective APR but also highlights the significant efficiency gains achievable through techniques like int8 quantization, which reduces memory demands with minimal performance compromise.
Small Models Rival Large Models for Repair
Small Language Models (SLMs) are increasingly being recognized for their potential in APR due to their lower computational demands and reduced need for extensive training data compared to LLMs. This study specifically investigated whether SLMs could offer competitive performance in APR, presenting a viable alternative to more resource-intensive LLMs. Through experiments conducted on the QuixBugs benchmark, the researchers directly compared the bug-fixing accuracy of various SLMs and LLMs under identical conditions. The results were striking: state-of-the-art SLMs demonstrated the ability to fix bugs with an accuracy that matched, or even surpassed, that of their LLM counterparts. Furthermore, the application of int8 quantization, a method to reduce model precision, showed a negligible impact on APR accuracy while significantly decreasing memory requirements. These findings strongly suggest that SLMs, particularly when optimized through quantization, represent a practical and efficient solution for APR, offering competitive accuracy at a fraction of the computational cost.
Quantized Small Language Models for Software Engineering
The research extends to evaluating the utility of quantized SLMs, such as specific versions of Phi-3, Llama 3, and Qwen2.5-coder, for broader software engineering tasks. The core argument is that these quantized SLMs can deliver performance comparable to larger models while substantially cutting down computational costs and environmental impact. The study found that SLMs with as few as 7 billion parameters could achieve performance levels on par with models exceeding 70 billion parameters across various software engineering benchmarks. Quantization techniques, such as GPTQ and 4-bit quantization, were identified as crucial for making SLMs practical, effectively reducing model size and inference costs without significant degradation in performance. The benefits of SLMs in terms of reduced memory footprint, faster inference speeds, and lower energy consumption make them particularly suitable for resource-constrained environments and edge device deployment. The research focused on tasks including code completion, bug fixing, code generation, and issue resolution, utilizing benchmarks like SWE-bench for performance evaluation.
Small Models Rival Large Models in Bug Fixing
A key contribution of this research is the empirical validation that SLMs can rival LLMs in bug-fixing accuracy for automated program repair. This directly addresses the efficiency bottleneck traditionally associated with LLM-based APR. By employing the QuixBugs benchmark, the study evaluated 14 different SLMs, comparing their performance against two LLMs. The results indicated that the top-performing SLM, Phi-3 (with 3.8 billion parameters), successfully repaired 38 out of 40 bugs. This performance was remarkably close to that of Codex, a leading LLM, which fixed 39 out of 40 bugs. This finding underscores the viability of SLMs as a computationally efficient alternative for APR, offering competitive accuracy with substantially reduced resource demands. The investigation into quantization further revealed that int8 quantization had a minimal effect on repair accuracy, with differences of approximately 0.5 bugs fixed compared to the full-precision float32 representation. This implies that developers can achieve significant reductions in memory usage and inference acceleration by adopting int8 quantization without substantially compromising the effectiveness of bug-fixing capabilities. This work represents a significant step forward, being the first comprehensive evaluation of 14 SLMs in APR, providing crucial insights into their potential and limitations.
Small Models Rival Large for Program Repair
The research conclusively demonstrates that SLMs present a compelling alternative to LLMs for automated program repair. The experiments, conducted on standard benchmarks, revealed that the most effective SLMs achieved bug-fixing accuracy comparable to, and in some cases even better than, larger models. This is particularly significant given that SLMs require considerably fewer computational resources, making them practical for integration into everyday development environments. The study further validated that applying int8 quantization, a method for model compression, had a negligible impact on repair accuracy while substantially decreasing memory requirements. This suggests that SLMs can be further optimized for efficiency without sacrificing their effectiveness. The findings collectively advocate for the use of code-specific SLMs, enhanced with int8 quantization, as a powerful and efficient solution for automated program repair, capable of matching the performance of LLM-based APR methods while demanding significantly fewer computational resources.
Key Findings and Recommendations
The study
AI Summary
A groundbreaking study led by researchers at Kyushu University, including Kazuki Kusama, Honglin Shu, and Masanari Kondo, has demonstrated that small language models (SLMs) can achieve bug-fixing accuracy on par with, and sometimes exceeding, that of larger language models (LLMs). This challenges the long-held assumption that model size is directly correlated with performance in automated program repair (APR). The research highlights that carefully designed SLMs offer a practical and resource-efficient alternative to LLMs, which traditionally require substantial computing power. A key finding is the effectiveness of int8 quantization, a technique that reduces model precision. This method significantly lowers memory requirements with only a minor impact on repair accuracy, paving the way for more accessible and efficient APR tools. Experiments conducted on the QuixBugs benchmark showed that state-of-the-art SLMs, such as Phi-3 with 3.8 billion parameters, successfully fixed 38 out of 40 bugs, closely matching the performance of LLMs like Codex, which fixed 39 out of 40 bugs. The study also found that int8 quantization resulted in a minimal drop in accuracy, with an average decrease of only 0.25 bugs fixed compared to full precision models. This suggests that developers can leverage quantized SLMs for practical software development workflows, benefiting from reduced computational costs and faster inference times without significant performance degradation. The research provides a comprehensive evaluation of 14 SLMs in APR tasks, offering valuable insights into their capabilities and limitations, and advocating for the adoption of code-specific SLMs with int8 quantization as a viable and efficient solution for automated program repair.