Small Language Models Achieve LLM-Level Accuracy in Bug Fixing, Significantly Reducing Resource Demands

Introduction

The field of automated program repair (APR) has witnessed remarkable advancements, largely driven by the capabilities of large language models (LLMs). However, the substantial computational resources required by LLMs have presented a significant barrier to their widespread adoption in practical software development. Addressing this challenge, researchers from Kyushu University, led by Kazuki Kusama, Honglin Shu, and Masanari Kondo, have presented compelling evidence that smaller language models (SLMs) can achieve comparable, and in some cases superior, bug-fixing accuracy. This research not only questions the necessity of massive parameter counts for effective APR but also highlights the significant efficiency gains achievable through techniques like int8 quantization, which reduces memory demands with minimal performance compromise.

Small Models Rival Large Models for Repair

Small Language Models (SLMs) are increasingly being recognized for their potential in APR due to their lower computational demands and reduced need for extensive training data compared to LLMs. This study specifically investigated whether SLMs could offer competitive performance in APR, presenting a viable alternative to more resource-intensive LLMs. Through experiments conducted on the QuixBugs benchmark, the researchers directly compared the bug-fixing accuracy of various SLMs and LLMs under identical conditions. The results were striking: state-of-the-art SLMs demonstrated the ability to fix bugs with an accuracy that matched, or even surpassed, that of their LLM counterparts. Furthermore, the application of int8 quantization, a method to reduce model precision, showed a negligible impact on APR accuracy while significantly decreasing memory requirements. These findings strongly suggest that SLMs, particularly when optimized through quantization, represent a practical and efficient solution for APR, offering competitive accuracy at a fraction of the computational cost.

Quantized Small Language Models for Software Engineering

The research extends to evaluating the utility of quantized SLMs, such as specific versions of Phi-3, Llama 3, and Qwen2.5-coder, for broader software engineering tasks. The core argument is that these quantized SLMs can deliver performance comparable to larger models while substantially cutting down computational costs and environmental impact. The study found that SLMs with as few as 7 billion parameters could achieve performance levels on par with models exceeding 70 billion parameters across various software engineering benchmarks. Quantization techniques, such as GPTQ and 4-bit quantization, were identified as crucial for making SLMs practical, effectively reducing model size and inference costs without significant degradation in performance. The benefits of SLMs in terms of reduced memory footprint, faster inference speeds, and lower energy consumption make them particularly suitable for resource-constrained environments and edge device deployment. The research focused on tasks including code completion, bug fixing, code generation, and issue resolution, utilizing benchmarks like SWE-bench for performance evaluation.

Small Models Rival Large Models in Bug Fixing

A key contribution of this research is the empirical validation that SLMs can rival LLMs in bug-fixing accuracy for automated program repair. This directly addresses the efficiency bottleneck traditionally associated with LLM-based APR. By employing the QuixBugs benchmark, the study evaluated 14 different SLMs, comparing their performance against two LLMs. The results indicated that the top-performing SLM, Phi-3 (with 3.8 billion parameters), successfully repaired 38 out of 40 bugs. This performance was remarkably close to that of Codex, a leading LLM, which fixed 39 out of 40 bugs. This finding underscores the viability of SLMs as a computationally efficient alternative for APR, offering competitive accuracy with substantially reduced resource demands. The investigation into quantization further revealed that int8 quantization had a minimal effect on repair accuracy, with differences of approximately 0.5 bugs fixed compared to the full-precision float32 representation. This implies that developers can achieve significant reductions in memory usage and inference acceleration by adopting int8 quantization without substantially compromising the effectiveness of bug-fixing capabilities. This work represents a significant step forward, being the first comprehensive evaluation of 14 SLMs in APR, providing crucial insights into their potential and limitations.

Small Models Rival Large for Program Repair

The research conclusively demonstrates that SLMs present a compelling alternative to LLMs for automated program repair. The experiments, conducted on standard benchmarks, revealed that the most effective SLMs achieved bug-fixing accuracy comparable to, and in some cases even better than, larger models. This is particularly significant given that SLMs require considerably fewer computational resources, making them practical for integration into everyday development environments. The study further validated that applying int8 quantization, a method for model compression, had a negligible impact on repair accuracy while substantially decreasing memory requirements. This suggests that SLMs can be further optimized for efficiency without sacrificing their effectiveness. The findings collectively advocate for the use of code-specific SLMs, enhanced with int8 quantization, as a powerful and efficient solution for automated program repair, capable of matching the performance of LLM-based APR methods while demanding significantly fewer computational resources.

Key Findings and Recommendations

The study