LLM as a Judge: Evaluating Accuracy in LLM Security Scans

Introduction: The Evolving Landscape of LLM Security

The rapid advancement and widespread adoption of Large Language Models (LLMs) have ushered in an era of unprecedented capabilities, but also introduced a complex web of security challenges. As these models become increasingly integrated into various applications and workflows, the potential for unintended or maliciously crafted outputs escalates, particularly in security-sensitive domains. To address this growing concern, researchers at Trend Micro have undertaken a comprehensive investigation into the efficacy of using LLMs themselves as automated judges for security scans. This approach aims to simulate adversarial attacks, thereby identifying and mitigating potential risks before they can be exploited.

The Dual Approach to LLM Response Evaluation

Evaluating the security implications of LLM-generated content typically involves two primary methods: human review and automated assessment using LLM judges. Human reviewers, with their nuanced understanding of context, tone, and intent, are adept at spotting subtle security flaws. However, the sheer volume of LLM outputs makes exclusive reliance on human expertise increasingly impractical and cost-prohibitive. This scalability issue paves the way for the "LLM as a judge" paradigm. In this model, a specialized LLM is tasked with evaluating the responses of another LLM, guided by a concise system prompt that defines the evaluation criteria. LLM judges offer significant advantages in terms of speed, consistency, and cost-effectiveness, enabling rapid assessment of vast datasets.

Methodology: Simulating Attacks and Automated Judgment

Trend Micro's evaluation process was meticulously designed to simulate adversarial behaviors and assess LLM responses at scale. The methodology involved a two-step approach:

Attack Simulation: Purpose-crafted prompts were sent to target LLMs (such as GPT-4 and Mistral) to elicit specific malicious responses. These prompts were designed to probe for vulnerabilities, aiming to generate outputs like malicious code, sensitive data leaks, or fabricated software packages.
Response Evaluation: The responses from the target LLMs were then fed into an LLM judge. This judge was configured with a tailored system prompt outlining the criteria for a successful attack, enabling it to provide a binary decision (true/false) on whether the response met the objective. This binary output, along with an evaluation reason, was designed for programmatic action, facilitating automated policy enforcement and tracking of model improvements over time.

This structured approach allows for the measurement of aggregate model risk and removes the bottleneck of manual review, which is critical for evaluating hundreds or thousands of responses efficiently.

Vulnerability Analysis: Key Security Threats

Malicious Code Generation

One of the most direct ways LLMs can be misused is by being prompted to generate malicious code. Such code, if executed, could compromise system security by creating backdoors, injecting harmful commands, or enabling unauthorized remote access. The LLM judge plays a crucial role here by being configured with criteria to detect such malicious code. The binary classification system enables high-throughput analysis, essential for integrating security checks into production pipelines and minimizing the risk of LLMs unintentionally aiding offensive actions.

Package Hallucination and Supply Chain Risks

Package hallucination, where an LLM invents non-existent software packages, poses a significant threat to software supply chains. Attackers can exploit this by registering malicious packages under the hallucinated names on public repositories. Developers relying on LLM coding assistants might inadvertently download and integrate these malicious packages, leading to data exfiltration or system compromise. Trend Micro's research identified instances of package hallucination in Python and Go that their initial LLM judge failed to detect. This failure stems from the LLM's training data potentially being outdated or incomplete. To address this, an enhanced method was developed, involving extracting package names from LLM responses, querying package repositories (like PyPI and Go proxy mirrors) to verify their existence, and inferring hallucination if a package is not found. While this improves detection of non-existent packages, it does not inherently identify malicious intent, necessitating further research into detecting AI-driven supply chain attacks.

System Prompt Leakage

System prompt leakage occurs when an LLM inadvertently reveals sensitive information embedded within its instructions. This can include details about system architecture, API keys, internal rules, filtering criteria, or user roles. Detecting such leakage is vital, as it can expose confidential operational details. The study demonstrated that a judge LLM, when not privy to the target model's system prompt, might fail to detect leakage. However, by updating the judge's system prompt to include a reference system prompt, semantic analysis capabilities allowed for more flexible detection compared to simple string matching. This underscores the importance of tailoring the judge's system prompt to specific objectives for effective detection.

Defending the LLM Judge Itself

The security evaluation process is not one-sided; the LLM judge itself can be a target. Adversarial attacks, such as role-playing attacks disguised in Base64 encoding, can be used to manipulate the judge. In one experiment, a target LLM decoded an attack string instead of falling for it, but the subsequent LLM judge, when presented with the decoded string, ignored its system prompt and revealed its own underlying model family. This highlights the critical need for guardrails to protect the LLM judge from unsanitized responses that could be attack strings themselves. Trend Micro's AI Guard is designed to mitigate such risks by protecting AI applications against harmful content generation, sensitive information leakage, and prompt injections.

Benchmarking and Fine-Tuning LLM Judges

To determine the most effective foundation models for use as judges, Trend Micro conducted a benchmarking study. A dataset of over 800 attack strings and responses was used, with human labeling to classify attack success. Various state-of-the-art foundation models were then evaluated based on accuracy, precision, recall, and F1 scores. The study also considered relative inference cost. The results indicated that while some models offered cost-effectiveness, they lacked accuracy. Trend Micro's own Primus-Labor-70B model emerged as a strong contender, balancing high accuracy and F1 scores with a relatively low cost, performing comparably to larger proprietary models. Furthermore, the research explored fine-tuning a judge model using the same dataset. The fine-tuned judge demonstrated accuracy comparable to leading models like GPT-4 and GPT-4o, while maintaining lower compute costs and offering tailored performance for specific evaluation objectives. This fine-tuning approach allows for continuous adaptation to new attack techniques.

Conclusion: Towards Robust LLM Security Evaluation

The research by Trend Micro underscores the potential of LLMs as judges for security scans, offering scalability and speed. However, it also reveals critical limitations, including susceptibility to adversarial attacks and challenges in detecting nuanced threats like package hallucinations. The findings emphasize the necessity of robust guardrails, tailored system prompts for judges, and external validation mechanisms to ensure the accuracy and reliability of LLM-based security assessments. Trend Micro's ongoing work in developing solutions like AI Guard and their Primus models aims to address these challenges, contributing to a more secure AI ecosystem by providing organizations with the tools to effectively monitor and secure their LLM deployments.