To ensure AI behaves as expected, you, your team, and your organization need to take deliberate action. This section provides the AI quality basics, basic team, and organizational guidance.
This is the multi-page printable view of this section. Click here to print.
Evaluation & Quality
- 1: AI Eval Methodology for Coding Tools
- 2: Team AI Evals for Coding Tools
- 3: AI Evals for AI Enablement Platforms
1 - AI Eval Methodology for Coding Tools
AI coding tools produce non-deterministic output. Evals make that output observable and measurable using three grading layers: deterministic checks, transcript analysis, and LLM rubrics.
This guide is for teams building AI coding tools and platform teams providing shared AI enablement infrastructure. For team-specific eval setup, see Team AI Evals. For platform-scale patterns, see AI Evals for AI Enablement Platforms.
Terminology
| Term | Definition |
|---|---|
| Task | A single work item given to the agent (one prompt + one fixture) |
| Trial | One execution of a task; multiple trials measure variance |
| Grader | An automated check that scores agent output (pass/fail or 0-1) |
| Transcript | The full agent conversation log: tool calls, reasoning, output |
| Outcome | The agent’s final output for a task |
| Evaluation harness | The framework that runs tasks, collects outcomes, applies graders |
| Agent harness | The runtime that executes the agent (e.g., Claude Code) |
| Evaluation suite | A collection of related tasks testing one capability dimension |
In the dev-plugins reference implementation: Promptfoo is the evaluation harness. Claude Code is the agent
harness. YAML files in evals/<plugin>/suites/ are evaluation suites.
What Are AI Evals
AI coding tools produce non-deterministic output. The same prompt run twice can yield different code, different explanations, and different tool-use sequences. Traditional unit tests verify deterministic application logic. AI evals verify behavior: whether an agent finds the right issues, follows a sound process, and produces useful output.
Without evals, teams face:
- Silent regressions: A prompt change that improves one scenario quietly breaks three others. Nobody notices until a user reports it.
- Hallucination drift: The agent starts citing files that do not exist or inventing issues that are not present. Without negative tests, fabrication goes undetected.
- Unmeasurable improvement: Every change is a guess. You cannot tell whether a prompt edit actually improved capability or just shifted failure modes.
Evals make AI tool quality observable and measurable.
This guide focuses on evals for coding and code review agents, tools that read code, produce findings, and generate or modify source files. Conversational agents and research agents have different evaluation needs and may require adapted approaches.
What Evals Validate: ACD Artifacts
In the Agentic Continuous Delivery framework, software delivery is organized around six first-class artifacts. Evals validate agent behavior against these artifacts. Each grading layer maps naturally to different artifact types.
| Artifact | Description | Primary Eval Layer |
|---|---|---|
| Intent Description | What the user wants to achieve | LLM Rubric |
| User-Facing Behavior | Observable outcomes from the user’s perspective | LLM Rubric |
| Feature Description | Structured specification of a capability | Transcript |
| Executable Truth | Tests, build scripts, type checks | Deterministic |
| System Constraints | Security, performance, compliance rules | Transcript |
| Implementation | Source code and configuration | Deterministic |
Reading the table: Deterministic graders excel at checking artifacts with verifiable ground truth (code compiles, tests pass, files exist). Transcript graders verify the agent respected process constraints and addressed structured specifications. LLM rubrics evaluate alignment with intent and user-facing quality, the artifacts that require judgment.
This mapping guides grader selection: when you know which artifact type your eval targets, the table tells you which grading layer is the primary fit.
The Eval Development Cycle
Evals are a development tool, not a post-hoc quality gate. The cycle looks like this:
graph LR
A[Write prompt or agent] --> B[Write eval]
B --> C[Run eval]
C --> D[Read transcripts]
D --> E[Identify failure mode]
E --> F[Improve prompt or agent]
F --> CThe key insight: you write the eval before you consider the prompt done. Running the eval, reading the full agent transcript, and understanding why it failed teaches you more about your prompt than any amount of manual testing. The eval is your feedback loop.
Three-Layer Grading
A single grading approach cannot cover the full range of AI tool behaviors. Deterministic checks are fast but shallow. LLM judges catch nuance but are slow and expensive. Transcript analysis validates the agent’s process independent of its output. Combining all three layers gives you coverage, speed, and accuracy.
Layer 1: Deterministic Graders
Deterministic graders run fast, produce binary pass/fail results, and have near-zero false positive rates. They check structural properties of the output.
What they check:
- Report structure matches expected headings and sections
- Scores add up correctly (weighted arithmetic validation)
- Output references real files from the fixture, not hallucinated paths
- Specific keywords or patterns appear (or do not appear) in the output
A score arithmetic grader parses category scores and weights from agent output, computes the weighted average, and compares it to the reported overall score. A small tolerance (e.g., +/- 3 points) accommodates rounding. This catches a common failure mode: the agent reports individual category scores and a total that do not add up.
A report structure grader validates that the output contains required headings at the correct level, that headings match expected patterns, that required sections have non-empty content, and that the output falls within length bounds.
Layer 2: Transcript Graders
Transcript graders validate how the agent worked, not just what it produced. They parse the agent’s tool-call sequence and conversation turns to verify sound process.
What they check:
- The agent gathered evidence (Read, Glob, Grep, Bash) before stating findings
- The agent used multiple evidence sources, not just one
- Evidence-gathering actions make up a sufficient proportion of total actions
An evidence gathering grader checks three things: whether evidence-gathering tools (Read, Glob, Grep) were used before the agent stated findings, whether at least two different evidence tools were used, and whether evidence-gathering actions make up a sufficient proportion of total actions (e.g., at least 40%). This catches agents that jump to conclusions without reading the code, or that rely on a single tool without examining actual file contents.
Layer 3: LLM Rubrics
LLM rubrics use a language model as judge to evaluate qualities that resist deterministic checking: accuracy of findings, quality of recommendations, appropriate severity ratings, and absence of hallucination.
A typical code quality rubric defines weighted criteria such as correctness, readability, maintainability, idiomatic usage, and error handling, each scored on a 1-5 scale. The LLM judge scores each criterion, a weighted total is computed, and the result passes if it meets a threshold (e.g., 3.5 out of 5).
LLM rubrics are the slowest and most expensive grading layer. Use them for qualities that the other layers cannot check.
Human Review as Calibration
Human review is a calibration tool, not a fourth runtime layer. You do not include human review in the automated eval pipeline. Instead, you use human review periodically to verify that your graders are correctly calibrated.
The CORE-Bench study found that fixing grader bugs improved measured performance from 42% to 95%. Uncalibrated graders waste prompt engineering effort on problems that do not exist.
For the hands-on calibration process and recalibration triggers, see Calibrating Graders.
Decision Table: When to Use Each Layer
| Question | Layer |
|---|---|
| Does the output have the right structure? | Deterministic |
| Do the numbers add up? | Deterministic |
| Does the output reference real files? | Deterministic |
| Did the agent read the code before judging? | Transcript |
| Did the agent use appropriate tools? | Transcript |
| Are the findings accurate and specific? | LLM Rubric |
| Are the recommendations actionable? | LLM Rubric |
| Is the severity rating appropriate? | LLM Rubric |
Worked Example: Three Layers Combined
Consider a code review eval that sends a messy codebase to the agent (mixed naming conventions, duplicated logic, dead code, a god class). A single test case uses all three layers:
Deterministic (high weight): Checks that findings reference specific files and line numbers from the fixture, and that the report has the expected heading structure.
Transcript (medium weight): Verifies the agent read code files before producing findings.
LLM Rubric (high weight): Judges whether findings include specific file references and accurate descriptions, and whether recommendations are actionable rather than generic.
The deterministic graders run in milliseconds and catch structural failures. The transcript grader catches agents that skip evidence gathering. The LLM rubrics evaluate the subjective quality that only another language model can assess. Together, they cover structure, process, and quality.
Positive and Negative Test Pairs
Every eval suite needs two types of tests:
Positive (capability) tests verify the tool finds real issues. You give the agent a fixture with planted problems and assert that it detects them.
Negative (regression) tests verify the tool does not fabricate findings. You give the agent a clean fixture and assert that it does not report false positives.
Without negative tests, you optimize for recall at the cost of precision. The agent learns to report everything as a problem, including things that are fine. Without positive tests, you have no idea whether the tool actually works.
Naming convention: Every positive suite file suite.yaml has a corresponding
suite-neg.yaml.
For a step-by-step walkthrough of building positive and negative test pairs, see Writing Your First Eval.
Fixture Design
Fixtures are the codebases your agent evaluates during testing. Their quality determines your eval quality.
Principles:
Realistic, not toy. Use code structures that resemble real projects. A single file with one obvious bug teaches you nothing about agent behavior on real codebases.
One scenario per test case. Each test should exercise a single capability. This mirrors ACD’s small-batch session pattern: one scenario per session keeps signal clean.
Planted issues with documented intent. Every issue in a positive fixture should be deliberate and documented. List expected findings in the suite metadata or in a reference solution.
Clean fixtures for negative tests. Build fixtures that follow best practices so you can verify the agent does not fabricate findings.
Diverse fixture types. Different fixtures exercise different capabilities.
For a fixture portfolio example, see Building a Fixture Matrix.
Task Quality
Ambiguous task specifications are the primary source of eval noise. If two domain experts would disagree on whether an agent’s output passes or fails, the task is underspecified, not the agent.
The two-expert test: Before finalizing an eval case, ask whether two domain experts given the same output would independently reach the same pass/fail verdict. If not, tighten the specification.
Writing unambiguous assertions:
- Specific over generic. “Detects missing
<label>on the email input” is testable. “Finds accessibility issues” is not. - Observable criteria. Assert on things you can check (keywords present, files referenced, structure correct), not on vague quality.
- Reference solutions as disambiguation. When a task could be interpreted multiple ways, write a reference solution that documents the intended interpretation. The reference eliminates ambiguity.
When pass rates are zero: A 0% pass rate is usually a task bug, not an agent bug. Before blaming the agent, investigate the task specification and grader logic. Common causes: overly narrow regex assertions, graders checking the wrong field, or fixture content that does not match the prompt’s assumptions.
Metrics: pass@k and pass^k
Single-run pass rates are misleading for non-deterministic systems. A test that passes once might fail on the next run. Two metrics address this:
pass@k (capability ceiling): The probability that at least 1 of k independent runs
passes. Computed as 1 - C(n-c, k) / C(n, k) where n is total runs and c is passing
runs. This tells you what the agent can do on a good run.
pass^k (reliability floor): The probability that all k independent runs pass.
Computed as C(c, k) / C(n, k). This tells you how consistently the agent succeeds.
Why you need both:
- High pass@k, low pass^k: The agent has the capability but is unreliable. Focus on reducing variance (better prompts, more constrained output format).
- Low pass@k, low pass^k: The agent lacks the capability. Focus on improving the prompt or agent architecture.
- High pass@k, high pass^k: The agent reliably performs this task. Move on.
Reference targets (from this repo’s eval philosophy):
| Metric | Target |
|---|---|
| pass@1 | > 80% |
| pass@5 | > 95% |
| pass^5 | > 60% |
| Negative suite pass@1 | > 90% |
Most eval frameworks support computing both metrics from multi-trial output, with optional grouping by suite or eval type.
Reference Solutions
Reference solutions are gold-standard outputs that document what a correct response looks like for each fixture. They serve two purposes:
Grader calibration. Compare agent output against the reference to verify your graders catch real failures and do not flag correct behavior.
LLM judge anchoring. Provide the reference solution to LLM rubric graders so they have a concrete standard to judge against, reducing variance in LLM-as-judge scoring.
Each reference solution covers one fixture and documents the expected findings, their severities, and the evidence that supports them.
Common Pitfalls
Only positive tests. The agent gets rewarded for finding issues everywhere, including in clean code. Add negative test suites.
Only LLM rubrics. Slow, expensive, and variable. Start with deterministic graders for structural checks and add LLM rubrics only for qualities that resist deterministic evaluation.
Toy fixtures. A 10-line file with one obvious bug does not test real-world agent behavior. Build fixtures that resemble actual codebases.
Single-run evaluation. One passing run does not mean the agent works. Use multi-trial execution and pass@k/pass^k metrics to measure true capability and reliability.
Not reading transcripts. The transcript shows you why the agent failed, not just that it failed. Read transcripts after every eval run. They are the primary debugging tool.
Related Content
- Team AI Evals for Coding Tools - Setting up evals for your team’s AI coding tools
- AI Evals for AI Enablement Platforms - Building shared eval infrastructure for reusable AI tools
- Agent Delivery Contract - ACD’s six artifact types that evals validate
- Pipeline Enforcement - How quality gates enforce ACD constraints
- Coding and Review Setup - Configuring AI agents for coding and review workflows
2 - Team AI Evals for Coding Tools
If you would notice a regression, it needs an eval. This page covers setting up eval infrastructure, writing your first positive and negative tests, choosing graders, and integrating evals into your pipeline.
Reference implementation: The dev-plugins repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.
What Needs Evals
Not every AI interaction needs an eval. Use this heuristic: if you would notice a regression, it needs an eval.
Artifacts that need evals:
- Custom prompts that guide agent behavior (code review checklists, scaffolding instructions, refactoring patterns)
- Slash commands that users invoke directly
- Agents that orchestrate multi-step workflows
- Skills (knowledge bases) that inform agent decisions
- Model migrations when you upgrade the underlying model version
- Configuration changes to temperature, timeout, or system prompts
Artifacts that do not need evals:
- One-off queries with no reuse expectation
- Simple wrappers around built-in IDE features
- Configuration that does not affect AI behavior (formatting, display preferences)
The decision comes down to blast radius. If a regression affects one developer once, skip the eval. If it affects every developer every time they use the tool, write the eval.
Setting Up Eval Infrastructure
Prerequisites: You need an eval framework (this guide uses Promptfoo), a working plugin to evaluate, and at least one realistic fixture. Start with a single real failure case, the last time your AI tool produced wrong output.
This walkthrough uses Promptfoo as the eval runner. The patterns apply to any eval framework that supports custom assertions and multi-trial execution.
Directory Structure
Mirror your plugin structure with an eval directory:
plugins/my-tool/ # Ships to users
commands/review.md
agents/reviewer.md
skills/patterns/SKILL.md
evals/my-tool/ # Stays in repo
promptfooconfig.yaml # Eval runner configuration
suites/ # Test case definitions
review.yaml
review-neg.yaml
graders/
deterministic/ # Fast structural checks
transcript/ # Agent process validation
llm-rubrics/ # LLM-as-judge quality checks
fixtures/ # Test input codebases
buggy-app/
clean-app/
reference-solutions/ # Gold-standard expected outputs
Configuration
The eval config sets the model provider, timeout, output format, and baseline
assertions that apply to every test. From evals/frontend-dev/promptfooconfig.yaml in the reference implementation:
Key settings:
- Temperature 0: Reduces variance between runs, making evals more reproducible.
- Timeout 300000ms: Agents need time to read files, run tools, and compose output. Five minutes accommodates complex multi-step tasks.
- Baseline assertions: Every test automatically checks for non-empty output within length bounds. These catch complete failures without adding noise to individual test definitions.
Writing Your First Eval
Start with a real failure, not a hypothetical one. Think of the last time your AI tool produced wrong output. That failure becomes your first eval.
Step 1: Pick a Real Failure
Example: your accessibility audit command missed missing <label> elements on a form.
Step 2: Build a Fixture That Reproduces It
Create a realistic component with the issue planted:
Step 3: Write a Positive Test
Step 4: Build the Clean Counterpart
Create a component that does everything right:
Step 5: Write the Negative Test
You now have a positive test proving the tool finds real issues and a negative test proving it does not fabricate issues on clean code.
Choosing Graders
The AI Eval Methodology page details the three-layer grading framework. Here is the practical guidance for choosing graders on your team.
Start with deterministic graders. They run in milliseconds, produce clear pass/fail results, and are easy to debug. For most teams, deterministic graders cover 60-70% of what you need to check.
Minimal deterministic grader:
Add transcript graders for agent behavior. When your tool involves multi-step agent workflows (reading files, running tools, composing output), add transcript graders to verify the agent followed a sound process.
Add LLM rubrics for quality. When you need to evaluate subjective qualities like “are the recommendations actionable?” or “is the severity rating appropriate?”, use an LLM rubric.
Minimal LLM rubric:
Calibrating Graders
A grader that consistently returns the wrong verdict is worse than no grader. It gives false confidence. Calibrate every grader before relying on it.
Three-step calibration process:
- Run against reference solutions. Your reference solution represents known-good output. Every grader should pass when given the reference solution. If a grader fails the reference, the grader is wrong.
- Run against known-bad output. Manually craft output with specific flaws (wrong files cited, missing findings, fabricated issues). Every grader should fail when given bad output. If a grader passes bad output, it is too permissive.
- Investigate disagreements. When a grader disagrees with your human judgment, diagnose whether the grader’s logic is wrong or your judgment needs updating. Usually, it is the grader.
The CORE-Bench lesson: A research benchmark found that fixing grader bugs improved measured agent performance from 42% to 95%. The agents were capable all along. The graders were miscalibrated. This is common and costly.
When to recalibrate:
- After modifying a grader’s logic
- After adding a new fixture to the suite
- Before and after a model migration
- When score distributions shift unexpectedly (sudden pass rate drop or spike)
Building Fixtures
Good fixtures determine good evals. Follow these principles:
Take real codebase snapshots. Copy actual code that triggered failures. Sanitize proprietary details but keep the structural complexity.
Plant issues deliberately. Document every planted issue so you can write targeted assertions. A fixture without documented issues is untestable.
Build clean counterparts. For every fixture with planted issues, build a clean version that follows best practices. The clean fixture drives your negative tests.
Keep them small but realistic. A fixture with 3-5 files covering 100-300 lines total is enough to test most agent behaviors without making eval runs slow.
Running and Interpreting Results
Single Run
This runs every test suite once and produces a scorecard.
Multi-Trial Execution
For pass@k metrics, run each test multiple times. Configure the repeat field in
your promptfoo config or run the eval multiple times and aggregate results.
Reading the Scorecard
The scorecard shows each test case with its assertion results. Focus on:
- Which assertions failed? A failed deterministic check points to a specific structural problem. A failed LLM rubric suggests a quality issue.
- What is the score? Weighted assertions contribute proportionally. A test with a score of 0.7 passed most assertions but missed some.
- Are failures consistent? A test that fails the same way every run has a systematic problem. A test that fails intermittently has a variance problem.
The Transcript Viewer
The transcript viewer shows you the full agent conversation for failed tests:
Options:
-a/--all: Show all transcripts, not just failures-s/--short: Abbreviated view (first/last 3 lines)-t/--test <name>: Filter by test description-c/--count: Print pass/fail counts only
Reading transcripts is the single most valuable debugging activity. The transcript shows you which files the agent read, which tools it used, where it got confused, and why it produced wrong output.
pass@k Computation
After running evals, compute capability and reliability metrics:
This groups results by evalType (capability vs. regression) and computes pass@k
(capability ceiling) and pass^k (reliability floor) for k=1, 3, and 5.
The Eval-Driven Development Loop
The eval is your development tool, not your release gate. The loop works like this:
Run the eval. Start with the full suite.
Read transcripts for failures. Open the transcript viewer and read the full agent conversation for every failed test. Do not skip this step.
Identify the failure mode. Common modes:
- Agent did not read the right files (prompt needs better guidance)
- Agent read the files but missed the issue (prompt needs more specific criteria)
- Agent found the issue but described it poorly (prompt needs output format guidance)
- Agent fabricated a finding (need a negative test to prevent this)
Improve the prompt or agent. Make one targeted change that addresses the identified failure mode.
Re-run the eval. Verify the fix works without breaking other tests.
When to add new test cases: When you discover a failure mode that no existing test covers.
When to improve existing tests: When a test passes for the wrong reasons or fails for reasons unrelated to the capability it tests.
Recording baselines: After a successful eval run, record the baseline:
This saves a timestamped record of pass@k metrics, git commit, and branch to
evals/frontend-dev/eval-history.jsonl. Use baselines to track improvement over time
and detect regressions across prompt changes.
Model Migration Testing
When upgrading the underlying model (e.g., from Claude Sonnet 4 to a newer version), use your eval suite to validate the migration systematically.
- Run the full suite on the current model. Record baselines for all metrics.
- Run the identical suite on the new model. Change only the provider; keep everything else constant.
- Run negative suites first. Regressions (fabricated findings on clean code) are the highest-risk failure mode in model migrations.
- Compare pass@k and pass^k. Look for changes in both capability ceiling and reliability floor.
- Read transcripts for new failure modes. A new model may fail differently than the old one: different tools used, different reasoning patterns, different output structure.
Watch for masked regressions: Aggregate pass rates can improve while individual tasks regress. Compare per-task results, not just suite-level metrics. A model that scores 85% overall but drops three previously-passing tasks may be worse for your users than the old model at 80%.
CI Integration
Run evals automatically when prompt or agent files change.
What to Run in CI
- Always run deterministic and transcript graders. They are fast and cheap.
- Run LLM rubrics on pull requests only. They are slow and cost real API tokens. Skip them on every commit to manage costs.
- Run the full suite with multi-trial on release branches. This gives you pass@k confidence before shipping.
Pipeline stage mapping:
| Pipeline Stage | Graders to Run | Rationale |
|---|---|---|
| Pre-commit | Deterministic only | Instant feedback, zero cost |
| CI (every push) | Deterministic + Transcript | Fast, free, catches process issues |
| Pull request | All layers including LLM rubrics | Full quality assessment at review |
| Release branch | Full suite, multi-trial | pass@k confidence before shipping |
Most commits get fast, cheap feedback. Expensive LLM rubric runs happen only at decision points where the full quality picture matters.
Gating Criteria
Gate on the regression (negative) suite:
Regression suite pass@1 >= 90%
This means: on a single run, at least 90% of negative tests pass. This prevents shipping prompts that fabricate findings on clean code.
Do not gate on capability suite pass rates during early development. Use capability metrics to track improvement, not to block merges.
Cost Management
| Grader Type | Speed | Cost | When to Run |
|---|---|---|---|
| Deterministic | < 1s | Free | Every commit |
| Transcript | < 1s | Free | Every commit |
| LLM Rubric | 5-30s | API cost | PR only |
| Full multi-trial | Minutes | API cost | Release branch |
Optimizing token costs:
- Cache eval prompts. Structure prompts with stable content (system prompts, instructions) first so provider caching can apply.
- Smaller models for initial passes. Use a faster, cheaper model for deterministic and transcript grading passes before running expensive LLM rubrics.
- Track per-run token costs. Log input and output tokens per eval run to identify cost trends and outliers.
- Batch rubric evaluations. Where possible, combine multiple rubric checks into a single LLM judge prompt to reduce per-call overhead.
For broader token optimization strategies, see Tokenomics.
Evals in the Quality Feedback Loop
Evals catch regressions before deployment. Production monitoring catches unanticipated failures after deployment. Together, they form a complete quality feedback loop.
The cycle:
Evals --> Deploy --> Monitor --> User reports --> New eval cases --> Evals
How production failures become eval cases:
- Capture the failure. Save the user’s input, the agent’s output, and the expected behavior.
- Build a fixture. Create a minimal reproduction from the captured input.
- Write the eval case. Add a test with
source: production-failurein the metadata. - Verify the failure. Run the eval and confirm the agent fails the same way.
- Fix and validate. Improve the prompt or agent, then re-run until the eval passes without breaking existing tests.
Complementary monitoring methods:
- Manual transcript review. Sample 5-10 transcripts per week from production to catch failure modes your evals miss.
- Expert validation agents. Deploy expert agents that validate agent output at runtime.
- User satisfaction signals. Track whether users accept, modify, or reject agent output. High rejection rates indicate undetected failure modes.
Related Content
- AI Eval Methodology - Three-layer grading framework and core eval concepts
- AI Evals for AI Enablement Platforms - Building shared eval infrastructure across multiple plugins
- Pipeline Enforcement - How quality gates enforce ACD constraints
- Small-Batch Sessions - Structuring focused agent work sessions
- Tokenomics - Optimizing token usage and costs for AI agent operations
3 - AI Evals for AI Enablement Platforms
Platform teams build reusable AI coding tools for multiple teams. Shared eval infrastructure (base configs, grader libraries, rubric templates) eliminates duplication and enforces consistency across the plugin portfolio.
Reference implementation: The dev-plugins repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.
What is an AI Enablement Platform
An AI enablement platform is the team that builds reusable AI coding tools (prompts, agents, plugins, skills) for multiple teams in an organization. Instead of every team writing their own code review agent or scaffolding command, the platform team builds these once and distributes them.
The platform challenge: your tools must work across diverse codebases with different languages, frameworks, and coding conventions. You cannot manually test against every consumer’s codebase. You cannot rely on consumer teams to report regressions.
The eval challenge compounds this: each tool in your portfolio needs its own eval suite, and those suites share common infrastructure. Without shared eval patterns, you duplicate graders, rubrics, and fixture conventions across every plugin.
Multi-Plugin Eval Architecture
The dev-plugins reference implementation demonstrates a monorepo structure that separates shipping artifacts from eval infrastructure. This example uses Claude Code plugins, but the same pattern applies to any collection of reusable AI tools:
plugins/frontend-dev/ # Ships to users
.claude-plugin/plugin.json
commands/*.md
agents/*.md
skills/*/SKILL.md
plugins/ai-readiness/ # Ships to users
.claude-plugin/plugin.json
commands/*.md
agents/*.md
skills/*/SKILL.md
evals/frontend-dev/ # Stays in repo
promptfooconfig.yaml
suites/ # 5 positive + 3 negative suites
graders/{deterministic,transcript,llm-rubrics}/
fixtures/
reference-solutions/
evals/ai-readiness/ # Stays in repo
promptfooconfig.yaml
suites/ # 7 positive + 7 negative suites
graders/{deterministic,transcript,llm-rubrics}/
fixtures/
reference-solutions/
eval-infra/ # Shared across all plugins
promptfoo-base.yaml
grader-lib/
rubric-templates/
scripts/
Running Evals Across the Portfolio
Run evals for a single plugin or the entire portfolio:
The eval-infra/scripts/run-plugin-evals.sh script iterates over a KNOWN_PLUGINS
list, running each plugin’s eval suite and aggregating results.
Plugin Validation
Before running evals, validate that a plugin has the required structure:
This checks for required directories, manifest fields, at least one command, at least one eval suite, and proper naming conventions. Validation catches structural problems before they cause confusing eval failures.
Shared Eval Infrastructure
Platform teams need a shared foundation that every plugin eval builds on. This eliminates duplication and enforces consistency.
Base Configuration
A single base config defines the provider, timeout, output format, and universal
assertions. From eval-infra/promptfoo-base.yaml in the reference implementation:
Every plugin config replicates these defaults and adds plugin-specific variables
(pluginRoot, fixtureRoot, graderRoot, referenceRoot). Shared variables
(evalInfraRoot, graderLibRoot, rubricRoot) point back to the central
infrastructure.
Shared Grader Library
The grader library (eval-infra/grader-lib/) provides reusable grading functions
that any plugin can use:
| Grader | Purpose |
|---|---|
report-schema.js | Validates markdown heading structure or JSON fields |
finding-parser.js | Extracts findings with severity and evidence |
hallucination-check.js | Detects fabricated file path references |
transcript-utils.js | Parses agent transcripts and tool-call sequences |
build-check.sh | Runs npm install && npm run build in fixtures |
lint-check.sh | Runs ESLint and reports error/warning counts |
Plugin-specific graders import from the shared library. For example, a transcript
grader in evals/ai-readiness/graders/transcript/evidence-gathering.js loads
transcript-utils.js from the shared graderLibRoot path to parse transcripts
and count tool calls.
Shared Rubric Templates
LLM rubric templates in eval-infra/rubric-templates/ provide consistent judging
criteria across plugins:
code-quality-base.md- Five weighted criteria (correctness, readability, maintainability, idiomatic usage, error handling) with a 3.5/5 pass thresholdover-engineering-base.md- Checks for unnecessary abstraction and complexityinstruction-following.md- Checks adherence to prompt instructionsreport-quality.md- Evaluates report structure and actionability
Plugin-specific rubrics extend or reference these templates. This prevents each plugin team from inventing their own quality criteria.
The Extend and Specialize Pattern
The shared infrastructure provides the foundation. Each plugin specializes it:
- Base config sets provider, timeout, and universal assertions
- Shared graders handle common structural checks
- Shared rubrics define baseline quality criteria
- Plugin config adds plugin-specific variables and test suites
- Plugin graders handle domain-specific checks (e.g., accessibility patterns, security vulnerability detection)
- Plugin rubrics add domain-specific quality criteria
This layering means a new plugin gets structural validation, hallucination detection, and transcript analysis for free. The plugin author only writes graders for the domain-specific checks their tool needs.
Fixture Diversity
Platform tools must handle diverse codebases. Your fixture portfolio should cover the range of code your tools will encounter in production.
Building a Fixture Matrix
From evals/ai-readiness/fixtures/ in the reference implementation, seven fixture types exercise different tool
capabilities:
| Fixture | What It Tests | Positive/Negative |
|---|---|---|
messy-repo/ | Naming, duplication, dead code detection | Positive |
insecure-repo/ | Security vulnerability detection | Positive |
bad-git-repo/ | Git hygiene assessment | Positive |
untested-repo/ | Test coverage gap detection | Positive |
bad-api-repo/ | API design issue detection | Positive |
spaghetti-arch-repo/ | Architecture problem detection | Positive |
clean-repo/ | False positive prevention | Negative |
Each positive fixture has documented, planted issues. The clean-repo/ fixture
follows best practices (clear naming, proper structure, tests, documentation) and
drives negative tests across multiple suites.
Coverage Dimensions
Design fixtures to cover these dimensions:
- Language/framework diversity: If your tool supports React, Vue, and Angular, build fixtures for each.
- Problem type diversity: Naming issues, security vulnerabilities, architecture problems, missing tests, and API design flaws exercise different detection capabilities.
- Severity diversity: Include minor, moderate, and critical issues. Your tool should rate severity appropriately.
- Clean examples: At least one fixture per problem domain should be clean to drive negative tests.
Reference Solutions as Platform Artifacts
Reference solutions serve double duty on a platform team:
Grader calibration. Compare agent output against the reference to verify graders catch real failures and pass correct behavior. When a grader disagrees with the reference solution, the grader is probably wrong.
Onboarding material. New team members read reference solutions to understand what good output looks like for each tool. The reference solution for
messy-repo-audit.mddocuments exactly what findings the code review tool should produce, at what severity, with what evidence.
From evals/ai-readiness/reference-solutions/ in the reference implementation, seven reference solutions cover the
full fixture portfolio:
| Reference Solution | Fixture |
|---|---|
messy-repo-audit.md | messy-repo/ |
insecure-repo-findings.md | insecure-repo/ |
bad-git-health-report.md | bad-git-repo/ |
untested-repo-findings.md | untested-repo/ |
bad-api-findings.md | bad-api-repo/ |
spaghetti-arch-findings.md | spaghetti-arch-repo/ |
clean-repo-audit.md | clean-repo/ |
The clean-repo-audit.md reference solution documents what the agent should say about
well-structured code: acknowledge what is done well, note minor improvement
opportunities without false alarm, and assign no critical or major findings.
Meta-Evaluation
Platform teams face a second-order problem: how do you evaluate your eval infrastructure itself? If your graders are miscalibrated or your fixtures are unrealistic, your evals give false confidence.
The Eval-Rubric Pattern
The eval-rubric pattern uses a structured assessment against known best practices.
This repo’s /eval-rubric command scores the eval infrastructure against 12 dimensions
from Anthropic’s “Demystifying Evals for AI Agents” article, each scored 0-5:
- Start Early with Real Failures
- Source from Real User Behavior
- Unambiguous Tasks + Reference Solutions
- Balanced Problem Sets
- Robust Eval Harness + Stable Environment
- Thoughtful Grader Design
- Read Transcripts Regularly
- Monitor Capability Eval Saturation
- Maintain Evals Long-Term
- Non-Determinism Handling
- Agent-Specific Approaches
- Holistic Evaluation
Score thresholds: 0-2 requires an action plan to reach 3. 3-4 indicates adequate infrastructure with room for refinement. 5 indicates complete coverage.
The eval-rubric runs against the actual repo contents, reading suite files, grader implementations, fixture directories, and CI configuration before scoring. It produces evidence-based assessments, not opinions.
Running Meta-Evaluation Periodically
Run the eval-rubric after significant infrastructure changes:
- Adding a new grader type
- Expanding the fixture portfolio
- Changing the base configuration
- Adding a new plugin
Track scores over time. A dimension that drops below 3 after an infrastructure change indicates a regression in eval quality.
Expert Validation Agents
ACD defines expert validation agents that validate agent output at runtime, the production counterpart to offline evals. Where evals catch issues during development, expert agents catch issues during execution.
| Expert Agent | What It Validates | Offline Eval Counterpart |
|---|---|---|
| Intent Clarity Agent | Prompt matches user intent | LLM rubric (intent alignment) |
| Behavior Validation | Output matches expected behavior | LLM rubric (behavior quality) |
| Constraint Checker | Output respects system rules | Transcript grader (process) |
| Implementation Review | Code quality and correctness | Deterministic grader (structure) |
| Truth Verification | Output passes executable checks | Deterministic grader (build/test) |
Both need calibration. Expert agents, like graders, can be miscalibrated. Apply the same calibration discipline: test against known-good and known-bad outputs, investigate disagreements, and recalibrate periodically.
Both need negative testing. An expert agent that flags everything is as useless as a grader with 100% false positive rate. Test expert agents against clean inputs to verify they do not fabricate issues.
The division: Offline evals validate during development. They run in CI and during prompt engineering. Expert agents validate during execution. They run alongside the agent in production. A mature platform uses both.
The Eval Lifecycle at Scale
Adding New Plugins
When adding a new plugin to the platform, follow this checklist (detailed in
docs/ADDING_A_PLUGIN.md):
- Create the plugin directory structure (
plugins/<name>/) - Create the plugin manifest (
plugin.json) - Write at least one command or agent
- Create the eval directory structure (
evals/<name>/) - Create the eval config replicating base defaults
- Write at least one positive suite with a fixture
- Write the corresponding negative suite
- Add the plugin to
KNOWN_PLUGINSin the run script - Validate structure with
validate-plugin.sh
The checklist ensures every plugin ships with evals from day one. A plugin without evals does not ship.
Monitoring Capability Saturation
Track pass@k metrics over time for each plugin. When pass@5 consistently exceeds 95% across all capability suites, the current eval suite is saturated. The tool handles everything you test for. Either:
- The tool is genuinely excellent (check pass^5 to verify reliability)
- The tests are too easy (add harder fixtures, more subtle issues)
- The test suite has gaps (add new capability dimensions)
Saturation is a signal to expand the eval suite, not to stop evaluating.
Baseline Management
Record baselines after significant prompt or eval changes:
This appends a timestamped JSON record to evals/ai-readiness/eval-history.jsonl
with pass@k metrics, git commit, and branch. Use the history to:
- Detect metric regressions across prompt changes
- Measure the impact of model migrations
- Report capability improvement over time to stakeholders
Eval Maintenance and Retirement
Eval suites require ongoing maintenance. Without a retirement policy, suites accumulate stale tests that slow runs and obscure signal.
When to retire eval cases:
- Saturated AND stable. A test that passes consistently (pass^5 > 95%) across multiple model versions has served its purpose. Archive it, do not delete it. Archived tests can be re-activated if regressions appear.
When to split suites:
- Suites exceeding 20 cases. Large suites make it hard to identify which capability dimension failed. Split by capability dimension (e.g., separate “naming” from “architecture” tests).
Ownership model:
- Shared graders (in
eval-infra/grader-lib/) - platform team owns - Plugin-specific graders (in
evals/<plugin>/graders/) - plugin team owns - Rubric templates (in
eval-infra/rubric-templates/) - platform team owns
Review cadence: Quarterly, aligned with model migration timelines. Review pass@k trends, retire saturated cases, split oversized suites, and recalibrate graders against updated reference solutions.
Common Platform Pitfalls
Building tools without evals first. The platform team ships a new plugin, gets user complaints, and then scrambles to build evals. Write the eval suite alongside the first command. Evals are the development tool, not an afterthought.
Overly generic shared graders. A grader that checks “output is valid markdown” adds little value. Shared graders should check specific structural properties (heading hierarchy, score arithmetic, hallucinated file paths) that apply across multiple plugins.
Overly specific shared graders. A grader that checks for React-specific patterns does not belong in the shared library. It belongs in the plugin’s grader directory. Keep the shared library domain-agnostic.
Uncalibrated rubric templates. LLM rubric templates need calibration against reference solutions. Run the rubric against your reference solution and verify it scores 4-5. Run it against a known-bad output and verify it scores 1-2. A rubric that gives 3.5 to everything is useless.
Not validating plugin structure. A missing grader directory or misconfigured
manifest causes confusing eval failures. Run validate-plugin.sh before every eval
run in CI.
Ignoring negative testing. Platform tools face the strongest temptation to over-report issues because they optimize for “finding things.” Negative test suites with clean fixtures are the only defense against false positive drift.
Related Content
- AI Eval Methodology - Three-layer grading framework and core eval concepts
- Team AI Evals for Coding Tools - Setting up evals for individual team AI tools
- Pipeline Enforcement - How quality gates enforce ACD constraints
- Tokenomics - Optimizing token usage in agent architecture