This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Evaluation & Quality

AI evaluations and quality for teams and organizations. Applying common CD practices to AI prompts, agents, skills, commands, etc.

1: AI Eval Methodology for Coding Tools
2: Team AI Evals for Coding Tools
3: AI Evals for AI Enablement Platforms

To ensure AI behaves as expected, you, your team, and your organization need to take deliberate action. This section provides the AI quality basics, basic team, and organizational guidance.

1 - AI Eval Methodology for Coding Tools

A three-layer grading framework and development cycle for evaluating non-deterministic AI coding tools with automated behavioral testing.

AI coding tools produce non-deterministic output. Evals make that output observable and measurable using three grading layers: deterministic checks, transcript analysis, and LLM rubrics.

This guide is for teams building AI coding tools and platform teams providing shared AI enablement infrastructure. For team-specific eval setup, see Team AI Evals. For platform-scale patterns, see AI Evals for AI Enablement Platforms.

Terminology

Term	Definition
Task	A single work item given to the agent (one prompt + one fixture)
Trial	One execution of a task; multiple trials measure variance
Grader	An automated check that scores agent output (pass/fail or 0-1)
Transcript	The full agent conversation log: tool calls, reasoning, output
Outcome	The agent’s final output for a task
Evaluation harness	The framework that runs tasks, collects outcomes, applies graders
Agent harness	The runtime that executes the agent (e.g., Claude Code)
Evaluation suite	A collection of related tasks testing one capability dimension

In the dev-plugins reference implementation: Promptfoo is the evaluation harness. Claude Code is the agent harness. YAML files in evals/<plugin>/suites/ are evaluation suites.

What Are AI Evals

AI coding tools produce non-deterministic output. The same prompt run twice can yield different code, different explanations, and different tool-use sequences. Traditional unit tests verify deterministic application logic. AI evals verify behavior: whether an agent finds the right issues, follows a sound process, and produces useful output.

Without evals, teams face:

Silent regressions: A prompt change that improves one scenario quietly breaks three others. Nobody notices until a user reports it.
Hallucination drift: The agent starts citing files that do not exist or inventing issues that are not present. Without negative tests, fabrication goes undetected.
Unmeasurable improvement: Every change is a guess. You cannot tell whether a prompt edit actually improved capability or just shifted failure modes.

Evals make AI tool quality observable and measurable.

This guide focuses on evals for coding and code review agents, tools that read code, produce findings, and generate or modify source files. Conversational agents and research agents have different evaluation needs and may require adapted approaches.

What Evals Validate: ACD Artifacts

In the Agentic Continuous Delivery framework, software delivery is organized around six first-class artifacts. Evals validate agent behavior against these artifacts. Each grading layer maps naturally to different artifact types.

Artifact	Description	Primary Eval Layer
Intent Description	What the user wants to achieve	LLM Rubric
User-Facing Behavior	Observable outcomes from the user’s perspective	LLM Rubric
Feature Description	Structured specification of a capability	Transcript
Executable Truth	Tests, build scripts, type checks	Deterministic
System Constraints	Security, performance, compliance rules	Transcript
Implementation	Source code and configuration	Deterministic

Reading the table: Deterministic graders excel at checking artifacts with verifiable ground truth (code compiles, tests pass, files exist). Transcript graders verify the agent respected process constraints and addressed structured specifications. LLM rubrics evaluate alignment with intent and user-facing quality, the artifacts that require judgment.

This mapping guides grader selection: when you know which artifact type your eval targets, the table tells you which grading layer is the primary fit.

The Eval Development Cycle

Evals are a development tool, not a post-hoc quality gate. The cycle looks like this:

graph LR
    A[Write prompt or agent] --> B[Write eval]
    B --> C[Run eval]
    C --> D[Read transcripts]
    D --> E[Identify failure mode]
    E --> F[Improve prompt or agent]
    F --> C

The key insight: you write the eval before you consider the prompt done. Running the eval, reading the full agent transcript, and understanding why it failed teaches you more about your prompt than any amount of manual testing. The eval is your feedback loop.

Three-Layer Grading

A single grading approach cannot cover the full range of AI tool behaviors. Deterministic checks are fast but shallow. LLM judges catch nuance but are slow and expensive. Transcript analysis validates the agent’s process independent of its output. Combining all three layers gives you coverage, speed, and accuracy.

Layer 1: Deterministic Graders

Deterministic graders run fast, produce binary pass/fail results, and have near-zero false positive rates. They check structural properties of the output.

What they check:

Report structure matches expected headings and sections
Scores add up correctly (weighted arithmetic validation)
Output references real files from the fixture, not hallucinated paths
Specific keywords or patterns appear (or do not appear) in the output

A score arithmetic grader parses category scores and weights from agent output, computes the weighted average, and compares it to the reported overall score. A small tolerance (e.g., +/- 3 points) accommodates rounding. This catches a common failure mode: the agent reports individual category scores and a total that do not add up.

A report structure grader validates that the output contains required headings at the correct level, that headings match expected patterns, that required sections have non-empty content, and that the output falls within length bounds.

Layer 2: Transcript Graders

Transcript graders validate how the agent worked, not just what it produced. They parse the agent’s tool-call sequence and conversation turns to verify sound process.

What they check:

The agent gathered evidence (Read, Glob, Grep, Bash) before stating findings
The agent used multiple evidence sources, not just one
Evidence-gathering actions make up a sufficient proportion of total actions

An evidence gathering grader checks three things: whether evidence-gathering tools (Read, Glob, Grep) were used before the agent stated findings, whether at least two different evidence tools were used, and whether evidence-gathering actions make up a sufficient proportion of total actions (e.g., at least 40%). This catches agents that jump to conclusions without reading the code, or that rely on a single tool without examining actual file contents.

Layer 3: LLM Rubrics

LLM rubrics use a language model as judge to evaluate qualities that resist deterministic checking: accuracy of findings, quality of recommendations, appropriate severity ratings, and absence of hallucination.

A typical code quality rubric defines weighted criteria such as correctness, readability, maintainability, idiomatic usage, and error handling, each scored on a 1-5 scale. The LLM judge scores each criterion, a weighted total is computed, and the result passes if it meets a threshold (e.g., 3.5 out of 5).

LLM rubrics are the slowest and most expensive grading layer. Use them for qualities that the other layers cannot check.

Human Review as Calibration

Human review is a calibration tool, not a fourth runtime layer. You do not include human review in the automated eval pipeline. Instead, you use human review periodically to verify that your graders are correctly calibrated.

The CORE-Bench study found that fixing grader bugs improved measured performance from 42% to 95%. Uncalibrated graders waste prompt engineering effort on problems that do not exist.

For the hands-on calibration process and recalibration triggers, see Calibrating Graders.

Decision Table: When to Use Each Layer

Question	Layer
Does the output have the right structure?	Deterministic
Do the numbers add up?	Deterministic
Does the output reference real files?	Deterministic
Did the agent read the code before judging?	Transcript
Did the agent use appropriate tools?	Transcript
Are the findings accurate and specific?	LLM Rubric
Are the recommendations actionable?	LLM Rubric
Is the severity rating appropriate?	LLM Rubric

Worked Example: Three Layers Combined

Consider a code review eval that sends a messy codebase to the agent (mixed naming conventions, duplicated logic, dead code, a god class). A single test case uses all three layers:

Deterministic (high weight): Checks that findings reference specific files and line numbers from the fixture, and that the report has the expected heading structure.
Transcript (medium weight): Verifies the agent read code files before producing findings.
LLM Rubric (high weight): Judges whether findings include specific file references and accurate descriptions, and whether recommendations are actionable rather than generic.

The deterministic graders run in milliseconds and catch structural failures. The transcript grader catches agents that skip evidence gathering. The LLM rubrics evaluate the subjective quality that only another language model can assess. Together, they cover structure, process, and quality.

Positive and Negative Test Pairs

Every eval suite needs two types of tests:

Positive (capability) tests verify the tool finds real issues. You give the agent a fixture with planted problems and assert that it detects them.

Negative (regression) tests verify the tool does not fabricate findings. You give the agent a clean fixture and assert that it does not report false positives.

Without negative tests, you optimize for recall at the cost of precision. The agent learns to report everything as a problem, including things that are fine. Without positive tests, you have no idea whether the tool actually works.

Naming convention: Every positive suite file suite.yaml has a corresponding suite-neg.yaml.

For a step-by-step walkthrough of building positive and negative test pairs, see Writing Your First Eval.

Fixture Design

Fixtures are the codebases your agent evaluates during testing. Their quality determines your eval quality.

Principles:

Realistic, not toy. Use code structures that resemble real projects. A single file with one obvious bug teaches you nothing about agent behavior on real codebases.
One scenario per test case. Each test should exercise a single capability. This mirrors ACD’s small-batch session pattern: one scenario per session keeps signal clean.
Planted issues with documented intent. Every issue in a positive fixture should be deliberate and documented. List expected findings in the suite metadata or in a reference solution.
Clean fixtures for negative tests. Build fixtures that follow best practices so you can verify the agent does not fabricate findings.
Diverse fixture types. Different fixtures exercise different capabilities.

For a fixture portfolio example, see Building a Fixture Matrix.

Task Quality

Ambiguous task specifications are the primary source of eval noise. If two domain experts would disagree on whether an agent’s output passes or fails, the task is underspecified, not the agent.

The two-expert test: Before finalizing an eval case, ask whether two domain experts given the same output would independently reach the same pass/fail verdict. If not, tighten the specification.

Writing unambiguous assertions:

Specific over generic. “Detects missing <label> on the email input” is testable. “Finds accessibility issues” is not.
Observable criteria. Assert on things you can check (keywords present, files referenced, structure correct), not on vague quality.
Reference solutions as disambiguation. When a task could be interpreted multiple ways, write a reference solution that documents the intended interpretation. The reference eliminates ambiguity.

When pass rates are zero: A 0% pass rate is usually a task bug, not an agent bug. Before blaming the agent, investigate the task specification and grader logic. Common causes: overly narrow regex assertions, graders checking the wrong field, or fixture content that does not match the prompt’s assumptions.

Metrics: pass@k and pass^k

Single-run pass rates are misleading for non-deterministic systems. A test that passes once might fail on the next run. Two metrics address this:

pass@k (capability ceiling): The probability that at least 1 of k independent runs passes. Computed as 1 - C(n-c, k) / C(n, k) where n is total runs and c is passing runs. This tells you what the agent can do on a good run.

pass^k (reliability floor): The probability that all k independent runs pass. Computed as C(c, k) / C(n, k). This tells you how consistently the agent succeeds.

Why you need both:

High pass@k, low pass^k: The agent has the capability but is unreliable. Focus on reducing variance (better prompts, more constrained output format).
Low pass@k, low pass^k: The agent lacks the capability. Focus on improving the prompt or agent architecture.
High pass@k, high pass^k: The agent reliably performs this task. Move on.

Reference targets (from this repo’s eval philosophy):

Metric	Target
pass@1	> 80%
pass@5	> 95%
pass^5	> 60%
Negative suite pass@1	> 90%

Most eval frameworks support computing both metrics from multi-trial output, with optional grouping by suite or eval type.

Reference Solutions

Reference solutions are gold-standard outputs that document what a correct response looks like for each fixture. They serve two purposes:

Grader calibration. Compare agent output against the reference to verify your graders catch real failures and do not flag correct behavior.
LLM judge anchoring. Provide the reference solution to LLM rubric graders so they have a concrete standard to judge against, reducing variance in LLM-as-judge scoring.

Each reference solution covers one fixture and documents the expected findings, their severities, and the evidence that supports them.

Common Pitfalls

Only positive tests. The agent gets rewarded for finding issues everywhere, including in clean code. Add negative test suites.

Only LLM rubrics. Slow, expensive, and variable. Start with deterministic graders for structural checks and add LLM rubrics only for qualities that resist deterministic evaluation.

Toy fixtures. A 10-line file with one obvious bug does not test real-world agent behavior. Build fixtures that resemble actual codebases.

Single-run evaluation. One passing run does not mean the agent works. Use multi-trial execution and pass@k/pass^k metrics to measure true capability and reliability.

Not reading transcripts. The transcript shows you why the agent failed, not just that it failed. Read transcripts after every eval run. They are the primary debugging tool.

Team AI Evals for Coding Tools - Setting up evals for your team’s AI coding tools
AI Evals for AI Enablement Platforms - Building shared eval infrastructure for reusable AI tools
Agent Delivery Contract - ACD’s six artifact types that evals validate
Pipeline Enforcement - How quality gates enforce ACD constraints
Coding and Review Setup - Configuring AI agents for coding and review workflows

2 - Team AI Evals for Coding Tools

How individual teams set up, write, and run evals for their AI coding tools using eval-driven development.

If you would notice a regression, it needs an eval. This page covers setting up eval infrastructure, writing your first positive and negative tests, choosing graders, and integrating evals into your pipeline.

Reference implementation: The dev-plugins repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.

What Needs Evals

Not every AI interaction needs an eval. Use this heuristic: if you would notice a regression, it needs an eval.

Artifacts that need evals:

Custom prompts that guide agent behavior (code review checklists, scaffolding instructions, refactoring patterns)
Slash commands that users invoke directly
Agents that orchestrate multi-step workflows
Skills (knowledge bases) that inform agent decisions
Model migrations when you upgrade the underlying model version
Configuration changes to temperature, timeout, or system prompts

Artifacts that do not need evals:

One-off queries with no reuse expectation
Simple wrappers around built-in IDE features
Configuration that does not affect AI behavior (formatting, display preferences)

The decision comes down to blast radius. If a regression affects one developer once, skip the eval. If it affects every developer every time they use the tool, write the eval.

Setting Up Eval Infrastructure

Prerequisites: You need an eval framework (this guide uses Promptfoo), a working plugin to evaluate, and at least one realistic fixture. Start with a single real failure case, the last time your AI tool produced wrong output.

This walkthrough uses Promptfoo as the eval runner. The patterns apply to any eval framework that supports custom assertions and multi-trial execution.

Directory Structure

Mirror your plugin structure with an eval directory:

plugins/my-tool/           # Ships to users
  commands/review.md
  agents/reviewer.md
  skills/patterns/SKILL.md

evals/my-tool/             # Stays in repo
  promptfooconfig.yaml     # Eval runner configuration
  suites/                  # Test case definitions
    review.yaml
    review-neg.yaml
  graders/
    deterministic/         # Fast structural checks
    transcript/            # Agent process validation
    llm-rubrics/           # LLM-as-judge quality checks
  fixtures/                # Test input codebases
    buggy-app/
    clean-app/
  reference-solutions/     # Gold-standard expected outputs

Configuration

The eval config sets the model provider, timeout, output format, and baseline assertions that apply to every test. From evals/frontend-dev/promptfooconfig.yaml in the reference implementation:

providers:
  - id: anthropic:messages:claude-sonnet-4-20250514
    label: claude-sonnet
    config:
      temperature: 0
      max_tokens: 16384

defaultTest:
  options:
    timeout: 300000 # 5 minutes per test case
    transform: output.trim()
  assert:
    - type: javascript
      value: "output.length > 0" # Non-empty output
    - type: javascript
      value: "output.length >= 500" # Minimum substance
    - type: javascript
      value: "output.length <= 50000" # Maximum length bound

Key settings:

Temperature 0: Reduces variance between runs, making evals more reproducible.
Timeout 300000ms: Agents need time to read files, run tools, and compose output. Five minutes accommodates complex multi-step tasks.
Baseline assertions: Every test automatically checks for non-empty output within length bounds. These catch complete failures without adding noise to individual test definitions.

Writing Your First Eval

Start with a real failure, not a hypothetical one. Think of the last time your AI tool produced wrong output. That failure becomes your first eval.

Step 1: Pick a Real Failure

Example: your accessibility audit command missed missing <label> elements on a form.

Step 2: Build a Fixture That Reproduces It

Create a realistic component with the issue planted:

// fixtures/bad-form/BadForm.jsx
export default function BadForm() {
  return (
    <div>
      <h1>Contact Us</h1>
      <h4>Fill out the form</h4> {/* Heading level skip */}
      <input type="text" placeholder="Name" /> {/* No label */}
      <input type="email" placeholder="Email" /> {/* No label */}
      <div onClick={() => submit()}>Submit</div> {/* No keyboard handler */}
    </div>
  );
}

Step 3: Write a Positive Test

- description: "Accessibility audit detects missing labels and keyboard issues"
  metadata:
    suite: a11y
    case: bad-form-positive
    evalType: capability
    source: production-failure
  vars:
    fixture: "{{fixtureRoot}}/bad-form"
    prompt: |
      Review this component for accessibility violations:
      {% raw %}
      ```jsx
      // BadForm.jsx -- component source here
      ```
      {% endraw %}
  assert:
    - type: javascript
      value: "output.match(/label/i) !== null"
      metric: detects_missing_labels
      weight: 3
    - type: javascript
      value: "output.match(/keyboard|onClick.*div/i) !== null"
      metric: detects_keyboard_issues
      weight: 2

Step 4: Build the Clean Counterpart

Create a component that does everything right:

// fixtures/accessible-form/SearchBox.jsx
export default function SearchBox() {
  return (
    <form role="search">
      <label htmlFor="q">Search</label>
      <input id="q" type="search" aria-describedby="hint" />
      <p id="hint">Enter keywords to search</p>
      <button type="submit">Search</button>
    </form>
  );
}

Step 5: Write the Negative Test

- description: "Accessible form should not trigger false positives"
  metadata:
    suite: a11y-neg
    case: accessible-form-negative
    evalType: regression
    source: manual
  vars:
    fixture: "{{fixtureRoot}}/accessible-form"
    prompt: |
      Review this component for accessibility violations:
      {% raw %}
      ```jsx
      // SearchBox.jsx -- component source here
      ```
      {% endraw %}
  assert:
    - type: not-icontains
      value: "critical"
      weight: 2
    - type: llm-rubric
      value: >
        The output should not report false positive accessibility violations
        against this well-structured accessible form component.
      weight: 4

You now have a positive test proving the tool finds real issues and a negative test proving it does not fabricate issues on clean code.

Choosing Graders

The AI Eval Methodology page details the three-layer grading framework. Here is the practical guidance for choosing graders on your team.

Start with deterministic graders. They run in milliseconds, produce clear pass/fail results, and are easy to debug. For most teams, deterministic graders cover 60-70% of what you need to check.

Minimal deterministic grader:

// graders/deterministic/checks-labels.js
module.exports = function (output) {
  const mentionsLabels = /label|htmlFor|aria-label/i.test(output);
  return {
    pass: mentionsLabels,
    score: mentionsLabels ? 1 : 0,
    reason: mentionsLabels
      ? "Output discusses form labeling"
      : "Output does not mention labels, htmlFor, or aria-label",
  };
};

Add transcript graders for agent behavior. When your tool involves multi-step agent workflows (reading files, running tools, composing output), add transcript graders to verify the agent followed a sound process.

Add LLM rubrics for quality. When you need to evaluate subjective qualities like “are the recommendations actionable?” or “is the severity rating appropriate?”, use an LLM rubric.

Minimal LLM rubric:

- type: llm-rubric
  value: >
    The output identifies specific accessibility violations with WCAG success
    criterion references. Each finding includes the file location and a concrete
    fix recommendation. Generic advice like "add labels" without specifying which
    elements is insufficient.
  weight: 3

Calibrating Graders

A grader that consistently returns the wrong verdict is worse than no grader. It gives false confidence. Calibrate every grader before relying on it.

Three-step calibration process:

Run against reference solutions. Your reference solution represents known-good output. Every grader should pass when given the reference solution. If a grader fails the reference, the grader is wrong.
Run against known-bad output. Manually craft output with specific flaws (wrong files cited, missing findings, fabricated issues). Every grader should fail when given bad output. If a grader passes bad output, it is too permissive.
Investigate disagreements. When a grader disagrees with your human judgment, diagnose whether the grader’s logic is wrong or your judgment needs updating. Usually, it is the grader.

The CORE-Bench lesson: A research benchmark found that fixing grader bugs improved measured agent performance from 42% to 95%. The agents were capable all along. The graders were miscalibrated. This is common and costly.

When to recalibrate:

After modifying a grader’s logic
After adding a new fixture to the suite
Before and after a model migration
When score distributions shift unexpectedly (sudden pass rate drop or spike)

Building Fixtures

Good fixtures determine good evals. Follow these principles:

Take real codebase snapshots. Copy actual code that triggered failures. Sanitize proprietary details but keep the structural complexity.
Plant issues deliberately. Document every planted issue so you can write targeted assertions. A fixture without documented issues is untestable.
Build clean counterparts. For every fixture with planted issues, build a clean version that follows best practices. The clean fixture drives your negative tests.
Keep them small but realistic. A fixture with 3-5 files covering 100-300 lines total is enough to test most agent behaviors without making eval runs slow.

Running and Interpreting Results

Single Run

npm run eval:frontend

This runs every test suite once and produces a scorecard.

Multi-Trial Execution

For pass@k metrics, run each test multiple times. Configure the repeat field in your promptfoo config or run the eval multiple times and aggregate results.

Reading the Scorecard

The scorecard shows each test case with its assertion results. Focus on:

Which assertions failed? A failed deterministic check points to a specific structural problem. A failed LLM rubric suggests a quality issue.
What is the score? Weighted assertions contribute proportionally. A test with a score of 0.7 passed most assertions but missed some.
Are failures consistent? A test that fails the same way every run has a systematic problem. A test that fails intermittently has a variance problem.

The Transcript Viewer

The transcript viewer shows you the full agent conversation for failed tests:

./eval-infra/scripts/transcript-viewer.sh evals/frontend-dev/.promptfoo

Options:

-a / --all: Show all transcripts, not just failures
-s / --short: Abbreviated view (first/last 3 lines)
-t / --test <name>: Filter by test description
-c / --count: Print pass/fail counts only

Reading transcripts is the single most valuable debugging activity. The transcript shows you which files the agent read, which tools it used, where it got confused, and why it produced wrong output.

pass@k Computation

After running evals, compute capability and reliability metrics:

python eval-infra/scripts/compute-pass-at-k.py \
  --results evals/frontend-dev/.promptfoo/output.json \
  --k 1 3 5 \
  --group-by evalType

This groups results by evalType (capability vs. regression) and computes pass@k (capability ceiling) and pass^k (reliability floor) for k=1, 3, and 5.

The Eval-Driven Development Loop

The eval is your development tool, not your release gate. The loop works like this:

Run the eval. Start with the full suite.
Read transcripts for failures. Open the transcript viewer and read the full agent conversation for every failed test. Do not skip this step.
Identify the failure mode. Common modes:
- Agent did not read the right files (prompt needs better guidance)
- Agent read the files but missed the issue (prompt needs more specific criteria)
- Agent found the issue but described it poorly (prompt needs output format guidance)
- Agent fabricated a finding (need a negative test to prevent this)
Improve the prompt or agent. Make one targeted change that addresses the identified failure mode.
Re-run the eval. Verify the fix works without breaking other tests.

When to add new test cases: When you discover a failure mode that no existing test covers.

When to improve existing tests: When a test passes for the wrong reasons or fails for reasons unrelated to the capability it tests.

Recording baselines: After a successful eval run, record the baseline:

./eval-infra/scripts/record-baseline.sh frontend-dev

This saves a timestamped record of pass@k metrics, git commit, and branch to evals/frontend-dev/eval-history.jsonl. Use baselines to track improvement over time and detect regressions across prompt changes.

Model Migration Testing

When upgrading the underlying model (e.g., from Claude Sonnet 4 to a newer version), use your eval suite to validate the migration systematically.

Run the full suite on the current model. Record baselines for all metrics.
Run the identical suite on the new model. Change only the provider; keep everything else constant.
Run negative suites first. Regressions (fabricated findings on clean code) are the highest-risk failure mode in model migrations.
Compare pass@k and pass^k. Look for changes in both capability ceiling and reliability floor.
Read transcripts for new failure modes. A new model may fail differently than the old one: different tools used, different reasoning patterns, different output structure.

Watch for masked regressions: Aggregate pass rates can improve while individual tasks regress. Compare per-task results, not just suite-level metrics. A model that scores 85% overall but drops three previously-passing tasks may be worse for your users than the old model at 80%.

CI Integration

Run evals automatically when prompt or agent files change.

What to Run in CI

Always run deterministic and transcript graders. They are fast and cheap.
Run LLM rubrics on pull requests only. They are slow and cost real API tokens. Skip them on every commit to manage costs.
Run the full suite with multi-trial on release branches. This gives you pass@k confidence before shipping.

Pipeline stage mapping:

Pipeline Stage	Graders to Run	Rationale
Pre-commit	Deterministic only	Instant feedback, zero cost
CI (every push)	Deterministic + Transcript	Fast, free, catches process issues
Pull request	All layers including LLM rubrics	Full quality assessment at review
Release branch	Full suite, multi-trial	pass@k confidence before shipping

Most commits get fast, cheap feedback. Expensive LLM rubric runs happen only at decision points where the full quality picture matters.

Gating Criteria

Gate on the regression (negative) suite:

Regression suite pass@1 >= 90%

This means: on a single run, at least 90% of negative tests pass. This prevents shipping prompts that fabricate findings on clean code.

Do not gate on capability suite pass rates during early development. Use capability metrics to track improvement, not to block merges.

Cost Management

Grader Type	Speed	Cost	When to Run
Deterministic	< 1s	Free	Every commit
Transcript	< 1s	Free	Every commit
LLM Rubric	5-30s	API cost	PR only
Full multi-trial	Minutes	API cost	Release branch

Optimizing token costs:

Cache eval prompts. Structure prompts with stable content (system prompts, instructions) first so provider caching can apply.
Smaller models for initial passes. Use a faster, cheaper model for deterministic and transcript grading passes before running expensive LLM rubrics.
Track per-run token costs. Log input and output tokens per eval run to identify cost trends and outliers.
Batch rubric evaluations. Where possible, combine multiple rubric checks into a single LLM judge prompt to reduce per-call overhead.

For broader token optimization strategies, see Tokenomics.

Evals in the Quality Feedback Loop

Evals catch regressions before deployment. Production monitoring catches unanticipated failures after deployment. Together, they form a complete quality feedback loop.

The cycle:

Evals --> Deploy --> Monitor --> User reports --> New eval cases --> Evals

How production failures become eval cases:

Capture the failure. Save the user’s input, the agent’s output, and the expected behavior.
Build a fixture. Create a minimal reproduction from the captured input.
Write the eval case. Add a test with source: production-failure in the metadata.
Verify the failure. Run the eval and confirm the agent fails the same way.
Fix and validate. Improve the prompt or agent, then re-run until the eval passes without breaking existing tests.

Complementary monitoring methods:

Manual transcript review. Sample 5-10 transcripts per week from production to catch failure modes your evals miss.
Expert validation agents. Deploy expert agents that validate agent output at runtime.
User satisfaction signals. Track whether users accept, modify, or reject agent output. High rejection rates indicate undetected failure modes.

AI Eval Methodology - Three-layer grading framework and core eval concepts
AI Evals for AI Enablement Platforms - Building shared eval infrastructure across multiple plugins
Pipeline Enforcement - How quality gates enforce ACD constraints
Small-Batch Sessions - Structuring focused agent work sessions
Tokenomics - Optimizing token usage and costs for AI agent operations

3 - AI Evals for AI Enablement Platforms

How platform teams build shared eval infrastructure for reusable AI coding tools that serve multiple teams and diverse codebases.

Platform teams build reusable AI coding tools for multiple teams. Shared eval infrastructure (base configs, grader libraries, rubric templates) eliminates duplication and enforces consistency across the plugin portfolio.

Reference implementation: The dev-plugins repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.

What is an AI Enablement Platform

An AI enablement platform is the team that builds reusable AI coding tools (prompts, agents, plugins, skills) for multiple teams in an organization. Instead of every team writing their own code review agent or scaffolding command, the platform team builds these once and distributes them.

The platform challenge: your tools must work across diverse codebases with different languages, frameworks, and coding conventions. You cannot manually test against every consumer’s codebase. You cannot rely on consumer teams to report regressions.

The eval challenge compounds this: each tool in your portfolio needs its own eval suite, and those suites share common infrastructure. Without shared eval patterns, you duplicate graders, rubrics, and fixture conventions across every plugin.

Multi-Plugin Eval Architecture

The dev-plugins reference implementation demonstrates a monorepo structure that separates shipping artifacts from eval infrastructure. This example uses Claude Code plugins, but the same pattern applies to any collection of reusable AI tools:

plugins/frontend-dev/          # Ships to users
  .claude-plugin/plugin.json
  commands/*.md
  agents/*.md
  skills/*/SKILL.md

plugins/ai-readiness/          # Ships to users
  .claude-plugin/plugin.json
  commands/*.md
  agents/*.md
  skills/*/SKILL.md

evals/frontend-dev/            # Stays in repo
  promptfooconfig.yaml
  suites/                      # 5 positive + 3 negative suites
  graders/{deterministic,transcript,llm-rubrics}/
  fixtures/
  reference-solutions/

evals/ai-readiness/            # Stays in repo
  promptfooconfig.yaml
  suites/                      # 7 positive + 7 negative suites
  graders/{deterministic,transcript,llm-rubrics}/
  fixtures/
  reference-solutions/

eval-infra/                    # Shared across all plugins
  promptfoo-base.yaml
  grader-lib/
  rubric-templates/
  scripts/

Running Evals Across the Portfolio

Run evals for a single plugin or the entire portfolio:

npm run eval:frontend          # One plugin
npm run eval:readiness         # Another plugin
npm run eval:all               # All plugins

The eval-infra/scripts/run-plugin-evals.sh script iterates over a KNOWN_PLUGINS list, running each plugin’s eval suite and aggregating results.

Plugin Validation

Before running evals, validate that a plugin has the required structure:

./eval-infra/scripts/validate-plugin.sh frontend-dev

This checks for required directories, manifest fields, at least one command, at least one eval suite, and proper naming conventions. Validation catches structural problems before they cause confusing eval failures.

Shared Eval Infrastructure

Platform teams need a shared foundation that every plugin eval builds on. This eliminates duplication and enforces consistency.

Base Configuration

A single base config defines the provider, timeout, output format, and universal assertions. From eval-infra/promptfoo-base.yaml in the reference implementation:

providers:
  - id: anthropic:messages:claude-sonnet-4-20250514
    label: claude-sonnet
    config:
      temperature: 0
      max_tokens: 16384

defaultTest:
  options:
    timeout: 300000
    transform: output.trim()
  assert:
    # Universal assertions applied to every test case
    - type: javascript
      value: "output.length > 0"
      metric: non_empty_output
    - type: javascript
      value: "output.length >= 500"
      metric: min_output_length
    - type: javascript
      value: "output.length <= 50000"
      metric: max_output_length

Every plugin config replicates these defaults and adds plugin-specific variables (pluginRoot, fixtureRoot, graderRoot, referenceRoot). Shared variables (evalInfraRoot, graderLibRoot, rubricRoot) point back to the central infrastructure.

Shared Grader Library

The grader library (eval-infra/grader-lib/) provides reusable grading functions that any plugin can use:

Grader	Purpose
`report-schema.js`	Validates markdown heading structure or JSON fields
`finding-parser.js`	Extracts findings with severity and evidence
`hallucination-check.js`	Detects fabricated file path references
`transcript-utils.js`	Parses agent transcripts and tool-call sequences
`build-check.sh`	Runs `npm install && npm run build` in fixtures
`lint-check.sh`	Runs ESLint and reports error/warning counts

Plugin-specific graders import from the shared library. For example, a transcript grader in evals/ai-readiness/graders/transcript/evidence-gathering.js loads transcript-utils.js from the shared graderLibRoot path to parse transcripts and count tool calls.

Shared Rubric Templates

LLM rubric templates in eval-infra/rubric-templates/ provide consistent judging criteria across plugins:

code-quality-base.md - Five weighted criteria (correctness, readability, maintainability, idiomatic usage, error handling) with a 3.5/5 pass threshold
over-engineering-base.md - Checks for unnecessary abstraction and complexity
instruction-following.md - Checks adherence to prompt instructions
report-quality.md - Evaluates report structure and actionability

Plugin-specific rubrics extend or reference these templates. This prevents each plugin team from inventing their own quality criteria.

The Extend and Specialize Pattern

The shared infrastructure provides the foundation. Each plugin specializes it:

Base config sets provider, timeout, and universal assertions
Shared graders handle common structural checks
Shared rubrics define baseline quality criteria
Plugin config adds plugin-specific variables and test suites
Plugin graders handle domain-specific checks (e.g., accessibility patterns, security vulnerability detection)
Plugin rubrics add domain-specific quality criteria

This layering means a new plugin gets structural validation, hallucination detection, and transcript analysis for free. The plugin author only writes graders for the domain-specific checks their tool needs.

Fixture Diversity

Platform tools must handle diverse codebases. Your fixture portfolio should cover the range of code your tools will encounter in production.

Building a Fixture Matrix

From evals/ai-readiness/fixtures/ in the reference implementation, seven fixture types exercise different tool capabilities:

Fixture	What It Tests	Positive/Negative
`messy-repo/`	Naming, duplication, dead code detection	Positive
`insecure-repo/`	Security vulnerability detection	Positive
`bad-git-repo/`	Git hygiene assessment	Positive
`untested-repo/`	Test coverage gap detection	Positive
`bad-api-repo/`	API design issue detection	Positive
`spaghetti-arch-repo/`	Architecture problem detection	Positive
`clean-repo/`	False positive prevention	Negative

Each positive fixture has documented, planted issues. The clean-repo/ fixture follows best practices (clear naming, proper structure, tests, documentation) and drives negative tests across multiple suites.

Coverage Dimensions

Design fixtures to cover these dimensions:

Language/framework diversity: If your tool supports React, Vue, and Angular, build fixtures for each.
Problem type diversity: Naming issues, security vulnerabilities, architecture problems, missing tests, and API design flaws exercise different detection capabilities.
Severity diversity: Include minor, moderate, and critical issues. Your tool should rate severity appropriately.
Clean examples: At least one fixture per problem domain should be clean to drive negative tests.

Reference Solutions as Platform Artifacts

Reference solutions serve double duty on a platform team:

Grader calibration. Compare agent output against the reference to verify graders catch real failures and pass correct behavior. When a grader disagrees with the reference solution, the grader is probably wrong.
Onboarding material. New team members read reference solutions to understand what good output looks like for each tool. The reference solution for messy-repo-audit.md documents exactly what findings the code review tool should produce, at what severity, with what evidence.

From evals/ai-readiness/reference-solutions/ in the reference implementation, seven reference solutions cover the full fixture portfolio:

Reference Solution	Fixture
`messy-repo-audit.md`	`messy-repo/`
`insecure-repo-findings.md`	`insecure-repo/`
`bad-git-health-report.md`	`bad-git-repo/`
`untested-repo-findings.md`	`untested-repo/`
`bad-api-findings.md`	`bad-api-repo/`
`spaghetti-arch-findings.md`	`spaghetti-arch-repo/`
`clean-repo-audit.md`	`clean-repo/`

The clean-repo-audit.md reference solution documents what the agent should say about well-structured code: acknowledge what is done well, note minor improvement opportunities without false alarm, and assign no critical or major findings.

Meta-Evaluation

Platform teams face a second-order problem: how do you evaluate your eval infrastructure itself? If your graders are miscalibrated or your fixtures are unrealistic, your evals give false confidence.

The Eval-Rubric Pattern

The eval-rubric pattern uses a structured assessment against known best practices. This repo’s /eval-rubric command scores the eval infrastructure against 12 dimensions from Anthropic’s “Demystifying Evals for AI Agents” article, each scored 0-5:

Start Early with Real Failures
Source from Real User Behavior
Unambiguous Tasks + Reference Solutions
Balanced Problem Sets
Robust Eval Harness + Stable Environment
Thoughtful Grader Design
Read Transcripts Regularly
Monitor Capability Eval Saturation
Maintain Evals Long-Term
Non-Determinism Handling
Agent-Specific Approaches
Holistic Evaluation

Score thresholds: 0-2 requires an action plan to reach 3. 3-4 indicates adequate infrastructure with room for refinement. 5 indicates complete coverage.

The eval-rubric runs against the actual repo contents, reading suite files, grader implementations, fixture directories, and CI configuration before scoring. It produces evidence-based assessments, not opinions.

Running Meta-Evaluation Periodically

Run the eval-rubric after significant infrastructure changes:

Adding a new grader type
Expanding the fixture portfolio
Changing the base configuration
Adding a new plugin

Track scores over time. A dimension that drops below 3 after an infrastructure change indicates a regression in eval quality.

Expert Validation Agents

ACD defines expert validation agents that validate agent output at runtime, the production counterpart to offline evals. Where evals catch issues during development, expert agents catch issues during execution.

Expert Agent	What It Validates	Offline Eval Counterpart
Intent Clarity Agent	Prompt matches user intent	LLM rubric (intent alignment)
Behavior Validation	Output matches expected behavior	LLM rubric (behavior quality)
Constraint Checker	Output respects system rules	Transcript grader (process)
Implementation Review	Code quality and correctness	Deterministic grader (structure)
Truth Verification	Output passes executable checks	Deterministic grader (build/test)

Both need calibration. Expert agents, like graders, can be miscalibrated. Apply the same calibration discipline: test against known-good and known-bad outputs, investigate disagreements, and recalibrate periodically.

Both need negative testing. An expert agent that flags everything is as useless as a grader with 100% false positive rate. Test expert agents against clean inputs to verify they do not fabricate issues.

The division: Offline evals validate during development. They run in CI and during prompt engineering. Expert agents validate during execution. They run alongside the agent in production. A mature platform uses both.

The Eval Lifecycle at Scale

Adding New Plugins

When adding a new plugin to the platform, follow this checklist (detailed in docs/ADDING_A_PLUGIN.md):

Create the plugin directory structure (plugins/<name>/)
Create the plugin manifest (plugin.json)
Write at least one command or agent
Create the eval directory structure (evals/<name>/)
Create the eval config replicating base defaults
Write at least one positive suite with a fixture
Write the corresponding negative suite
Add the plugin to KNOWN_PLUGINS in the run script
Validate structure with validate-plugin.sh

The checklist ensures every plugin ships with evals from day one. A plugin without evals does not ship.

Monitoring Capability Saturation

Track pass@k metrics over time for each plugin. When pass@5 consistently exceeds 95% across all capability suites, the current eval suite is saturated. The tool handles everything you test for. Either:

The tool is genuinely excellent (check pass^5 to verify reliability)
The tests are too easy (add harder fixtures, more subtle issues)
The test suite has gaps (add new capability dimensions)

Saturation is a signal to expand the eval suite, not to stop evaluating.

Baseline Management

Record baselines after significant prompt or eval changes:

./eval-infra/scripts/record-baseline.sh ai-readiness

This appends a timestamped JSON record to evals/ai-readiness/eval-history.jsonl with pass@k metrics, git commit, and branch. Use the history to:

Detect metric regressions across prompt changes
Measure the impact of model migrations
Report capability improvement over time to stakeholders

Eval Maintenance and Retirement

Eval suites require ongoing maintenance. Without a retirement policy, suites accumulate stale tests that slow runs and obscure signal.

When to retire eval cases:

Saturated AND stable. A test that passes consistently (pass^5 > 95%) across multiple model versions has served its purpose. Archive it, do not delete it. Archived tests can be re-activated if regressions appear.

When to split suites:

Suites exceeding 20 cases. Large suites make it hard to identify which capability dimension failed. Split by capability dimension (e.g., separate “naming” from “architecture” tests).

Ownership model:

Shared graders (in eval-infra/grader-lib/) - platform team owns
Plugin-specific graders (in evals/<plugin>/graders/) - plugin team owns
Rubric templates (in eval-infra/rubric-templates/) - platform team owns

Review cadence: Quarterly, aligned with model migration timelines. Review pass@k trends, retire saturated cases, split oversized suites, and recalibrate graders against updated reference solutions.

Common Platform Pitfalls

Building tools without evals first. The platform team ships a new plugin, gets user complaints, and then scrambles to build evals. Write the eval suite alongside the first command. Evals are the development tool, not an afterthought.

Overly generic shared graders. A grader that checks “output is valid markdown” adds little value. Shared graders should check specific structural properties (heading hierarchy, score arithmetic, hallucinated file paths) that apply across multiple plugins.

Overly specific shared graders. A grader that checks for React-specific patterns does not belong in the shared library. It belongs in the plugin’s grader directory. Keep the shared library domain-agnostic.

Uncalibrated rubric templates. LLM rubric templates need calibration against reference solutions. Run the rubric against your reference solution and verify it scores 4-5. Run it against a known-bad output and verify it scores 1-2. A rubric that gives 3.5 to everything is useless.

Not validating plugin structure. A missing grader directory or misconfigured manifest causes confusing eval failures. Run validate-plugin.sh before every eval run in CI.

Ignoring negative testing. Platform tools face the strongest temptation to over-report issues because they optimize for “finding things.” Negative test suites with clean fixtures are the only defense against false positive drift.

AI Eval Methodology - Three-layer grading framework and core eval concepts
Team AI Evals for Coding Tools - Setting up evals for individual team AI tools
Pipeline Enforcement - How quality gates enforce ACD constraints
Tokenomics - Optimizing token usage in agent architecture

Evaluation & Quality

1 - AI Eval Methodology for Coding Tools

Terminology

What Are AI Evals

What Evals Validate: ACD Artifacts

The Eval Development Cycle

Three-Layer Grading

Layer 1: Deterministic Graders

Layer 2: Transcript Graders

Layer 3: LLM Rubrics

Human Review as Calibration

Decision Table: When to Use Each Layer

Worked Example: Three Layers Combined

Positive and Negative Test Pairs

Fixture Design

Task Quality

Metrics: pass@k and pass^k

Reference Solutions

Common Pitfalls

Related Content

2 - Team AI Evals for Coding Tools

What Needs Evals

Setting Up Eval Infrastructure

Directory Structure

Configuration

Writing Your First Eval

Step 1: Pick a Real Failure

Step 2: Build a Fixture That Reproduces It

Step 3: Write a Positive Test

Step 4: Build the Clean Counterpart

Step 5: Write the Negative Test

Choosing Graders

Calibrating Graders

Building Fixtures

Running and Interpreting Results

Single Run

Multi-Trial Execution

Reading the Scorecard

The Transcript Viewer

pass@k Computation

The Eval-Driven Development Loop

Model Migration Testing

CI Integration

What to Run in CI

Gating Criteria

Cost Management

Evals in the Quality Feedback Loop

Related Content

3 - AI Evals for AI Enablement Platforms

What is an AI Enablement Platform

Multi-Plugin Eval Architecture

Running Evals Across the Portfolio

Plugin Validation

Shared Eval Infrastructure

Base Configuration

Shared Grader Library

Shared Rubric Templates

The Extend and Specialize Pattern

Fixture Diversity

Building a Fixture Matrix

Coverage Dimensions

Reference Solutions as Platform Artifacts

Meta-Evaluation

The Eval-Rubric Pattern

Running Meta-Evaluation Periodically

Expert Validation Agents

The Eval Lifecycle at Scale

Adding New Plugins

Monitoring Capability Saturation

Baseline Management

Eval Maintenance and Retirement

Common Platform Pitfalls

Related Content