This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Architecting Tests for CD

Test architecture, types, and good practices for building confidence in your delivery pipeline.

1: Test Feedback Speed
2: Test Types

2.1: Component Tests
2.2: Contract Tests
2.3: End-to-End Tests
2.4: Integration Tests
2.5: Static Analysis
2.6: Unit Tests

3: Applied Testing Strategies

3.1: Pre-Ship Checklist
3.2: Patterns

3.2.1: API Provider
3.2.2: API Consumer
3.2.3: Scheduled Job
3.2.4: User Interface
3.2.5: Event Consumer
3.2.6: Event Producer
3.2.7: CLI Tool or Library
3.2.8: Stateful Service

3.3: Cross-Cutting Concerns

4: Testing Antipatterns
5: Testing Glossary

A test architecture that lets your pipeline deploy confidently, regardless of external system availability, is a core CD capability. The child pages cover each test type.

A CD pipeline’s job is to force every artifact to prove it is worthy of delivery. That proof only works when test changes ship with the code they validate. If a developer adds a feature but the corresponding tests arrive in a later commit, the pipeline approved an artifact it never actually verified. That is not a CD pipeline. It is a CI pipeline with a deploy step. Tests and production code must always travel together through the pipeline as a single unit of change.

Beyond the Test Pyramid

The test pyramid: a triangle with Unit Tests at the wide base (fast, cheap, many), Integration/Component in the middle, and End-to-End at the narrow top (slow, expensive, few). Arrows on the sides indicate cost and speed increase toward the top.

The test pyramid says: write many fast unit tests at the base, fewer integration tests in the middle, and only a handful of end-to-end tests at the top. The underlying principle is sound - lower-level tests are faster, more deterministic, and cheaper to maintain.

The principle behind the shape

The pyramid’s shape communicates a principle: prefer fast, deterministic tests that you fully control. Tests at the base are cheap to write, fast to run, and reliable. Tests at the top are slow, expensive, and depend on systems outside your control. The more weight you put at the base, the faster and more reliable your pipeline becomes - to a point. We also have the engineering goal of achieving the most functional coverage with the fewest number of tests. Every test costs money to maintain and adds time to the pipeline.

The testing trophy

The testing trophy, popularized by Kent C. Dodds, rebalances the pyramid by putting component tests at the center. Where the pyramid emphasizes unit tests at the base, the trophy argues that component tests give you the most confidence per test because they exercise realistic user behavior through a component’s public interface while still using test doubles for external dependencies.

The trophy also makes static analysis explicit as the foundation. Linting, type checking, and formatting catch entire categories of defects for free - no test code to write or maintain.

Both models agree on the principle: keep end-to-end tests few and focused, and maximize fast, deterministic coverage. The trophy simply shifts where that coverage concentrates. For teams building component-heavy applications, the trophy distribution often produces better results than a strict pyramid.

Teams often miss this underlying principle and treat either shape as a metric. They count tests by type and debate ratios - “do we have enough unit tests?” or “are our integration tests too many?” - when the real question is:

Can our pipeline determine that a change is safe to deploy without depending on any system we do not control?

A pipeline that answers yes can deploy at any time - even when a downstream service is down, a third-party API is slow, or a partner team hasn’t shipped yet. That independence is what CD requires, and it is the reason the pyramid favors the base.

What this looks like in practice

A test architecture that achieves this has three responsibilities:

Fast, deterministic tests - unit, component, and contract tests - run on every commit using test doubles for external dependencies. They give a reliable go/no-go signal in minutes.
Acceptance tests validate that a deployed artifact is deliverable. Acceptance testing is not a single test type. It is a pipeline stage that can include component tests, load tests, chaos tests, resilience tests, and compliance tests. Any test that runs after CI to gate promotion to production is an acceptance test.
Integration tests validate that contract test doubles still match the real external systems. They run in a dedicated test environment with versioned test data, on demand or on a schedule, providing monitoring rather than gating.

The anti-pattern: the ice cream cone

The ice cream cone anti-pattern: an inverted test distribution where most testing effort goes to manual and end-to-end tests at the top, with too few fast unit tests at the bottom

Most teams that struggle with CD have inverted the pyramid - too many slow, flaky end-to-end tests and too few fast, focused ones. Manual gates block every release. The pipeline cannot give a fast, reliable answer, so deployments become high-ceremony events.

Test Architecture

A test architecture is the deliberate structure of how different test types work together across your pipeline to give you deployment confidence. Use the table below to decide what type of test to write and where it runs. This is not a comprehensive list. It shows how common tests impact pipeline design and how teams should structure their suites. See the Pipeline Reference Architecture for a complete quality gate sequence.

Four-lane CD pipeline diagram. Pipeline lane: Commit triggers pre-merge and CI checks (Static Analysis, Unit Tests, Component Tests, Contract Tests - deterministic, blocks merge), then Build, Deploy to test environment, Acceptance Tests in test environment (Component, Load, Chaos, Resilience, Compliance - gates promotion to production), Deploy to production, and a green Live checkmark. Post-deploy lane: Production Verification (Health Checks, Real User Monitoring, SLO) triggered after production deploy - non-deterministic, triggers alerts, never blocks promotion. Async lane: Integration Tests validate contract test doubles against real systems - non-deterministic, post-deploy, failures trigger review. Continuous lane: Exploratory Testing and Usability Testing run continuously alongside delivery and never block.

Pipeline Stage	What You Need to Verify	Test Type	Speed	Deterministic?	Blocks Deploy?
CI	A function or method behaves correctly	Unit	Milliseconds	Yes	■ Yes
CI	A complete component or service works through its public interface	Component	Milliseconds to seconds	Yes	■ Yes
CI	Your code correctly interacts with external system interfaces	Contract	Milliseconds to seconds	Yes	■ Yes
CI	Code quality, security, and style compliance	Static Analysis	Seconds	Yes	■ Yes
CI	UI meets WCAG accessibility standards	Static Analysis + Component	Seconds	Yes	■ Yes
Acceptance Testing	Deployed artifact meets acceptance criteria	Deploy, Smoke, Load, Resilience, Compliance, etc.	Minutes	No	■ Yes - gates production
Post-deploy (production)	Critical user journeys work in production	E2E smoke	Seconds to minutes	No	No - triggers rollback
Post-deploy (production)	Production health and SLOs	Synthetic monitoring	Continuous	No	No - triggers alerts
On demand/scheduled	Contract test doubles still match real external systems	Integration	Seconds to minutes	No	No - triggers review
Continuous	Unexpected behavior, edge cases, real-world workflows	Exploratory Testing	Varies	No	Never
Continuous	Real users can accomplish goals effectively	Usability Testing	Varies	No	Never

The critical insight: everything that blocks merge is deterministic and under your control. Acceptance tests gate production promotion after verifying the deployed artifact. Everything that involves real external systems runs post-deployment. This is what gives you the independence to deploy any time, regardless of the state of the world around you.

Acceptance tests can include non-deterministic activities (load, chaos, resilience), but the gate decision is still deterministic: it fires on a documented pass/fail threshold - a performance budget, an error-rate ceiling, a required compliance check - not on the raw variability of the measurement. That is different from gating on a flaky test whose pass/fail flips for reasons unrelated to the change, which the Do Not list below warns against.

Pre-merge vs post-merge

The table maps to two distinct phases of your pipeline, each with different goals and constraints.

Pre-merge (before code lands on trunk): Run unit, component, and contract tests. These must all be deterministic and fast. Target: under 10 minutes total. This is the quality gate that every change must pass. If pre-merge tests are slow, developers batch up changes or skip local runs, both of which undermine continuous integration.

Post-merge (after code lands on trunk, before or after deployment): Re-run the full deterministic suite against the integrated trunk. Then run acceptance tests, E2E smoke tests, and synthetic monitoring post-deploy. Integration tests run separately in a test environment, on demand or on a schedule. Target: under 60 minutes for the full post-merge cycle.

Why re-run pre-merge tests post-merge? Two changes can each pass pre-merge independently but conflict when combined on trunk. The post-merge run catches these integration effects.

If a post-merge failure occurs, the team fixes it immediately. Trunk must always be releasable.

This post-merge re-run is what teams traditionally call regression testing: running all previous tests against the current artifact to confirm that existing behavior still works after a change. In CD, regression testing is not a separate test type or a special suite. Every test in the pipeline is a regression test. The deterministic suite runs on every commit, and the full suite runs post-merge. A green run means the artifact has been regression-tested against every behavior the suite encodes - no more and no less, which is why the suite’s coverage of prior behavior is what makes the signal trustworthy.

good practices

Do

Run tests on every commit. If tests do not run automatically, they will be skipped.
Keep the deterministic suite under 10 minutes. If it is slower, developers will stop running it locally.
Fix broken tests immediately. A broken test is equivalent to a broken build.
Delete tests that do not provide value. A test that never fails and tests trivial behavior is maintenance cost with no benefit.
Test behavior, not implementation. Use a black box approach - verify what the code does, not how it does it. As Ham Vocke advises: “if I enter values x and y, will the result be z?” - not the sequence of internal calls that produce z. Avoid white box testing that asserts on internals.
Use test doubles for external dependencies. Your deterministic tests should run without network access to external systems.
Validate test doubles with contract tests. Test doubles that drift from reality give false confidence.
Treat test code as production code. Give it the same care, review, and refactoring attention.
Run automated accessibility checks on every commit. WCAG compliance scans are fast, deterministic, and catch violations that are invisible to sighted developers. Treat them like security scans: automate the detectable rules and reserve manual review for subjective judgment. See Accessibility testing for the full three-tier strategy and pipeline placement.

Do Not

Do not tolerate flaky tests. Quarantine or delete them immediately.
Do not gate your pipeline on flaky, non-deterministic test signals. E2E and integration test failures - pass/fail that flips for reasons unrelated to the change - should trigger review or alerts, not block deployment. (An acceptance gate that fires on a deterministic threshold, like a performance budget, is not this: the gate decision is stable even when the underlying measurement varies.)
Do not couple your deployment to external system availability. If a third-party API being down prevents you from deploying, your test architecture has a critical gap.
Do not write tests after the fact as a checkbox exercise. Tests written without understanding the behavior they verify add noise, not value.
Do not test private methods directly. Test the public interface; private methods are tested indirectly.
Do not share mutable state between tests. Each test should set up and tear down its own state.
Do not use sleep/wait for timing-dependent tests. Use explicit waits, polling, or event-driven assertions.
Do not let unit or component tests depend on a shared or external database or service. A real engine the team controls and isolates per test - a per-test testcontainer, or a transaction that rolls back at teardown - is fine in-band and stays deterministic. A shared, mutable database, or any service the team does not control, is not: that reintroduces non-determinism, so categorize the test as integration or end-to-end and run it post-deployment, not as a pre-merge gate.
Do not make exploratory or usability testing a release gate. These activities are continuous and inform product direction; they are not a pass/fail checkpoint before deployment.

ACD - How acceptance criteria make testing the constraint that governs agent-generated code
Testing Fundamentals - Establishing testing practices as part of CD migration
High Coverage but Ineffective Tests - When tests pass but do not catch real defects

Additional concepts drawn from Ham Vocke, The Practical Test Pyramid, and Toby Clemson, Testing Strategies in a Microservice Architecture.

1 - Test Feedback Speed

Why test suite speed matters for developer effectiveness and how cognitive limits set the targets.

Why speed has a threshold

The 10-minute CI target and the preference for sub-second unit tests are not arbitrary. They are long-standing conventions in CD practice, and they align with how human cognition handles interrupted work. When a developer makes a change and waits for test results, three things determine whether that feedback is useful: whether the developer still holds the mental model of the change, whether they can act on the result immediately, and whether the wait is short enough that they do not context-switch to something else.

Research on task interruption and working memory consistently shows that context switches are expensive. Gloria Mark’s research at UC Irvine found that it takes an average of 23 minutes for a person to fully regain deep focus after being interrupted during a task, and that interrupted tasks take twice as long and contain twice as many errors as uninterrupted ones.¹ If the test suite itself takes 30 minutes, the total cost of a single feedback cycle approaches an hour - and most of that time is spent re-loading context, not fixing code.

The cognitive breakpoints

Jakob Nielsen’s foundational research on response times identified three thresholds that govern how users perceive and respond to system delays: 0.1 seconds (feels instantaneous), 1 second (noticeable but flow is maintained), and 10 seconds (attention limit - the user starts thinking about other things).² These thresholds, rooted in human perceptual and cognitive limits, apply directly to developer tooling.

Different feedback speeds produce fundamentally different developer behaviors:

Feedback time	Developer behavior	Cognitive impact
Under 1 second	Feels instantaneous. The developer stays in flow, treating the test result as part of the editing cycle.²	Working memory is fully intact. The change and the result are experienced as a single action.
1 to 10 seconds	The developer waits. Attention may drift briefly but returns without effort.	Working memory is intact. The developer can act on the result immediately.
10 seconds to 2 minutes	The developer starts to feel the wait. They may glance at another window or check a message, but they do not start a new task.	Working memory begins to decay. Nielsen’s 10-second limit marks the point where attention starts to wander;² beyond it, each additional second increases the chance of distraction (extrapolated from the same perceptual thresholds).
2 to 10 minutes	The developer context-switches. They check email, review a PR, or start thinking about a different problem. When the result arrives, they must actively return to the original task.	Working memory is partially lost. Rebuilding context takes several minutes depending on the complexity of the change.¹
Over 10 minutes	The developer fully disengages and starts a different task. The test result arrives as an interruption to whatever they are now doing.	Working memory of the original change is gone. Rebuilding it takes upward of 23 minutes.¹ Investigating a failure means re-reading code they wrote an hour ago.

The conventional 10-minute CI target lines up with the boundary between “developer waits and acts on the result” and “developer starts something else and pays a full context-switch penalty.” Below 10 minutes, feedback is actionable. Above 10 minutes, feedback becomes an interruption. The number itself is an established CD convention rather than a figure the cognitive research produces directly, but DORA’s research on continuous integration converges on the same target: tests should complete in under 10 minutes to support the fast feedback loops that high-performing teams depend on.³

What this means for test architecture

These cognitive breakpoints should drive how you structure your test suite:

Local development (under 1 second). Unit tests for the code you are actively changing should run in watch mode, re-executing on every save. At this speed, TDD becomes natural - the test result is part of the writing process, not a separate step. This is where you test complex logic with many permutations.

Pre-push verification (under 2 minutes). The full unit test suite and the component tests for the component you changed should complete before you push. At this speed, the developer stays engaged and acts on failures immediately. This is where you catch regressions.

CI pipeline (under 10 minutes). The full deterministic suite - all unit tests, all component tests, all contract tests - should complete within 10 minutes of commit. At this speed, the developer has not yet fully disengaged from the change. If CI fails, they can investigate while the code is still fresh.

Post-deploy verification (minutes to hours). E2E smoke tests and integration test validation run after deployment. These are non-deterministic, slower, and less frequent. Failures at this level trigger investigation, not immediate developer action.

When a test suite exceeds 10 minutes, the solution is not to accept slower feedback. It is to redesign the suite: replace E2E tests with component tests using test doubles, parallelize test execution, and move non-deterministic tests out of the gating path.

Impact on application architecture

Test feedback speed is not just a testing concern - it puts pressure on how you design your systems. A monolithic application with a single test suite that takes 40 minutes to run forces every developer to pay the full context-switch penalty on every change, regardless of which module they touched.

Breaking a system into smaller, independently testable components is often motivated as much by test speed as by deployment independence. When a component has its own focused test suite that runs in under 2 minutes, the developer working on that component gets fast, relevant feedback. They do not wait for tests in unrelated modules to finish.

This creates a virtuous cycle: smaller components with clear boundaries produce faster test suites, which enable more frequent integration, which encourages smaller changes, which are easier to test. Conversely, a tightly coupled monolith produces a slow, tangled test suite that discourages frequent integration, which leads to larger changes, which are harder to test and more likely to fail.

Architecture decisions that improve test feedback speed include:

Clear component boundaries with well-defined interfaces, so each component can be tested in isolation with test doubles for its dependencies.
Separating business logic from infrastructure so that core rules can be unit tested in milliseconds without databases, queues, or network calls.
Independently deployable services with their own test suites, so a change to one service does not require running the entire system’s tests.
Avoiding shared mutable state between components, which forces integration tests and introduces non-determinism.

If your test suite is slow and you cannot make it faster by optimizing test execution alone, the architecture is telling you something. A system that is hard to test quickly is also hard to change safely - and both problems have the same root cause.

The compounding cost of slow feedback

Slow feedback does not just waste time - it changes behavior. When the suite takes 40 minutes, developers adapt:

They batch changes to avoid running the suite more than necessary, creating larger and riskier commits.
They stop running tests locally because the wait is unacceptable during active development.
They push to CI and context-switch, paying the full rebuild penalty on every cycle.
They rerun failures instead of investigating, because re-reading the code they wrote an hour ago is expensive enough that “maybe it was flaky” feels like a reasonable bet.

Each of these behaviors degrades quality independently. Together, they make continuous integration impossible. A team that cannot get feedback on a change within 10 minutes cannot sustain the practice of integrating changes multiple times per day.⁴

Sources

2 - Test Types

Definitions of the test types used throughout this site: unit, component, contract, integration, end-to-end, and static analysis.

Definitions for the test types used throughout this site. Each page covers what the type is, when it runs in the pipeline, what it asserts on, and what it does not.

The list isn’t exhaustive and the boundaries between types aren’t crisp in every codebase. Use these definitions as shared vocabulary for the rest of the testing section, especially Applied Testing Strategies and Testing Antipatterns.

2.1 - Component Tests

Deterministic tests that exercise a single component through its public interface, with systems the team doesn’t control replaced by test doubles.

Component test pattern: a test actor hits the public interface of a component boundary. Inside the boundary, real internal modules (API Layer, Business Logic, Data Adapter) are wired together. Outside the boundary, a Database and External API are represented by test doubles.

Definition

A component test exercises one component through its public interface: one backend service through its HTTP, gRPC, or GraphQL API, or one frontend component (or app shell) through its rendered DOM. The test treats that component as a black box: inputs go in through the public interface, observable outputs come out (response, persisted state, emitted event, rendered DOM, side effect), and the test asserts only on those outputs.

The component’s real internal modules are wired together - routing, validation, business logic, and persistence in a backend, or rendering, state management, and event handling in a UI. What gets replaced is whatever crosses the component’s boundary into a system the team doesn’t control: third-party APIs, downstream services owned by other teams, message brokers. Those become test doubles.

The component’s own persistence layer is the boundary that admits a choice. Two configurations are both valid component tests:

Doubled persistence: an in-memory repository or fake stands in for the database. Tests are fastest. Good for backend logic that doesn’t depend on SQL semantics.
Real production engine in a testcontainer: Postgres, MySQL, or whatever the production engine is, run in a per-test container or a transaction that rolls back at teardown. Slightly slower but exercises the real query plan, real constraints, real migration. The page on the API provider pattern covers when to prefer each.

A component test does not exercise more than one component end-to-end. A test that drives a UI which calls a real backend which writes to a real database is a fullstack flow - that’s an end-to-end test. Each component gets its own component tests at its own boundary; the frontend has its tests against a doubled backend, the backend has its tests against a doubled downstream and a real-or-doubled DB.

This is broader than a sociable unit test: a sociable unit test exercises a single behavior through a few collaborators; a component test exercises the entire assembled component through its public interface.

When component tests earn their keep

A component test overlaps with the combination of provider contract tests, sociable unit tests, and spies on collaborators. Each of those layers covers part of what a component test asserts. Component tests pull their weight when they catch something the other layers can’t, or when they let a single test answer a single user-story-level question.

They earn their keep when the component has:

Cross-cutting behavior at the seams. Auth, multi-tenancy, persistence, and event emission interacting on a single request is where production bugs live. Each layer in isolation may pass; the seam between them is what a component test exercises.
Non-trivial framework wiring. Middleware ordering, error-handler mapping (does a domain exception become 409 or 500?), DI-container configuration, request-body limits. Spy-based unit tests bypass all of this. Contract tests bypass it unless they exercise the fully booted app.
Acceptance criteria you want to map 1:1 to tests. A test that says “POST /orders with valid payment returns 201 and emits OrderPlaced” reads as the user story. The fragmented equivalent (contract test for shape + unit test for domain + spy for delegation + unit test for emission) covers the same ground but no single test reads as the story.
Realistic UI flows. Keyboard navigation, focus management, and screen-reader announcements need the rendered DOM, not a unit test of a component class.

They overlap heavily with other layers when the component is:

Thin CRUD with no middleware to speak of. Provider contract verification against a booted app plus sociable unit tests of the domain cover most of what a component test would. Keep one per critical flow as smoke coverage; skip exhaustive component coverage.
Pure transformation logic. Parsers, calculators, scheduling math. Unit tests give better coverage per unit of effort.

If you’re choosing between an extra component test and an extra unit test for the same behavior, the unit test is cheaper to write, run, and maintain. Component tests earn their keep at the seams between layers, not in repeating ground that unit tests already cover.

Two boundary cases worth naming:

A test that needs to span more than one component (a real frontend driving a real backend) is an end-to-end test, not a component test.
A test that exercises a single unit of behavior through a few collaborators is a unit test, not a component test.

Characteristics

Property	Value
Speed	Milliseconds to seconds
Determinism	Deterministic (with per-test isolation when a real engine is used)
Scope	One backend service or one frontend component
Dependencies	Systems the team doesn’t control are doubled
Network	Localhost only (testcontainers permitted)
Database	Doubled (in-memory) or production engine in a per-test testcontainer
Breaks build	Yes

Examples

Backend Service

A component test for a REST API, exercising the full application stack with the downstream inventory service replaced by a test double:

Backend component test - order creation with stubbed inventory service

describe("POST /orders", () => {
  it("should create an order and return 201", async () => {
    // Arrange: mock the inventory service response
    httpMock("https://inventory.internal")
      .onGet("/stock/item-42")
      .reply(200, { available: true, quantity: 10 });

    // Act: send a request through the full application stack
    const response = await request(app)
      .post("/orders")
      .send({ itemId: "item-42", quantity: 2 });

    // Assert: verify the public interface response
    expect(response.status).toBe(201);
    expect(response.body.orderId).toBeDefined();
    expect(response.body.status).toBe("confirmed");
  });

  it("should return 409 when inventory is insufficient", async () => {
    httpMock("https://inventory.internal")
      .onGet("/stock/item-42")
      .reply(200, { available: true, quantity: 0 });

    const response = await request(app)
      .post("/orders")
      .send({ itemId: "item-42", quantity: 2 });

    expect(response.status).toBe(409);
    expect(response.body.error).toMatch(/insufficient/i);
  });
});

Frontend Component

A component test exercising a login flow with a stubbed authentication service:

Frontend component test - login flow with stubbed auth service

describe("Login page", () => {
  it("should redirect to the dashboard after successful login", async () => {
    mockAuthService.login.mockResolvedValue({ token: "abc123" });

    render(<App />);
    await userEvent.type(screen.getByLabelText("Email"), "ada@example.com");
    await userEvent.type(screen.getByLabelText("Password"), "s3cret");
    await userEvent.click(screen.getByRole("button", { name: "Sign in" }));

    expect(await screen.findByText("Dashboard")).toBeInTheDocument();
  });
});

Accessibility Verification

Component tests already exercise the UI from the actor’s perspective, making them the natural place to verify that interactions work for all users. Accessibility assertions fit alongside existing assertions rather than in a separate test suite.

This is the second of three tiers in the Accessibility testing strategy: static-analysis linting catches structural violations in source, component tests catch the rendered-only ones (computed contrast, focus order, keyboard operability), and manual audits cover the subjective remainder.

Accessibility component test - keyboard navigation and WCAG assertions

// accessibility scanner setup

describe("Checkout flow", () => {
  it("should be completable using only the keyboard", async () => {
    render(<CheckoutPage />);

    await userEvent.tab();
    expect(screen.getByLabelText("Card number")).toHaveFocus();

    await userEvent.type(screen.getByLabelText("Card number"), "4111111111111111");
    await userEvent.tab();
    await userEvent.type(screen.getByLabelText("Expiry"), "12/27");
    await userEvent.tab();
    await userEvent.keyboard("{Enter}");

    expect(await screen.findByText("Order confirmed")).toBeInTheDocument();

    const results = await accessibilityScanner(document.body);
    expect(results).toHaveNoViolations();
  });
});

Anti-Patterns

Calling a live external service the team doesn’t own: real network calls to a third-party API or another team’s service make the test non-deterministic and slow. Replace anything across the component boundary with a test double of a thin gateway you own.
Spanning more than one component: a test that drives a UI, makes a real network call to a backend, and waits for a real DB write is a fullstack flow, not a component test. Each component gets its own component tests at its own boundary; the cross-component flow belongs in end-to-end tests, and only for the few cases that can’t be covered any other way.
Sharing a live, mutable database between tests: leftover state and ordering dependencies produce flakes and “works on my machine” failures. The fix isn’t necessarily “no real DB”. A per-test testcontainer or a per-test transaction with rollback gives you the production engine and isolation. The anti-pattern is the shared, mutable part.
Ignoring the actor’s perspective: component tests should interact with the system the way a user or API consumer would. Reaching into internal state or bypassing the public interface defeats the purpose.
Duplicating unit test coverage: component tests should focus on feature-level behavior and happy/critical paths. Leave exhaustive edge case and permutation testing to unit tests.
Slow test setup: if bootstrapping the component takes too long, invest in faster initialization (in-memory stores, lazy loading) rather than skipping component tests.
Deferring accessibility testing to manual audits: automated WCAG checks in component tests catch violations on every commit. Quarterly audits find problems that are weeks old.

Connection to CD Pipeline

Component tests run after unit tests in the pipeline and provide the broadest fast, deterministic feedback before code is promoted:

Local development: run before committing. Deterministic scope keeps them fast enough to run locally without slowing the development loop.
PR verification: CI executes the full suite; failures block merge.
Trunk verification: the same tests run on the merged HEAD to catch conflicts.
Pre-deployment gate: component tests can serve as the final deterministic gate before a build artifact is promoted.

Because component tests are deterministic, they should always break the build on failure. A healthy CD pipeline relies on a strong component test suite to verify assembled behavior - not just individual units - before any code reaches an environment with real dependencies.

2.2 - Contract Tests

Deterministic tests that verify interface boundaries with external systems using test doubles. Also called narrow integration tests. Validated by integration tests running against real systems.

Consumer-driven contract flow: consumer team runs a component test against a provider test double, generating a contract artifact. The provider team runs a verification step against the real service using the consumer contract. Both sides discover different things: consumers check for fields and types they depend on; providers check they have not broken any consumer.

Definition

A contract test (also called a narrow integration test) is a deterministic test that validates your code’s interaction with an external system’s interface using test doubles. It verifies that the boundary layer code - HTTP clients, database query layers, message producers - correctly handles the expected request/response shapes, field names, types, and status codes.

A contract test validates interface structure, not business behavior. It answers “does my code correctly interact with the interface I expect?” not “is the logic correct?” Business logic belongs in component tests.

Because contract tests use test doubles rather than live systems, they are deterministic and run on every commit as part of the pipeline. They block the build on failure, just like unit and component tests.

Integration tests validate that contract test doubles still match the real external systems by running against live dependencies post-deployment.

Consumer and Provider Perspectives

Every contract has two sides. The questions each side is trying to answer are different.

Consumer contract testing

The consumer is the service or component that depends on an external API. A consumer contract test asks:

“Do the fields I depend on still exist, in the types I expect, with the status codes I handle?”

Consumer tests assert only on the subset of the API the consumer actually uses - not everything the provider exposes. A consumer that only needs id and email from a user object should not assert on address or phone. This allows providers to add new fields freely without breaking consumers.

Following Postel’s Law - “be conservative in what you send, be liberal in what you accept”

consumer tests should accept any valid response that contains the fields they need, and tolerate fields they do not use.

What a consumer is trying to discover:

Has the provider changed or removed a field I depend on?
Has the provider changed a type I expect (string to integer, object to array)?
Has the provider changed a status code I handle?
Does the provider still accept the request format I send?

Provider contract testing

The provider is the service that owns the API. A provider contract test asks:

“Have my changes broken any of my consumers?”

A provider runs contract tests to verify that its API responses still satisfy the expectations of every known consumer. This gives early warning - before any consumer deploys and discovers the breakage - that a change is breaking.

What a provider is trying to discover:

Have I removed or renamed a field that a consumer depends on?
Have I changed a type in a way that breaks deserialization for a consumer?
Have I changed error behavior (status codes, error formats) that consumers handle?
Is my API still backward compatible with all published consumer expectations?

Approaches to Contract Testing

Consumer-driven contract development

In consumer-driven contracts (CDC), the consumer writes the contract. The consumer defines their expectations as executable tests - what request they will send and what response shape they require. These expectations are published to a shared contract broker and the provider runs them as part of their own build.

The flow:

Consumer team writes tests defining their expectations against a mock provider.
The consumer tests generate a contract artifact.
The contract is published to a shared contract broker.
The provider team runs the consumer’s contract expectations against their real implementation.
If the provider’s implementation satisfies the contract, the provider can deploy with confidence it will not break this consumer. If not, the teams negotiate before merging the breaking change.

CDC works well for evolving systems: it grounds the API design in actual consumer needs rather than the provider’s assumptions about what consumers will use.

Contract-first development

In contract-first development, the interface is defined as a formal artifact - an OpenAPI specification, a Protobuf schema, an Avro schema, or similar - before any implementation is written. Both the consumer and provider code are generated from or validated against that artifact.

The flow:

Teams agree on the interface contract (usually during design or story refinement).
The contract is committed to version control.
Consumer and provider teams develop independently, each generating or validating their code against the contract.
Tests on both sides verify conformance to the contract - not to each other’s implementation.

Contract-first works well for new APIs and parallel development: it lets consumer and provider teams work simultaneously without waiting for a real implementation, and makes the interface an explicit design decision rather than an emergent one.

Choosing between them

Situation	Prefer
Existing API with multiple consumers, evolving over time	Consumer-driven (CDC)
New API, teams working in parallel	Contract-first
Third-party API you do not control	Consumer-only contract tests (no provider side)
Public API with external consumers you cannot reach	Provider tests against published spec

The two approaches are not mutually exclusive. A team may define an initial contract-first schema and then adopt CDC tooling as the number of consumers grows.

Characteristics

Property	Value
Speed	Milliseconds to seconds
Determinism	Always deterministic (uses test doubles)
Scope	Interface boundary between two systems
Dependencies	All replaced with test doubles
Network	None or localhost only
Database	None
Breaks build	Yes

Examples

A consumer contract test using a consumer-driven contract tool:

Consumer contract test - order service consuming inventory API

describe("Order Service - Inventory Provider Contract", () => {
  it("should receive stock availability in the expected format", async () => {
    // Define what the consumer expects from the provider
    await contractTool.addInteraction({
      state: "item-42 is in stock",
      uponReceiving: "a request for item-42 stock",
      withRequest: { method: "GET", path: "/stock/item-42" },
      willRespondWith: {
        status: 200,
        body: {
          // Only assert on fields the consumer actually uses
          available: matchType(true),   // boolean
          quantity: matchType(10),      // integer
        },
      },
    });

    // Exercise the consumer code against the mock provider
    const result = await inventoryClient.checkStock("item-42");
    expect(result.available).toBe(true);
  });
});

A provider verification test that runs consumer expectations against the real implementation:

Provider verification - running consumer contracts against the real API

describe("Inventory Service - Provider Verification", () => {
  it("should satisfy all registered consumer contracts", async () => {
    await contractBroker.verifyProvider({
      provider: "InventoryService",
      providerBaseUrl: "http://localhost:3001",
      brokerUrl: "https://contract-broker.internal",
      providerVersion: process.env.GIT_SHA,
    });
  });
});

A contract-first schema validation test verifying a provider response against an OpenAPI spec:

Contract-first test - OpenAPI schema validation

// The OpenAPI document is the source of truth. Validate the whole response
// against the named schema rather than hand-checking individual fields - a
// field-by-field check drifts from the spec the moment the spec changes.
const validator = openApiValidator(openApiSpec);

describe("GET /stock/:id - OpenAPI contract", () => {
  it("should return a response conforming to the published schema", async () => {
    const response = await fetch("http://localhost:3001/stock/item-42");
    const body = await response.json();

    expect(response.status).toBe(200);

    // Asserts structure, types, required fields, and additionalProperties
    // rules exactly as the OpenAPI schema declares them.
    const result = validator.validate(body, "StockResponse");
    expect(result.errors).toEqual([]);
  });
});

Anti-Patterns

Asserting on business logic: contract tests verify structure, not behavior. A contract test that asserts quantity > 0 when in stock is crossing into business logic territory. That belongs in component tests.
Asserting on fields the consumer does not use: over-specified consumer contracts make providers brittle. Only assert on what your code actually reads.
Testing specific data values: asserting that name equals "Alice" makes the test brittle. Assert on types, required fields, and status codes instead.
Hitting live systems in contract tests: contract tests must use test doubles to stay deterministic. Validating doubles against live systems is the role of integration tests, which run post-deployment.
Running infrequently: contract tests should run often enough to catch drift before it causes a production incident. High-volatility APIs may need hourly runs.
Skipping provider verification in CDC: publishing consumer expectations is only half the pattern. The provider must actually run those expectations for CDC to work.

Connection to CD Pipeline

Contract tests run on every commit as part of the deterministic pipeline:

Contract tests in the pipeline

On every commit          Unit tests              Deterministic    Blocks
                         Component tests         Deterministic    Blocks
                         Contract tests          Deterministic    Blocks

Post-deployment          Integration tests       Non-deterministic   Validates contract doubles
                         E2E smoke tests         Non-deterministic   Triggers rollback

Contract tests verify that your boundary layer code correctly interacts with the interfaces you depend on. Integration tests validate that those test doubles still match the real external systems by running against live dependencies post-deployment.

2.3 - End-to-End Tests

Tests that exercise two or more real components up to the full system. Non-deterministic by nature; never a pre-merge gate.

End-to-end test scope spectrum. Narrow scope: a test drives a real service that calls a real database. Full-system scope: a browser drives a real frontend, which calls a real backend, which calls a real database. All components are real at every scope - no test doubles.

Definition

An end-to-end test exercises real components working together - no test doubles replace the dependencies under test. The scope ranges from two services calling each other, to a service talking to a real database, to a complete user journey through every layer of the system.

The defining characteristic is that real external dependencies are present: actual databases, live downstream services, real message brokers, or third-party APIs. Because those dependencies introduce timing, state, and availability factors outside the test’s control, end-to-end tests are typically non-deterministic. They fail for reasons unrelated to code correctness - network instability, service unavailability, test data collisions, or third-party rate limits.

Terminology note

“Integration test” and “end-to-end test” are often used interchangeably in the industry. Martin Fowler distinguishes between narrow integration tests (which use test doubles at the boundary - what this site calls contract tests) and broad integration tests (which use real dependencies). This site treats them as distinct categories: integration tests validate that contract test doubles still match the real external systems, while end-to-end tests exercise user journeys or multi-service flows through real systems.

Scope

End-to-end tests cover a spectrum based on how many components are real:

Scope	Example
Narrow	A service making real calls to a real database
Service-to-service	Order service calling the real inventory service
Multi-service	A user journey spanning three live services
Full system	A browser test through a staging environment with all dependencies live

All of these involve real external dependencies. All share the same fundamental non-determinism risk. Use the narrowest scope that gives you the confidence you need.

When to Use

Use end-to-end tests sparingly. They are the most expensive test type to write, run, and maintain. Use them for:

Smoke testing a deployed environment to verify that key integrations are functioning after a deployment.
Happy-path validation of critical business flows that cannot be verified any other way (e.g., a payment flow that depends on a real payment provider).
Cross-team workflows that span multiple deployables and cannot be isolated within a single component test.

Do not use end-to-end tests to cover edge cases, error handling, or input validation. Those scenarios belong in unit or component tests, which are faster, cheaper, and deterministic.

Vertical vs. horizontal

Vertical end-to-end tests target features owned by a single team:

An order is created and the confirmation email is sent.
A user uploads a file and it appears in their document list.

Horizontal end-to-end tests span multiple teams:

A user navigates from homepage through search, product detail, cart, and checkout.

Horizontal tests have a large failure surface and are significantly more fragile. They are not suitable for blocking the pipeline; run them on a schedule and review failures out-of-band.

Characteristics

Property	Value
Speed	Seconds to minutes per test
Determinism	Typically non-deterministic
Scope	Two or more real components, up to the full system
Dependencies	Real services, databases, brokers, third-party APIs
Network	Full network access
Database	Live databases
Breaks build	No - triggers review or rollback, not a pre-merge gate

Examples

A narrow end-to-end test verifying a service against a real database:

Narrow E2E - order service against a real database

describe("OrderRepository (real database)", () => {
  it("should persist and retrieve an order by ID", async () => {
    const order = await orderRepository.create({
      itemId: "item-42",
      quantity: 2,
      customerId: "cust-99",
    });

    const retrieved = await orderRepository.findById(order.id);
    expect(retrieved.itemId).toBe("item-42");
    expect(retrieved.status).toBe("pending");
  });
});

A full-system browser test using a browser automation framework:

Full-system E2E - add to cart and checkout with browser automation

test("user can add an item to cart and check out", async ({ page }) => {
  await page.goto("https://staging.example.com");
  await page.getByRole("link", { name: "Running Shoes" }).click();
  await page.getByRole("button", { name: "Add to Cart" }).click();

  await page.getByRole("link", { name: "Cart" }).click();
  await expect(page.getByText("Running Shoes")).toBeVisible();

  await page.getByRole("button", { name: "Checkout" }).click();
  await expect(page.getByText("Order confirmed")).toBeVisible();
});

Anti-Patterns

Using end-to-end tests as the primary safety net: this is the ice cream cone anti-pattern. The majority of your confidence should come from unit and component tests, which are fast and deterministic. End-to-end tests are expensive insurance for the gaps.
Blocking the pipeline: end-to-end tests must never be a pre-merge gate. Their non-determinism will eventually block a deploy for reasons unrelated to code quality.
Blocking on horizontal tests: horizontal tests span too many teams and failure surfaces. Run them on a schedule and review failures as a team.
Ignoring flaky failures: track frequency and root cause. A test that fails for environmental reasons is not providing a code quality signal - fix it or remove it.
Testing edge cases here: exhaustive permutation testing in end-to-end tests is slow, expensive, and duplicates what unit and component tests should cover.
Not capturing failure context: end-to-end failures are expensive to debug. Capture screenshots, network logs, and video recordings automatically on failure.

Connection to CD Pipeline

End-to-end tests run after deployment, not before:

E2E tests in the pipeline

Stage 1 (every commit)    Unit tests              Deterministic    Blocks
                          Component tests         Deterministic    Blocks
                          Contract tests          Deterministic    Blocks

Post-deployment           Integration tests       Non-deterministic   Validates contract doubles
                          E2E smoke tests         Non-deterministic   Triggers rollback
                          Scheduled E2E suites    Non-deterministic   Review out-of-band
                          Synthetic monitoring    Non-deterministic   Triggers alerts

A team may choose to gate on a small, highly reliable set of vertical end-to-end smoke tests immediately after deployment. This is acceptable only if the team invests in keeping those tests stable. A flaky smoke gate is worse than no gate: it trains developers to ignore failures.

Use contract tests to verify that the test doubles in your component tests still match reality. This gives you deterministic pre-merge confidence without depending on live external systems.

2.4 - Integration Tests

Tests that exercise real external dependencies to validate that contract test doubles still match reality. Non-deterministic; never a pre-merge gate.

“Integration test” is widely used but inconsistently defined. On this site, integration tests are tests that involve real external dependencies - actual databases, live downstream services, real message brokers, or third-party APIs. They are non-deterministic because those dependencies introduce timing, state, and availability factors outside the test’s control.

Integration tests serve a specific role in the test architecture: they validate that the test doubles used in your contract tests still match reality. Without integration tests, contract test doubles can silently drift from the real behavior of the systems they simulate - giving false confidence.

Because integration tests depend on live systems, they run post-deployment or on a schedule - never as a pre-merge gate. Failures trigger review or rollback decisions, not build failures.

For tests that validate interface boundaries using test doubles (deterministic), see Contract Tests.

For full-system browser tests and multi-service smoke tests, see End-to-End Tests.

A note on the word “integration test”

The industry uses “integration test” for at least two different things, and this site keeps them separate. The page you are reading covers the out-of-band flavor: a non-deterministic check against real external systems that runs on a schedule or post-deploy and never gates the build.

There is also a deterministic, in-band flavor - an adapter integration test (Toby Clemson’s “gateway integration test”). It exercises a single boundary adapter against a dependency the team fully controls (typically a per-test testcontainer running the pinned production engine) and pins the adapter’s protocol behavior: serialization, deserialization, headers, error mapping. Because it is deterministic, it runs in the pre-merge suite and blocks the build, the same as a unit or contract test. When the dependency is not team-controlled - a third-party API, a shared environment - that same adapter test runs out-of-band, as described on this page.

So “integration test,” unqualified, is ambiguous on this site. When a page means the in-band adapter flavor, it says adapter integration test; when it means the out-of-band check, it links here.

2.5 - Static Analysis

Code analysis tools that evaluate non-running code for security vulnerabilities, complexity, and best practice violations.

Definition

Static analysis (also called static testing) evaluates non-running code against rules for known good practices. Unlike other test types that execute code and observe behavior, static analysis inspects source code, configuration files, and dependency manifests to detect problems before the code ever runs.

Static analysis serves several key purposes:

Catches errors that would otherwise surface at runtime.
Warns of excessive complexity that degrades the ability to change code safely.
Identifies security vulnerabilities and coding patterns that provide attack vectors.
Enforces coding standards by removing subjective style debates from code reviews.
Alerts to dependency issues such as outdated packages, known CVEs, license incompatibilities, or supply-chain compromises.

When to Use

Static analysis should run continuously, at every stage where feedback is possible:

In the IDE: real-time feedback as developers type, via editor plugins and language server integrations.
On save: format-on-save and lint-on-save catch issues immediately.
Pre-commit: hooks prevent problematic code from entering version control.
In CI: the full suite of static checks runs on every PR and on the trunk after merge, verifying that earlier local checks were not bypassed.

Static analysis is always applicable. Every project, regardless of language or platform, benefits from linting, formatting, and dependency scanning.

Characteristics

Property	Value
Speed	Seconds (typically the fastest test category)
Determinism	Always deterministic
Scope	Entire codebase (source, config, dependencies)
Dependencies	None (analyzes code at rest)
Network	None (except dependency scanners)
Database	None
Breaks build	Yes

Examples

Linting

A .eslintrc.json configuration enforcing test quality rules:

Linter configuration for test quality rules

{
  "rules": {
    "no-disabled-tests": "warn",
    "require-assertions": "error",
    "no-commented-out-tests": "error",
    "valid-assertions": "error",
    "no-unused-vars": "error",
    "no-console": "warn"
  }
}

Type Checking

Statically typed languages catch type mismatches at compile time, eliminating entire classes of runtime errors. Java, for example, rejects incompatible argument types before the code runs:

Java type checking example

public static double calculateTotal(double price, int quantity) {
    return price * quantity;
}

// Compiler error: incompatible types: String cannot be converted to double
calculateTotal("19.99", 3);

Dependency Scanning

Dependency scanning tools scan for known vulnerabilities:

npm audit output example

$ npm audit
found 2 vulnerabilities (1 moderate, 1 high)
  moderate: Prototype Pollution in lodash < 4.17.21
  high:     Remote Code Execution in log4j < 2.17.1

Types of Static Analysis

Type	Purpose
Linting	Catches common errors and enforces good practices
Formatting	Enforces consistent code style, removing subjective debates
Complexity analysis	Flags overly deep or long code blocks that breed defects
Type checking	Prevents type-related bugs, replacing some unit tests
Security scanning	Detects known vulnerabilities and dangerous coding patterns
Dependency scanning	Checks for outdated, hijacked, or insecurely licensed deps
Accessibility linting	Detects missing alt text, ARIA violations, contrast failures, semantic HTML issues

Accessibility Linting

Accessibility linting catches deterministic WCAG violations the same way a security scanner catches known vulnerability patterns. Automated checks cover structural issues (missing alt text, invalid ARIA attributes, insufficient contrast ratios, broken heading hierarchy) while manual review covers subjective aspects like whether alt text is actually meaningful.

Linting is the first of three tiers. For how it fits with component-test DOM scans and manual audits across the pipeline - and the caveat that automated checks catch only a fraction of WCAG criteria - see Accessibility testing.

An accessibility checker configuration running WCAG 2.1 AA checks against rendered pages:

Accessibility checker configuration for WCAG 2.1 AA

{
  "defaults": {
    "standard": "WCAG2AA",
    "timeout": 10000,
    "wait": 1000
  },
  "urls": [
    "http://localhost:1313/docs/",
    "http://localhost:1313/docs/testing/"
  ]
}

An accessibility scanner test asserting that a rendered component has no violations:

Accessibility scanner test verifying no WCAG violations

// accessibility scanner setup (e.g. import scanner and extend assertions)

it("should have no accessibility violations", async () => {
  const { container } = render(<LoginForm />);
  const results = await accessibilityScanner(container);
  expect(results).toHaveNoViolations();
});

Anti-Patterns

Disabling rules instead of fixing code: suppressing linter warnings or ignoring security findings erodes the value of static analysis over time.
Not customizing rules: default rulesets are a starting point. Write custom rules for patterns that come up repeatedly in code reviews.
Running static analysis only in CI: by the time CI reports a formatting error, the developer has context-switched. IDE plugins and pre-commit hooks provide immediate feedback.
Ignoring dependency vulnerabilities: known CVEs in dependencies are a direct attack vector. Treat high-severity findings as build-breaking.
Treating static analysis as optional: static checks should be mandatory and enforced. If developers can bypass them, they will.

Connection to CD Pipeline

Static analysis is the first gate in the CD pipeline, providing the fastest feedback:

IDE / local development: plugins run in real time as code is written.
Pre-commit: hooks run linters, formatters, and accessibility checks on changed components, blocking commits that violate rules.
PR verification: CI runs the full static analysis suite (linting, type checking, security scanning, dependency auditing, accessibility linting) and blocks merge on failure.
Trunk verification: the same checks re-run on the merged HEAD to catch anything missed.
Scheduled scans: dependency and security scanners run on a schedule to catch newly disclosed vulnerabilities in existing dependencies.

Because static analysis requires no running code, no test environment, and no external dependencies, it is the cheapest and fastest form of quality verification. A mature CD pipeline treats static analysis failures the same as test failures: they break the build.

2.6 - Unit Tests

Fast, deterministic tests that verify a unit of behavior through its public interface, asserting on what the code does rather than how it works.

Solitary unit test: test actor sends input to a Unit Under Test; all collaborators are replaced by test doubles. Sociable unit test: test actor sends input to a Unit Under Test that uses real in-process collaborators; only external I/O is replaced by a test double.

Definition

A unit test is a deterministic test that exercises a unit of behavior (a single meaningful action or decision your code makes) and verifies that the observable outcome is correct. The “unit” is not a function, method, or class. It is a behavior: given these inputs, the system produces this result. A single behavior may involve one function or several collaborating objects. What matters is that the test treats the code as a black box and asserts only on what it produces, not on how it produces it.

All external dependencies are replaced with test doubles so the test runs quickly and produces the same result every time.

Solitary vs. sociable unit tests

A solitary unit test replaces all collaborators with test doubles. A sociable unit test allows real in-process collaborators while still replacing any external I/O. Both styles are unit tests as long as no real external dependency is involved.

When the scope expands to an entire frontend component or a complete backend service exercised through its public API, that is a component test.

White box testing (asserting on internal method calls, call order, or private state) creates change-detector tests that break during routine refactoring without catching real defects. Prefer testing through the public interface (methods, APIs, exported functions) and asserting on return values, state changes visible to consumers, or observable side effects.

The purpose of unit tests is to:

Verify that a unit of behavior produces the correct observable outcome.
Cover high-complexity logic where many input permutations exist, such as business rules, calculations, and state transitions.
Keep cyclomatic complexity visible and manageable through good separation of concerns.

When to Use

During development: run the relevant subset of unit tests continuously while writing code. TDD (Red-Green-Refactor) is the most effective workflow.
On every commit: use pre-commit hooks or watch-mode test runners so broken tests never reach the remote repository.
In CI: execute the full unit test suite on every pull request and on the trunk after merge to verify nothing was missed locally.

Unit tests are the right choice when the behavior under test can be exercised without network access, file system access, or database connections. If you need any of those, you likely need a component test or an end-to-end test instead.

Characteristics

Property	Value
Speed	Milliseconds per test
Determinism	Always deterministic
Scope	A single unit of behavior
Dependencies	All replaced with test doubles
Network	None
Database	None
Breaks build	Yes

Examples

A JavaScript unit test verifying a pure utility function:

JavaScript unit test for castArray utility

// castArray.test.js
describe("castArray", () => {
  it("should wrap non-array items in an array", () => {
    expect(castArray(1)).toEqual([1]);
    expect(castArray("a")).toEqual(["a"]);
    expect(castArray({ a: 1 })).toEqual([{ a: 1 }]);
  });

  it("should return array values by reference", () => {
    const array = [1];
    expect(castArray(array)).toBe(array);
  });

  it("should return an empty array when no arguments are given", () => {
    expect(castArray()).toEqual([]);
  });
});

A Java sociable unit test exercising real domain logic through its public interface. The collaborators (the pricing policy and the order model) are real objects, not mocks, and the test asserts on the observable outcome - the computed total - rather than on which methods were called:

Java sociable unit test for a bulk-discount pricing rule

@Test
public void appliesBulkDiscountWhenQuantityReachesThreshold() {
    // Arrange: real collaborators, no test doubles - this is pure in-process logic
    PricingPolicy pricing = new PricingPolicy(
        bulkThreshold(10), bulkDiscountRate(0.15));
    Order order = new Order(new LineItem("widget", money("20.00"), quantity(12)));

    // Act
    Money total = pricing.totalFor(order);

    // Assert: the observable result, not the sequence of internal calls
    // 12 * 20.00 = 240.00, less 15% = 204.00
    assertEquals(money("204.00"), total);
}

@Test
public void chargesFullPriceBelowTheThreshold() {
    PricingPolicy pricing = new PricingPolicy(
        bulkThreshold(10), bulkDiscountRate(0.15));
    Order order = new Order(new LineItem("widget", money("20.00"), quantity(9)));

    assertEquals(money("180.00"), pricing.totalFor(order));
}

Anti-Patterns

White box testing: asserting on internal state, call order, or private method behavior rather than observable output. These change-detector tests break during refactoring without catching real defects. Test through the public interface instead.
Testing private methods: private implementations are meant to change. They are exercised indirectly through the behavior they support. Test the public interface instead.
No assertions: a test that runs code without asserting anything provides false confidence. Lint rules can catch this automatically.
Disabling or skipping tests: skipped tests erode confidence over time. Fix or remove them.
Confusing “unit” with “function”: a unit of behavior may span multiple collaborating objects. Forcing one-test-per-function creates brittle tests that mirror the implementation structure rather than verifying meaningful outcomes.
Ice cream cone testing: relying primarily on slow E2E tests while neglecting fast unit tests inverts the test pyramid and slows feedback.
Chasing coverage numbers: gaming coverage metrics (e.g., running code paths without meaningful assertions) creates a false sense of confidence. Focus on behavior coverage instead.

Connection to CD Pipeline

Unit tests occupy the base of the test pyramid. They run in the earliest stages of the CD pipeline and provide the fastest feedback loop:

Local development: watch mode reruns tests on every save.
Pre-commit: hooks run the suite before code reaches version control.
PR verification: CI runs the full suite and blocks merge on failure.
Trunk verification: CI reruns tests on the merged HEAD to catch integration issues.

Because unit tests are fast and deterministic, they should always break the build on failure. A healthy CD pipeline depends on a large, reliable suite of black box unit tests that verify behavior rather than implementation, giving developers the confidence to refactor freely and ship small changes frequently.

3 - Applied Testing Strategies

Practical guidance for fully testing eight common component patterns: API providers, API consumers, scheduled jobs, user interfaces, event consumers, event producers, CLI tools and libraries, and stateful services.

A practical guide for fully testing eight common component patterns. Builds on the test-type definitions in Architecting Tests for CD and the deterministic-pipeline model used throughout this site.

This is a set of recommended patterns to consider when designing a test suite, not a prescriptive checklist. The patterns describe shapes of components teams commonly build; the lists of positive cases, negative cases, and pipeline placements are common things to consider for that shape, not an all-inclusive set. Use them as a starting point for the conversation about what your component actually needs.

That said, three goals apply to every pattern:

Cover the positive paths - the component does what it should under expected inputs.
Cover the negative paths - the component fails safely, predictably, and observably under bad inputs, broken dependencies, and adverse conditions.
Validate the test doubles - every double used to keep deterministic tests fast must be backed by a non-deterministic check that the double still matches reality.

If the third point is missing, the first two lie to you over time.

How to use this section

New to the patterns? Start with Cross-cutting principles below, then Patterns.
Auditing a component before ship? Jump to the Pre-ship checklist.
Looking for a specific concern that crosses every pattern (authn, migrations, fixtures, observability, perf, mutation testing, flake handling, time budgets)? See Cross-cutting concerns.
Existing suite needs rework first? See Testing Antipatterns.

Terminology

Two phrases that look similar but mean different things:

Adapter integration test (Toby Clemson’s “integration test”): a narrow test of a single boundary adapter (HTTP client, DB query layer, message-broker client) exercised against the real external dependency or a high-fidelity stand-in. Pins the adapter’s protocol behavior - serialization, deserialization, headers, error mapping - not the behavior of the dependency itself. Runs in-band only when the team has full control over the dependency (typically a per-test testcontainer) and the test is fully deterministic; otherwise runs out-of-band on a schedule.
Out-of-band integration check (this site’s Integration Tests): runs out-of-band on a schedule or post-deploy against real external systems. Confirms that doubles used by in-band tests still match reality. Failures trigger review, not a build break.

When this section says bare “integration test,” it’s the gateway flavor unless qualified.

Cross-cutting principles

Six principles apply to every pattern. The first three are short pointers to pages that own the topic; the last three are unique to this section.

1. In-band tests are deterministic; out-of-band checks confirm reality

In-band tests run in the commit-to-deploy pipeline and gate the build. They must be deterministic, which means test doubles replace anything that crosses the component boundary - downstream services, message brokers, schedulers, browsers talking to real backends. Out-of-band checks run on a schedule or post-deploy against the real systems those doubles stand in for. They confirm the doubles still match reality. Failures trigger review or rollback, not a build break. See the architecture in Architecting Tests for CD.

2. Test doubles need their own tests

Every double is traceable to a contract test pinning its claims and an out-of-band check confirming the claims still hold. The mechanics live in Test Doubles.

3. Test through the public interface

Public methods for classes; HTTP routing for services; rendered DOM for UIs; the entrypoint the scheduler invokes for jobs. See Component Tests. Reflection, package-private back doors, and asserting on private state are tested-the-wrong-thing in disguise.

4. Sociable unit tests dominate; solitary unit tests are the narrow exception

Domain logic in a real system lives in how behaviors collaborate, not in any single class. A sociable unit test drives the actual collaborators that implement a domain operation - validators, domain services, repositories backed by an in-memory or testcontainer double - and asserts on the observable outcome of that operation: the response, the persisted state, the event emitted. That is the bulk of the suite. Solitary unit tests are reserved for genuinely complex pure logic with no collaborators worth wiring up - pricing math, parsers, scheduling arithmetic.

Organize the suite around domain operations (“place an order,” “cancel a subscription within the grace period”), not around the classes or methods that happen to implement them. Tests written this way survive refactoring, catch bugs that live in the interactions between collaborators, and document what the component does to a stakeholder who can’t read the code. Tests written one-class-at-a-time with mocks for every collaborator do none of that.

5. Negative paths get equal weight

For every “it works” test, ask: malformed input, dependency timeout, dependency 500, dependency 200-with-malformed-body, slow response, partial write, duplicate request, missing or wrong authn/authz. Negative paths are where production incidents come from.

6. Name tests in domain terms, not implementation terms

A test name is documentation. places_order_with_valid_payment_creates_order_and_emits_OrderPlaced survives refactoring; OrderService.processPayment_returns_PaymentResult does not. The translation rule: if the name only makes sense to someone who has read the code, rewrite it. Highest-ROI change a team can make to an existing suite without any new infrastructure. For more on what to avoid, see Testing Antipatterns.

Architecting Tests for CD - the section overview, with the do/do-not list and the architecture diagram.
Testing Antipatterns - common testing anti-patterns and a migration guide for teams whose suite needs rework.
Test Doubles - types, when to use, anti-patterns.
Contract Tests - consumer-driven and contract-first approaches.
Component Tests - exercising a deployable through its public interface with doubles for everything outside the boundary.
Integration Tests - the post-deploy check that keeps the deterministic pipeline honest.
Pipeline Reference Architecture - quality gates sequenced by defect detection priority.

The layered approach (unit, integration, component, contract, end-to-end) this section builds on comes from Toby Clemson, Testing Strategies in a Microservice Architecture.

3.1 - Pre-Ship Checklist

Quick audit for any component before it ships. Walk back to the section that needs attention for any item that fails.

Use this as a set of prompts for a quick self-audit, not a list of gates that must all pass. Items that don’t apply to a component can be ignored; items the list doesn’t mention but your component clearly needs should be added. Walk back to the pattern or cross-cutting concern that needs attention for any item that prompts a “we should fix that.”

The bulk of the suite is sociable unit tests that exercise how behaviors collaborate to deliver a domain operation. Solitary unit tests are reserved for genuinely complex pure logic.
Tests are organized around domain operations, not around classes or methods. Test names read as something a stakeholder would recognize.
Every public-interface contract (inbound and outbound) has a contract test running in the pipeline.
Classes are tested through their public methods only. No reflection, no test-only visibility relaxations, no asserting on private state.
Every consumed external dependency is wrapped in a gateway the team owns; doubles are of the gateway, not of the third-party library.
Every boundary adapter has an adapter integration test against the real dependency or a high-fidelity stand-in (testcontainer, WireMock with provider fixtures).
The bulk of testing runs in-band in the pipeline and gates the build; out-of-band checks against real systems run on a schedule and trigger review on failure, never a build break.
Every test double has a corresponding non-deterministic check that exercises the real dependency on a schedule or post-deploy.
Every documented failure mode has a negative test.
Every error response has a test that verifies the error envelope, status code, and any side effects (or absence thereof).
Time, randomness, and the network are injected, not called directly. No sleep in tests. Use bounded polling or a fake clock.
All deterministic tests run pre-commit and in CI Stage 1, and fail the build on failure.
All post-deploy integration checks run out of pipeline and trigger review on failure, never blocking a commit.
Pipeline gates map to defect sources from the Systemic Defect Fixes catalog. If a defect category has no automated check, that’s a known risk.
Authn and authz are tested across every protected endpoint, not as one-offs per feature.
Database migrations are tested forward, backward (where supported), and on representative data volume against the production engine.
Fixtures are generated from the schema or built through Object Mother / builder helpers, not inline literals.
Failure-path tests assert on observability (metric incremented, structured log emitted with correlation ID), not just the response.
Per-endpoint perf budgets exist for hot paths; load tests gate production promotion; soak tests run out of pipeline.
Flaky tests are quarantined with a dated owner and time-boxed remediation. No permanent quarantine list.
The deterministic suite respects the pattern’s time budget (under 5 to 8 minutes per component, under 10 minutes total).

3.2 - Patterns

Eight common component patterns and how to test each fully. Each page covers what to verify, positive and negative cases, double validation, pipeline placement, and a small code example.

Each page in this subsection covers one component pattern. The structure is the same on every page so you can scan-compare:

What needs covered - the layers of testing the pattern typically benefits from.
Positive test cases - common success behaviors worth testing.
Negative test cases - common failure modes that produce production incidents.
Test double validation - how the doubles in pipeline tests stay honest.
Pipeline placement - where each test type tends to run.
Example - a short code sample illustrating one of the harder cases for that pattern.

These are recommended starting points, not exhaustive lists or required gates. Real components have details these pages don’t capture; ignore items that don’t apply, and add items the pattern doesn’t mention but your component clearly needs. The goal is to prompt the conversation, not to constrain it.

API provider, API consumer, scheduled job, and user interface are covered in depth. Event consumer, event producer, CLI/library, and stateful service are deliberately briefer sketches: the same six principles apply, the same checklist still prompts useful questions, and the test double validation model is the same. Use the briefer sketches as a starting point and expand the depth in your own runbooks for the patterns your services actually use.

The patterns

API provider - a backend service exposing an HTTP/gRPC/GraphQL API and owning its own data.
API consumer - the above, plus outbound calls to other services. The most failure-prone pattern.
Scheduled job - a service triggered on a cron, queue, or external scheduler.
User interface - a UI that renders data and accepts user interaction.
Event consumer - a service that consumes messages from a broker.
Event producer - a service that produces messages to a broker.
CLI tool or library - a binary or package consumed by other developers.
Stateful service - a service that maintains long-lived in-memory state.

3.2.1 - API Provider

A backend service that exposes an HTTP/gRPC/GraphQL API and owns its own data. No outbound calls to other services in your control.

What needs covered

Layer	Concern	Test type
Domain logic	Business rules, invariants, state transitions	Solitary unit tests
Module collaboration	Validators + repositories + domain working together	Sociable unit tests
Persistence adapter	Query correctness, transaction boundaries, migrations against the real DB engine	Adapter integration tests (testcontainers running production engine and version)
Assembled component	Routing, validation, business logic, and persistence wired together through the controller layer	Component tests with persistence either real (testcontainers) or doubled (in-memory repository)
Served API	What downstream consumers depend on	Provider-side contract tests

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Documented endpoints: return the expected shape and status for valid input.
Auth: succeeds for valid credentials and tokens.
Pagination, filtering, sorting: all return the documented results.
Idempotency: idempotent operations are idempotent; non-idempotent operations create exactly one record.
Success-path side effects: events emitted and audit log entries happen on the success path.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Malformed body: bad JSON, missing required fields, wrong types, extra fields handled per the documented policy (reject vs. ignore).
Out-of-range values: negatives where positives are expected, oversize strings, unicode edge cases.
Auth failures: missing token, expired token, valid token with insufficient scope, valid token for a different tenant.
Authorization boundaries: user A cannot read or modify user B’s resources.
Resource not found: referenced IDs don’t exist, return 404 not 500.
Concurrency: two writes to the same resource at once, optimistic-lock conflict handled with the documented status code.
Persistence failure: DB unavailable, deadlock, constraint violation. The error envelope is correct and no partial state is committed.
Rate limiting and request size limits: both enforce as documented.
Idempotency under retry: same idempotency key within the window returns the original result, not a duplicate write.

Test double validation

Doubles in this pattern are mostly around persistence. Two layers keep them honest:

Adapter integration tests run against a real instance of your production database engine (the same major version, same extensions). If component tests use an in-memory SQLite shim while production runs Postgres, the shim is the lie. The adapter integration test exercises every query and migration against a Postgres testcontainer in CI.
Provider-side contract tests verify the API still satisfies every published consumer expectation. See Consumer and Provider Perspectives. Provider verification is where you discover that a “harmless” field rename broke a consumer before that consumer deploys.

Pipeline placement

Unit + sociable unit tests: pre-commit and CI Stage 1.
Adapter integration tests against testcontainers: CI Stage 1 if fast, Stage 2 otherwise.
Component tests: CI Stage 1.
Provider-side contract verification: CD Stage 1 (Contract and Boundary Validation).

Example: component test

A flow-oriented component test for an order-placement endpoint. The full app is assembled with an in-memory order repository and an in-memory event bus. The test drives the assembled component through its HTTP handlers and asserts on observable outcomes (status, persisted state, emitted event):

@SpringBootTest
@AutoConfigureMockMvc
class OrderPlacementTest {

  @Autowired MockMvc mvc;
  @Autowired InMemoryOrderRepo orderRepo;
  @Autowired InMemoryEventBus events;

  @Test
  void places_order_with_valid_payment_creates_order_and_emits_OrderPlaced() throws Exception {
    var body = """
      { "items": [{"sku": "A1", "qty": 2}], "paymentToken": "pm_ok" }
      """;

    var result = mvc.perform(post("/orders")
        .header("Authorization", "Bearer tok_valid")
        .contentType(APPLICATION_JSON)
        .content(body))
      .andExpect(status().isCreated())
      .andReturn();

    var orderId = JsonPath.<String>read(result.getResponse().getContentAsString(), "$.id");
    assertThat(orderRepo.findById(orderId)).isPresent();
    assertThat(events.published()).anyMatch(e ->
        e.type().equals("OrderPlaced") && e.orderId().equals(orderId));
  }
}

public class OrderPlacementTests : IClassFixture<WebApplicationFactory<Program>>
{
    private readonly HttpClient client;
    private readonly InMemoryOrderRepo orderRepo = new();
    private readonly InMemoryEventBus events = new();

    public OrderPlacementTests(WebApplicationFactory<Program> factory)
    {
        client = factory.WithWebHostBuilder(b => b.ConfigureServices(s =>
        {
            s.AddSingleton<IOrderRepo>(orderRepo);
            s.AddSingleton<IEventBus>(events);
        })).CreateClient();
    }

    [Fact]
    public async Task Places_order_with_valid_payment_creates_order_and_emits_OrderPlaced()
    {
        client.DefaultRequestHeaders.Authorization = new("Bearer", "tok_valid");
        var body = new { items = new[] { new { sku = "A1", qty = 2 } }, paymentToken = "pm_ok" };

        var response = await client.PostAsJsonAsync("/orders", body);

        response.StatusCode.Should().Be(HttpStatusCode.Created);
        var created = await response.Content.ReadFromJsonAsync<OrderCreated>();
        orderRepo.FindById(created!.Id).Should().NotBeNull();
        events.Published.Should().Contain(e =>
            e.Type == "OrderPlaced" && e.OrderId == created.Id);
    }
}

import request from "supertest";
import { buildApp } from "./app.js";
import { InMemoryOrderRepo } from "./test/in-memory-order-repo.js";
import { InMemoryEventBus } from "./test/in-memory-event-bus.js";

test("places order with valid payment creates order and emits OrderPlaced", async () => {
  const orderRepo = new InMemoryOrderRepo();
  const events = new InMemoryEventBus();
  const app = buildApp({ orderRepo, events });

  const res = await request(app)
    .post("/orders")
    .set("Authorization", "Bearer tok_valid")
    .send({ items: [{ sku: "A1", qty: 2 }], paymentToken: "pm_ok" });

  expect(res.status).toBe(201);
  expect(orderRepo.findById(res.body.id)).toBeDefined();
  expect(events.published).toContainEqual(
    expect.objectContaining({ type: "OrderPlaced", orderId: res.body.id })
  );
});

The test asserts on what a real caller can observe, not on private methods or call sequences inside the controller.

3.2.2 - API Consumer

An API provider that also consumes one or more upstream APIs. The most failure-prone pattern in distributed systems and the one that gets the most testing attention.

Same as API provider, plus outbound HTTP/gRPC calls to services the team does not own (or does own but deploys independently). This is the most failure-prone pattern in distributed systems and gets the most testing attention.

What needs covered

Everything from the API provider pattern, plus:

Layer	Concern	Test type
Outbound HTTP client	Request shape, response parsing, status code handling, header propagation, timeout enforcement	Adapter integration tests (against WireMock or, periodically, the real downstream)
Consumed API contract	The fields and status codes the consumer depends on	Consumer-side contract tests
Resilience under degraded dependencies	Retries, circuit breaking, backoff, fallback, partial-failure compensation	Component tests with fault-injecting client doubles
Composite behavior	The service still returns useful responses when downstreams misbehave	Component tests

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Outbound call: constructs the right URL, headers, body, auth, and timeout.
Success response: parsed correctly, including optional fields and unknown fields per Postel’s Law.
Multi-call composition: multiple downstream calls in sequence or parallel produce the documented composite response.
Caching: returns the cached value within TTL and refreshes after.
Trace context: propagates downstream.

Negative test cases

Common cases to consider, not an exhaustive list. The bulk of the negative testing happens here, and it’s where most production incidents originate. Drive each failure mode through a client double that simulates it.

Timeout (downstream exceeds configured deadline): the deadline enforces; the upstream caller gets the documented response (e.g., 504); no partial state is committed. Use a client double that delays past the deadline.
Connection refused: retry policy executes the documented count and backoff; falls over to fallback or returns an error. Use a client double that rejects the connection.
5xx responses (500, 502, 503): retry only on retryable codes. Use a client double that returns 5xx.
4xx responses (400, 401, 403, 404, 409, 422, 429): each maps to documented behavior; 4xx generally not retried; 429 respects Retry-After. Use a client double that returns each code.
Slow response within timeout: performance-budget assertions hold if the service has SLO commitments. Use a client double that delays within the deadline.
Malformed response body: the response is rejected, not silently coerced. Use a client double that returns a truncated or wrong-type body.
Schema drift (extra or missing fields): extra fields tolerated; missing required fields detected with a clear error. Use a client double that returns a drifted body.
Wrong status code (200 with error body, 500 with success body): the client trusts the status code, not the body. Use a client double that returns mismatched status and body.
Circuit open: the circuit opens under sustained failure; fast-fails subsequent calls; recovers on a half-open probe. Use a client double that sustains failures.
Partial multi-call failure: compensation, rollback, or documented partial-success behavior. First client double succeeds, second fails.

Test double validation

This is where the “doubles need tests” rule lives or dies. Four layers:

Consumer-side contract tests run in the pipeline on every commit using doubles. They pin the request the consumer sends and the response shape the consumer depends on. Contract artifacts are published to a broker. Fast, deterministic, blocks the build.
Adapter integration tests exercise the outbound HTTP client against the real dependency in a controlled state - typically a testcontainer running an in-house service the team owns. They verify the adapter code correctly speaks the protocol: serialization, deserialization, header handling, timeout behavior, error mapping. The test asserts the adapter’s correctness, not the dependency’s behavior: if the test asks for a user, it validates that the response parses into a valid User, not which user was returned. For third-party dependencies the team can’t run in a controlled state, run these tests out-of-band on a schedule. WireMock loaded with provider-supplied fixtures is a useful complement but functions more like a contract test against recorded shapes than an integration test against the live protocol.
Provider-side contract verification runs in the provider’s pipeline. The provider executes every consumer’s published contract against the real provider implementation. Breaking changes are caught at the source before the provider deploys.
Post-deploy integration check runs periodically against the real downstream in a non-production environment. Same fixtures used in contract tests. Catches drift in fields the contract didn’t pin, version skew, environment differences. Failures trigger review, not a build break. See Out-of-Pipeline Verification.

For third-party APIs you do not control, there is no provider verification step. The post-deploy check against the live (or sandbox) API is the only mechanism keeping doubles honest. Run it more often than for in-house dependencies. Daily at minimum.

The anti-pattern to avoid: stubbing the third-party SDK directly. Always wrap third-party clients in a thin adapter the team owns, then double the adapter. This is called out explicitly as Mocking what you don’t own and is the single most common source of “but it worked in tests” incidents.

Pipeline placement

Same as the API provider pattern, plus:

Consumer-side contract tests: pre-commit and CI Stage 1.
Adapter integration tests for the outbound HTTP client against an in-house dependency the team controls (a testcontainer running the team’s own service in a known state): CI Stage 1 or Stage 2.
Adapter integration tests against a third-party API or a service owned by another team: out-of-band on a schedule, never in-band. The risk of a flaky external service blocking deploys outweighs any in-band coverage benefit, and adapter tests with WireMock fixtures already cover the team’s adapter code.
Resilience component tests with fault injection: CI Stage 1.
Post-deploy integration checks against real downstreams: out of pipeline, on a schedule.

Example: fault injection at the client double

A negative-path test for downstream timeout. The payment client double simulates a slow response, the test asserts the deadline enforces and the upstream caller gets the documented error envelope:

@SpringBootTest
@AutoConfigureMockMvc
class PaymentTimeoutTest {

  @Autowired MockMvc mvc;
  @Autowired InMemoryOrderRepo orderRepo;
  @MockBean PaymentsGateway payments;

  @Test
  void returns_504_when_payment_service_exceeds_deadline() throws Exception {
    when(payments.charge(any())).thenAnswer(inv -> {
      Thread.sleep(50);
      throw new UpstreamTimeoutException("payments");
    });

    var body = """
      { "items": [{"sku": "A1", "qty": 1}], "paymentToken": "pm_ok" }
      """;

    mvc.perform(post("/orders")
        .header("Authorization", "Bearer tok_valid")
        .contentType(APPLICATION_JSON)
        .content(body))
      .andExpect(status().isGatewayTimeout())
      .andExpect(jsonPath("$.error.code").value("UPSTREAM_TIMEOUT"));

    assertThat(orderRepo.all()).isEmpty();
  }
}

public class PaymentTimeoutTests : IClassFixture<WebApplicationFactory<Program>>
{
    private readonly HttpClient client;
    private readonly InMemoryOrderRepo orderRepo = new();
    private readonly Mock<IPaymentsGateway> payments = new();

    public PaymentTimeoutTests(WebApplicationFactory<Program> factory)
    {
        payments.Setup(p => p.ChargeAsync(It.IsAny<ChargeRequest>()))
            .Returns(async () =>
            {
                await Task.Delay(50);
                throw new UpstreamTimeoutException("payments");
            });

        client = factory.WithWebHostBuilder(b => b.ConfigureServices(s =>
        {
            s.AddSingleton<IOrderRepo>(orderRepo);
            s.AddSingleton(payments.Object);
        })).CreateClient();
    }

    [Fact]
    public async Task Returns_504_when_payment_service_exceeds_deadline()
    {
        client.DefaultRequestHeaders.Authorization = new("Bearer", "tok_valid");
        var body = new { items = new[] { new { sku = "A1", qty = 1 } }, paymentToken = "pm_ok" };

        var response = await client.PostAsJsonAsync("/orders", body);

        response.StatusCode.Should().Be(HttpStatusCode.GatewayTimeout);
        var error = await response.Content.ReadFromJsonAsync<ErrorEnvelope>();
        error!.Error.Code.Should().Be("UPSTREAM_TIMEOUT");
        orderRepo.All().Should().BeEmpty();
    }
}

test("returns 504 when payment service exceeds deadline", async () => {
  const slowPayments = {
    charge: () => new Promise((_, reject) => {
      setTimeout(() => reject(new TimeoutError("payments")), 50);
    })
  };
  const orderRepo = new InMemoryOrderRepo();
  const app = buildApp({ orderRepo, payments: slowPayments, deadlineMs: 30 });

  const res = await request(app)
    .post("/orders")
    .set("Authorization", "Bearer tok_valid")
    .send({ items: [{ sku: "A1", qty: 1 }], paymentToken: "pm_ok" });

  expect(res.status).toBe(504);
  expect(res.body.error.code).toBe("UPSTREAM_TIMEOUT");
  expect(orderRepo.all()).toHaveLength(0);
});

The test verifies three things at once: the documented status code, the structured error body the API contract promises, and that no partial state was committed.

3.2.3 - Scheduled Job

A service triggered on a cron, queue, or external scheduler. Reads from data sources, writes reports or updates state.

A job that runs on a cron, queue, or external scheduler. Reads from data sources, writes reports or updates state. Often has no inbound API surface. The entrypoint is the scheduler.

This pattern has two test design challenges that the API provider and API consumer patterns don’t have: time and data volume.

What needs covered

Layer	Concern	Test type
Pure transformation logic	The data calculation itself, with no I/O	Solitary unit tests
Source and sink adapters	Reading from sources, writing to sinks: protocol correctness, error mapping	Adapter integration tests against real source/sink containers or WireMock
Job orchestration	Idempotency, partial failure recovery, checkpointing, locking, time-window logic	Component tests through the job’s invocation entrypoint, with client doubles, source/sink doubles, and an injected clock
Process startup	Exit codes, signal handling, configuration loading, real environment wiring	Deployed-binary tests that invoke the real artifact
Scheduling integration	The scheduler triggers the right entrypoint with the right arguments, environment, secrets, and concurrency settings	Out-of-band integration check against the real scheduler in a non-prod environment
Observability	Job ran, succeeded/failed, duration, records processed, error count	Assertions in component tests

Process startup matters more here than for an API service, because scheduled jobs typically have non-trivial startup behavior (config loading, secret resolution, lock acquisition) that a component test with the SUT in-memory can bypass. The right shape is many component tests for behavior, plus one or two tests that invoke the actual deployed binary the scheduler will invoke.

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

End-to-end run: with representative input, produces the expected output (report file, database update, message published).
Idempotency: running the job twice for the same logical period produces the same result, not duplicates.
Checkpointing: a job that processes a stream resumes from the last checkpoint, not from scratch.
Time windows: “yesterday’s data” computes correctly for various reference times, especially around DST, month boundaries, and year boundaries.
Empty input: zero records produces a valid empty report, not an error.
Output format: the report or message conforms to the documented schema.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Source unavailable: DB down, source API returning 5xx. Verify the job fails cleanly with a documented exit code/status, doesn’t write partial output, and is safely re-runnable.
Sink unavailable: destination DB or message broker rejects writes. Verify no source state changes (e.g., “marked as processed”) happen if the sink fails.
Partial-write failure: half the batch writes successfully, then the connection drops. Verify the next run reprocesses the failed half without duplicating the successful half. This is where idempotency keys, transactional outboxes, or compensating reads earn their keep.
Slow job: job exceeds its expected runtime. Verify it surfaces as alertable, doesn’t silently overlap with the next scheduled run, and that the lock prevents concurrent execution.
Malformed source data: null where non-null was expected, wrong type, encoding issues. Verify the bad record is logged with enough context to investigate, and the job decides per its policy: skip, dead-letter, or fail the whole run. The choice is design; the test pins it.
Time-zone bugs: the job runs at 02:30 UTC for a “daily” report. What does it do on the day clocks shift? Test it. Use the injected clock so the test deterministically simulates the boundary.
Concurrent run: the previous run hadn’t finished when the next was triggered. Verify the lock prevents overlap or, if overlap is acceptable, that the work is partitioned correctly.
Crash mid-run: kill -9 in the middle of processing. Verify on restart the job resumes from a consistent state.
Schema drift on source: a new field appears or a field changes type. Verify per the contract policy.

Test double validation

Three classes of doubles need validation, each through a different mechanism:

The injected clock. Every in-band test that depends on “now” uses an injected clock. Validate it with one out-of-band check that runs against the real system clock, exercises a known time-window calculation, and confirms the production wiring of the clock dependency is correct. This catches the “tests use UTC, prod uses container local time” class of bug.
Source and sink gateways. Same model as the API consumer pattern. Adapter integration tests in the pipeline exercise each gateway against a real source/sink container or WireMock. Contract tests pin the shape. Post-deploy integration checks confirm the doubles still match the real systems on a schedule.
The scheduler trigger. The doubled trigger in component tests must match what the real scheduler invokes. Verify with a post-deploy integration check that runs the real scheduler against a deployed instance in a non-prod environment and confirms the entrypoint is found, the cron expression fires at the expected times, environment variables and secrets resolve, and the concurrency policy holds. This is the test that catches “passed in CI, didn’t run in prod because the cron expression had a typo.”

Pipeline placement

Unit and component tests: CI Stage 1.
Adapter integration tests for the source and sink adapters: CI Stage 1 or Stage 2.
Contract tests for each source and sink: CI Stage 1.
Component tests of the deployed binary (small set): CI Stage 1 or Stage 2.
Real-clock and real-scheduler integration check: out of pipeline, scheduled, against a non-prod environment.
Post-deploy: a synthetic invocation of the job in production that verifies it ran, processed records, and met its SLO.

Example: time-window logic with an injected clock

A test that pins the daily-report window calculation around a DST boundary. The clock is injected so the test deterministically simulates the moment of interest. source and sink are field-level fakes set up in the test class with seeded data for 2026-03-08 and 2026-03-09.

@Test
void daily_report_run_after_DST_spring_forward_uses_correct_window() {
  Clock fixedClock = Clock.fixed(
      Instant.parse("2026-03-09T07:30:00Z"),
      ZoneOffset.UTC);
  ReportJob job = new ReportJob(fixedClock, source, sink);

  job.run();

  Report emitted = sink.lastReport();
  assertThat(emitted.windowStart())
      .isEqualTo(Instant.parse("2026-03-08T05:00:00Z"));
  assertThat(emitted.windowEnd())
      .isEqualTo(Instant.parse("2026-03-09T05:00:00Z"));
  assertThat(emitted.recordsProcessed())
      .isEqualTo(source.recordsForDay("2026-03-08"));
}

[Fact]
public void Daily_report_run_after_DST_spring_forward_uses_correct_window()
{
    var fixedClock = new FakeClock(DateTimeOffset.Parse("2026-03-09T07:30:00Z"));
    var job = new ReportJob(fixedClock, source, sink);

    job.Run();

    var emitted = sink.LastReport();
    emitted.WindowStart.Should().Be(DateTimeOffset.Parse("2026-03-08T05:00:00Z"));
    emitted.WindowEnd.Should().Be(DateTimeOffset.Parse("2026-03-09T05:00:00Z"));
    emitted.RecordsProcessed.Should().Be(source.RecordsForDay("2026-03-08"));
}

test("daily report run after DST spring forward uses correct window", () => {
  const fixedClock = { now: () => new Date("2026-03-09T07:30:00Z") };
  const job = new ReportJob({ clock: fixedClock, source, sink });

  job.run();

  const emitted = sink.lastReport();
  expect(emitted.windowStart).toEqual(new Date("2026-03-08T05:00:00Z"));
  expect(emitted.windowEnd).toEqual(new Date("2026-03-09T05:00:00Z"));
  expect(emitted.recordsProcessed).toBe(source.recordsForDay("2026-03-08"));
});

A separate out-of-band check runs the deployed binary against the real system clock once, to verify the production wiring of the clock dependency matches the doubled clock used here.

3.2.4 - User Interface

A UI that renders data and accepts user interaction. Talks to one or more backend APIs.

What needs covered

Layer	Concern	Test type
Pure rendering	Component renders given props/state	Solitary unit tests
Component composition	Composed components wire correctly	Sociable unit tests
Feature behavior	A flow (login, checkout, search) works through the rendered DOM with the backend stubbed at the network layer	Component tests driven by Playwright with the team’s unit-testing framework as the runner
Backend contract	What the UI sends and expects from each backend endpoint	Consumer-side contract tests
End-to-end happy paths	A small number of critical journeys against real backends	E2E tests, post-deploy
Visual regression	The UI looks right	Snapshot or visual diff tests
Accessibility	The UI works for assistive tech and keyboard users	Assertions in component tests + automated WCAG scanning

UI component tests run in a real browser engine (Chromium, Firefox, WebKit) driven by Playwright, with the team’s existing unit-testing framework (Vitest, Jest, or whatever is already in the project) as the runner. In-memory renderer shortcuts like JSDOM are rejected: they trade accuracy for speed and produce false greens around layout, focus, event timing, Intersection Observer, and animations - exactly the surface where UI bugs live. Playwright’s headless Chromium starts in milliseconds and runs the suite fast enough to use as the default. Backends are stubbed at the network layer with page.route so the same fixtures drive component tests today and end-to-end smoke tests later.

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Critical flows: a user can complete each documented critical flow via keyboard and via mouse.
Forms: accept valid input, submit, and show success.
Loading states: render while the backend is in flight.
Empty, populated, and overflow states: all render correctly.
Internationalization: the UI renders with longer translations and right-to-left scripts.
Responsive layouts: render at the documented breakpoints.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Backend errors: for every API call the UI makes, what does the user see for 4xx, 5xx, network failure, timeout? Test each. The most common UI bug is “spins forever on error.”
Form validation: required fields, format errors, length limits, cross-field rules. Each shows a specific, actionable message that’s announced to screen readers.
Authentication expiry: token expires mid-session. Verify the user is sent through the documented re-auth flow, not silently dropped.
Permission denied: the user navigates to a page they cannot access. Verify the documented response (redirect, “not authorized,” etc.).
Stale data: a list rendered, then a delete on another tab, then the user clicks the deleted item. Verify the documented refresh or error behavior.
Slow network: every interaction has a documented behavior at 3G speeds. Verify with throttled fixtures.
Concurrent edit: two users editing the same record. Verify the optimistic-lock UX behaves as documented.
Browser back button: the back button is a public interface. Test it.
Accessibility violations: automated WCAG scan in component tests catches missing labels, contrast failures, ARIA misuse on every commit. Don’t defer to quarterly audits.

Test double validation

Backend doubles in component tests must match the real backends. Same mechanism as the API consumer pattern: the UI is a consumer, every backend it talks to is a provider. Consumer-driven contracts run on every commit; provider verification runs in the backend’s pipeline. Post-deploy E2E smoke tests against the real backend close the loop on drift the contract didn’t pin.

Because UI component tests run in a real browser engine, there is no renderer-level double to validate. The browser is the production renderer, just headless. The remaining gap is between the stubbed backend and the real backend, which the out-of-band E2E suite covers. Out-of-band failures trigger review, not a build break.

Pipeline placement

Unit tests (rendering, composition): CI Stage 1.
Component tests in headless browser (including a11y assertions): CI Stage 1.
Visual regression: CI Stage 1 if fast, CI Stage 2 if slow.
Consumer-side contract tests for each backend: CI Stage 1.
E2E happy-path smoke tests against real backends: post-deploy, in a production-like environment, blocking the rollout but not the build.
Real user monitoring + synthetic transactions: continuously in production.

Example: UI component test for an error path

A flow-oriented test for the checkout error path. Playwright drives a headless browser; the backend is stubbed at the network layer with page.route; the team’s existing unit-testing framework (Vitest, JUnit, xUnit) runs the test. The assertion: the user sees a documented error message and the spinner does not get stuck.

@Test
void shows_error_and_clears_spinner_when_checkout_fails_with_500() {
  try (Playwright playwright = Playwright.create();
       Browser browser = playwright.chromium().launch()) {
    Page page = browser.newPage();

    page.route("**/api/checkout", route ->
        route.fulfill(new Route.FulfillOptions()
            .setStatus(500)
            .setContentType("application/json")
            .setBody("{\"error\":{\"code\":\"INTERNAL\"}}")));

    page.navigate("http://localhost:3000/checkout");
    page.getByRole(AriaRole.BUTTON,
        new Page.GetByRoleOptions().setName("Place order")).click();

    assertThat(page.getByRole(AriaRole.ALERT))
        .containsText("Something went wrong, please try again");
    assertThat(page.getByRole(AriaRole.STATUS)).not().isVisible();
  }
}

[Fact]
public async Task Shows_error_and_clears_spinner_when_checkout_fails_with_500()
{
    using var playwright = await Playwright.CreateAsync();
    await using var browser = await playwright.Chromium.LaunchAsync();
    var page = await browser.NewPageAsync();

    await page.RouteAsync("**/api/checkout", route => route.FulfillAsync(new()
    {
        Status = 500,
        ContentType = "application/json",
        Body = "{\"error\":{\"code\":\"INTERNAL\"}}"
    }));

    await page.GotoAsync("http://localhost:3000/checkout");
    await page.GetByRole(AriaRole.Button, new() { Name = "Place order" })
        .ClickAsync();

    await Expect(page.GetByRole(AriaRole.Alert))
        .ToContainTextAsync("Something went wrong, please try again");
    await Expect(page.GetByRole(AriaRole.Status)).Not.ToBeVisibleAsync();
}

import { test, expect, beforeAll, afterAll } from "vitest";
import { chromium } from "playwright";

let browser;

beforeAll(async () => { browser = await chromium.launch(); });
afterAll(async () => { await browser.close(); });

test("shows error and clears spinner when checkout fails with 500", async () => {
  const page = await browser.newPage();

  await page.route("**/api/checkout", route =>
    route.fulfill({
      status: 500,
      contentType: "application/json",
      body: JSON.stringify({ error: { code: "INTERNAL" } }),
    })
  );

  await page.goto("http://localhost:3000/checkout");
  await page.getByRole("button", { name: /place order/i }).click();

  await expect(page.getByRole("alert"))
    .toContainText(/something went wrong, please try again/i);
  await expect(page.getByRole("status")).not.toBeVisible();
});

The test exercises the rendered DOM the way a real user would. Intercepting at the network layer with page.route keeps the same fixtures reusable when the component test gets promoted to an end-to-end smoke test against the real backend.

3.2.5 - Event Consumer

A service that consumes messages from a broker (Kafka, SQS, RabbitMQ, Pub/Sub). Brief sketch.

A consumer of messages from Kafka, SQS, RabbitMQ, Pub/Sub, or similar. Reads messages, processes them, often updates state and produces downstream messages. The “public interface” is the topic or queue and the schema of messages on it.

This pattern has problems the API provider and API consumer patterns don’t have: ordering, replay, poison messages, dead-letter queues, and delivery semantics (at-most-once, at-least-once, exactly-once-with-effort).

What needs covered

Layer	Concern	Test type
Message handler	Pure transformation per message	Solitary unit tests
Idempotency	Same message twice produces the same effect	In-process component tests
Poison message handling	Malformed message goes to DLQ, doesn’t crash the consumer	In-process component tests
Ordering	Out-of-order messages produce documented outcomes	In-process component tests
Backpressure	Consumer slows when downstream is slow	Resilience component tests
Broker contract	Topic, schema, headers	Contract tests
Broker client	Real protocol behavior, offset commits, consumer group rebalancing	Adapter integration tests against a real broker container

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Well-formed message: produces the expected state change and the documented downstream events.
Batch processing: processes per documented policy.
Replay from offset: reproduces the same end state.
Documented schema versions: are accepted.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Malformed message: routes to the DLQ with a correlation ID; the consumer survives.
Duplicate delivery: absorbed by idempotency.
Out-of-order delivery: follows the documented behavior.
Mid-batch downstream failure: the offset is left uncommitted.
Schema-version skew: handled per the documented policy.
Slow downstream: applies backpressure rather than OOM.
Consumer-group rebalance during processing: no in-flight messages are stranded.

Test double validation

The broker double in component tests is validated by adapter integration tests against a real broker container the team controls (Kafka in Docker, ElasticMQ for SQS, Redpanda in Docker). The test exercises the broker client adapter against that controlled instance and asserts the adapter speaks the protocol correctly - it does not assert anything about which messages the broker returns or in what order; that is the broker’s behavior, not the adapter’s. Schema registry double is validated by contract tests pinning each version, plus a post-deploy check against the real registry. Post-deploy synthetic publishes a known message to the real topic in a non-prod environment.

Pipeline placement

Handler unit tests and component tests run in CI Stage 1; adapter integration tests against a team-controlled broker container in CI Stage 1 or Stage 2; adapter integration tests against a managed broker the team can’t pin to a known state run out-of-band on a schedule, alongside the post-deploy synthetic.

Example: idempotency under duplicate delivery

Money.usd takes minor units (cents); 4250 represents $42.50.

@Test
void same_message_processed_twice_creates_one_payment_record() {
  PaymentEvent event = new PaymentEvent(
      "evt-9f12", OrderId.of("ord-001"), Money.usd(4250));
  PaymentRepo repo = new InMemoryPaymentRepo();
  PaymentEventHandler handler = new PaymentEventHandler(repo);

  handler.handle(event);
  handler.handle(event);

  assertThat(repo.findByEventId("evt-9f12")).hasSize(1);
  assertThat(repo.totalForOrder(OrderId.of("ord-001"))).isEqualTo(Money.usd(4250));
}

[Fact]
public void Same_message_processed_twice_creates_one_payment_record()
{
    var evt = new PaymentEvent("evt-9f12", OrderId.Of("ord-001"), Money.Usd(4250));
    var repo = new InMemoryPaymentRepo();
    var handler = new PaymentEventHandler(repo);

    handler.Handle(evt);
    handler.Handle(evt);

    repo.FindByEventId("evt-9f12").Should().HaveCount(1);
    repo.TotalForOrder(OrderId.Of("ord-001")).Should().Be(Money.Usd(4250));
}

test("same message processed twice creates one payment record", () => {
  const event = new PaymentEvent(
    "evt-9f12", OrderId.of("ord-001"), Money.usd(4250));
  const repo = new InMemoryPaymentRepo();
  const handler = new PaymentEventHandler(repo);

  handler.handle(event);
  handler.handle(event);

  expect(repo.findByEventId("evt-9f12")).toHaveLength(1);
  expect(repo.totalForOrder(OrderId.of("ord-001"))).toEqual(Money.usd(4250));
});

3.2.6 - Event Producer

A service that produces messages to a broker. Often paired with the event consumer pattern in the same service. Brief sketch.

The producer side, often paired with the Event consumer pattern in the same service. After a state change, the service publishes a message that downstream consumers depend on.

The hard problems differ from the consumer side: atomicity with persistence (did the DB row commit and the message publish?), exactly-once semantics that require an outbox or two-phase commit, and downstream consumer dependence on schema, routing key, and headers.

What needs covered

Layer	Concern	Test type
Outbox / transactional emit	DB write and message emit happen as a unit	Component tests with real DB + broker double
Produced message contract	Schema, headers, routing	Provider-side contract tests
Routing	Right topic and key per event type	Component tests
Retry on broker unavailable	Outbox drains once broker recovers	Component tests with fault-injecting broker client double
Trace propagation	Trace context in headers matches the inbound request	Component tests

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

State change: produces the correct message on the correct topic with the correct routing key, headers, and schema version.
Outbox drain: drains in order.
Redelivery: does not reorder.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

DB commits but broker fails: the message stays in the outbox and emits on the next drain. No event lost.
Broker accepts but DB rolls back: nothing is emitted. No phantom events.
Broker unavailable for an extended period: the outbox accumulates with bounded growth and alerts at a threshold.
Breaking schema change: fails provider-side contract verification before shipping.

Test double validation

The broker double in component tests is validated against a real broker container the team controls in adapter integration tests. The test asserts the adapter publishes with the right routing key, headers, and serialization - it does not assert which messages downstream consumers happen to read or in what order; those are downstream concerns. Provider-side contract verification runs in this service’s pipeline against every consumer’s published expectations.

Pipeline placement

Outbox component tests and routing tests run in CI Stage 1; adapter integration tests against a team-controlled broker container in CI Stage 1 or Stage 2; adapter integration tests against a managed broker the team can’t pin run out-of-band on a schedule. Provider-side contract verification in CD Stage 1; post-deploy synthetic state change verifies the message arrives with the expected shape.

3.2.7 - CLI Tool or Library

A binary or package consumed by other developers. The public interface is the CLI invocation surface or the library’s exported API. Brief sketch.

A binary (CLI) or package (library) consumed by other developers. The “public interface” is the CLI invocation surface (argv, stdin, stdout, stderr, exit code) or the library’s exported API.

The pattern is different because the consumer is a developer or another program, not a user clicking a button. Cross-platform behavior, semantic versioning, and backward compatibility matter more than they do for a service.

What needs covered

Layer	Concern	Test type
Pure logic	Functions, classes, parsers	Solitary unit tests
CLI invocation	Argument parsing, exit codes, output streams	Component tests through the CLI entrypoint
Cross-platform	Path separators, line endings, signal handling	Cross-OS test matrix running the suite on every supported OS in CI
Public API surface	Library’s exported types and functions	API surface tests (snapshot of the public API; diff fails the build)
Documented examples	The README examples actually work	Doctests / executable docs

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Valid arguments: produce documented stdout output, no stderr, and exit code 0.
Pipe-friendly mode: produces machine-readable output (JSON/NDJSON) when stdout is not a TTY.
Library API: returns documented values for valid input.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Bad arguments: exit with the documented non-zero code and structured stderr.
Help text: reachable via --help.
Large input: does not OOM.
Interrupt (Ctrl-C, SIGTERM): runs cleanup and flushes or rolls back partial output.
Invalid arguments to the library: throws the documented error type.
Public symbol removed or renamed: the API-surface test fails the build.

Test double validation

File system doubles validated by integration tests against the real FS in a temp directory. Subprocess doubles validated by tests that actually spawn the subprocess on each supported OS. Doctests validate README examples against the real binary or library on every build.

Pipeline placement

Unit and component tests run in CI Stage 1 on every supported OS; API surface diff and doctests in CI Stage 1; cross-platform integration tests in CI Stage 2 if slow.

3.2.8 - Stateful Service

A service that maintains long-lived in-memory state: caches, in-memory aggregates, leader-elected coordinators, websocket gateways, real-time engines. Brief sketch.

A service that maintains long-lived in-memory state: caches, in-memory aggregates, leader-elected coordinators, websocket gateways, real-time engines, sticky-session servers.

The hard problems are concurrency, recovery, and unbounded growth. Stateful services fail in ways stateless services do not.

What needs covered

Layer	Concern	Test type
State machine logic	Pure transitions	Solitary unit tests
Persistence and checkpointing	State survives restart or rebuilds correctly	Component tests with real persistence
Recovery from crash	Restart converges to a consistent state	Component tests that simulate crash mid-write
Leader election	Only one leader; transitions are observable; split-brain is impossible	Cluster tests with real consensus library
Replication	Followers stay in sync; backpressure is documented	Cluster tests
Memory bounds	State doesn’t grow unbounded; eviction policy holds	Long-running soak tests
Connection lifecycle	Sessions clean up on disconnect; reconnect is documented	Component tests

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

State transitions: follow the documented machine.
Restart: state rebuilds and behavior matches pre-restart.
Replication lag under expected load: stays within budget.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

Crash mid-write: consistent state on restart. No torn writes.
Network partition: minority replicas step down with documented reconciliation on heal.
Slow replication: applies backpressure rather than silent divergence.
Memory pressure: evicts oldest entries per policy without OOM.
Idle long-running connections: close cleanly with documented reconnect behavior.
Concurrent state mutations: serialize without lost updates.

Test double validation

Persistence doubles validated by adapter integration tests against the real production engine. Consensus library doubles validated by cluster tests against a multi-node testcontainer setup. Soak tests run out of pipeline against a deployed instance to catch slow leaks and unbounded growth.

Pipeline placement

State machine unit tests, recovery component tests, and single-node concurrency tests run in CI Stage 1; cluster tests with real consensus library in CI Stage 2; soak and chaos tests out of pipeline.

3.3 - Cross-Cutting Concerns

Concerns that cut across every pattern: authn/authz, database migrations, fixtures, observability, accessibility, performance, mutation testing, flake handling, and time budgets.

The patterns describe testing organized by component shape. The concerns below cut across all patterns and deserve dedicated coverage in any non-trivial system.

Authn and authz testing

Authentication and authorization deserve dedicated, exhaustive coverage. They are a major source of high-impact incidents and the failure modes are predictable:

Tenant isolation: tenant A’s queries never return tenant B’s data. Test every read path. Multi-tenant SaaS bugs are almost always missing isolation tests.
Scope or role escalation: a token with read:orders cannot perform write:orders. Test the matrix of scope and endpoint.
Expired tokens: rejected even if cached locally. Clock-skew tolerance is a property of the verifier, not a license to skip the test.
Forged tokens: signature validation actually validates. The classic JWT alg: none bug still ships periodically.
Missing auth: every protected endpoint returns 401, never 500 (information leak) and never 200 (catastrophic).
Service-to-service auth: machine identities respected, mTLS validated, token-swapping attacks detected.

The pattern: a parameterized test that takes (endpoint, method, expected-status-when-no-token, expected-status-when-wrong-scope) and runs across every endpoint in the OpenAPI or schema definition. New endpoints are covered automatically.

Database migrations

Migrations have their own discipline. For every migration:

Forward on representative data: produces the expected schema and data.
Backward (where supported): returns to the previous schema with no data loss. Expand-contract migrations may not roll back; that’s a design choice the test pins.
Forward + backward + forward: idempotent.
Time on production-scale data: budget assertion. A 30-minute migration on a 50M-row table needs a different deploy strategy than a 30-second one.
Under traffic: the expand-contract pattern doesn’t break in-flight transactions.

Test against the real production database engine and version using testcontainers. SQLite-against-Postgres is a frequent source of “passed in CI, broke at 02:00 in prod” incidents.

Test data and fixtures

Fixtures rot faster than the code that uses them. Two principles keep them honest:

Generate fixtures from the schema, not by hand. When the schema is the source of truth (Avro, OpenAPI, SQL DDL, Protobuf), generate fixture builders from it. A type change breaks the build, not production.
Use Object Mother or builder patterns, not raw inline literals. A test that says placeOrder(buildValidOrder().withItem("A1", 2).build()) survives a schema change because the builder updates centrally. A test with 30 lines of raw JSON inline does not.

Avoid shared global fixtures that tests mutate. Each test creates the state it needs, names what is essential about that state, and discards the rest.

Observability as a tested artifact

Logs, metrics, and traces are part of a service’s contract with operators. If an alert depends on a metric, the test for the failure path should assert the metric is emitted. If a runbook depends on a structured log line, the test should assert the line is produced with the right fields and correlation ID.

The pattern: in component tests, attach a metrics collector and a log capture to the assembled component. Failure-path tests assert three things at once:

The response status is correct.
The error metric is incremented with the right labels.
The structured log line is emitted with correlation ID, error code, and any fields the runbook depends on.

This prevents silent regressions where the code “works” but the operator can’t see what’s happening when it doesn’t.

Accessibility testing

For any pattern that renders a user interface, accessibility is a functional requirement, not a finishing touch, and it belongs in the same in-band / out-of-band split as every other concern on this page. The dividing line is the one the whole test architecture uses: deterministic checks gate the build; subjective judgment runs continuously and never blocks.

The governing rule: automate the deterministic rules, reserve human judgment for the rest. A large share of WCAG success criteria are machine-checkable - missing alt attributes, invalid or contradictory ARIA, unlabeled form controls, insufficient color contrast, broken heading hierarchy, a missing document language. Those are deterministic and belong in the pipeline. The remainder - whether alt text is meaningful, whether the screen-reader narrative makes sense, whether a flow is actually operable with a keyboard or a switch device - cannot be settled by a tool and must not be faked with one.

Three tiers, mapped to pipeline placement:

Static analysis (in-band, blocks build). Accessibility linting catches structural violations in source without rendering: missing alt text, ARIA misuse, label associations, heading order. It runs in the IDE, pre-commit, and CI, exactly like any other static check. Cheapest and fastest; treat high-severity findings as build-breaking, the same as a security finding.
Component tests against the rendered DOM (in-band, blocks build). Some violations exist only in the rendered output: contrast computed after CSS resolves, focus order, dynamic ARIA state, keyboard operability. A scanner assertion inside a component test (expect(results).toHaveNoViolations()) plus explicit keyboard-navigation assertions cover these deterministically, on every commit. The user interface pattern shows the full shape.
Manual audit and assistive-technology testing (out-of-band, never a gate). Real screen-reader passes, keyboard-only walkthroughs, and expert review of whether the experience is coherent. This is continuous and informs the backlog; like exploratory testing, it is not a pass/fail checkpoint and must not gate a deploy.

The caveat that keeps tiers 1 and 2 honest: automated checks detect only a fraction of WCAG success criteria - industry estimates commonly land between a third and a half, depending on the tool and the page. A green automated scan means “no detectable violations,” not “accessible.” Wiring a scanner into the build is necessary and high-value, but a team that reads a passing scan as proof of accessibility has the same false-confidence problem as a team that reads high line coverage as proof of correctness. The deterministic tiers shrink the manual surface; they do not remove it.

This mirrors observability as a tested artifact above: the machine-verifiable part of a human contract gets pinned in the deterministic suite, and the judgment part stays with people.

Performance and load testing

Three classes of perf tests, each with a different home in the pipeline:

Per-endpoint perf budgets in component tests. Simple latency assertion under no load (assertThat(p99).isLessThan(50ms)). Catches algorithmic regressions cheaply. Fits in CI Stage 1 if the assertions are tight and the runtime is stable.
Load tests in acceptance. k6, Gatling, or Locust against a deployed instance. Validate p99 latency, throughput, and error rate at expected production load. Gates production promotion.
Soak tests out of pipeline. Long-running load to catch memory leaks, file handle leaks, and slow drift. Scheduled, non-blocking.

A perf regression that breaches a documented budget should block deploy. A regression within budget but worse than baseline should generate a finding for review, not a build failure: noisy alerts get ignored.

Mutation testing

Coverage % tells you what code ran. Mutation testing tells you whether the tests would have failed if the code had been wrong. Tools (Stryker for JS, PIT for Java) systematically change operators, return values, and conditionals, then re-run the test suite. Surviving mutants are tests that didn’t catch the mutation.

Each surviving mutant is one of three things:

A real test gap. Add a flow-oriented test that would have failed when the mutation was applied.
An equivalent mutant, semantically identical to the original. Mark and move on.
A trivially equivalent mutant (logging change, assertion message tweak). Configure the tool to skip.

Mutation testing is too slow to run on every commit. Run it nightly or weekly on the highest-value modules. Treat it as a periodic audit of test quality, not a gating check.

Flake handling protocol

A flaky test is a known unknown. Three rules keep flakes from rotting the suite:

Quarantine on detection. First flake gets the test moved to a quarantine lane that doesn’t block the build. Don’t ignore it; don’t keep failing builds for unrelated reasons.
Time-boxed remediation. Quarantined tests have a deadline (e.g., five business days) and an owner. After the deadline, fix or delete. No silent quarantine.
Track the cause. Most flakes share root causes: timing, shared state, network, ordering. The fix is usually structural (eliminate the timing dependency) rather than local (add a longer sleep).

A suite with a permanent quarantine list has lost its CD-ready quality. See also Tests Randomly Pass or Fail.

Cost and time budgets

Empirical starting points for in-band test budgets, based on typical service complexity. Adjust for your codebase, language, framework, and the size of the component under test.

Pattern	In-band suite budget	Notes
1 (API provider)	< 5 min	Most logic in unit and component tests
2 (API consumer)	< 5 min	More gateway and resilience tests than 1
3 (scheduled job)	< 3 min	Plus a small set of tests that exercise the deployed binary
4 (UI)	< 8 min	Component tests in headless browser via Playwright + the team’s unit-testing framework
5 (event consumer)	< 5 min	Real broker container for gateway tests
6 (event producer)	< 5 min	Same
7 (CLI / library)	< 3 min	One pass per supported OS in CI matrix
8 (stateful service)	< 8 min	Real persistence; cluster tests in Stage 2

The total CD pipeline in-band suite under 10 minutes is the gating constraint at the team level. The first lever for hitting that budget is parallel execution: the suite should fan out across cores or runners, not run serially. Parallelism only works when tests are independent of each other - no shared mutable state, no ordering dependencies, no global fixtures that one test mutates and another reads. Decoupling tests is a prerequisite for speed, not an optimization on top of it.

If a component’s tests still can’t fit the budget after the suite is running in parallel, the goal is to remediate the underlying cause - slow component startup, oversize fixtures, expensive setup duplicated per test, hidden serialization through a shared resource - not to declare the budget unreachable. While the remediation is underway, moving the offending tests out-of-band on a schedule is a reasonable stopgap so the in-band suite stays fast. Out-of-band placement here is a temporary mitigation, not the destination: those tests should come back in-band once the underlying speed issue is fixed.

4 - Testing Antipatterns

Common testing antipatterns that block CD, plus a migration guide for getting an existing suite back on track.

Most teams arrive at this section with a test suite that doesn’t match the Applied Testing Strategies guide. This page covers the failure modes that show up most often and the migration moves that get a suite back on track.

Common testing anti-patterns

Each entry below is a smell that the suite is testing the wrong thing, will erode trust over time, or will block refactoring instead of enabling it.

Reflection to reach private members

Using reflection (or language-equivalent escape hatches: @VisibleForTesting-only public access, friend classes, internal exposed only for tests) to read or invoke private members from a test. This couples the test to the exact internal structure of the class, breaks every time the implementation is refactored, and tests something the caller cannot observe, meaning the test can pass while the actual public behavior is broken.

If a private behavior is worth testing, it’s reachable through a public method that exercises it. If no public method exercises it, the private code is dead and should be deleted. Reflection in tests is a signal that either the design needs adjustment (the class is too large and a collaborator wants to come out) or the test is aimed at the wrong abstraction level.

Testing private methods directly

Same root cause as the reflection anti-pattern, but achieved by making methods package-private, protected, or otherwise reachable through a side door specifically so tests can call them. The method’s accessibility is now distorted by the test, not by the design. Drive private logic through the public method that uses it, or extract it into a collaborator with its own public surface and test that collaborator through its public interface.

One test class per production class, one test per method

Tests organized as a mirror of the production code structure, such as OrderServiceTest with testProcessPayment, testValidateOrder, testEmitEvent, produce a suite that documents the implementation and dies on contact with refactoring. Organize tests by behavior. An OrderPlacement test class with places_order_with_valid_payment, rejects_order_when_payment_declined, holds_order_when_inventory_unavailable is what survives, what reads well, and what catches integration bugs between methods.

Tests that mirror the implementation

A test that asserts “method A is called, then method B is called, then method C is called with these arguments” is testing the implementation, not the behavior. The same outcome could be achieved by a different sequence of calls, and if the test fails when the sequence changes but the outcome doesn’t, the test is wrong, not the code. Assert on observable outcomes (returned value, persisted state, emitted event, response status) and use mocks/spies sparingly, only for outbound interactions that are themselves part of the contract.

Mocking what you don’t own

Stubbing a third-party SDK, ORM, HTTP client, or cloud SDK directly in tests. The double is now a claim about a library the team has no control over and incomplete knowledge of. When the library updates or the team upgrades versions, the doubles are silently wrong and the tests still pass. Wrap third-party clients in a thin gateway the team owns, then double the gateway.

Doubles without validating tests

Any test double that has no corresponding mechanism (contract test, adapter integration test, post-deploy integration check) keeping it honest is a lie waiting to be discovered in production. If a double exists and there’s no traceable answer to “how would we know if this stopped matching reality?” that double is a known risk and should be tracked as one.

Over-mocking

Replacing every collaborator with a mock so the test sees only the system under test in isolation. The test now mirrors the implementation: every refactor that moves a method between collaborators breaks tests that didn’t fail for any production reason. Only mock what’s necessary to keep the test deterministic. Real in-process collaborators - value objects, domain models, in-memory repositories - belong in the test, not behind a mock.

Complex mock setup

If a single test needs dozens of lines to set up its mocks, the system under test probably has too many dependencies for one unit of behavior. Setup complexity is a smell pointing at the production design, not at the test. Refactor the production code (extract a collaborator, narrow the interface, push concerns into separate classes) before adding more mocks.

Sleeping in tests

Thread.sleep, await sleep(500), and friends to “wait for” an asynchronous operation. Sleeps are either too short (flaky) or too long (slow), and they ratchet upward over time as people debug flakes. Use the framework’s built-in waiting primitives (Awaitility, waitFor from Testing Library, eventually blocks) that poll until a condition is true with a bounded timeout. If the system under test depends on real wall-clock time, inject a fake clock. Never sleep.

Shared mutable state between tests

Tests that depend on the order they run in, or that leak state through static singletons, shared databases without per-test isolation, or module-level caches. Each test should set up the state it needs and tear it down (or use a fresh isolated context). Order-dependent suites fail randomly when run in parallel and produce “works on my machine” failures that erode trust in the suite.

Skipping or muting tests instead of fixing them

A muted test is a known bug in the test or in the system, hidden. Either fix it now, delete it, or open a ticket and put a deadline on it. Suites with a steady population of @Ignore/@skip/xit decorations end up with a steady population of latent bugs.

Test code held to lower standards than production code

Copy-pasted setup blocks, string-typed assertions on JSON fragments, magic numbers, no abstractions, no review. Tests are production code. They’re how the team learns whether the system works. Refactor them, deduplicate them, name them well, and review them as carefully as the code they protect.

Testing through the UI when the same behavior is testable lower in the stack

UI tests are the slowest and most fragile layer. Pushing logic-only assertions into UI tests because “that’s where we’re set up to test” produces a brittle, slow suite that becomes a tax on every change. Test logic where the logic lives. Reserve UI tests for things that can only be observed at the UI layer.

“We’ll add tests later”

Tests added after the code is already in production, written by someone who didn’t write the code, asserting only what the code currently does, are not tests of the system’s intended behavior. They’re a snapshot of the current implementation, including its bugs. The team learns nothing from them and refactoring becomes risky in exactly the way tests are supposed to prevent. Tests written alongside the code (or before it, TDD-style) are the only ones that document intent.

Migrating an existing suite

The right first move depends on what the suite looks like now. Five common starting points and the first three steps for each:

If most coverage is end-to-end Selenium or Cypress against real backends

Inventory the flows the E2E suite exercises. Pick the top five that fail most often.
Build component tests for those flows. Double the backend through the gateway the team owns.
Once those component tests are green and the doubles they rely on are backed by a contract test plus an out-of-band check that is actually running and watched, delete the corresponding E2E tests. Don’t keep both: duplicated coverage doubles the maintenance cost without doubling the confidence. Until that out-of-band validation is in place and monitored, keep one real-integration smoke test per flow - the component test’s confidence rests on doubles, and deleting the last real-integration signal before anything proves those doubles still match reality just moves the risk somewhere you can’t see it.

If most “unit” tests mock third-party SDKs

Identify the third-party clients (HTTP, DB, cloud SDKs). For each, define a thin gateway interface owned by the team.
Replace direct SDK use in production code with the gateway. Tests now double the gateway, which the team controls.
Add adapter integration tests against the real dependency (testcontainer, sandbox account). The doubles are now backed by reality.

If line coverage is high but production keeps breaking

Run mutation testing on a high-traffic module. Most surviving mutants are tests that didn’t catch the mutation.
For each surviving mutant, add a flow-oriented test that would have caught it. Don’t add a test of the specific mutation: add the test of the behavior the mutation breaks.
Repeat module by module, prioritized by production incident frequency. Coverage % won’t change much. Defect-finding will.

If the suite has six figures of tests and runs for 90 minutes

Move tests that need a database or downstream into an integration lane on a different cadence (post-merge or scheduled), not the pre-commit gate.
Convert sociable unit tests to component tests where they exercise complete flows. Delete redundant unit-level duplicates.
Set a budget: deterministic suite under 10 minutes. Non-conforming tests get reviewed; if they can’t be made fast, they move to acceptance or get deleted.

If there are no tests at all

Don’t try to retrofit unit tests for existing code. You’ll write tests that pin the current bugs.
Start with a small set of component tests for the highest-value flows. They double as characterization tests for legacy behavior.
As the team changes code, write tests for the change first. The test base grows organically with the change set, and the parts of the code that change most are the parts that get tests soonest.

The pattern across all five: don’t try to convert the whole suite at once. Move flow by flow, module by module. The test that matters next is the one for the change you’re about to make.

Applied Testing Strategies - the patterns this page is helping teams migrate toward.
Architecting Tests for CD - the section overview, with the do/do-not list this page expands on.
Test Double - the glossary entry covering the five flavours and when to use each.

5 - Testing Glossary

Definitions for testing terms as they are used on this site.

These definitions reflect how this site uses each term. They are not universal definitions - other communities may use the same words differently.

Acceptance Tests

Automated tests that verify a system behaves as specified. Acceptance tests exercise user workflows in a production-like environment and confirm the implementation matches the acceptance criteria. They answer “did we build what was specified?” rather than “does the code work?” They do not validate whether the specification itself is correct - only real user feedback can confirm we are building the right thing.

In CD, acceptance testing is a pipeline stage, not a single test type. It can include component tests, load tests, chaos tests, resilience tests, and compliance tests. Any test that runs after CI to gate promotion to production is an acceptance test.

Referenced in: CD Testing, Pipeline Reference Architecture

Adapter Integration Test

A narrow test of a single boundary adapter - the team’s own HTTP client, database query layer, message-broker client, file-system adapter, or similar - exercised against either the real external dependency or a high-fidelity stand-in like a testcontainer running the production engine. (Legacy name from Toby Clemson: “gateway integration test.”)

What the test is for

The test asserts that the adapter correctly speaks the protocol: that it serializes the request the way the dependency expects, parses the response shape correctly, maps errors to the right exception types, propagates headers, enforces timeouts, and handles transactional semantics.

What the test is not for

It does not test the behavior of the dependency itself. If the adapter asks for a user, the test validates that the response parses into a valid User object - not which user comes back, not the dependency’s own business rules, not anything that the dependency owns. The dependency’s correctness is the dependency’s problem; the adapter’s job is to speak the protocol faithfully. Conflating the two produces brittle tests that fail on unrelated changes to the dependency’s data or logic.

Pipeline placement

Runs in-band only when both conditions hold:

The team has full control over the dependency - a database, broker, or service the team owns and can pin to a known version, typically via a per-test testcontainer.
The test is fully deterministic against that controlled instance.

For everything else - third-party APIs, services owned by another team, dependencies whose state the team can’t reset between runs - the test runs out-of-band on a schedule. Out-of-band placement is the default for any adapter test that touches a system outside the team’s full control. Failures trigger review, not a build break. Pulling these tests in-band is the most common cause of flaky pipelines.

Distinguishing from neighboring test types

Different from a broader end-to-end test: an adapter integration test isolates one boundary adapter, not a flow across multiple components. Different from a contract test at the same boundary: contract tests pin shape against doubles in the pipeline; adapter integration tests pin protocol against the real dependency.

Referenced in: API Consumer, API Provider, Applied Testing Strategies, Antipatterns, Event Consumer, Event Producer, Scheduled Job, Stateful Service

API Surface Test

A test that pins the public-facing API of a library or CLI - the exported symbols, their signatures, the documented arguments and exit codes. Typically a snapshot: the current public surface is captured to a file, and any diff fails the build. Catches accidental breaking changes (a renamed function, a removed flag, a tightened type) before they reach consumers. Distinct from a contract test, which pins the wire boundary between two services; an API surface test pins the source-level boundary between a library and its callers.

Referenced in: CLI Tool or Library

Black Box Testing

A testing approach where the test exercises code through its public interface and asserts only on observable outputs - return values, state changes visible to consumers, or side effects such as messages sent. The test has no knowledge of internal implementation details. Black box tests are resilient to refactoring because they verify what the code does, not how it does it. Contrast with white box testing.

Referenced in: CD Testing, Unit Tests

Cluster Test

A test that exercises a stateful service across multiple nodes - replication, leader election, consensus, partition tolerance - against a real multi-node setup, typically via testcontainers running the production consensus library. Cluster tests catch behavior that only appears under a real cluster: split-brain, slow followers, leader transitions, partition reconciliation. Deterministic enough to run in-band but slower than single-node component tests, so usually relegated to a later CI stage.

Referenced in: Stateful Service

Component Test

A deterministic test that verifies a complete frontend component or backend service through its public interface, with test doubles for all external dependencies. See Component Tests for full definition and examples.

Referenced in: Component Tests, End-to-End Tests, Tests Randomly Pass or Fail, Unit Tests

Contract Test

A deterministic test that verifies the boundary between two systems using test doubles. Sometimes called a narrow integration test. Has two perspectives. A consumer contract test asks “do the fields and status codes I depend on still exist?” and asserts only on the subset of the API the consumer actually uses. A provider contract test asks “have my changes broken any of my consumers?” and runs every consumer’s published expectations against the real provider implementation. The same shape applies to broker topics (a “broker contract”) and to source-and-sink schemas in pipelines (“source/sink contract”) - the test object is the boundary, the perspective is whichever side the test runs from.

Contract tests are deterministic and run pre-merge as in-band tests. They block the build like any other in-band test. See Contract Tests for the full discussion of consumer-driven contracts (CDC) and contract-first development.

Referenced in: API Consumer, API Provider, Contract Tests, Event Consumer, Event Producer

Cross-OS Test Matrix

A CI configuration that runs the existing test suite on each supported operating system rather than a separate test type. The matrix catches platform-specific behavior single-OS tests can’t: path separators, line endings, signal-handling differences, locale defaults, file-system case sensitivity. Required for any deployable consumed across multiple OSes - CLI tools, libraries, cross-platform desktop or mobile apps.

Referenced in: CLI Tool or Library

Deployed-binary Test

A test that invokes the actual deployed artifact - the same binary, container image, or package the scheduler, orchestrator, or operator will invoke in production - and asserts on observable behavior at startup or first invocation. Catches what in-process component tests bypass: configuration loading, secret resolution, signal handling, exit codes, lock acquisition, dependency-version mismatches. Usually a small set; the bulk of behavior is tested in component tests against an in-memory assembled app.

Referenced in: CLI Tool or Library, Scheduled Job

Doctest

An executable test extracted from documentation - typically the README or inline code samples - that runs the documented examples against the real binary or library and fails the build if the examples are broken. Doctests close the gap between “the docs say X works” and “X actually works in the latest build”. Most languages have framework support: Python’s doctest module, Rust’s #[doc] attribute, and Markdown-based runners for Node and Java.

Referenced in: CLI Tool or Library

In-Band Test

A test that runs in the delivery pipeline as part of the commit-to-deploy flow. In-band tests must be deterministic, which means test doubles replace anything that crosses the component boundary - downstream services, message brokers, schedulers, browsers talking to real backends. Failures block the build or the deployment.

The bulk of any project’s test suite is in-band: unit tests, component tests, contract tests, and adapter integration tests against team-controlled dependencies (testcontainers running an engine the team pins). Adapter integration tests against third-party services or shared environments run out-of-band on a schedule, not in-band. They give a deterministic go/no-go signal in minutes.

Contrast with out-of-band tests, which run on a schedule against real systems and never gate the build.

Referenced in: Applied Testing Strategies, Architecting Tests for CD

Out-of-Band Test

A test that runs outside the delivery pipeline on a schedule or post-deploy, exercising real external systems. Out-of-band tests are non-deterministic by design (they depend on the real world) and never gate a commit or merge. Failures trigger review, alerts, or rollback decisions.

Out-of-band checks are how teams confirm that the doubles used by in-band tests still match reality. Examples: post-deploy integration tests against the real downstream, synthetic monitoring of production, scheduled smoke checks against a sandbox API.

Referenced in: Applied Testing Strategies, Architecting Tests for CD, Integration Tests

Soak Test

A long-running test that exercises a deployed service for hours or days under representative load to catch behavior that only appears with time: memory leaks, unbounded growth, replication-lag drift, slow-burn resource exhaustion. Soak tests are out-of-band by design - they don’t fit a pre-merge budget. Failures trigger review, not a build break. Often paired with chaos testing (deliberate fault injection during the soak) to validate recovery behavior over time.

Referenced in: Stateful Service

Sociable Unit Test

A unit test that allows real collaborator objects to participate - for example, a service object calling a real domain model or value object - while still replacing any external I/O (network, database, file system) with test doubles. The “unit” being tested is a behavior that spans multiple in-process objects. When the scope expands to the entire public interface of a frontend component or backend service, that is a component test.

Referenced in: Unit Tests, Component Tests

Solitary Unit Test

A unit test that replaces all collaborators with test doubles and exercises a single class or function in complete isolation. Contrast with sociable unit test, which allows real collaborator objects while still replacing external I/O.

Referenced in: Unit Tests

Synthetic Monitoring

Automated scripts that continuously execute realistic user journeys or API calls against a live production (or production-like) environment and alert when those journeys fail or degrade. Unlike passive monitoring that watches for errors in real user traffic, synthetic monitoring proactively simulates user behavior on a schedule - so problems are detected even during low traffic periods. Synthetic monitors are non-deterministic (they depend on live external systems) and are never a pre-merge gate. Failures trigger alerts or rollback decisions, not build blocks.

Referenced in: Architecting Tests for CD, End-to-End Tests

TDD (Test-Driven Development)

A development practice where tests are written before the production code that makes them pass. TDD supports CD by ensuring high test coverage, driving simple design, and producing a fast, reliable test suite. TDD feeds into the testing fundamentals required in Phase 1.

Referenced in: CD for Greenfield Projects, Integration Frequency, Inverted Test Pyramid, Small Batches, TBD Migration Guide, Trunk-Based Development, Unit Tests

Test Double

A stand-in object that replaces a real production dependency during testing. The term comes from the film industry’s “stunt double”: just as a stunt double replaces an actor for dangerous scenes, a test double replaces a costly or non-deterministic dependency to make tests fast, isolated, and reliable.

Test doubles let you:

Remove non-determinism by replacing network calls, databases, and file systems with predictable substitutes.
Control test conditions by forcing specific states, error conditions, or edge cases that would be hard to reproduce with real dependencies.
Increase speed by eliminating slow I/O.
Isolate the system under test so failures point at the code being tested, not at an external dependency.

Types of test doubles

Type	Description	Example use case
Dummy	Passed around but never actually used. Fills parameter lists.	A required logger parameter in a constructor.
Stub	Provides canned answers to calls made during the test. Does not respond to anything outside what is programmed.	Returning a fixed user object from a repository.
Spy	A stub that also records information about how it was called (arguments, call count, order).	Verifying that an analytics event was sent once.
Mock	Pre-programmed with expectations about which calls will be made. Verification happens on the mock itself.	Asserting that `sendEmail()` was called with specific arguments.
Fake	Has a working implementation, but takes shortcuts not suitable for production.	An in-memory database replacing PostgreSQL.

Choosing the right double

Use a stub when you need to supply data but don’t care how it was requested.
Use a spy when you need to verify call arguments or call count.
Use a mock when the interaction itself is the primary thing being verified.
Use a fake when you need realistic behavior but can’t use the real system.
Use a dummy when a parameter is required by the interface but irrelevant to the test.

Test doubles are heaviest in the early pipeline stages (unit, component, contract tests) where deterministic speed is the priority. They thin out as you move through the pipeline; end-to-end tests use no doubles by design. The guiding principle from Justin Searls: “Don’t poke too many holes in reality.” Use a double when you must, and prefer the real implementation when it’s fast and deterministic.

Doubles are only as good as the contract they encode. Every double in the suite should trace to a contract test pinning its claims and an out-of-band check confirming the claims still hold. See the Antipatterns page for the failure modes of unvalidated doubles.

Referenced in: Antipatterns, Applied Testing Strategies, Component Tests, Contract Tests, Unit Tests

Virtual Service

A test double that simulates a real external service over the network, responding to HTTP requests with pre-configured or recorded responses. Unlike in-process stubs or mocks, a virtual service runs as a standalone process and is accessed via real network calls, making it suitable for component testing and end-to-end testing where your application needs to make actual HTTP requests against a dependency. Service virtualization tools can create virtual services from recorded traffic or API specifications. See Test Doubles.

Referenced in: Component Tests, End-to-End Tests, Testing Fundamentals

White Box Testing

A testing approach where the test has knowledge of and asserts on internal implementation details - specific methods called, call order, internal state, or code paths taken. White box tests verify how the code works, not what it produces. These tests are fragile because any refactoring of internals breaks them, even when behavior is unchanged. Avoid white box testing in unit tests; prefer black box testing that asserts on observable outcomes.

Referenced in: CD Testing, Unit Tests

Architecting Tests for CD

Beyond the Test Pyramid

The principle behind the shape

The testing trophy

What this looks like in practice

The anti-pattern: the ice cream cone

Test Architecture

Pre-merge vs post-merge

good practices

Do

Do Not

Related Content

1 - Test Feedback Speed

Why speed has a threshold

The cognitive breakpoints

What this means for test architecture

Impact on application architecture

The compounding cost of slow feedback

Sources

Further reading

2 - Test Types

2.1 - Component Tests

Definition

When component tests earn their keep

Characteristics

Examples

Backend Service

Frontend Component

Accessibility Verification

Anti-Patterns

Connection to CD Pipeline

2.2 - Contract Tests

Definition

Consumer and Provider Perspectives

Consumer contract testing

Provider contract testing

Approaches to Contract Testing

Consumer-driven contract development

Contract-first development

Choosing between them

Characteristics

Examples

Anti-Patterns

Connection to CD Pipeline

2.3 - End-to-End Tests

Definition

Terminology note

Scope

When to Use

Vertical vs. horizontal

Characteristics

Examples

Anti-Patterns

Connection to CD Pipeline

2.4 - Integration Tests

A note on the word “integration test”

2.5 - Static Analysis

Definition

When to Use

Characteristics

Examples

Linting

Type Checking

Dependency Scanning

Types of Static Analysis

Accessibility Linting

Anti-Patterns

Connection to CD Pipeline

2.6 - Unit Tests

Definition

Solitary vs. sociable unit tests

When to Use

Characteristics

Examples

Anti-Patterns

Connection to CD Pipeline

3 - Applied Testing Strategies

How to use this section

Terminology

Cross-cutting principles