Continuous delivery requires that trunk always be releasable, which means testing it automatically on every change. A collection of tests is not enough. You need a test architecture: different test types working together so the pipeline can confidently deploy any change, even when external systems are unavailable.
Testing Goals for CD
Your test suite must meet these goals before it can support continuous delivery.
Goal
Target
How to Measure
Fast
CI gating tests < 10 minutes; full acceptance suite < 1 hour
CI gating suite duration; full acceptance suite duration
Deterministic
Same code always produces the same result
Flaky test count: 0 in the gating suite
Catches real bugs
Tests fail when behavior is wrong, not when implementation changes
Defect escape rate trending down
Independent of external systems
Pipeline can determine deployability without any dependency being available
Trace defects to their origin and prevent entire categories of bugs
The Ice Cream Cone: What to Avoid
An inverted test distribution, with too many slow end-to-end tests and too few fast unit tests, is the most common testing barrier to CD.
The ice cream cone makes CD impossible. Manual testing gates block every release. End-to-end tests
take hours, fail randomly, and depend on external systems being healthy. For the test architecture
that replaces this, see Pipeline Test Strategy
and the Testing reference.
Next Step
Automate your build process so that building, testing, and packaging happen with a single command. Continue to Build Automation.
Inverted Test Pyramid - Anti-pattern where too many slow E2E tests replace fast unit tests
Pressure to Skip Testing - Anti-pattern where testing is treated as optional under deadline pressure
1 - What to Test - and What Not To
The principles that determine what belongs in your test suite and what does not - focusing on interfaces, isolating what you control, and applying the same pattern to frontend and backend.
Three principles determine what belongs in your test suite and what does not.
If you cannot fix it, do not test for it
You should never test the behavior of
services you consume. Testing their behavior is the responsibility of the team that builds
them. If their service returns incorrect data, you cannot fix that, so testing for it is
waste.
What you should test is how your system responds when a consumed service is unstable or
unavailable. Can you degrade gracefully? Do you return a meaningful error? Do you retry
appropriately? These are behaviors you own and can fix, so they belong in your test suite.
This principle directly enables the pipeline test strategy. When you stop testing things you
cannot fix, you stop depending on external systems in your pipeline. Your tests become faster,
more deterministic, and more focused on the code your team actually ships.
Test interfaces first
Most integration failures originate at interfaces, the boundaries where your system talks to
other systems. These boundaries are the highest-risk areas in your codebase, and they deserve
the most testing attention. But testing interfaces does not require integrating with the real
system on the other side.
When you test an interface you consume, the question is: “Can I understand the response and
act accordingly?” If you send a request for a user’s information, you do not test that you
get that specific user back. You test that you receive and understand the properties you need -
that your code can parse the response structure and make correct decisions based on it. This
distinction matters because it keeps your tests deterministic and focused on what you control.
Use contract mocks, virtual services, or any
test double that faithfully represents the interface contract. The test validates your side of
the conversation, not theirs.
Frontend and backend follow the same pattern
Both frontend and backend applications provide interfaces to consumers and consume interfaces
from providers. The only difference is the consumer: a frontend provides an interface for
humans, while a backend provides one for machines. The testing strategy is the same.
Test frontend code the same way you test backend code: validate the interface you provide,
test logic in isolation, and verify that user actions trigger the correct behavior. The only
difference is the consumer (a human instead of a machine).
For a frontend:
Validate the interface you provide. The UI contains the components it should and they
appear correctly. This is the equivalent of verifying your API returns the right response
structure.
Test behavior isolated from presentation. Use your unit test framework to test the
logic that UI controls trigger, separated from the rendering layer. This gives you the same
speed and control you get from testing backend logic in isolation.
Verify that controls trigger the right logic. Confirm that user actions invoke the
correct behavior, without needing a running backend or browser-based E2E test.
This approach gives you targeted testing with far more control. Testing exception flows -
what happens when a service returns an error, when a network request times out, when data is
malformed, becomes straightforward instead of requiring elaborate E2E setups that are hard
to make fail on demand.
Test Quality Over Coverage Percentage
Code coverage tells you which lines executed during tests. It does not tell you whether the tests
verified anything meaningful. A test suite with 90% coverage and no assertions has high coverage
and zero value.
Better questions than “what is our coverage percentage?”:
When a test fails, does it point directly to the defect?
When we refactor, do tests break because behavior changed or because implementation details
shifted?
Do our tests catch the bugs that actually reach production?
Can a developer trust a green build enough to deploy immediately?
Why coverage mandates are harmful
When teams are required to hit a coverage target, they
write tests to satisfy the metric rather than to verify behavior. This produces:
Tests that exercise code paths without asserting outcomes
Tests that mirror implementation rather than specify behavior
Tests that inflate the number without improving confidence
The metric goes up while the defect escape rate stays the same. Worse, meaningless tests add
maintenance cost and slow down the suite.
Instead of mandating a coverage number, set a coverage floor (see
Getting Started)
and focus team attention on test quality: mutation testing scores, defect escape rates, and
whether developers actually trust the suite enough to deploy on green.
Test Doubles - Patterns for isolating dependencies in tests
Contract Tests - Verifying that test doubles match reality
2 - Pipeline Test Strategy
What tests run where in a CD pipeline, how contract tests validate the test doubles used inside the pipeline, and why everything that blocks deployment must be deterministic.
Everything that blocks deployment must be deterministic and under your control. Everything
that involves external systems runs asynchronously or post-deployment. This gives you the
independence to deploy any time, regardless of the state of the world around you.
Tests Inside the Pipeline
These tests run on every commit and block deployment if they fail. They must be fast,
deterministic, and free of external dependencies.
Every test in this pipeline uses test doubles for
external dependencies. No test calls a real external API, database, or third-party service. This
means:
A downstream outage cannot block your deployment. Your pipeline runs the same whether
external systems are healthy or down.
Tests are deterministic. The same code always produces the same result.
The suite is fast. No network latency, no waiting for external systems to respond.
Why re-run tests post-merge?
Two changes can each pass pre-merge independently but conflict when combined on trunk. The
post-merge run catches these integration effects. If a post-merge failure occurs, the team
fixes it immediately. Trunk must always be releasable.
Tests Outside the Pipeline
These tests involve real external systems and are therefore non-deterministic. They never
block deployment. Instead, they validate assumptions and monitor production health.
Test Type
When It Runs
What It Does on Failure
Contract tests
On a schedule (hourly or daily)
Triggers review; team updates test doubles to match new reality
The pipeline’s deterministic tests depend on test doubles to represent external systems. But
test doubles can drift from reality. An API adds a required field, changes a response format,
or deprecates an endpoint. Contract tests close this gap.
Pipeline tests use test doubles that encode your assumptions about external APIs -
response schemas, status codes, error formats.
Contract tests run on a schedule and send real requests to the actual external APIs.
Contract tests compare the real response against what your test doubles return. They
check structure and types, not specific data values.
When a contract test passes, your test doubles are confirmed accurate. The pipeline’s
deterministic tests are trustworthy.
When a contract test fails, the team is alerted. They update the test doubles to match
the new reality, then re-run component tests to verify nothing breaks.
This design means your pipeline never touches external systems, but you still catch when
external systems change. You get both speed and accuracy.
Consumer-driven contracts
When the external API is owned by another team in your organization, you can go further with
consumer-driven contracts. Instead of your team polling their API on a schedule, both teams
share a contract specification (using a tool like Pact):
You (the consumer) define the requests you send and the responses you expect.
They (the provider) run your contract as part of their build. If a change would break
your expectations, their build fails before they deploy.
Your test doubles are generated from the contract, guaranteeing they match what the
provider actually delivers.
This shifts contract validation from “detect and react” to “prevent.” See
Contract Tests for implementation details.
Summary: All Stages at a Glance
Stage
Blocks Deployment?
Uses Test Doubles?
Deterministic?
Every Commit
Yes
Yes - all external deps
Yes
Post-Merge
Yes
Yes - all external deps
Yes
Scheduled (Contract)
No - triggers review
No - hits real APIs
No
Post-Deploy (E2E)
No - triggers rollback
No - real system
No
Production (Monitoring)
No - triggers alerts
No - real system
No
The Testing reference provides detailed documentation
for each test type, including code examples and anti-patterns.
Practical steps to audit your test suite, fix flaky tests, decouple from external dependencies, and adopt test-driven development.
Starting Without Full Coverage
Teams often delay adopting CI because their existing code lacks tests. This is backwards. You do
not need tests for existing code to begin. You need one rule applied without exception:
Every new change gets a test. We will not go lower than the current level of code coverage.
Record your current coverage percentage as a baseline. Configure CI to fail if coverage drops
below that number. This does not mean the baseline is good enough. It means the trend only moves
in one direction. Every bug fix, every new feature, and every refactoring adds tests. Over time,
coverage grows organically in the areas that matter most: the code that is actively changing.
Do not attempt to retrofit tests across the entire codebase before starting CI. That approach
takes months and delivers no incremental value. It also produces low-quality tests written by
developers who are testing code they did not write and do not fully understand.
Quick-Start Action Plan
If your test suite is not yet ready to support CD, use this focused action plan to make immediate
progress.
1. Audit your current test suite
Assess where you stand before making changes.
Actions:
Run your full test suite 3 times. Note total duration and any tests that pass intermittently
(flaky tests).
Count tests by type: unit, integration, functional, end-to-end.
Identify tests that require external dependencies (databases, APIs, file systems) to run.
Record your baseline: total test count, pass rate, duration, flaky test count.
Map each test type to a pipeline stage. Which tests gate deployment? Which run asynchronously?
Which tests couple your deployment to external systems?
Output: A clear picture of your test distribution and the specific problems to address.
2. Fix or remove flaky tests
Flaky tests are worse than no tests. They train developers to ignore failures, which means real
failures also get ignored.
Actions:
Quarantine all flaky tests immediately. Move them to a separate suite that does not block the
build.
For each quarantined test, decide: fix it (if the behavior it tests matters) or delete it (if
it does not).
Common causes of flakiness: timing dependencies, shared mutable state, reliance on external
services, test order dependencies.
Target: zero flaky tests in your main test suite.
3. Decouple your pipeline from external dependencies
This is the highest-leverage change for CD. Identify every test that calls a real external service
and replace that dependency with a test double.
Actions:
List every external service your tests depend on: databases, APIs, message queues, file
storage, third-party services.
For each dependency, decide the right test double approach:
In-memory fakes for databases (e.g., SQLite, H2, testcontainers with local instances).
HTTP stubs for external APIs (e.g., WireMock, nock, MSW).
Fakes for message queues, email services, and other infrastructure.
Replace the dependencies in your unit and component tests.
Move the original tests that hit real services into a separate suite. These become your
starting contract tests or E2E smoke tests.
Output: A test suite where everything that blocks the build is deterministic and runs without
network access to external systems.
4. Add component tests for critical paths
If you do not have component tests that exercise your whole service in
isolation, start with the most critical paths.
Actions:
Identify the 3-5 most critical user journeys or API endpoints in your application.
Write a component test for each: boot the application, stub external dependencies, send a
real request or simulate a real user action, verify the response.
Each component test should prove that the feature works correctly assuming external
dependencies behave as expected (which your test doubles encode).
Run these in CI on every commit.
Output: Component tests covering your critical paths, running in CI on every commit.
5. Set up contract tests for your most important dependency
Pick the external dependency that changes most frequently or has caused the most production
issues. Set up a contract test for it.
Actions:
Write a contract test that validates the response structure (types, required fields, status
codes) of the dependency’s API.
Run it on a schedule (e.g., every hour or daily), not on every commit.
When it fails, update your test doubles to match the new reality and re-verify your
component tests.
If the dependency is owned by another team in your organization, explore consumer-driven
contracts with a tool like Pact.
Output: One contract test running on a schedule, with a process to update test doubles when it fails.
6. Adopt TDD for new code
Once your pipeline tests are reliable, adopt TDD for all new work. TDD is the practice of writing the test before the code. It ensures every
piece of behavior has a corresponding test.
The TDD cycle
Red: Write a failing test that describes the behavior you want.
Green: Write the minimum code to make the test pass.
Refactor: Improve the code without changing the behavior. The test ensures you do not
break anything.
Why TDD matters for CD
Every change is automatically covered by a test
The test suite grows proportionally with the codebase
Tests describe behavior, not implementation, making them more resilient to refactoring
Developers get immediate feedback on whether their change works
TDD is not mandatory for CD, but teams that practice TDD consistently have significantly faster
and more reliable test suites.
How to start: Pick one new feature or bug fix this week. Write the test first, watch it
fail, write the code to make it pass, then refactor. Do not try to retroactively TDD your
entire codebase. Apply TDD to new code and to any code you modify.
Output: Team members practicing TDD on new work, with at least one completed red-green-refactor cycle.
How to trace defects to their origin and make systemic changes that prevent entire categories of bugs from recurring.
Treat every test failure as diagnostic data about where your process breaks down, not just as
something to fix. When you identify the systemic source of defects, you can prevent entire
categories from recurring.
Two questions sharpen this thinking:
What is the earliest point we can detect this defect? The later a defect is found, the
more expensive it is to fix. A requirements defect caught during example mapping costs
minutes. The same defect caught in production costs days of incident response, rollback,
and rework.
Can AI help us detect it earlier? AI-assisted tools can now surface defects at stages
where only human review was previously possible, shifting detection left without adding
manual effort.
Trace Every Defect to Its Origin
When a test catches a defect (or worse, when a defect escapes to production) ask: where was
this defect introduced, and what would have prevented it from being created?
Defects do not originate randomly. They cluster around specific causes. The
CD Defect Detection and Remediation Catalog
documents over 30 defect types across eight categories, with detection methods, AI
opportunities, and systemic fixes for each.
Category
Example Defects
Earliest Detection
Systemic Fix
Requirements
Building the right thing wrong, or the wrong thing right
Discovery, during story refinement or example mapping
Acceptance criteria as user outcomes, Three Amigos sessions, example mapping
Missing domain knowledge
Business rules encoded incorrectly, tribal knowledge loss
During coding, when the developer writes the logic
Ubiquitous language (DDD), pair programming, rotate ownership
Integration boundaries
Interface mismatches, wrong assumptions about upstream behavior
During design, when defining the interface contract
Contract tests per boundary, API-first design, circuit breakers
Untested edge cases
Null handling, boundary values, error paths
Pre-commit, through null-safe type systems and static analysis
Property-based testing, boundary value analysis, test for every bug fix
Pre-commit for null safety; CI for schema compatibility
Null-safe types, expand-then-contract for schema changes, design for idempotency
For the complete catalog covering all defect categories (including product and discovery,
dependency and infrastructure, testing and observability gaps, and more) see the
CD Defect Detection and Remediation Catalog.
Build a Defect Feedback Loop
You need a process that systematically connects test
failures to root causes and root causes to systemic fixes.
Classify every defect. When a test fails or a bug is reported, tag it with its origin
category from the tables above. This takes seconds and builds a dataset over time.
Look for patterns. Monthly (or during retrospectives), review the defect
classifications. Which categories appear most often? That is where your process is weakest.
Apply the systemic fix, not just the local fix. When you fix a bug, also ask: what
systemic change would prevent this entire category of bug? If most defects come from
integration boundaries, the fix is not “write more integration tests.” It is “make contract
tests mandatory for every new boundary.” If most defects come from untested edge cases, the
fix is not “increase code coverage.” It is “adopt property-based testing as a standard
practice.”
Measure whether the fix works. Track defect counts by category over time. If you
applied a systemic fix for integration boundary defects and the count does not drop, the fix
is not working and you need a different approach.
The Test-for-Every-Bug-Fix Rule
Every bug fix must include a test that reproduces the bug before the fix and passes after.
This is non-negotiable for CD because:
It proves the fix actually addresses the defect (not just the symptom).
It prevents the same defect from recurring.
It builds test coverage exactly where the codebase is weakest: the places where bugs actually
occur.
Over time, it shifts your test suite from “tests we thought to write” to “tests that cover
real failure modes.”
Advanced Detection Techniques
As your test architecture matures, add techniques that catch defects before manual review:
Technique
What It Finds
When to Adopt
Mutation testing (Stryker, PIT)
Tests that pass but do not actually verify behavior (your test suite’s blind spots)
When basic coverage is in place but defect escape rate is not dropping
Property-based testing
Edge cases and boundary conditions across large input spaces that example-based tests miss
When defects cluster around unexpected input combinations
Chaos engineering
Failure modes in distributed systems: what happens when a dependency is slow, returns errors, or disappears
When you have component tests and contract tests in place and need confidence in failure handling
Static analysis and linting
Null safety violations, type errors, security vulnerabilities, dead code