Testing Fundamentals
18 minute read
Phase 1 - Foundations | Adapted from Dojo Consortium
Before you can trust your pipeline, you need a test suite that is fast, deterministic, and catches real defects. But a collection of tests is not enough. You need a test architecture - a deliberate structure where different types of tests work together to give you the confidence to deploy every change, regardless of whether external systems are up, slow, or behaving unexpectedly.
Why Testing Is a Foundation
Continuous delivery requires that trunk always be releasable. The only way to know trunk is releasable is to test it - automatically, on every change. Without a reliable test suite, daily integration is just daily risk.
In many organizations, testing is the single biggest obstacle to CD adoption. Not because teams lack tests, but because the tests they have are slow, flaky, poorly structured, and - most critically - unable to give the pipeline a reliable answer to the question: is this change safe to deploy?
Testing Goals for CD
Your test suite must meet these criteria before it can support continuous delivery:
| Goal | Target | Why |
|---|---|---|
| Fast | Full suite completes in under 10 minutes | Developers need feedback before context-switching |
| Deterministic | Same code always produces the same test result | Flaky tests destroy trust and get ignored |
| Catches real bugs | Tests fail when behavior is wrong, not when implementation changes | Brittle tests create noise, not signal |
| Independent of external systems | Pipeline can determine deployability without any dependency being available | Your ability to deploy cannot be held hostage by someone else’s outage |
If your test suite does not meet these criteria today, improving it is your highest-priority foundation work.
Beyond the Test Pyramid
The test pyramid - many unit tests at the base, fewer integration tests in the middle, a handful of end-to-end tests at the top - has been the dominant mental model for test strategy since Mike Cohn introduced it. The core insight is sound: push testing as low as possible. Lower-level tests are faster, more deterministic, and cheaper to maintain. Higher-level tests are slower, more brittle, and more expensive.
But as a prescriptive model, the pyramid is overly simplistic. Teams that treat it as a rigid ratio end up in unproductive debates about whether they have “too many” integration tests or “not enough” unit tests. The shape of your test distribution matters far less than whether your tests, taken together, give you the confidence to deploy.
What actually matters
The pyramid’s principle - write tests with different granularity - remains correct. But for CD, the question is not “do we have the right pyramid shape?” The question is:
Can our pipeline determine that a change is safe to deploy without depending on any system we do not control?
This reframes the testing conversation. Instead of counting tests by type and trying to match a diagram, you design a test architecture where:
-
Fast, deterministic tests catch the vast majority of defects and run on every commit. These tests use test doubles for anything outside the team’s control. They give you a reliable go/no-go signal in minutes.
-
Contract tests verify that your test doubles still match reality. They run asynchronously and catch drift between your assumptions and the real world - without blocking your pipeline.
-
A small number of non-deterministic tests validate that the fully integrated system works. These run post-deployment and provide monitoring, not gating.
This structure means your pipeline can confidently say “yes, deploy this” even if a downstream API is having an outage, a third-party service is slow, or a partner team hasn’t deployed their latest changes yet. Your ability to deliver is decoupled from the reliability of systems you do not own.
The anti-pattern: the ice cream cone
Most teams that struggle with CD have an inverted test distribution - too many slow, expensive end-to-end tests and too few fast, focused tests.
┌─────────────────────────┐
│ Manual Testing │ ← Most testing happens here
├─────────────────────────┤
│ End-to-End Tests │ ← Slow, flaky, expensive
├─────────────────────────┤
│ Integration Tests │ ← Some, but not enough
├───────────┤
│Unit Tests │ ← Too few
└───────────┘
The ice cream cone makes CD impossible. Manual testing gates block every release. End-to-end tests take hours, fail randomly, and depend on external systems being healthy. The pipeline cannot give a fast, reliable answer about deployability, so deployments become high-ceremony events.
Test Architecture for the CD Pipeline
A test architecture is the deliberate structure of how different test types work together across your pipeline to give you deployment confidence. Each layer has a specific role, and the layers reinforce each other.
Layer 1: Unit tests - verify logic in isolation
Unit tests exercise individual functions, methods, or components with all external dependencies replaced by test doubles. They are the fastest and most deterministic tests you have.
Role in CD: Catch logic errors, regressions, and edge cases instantly. Provide the tightest feedback loop - developers should see results in seconds while coding.
What they cannot do: Verify that components work together, that your code correctly calls external services, or that the system behaves correctly as a whole.
See Unit Tests for detailed guidance.
Layer 2: Integration tests - verify boundaries
Integration tests verify that components interact correctly at their boundaries: database queries return the expected data, HTTP clients serialize requests correctly, message producers format messages as expected. External systems are replaced with test doubles, but internal collaborators are real.
Role in CD: Catch the bugs that unit tests miss - mismatched interfaces, serialization errors, query bugs. These tests are fast enough to run on every commit but realistic enough to catch real integration failures.
What they cannot do: Verify that the system works end-to-end from a user’s perspective, or that your assumptions about external services are still correct.
The line between unit tests and integration tests is often debated. As Ham Vocke writes in The Practical Test Pyramid: the naming matters less than the discipline. The key question is whether the test is fast, deterministic, and tests something your unit tests cannot. If yes, it belongs here.
See Integration Tests for detailed guidance.
Layer 3: Functional tests - verify your system works in isolation
Functional tests (also called component tests) exercise your entire sub-system - your service, your application - from the outside, as a user or consumer would interact with it. All external dependencies are replaced with test doubles. The test boots your application, sends real HTTP requests or simulates real user interactions, and verifies the responses.
Role in CD: This is the layer that proves your system works as a complete unit, independent of everything else. Functional tests answer: “if we deploy this service right now, will it behave correctly for every interaction that is within our control?” Because all external dependencies are stubbed, these tests are deterministic and fast. They can run on every commit.
Why this layer is critical for CD: Functional tests are what allow you to deploy with confidence even when dependencies outside your control are unavailable. Your test doubles simulate the expected behavior of those dependencies. As long as your doubles are accurate (which is what contract tests verify), your functional tests prove your system handles those interactions correctly.
See Functional Tests for detailed guidance.
Layer 4: Contract tests - verify your assumptions about others
Contract tests validate that the test doubles you use in layers 1-3 still accurately represent the real external systems. They run against live dependencies and check contract format - response structures, field names, types, and status codes - not specific data values.
Role in CD: Contract tests are the bridge between your fast, deterministic test suite and the real world. Without them, your test doubles can silently drift from reality, and your functional tests provide false confidence. With them, you know that the assumptions baked into your test doubles are still correct.
Consumer-driven contracts take this further: the consumer of an API publishes expectations (using tools like Pact), and the provider runs those expectations as part of their build. Both teams know immediately when a change would break the contract.
Contract tests are non-deterministic because they hit live systems. They should not block your pipeline. Instead, failures trigger a review: has the contract changed, or was it a transient network issue? If the contract has changed, update your test doubles and re-verify.
See Contract Tests for detailed guidance.
Layer 5: End-to-end tests - verify the integrated system post-deployment
End-to-end tests validate complete user journeys through the fully integrated system with no test doubles. They run against real services, real databases, and real third-party integrations.
Role in CD: E2E tests are monitoring, not gating. They run after deployment to verify that the integrated system works. A small suite of smoke tests can run immediately post-deployment to catch gross integration failures. Broader E2E suites run on a schedule.
Why E2E tests should not gate your pipeline: E2E tests are non-deterministic. They fail for reasons unrelated to your change - network blips, third-party outages, shared environment instability. If your pipeline depends on E2E tests passing before you can deploy, your deployment frequency is limited by the reliability of every system in the chain. This is the opposite of the independence CD requires.
See End-to-End Tests for detailed guidance.
How the layers work together
Pipeline stage Test layer Deterministic? Blocks deploy?
─────────────────────────────────────────────────────────────────────────
On every commit Unit tests Yes Yes
Integration tests Yes Yes
Functional tests Yes Yes
Asynchronous Contract tests No No (triggers review)
Post-deployment E2E smoke tests No Triggers rollback if critical
Synthetic monitoring No Triggers alerts
The critical insight: everything that blocks deployment is deterministic and under your control. Everything that involves external systems runs asynchronously or post-deployment. This is what gives you the independence to deploy any time, regardless of the state of the world around you.
Week 1 Action Plan
If your test suite is not yet ready to support CD, use this focused action plan to make immediate progress.
Day 1-2: Audit your current test suite
Assess where you stand before making changes.
Actions:
- Run your full test suite 3 times. Note total duration and any tests that pass intermittently (flaky tests).
- Count tests by type: unit, integration, functional, end-to-end.
- Identify tests that require external dependencies (databases, APIs, file systems) to run.
- Record your baseline: total test count, pass rate, duration, flaky test count.
- Map each test type to a pipeline stage. Which tests gate deployment? Which run asynchronously? Which tests couple your deployment to external systems?
Output: A clear picture of your test distribution and the specific problems to address.
Day 2-3: Fix or remove flaky tests
Flaky tests are worse than no tests. They train developers to ignore failures, which means real failures also get ignored.
Actions:
- Quarantine all flaky tests immediately. Move them to a separate suite that does not block the build.
- For each quarantined test, decide: fix it (if the behavior it tests matters) or delete it (if it does not).
- Common causes of flakiness: timing dependencies, shared mutable state, reliance on external services, test order dependencies.
- Target: zero flaky tests in your main test suite by end of week.
Day 3-4: Decouple your pipeline from external dependencies
This is the highest-leverage change for CD. Identify every test that calls a real external service and replace that dependency with a test double.
Actions:
- List every external service your tests depend on: databases, APIs, message queues, file storage, third-party services.
- For each dependency, decide the right test double approach:
- In-memory fakes for databases (e.g., SQLite, H2, testcontainers with local instances).
- HTTP stubs for external APIs (e.g., WireMock, nock, MSW).
- Fakes for message queues, email services, and other infrastructure.
- Replace the dependencies in your unit, integration, and functional tests.
- Move the original tests that hit real services into a separate suite - these become your starting contract tests or E2E smoke tests.
Output: A test suite where everything that blocks the build is deterministic and runs without network access to external systems.
Day 4-5: Add functional tests for critical paths
If you don’t have functional tests (component tests) that exercise your whole service in isolation, start with the most critical paths.
Actions:
- Identify the 3-5 most critical user journeys or API endpoints in your application.
- Write a functional test for each: boot the application, stub external dependencies, send a real request or simulate a real user action, verify the response.
- Each functional test should prove that the feature works correctly assuming external dependencies behave as expected (which your test doubles encode).
- Run these in CI on every commit.
Day 5: Set up contract tests for your most important dependency
Pick the external dependency that changes most frequently or has caused the most production issues. Set up a contract test for it.
Actions:
- Write a contract test that validates the response structure (types, required fields, status codes) of the dependency’s API.
- Run it on a schedule (e.g., every hour or daily), not on every commit.
- When it fails, update your test doubles to match the new reality and re-verify your functional tests.
- If the dependency is owned by another team in your organization, explore consumer-driven contracts with a tool like Pact.
Test-Driven Development (TDD)
TDD is the practice of writing the test before the code. It is the most effective way to build a reliable test suite because it ensures every piece of behavior has a corresponding test.
The TDD cycle:
- Red: Write a failing test that describes the behavior you want.
- Green: Write the minimum code to make the test pass.
- Refactor: Improve the code without changing the behavior. The test ensures you do not break anything.
Why TDD supports CD:
- Every change is automatically covered by a test
- The test suite grows proportionally with the codebase
- Tests describe behavior, not implementation, making them more resilient to refactoring
- Developers get immediate feedback on whether their change works
TDD is not mandatory for CD, but teams that practice TDD consistently have significantly faster and more reliable test suites.
Getting started with TDD
If your team is new to TDD, start small:
- Pick one new feature or bug fix this week.
- Write the test first, watch it fail.
- Write the code to make it pass.
- Refactor.
- Repeat for the next change.
Do not try to retroactively TDD your entire codebase. Apply TDD to new code and to any code you modify.
Testing Matrix
Use this reference to decide what type of test to write and where it runs in your pipeline.
| What You Need to Verify | Test Type | Speed | Deterministic? | Blocks Deploy? | See Also |
|---|---|---|---|---|---|
| A function or method behaves correctly | Unit | Milliseconds | Yes | Yes | |
| Components interact correctly at a boundary | Integration | Milliseconds to seconds | Yes | Yes | |
| Your whole service works in isolation | Functional | Seconds | Yes | Yes | |
| Your test doubles match reality | Contract | Seconds | No | No | |
| A critical user journey works end-to-end | E2E | Minutes | No | No | |
| Code quality, security, and style compliance | Static Analysis | Seconds | Yes | Yes |
Best Practices Summary
Do
- Run tests on every commit. If tests do not run automatically, they will be skipped.
- Keep the deterministic suite under 10 minutes. If it is slower, developers will stop running it locally.
- Fix broken tests immediately. A broken test is equivalent to a broken build.
- Delete tests that do not provide value. A test that never fails and tests trivial behavior is maintenance cost with no benefit.
- Test behavior, not implementation. Tests should verify what the code does, not how it
does it. As Ham Vocke advises: “if I enter values
xandy, will the result bez?” - not the sequence of internal calls that producez. - Use test doubles for external dependencies. Your deterministic tests should run without network access to external systems.
- Validate test doubles with contract tests. Test doubles that drift from reality give false confidence.
- Treat test code as production code. Give it the same care, review, and refactoring attention.
Do Not
- Do not tolerate flaky tests. Quarantine or delete them immediately.
- Do not gate your pipeline on non-deterministic tests. E2E and contract test failures should trigger review or alerts, not block deployment.
- Do not couple your deployment to external system availability. If a third-party API being down prevents you from deploying, your test architecture has a critical gap.
- Do not write tests after the fact as a checkbox exercise. Tests written without understanding the behavior they verify add noise, not value.
- Do not test private methods directly. Test the public interface; private methods are tested indirectly.
- Do not share mutable state between tests. Each test should set up and tear down its own state.
- Do not use sleep/wait for timing-dependent tests. Use explicit waits, polling, or event-driven assertions.
- Do not require a running database or external service for unit tests. That makes them integration tests - which is fine, but categorize them correctly.
Using Tests to Find and Eliminate Defect Sources
A test suite that catches bugs is good. A test suite that helps you stop producing those bugs is transformational. Every test failure is evidence of a defect, and every defect has a source. If you treat test failures only as things to fix, you are doing rework. If you treat them as diagnostic data about where your process breaks down, you can make systemic changes that prevent entire categories of defects from occurring.
This is the difference between a team that writes more tests to catch more bugs and a team that changes how it works so that fewer bugs are created in the first place.
Trace every defect to its origin
When a test catches a defect - or worse, when a defect escapes to production - ask: where was this defect introduced, and what would have prevented it from being created?
Defects do not originate randomly. They cluster around specific causes, and each cause has a systemic fix:
| Where Defects Originate | Example Defects | Detection Method | Systemic Fix |
|---|---|---|---|
| Requirements | Building the right thing wrong, or the wrong thing right | UX analytics, task completion tracking, A/B testing | Acceptance criteria as user outcomes, not implementation tasks. Three Amigos sessions before work starts. Example mapping to surface edge cases before coding begins. |
| Missing domain knowledge | Business rules encoded incorrectly, implicit assumptions | Magic number detection, knowledge-concentration metrics | Embed domain rules in code using ubiquitous language (DDD). Pair programming to spread knowledge. Living documentation generated from code. |
| Integration boundaries | Interface mismatches, wrong assumptions about upstream behavior | Consumer-driven contract tests, schema validation | Contract tests mandatory per boundary. API-first design. Document behavioral contracts, not just data schemas. |
| Untested edge cases | Null handling, boundary values, error paths | Mutation testing, branch coverage thresholds, property-based testing | Require a test for every bug fix. Adopt property-based testing for logic with many input permutations. Boundary value analysis as a standard practice. |
| Unintended side effects | Change to module A breaks module B | Mutation testing, change impact analysis | Small focused commits. Trunk-based development (integrate daily so side effects surface immediately). Modular design with clear boundaries. |
| Accumulated complexity | Defects cluster in the most complex, most-changed files | Complexity trends, duplication scoring, dependency cycle detection | Refactoring as part of every story, not deferred to a “tech debt sprint.” Dedicated complexity budget. |
| Long-lived branches | Merge conflicts, integration failures, stale code | Branch age alerts, merge conflict frequency | Trunk-based development. Merge at least daily. CI rejects stale branches. |
| Configuration drift | Works in staging, fails in production | IaC drift detection, environment comparison, smoke tests | All infrastructure as code. Same provisioning for every environment. Immutable infrastructure. |
| Data assumptions | Null pointer exceptions, schema migration failures | Null safety static analysis, schema compatibility checks, migration dry-runs | Enforce null-safe types. Expand-then-contract for all schema changes. |
Build a defect feedback loop
Knowing the categories is not enough. You need a process that systematically connects test failures to root causes and root causes to systemic fixes.
Step 1: Classify every defect. When a test fails or a bug is reported, tag it with its origin category from the table above. This takes seconds and builds a dataset over time.
Step 2: Look for patterns. Monthly (or during retrospectives), review the defect classifications. Which categories appear most often? That is where your process is weakest.
Step 3: Apply the systemic fix, not just the local fix. When you fix a bug, also ask: what systemic change would prevent this entire category of bug? If most defects come from integration boundaries, the fix is not “write more integration tests” - it is “make contract tests mandatory for every new boundary.” If most defects come from untested edge cases, the fix is not “increase code coverage” - it is “adopt property-based testing as a standard practice.”
Step 4: Measure whether the fix works. Track defect counts by category over time. If you applied a systemic fix for integration boundary defects and the count does not drop, the fix is not working and you need a different approach.
The test-for-every-bug-fix rule
One of the most effective systemic practices: every bug fix must include a test that reproduces the bug before the fix and passes after. This is non-negotiable for CD because:
- It proves the fix actually addresses the defect (not just the symptom).
- It prevents the same defect from recurring.
- It builds test coverage exactly where the codebase is weakest - the places where bugs actually occur.
- Over time, it shifts your test suite from “tests we thought to write” to “tests that cover real failure modes.”
Advanced detection techniques
As your test architecture matures, add techniques that find defects humans overlook:
| Technique | What It Finds | When to Adopt |
|---|---|---|
| Mutation testing (Stryker, PIT) | Tests that pass but do not actually verify behavior - your test suite’s blind spots | When basic coverage is in place but defect escape rate is not dropping |
| Property-based testing | Edge cases and boundary conditions across large input spaces that example-based tests miss | When defects cluster around unexpected input combinations |
| Chaos engineering | Failure modes in distributed systems - what happens when a dependency is slow, returns errors, or disappears | When you have functional tests and contract tests in place and need confidence in failure handling |
| Static analysis and linting | Null safety violations, type errors, security vulnerabilities, dead code | From day one - these are cheap and fast |
For more examples of mapping defect origins to detection methods and systemic corrections, see the CD Defect Detection and Remediation Patterns.
Measuring Success
| Metric | Target | Why It Matters |
|---|---|---|
| Deterministic suite duration | < 10 minutes | Enables fast feedback loops |
| Flaky test count | 0 in pipeline-gating suite | Maintains trust in test results |
| External dependencies in gating tests | 0 | Ensures deployment independence |
| Test coverage trend | Increasing | Confirms new code is being tested |
| Defect escape rate | Decreasing | Confirms tests catch real bugs |
| Contract test freshness | All passing within last 24 hours | Confirms test doubles are current |
Next Step
With a reliable test suite in place, automate your build process so that building, testing, and packaging happens with a single command. Continue to Build Automation.
This content is adapted from the Dojo Consortium, licensed under CC BY 4.0. Additional concepts drawn from Ham Vocke, The Practical Test Pyramid, and Toby Clemson, Testing Strategies in a Microservice Architecture.